from:"Igor Mammedov"

Re: [PATCH] kvm: lapic: fix broken vcpu hotplug

2020-06-23 Thread Igor Mammedov

On Mon, 22 Jun 2020 18:47:57 +0200
Paolo Bonzini  wrote:

> On 22/06/20 18:08, Igor Mammedov wrote:
> > Guest fails to online hotplugged CPU with error
> >   smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
> > 
> > It's caused by the fact that kvm_apic_set_state(), which used to call
> > recalculate_apic_map() unconditionally and pulled hotplugged CPU into
> > apic map, is updating map conditionally [1] on state change which doesn't
> > happen in this case and apic map update is skipped.
> > 
> > Note:
> > new CPU during kvm_arch_vcpu_create() is not visible to
> > kvm_recalculate_apic_map(), so all related update calls endup
> > as NOP and only follow up kvm_apic_set_state() used to trigger map
> > update that counted in hotplugged CPU.
> > Fix issue by forcing unconditional update from kvm_apic_set_state(),
> > like it used to be.
> > 
> > 1)
> > Fixes: (4abaffce4d25a KVM: LAPIC: Recalculate apic map in batch)
> > Signed-off-by: Igor Mammedov 
> > ---
> > PS:
> > it's alternative to full revert of [1], I've posted earlier
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2205600.html
> > so fii free to pick up whatever is better by now
> > ---
> >  arch/x86/kvm/lapic.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 34a7e0533dad..5696831d4005 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2556,6 +2556,7 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct 
> > kvm_lapic_state *s)
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > int r;
> >  
> > +   apic->vcpu->kvm->arch.apic_map_dirty = true;
> > kvm_lapic_set_base(vcpu, vcpu->arch.apic_base);
> > /* set SPIV separately to get count of SW disabled APICs right */
> > apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
> >   
> 
> Queued, but it's better to set apic_map_dirty just before the call to
> kvm_recalculate_apic_map, or you can have a variant of the race that you
> pointed out.

Here I was worried about failure path as well that is just before normal
kvm_recalculate_apic_map(), and has its own kvm_recalculate_apic_map().

but I'm not sure if we should force map update in that case.

> 
> Paolo
>

[PATCH] kvm: lapic: fix broken vcpu hotplug

2020-06-22 Thread Igor Mammedov

Guest fails to online hotplugged CPU with error
  smpboot: do_boot_cpu failed(-1) to wakeup CPU#4

It's caused by the fact that kvm_apic_set_state(), which used to call
recalculate_apic_map() unconditionally and pulled hotplugged CPU into
apic map, is updating map conditionally [1] on state change which doesn't
happen in this case and apic map update is skipped.

Note:
new CPU during kvm_arch_vcpu_create() is not visible to
kvm_recalculate_apic_map(), so all related update calls endup
as NOP and only follow up kvm_apic_set_state() used to trigger map
update that counted in hotplugged CPU.
Fix issue by forcing unconditional update from kvm_apic_set_state(),
like it used to be.

1)
Fixes: (4abaffce4d25a KVM: LAPIC: Recalculate apic map in batch)
Signed-off-by: Igor Mammedov 
---
PS:
it's alternative to full revert of [1], I've posted earlier
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2205600.html
so fii free to pick up whatever is better by now
---
 arch/x86/kvm/lapic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 34a7e0533dad..5696831d4005 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2556,6 +2556,7 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct 
kvm_lapic_state *s)
struct kvm_lapic *apic = vcpu->arch.apic;
int r;
 
+   apic->vcpu->kvm->arch.apic_map_dirty = true;
kvm_lapic_set_base(vcpu, vcpu->arch.apic_base);
/* set SPIV separately to get count of SW disabled APICs right */
apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
-- 
2.26.2

[PATCH] Revert "KVM: LAPIC: Recalculate apic map in batch"

2020-06-22 Thread Igor Mammedov

Guest fails to online hotplugged CPU with error
  smpboot: do_boot_cpu failed(-1) to wakeup CPU#4

It's caused by the fact that kvm_apic_set_state(), which used to call
recalculate_apic_map() unconditionally and pulled hotplugged CPU into
apic map, is updating map conditionally on state change which doesn't
happen in this case and apic map update is skipped.

Also subj commit introduces a race which can make apic map update
events loss, in case one CPU is already in process of updating
apic map and another CPU flags map as dirty asking for another
update. But the first CPU will clear update request set by another
CPU when update in progress is finished.

  CPU1 kvm_recalculate_apic_map()
 .. updating
  CPU2 apic_map_dirty = true
  CPU1 apic_map_dirty = false
  CPU2 kvm_recalculate_apic_map()
 nop due to apic_map_dirty == false

This reverts commit 4abaffce4d25aa41392d2e81835592726d757857.

Signed-off-by: Igor Mammedov 
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/lapic.h|  1 -
 arch/x86/kvm/lapic.c| 46 +++--
 arch/x86/kvm/x86.c  |  1 -
 4 files changed, 10 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8998e97457f..bbf066217c91 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -943,7 +943,6 @@ struct kvm_arch {
atomic_t vapics_in_nmi_mode;
struct mutex apic_map_lock;
struct kvm_apic_map *apic_map;
-   bool apic_map_dirty;
 
bool apic_access_page_done;
unsigned long apicv_inhibit_reasons;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 754f29beb83e..7d08ccfb69d0 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -81,7 +81,6 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long 
cr8);
 void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
-void kvm_recalculate_apic_map(struct kvm *kvm);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val);
 int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 34a7e0533dad..e99d74288b9e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -169,28 +169,14 @@ static void kvm_apic_map_free(struct rcu_head *rcu)
kvfree(map);
 }
 
-void kvm_recalculate_apic_map(struct kvm *kvm)
+static void recalculate_apic_map(struct kvm *kvm)
 {
struct kvm_apic_map *new, *old = NULL;
struct kvm_vcpu *vcpu;
int i;
u32 max_id = 255; /* enough space for any xAPIC ID */
 
-   if (!kvm->arch.apic_map_dirty) {
-   /*
-* Read kvm->arch.apic_map_dirty before
-* kvm->arch.apic_map
-*/
-   smp_rmb();
-   return;
-   }
-
mutex_lock(&kvm->arch.apic_map_lock);
-   if (!kvm->arch.apic_map_dirty) {
-   /* Someone else has updated the map. */
-   mutex_unlock(&kvm->arch.apic_map_lock);
-   return;
-   }
 
kvm_for_each_vcpu(i, vcpu, kvm)
if (kvm_apic_present(vcpu))
@@ -255,12 +241,6 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
old = rcu_dereference_protected(kvm->arch.apic_map,
lockdep_is_held(&kvm->arch.apic_map_lock));
rcu_assign_pointer(kvm->arch.apic_map, new);
-   /*
-* Write kvm->arch.apic_map before
-* clearing apic->apic_map_dirty
-*/
-   smp_wmb();
-   kvm->arch.apic_map_dirty = false;
mutex_unlock(&kvm->arch.apic_map_lock);
 
if (old)
@@ -282,20 +262,20 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, 
u32 val)
else
static_key_slow_inc(&apic_sw_disabled.key);
 
-   apic->vcpu->kvm->arch.apic_map_dirty = true;
+   recalculate_apic_map(apic->vcpu->kvm);
}
 }
 
 static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id)
 {
kvm_lapic_set_reg(apic, APIC_ID, id << 24);
-   apic->vcpu->kvm->arch.apic_map_dirty = true;
+   recalculate_apic_map(apic->vcpu->kvm);
 }
 
 static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
 {
kvm_lapic_set_reg(apic, APIC_LDR, id);
-   apic->vcpu->kvm->arch.apic_map_dirty = true;
+   recalculate_apic_map(apic->vcpu->kvm);
 }
 
 static inline u32 kvm_apic_calc_x2apic_ldr(u32 id)
@@ -311,7 +291,7 @@ static inline void kvm_apic_set_x2apic_id(struct kvm_lapic 
*apic, u32 id)
 
kvm_lapic_set_reg(apic, APIC_ID, id);
kvm_lapic_set_reg(apic, APIC_LDR, ldr);
-   apic->vcpu->kvm->arc

Re: [PATCH v3] KVM: LAPIC: Recalculate apic map in batch

2020-06-21 Thread Igor Mammedov

On Fri, 19 Jun 2020 16:10:43 +0200
Paolo Bonzini  wrote:

> On 19/06/20 14:36, Igor Mammedov wrote:
> > qemu-kvm -m 2G -smp 4,maxcpus=8  -monitor stdio
> > (qemu) device_add qemu64-x86_64-cpu,socket-id=4,core-id=0,thread-id=0
> > 
> > in guest fails with:
> > 
> >  smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
> > 
> > which makes me suspect that  INIT/SIPI wasn't delivered
> > 
> > Is it a know issue?
> >   
> 
> No, it isn't.  I'll revert.
> 
> Paolo
> 

Following fixes immediate issue:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 34a7e0533dad..6dc177da19da 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2567,6 +2567,7 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct 
kvm_lapic_state *s)
}
memcpy(vcpu->arch.apic->regs, s->regs, sizeof(*s));

+   apic->vcpu->kvm->arch.apic_map_dirty = true;
kvm_recalculate_apic_map(vcpu->kvm);
kvm_apic_set_version(vcpu);

Problem is that during kvm_arch_vcpu_create() new vcpu is not visible to
kvm_recalculate_apic_map(), so whoever many times map update was called
during it, it didn't affect apic map.

What broke hotplug is that kvm_vcpu_ioctl_set_lapic -> kvm_apic_set_state,
which is called after new vcpu is visible, used to make an unconditional update
which pulled in the new vcpu, but with this patch the map update is gone
since state hasn't actuaaly changed, so we lost the one call of
kvm_recalculate_apic_map() which did actually matter.

It happens to work for vcpus present at boot just by luck
(BSP updates SPIV after all vcpus has been created which triggers 
kvm_recalculate_apic_map())

I'm not sending formal patch yet, since I have doubts wrt subj.

following sequence looks like a race that can cause lost map update events:

 cpu1cpu2

apic_map_dirty = true 

kvm_recalculate_apic_map:
 pass check
 mutex_lock(&kvm->arch.apic_map_lock);
 if (!kvm->arch.apic_map_dirty)
 and in process of updating map
  -
other calls to
   apic_map_dirty = true might be too late for affected cpu
  -
 apic_map_dirty = false
  -
kvm_recalculate_apic_map:
bail out on
  if (!kvm->arch.apic_map_dirty)

it's safer to revert this patch for now like you have suggested earlier.

If you prefer to keep it, I'll post above fixup as a patch.

Re: [PATCH v3] KVM: LAPIC: Recalculate apic map in batch

2020-06-19 Thread Igor Mammedov

On Fri, 19 Jun 2020 16:10:43 +0200
Paolo Bonzini  wrote:

> On 19/06/20 14:36, Igor Mammedov wrote:
> > qemu-kvm -m 2G -smp 4,maxcpus=8  -monitor stdio
> > (qemu) device_add qemu64-x86_64-cpu,socket-id=4,core-id=0,thread-id=0
> > 
> > in guest fails with:
> > 
> >  smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
> > 
> > which makes me suspect that  INIT/SIPI wasn't delivered
> > 
> > Is it a know issue?
> >   
> 
> No, it isn't.  I'll revert.
wait for a day or 2,

I'll try to come up with a fix over weekend.

> Paolo
>

Re: [PATCH v3] KVM: LAPIC: Recalculate apic map in batch

2020-06-19 Thread Igor Mammedov

On Wed, 26 Feb 2020 10:41:02 +0800
Wanpeng Li  wrote:

> From: Wanpeng Li 
> 
> In the vCPU reset and set APIC_BASE MSR path, the apic map will be 
> recalculated 
> several times, each time it will consume 10+ us observed by ftrace in my 
> non-overcommit environment since the expensive memory allocate/mutex/rcu etc 
> operations. This patch optimizes it by recaluating apic map in batch, I hope 
> this can benefit the serverless scenario which can frequently create/destroy 
> VMs. 
> 
> Before patch:
> 
> kvm_lapic_reset  ~27us
> 
> After patch:
> 
> kvm_lapic_reset  ~14us 
> 
> Observed by ftrace, improve ~48%.
> 
> Signed-off-by: Wanpeng Li 
this broke CPU hotplug,

qemu-kvm -m 2G -smp 4,maxcpus=8  -monitor stdio
(qemu) device_add qemu64-x86_64-cpu,socket-id=4,core-id=0,thread-id=0

in guest fails with:

 smpboot: do_boot_cpu failed(-1) to wakeup CPU#4

which makes me suspect that  INIT/SIPI wasn't delivered

Is it a know issue?

> ---
> v2 -> v3:
>  * move apic_map_dirty to kvm_arch
>  * add the suggestions from Paolo
> 
> v1 -> v2:
>  * add apic_map_dirty to kvm_lapic
>  * error condition in kvm_apic_set_state, do recalcuate  unconditionally
> 
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c| 46 
> -
>  arch/x86/kvm/lapic.h|  1 +
>  arch/x86/kvm/x86.c  |  1 +
>  4 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 40a0c0f..4380ed1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -920,6 +920,7 @@ struct kvm_arch {
>   atomic_t vapics_in_nmi_mode;
>   struct mutex apic_map_lock;
>   struct kvm_apic_map *apic_map;
> + bool apic_map_dirty;
>  
>   bool apic_access_page_done;
>   unsigned long apicv_inhibit_reasons;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index afcd30d..de832aa 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -164,14 +164,28 @@ static void kvm_apic_map_free(struct rcu_head *rcu)
>   kvfree(map);
>  }
>  
> -static void recalculate_apic_map(struct kvm *kvm)
> +void kvm_recalculate_apic_map(struct kvm *kvm)
>  {
>   struct kvm_apic_map *new, *old = NULL;
>   struct kvm_vcpu *vcpu;
>   int i;
>   u32 max_id = 255; /* enough space for any xAPIC ID */
>  
> + if (!kvm->arch.apic_map_dirty) {
> + /*
> +  * Read kvm->arch.apic_map_dirty before
> +  * kvm->arch.apic_map
> +  */
> + smp_rmb();
> + return;
> + }
> +
>   mutex_lock(&kvm->arch.apic_map_lock);
> + if (!kvm->arch.apic_map_dirty) {
> + /* Someone else has updated the map. */
> + mutex_unlock(&kvm->arch.apic_map_lock);
> + return;
> + }
>  
>   kvm_for_each_vcpu(i, vcpu, kvm)
>   if (kvm_apic_present(vcpu))
> @@ -236,6 +250,12 @@ static void recalculate_apic_map(struct kvm *kvm)
>   old = rcu_dereference_protected(kvm->arch.apic_map,
>   lockdep_is_held(&kvm->arch.apic_map_lock));
>   rcu_assign_pointer(kvm->arch.apic_map, new);
> + /*
> +  * Write kvm->arch.apic_map before
> +  * clearing apic->apic_map_dirty
> +  */
> + smp_wmb();
> + kvm->arch.apic_map_dirty = false;
>   mutex_unlock(&kvm->arch.apic_map_lock);
>  
>   if (old)
> @@ -257,20 +277,20 @@ static inline void apic_set_spiv(struct kvm_lapic 
> *apic, u32 val)
>   else
>   static_key_slow_inc(&apic_sw_disabled.key);
>  
> - recalculate_apic_map(apic->vcpu->kvm);
> + apic->vcpu->kvm->arch.apic_map_dirty = true;
>   }
>  }
>  
>  static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id)
>  {
>   kvm_lapic_set_reg(apic, APIC_ID, id << 24);
> - recalculate_apic_map(apic->vcpu->kvm);
> + apic->vcpu->kvm->arch.apic_map_dirty = true;
>  }
>  
>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
>  {
>   kvm_lapic_set_reg(apic, APIC_LDR, id);
> - recalculate_apic_map(apic->vcpu->kvm);
> + apic->vcpu->kvm->arch.apic_map_dirty = true;
>  }
>  
>  static inline u32 kvm_apic_calc_x2apic_ldr(u32 id)
> @@ -286,7 +306,7 @@ static inline void kvm_apic_set_x2apic_id(struct 
> kvm_lapic *apic, u32 id)
>  
>   kvm_lapic_set_reg(apic, APIC_ID, id);
>   kvm_lapic_set_reg(apic, APIC_LDR, ldr);
> - recalculate_apic_map(apic->vcpu->kvm);
> + apic->vcpu->kvm->arch.apic_map_dirty = true;
>  }
>  
>  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> @@ -1912,7 +1932,7 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
> reg, u32 val)
>   case APIC_DFR:
>   if (!apic_x2apic_mode(apic)) {
>   kvm_lapic_set_reg(apic, APIC_DFR, val | 0x0FFF);
> - recalculate_apic_map(apic->vcpu->kvm)

[PATCH v2] KVM: s390: kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset()

2019-09-11 Thread Igor Mammedov

If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling
kvm_s390_vm_start_migration(), kernel will oops with:

  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address:  TEID: 0483
  Fault in home space mode while using kernel ASCE.
  AS:02a2000b R2:0001bff8c00b R3:0001bff88007 
S:0001bff91000 P:003d
  Oops: 0004 ilc:2 [#1] SMP
  ...
  Call Trace:
  ([<001f804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm])
   [<001f804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm]
   [<001f804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm]
   [<008ba572>] do_vfs_ioctl+0x1d2/0xe58
   [<008bb284>] ksys_ioctl+0x8c/0xb8
   [<008bb2e2>] sys_ioctl+0x32/0x40
   [<0175552c>] system_call+0x2b8/0x2d8
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
   [<00dbaf60>] __memset+0xc/0xa0

due to ms->dirty_bitmap being NULL, which might crash the host.

Make sure that ms->dirty_bitmap is set before using it or
return -ENIVAL otherwise.

Fixes: afdad61615cc ("KVM: s390: Fix storage attributes migration with memory 
slots")
Signed-off-by: Igor Mammedov 
---
Cc: sta...@vger.kernel.org # v4.19+

v2:
   - remove not true anym 'warning' clause in commit message
v2:
   - drop WARN()

 arch/s390/kvm/kvm-s390.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f329dcb3f44c..2a40cd3e40b4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1018,6 +1018,8 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
/* mark all the pages in active slots as dirty */
for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
ms = slots->memslots + slotnr;
+   if (!ms->dirty_bitmap)
+   return -EINVAL;
/*
 * The second half of the bitmap is only used on x86,
 * and would be wasted otherwise, so we put it to good
-- 
2.18.1

[PATCH] KVM: s390: kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset()

2019-09-10 Thread Igor Mammedov

If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling
kvm_s390_vm_start_migration(), kernel will oops with:

  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address:  TEID: 0483
  Fault in home space mode while using kernel ASCE.
  AS:02a2000b R2:0001bff8c00b R3:0001bff88007 
S:0001bff91000 P:003d
  Oops: 0004 ilc:2 [#1] SMP
  ...
  Call Trace:
  ([<001f804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm])
   [<001f804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm]
   [<001f804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm]
   [<008ba572>] do_vfs_ioctl+0x1d2/0xe58
   [<008bb284>] ksys_ioctl+0x8c/0xb8
   [<008bb2e2>] sys_ioctl+0x32/0x40
   [<0175552c>] system_call+0x2b8/0x2d8
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
   [<00dbaf60>] __memset+0xc/0xa0

due to ms->dirty_bitmap being NULL, which might crash the host.

Make sure that ms->dirty_bitmap is set before using it or
print a warning and return -ENIVAL otherwise.

Fixes: afdad61615cc ("KVM: s390: Fix storage attributes migration with memory 
slots")
Signed-off-by: Igor Mammedov 
---
Cc: sta...@vger.kernel.org # v4.19+

v2:
   - drop WARN()

 arch/s390/kvm/kvm-s390.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f329dcb3f44c..2a40cd3e40b4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1018,6 +1018,8 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
/* mark all the pages in active slots as dirty */
for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
ms = slots->memslots + slotnr;
+   if (!ms->dirty_bitmap)
+   return -EINVAL;
/*
 * The second half of the bitmap is only used on x86,
 * and would be wasted otherwise, so we put it to good
-- 
2.18.1

Re: [PATCH] kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset()

2019-09-09 Thread Igor Mammedov

On Mon, 9 Sep 2019 17:47:37 +0200
David Hildenbrand  wrote:

> On 09.09.19 16:55, Igor Mammedov wrote:
> > If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling
> > kvm_s390_vm_start_migration(), kernel will oops with:
> >   
> 
> We usually have the subject in a "KVM: s390: ..." format. Like
> 
> "KVM: s390: check dirty_bitmap before using it as target for memset()"
> 
> >   Unable to handle kernel pointer dereference in virtual kernel address 
> > space
> >   Failing address:  TEID: 0483
> >   Fault in home space mode while using kernel ASCE.
> >   AS:02a2000b R2:0001bff8c00b R3:0001bff88007 
> > S:0001bff91000 P:003d
> >   Oops: 0004 ilc:2 [#1] SMP
> >   ...
> >   Call Trace:
> >   ([<001f804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm])
> >[<001f804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm]
> >[<001f804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm]
> >[<008ba572>] do_vfs_ioctl+0x1d2/0xe58
> >[<008bb284>] ksys_ioctl+0x8c/0xb8
> >[<008bb2e2>] sys_ioctl+0x32/0x40
> >[<0175552c>] system_call+0x2b8/0x2d8
> >   INFO: lockdep is turned off.
> >   Last Breaking-Event-Address:
> >[<00dbaf60>] __memset+0xc/0xa0
> > 
> > due to ms->dirty_bitmap being NULL, which migh crash the host.  
> 
> s/migh/might/
> 
> > 
> > Make sure that ms->dirty_bitmap is set before using it or
> > print a warning and return -ENIVAL otherwise.
> > 
> > Signed-off-by: Igor Mammedov 
> > ---
> > 
> > PS:
> >   keeping it private for now as issue might DoS host,
> >   I'll leave it upto maintainers to decide if it should be handled as 
> > security
> >   bug (I'm not sure what process for handling such bugs should be used).
> > 
> > 
> >  arch/s390/kvm/kvm-s390.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> > index f329dcb3f44c..dfba51c9d60c 100644
> > --- a/arch/s390/kvm/kvm-s390.c
> > +++ b/arch/s390/kvm/kvm-s390.c
> > @@ -1018,6 +1018,10 @@ static int kvm_s390_vm_start_migration(struct kvm 
> > *kvm)
> > /* mark all the pages in active slots as dirty */
> > for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
> > ms = slots->memslots + slotnr;
> > +   if (!ms->dirty_bitmap) {
> > +   WARN(1, "ms->dirty_bitmap == NULL\n");
> > +   return -EINVAL;
> > +   }  
> 
> if (WARN_ON_ONCE(!ms->dirty_bitmap))
>   return -EINVAL;
> 
> But I wonder if the WARN is really needed. (or simply a wrong usage of the 
> interface)
I added WARN because of there is no any visible sign that something
went wrong in userspace, this way at least we would have a trace of
invalid API usage.

But I can drop it if you prefer.

> 
> 
> > /*
> >  * The second half of the bitmap is only used on x86,
> >  * and would be wasted otherwise, so we put it to good
> >   
> 
> You should add
> 
> Fixes: afdad61615cc ("KVM: s390: Fix storage attributes migration with memory 
> slots")
> Cc: sta...@vger.kernel.org # v4.19+
> 
> Thanks!

[PATCH] kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset()

2019-09-09 Thread Igor Mammedov

If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling
kvm_s390_vm_start_migration(), kernel will oops with:

  Unable to handle kernel pointer dereference in virtual kernel address space
  Failing address:  TEID: 0483
  Fault in home space mode while using kernel ASCE.
  AS:02a2000b R2:0001bff8c00b R3:0001bff88007 
S:0001bff91000 P:003d
  Oops: 0004 ilc:2 [#1] SMP
  ...
  Call Trace:
  ([<001f804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm])
   [<001f804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm]
   [<001f804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm]
   [<008ba572>] do_vfs_ioctl+0x1d2/0xe58
   [<008bb284>] ksys_ioctl+0x8c/0xb8
   [<008bb2e2>] sys_ioctl+0x32/0x40
   [<0175552c>] system_call+0x2b8/0x2d8
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
   [<00dbaf60>] __memset+0xc/0xa0

due to ms->dirty_bitmap being NULL, which migh crash the host.

Make sure that ms->dirty_bitmap is set before using it or
print a warning and return -ENIVAL otherwise.

Signed-off-by: Igor Mammedov 
---

PS:
  keeping it private for now as issue might DoS host,
  I'll leave it upto maintainers to decide if it should be handled as security
  bug (I'm not sure what process for handling such bugs should be used).


 arch/s390/kvm/kvm-s390.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f329dcb3f44c..dfba51c9d60c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1018,6 +1018,10 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
/* mark all the pages in active slots as dirty */
for (slotnr = 0; slotnr < slots->used_slots; slotnr++) {
ms = slots->memslots + slotnr;
+   if (!ms->dirty_bitmap) {
+   WARN(1, "ms->dirty_bitmap == NULL\n");
+   return -EINVAL;
+   }
/*
 * The second half of the bitmap is only used on x86,
 * and would be wasted otherwise, so we put it to good
-- 
2.18.1

Re: cpu/hotplug: broken sibling thread hotplug

2019-01-28 Thread Igor Mammedov

On Mon, 28 Jan 2019 06:52:52 -0600
Josh Poimboeuf  wrote:

> On Mon, Jan 28, 2019 at 11:13:04AM +0100, Igor Mammedov wrote:
> > On Fri, 25 Jan 2019 11:02:03 -0600
> > Josh Poimboeuf  wrote:
> >   
> > > On Fri, Jan 25, 2019 at 10:36:57AM -0600, Josh Poimboeuf wrote:  
> > > > How about this patch?  It's just a revert of 73d5e2b47264 and
> > > > bc2d8d262cba, plus the 1-line vmx_vm_init() change.  If it looks ok to
> > > > you, I can clean it up and submit an official version.
> > > 
> > > This one actually compiles...  
> > 
> > Looks good to me,
> > (one small question below)
> > 
> >  
> > [...]  
> > >  static inline bool cpu_smt_allowed(unsigned int cpu)
> > >  {
> > > - if (topology_is_primary_thread(cpu))
> > > + if (cpu_smt_control == CPU_SMT_ENABLED)
> > >   return true;
> > >  
> > > - /*
> > > -  * If the CPU is not a 'primary' thread and the booted_once bit is
> > > -  * set then the processor has SMT support. Store this information
> > > -  * for the late check of SMT support in cpu_smt_check_topology().
> > > -  */
> > > - if (per_cpu(cpuhp_state, cpu).booted_once)
> > > - cpu_smt_available = true;
> > > -
> > > - if (cpu_smt_control == CPU_SMT_ENABLED)
> > > + if (topology_is_primary_thread(cpu))
> > >   return true;  
> > why did you swap cpu_smt_control and topology_is_primary_thread checks?  
> 
> That's just the revert of bc2d8d262cba5.
ok. waiting for formal patch
pls, CC k...@vger.kernel.org so Paolo could have a chance to review
(pick it up if Thomas is ok with approach)

Re: cpu/hotplug: broken sibling thread hotplug

2019-01-28 Thread Igor Mammedov

On Fri, 25 Jan 2019 11:02:03 -0600
Josh Poimboeuf  wrote:

> On Fri, Jan 25, 2019 at 10:36:57AM -0600, Josh Poimboeuf wrote:
> > How about this patch?  It's just a revert of 73d5e2b47264 and
> > bc2d8d262cba, plus the 1-line vmx_vm_init() change.  If it looks ok to
> > you, I can clean it up and submit an official version.  
> 
> This one actually compiles...

Looks good to me,
(one small question below)

 
[...]
>  static inline bool cpu_smt_allowed(unsigned int cpu)
>  {
> - if (topology_is_primary_thread(cpu))
> + if (cpu_smt_control == CPU_SMT_ENABLED)
>   return true;
>  
> - /*
> -  * If the CPU is not a 'primary' thread and the booted_once bit is
> -  * set then the processor has SMT support. Store this information
> -  * for the late check of SMT support in cpu_smt_check_topology().
> -  */
> - if (per_cpu(cpuhp_state, cpu).booted_once)
> - cpu_smt_available = true;
> -
> - if (cpu_smt_control == CPU_SMT_ENABLED)
> + if (topology_is_primary_thread(cpu))
>   return true;
why did you swap cpu_smt_control and topology_is_primary_thread checks?

>  
>   /*

cpu/hotplug: broken sibling thread hotplug

2019-01-24 Thread Igor Mammedov

In case guest is booted with one CPU present and then later
a sibling CPU is hotplugged [1], it stays offline since SMT
is disabled.

Bisects to
 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
which used __max_smt_threads to decide disabling SMT and in
case [1] only primary CPU thread is present hence SMT
is disabled.

Later bc2d8d262cba (cpu/hotplug: Fix SMT supported evaluation),
rewrites code path but evaluation criteria still depends on
sibling thread being present at boot time, so problem persist.

1) QEMU -smp 1,sockets=2,cores=1,threads=2 -monitor stdio ...
   # hotplug sibling thread
   (qemu) device_add qemu64-x86_64-cpu,socket-id=0,core-id=0,thread-id=1

I've failed to find reasoning behind statement:
  "
  cpu/hotplug: detect SMT disabled by BIOS

If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
  "

Question is 
 1: why cpu_smt_check_topology_early() at check_bugs()
wasn't sufficient to detect SMT disabled in BIOS and
 2: why side-effect of present at boot siblings were used
to keep SMT enabled?

Following quick hack fixes the sibling issue but that's
effectively means reverting both above mentioned so we are
back to the original issue "If SMT is disabled in BIOS, ..."
which roots I weren't able to locate.

---
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 91d5c38..44df8cd 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -415,7 +415,7 @@ void __init cpu_smt_check_topology_early(void)
  */
 void __init cpu_smt_check_topology(void)
 {
-   if (!cpu_smt_available)
+   if (!topology_smt_supported())
cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
 }
---

Re: [Qemu-devel] [PATCH v11 2/6] ACPI: Add APEI GHES Table Generation support

2017-09-01 Thread Igor Mammedov

On Fri, 1 Sep 2017 17:58:55 +0800
gengdongjiu  wrote:

> Hi Igor,
> 
> On 2017/8/29 18:20, Igor Mammedov wrote:
> > On Fri, 18 Aug 2017 22:23:43 +0800
> > Dongjiu Geng  wrote:
[...]
> >   
> >> +void ghes_build_acpi(GArray *table_data, GArray *hardware_error,
> >> +BIOSLinker *linker)
> >> +{
> >> +uint32_t ghes_start = table_data->len;
> >> +uint32_t address_size, error_status_address_offset;
> >> +uint32_t read_ack_register_offset, i;
> >> +
> >> +address_size = sizeof(struct AcpiGenericAddress) -
> >> +offsetof(struct AcpiGenericAddress, address);  
> > it's confusing name for var,
> > AcpiGenericAddress::address is fixed unsigned 64 bit integer per spec
> > also, I'm not sure why it's needed at all.  
>  it is because other people have concern about where does the "unsigned 64 
> bit integer"
>  come from, they are confused about the "unsigned 64 bit integer"
>  so they suggested use sizeof. anyway I will directly use unsigned 64 bit 
> integer.
Maybe properly named macro instead of sizeof(foo) would do the job


[...]
> >> +
> >> +/* Build error status address*/
> >> +build_address(table_data, linker, error_status_address_offset + i 
> >> *
> >> +sizeof(AcpiGenericHardwareErrorSourceV2), i * address_size,
> >> +AML_SYSTEM_MEMORY, 0x40, 0, 4 /* QWord access */);  
> > just do something like this instead of build_address():
> > build_append_gas()
> > bios_linker_loader_add_pointer()  
> Thanks for the suggestion.
> 
> > 
> > also register width 0x40 looks suspicious, where does it come from?
> > While at it do you have a real hardware which has HEST table that you re 
> > trying to model after?
> > I'd like to see HEST and other related tables from it.  
> 
> Igor, what is your suspicious point? The register width 0x40 come from our 
> host BIOS record to the System Memory space.

maybe s/0x40/ERROR_STATUS_BLOCK_POINTER_SIZE/

[...]

Re: [Qemu-devel] [PATCH v11 2/6] ACPI: Add APEI GHES Table Generation support

2017-08-29 Thread Igor Mammedov

On Fri, 18 Aug 2017 22:23:43 +0800
Dongjiu Geng  wrote:

> This implements APEI GHES Table by passing the error CPER info
> to the guest via a fw_cfg_blob. After a CPER info is recorded, an
> SEA(Synchronous External Abort)/SEI(SError Interrupt) exception
> will be injected into the guest OS.

it's a bit complex patch/functionality so I've just mosty skimmed and
commented only on structure of the patch and changes I'd like to see
so it would be more structured and review-able.

I'd suggest to add doc patch first which will describe how it's
supposed to work between QEMU/firmware/guest OS with expected
flows.
 
> Below is the table layout, the max number of error soure is 11,
> which is classified by notification type.
> 
>  etc/acpi/tables   etc/hardware_errors
> 
> ==
> + +--++--+
> | | HEST ||address   |  
> +--+
> | +--+|registers |  | 
> Error Status |
> | | GHES0|| ++  | 
> Data Block 0 |
> | +--+ +->| |status_address0 |->| 
> ++
> | | .| |  | ++  | 
> |  CPER  |
> | | error_status_address-+-+ +--->| |status_address1 |--+   | 
> |  CPER  |
> | | .|   || ++  |   | 
> |    |
> | | read_ack_register+-+ ||  .   |  |   | 
> |  CPER  |
> | | read_ack_preserve| | |+--+  |   | 
> +-++
> | | read_ack_write   | | | +->| |status_address10|+ |   | 
> Error Status |
> + +--+ | | |  | ++| |   | 
> Data Block 1 |
> | | GHES1| +-+-+->| | ack_value0 || +-->| 
> ++
> + +--+   | |  | ++| | 
> |  CPER  |
> | | .|   | | +--->| | ack_value1 || | 
> |  CPER  |
> | | error_status_address-+---+ | || ++| | 
> |    |
> | | .| | || |  . || | 
> |  CPER  |
> | | read_ack_register+-+-+| ++| 
> +-++
> | | read_ack_preserve| |   +->| | ack_value10|| | 
> |..  |
> | | read_ack_write   | |   |  | ++| | 
> ++
> + +--| |   |  | | 
> Error Status |
> | | ...  | |   |  | | 
> Data Block 10|
> + +--+ |   |  +>| 
> ++
> | | GHES10   | |   || 
> |  CPER  |
> + +--+ |   || 
> |  CPER  |
> | | .| |   || 
> |    |
> | | error_status_address-+-+   || 
> |  CPER  |
> | | .| |
> +-++
> | | read_ack_register+-+
> | | read_ack_preserve|
> | | read_ack_write   |
> + +--+
these diagram shows relations between tables which not necessarily bad
but as layout it's useless.
 * Probably there is not much sense to have HEST table here, it's described
   well enough in spec. You might just put reference here.
 * these diagrams should go into doc/spec patch
 * when you describe layout you need to show what and at what offsets
   in which order in which blob/file is located. See ACPI spec for example
   and/or docs/specs/acpi_nvdimm.txt docs/specs/acpi_mem_hotplug.txt for 
inspiration.

> For GHESv2 error source, the OSPM must acknowledges the error via Read Ack 
> register.
> so user space must check the ack value to avoid read-write race condition.
> 
> Signed-off-by: Dongjiu Geng 
> ---
>  hw/acpi/aml-build.c |   2 +
>  hw/acpi/hest_ghes.c | 345 
> 
>  hw/arm/virt-acpi-build.c|   6 +
>  include/hw/acpi/aml-build.h |   1 +
>  include/hw/acpi/hest_ghes.h |  47 ++
>  5 files changed, 401 insertions(+)
>  create mode 100644 hw/acpi/hest_ghes.c
>  create mode 100644 include/hw/acpi/hest_ghes.h
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 36a6cc4..6849e5f 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
>

Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-08-01 Thread Igor Mammedov

On Mon, 31 Jul 2017 19:58:30 +0200
Gerald Schaefer  wrote:

> On Mon, 31 Jul 2017 17:53:50 +0200
> Michal Hocko  wrote:
> 
> > On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:  
> > > On Mon, 31 Jul 2017 14:53:19 +0200
> > > Michal Hocko  wrote:
> > >   
> > > > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:  
> > > > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > > > Michal Hocko  wrote:
> > > > >   
> > > > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:  
> > > > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > > > [...]  
> > > > > > > > This does not seems to be an opt-in change ie if i am reading 
> > > > > > > > patch 3
> > > > > > > > correctly if an altmap is not provided to __add_pages() you 
> > > > > > > > fallback
> > > > > > > > to allocating from begining of zone. This will not work with 
> > > > > > > > HMM ie
> > > > > > > > device private memory. So at very least i would like to see 
> > > > > > > > some way
> > > > > > > > to opt-out of this. Maybe a new argument like bool 
> > > > > > > > forbid_altmap ?  
> > > > > > > 
> > > > > > > OK, I see! I will think about how to make a sane api for that.  
> > > > > > 
> > > > > > This is what I came up with. s390 guys mentioned that I cannot 
> > > > > > simply
> > > > > > use the new range at this stage yet. This will need probably some 
> > > > > > other
> > > > > > changes but I guess we want an opt-in approach with an arch veto in 
> > > > > > general.
> > > > > > 
> > > > > > So what do you think about the following? Only x86 is update now 
> > > > > > and I
> > > > > > will split it into two parts but the idea should be clear at least. 
> > > > > >  
> > > > > 
> > > > > This looks good, and the kernel will also boot again on s390 when 
> > > > > applied
> > > > > on top of the other 5 patches (plus adding the s390 part here).  
> > > > 
> > > > Thanks for testing Gerald! I am still undecided whether the arch code
> > > > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > > > it when it is supported. My last post did the later but the first one
> > > > sounds like a more clear API to me. I will keep thinking about it.
> > > > 
> > > > Anyway, did you have any chance to consider mapping the new physical
> > > > memory range inside arch_add_memory rather than during online on s390?  
> > > 
> > > Well, it still looks like we cannot do w/o splitting up add_memory():
> > > 1) (only) set up section map during our initial memory detection, w/o
> > > allocating struct pages, so that the sysfs entries get created also for
> > > our offline memory (or else we have no way to online it later)
> > > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > > during our MEM_GOING_ONLINE callback, because only now the memory is 
> > > really
> > > accessible  
> > 
> > As I've tried to mentioned in my other response. This is not possible
> > because there are memory hotplug usecases which never do an explicit
> > online.  
> 
> Of course the default behaviour should not change, we only need an option
> to do the "2-stage-approach". E.g. we would call __add_pages() from our
> MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
> then we would need some way to add memory sections (for generating sysfs
> memory blocks) only, without allocating struct pages. See also below.
> 
> > 
> > I am sorry to ask again. But why exactly cannot we make the range
> > accessible from arch_add_memory on s390?  
> 
> We have no acpi or other events to indicate new memory, both online and
> offline memory needs to be (hypervisor) defined upfront, and then we want
> to be able to use memory hotplug for ballooning during runtime.
> 
> Making the range accessible is equivalent to a hypervisor call that assigns
> the memory to the guest. The problem with arch_add_memory() is now that
> this gets called from add_memory(), which we call during initial memory
> detection for the offline memory ranges. At that time, assigning all
> offline memory to the guest, and thus making it accessible, would break
> the ballooning usecase (even if it is still offline in Linux, the
> hypervisor could not use it for other guests any more).
> 
> The main difference to other architectures is that we can not simply
> call add_memory() (and thus arch_add_memory()) at the time when the
> offline memory is actually supposed to get online (e.g. triggered by acpi).
> We rather need to somehow make sure that the offline memory is detected
> early, and sysfs entries are created for it, so that it can be set online
> later on demand.
> 
> Maybe our design to use add_memory() for offline ranges during memory
> detection was wrong, or overkill, since we actually only need to establish
> a memory section, if I understood the sysfs code right. But I currently
> see no other way to make sure that we get the sysfs attributes. And of
> course the presence of users that work on offline struct pages, like
> valid_zones, is also not help

Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE

2017-05-24 Thread Igor Mammedov

On Wed, 24 May 2017 14:24:11 +0200
Michal Hocko  wrote:

[...]
> index facc20a3f962..ec7d6ae01c96 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2246,8 +2246,11 @@
[...]
> + movable. This means that the memory of such nodes
> + will be usable only for movable allocations which
> + rules out almost all kernel allocations. Use with
> + caution!
maybe dumb question but, is it really true that kernel won't ever
do kernel allocations from movable zone?

looking at kmalloc(slab): we can get here:

get_page_from_freelist() ->
rmqueue() ->
__rmqueue() ->
__rmqueue_fallback() ->
find_suitable_fallback()

and it might return movable fallback and page could be stolen from there.

Re: [PATCH -v2 0/9] mm: make movable onlining suck less

2017-04-11 Thread Igor Mammedov

On Tue, 11 Apr 2017 11:23:07 +0200
Michal Hocko  wrote:

> On Tue 11-04-17 08:38:34, Igor Mammedov wrote:
> > for issue2:
> > -enable-kvm -m 2G,slots=4,maxmem=4G -smp 4 -numa node -numa node \
> > -drive if=virtio,file=disk.img -kernel bzImage -append 'root=/dev/vda1' \
> > -object memory-backend-ram,id=mem1,size=256M -object 
> > memory-backend-ram,id=mem0,size=256M \
> > -device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
> > pc-dimm,id=dimm0,memdev=mem0,slot=0,node=1  
> 
> I must be doing something wrong here...
> qemu-system-x86_64 -enable-kvm -monitor telnet:127.0.0.1:,server,nowait 
> -net nic -net user,hostfwd=tcp:127.0.0.1:-:22 -serial 
> file:test.qcow_serial.log -enable-kvm -m 2G,slots=4,maxmem=4G -smp 4 -numa 
> node -numa node -object memory-backend-ram,id=mem1,size=256M -object 
> memory-backend-ram,id=mem0,size=256M -device 
> pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
> pc-dimm,id=dimm0,memdev=mem0,slot=0,node=1 -drive 
> file=test.qcow,if=ide,index=0
> 
> for i in $(seq 0 3)
> do
>   sh probe_memblock.sh $i
> done
dimm to node mapping comes from ACPI subsystem (_PXM object in memory device),
which adds memory blocks automatically on hotplug.

you probably don't have ACPI_HOTPLUG_MEMORY config option enabled.

> 
> # ls -l /sys/devices/system/memory/memory3?/node*
> lrwxrwxrwx 1 root root 0 Apr 11 11:21 
> /sys/devices/system/memory/memory32/node0 -> ../../node/node0
> lrwxrwxrwx 1 root root 0 Apr 11 11:21 
> /sys/devices/system/memory/memory33/node0 -> ../../node/node0
> lrwxrwxrwx 1 root root 0 Apr 11 11:21 
> /sys/devices/system/memory/memory34/node0 -> ../../node/node0
> lrwxrwxrwx 1 root root 0 Apr 11 11:21 
> /sys/devices/system/memory/memory35/node0 -> ../../node/node0
> 
> all of them end in the same node0
> 
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal Movable
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
> /sys/devices/system/memory/memory35/valid_zones:Normal Movable
>

Re: [PATCH -v2 0/9] mm: make movable onlining suck less

2017-04-11 Thread Igor Mammedov

On Tue, 11 Apr 2017 10:41:42 +0200
Michal Hocko  wrote:

> On Tue 11-04-17 10:01:52, Igor Mammedov wrote:
> > On Mon, 10 Apr 2017 16:56:39 +0200
> > Michal Hocko  wrote:  
> [...]
> > > > #echo online_kernel > memory32/state
> > > > write error: Invalid argument
> > > > // that's not what's expected
> > > 
> > > this is proper behavior with the current implementation. Does anything
> > > depend on the zone reusing?  
> > if we didn't have zone imbalance issue in design,
> > the it wouldn't matter but as it stands it's not
> > minore issue.
> > 
> > Consider following,
> > one hotplugs some memory and onlines it as movable,
> > then one needs to hotplug some more but to do so 
> > one one needs more memory from zone NORMAL and to keep
> > zone balance some memory in MOVABLE should be reonlined
> > as NORMAL  
> 
> Is this something that we absolutely have to have right _now_? Or are you
> OK if I address this in follow up series? Because it will make the
> current code slightly more complex and to be honest I would rather like
> to see this "core" merge and build more on top.
It's fine by me to do it on top.

Re: [PATCH -v2 0/9] mm: make movable onlining suck less

2017-04-11 Thread Igor Mammedov

On Mon, 10 Apr 2017 16:56:39 +0200
Michal Hocko  wrote:

> On Mon 10-04-17 16:27:49, Igor Mammedov wrote:
> [...]
> > Hi Michal,
> > 
> > I've given series some dumb testing, see below for unexpected changes I've 
> > noticed.
> > 
> > Using the same CLI as above plus hotpluggable dimms present at startup
> > (it still uses hotplug path as dimms aren't reported in e820)
> > 
> > -object memory-backend-ram,id=mem1,size=256M -object 
> > memory-backend-ram,id=mem0,size=256M \
> > -device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
> > pc-dimm,id=dimm0,memdev=mem0,slot=0,node=0
> > 
> > so dimm1 => memory3[23] and dimm0 => memory3[45]
> > 
> > #issue1:
> > unable to online memblock as NORMAL adjacent to onlined MOVABLE
> > 
> > 1: after boot
> > memory32:offline removable: 0  zones: Normal Movable
> > memory33:offline removable: 0  zones: Normal Movable
> > memory34:offline removable: 0  zones: Normal Movable
> > memory35:offline removable: 0  zones: Normal Movable
> > 
> > 2: online as movable 1st dimm
> > 
> > #echo online_movable > memory32/state
> > #echo online_movable > memory33/state
> > 
> > everything is as expected:
> > memory32:online removable: 1  zones: Movable
> > memory33:online removable: 1  zones: Movable
> > memory34:offline removable: 0  zones: Movable
> > memory35:offline removable: 0  zones: Movable
> > 
> > 3: try to offline memory32 and online as NORMAL
> > 
> > #echo offline > memory32/state
> > memory32:offline removable: 1  zones: Normal Movable
> > memory33:online removable: 1  zones: Movable
> > memory34:offline removable: 0  zones: Movable
> > memory35:offline removable: 0  zones: Movable  
> 
> OK, this is not expected. We are not shifting zones anymore so the range
> which was online_movable will not become available to the zone Normal.
> So this must be something broken down the show_valid_zones path. I will
> investigate.
> 
> > 
> > #echo online_kernel > memory32/state
> > write error: Invalid argument
> > // that's not what's expected  
> 
> this is proper behavior with the current implementation. Does anything
> depend on the zone reusing?
if we didn't have zone imbalance issue in design,
the it wouldn't matter but as it stands it's not
minore issue.

Consider following,
one hotplugs some memory and onlines it as movable,
then one needs to hotplug some more but to do so 
one one needs more memory from zone NORMAL and to keep
zone balance some memory in MOVABLE should be reonlined
as NORMAL

Re: [PATCH -v2 0/9] mm: make movable onlining suck less

2017-04-10 Thread Igor Mammedov

On Mon, 10 Apr 2017 18:09:41 +0200
Michal Hocko  wrote:

> On Mon 10-04-17 16:27:49, Igor Mammedov wrote:
> [...]
> > -object memory-backend-ram,id=mem1,size=256M -object 
> > memory-backend-ram,id=mem0,size=256M \
> > -device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
> > pc-dimm,id=dimm0,memdev=mem0,slot=0,node=0  
> 
> are you sure both of them should be node=0?
> 
> What is the full comman line you use?
CLI for issue 1, 3:
-enable-kvm -m 2G,slots=4,maxmem=4G -smp 4 -numa node -numa node \
-drive if=virtio,file=disk.img -kernel bzImage -append 'root=/dev/vda1' \
-object memory-backend-ram,id=mem1,size=256M -object 
memory-backend-ram,id=mem0,size=256M \
-device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
pc-dimm,id=dimm0,memdev=mem0,slot=0,node=0

for issue2:
-enable-kvm -m 2G,slots=4,maxmem=4G -smp 4 -numa node -numa node \
-drive if=virtio,file=disk.img -kernel bzImage -append 'root=/dev/vda1' \
-object memory-backend-ram,id=mem1,size=256M -object 
memory-backend-ram,id=mem0,size=256M \
-device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
pc-dimm,id=dimm0,memdev=mem0,slot=0,node=1

Re: [PATCH -v2 0/9] mm: make movable onlining suck less

2017-04-10 Thread Igor Mammedov

On Mon, 10 Apr 2017 13:03:42 +0200
Michal Hocko  wrote:

> Hi,
> The last version of this series has been posted here [1]. It has seen
> some more serious testing (thanks to Reza Arbab) and fixes for the found
> issues. I have also decided to drop patch 1 [2] because it turned out to
> be more complicated than I initially thought [3]. Few more patches were
> added to deal with expectation on zone/node initialization.
> 
> I have rebased on top of the current mmotm-2017-04-07-15-53. It
> conflicts with HMM because it touches memory hotplug as
> well. We have discussed [4] with Jérôme and he agreed to
> rebase on top of this rework [5] so I have reverted his series
> before applyig mine. I will help him to resolve the resulting
> conflicts. You can find the whole series including the HMM revers in
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git branch
> attempts/rewrite-mem_hotplug
> 
> Motivation:
> Movable onlining is a real hack with many downsides - mainly
> reintroduction of lowmem/highmem issues we used to have on 32b systems -
> but it is the only way to make the memory hotremove more reliable which
> is something that people are asking for.
> 
> The current semantic of memory movable onlinening is really cumbersome,
> however. The main reason for this is that the udev driven approach is
> basically unusable because udev races with the memory probing while only
> the last memory block or the one adjacent to the existing zone_movable
> are allowed to be onlined movable. In short the criterion for the
> successful online_movable changes under udev's feet. A reliable udev
> approach would require a 2 phase approach where the first successful
> movable online would have to check all the previous blocks and online
> them in descending order. This is hard to be considered sane.
> 
> This patchset aims at making the onlining semantic more usable. First of
> all it allows to online memory movable as long as it doesn't clash with
> the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE
> cannot overlap. Currently I preserve the original ordering semantic so
> the zone always precedes the movable zone but I have plans to remove this
> restriction in future because it is not really necessary.
> 
> First 3 patches are cleanups which should be ready to be merged right
> away (unless I have missed something subtle of course).
> 
> Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
> 
> Patch 5 deals with implicit assumptions of register_one_node on pgdat
> initialization.
> 
> Patch 6 is the core of the change. In order to make it easier to review
> I have tried it to be as minimalistic as possible and the large code
> removal is moved to patch 9.
> 
> Patch 7 is a trivial follow up cleanup. Patch 8 fixes sparse warnings
> and finally patch 9 removes the unused code.
> 
> I have tested the patches in kvm:
> # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa 
> node,mem=1G -numa node,mem=1G ...
> 
> and then probed the additional memory by
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1

Hi Michal,

I've given series some dumb testing, see below for unexpected changes I've 
noticed.

Using the same CLI as above plus hotpluggable dimms present at startup
(it still uses hotplug path as dimms aren't reported in e820)

-object memory-backend-ram,id=mem1,size=256M -object 
memory-backend-ram,id=mem0,size=256M \
-device pc-dimm,id=dimm1,memdev=mem1,slot=1,node=0 -device 
pc-dimm,id=dimm0,memdev=mem0,slot=0,node=0

so dimm1 => memory3[23] and dimm0 => memory3[45]

#issue1:
unable to online memblock as NORMAL adjacent to onlined MOVABLE

1: after boot
memory32:offline removable: 0  zones: Normal Movable
memory33:offline removable: 0  zones: Normal Movable
memory34:offline removable: 0  zones: Normal Movable
memory35:offline removable: 0  zones: Normal Movable

2: online as movable 1st dimm

#echo online_movable > memory32/state
#echo online_movable > memory33/state

everything is as expected:
memory32:online removable: 1  zones: Movable
memory33:online removable: 1  zones: Movable
memory34:offline removable: 0  zones: Movable
memory35:offline removable: 0  zones: Movable

3: try to offline memory32 and online as NORMAL

#echo offline > memory32/state
memory32:offline removable: 1  zones: Normal Movable
memory33:online removable: 1  zones: Movable
memory34:offline removable: 0  zones: Movable
memory35:offline removable: 0  zones: Movable

#echo online_kernel > memory32/state
write error: Invalid argument
// that's not what's expected

memory32:offline removable: 1  zones: Normal Movable
memory33:online removable: 1  zones: Movable
memory34:offline removable: 0  zones: Movable
memory35:offline removable: 0  zones: Movable


==
#issue2: dimm1 assigned to node 1 on qemu CLI
memblock is onlined as movable by default

// after boot
memory32:offline removable: 1  zones: Normal
memory33:offline remo

Re: [PATCH v1 1/6] mm: get rid of zone_is_initialized

2017-04-05 Thread Igor Mammedov

On Wed, 5 Apr 2017 10:14:00 +0200
Michal Hocko  wrote:

> On Fri 31-03-17 09:39:54, Michal Hocko wrote:
> > Fixed screw ups during the initial patch split up as per Hillf
> > ---
> > From 8be6c5e47de66210e47710c80e72e8abd899017b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Wed, 29 Mar 2017 15:11:30 +0200
> > Subject: [PATCH] mm: get rid of zone_is_initialized
> > 
> > There shouldn't be any reason to add initialized when we can tell the
> > same thing from checking whether there are any pages spanned to the
> > zone. Remove zone_is_initialized() and replace it by zone_is_empty
> > which can be used for the same set of tests.
> > 
> > This shouldn't have any visible effect  
> 
> I've decided to drop this patch. My main motivation was to simplify
> the hotplug workflow/ The situation is more hairy than I expected,
> though. On one hand all zones should be initialized early during the
> hotplug in add_memory_resource but direct users of arch_add_memory will
> need this to be called I suspect. Let's just keep the current status quo
> and clean up it later. It is not really needed for this series.
Would you post v2 with all fixups you've done so far?

Re: [PATCH 0/6] mm: make movable onlining suck less

2017-04-03 Thread Igor Mammedov

On Mon, 3 Apr 2017 13:55:46 +0200
Michal Hocko  wrote:

> On Thu 30-03-17 13:54:48, Michal Hocko wrote:
> [...]
> > Any thoughts, complains, suggestions?  
> 
> Anyting? I would really appreciate a feedback from IBM and Futjitsu guys
> who have shaped this code last few years. Also Igor and Vitaly seem to
> be using memory hotplug in virtualized environments. I do not expect
> they would see a huge advantage of the rework but I would appreciate
> to give it some testing to catch any potential regressions.
I really appreciate this rework as it simplifies code a bit and potentially
would allow me/Vitaly to make auto-online work with movable zone as well.

I'll try to test the series within this week.

> 
> I plan to repost the series and would like to prevent from pointless
> submission if there are any obvious issues.
> 
> Thanks!

Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-17 Thread Igor Mammedov

On Thu, 16 Mar 2017 20:01:25 +0100
Andrea Arcangeli  wrote:

[...]
> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.
Looks like I'm not getting idea with zone overlap, so

We potentially have a flag that hotplugged block is removable
so on hotplug we could register them with zone MOVABLE as default,
however here comes zone balance issue so we can't do it until
it's solved.

As Vitaly's suggested we could steal(convert) existing blocks from
the border of MOVABLE zone into zone NORMAL when there isn't enough
memory in zone NORMAL to accommodate page tables extension for
just arrived new memory block. That would make a memory module
containing stolen block non-removable, but that may be acceptable
sacrifice to keep system alive. Potentially on attempt to remove it
kernel could even inform hardware(hypervisor) that memory module
become non removable using _OST ACPI method.

> Thanks,
> Andrea

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-14 Thread Igor Mammedov

On Mon, 13 Mar 2017 13:28:25 +0100
Michal Hocko  wrote:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > On Thu, 9 Mar 2017 13:54:00 +0100
> > Michal Hocko  wrote:
> > 
> > [...]  
> > > > It's major regression if you remove auto online in kernels that
> > > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > > while breaking useful functionality doesn't make sense.
> > > > 
> > > > I would ACK config option removal if auto online keeps working
> > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > > and keep kernel CLI option to override default.
> > > > 
> > > > That doesn't mean that others will agree with flipping default,
> > > > that's why config option has been added.
> > > > 
> > > > Now to sum up what's been discussed on this thread, there were 2
> > > > different issues discussed:
> > > >   1) memory hotplug: remove in kernel auto online for all
> > > >  except of hyperv/xen
> > > > 
> > > >- suggested RFC is not acceptable from virt point of view
> > > >  as it regresses guests on top of x86 kvm/vmware which
> > > >  both use ACPI based memory hotplug.
> > > > 
> > > >- udev/userspace solution doesn't work in practice as it's
> > > >  too slow and unreliable when system is under load which
> > > >  is quite common in virt usecase. That's why auto online
> > > >  has been introduced in the first place.
> > > 
> > > Please try to be more specific why "too slow" is a problem. Also how
> > > much slower are we talking about?  
> >
> > In virt case on host with lots VMs, userspace handler
> > processing could be scheduled late enough to trigger a race
> > between (guest memory going away/OOM handler) and memory
> > coming online.  
> 
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.
It doesn't have to be user initiated action,
have you heard about pay as you go phone plans, same thing use case
applies to cloud environments where typically hotplug user initiated
action on baremetal could be easily automated to hotplug on demand.


> [...]
> > > >- currently if one wants to use online_movable,
> > > >  one has to either
> > > >* disable auto online in kernel OR
> > > 
> > > which might not just work because an unmovable allocation could have
> > > made the memblock pinned.  
> >
> > With memhp_default_state=offline on kernel CLI there won't be any
> > unmovable allocation as hotplugged memory won't be onlined and
> > user can online it manually. So it works for non default usecase
> > of playing with memory hot-unplug.  
> 
> I was talking about the case when the auto_online was true, of course,
> e.g. depending on the config option which you've said is enabled in
> Fedora kernels.
>
> [...] 
> > > >  I'm in favor of implementing that in kernel as it keeps
> > > >  kernel internals inside kernel and doesn't need
> > > >  kernel API to be involved (memory blocks in sysfs,
> > > >  online_kernel, online_movable)
> > > >  There would be no need in userspace which would have to
> > > >  deal with kernel zoo and maintain that as well.
> > > 
> > > The kernel is supposed to provide a proper API and that is sysfs
> > > currently. I am not entirely happy about it either but pulling a lot of
> > > code into the kernel is not the rigth thing to do. Especially when
> > > different usecases require different treatment.  
> >
> > If it could be done from kernel side alone, it looks like a better way
> > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > possible to implement without adding a lot of kernel code.  
> 
> But this is not how we do the kernel development. We provide the API so
> that userspace can implement the appropriate policy on top. We do not
> add random knobs to implement the same thing in

Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov

On Mon, 13 Mar 2017 11:43:02 +0100
Michal Hocko  wrote:

> On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > On Fri, 10 Mar 2017 14:58:07 +0100  
> [...]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > > hotplug
> > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > > [0.00] Zone ranges:
> > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > > [0.00]   Normal   empty
> > > [0.00] Movable zone start for each node
> > > [0.00] Early memory node ranges
> > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > [0.00]   node   0: [mem 0x0010-0x3fff]
> > > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > > 
> > > so there is neither any normal zone nor movable one at the boot time.  
> > it could be if hotpluggable memory were present at boot time in E802 table
> > (if I remember right when running on hyperv there is movable zone at boot 
> > time),
> > 
> > but in qemu hotpluggable memory isn't put into E820,
> > so zone is allocated later when memory is enumerated
> > by ACPI subsystem and onlined.
> > It causes less issues wrt movable zone and works for
> > different versions of linux/windows as well.
> > 
> > That's where in kernel auto-onlining could be also useful,
> > since user would be able to start-up with with small
> > non removable memory plus several removable DIMMs
> > and have all the memory onlined/available by the time
> > initrd is loaded. (missing piece here is onling
> > removable memory as movable by default).  
> 
> Why we should even care to online that memory that early rather than
> making it available via e820?

It's not forbidden by spec and has less complications
when it comes to removable memory. Declaring it in E820
would add following limitations/drawbacks:
 - firmware should be able to exclude removable memory
   from its usage (currently SeaBIOS nor EFI have to
   know/care about it) => less qemu-guest ABI to maintain.
 - OS should be taught to avoid/move (early) nonmovable
   allocations from removable address ranges.
   There were patches targeting that in recent kernels,
   but it won't work with older kernels that don't have it.
   So limiting a range of OSes that could run on QEMU
   and do memory removal.

E820 less approach works reasonably well with wide range
of guest OSes and less complex that if removable memory
were present it E820. Hence I don't have a compelling
reason to introduce removable memory in E820 as it
only adds to hot(un)plug issues.

I have an off-tree QEMU hack that puts hotremovable
memory added with "-device pc-dimm" on CLI into E820
to experiment with. It could be useful to play
with zone layouts at boot time, so if you are
interested I can rebase it on top of current master
and post it here to play with.

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-13 Thread Igor Mammedov

On Thu, 9 Mar 2017 13:54:00 +0100
Michal Hocko  wrote:

[...]
> > It's major regression if you remove auto online in kernels that
> > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > while breaking useful functionality doesn't make sense.
> > 
> > I would ACK config option removal if auto online keeps working
> > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > and keep kernel CLI option to override default.
> > 
> > That doesn't mean that others will agree with flipping default,
> > that's why config option has been added.
> > 
> > Now to sum up what's been discussed on this thread, there were 2
> > different issues discussed:
> >   1) memory hotplug: remove in kernel auto online for all
> >  except of hyperv/xen
> > 
> >- suggested RFC is not acceptable from virt point of view
> >  as it regresses guests on top of x86 kvm/vmware which
> >  both use ACPI based memory hotplug.
> > 
> >- udev/userspace solution doesn't work in practice as it's
> >  too slow and unreliable when system is under load which
> >  is quite common in virt usecase. That's why auto online
> >  has been introduced in the first place.  
> 
> Please try to be more specific why "too slow" is a problem. Also how
> much slower are we talking about?
In virt case on host with lots VMs, userspace handler
processing could be scheduled late enough to trigger a race
between (guest memory going away/OOM handler) and memory
coming online.

>  
> >   2) memory unplug: online memory as movable
> > 
> >- doesn't work currently with udev rule due to kernel
> >  issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> These should be fixed
>  
> >- could be fixed both for in kernel auto online and udev
> >  with following patch:
> >  https://bugzilla.redhat.com/attachment.cgi?id=1146332
> >  but fixing it this way exposes zone disbalance issues,
> >  which are not present in current kernel as blocks are
> >  onlined in Zone Normal. So this is area to work and
> >  improve on.
> > 
> >- currently if one wants to use online_movable,
> >  one has to either
> >* disable auto online in kernel OR  
> 
> which might not just work because an unmovable allocation could have
> made the memblock pinned.
With memhp_default_state=offline on kernel CLI there won't be any
unmovable allocation as hotplugged memory won't be onlined and
user can online it manually. So it works for non default usecase
of playing with memory hot-unplug.
 
> >* remove udev rule that distro ships
> >  AND write custom daemon that will be able to online
> >  block in right zone/order. So currently whole
> >  online_movable thing isn't working by default
> >  regardless of who onlines memory.  
> 
> my epxperience with onlining full nodes as movable shows this works just
> fine (with all the limitations of the movable zones but that is a
> separate thing). I haven't played with configurations where movable
> zones are sharing the node with other zones.
I don't have access to a such baremetal configuration to play
with anymore.


> >  I'm in favor of implementing that in kernel as it keeps
> >  kernel internals inside kernel and doesn't need
> >  kernel API to be involved (memory blocks in sysfs,
> >  online_kernel, online_movable)
> >  There would be no need in userspace which would have to
> >  deal with kernel zoo and maintain that as well.  
> 
> The kernel is supposed to provide a proper API and that is sysfs
> currently. I am not entirely happy about it either but pulling a lot of
> code into the kernel is not the rigth thing to do. Especially when
> different usecases require different treatment.
If it could be done from kernel side alone, it looks like a better way
to me not to involve userspace at all. And for ACPI based x86/ARM it's
possible to implement without adding a lot of kernel code.
That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
so we could continue on improving kernel only auto-onlining
and fixing current memory hot(un)plug issues without affecting
other platforms/users that are no interested in it.
(PS: I don't care much about sysfs knob for setting auto-onlining,
as kernel CLI override with memhp_default_state seems
sufficient to me)

Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov

On Fri, 10 Mar 2017 14:58:07 +0100
Michal Hocko  wrote:

> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
> 
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> > On Tue 07-03-17 13:40:04, Igor Mammedov wrote:  
> > > On Mon, 6 Mar 2017 15:54:17 +0100
> > > Michal Hocko  wrote:
> > >   
> > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:  
> > [...]  
> > > > > in current mainline kernel it triggers following code path:
> > > > > 
> > > > > online_pages()
> > > > >   ...
> > > > >if (online_type == MMOP_ONLINE_KERNEL) {   
> > > > >   
> > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > > > &zone_shift))
> > > > > return -EINVAL;
> > > > 
> > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here  
> > > pretty much, reproducer is above so try and see for yourself  
> > 
> > I will play with this...  
> 
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G 
> which generated
'mem' here distributes boot memory specified by "-m 2G" and does not
include memory specified by -device pc-dimm.

> [...]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug
> [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> 0x0010-0x3fff] -> [mem 0x-0x3fff]
> [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x7ffd]
> [0.00]   Normal   empty
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0x3fff]
> [0.00]   node   1: [mem 0x4000-0x7ffd]
> 
> so there is neither any normal zone nor movable one at the boot time.
it could be if hotpluggable memory were present at boot time in E802 table
(if I remember right when running on hyperv there is movable zone at boot time),

but in qemu hotpluggable memory isn't put into E820,
so zone is allocated later when memory is enumerated
by ACPI subsystem and onlined.
It causes less issues wrt movable zone and works for
different versions of linux/windows as well.

That's where in kernel auto-onlining could be also useful,
since user would be able to start-up with with small
non removable memory plus several removable DIMMs
and have all the memory onlined/available by the time
initrd is loaded. (missing piece here is onling
removable memory as movable by default).


> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
You can also specify node a pc-dimm goes to with 'node' property
if it should go to other then node 0.

device_add pc-dimm,id=dimm1,memdev=mem1,node=1


> unfortunatelly the memory didn't show up automatically and I got
> [  116.375781] acpi PNP0C80:00: Enumeration failure
it should work,
do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?
 
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x1

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-07 Thread Igor Mammedov

On Mon, 6 Mar 2017 15:54:17 +0100
Michal Hocko  wrote:

> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> > On Fri, 3 Mar 2017 09:27:23 +0100
> > Michal Hocko  wrote:
> >   
> > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:  
> > > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > > Michal Hocko  wrote:
> > > > 
> > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]

> > > > > memblocks. If that doesn't work I would call it a bug.
> > > > It's rather an implementation constrain than a bug
> > > > for details and workaround patch see
> > > >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
> > > 
> > > "You are not authorized to access bug #1314306"  
> > Sorry,
> > I've made it public, related comments and patch should be accessible now
> > (code snippets in BZ are based on older kernel but logic is still the same 
> > upstream)
> >
> > > could you paste the reasoning here please?  
> > sure here is reproducer:
> > start VM with CLI:
> >   qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
> >   -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 
> > \
> >   /path/to/guest_image
> > 
> > then in guest dimm1 blocks are from 32-39
> > 
> >   echo online_movable > /sys/devices/system/memory/memory32/state
> > -bash: echo: write error: Invalid argument
> > 
> > in current mainline kernel it triggers following code path:
> > 
> > online_pages()
> >   ...
> >if (online_type == MMOP_ONLINE_KERNEL) { 
> > 
> > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > &zone_shift))
> > return -EINVAL;  
> 
> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
pretty much, reproducer is above so try and see for yourself

[...]
> [...]
> > > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > > the first to the last hotplugged block order. So now we would have
> > > > > > to write a daemon that would
> > > > > >  - watch for all blocks in hotplugged memory appear (how would it 
> > > > > > know)
> > > > > >  - online them in right order (order might also be different 
> > > > > > depending
> > > > > >on kernel version)
> > > > > >-- it becomes even more complicated in NUMA case when there are
> > > > > >   multiple zones and kernel would have to provide user-space
> > > > > >   with information about zone maps
> > > > > > 
> > > > > > In short current experience shows that userspace approach
> > > > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > > >fast and/or under memory pressure) when udev (or something else
> > > > > >might be killed)  
> > > > > 
> > > > > yeah and that is why the patch does the onlining from the kernel.
> > > > onlining in this patch is limited to hyperv and patch breaks
> > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > > hotplug path.
> > > 
> > > Those can use the udev or do you see any reason why they couldn't?  
> >
> > Reasons are above, under >>>> and >> quotations, patch breaks
> > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> > user-space process could be killed if hotplugged memory isn't onlined
> > fast enough leading to service termination and/or memory not
> > being onlined at all (if udev is killed)  
> 
> OK, so from the discussion so far I have learned that this would be
> problem _only_ if we are trying to hotplug a _lot_ of memory at once
> (~1.5% of the online memory is needed).  I am kind of skeptical this is
> a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
> would eat half of the available memory which is still quite far from
> OOM)? Even if the memory balloning uses hotplug then such a grow sounds
> a bit excessive.
Slow and killable udev issue doesn't really depends on
amount of hotplugged memory since it's onlined in blocks
(128M for x64). Considering that it's currently onlined
as zone normal, kernel doesn't have any issues adding more
follow up blocks of memory.

> > Currently udev rule is not usable

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-03 Thread Igor Mammedov

On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko  wrote:

> On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > On Thu, 2 Mar 2017 15:28:16 +0100
> > Michal Hocko  wrote:
> >   
> > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > [...]  
> > > > When trying to support memory unplug on guest side in RHEL7,
> > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > added block doesn't work in case one wants to online it as movable.
> > > > 
> > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > order to online blocks as movable depending on adjacent blocks zone.
> > > 
> > > Could you be more specific please? Setting online_movable from the udev
> > > rule should just work regardless of the ordering or the state of other
> > > memblocks. If that doesn't work I would call it a bug.  
> > It's rather an implementation constrain than a bug
> > for details and workaround patch see
> >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> "You are not authorized to access bug #1314306"
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same 
upstream)
 
> could you paste the reasoning here please?
sure here is reproducer:
start VM with CLI:
  qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
  -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
  /path/to/guest_image

then in guest dimm1 blocks are from 32-39

  echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
  ...
   if (online_type == MMOP_ONLINE_KERNEL) { 
if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))   
 
return -EINVAL;

  zone_can_shift()
...
if (idx < target) { 
 
/* pages must be at end of current zone */  
 
if (pfn + nr_pages != zone_end_pfn(zone))   
 
return false;

since we are trying to online as movable not the last section in
ZONE_NORMAL.

Here is what makes hotplugged memory end up in ZONE_NORMAL:
 acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
   -> arch/x86/mm/init_64.c  

 /*
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
 int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 {
...
struct zone *zone = pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);

i.e. all hot-plugged memory modules always go to ZONE_NORMAL
and only the first/last block in zone is allowed to be moved
to another zone. Patch [1] tries to fix issue by assigning
removable memory resource to movable zone so hotplugged+removable
blocks look like:
  movable normal, movable, movable
instead of current:
  normal, normal, normal movable

but then with this fixed as suggested, auto online by default
should work just fine in kernel with normal and movable zones
without any need for user-space.

> > patch attached there is limited by another memory hotplug
> > issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> > on configuration where the most of memory is hot-removable
> > kernel might experience lack of memory in zone NORMAL.  
> 
> yes and that is an inherent problem of movable memory.
> 
> > > > Which means simple udev rule isn't usable since it gets event from
> > > > the first to the last hotplugged block order. So now we would have
> > > > to write a daemon that would
> > > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > > >  - online them in right order (order might also be different depending
> > > >on kernel version)
> > > >-- it becomes even more complicated in NUMA case when there are
> > > >   multiple zones and kernel would have to provide user-space
> > > >   with information about zone maps
> > > > 
> > > > In short current experience shows that userspace approach
> > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > >fast and/or under memory pressure) when udev (or something else
> > > >might be killed)
> > > 
> > > yeah and that is why the

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-02 Thread Igor Mammedov

On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko  wrote:

> On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> [...]
> > When trying to support memory unplug on guest side in RHEL7,
> > experience shows otherwise. Simplistic udev rule which onlines
> > added block doesn't work in case one wants to online it as movable.
> > 
> > Hotplugged blocks in current kernel should be onlined in reverse
> > order to online blocks as movable depending on adjacent blocks zone.  
> 
> Could you be more specific please? Setting online_movable from the udev
> rule should just work regardless of the ordering or the state of other
> memblocks. If that doesn't work I would call it a bug.
It's rather an implementation constrain than a bug
for details and workaround patch see
 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
patch attached there is limited by another memory hotplug
issue, which is NORMAL/MOVABLE zone balance, if kernel runs
on configuration where the most of memory is hot-removable
kernel might experience lack of memory in zone NORMAL.

> 
> > Which means simple udev rule isn't usable since it gets event from
> > the first to the last hotplugged block order. So now we would have
> > to write a daemon that would
> >  - watch for all blocks in hotplugged memory appear (how would it know)
> >  - online them in right order (order might also be different depending
> >on kernel version)
> >-- it becomes even more complicated in NUMA case when there are
> >   multiple zones and kernel would have to provide user-space
> >   with information about zone maps
> > 
> > In short current experience shows that userspace approach
> >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> >fast and/or under memory pressure) when udev (or something else
> >might be killed)  
> 
> yeah and that is why the patch does the onlining from the kernel.
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.

> > > Can you imagine any situation when somebody actually might want to have
> > > this knob enabled? From what I understand it doesn't seem to be the
> > > case.  
> > For x86:
> >  * this config option is enabled by default in recent Fedora,  
> 
> How do you want to support usecases which really want to online memory
> as movable? Do you expect those users to disable the option because
> unless I am missing something the in kernel auto onlining only supporst
> regular onlining.
current auto onlining config option does what it's been designed for,
i.e. it onlines hotplugged memory.
It's possible for non average Fedora user to override default
(commit 86dd995d6) if she/he needs non default behavior
(i.e. user knows how to online manually and/or can write
a daemon that would handle all of nuances of kernel in use).

For the rest when Fedora is used in cloud and user increases memory
via management interface of whatever cloud she/he uses, it just works.

So it's choice of distribution to pick its own default that makes
majority of user-base happy and this patch removes it without taking
that in consideration.

How to online memory is different issue not related to this patch,
current default onlining as ZONE_NORMAL works well for scaling
up VMs.

Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.

Now about the question of onlining removable memory as movable,
x86 kernel is able to get info, if hotadded memory is removable,
from ACPI subsystem and online it as movable one without any
intervention from user-space where it's hard to do so,
as patch[1] shows.
Problem is still researched and when we figure out how to fix
hot-remove issues we might enable auto-onlining by default for x86.

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

2017-03-02 Thread Igor Mammedov

On Mon 27-02-17 16:43:04, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:  
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.  
> > 
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> > 
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> > 
> > Auto-onlining of standby memory is the last thing we want.
I don't know much about anything other than x86 so all comments
below are from that point of view,
archetectures that don't need auto online can keep current default

> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.  
> > 
> > This also is a reason why auto-onlining doesn't seem be the best way.  

When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
 - watch for all blocks in hotplugged memory appear (how would it know)
 - online them in right order (order might also be different depending
   on kernel version)
   -- it becomes even more complicated in NUMA case when there are
  multiple zones and kernel would have to provide user-space
  with information about zone maps

In short current experience shows that userspace approach
 - doesn't solve issues that Vitaly has been fixing (i.e. onlining
   fast and/or under memory pressure) when udev (or something else
   might be killed)
 - doesn't reduce overall system complexity, it only gets worse
   as user-space handler needs to know a lot about kernel internals
   and implementation details/kernel versions to work properly

It's might be not easy but doing onlining in kernel on the other hand is:
 - faster
 - more reliable (can't be killed under memory pressure)
 - kernel has access to all info needed for onlining and how it
   internals work to implement auto-online correctly
 - since there is no need to mantain ABI for user-space
   (zones layout/ordering/maybe something else), kernel is
   free change internal implemetation without breaking userspace
   (currently hotplug+unplug doesn't work reliably and we might
need something more flexible than zones)
That's direction of research in progress, i.e. making kernel
implementation better instead of putting responsibility on
user-space to deal with kernel shortcomings.

> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.
For x86:
 * this config option is enabled by default in recent Fedora,
 * RHEL6 ships similar downstream patches to do the same thing for years
 * RHEL7 has udev rule (because there wasn't kernel side solution at fork time)
   that auto-onlines it unconditionally, Vitaly might backport it later
   when he has time.
Not linux kernel but still auto online policy is used by Windows
both on baremetal and guest configurations.

That's somewhat shows that current defaults upstream on x86
might be not what end-users wish for.

When auto_online_blocks were introduced, Vitaly has been
conservative and left current upstream defaults where they were
lest it would break someone else setup but allowing downstreams
set their own auto-online policy, eventually we might switch it
upstream too.

Re: regression since 4.8 and newer in select_idle_siblings()

2016-10-18 Thread Igor Mammedov

On Tue, 18 Oct 2016 16:02:07 +0200
Mike Galbraith  wrote:

> On Tue, 2016-10-18 at 15:40 +0200, Igor Mammedov wrote:
> > kernel crashes at runtime  due null pointer dereference at
> >   select_idle_sibling()  
> >  -> select_idle_cpu()  
> >  ...
> >  u64 avg_cost = this_sd->avg_scan_cost;
> > 
> > regression bisects to:
> >   commit 10e2f1acd0106c05229f94c70a344ce3a2c8008b
> >   Author: Peter Zijlstra 
> >   sched/core: Rewrite and improve select_idle_siblings()  
> 
> http://git.kernel.org/tip/9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441

Thanks, above patch fixes issue for me.

> Already fixed, and will land in Linus' tree soon.
> 
>   -Mike

regression since 4.8 and newer in select_idle_siblings()

2016-10-18 Thread Igor Mammedov

kernel crashes at runtime  due null pointer dereference at
  select_idle_sibling()
 -> select_idle_cpu()
 ...
 u64 avg_cost = this_sd->avg_scan_cost;

regression bisects to:
  commit 10e2f1acd0106c05229f94c70a344ce3a2c8008b
  Author: Peter Zijlstra 
  sched/core: Rewrite and improve select_idle_siblings()

to reproduce crash at runtime start VM with:
 qemu-system-x86_64 [-enable-kvm] \
-smp 4,sockets=2 \
linux48_disk.img

and offline cpu1 in guest:
 echo 0 > /sys/devices/system/cpu/cpu1/online

as result guest panics immediately or with some small delay
from some path that triggers access to select_idle_sibling().


To reproduce crash at boot start VM with a recent QEMU (since 2.7):
 qemu-2.7/qemu-system-x86_64
-smp 1,sockets=2,cores=2,threads=1,maxcpus=4 \
-device qemu64-x86_64-cpu,socket-id=1,core-id=0,thread-id=0 \
-device qemu64-x86_64-cpu,socket-id=1,core-id=1,thread-id=0 \
-kernel bzImage_v48 [-enable-kvm]


=== one of the panics ===
[0.688680] BUG: unable to handle kernel NULL pointer dereference at 
0078
[0.688685] IP: [] select_idle_sibling+0x172/0x3b0
[0.688686] PGD 0 
[0.688687] Oops:  [#1] SMP
[0.688690] CPU: 0 PID: 109 Comm: kworker/u8:2 Not tainted 4.8.0-rc8+ #675
[0.688690] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
[0.688694] Workqueue: events_unbound async_run_entry_fn
[0.688695] task: 88007c258000 task.stack: 88007c3b
[0.688697] RIP: 0010:[]  [] 
select_idle_sibling+0x172/0x3b0
[0.688697] RSP: :88007c3b3bb0  EFLAGS: 00010007
[0.688698] RAX: 051b RBX: 0004 RCX: 0001
[0.688699] RDX: 0040 RSI: 0004 RDI: 88007d00a008
[0.688699] RBP: 88007c3b3c10 R08:  R09: 
[0.688700] R10:  R11: 0001 R12: 0002
[0.688700] R13: 88007d00a008 R14:  R15: 0004
[0.688701] FS:  () GS:88007fc0() 
knlGS:
[0.688702] CS:  0010 DS:  ES:  CR0: 80050033
[0.688703] CR2: 0078 CR3: 01c06000 CR4: 06f0
[0.688705] Stack:
[0.688707]  0001 88007c80e480 a118 
88007c282900
[0.688708]  0001 0002 0002 
88007c80e600
[0.688709]  88007c282900 00018ec0  

[0.688710] Call Trace:
[0.688712]  [] select_task_rq_fair+0x717/0x730
[0.688713]  [] ? update_curr+0xc7/0x150
[0.688715]  [] ? __enqueue_entity+0x6c/0x70
[0.688718]  [] try_to_wake_up+0x104/0x390
[0.688719]  [] wake_up_process+0x15/0x20
[0.688724]  [] scsi_eh_wakeup+0x33/0xa0
[0.688725]  [] scsi_schedule_eh+0x4c/0x60
[0.688728]  [] ata_std_sched_eh+0x3f/0x60
[0.688729]  [] ata_port_schedule_eh+0x13/0x20
[0.688730]  [] __ata_port_probe+0x44/0x60
[0.688731]  [] ata_port_probe+0x20/0x40
[0.688732]  [] async_port_probe+0x2e/0x60
[0.688734]  [] async_run_entry_fn+0x39/0x140
[0.688736]  [] process_one_work+0x152/0x400
[0.688738]  [] worker_thread+0x125/0x4b0
[0.688739]  [] ? process_one_work+0x400/0x400
[0.688740]  [] kthread+0xd8/0xf0
[0.688744]  [] ret_from_fork+0x1f/0x40
[0.688745]  [] ? __kthread_parkme+0x70/0x70
[0.688757] Code: c7 c0 20 dd 00 00 65 48 03 05 c3 bd f2 7e 4c 8b 30 48 c7 
c0 c0 8e 01 00 65 48 03 05 b1 bd f2 7e 48 8b 80 c8 09 00 00 48 c1 e8 09 <49> 39 
46 78 0f 87 29 02 00 00 65 8b 3d 9d bd f2 7e e8 b8 c8 ff 
[0.688758] RIP  [] select_idle_sibling+0x172/0x3b0
[0.688759]  RSP 
[0.688759] CR2: 0078
[0.688762] ---[ end trace f10266de945b1779 ]---

[PATCH 1/2] x86/x2apic: fix NULL pointer def during boot

2016-08-10 Thread Igor Mammedov

Fixes crash at boot for me.

Small nit wrt subj

s/def/deref/

[PATCH 1/2] x86/x2apic: fix NULL pointer def during boot

2016-08-10 Thread Igor Mammedov

Fixes crash at boot for me.

Small nit wrt subj

s/def/deref/

Re: [PATCH FIX 4.6+] bcma: add PCI ID for Foxconn's BCM43142 device

2016-07-12 Thread Igor Mammedov

On Mon, 11 Jul 2016 23:01:36 +0200
Rafał Miłecki  wrote:

> After discovering there are 2 very different 14e4:4365 PCI devices we
> made ID tables less generic. Back then we believed there are only 2
> such devices:
> 1) 14e4:4365 1028:0016 with SoftMAC BCM43142 chipset
> 2) 14e4:4365 14e4:4365 with FullMAC BCM4366 chipset
> 
> From the recent report it appears there is also 14e4:4365 105b:e092
> which should be claimed by bcma. Add back support for it.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=121881
> Fixes: 515b399c9a20 ("bcma: claim only 14e4:4365 PCI Dell card with
> SoftMAC BCM43142") Reported-by: Igor Mammedov 
> Signed-off-by: Rafał Miłecki 
> Cc: Stable  [4.6+]
Tested-by: Igor Mammedov 

> ---
>  drivers/bcma/host_pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/bcma/host_pci.c b/drivers/bcma/host_pci.c
> index cae5385..bd46569 100644
> --- a/drivers/bcma/host_pci.c
> +++ b/drivers/bcma/host_pci.c
> @@ -295,6 +295,7 @@ static const struct pci_device_id
> bcma_pci_bridge_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_BROADCOM,
> 0x4359) }, { PCI_DEVICE(PCI_VENDOR_ID_BROADCOM, 0x4360) },
>   { PCI_DEVICE_SUB(PCI_VENDOR_ID_BROADCOM, 0x4365,
> PCI_VENDOR_ID_DELL, 0x0016) },
> + { PCI_DEVICE_SUB(PCI_VENDOR_ID_BROADCOM, 0x4365,
> PCI_VENDOR_ID_FOXCONN, 0xe092) },
> { PCI_DEVICE(PCI_VENDOR_ID_BROADCOM, 0x43a0) },
> { PCI_DEVICE(PCI_VENDOR_ID_BROADCOM, 0x43a9) },
> { PCI_DEVICE(PCI_VENDOR_ID_BROADCOM, 0x43aa) },

Re: [PATCH v4 2/2] KVM: move vcpu id checking to archs

2016-04-22 Thread Igor Mammedov

On Fri, 22 Apr 2016 11:25:38 +0200
Greg Kurz  wrote:

> Hi Radim !
> 
> On Thu, 21 Apr 2016 19:36:11 +0200
> Radim Krčmář  wrote:
> 
> > 2016-04-21 18:45+0200, Greg Kurz:  
> > > On Thu, 21 Apr 2016 18:00:19 +0200
> > > Radim Krčmář  wrote:
> > >> 2016-04-21 16:20+0200, Greg Kurz:
[...]
> > >  
> > > maybe later
> > > if we have other scenarios where vcpu ids need to cross the limit ?
> > 
> > x86 is going to have that soon too -- vcpu_id will be able to range from
> > 0 to 2^32-1 (or 2^31), but MAX_CPUS related data structures probably
> > won't be improved to actually scale, so MAX_CPUS will remain lower.
> >  
That's not true, x86 is going to stick with KVM_MAX_VCPUS/qemu's max_cpus,
the only thing that is going to change is that max supported APIC ID
value will be in range 0 to 2^32-1 vs current 8bit one
and since APIC ID is not vcpu_id so it won't affect vcpu_id.
 
> 
> Do you have some pointers to share so that we can see the broader picture ?
> 
> Thanks !
> 
> --
> Greg
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] memory-hotplug: add automatic onlining policy for the newly added memory

2016-01-04 Thread Igor Mammedov

On Mon, 04 Jan 2016 11:47:12 +0100
Vitaly Kuznetsov  wrote:

> Andrew Morton  writes:
> 
> > On Tue, 22 Dec 2015 17:32:30 +0100 Vitaly Kuznetsov  
> > wrote:
> >  
> >> Currently, all newly added memory blocks remain in 'offline' state unless
> >> someone onlines them, some linux distributions carry special udev rules
> >> like:
> >> 
> >> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", 
> >> ATTR{state}="online"
> >> 
> >> to make this happen automatically. This is not a great solution for virtual
> >> machines where memory hotplug is being used to address high memory pressure
> >> situations as such onlining is slow and a userspace process doing this
> >> (udev) has a chance of being killed by the OOM killer as it will probably
> >> require to allocate some memory.
> >> 
> >> Introduce default policy for the newly added memory blocks in
> >> /sys/devices/system/memory/hotplug_autoonline file with two possible
> >> values: "offline" which preserves the current behavior and "online" which
> >> causes all newly added memory blocks to go online as soon as they're added.
> >> The default is "online" when MEMORY_HOTPLUG_AUTOONLINE kernel config option
> >> is selected.  
> >
> > I think the default should be "offline" so vendors can ship kernels
> > which have CONFIG_MEMORY_HOTPLUG_AUTOONLINE=y while being
> > back-compatible with previous kernels.
> >  
> 
> (sorry for the delayed response, just picking things up after holidays)
> 
> I was under an (wrong?) impression that in the majority of use cases
> users want to start using their newly added memory right away and that's
> what distros will ship. As an alternative to making the feature off by
> default I can suggest making CONFIG_MEMORY_HOTPLUG_AUTOONLINE a tristate
> switch (no feature, default offline, default online).
That what probably would satisfy every distro,
only question is why do you need 'no feature',
wouldn't 'default offline' cover current state?

> 
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -2537,6 +2537,8 @@ bytes respectively. Such letter suffixes can also be 
> >> entirely omitted.
> >>shutdown the other cpus.  Instead use the REBOOT_VECTOR
> >>irq.
> >>  
> >> +  nomemhp_autoonline  Don't automatically online newly added memory.
> >> +  
> >
> > This wasn't mentioned in the changelog.  Why do we need a boot
> > parameter as well as the sysfs knob?
if 'default online' policy is set then we need a kernel option to disable
auto-onlining at kernel boot time (when it parses ACPI tables for x86) if needed
and vice verse for 'default offline' to enable auto-onlining at kernel boot 
time.

For RHEL we would probably use 'default online' policy like
we do in RHEL6 with custom patch.

> >  
> 
> I was thinking about some faulty hardware (e.g. reporting new memory
> blocks which for some reason are not really usable) and an easy way to
> make such hardware work.
> 
> >> +config MEMORY_HOTPLUG_AUTOONLINE
> >> +  bool "Automatically online hot-added memory"
> >> +  depends on MEMORY_HOTPLUG_SPARSE
> >> +  help
> >> +When memory is hot-added, it is not at ready-to-use state, a special  
> >
> > "When memory is hot-added it is not in a ready-to-use state.  A special"
> >  
> >> +userspace action is required to online the newly added blocks. With
> >> +this option enabled, the kernel will try to online all newly added
> >> +memory automatically.
> >> +
> >>
> >> ...
> >>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memory-hotplug: don't BUG() in register_memory_resource()

2015-12-18 Thread Igor Mammedov

On Fri, 18 Dec 2015 15:50:24 +0100
Vitaly Kuznetsov  wrote:

> Out of memory condition is not a bug and while we can't add new memory in
> such case crashing the system seems wrong. Propagating the return value
> from register_memory_resource() requires interface change.
> 
> Cc: Andrew Morton 
> Cc: Tang Chen 
> Cc: Naoya Horiguchi 
> Cc: Xishi Qiu 
> Cc: Sheng Yong 
> Cc: David Rientjes 
> Cc: Zhu Guihua 
> Cc: Dan Williams 
> Cc: David Vrabel 
> Cc: Igor Mammedov 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  mm/memory_hotplug.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 67d488a..9392f01 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -127,11 +127,13 @@ void mem_hotplug_done(void)
>  }
>  
>  /* add this memory to iomem resource */
> -static struct resource *register_memory_resource(u64 start, u64 size)
> +static int register_memory_resource(u64 start, u64 size,
> + struct resource **resource)
>  {
>   struct resource *res;
>   res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> - BUG_ON(!res);
> + if (!res)
> + return -ENOMEM;
>  
>   res->name = "System RAM";
>   res->start = start;
> @@ -140,9 +142,10 @@ static struct resource *register_memory_resource(u64 
> start, u64 size)
>   if (request_resource(&iomem_resource, res) < 0) {
>   pr_debug("System RAM resource %pR cannot be added\n", res);
>   kfree(res);
> - res = NULL;
> + return -EEXIST;
>   }
> - return res;
> + *resource = res;
> + return 0;
>  }
>  
>  static void release_memory_resource(struct resource *res)
> @@ -1311,9 +1314,9 @@ int __ref add_memory(int nid, u64 start, u64 size)
>   struct resource *res;
>   int ret;
>  
> - res = register_memory_resource(start, size);
> - if (!res)
> -     return -EEXIST;
> + ret = register_memory_resource(start, size, &res);
> + if (ret)
> + return ret;
>  
>   ret = add_memory_resource(nid, res);
>   if (ret < 0)

Reviewed-by: Igor Mammedov 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-06 Thread tip-bot for Igor Mammedov

Commit-ID:  ec941c5ffede4d788b9fc008f9eeca75b9e964f5
Gitweb: http://git.kernel.org/tip/ec941c5ffede4d788b9fc008f9eeca75b9e964f5
Author: Igor Mammedov 
AuthorDate: Fri, 4 Dec 2015 14:07:06 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 6 Dec 2015 12:46:31 +0100

x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)

It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

Signed-off-by: Igor Mammedov 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: akata...@vmware.com
Cc: fujita.tomon...@lab.ntt.co.jp
Cc: konrad.w...@oracle.com
Cc: pbonz...@redhat.com
Cc: rev...@redhat.com
Cc: r...@redhat.com
Link: 
http://lkml.kernel.org/r/1449234426-273049-3-git-send-email-imamm...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/pci-swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index adf0392..7c577a1 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -88,7 +88,7 @@ int __init pci_swiotlb_detect_4gb(void)
 {
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
 #ifdef CONFIG_X86_64
-   if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+   if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
swiotlb = 1;
 #endif
return swiotlb;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86/mm: Introduce max_possible_pfn

2015-12-06 Thread tip-bot for Igor Mammedov

Commit-ID:  8dd3303001976aa8583bf20f6b93590c74114308
Gitweb: http://git.kernel.org/tip/8dd3303001976aa8583bf20f6b93590c74114308
Author: Igor Mammedov 
AuthorDate: Fri, 4 Dec 2015 14:07:05 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 6 Dec 2015 12:46:31 +0100

x86/mm: Introduce max_possible_pfn

max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.

By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.

Signed-off-by: Igor Mammedov 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: akata...@vmware.com
Cc: fujita.tomon...@lab.ntt.co.jp
Cc: konrad.w...@oracle.com
Cc: pbonz...@redhat.com
Cc: rev...@redhat.com
Cc: r...@redhat.com
Link: 
http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imamm...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/setup.c | 2 ++
 arch/x86/mm/srat.c  | 2 ++
 include/linux/bootmem.h | 4 
 mm/bootmem.c| 1 +
 mm/nobootmem.c  | 1 +
 5 files changed, 10 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 29db25f..16a8465 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
if (mtrr_trim_uncached_memory(max_pfn))
max_pfn = e820_end_of_ram_pfn();
 
+   max_possible_pfn = max_pfn;
+
 #ifdef CONFIG_X86_32
/* max_low_pfn get updated here */
find_low_pfn_range();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index c2aea63..b5f8218 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -203,6 +203,8 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);
 
+   max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
+
return 0;
 out_err_bad_srat:
bad_srat();
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index f589222..35b22f9 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -19,6 +19,10 @@ extern unsigned long min_low_pfn;
  * highest page
  */
 extern unsigned long max_pfn;
+/*
+ * highest possible page
+ */
+extern unsigned long long max_possible_pfn;
 
 #ifndef CONFIG_NO_BOOTMEM
 /*
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 3b63807..91e32bc 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long long max_possible_pfn;
 
 bootmem_data_t bootmem_node_data[MAX_NUMNODES] __initdata;
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index e57cf24..99feb2b 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -31,6 +31,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long long max_possible_pfn;
 
 static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
u64 goal, u64 limit)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/2] x86: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

changes since v2:
  - moved 'max_possible_pfn' tracking hunk to the 1st patch
  - make 'max_possible_pfn' 64-bit
  - simplify condition to oneliner as suggested by Ingo
changes since v1:
  - reduce # of #ifdefs by introducing max_possible_pfn
global variable
  - don't check 'acpi_no_memhotplug=1' for disabling
SWIOTLB initialization, since existing 'no_iommu'
kernel option could be used to the same effect.
  - split into 2 patches
  - 1st adds max_possible_pfn,
  - 2nd fixes bug enabling SWIOTLB when it's needed

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

ref to v2:
  https://lkml.org/lkml/2015/12/4/151
ref to v1:
  https://lkml.org/lkml/2015/11/30/594

Igor Mammedov (2):
  x86: introduce max_possible_pfn
  x86_64: enable SWIOTLB if system has SRAT memory regions above
MAX_DMA32_PFN

 arch/x86/kernel/pci-swiotlb.c | 2 +-
 arch/x86/kernel/setup.c   | 2 ++
 arch/x86/mm/srat.c| 2 ++
 include/linux/bootmem.h   | 4 
 mm/bootmem.c  | 1 +
 mm/nobootmem.c| 1 +
 6 files changed, 11 insertions(+), 1 deletion(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/2] x86: introduce max_possible_pfn

2015-12-04 Thread Igor Mammedov

max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.

By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.

Signed-off-by: Igor Mammedov 
---
v3:
 - make 'max_possible_pfn' 64-bit
 - simplify condition to oneliner as suggested by Ingo
---
 arch/x86/kernel/setup.c | 2 ++
 arch/x86/mm/srat.c  | 2 ++
 include/linux/bootmem.h | 4 
 mm/bootmem.c| 1 +
 mm/nobootmem.c  | 1 +
 5 files changed, 10 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 29db25f..16a8465 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
if (mtrr_trim_uncached_memory(max_pfn))
max_pfn = e820_end_of_ram_pfn();
 
+   max_possible_pfn = max_pfn;
+
 #ifdef CONFIG_X86_32
/* max_low_pfn get updated here */
find_low_pfn_range();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index c2aea63..b5f8218 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -203,6 +203,8 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);
 
+   max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
+
return 0;
 out_err_bad_srat:
bad_srat();
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index f589222..35b22f9 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -19,6 +19,10 @@ extern unsigned long min_low_pfn;
  * highest page
  */
 extern unsigned long max_pfn;
+/*
+ * highest possible page
+ */
+extern unsigned long long max_possible_pfn;
 
 #ifndef CONFIG_NO_BOOTMEM
 /*
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 3b63807..91e32bc 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long long max_possible_pfn;
 
 bootmem_data_t bootmem_node_data[MAX_NUMNODES] __initdata;
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index e57cf24..99feb2b 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -31,6 +31,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long long max_possible_pfn;
 
 static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
u64 goal, u64 limit)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 2/2] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

Signed-off-by: Igor Mammedov 
---
v3:
 - moved 'max_possible_pfn' tracking hunk to the patch
   that introduces it.
---
 arch/x86/kernel/pci-swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index adf0392..7c577a1 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -88,7 +88,7 @@ int __init pci_swiotlb_detect_4gb(void)
 {
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
 #ifdef CONFIG_X86_64
-   if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+   if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
swiotlb = 1;
 #endif
return swiotlb;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

On Fri, 4 Dec 2015 12:49:49 +0100
Ingo Molnar  wrote:

> 
> * Igor Mammedov  wrote:
> 
> > when memory hotplug enabled system is booted with less
> > than 4GB of RAM and then later more RAM is hotplugged
> > 32-bit devices stop functioning with following error:
> > 
> >  nommu_map_single: overflow 327b4f8c0+1522 of device mask 
> > 
> > the reason for this is that if x86_64 system were booted
> > with RAM less than 4GB, it doesn't enable SWIOTLB and
> > when memory is hotplugged beyond MAX_DMA32_PFN, devices
> > that expect 32-bit addresses can't handle 64-bit addresses.
> > 
> > Fix it by tracking max possible PFN when parsing
> > memory affinity structures from SRAT ACPI table and
> > enable SWIOTLB if there is hotpluggable memory
> > regions beyond MAX_DMA32_PFN.
> > 
> > It fixes KVM guests when they use emulated devices
> > (reproduces with ata_piix, e1000 and usb devices,
> >  RHBZ: 1275941, 1275977, 1271527)
> > It also fixes the HyperV, VMWare with emulated devices
> > which are affected by this issue as well.
> > 
> > Signed-off-by: Igor Mammedov 
> > ---
> >  arch/x86/kernel/pci-swiotlb.c | 2 +-
> >  arch/x86/mm/srat.c| 3 +++
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> > index adf0392..7c577a1 100644
> > --- a/arch/x86/kernel/pci-swiotlb.c
> > +++ b/arch/x86/kernel/pci-swiotlb.c
> > @@ -88,7 +88,7 @@ int __init pci_swiotlb_detect_4gb(void)
> >  {
> > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> >  #ifdef CONFIG_X86_64
> > -   if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > +   if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
> > swiotlb = 1;
> 
> Ok, this series looks a lot better!
> 
> >  #endif
> > return swiotlb;
> > diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> > index c2aea63..a26bdbe 100644
> > --- a/arch/x86/mm/srat.c
> > +++ b/arch/x86/mm/srat.c
> > @@ -203,6 +203,9 @@ acpi_numa_memory_affinity_init(struct 
> > acpi_srat_mem_affinity *ma)
> > pr_warn("SRAT: Failed to mark hotplug range [mem 
> > %#010Lx-%#010Lx] in memblock\n",
> > (unsigned long long)start, (unsigned long long)end - 1);
> >  
> > +   if (max_possible_pfn < PFN_UP(end - 1))
> > +   max_possible_pfn = PFN_UP(end - 1);
> 
> Small nit, why not write this as something like:
> 
>   max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
That doesn't pass strict type check:

include/linux/kernel.h:730:17: warning: comparison of distinct pointer types 
lacks a cast [enabled by default]
  (void) (&_max1 == &_max2);  \
 ^
arch/x86/mm/srat.c:206:21: note: in expansion of macro ‘max’
  max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));

I can change max_possible_pfn to u64 to match PFN_UP(end - 1) type.

> 
> ?
> 
> I'd also move this second hunk to the first patch, because logically it 
> belongs 
> there (or into a third patch).
> 
> Thanks,
> 
>   Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

On Fri, 4 Dec 2015 12:49:49 +0100
Ingo Molnar  wrote:

> 
> * Igor Mammedov  wrote:
> 
> > when memory hotplug enabled system is booted with less
> > than 4GB of RAM and then later more RAM is hotplugged
> > 32-bit devices stop functioning with following error:
> > 
> >  nommu_map_single: overflow 327b4f8c0+1522 of device mask 
> > 
> > the reason for this is that if x86_64 system were booted
> > with RAM less than 4GB, it doesn't enable SWIOTLB and
> > when memory is hotplugged beyond MAX_DMA32_PFN, devices
> > that expect 32-bit addresses can't handle 64-bit addresses.
> > 
> > Fix it by tracking max possible PFN when parsing
> > memory affinity structures from SRAT ACPI table and
> > enable SWIOTLB if there is hotpluggable memory
> > regions beyond MAX_DMA32_PFN.
> > 
> > It fixes KVM guests when they use emulated devices
> > (reproduces with ata_piix, e1000 and usb devices,
> >  RHBZ: 1275941, 1275977, 1271527)
> > It also fixes the HyperV, VMWare with emulated devices
> > which are affected by this issue as well.
> > 
> > Signed-off-by: Igor Mammedov 
> > ---
> >  arch/x86/kernel/pci-swiotlb.c | 2 +-
> >  arch/x86/mm/srat.c| 3 +++
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> > index adf0392..7c577a1 100644
> > --- a/arch/x86/kernel/pci-swiotlb.c
> > +++ b/arch/x86/kernel/pci-swiotlb.c
> > @@ -88,7 +88,7 @@ int __init pci_swiotlb_detect_4gb(void)
> >  {
> > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> >  #ifdef CONFIG_X86_64
> > -   if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > +   if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
> > swiotlb = 1;
> 
> Ok, this series looks a lot better!
> 
> >  #endif
> > return swiotlb;
> > diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> > index c2aea63..a26bdbe 100644
> > --- a/arch/x86/mm/srat.c
> > +++ b/arch/x86/mm/srat.c
> > @@ -203,6 +203,9 @@ acpi_numa_memory_affinity_init(struct 
> > acpi_srat_mem_affinity *ma)
> > pr_warn("SRAT: Failed to mark hotplug range [mem 
> > %#010Lx-%#010Lx] in memblock\n",
> > (unsigned long long)start, (unsigned long long)end - 1);
> >  
> > +   if (max_possible_pfn < PFN_UP(end - 1))
> > +   max_possible_pfn = PFN_UP(end - 1);
> 
> Small nit, why not write this as something like:
> 
>   max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
> 
> ?
> 
> I'd also move this second hunk to the first patch, because logically it 
> belongs 
> there (or into a third patch).
sure, I'll move it to the 1st patch

> 
> Thanks,
> 
>   Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/2] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

Signed-off-by: Igor Mammedov 
---
 arch/x86/kernel/pci-swiotlb.c | 2 +-
 arch/x86/mm/srat.c| 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index adf0392..7c577a1 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -88,7 +88,7 @@ int __init pci_swiotlb_detect_4gb(void)
 {
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
 #ifdef CONFIG_X86_64
-   if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+   if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
swiotlb = 1;
 #endif
return swiotlb;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index c2aea63..a26bdbe 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -203,6 +203,9 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);
 
+   if (max_possible_pfn < PFN_UP(end - 1))
+   max_possible_pfn = PFN_UP(end - 1);
+
return 0;
 out_err_bad_srat:
bad_srat();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/2] x86: introduce max_possible_pfn

2015-12-04 Thread Igor Mammedov

max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.

By default max_possible_pfn is initialized with max_pfn,
but a follow up patch will update it with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.

Signed-off-by: Igor Mammedov 
---
 arch/x86/kernel/setup.c | 2 ++
 include/linux/bootmem.h | 4 
 mm/bootmem.c| 1 +
 mm/nobootmem.c  | 1 +
 4 files changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 29db25f..16a8465 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1048,6 +1048,8 @@ void __init setup_arch(char **cmdline_p)
if (mtrr_trim_uncached_memory(max_pfn))
max_pfn = e820_end_of_ram_pfn();
 
+   max_possible_pfn = max_pfn;
+
 #ifdef CONFIG_X86_32
/* max_low_pfn get updated here */
find_low_pfn_range();
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index f589222..5b39aa5 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -19,6 +19,10 @@ extern unsigned long min_low_pfn;
  * highest page
  */
 extern unsigned long max_pfn;
+/*
+ * highest possible page
+ */
+extern unsigned long max_possible_pfn;
 
 #ifndef CONFIG_NO_BOOTMEM
 /*
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 3b63807..221a04b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long max_possible_pfn;
 
 bootmem_data_t bootmem_node_data[MAX_NUMNODES] __initdata;
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index e57cf24..08614c8 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -31,6 +31,7 @@ EXPORT_SYMBOL(contig_page_data);
 unsigned long max_low_pfn;
 unsigned long min_low_pfn;
 unsigned long max_pfn;
+unsigned long max_possible_pfn;
 
 static void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
u64 goal, u64 limit)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/2] x86: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

changes since v1:
  - reduce # of #ifdefs by introducing max_possible_pfn
global variable
  - don't check 'acpi_no_memhotplug=1' for disabling
SWIOTLB initialization, since existing 'no_iommu'
kernel option could be used to the same effect.
  - split into 2 patches
  - 1st adds max_possible_pfn,
  - 2nd fixes bug enabling SWIOTLB when it's needed

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

ref to v1:
  https://lkml.org/lkml/2015/11/30/594

Igor Mammedov (2):
  x86: introduce max_possible_pfn
  x86_64: enable SWIOTLB if system has SRAT memory regions above
MAX_DMA32_PFN

 arch/x86/kernel/pci-swiotlb.c | 2 +-
 arch/x86/kernel/setup.c   | 2 ++
 arch/x86/mm/srat.c| 3 +++
 include/linux/bootmem.h   | 4 
 mm/bootmem.c  | 1 +
 mm/nobootmem.c| 1 +
 6 files changed, 12 insertions(+), 1 deletion(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-12-04 Thread Igor Mammedov

On Fri, 4 Dec 2015 09:20:50 +0100
Ingo Molnar  wrote:

> 
> * Igor Mammedov  wrote:
> 
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index 94c18eb..53d7951 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -147,6 +147,7 @@ static inline void disable_acpi(void) { }
> >  #ifdef CONFIG_ACPI_NUMA
> >  extern int acpi_numa;
> >  extern int x86_acpi_numa_init(void);
> > +unsigned long acpi_get_max_possible_pfn(void);
> >  #endif /* CONFIG_ACPI_NUMA */
> >  
> >  #define acpi_unlazy_tlb(x) leave_mm(x)
> > @@ -170,4 +171,8 @@ static inline pgprot_t 
> > arch_apei_get_mem_attribute(phys_addr_t addr)
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_ACPI_HOTPLUG_MEMORY
> > +extern bool acpi_no_memhotplug;
> > +#endif /* CONFIG_ACPI_HOTPLUG_MEMORY */
> > +
> >  #endif /* _ASM_X86_ACPI_H */
> > diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> > index adf0392..61d5ba5 100644
> > --- a/arch/x86/kernel/pci-swiotlb.c
> > +++ b/arch/x86/kernel/pci-swiotlb.c
> > @@ -88,7 +88,11 @@ int __init pci_swiotlb_detect_4gb(void)
> >  {
> > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> >  #ifdef CONFIG_X86_64
> > +#ifdef CONFIG_ACPI_NUMA
> > +   if (!no_iommu && acpi_get_max_possible_pfn() > MAX_DMA32_PFN)
> > +#else
> > if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > +#endif
> > swiotlb = 1;
> >  #endif
> > return swiotlb;
> > diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
> > index c2aea63..21b33f0 100644
> > --- a/arch/x86/mm/srat.c
> > +++ b/arch/x86/mm/srat.c
> > @@ -153,10 +153,20 @@ acpi_numa_processor_affinity_init(struct 
> > acpi_srat_cpu_affinity *pa)
> >  
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  static inline int save_add_info(void) {return 1;}
> > +static unsigned long max_possible_pfn __initdata;
> >  #else
> >  static inline int save_add_info(void) {return 0;}
> >  #endif
> >  
> > +unsigned long __init acpi_get_max_possible_pfn(void)
> > +{
> > +#ifdef CONFIG_ACPI_HOTPLUG_MEMORY
> > +   if (!acpi_no_memhotplug)
> > +   return max_possible_pfn;
> > +#endif
> > +   return max_pfn;
> > +}
> > +
> >  /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
> >  int __init
> >  acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> > @@ -203,6 +213,11 @@ acpi_numa_memory_affinity_init(struct 
> > acpi_srat_mem_affinity *ma)
> > pr_warn("SRAT: Failed to mark hotplug range [mem 
> > %#010Lx-%#010Lx] in memblock\n",
> > (unsigned long long)start, (unsigned long long)end - 1);
> >  
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +   if (max_possible_pfn < PFN_UP(end - 1))
> > +   max_possible_pfn = PFN_UP(end - 1);
> > +#endif
> > +
> > return 0;
> >  out_err_bad_srat:
> > bad_srat();
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..ae38f57 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -357,7 +357,7 @@ static void acpi_memory_device_remove(struct 
> > acpi_device *device)
> > acpi_memory_device_free(mem_device);
> >  }
> >  
> > -static bool __initdata acpi_no_memhotplug;
> > +bool __initdata acpi_no_memhotplug;
> >  
> >  void __init acpi_memory_hotplug_init(void)
> >  {
> 
> So I don't disagree with the fix in principle, but the implementation here is 
> rather ugly - it spreads new non-obvious #ifdefs across various critical 
> parts of 
> the kernel.
> 
> For example this:
> 
> >  #ifdef CONFIG_X86_64
> > +#ifdef CONFIG_ACPI_NUMA
> > +   if (!no_iommu && acpi_get_max_possible_pfn() > MAX_DMA32_PFN)
> > +#else
> > if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > +#endif
> > swiotlb = 1;
> >  #endif
> 
> could be cleaned up by introducing a proper max_possible_pfn variable, and 
> setting 
> it from the ACPI code - instead of exporting acpi_get_max_possible_pfn().
Thanks for review,
I'll try to drop #ifdefs as suggested and split it
in several patches.

> 
> Another pattern is:
> 
> > +#ifdef CONFIG_ACPI_HOTPLUG_MEMORY
> > +   if (!acpi_no_memhotplug)
on the second thought,
we don't need need this knob to force disabling
SWIOTLB initialization since ther

[PATCH] x86_64: enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN

2015-11-30 Thread Igor Mammedov

when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:

 nommu_map_single: overflow 327b4f8c0+1522 of device mask 

the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.

Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.

It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
 RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.

Systems that have RAM less than 4GB and do not use
memory hotplug but still have hotplug regions in SRAT
(i.e. broken BIOS that can't disable mem hotplug)
can disable memory hotplug with 'acpi_no_memhotplug = 1'
to avoid automatic SWIOTLB initialization.

Tested on QEMU/KVM and HyperV.

Signed-off-by: Igor Mammedov 
---
 arch/x86/include/asm/acpi.h|  5 +
 arch/x86/kernel/pci-swiotlb.c  |  4 
 arch/x86/mm/srat.c | 15 +++
 drivers/acpi/acpi_memhotplug.c |  2 +-
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 94c18eb..53d7951 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -147,6 +147,7 @@ static inline void disable_acpi(void) { }
 #ifdef CONFIG_ACPI_NUMA
 extern int acpi_numa;
 extern int x86_acpi_numa_init(void);
+unsigned long acpi_get_max_possible_pfn(void);
 #endif /* CONFIG_ACPI_NUMA */
 
 #define acpi_unlazy_tlb(x) leave_mm(x)
@@ -170,4 +171,8 @@ static inline pgprot_t 
arch_apei_get_mem_attribute(phys_addr_t addr)
 }
 #endif
 
+#ifdef CONFIG_ACPI_HOTPLUG_MEMORY
+extern bool acpi_no_memhotplug;
+#endif /* CONFIG_ACPI_HOTPLUG_MEMORY */
+
 #endif /* _ASM_X86_ACPI_H */
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index adf0392..61d5ba5 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -88,7 +88,11 @@ int __init pci_swiotlb_detect_4gb(void)
 {
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
 #ifdef CONFIG_X86_64
+#ifdef CONFIG_ACPI_NUMA
+   if (!no_iommu && acpi_get_max_possible_pfn() > MAX_DMA32_PFN)
+#else
if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+#endif
swiotlb = 1;
 #endif
return swiotlb;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index c2aea63..21b33f0 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -153,10 +153,20 @@ acpi_numa_processor_affinity_init(struct 
acpi_srat_cpu_affinity *pa)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 static inline int save_add_info(void) {return 1;}
+static unsigned long max_possible_pfn __initdata;
 #else
 static inline int save_add_info(void) {return 0;}
 #endif
 
+unsigned long __init acpi_get_max_possible_pfn(void)
+{
+#ifdef CONFIG_ACPI_HOTPLUG_MEMORY
+   if (!acpi_no_memhotplug)
+   return max_possible_pfn;
+#endif
+   return max_pfn;
+}
+
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 int __init
 acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
@@ -203,6 +213,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+   if (max_possible_pfn < PFN_UP(end - 1))
+   max_possible_pfn = PFN_UP(end - 1);
+#endif
+
return 0;
 out_err_bad_srat:
bad_srat();
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef..ae38f57 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -357,7 +357,7 @@ static void acpi_memory_device_remove(struct acpi_device 
*device)
acpi_memory_device_free(mem_device);
 }
 
-static bool __initdata acpi_no_memhotplug;
+bool __initdata acpi_no_memhotplug;
 
 void __init acpi_memory_hotplug_init(void)
 {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()

2015-10-16 Thread Igor Mammedov

On Fri, 16 Oct 2015 11:49:36 +0800
Gonglei  wrote:

> On 2015/10/15 22:39, Christoph Lameter wrote:
> > On Thu, 15 Oct 2015, Gonglei (Arei) wrote:
> > 
> >> [0.016000] Call Trace:
> >> [0.016000]  [] dump_trace+0x6c/0x2d0
> >> [0.016000]  [] dump_stack+0x69/0x71
> >> [0.016000]  [] panic+0x78/0x199
> >> [0.016000]  [] do_exit+0x26f/0x360
> >> [0.016000]  [] oops_end+0xe1/0xf0
> >> [0.016000]  [] __bad_area_nosemaphore+0x155/0x230
> >> [0.016000]  [] page_fault+0x1f/0x30
> >> [0.016000]  [] kmem_cache_alloc_node+0xbf/0x140
> >> [0.016000]  [] alloc_cpumask_var_node+0x16/0x70
> >> [0.016000]  [] native_send_call_func_ipi+0x18/0xf0
> >> [0.016000]  [] smp_call_function_many+0x1ae/0x250
> >> [0.016000]  [] smp_call_function+0x20/0x30
> >> [0.016000]  [] set_mtrr+0x5a/0x140
> >> [0.016000]  [] smp_callin+0xf0/0x1b4
> >> [0.016000]  [] start_secondary+0xe/0xb5
> > 
> > This happened during IPI processing?
> > 
> >> crash> p cache_cache
> > 
> > Arg. This is the SLAB allocator. You cannot enable debugging without
> > rebuilding the kernel with CONFIG_SLAB_DEBUG.
> > 
> >> smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so 
> >> cache=
> >> p->array[14] is NULL,
> >> why did this situation happen? And cause NULL pointer accessing? Is this a 
> >> =
> >> kernel bug?
> > 
> > Its likely a bug in some obscure code in a driver that corrupted memory or
> > messed up the way memory was handled. set_mtrr()? What was going on at the
> > time? A special graphics driver being loaded? That could cause issues.
> > 
> 
> It seems that the problem was fixed by Igor, right?
>   https://lkml.org/lkml/2014/3/6/257
That might help.
"stuck" CPU14 means that master CPU has given up on the attempt
to online AP and tried to clean it up from different maps
*but* AP is still running and that may lead to an unexpected
behavior.

> 
> Cced Igor Mammedov.
> 
> Regards,
> -Gonglei
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] kvm: svm: reset mmu on VCPU reset

2015-09-18 Thread Igor Mammedov

When INIT/SIPI sequence is sent to VCPU which before that
was in use by OS, VMRUN might fail with:

 KVM: entry failed, hardware error 0x
 EAX= EBX= ECX= EDX=06d3
 ESI= EDI= EBP= ESP=
 EIP= EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =9a00 0009a000  9a00
 [...]
 CR0=6010 CR2=b6f3e000 CR3=01942000 CR4=07e0
 [...]
 EFER=

with corresponding SVM error:
 KVM: FAILED VMRUN WITH VMCB:
 [...]
 cpl:0efer: 1000
 cr0:80010010 cr2:  7fd7fe85bf90
 cr3:000187d0c000 cr4:  0020
 [...]

What happens is that VCPU state right after offlinig:
CR0: 0x80050033  EFER: 0xd01  CR4: 0x7e0
  -> long mode with CR3 pointing to longmode page tables

and when VCPU gets INIT/SIPI following transition happens
CR0: 0 -> 0x6010 EFER: 0x0  CR4: 0x7e0
  -> paging disabled with stale CR3

However SVM under the hood puts VCPU in Paged Real Mode*
which effectively translates CR0 0x6010 -> 80010010 after

   svm_vcpu_reset()
   -> init_vmcb()
   -> kvm_set_cr0()
   -> svm_set_cr0()

but from  kvm_set_cr0() perspective CR0: 0 -> 0x6010
only caching bits are changed and
commit d81135a57aa6
 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")'
regressed svm_vcpu_reset() which relied on MMU being reset.

As result VMRUN after svm_vcpu_reset() tries to run
VCPU in Paged Real Mode with stale MMU context (longmode page tables),
which causes some AMD CPUs** to bail out with VMEXIT_INVALID.

Fix issue by unconditionally resetting MMU context
at init_vmcb() time.

--
* AMD64 Architecture Programmer’s Manual,
Volume 2: System Programming, rev: 3.25
  15.19 Paged Real Mode
** Opteron 1216

Signed-off-by: Igor Mammedov 
---
 arch/x86/kvm/svm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index fdb8cb6..89173af 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1264,6 +1264,7 @@ static void init_vmcb(struct vcpu_svm *svm, bool 
init_event)
 * It also updates the guest-visible cr0 value.
 */
(void)kvm_set_cr0(&svm->vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
+   kvm_mmu_reset_context(&svm->vcpu);
 
save->cr4 = X86_CR4_PAE;
/* rdx = ?? */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-30 Thread Igor Mammedov

On Thu, 30 Jul 2015 09:33:57 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jul 30, 2015 at 08:26:03AM +0200, Igor Mammedov wrote:
> > On Wed, 29 Jul 2015 18:28:26 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
> > > > although now there is vhost module max_mem_regions option
> > > > to set custom limit it doesn't help for default setups,
> > > > since it requires administrator manually set a higher
> > > > limit on each host. Which complicates servers deployments
> > > > and management.
> > > > Rise limit to the same value as KVM has (509 slots max),
> > > > so that default deployments would work out of box.
> > > > 
> > > > Signed-off-by: Igor Mammedov 
> > > > ---
> > > > PS:
> > > > Users that would want to lock down vhost could still
> > > > use max_mem_regions option to set lower limit, but
> > > > I expect it would be minority.
> > > 
> > > I'm not inclined to merge this.
> > > 
> > > Once we change this we can't take it back. It's not a decision
> > > to be taken lightly.
> > considering that continuous HVA idea has failed, why would you
> > want to take limit back in the future if we rise it now?
> 
> I'm not sure.
> 
> I think you merely demonstrated it's a big change for userspace -
> not that it's unfeasible.
> 
> Alternatively, if we want an unlimited size table, we should keep it
> in userspace memory.
btw:
if table were a simple array and kernel will do inefficient linear scan
to do translation then I guess we could use userspace memory.

But I'm afraid we can't trust userspace in case of more elaborate
structure. Even if it's just binary search over sorted array,
it would be possible for userspace to hung kernel thread in
translate_desc() by providing corrupted or wrongly sorted table.
And we can't afford table validation on hot path.


> 
> > > 
> > > And memory hotplug users are a minority.  Out of these, users with a
> > > heavily fragmented PA space due to hotplug abuse are an even smaller
> > > minority.
> > > 
> > > > ---
> > > >  include/uapi/linux/vhost.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > > > index 2511954..92657bf 100644
> > > > --- a/include/uapi/linux/vhost.h
> > > > +++ b/include/uapi/linux/vhost.h
> > > > @@ -140,7 +140,7 @@ struct vhost_memory {
> > > >  #define VHOST_MEM_MAX_NREGIONS_NONE 0
> > > >  /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> > > >   * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS 
> > > > support. */
> > > > -#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > > > +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
> > > >  
> > > >  /* VHOST_NET specific defines */
> > > >  
> > > > -- 
> > > > 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-30 Thread Igor Mammedov

On Thu, 30 Jul 2015 09:33:57 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jul 30, 2015 at 08:26:03AM +0200, Igor Mammedov wrote:
> > On Wed, 29 Jul 2015 18:28:26 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
> > > > although now there is vhost module max_mem_regions option
> > > > to set custom limit it doesn't help for default setups,
> > > > since it requires administrator manually set a higher
> > > > limit on each host. Which complicates servers deployments
> > > > and management.
> > > > Rise limit to the same value as KVM has (509 slots max),
> > > > so that default deployments would work out of box.
> > > > 
> > > > Signed-off-by: Igor Mammedov 
> > > > ---
> > > > PS:
> > > > Users that would want to lock down vhost could still
> > > > use max_mem_regions option to set lower limit, but
> > > > I expect it would be minority.
> > > 
> > > I'm not inclined to merge this.
> > > 
> > > Once we change this we can't take it back. It's not a decision
> > > to be taken lightly.
in addition, you already gave out control on limit allowing
to rise it via module parameter. Rising default is just a way
to reduce pain for users that would try to use more than 64
slots.

> > considering that continuous HVA idea has failed, why would you
> > want to take limit back in the future if we rise it now?
> 
> I'm not sure.
> 
> I think you merely demonstrated it's a big change for userspace -
> not that it's unfeasible.
It's not a big change but rather ugly one, being unportable,
enforcing unnecessary (and to really reasonable) restrictions
on memory backends and changing memory unplug mgmt workflow
depending on if HVA used or not.

> Alternatively, if we want an unlimited size table, we should keep it
> in userspace memory.
this patch doesn't propose unlimited table size.
with proposed limit we are talking about max order-4 allocations
that can fallback to vmalloc if needed. And it makes vhost consistent
with KVM's limit, which has similar table.
Proposed limit also neatly works around corner cases of
existing old userspace that can't deal with it when it hits limit.


> 
> > > 
> > > And memory hotplug users are a minority.  Out of these, users with a
> > > heavily fragmented PA space due to hotplug abuse are an even smaller
> > > minority.
> > > 
> > > > ---
> > > >  include/uapi/linux/vhost.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > > > index 2511954..92657bf 100644
> > > > --- a/include/uapi/linux/vhost.h
> > > > +++ b/include/uapi/linux/vhost.h
> > > > @@ -140,7 +140,7 @@ struct vhost_memory {
> > > >  #define VHOST_MEM_MAX_NREGIONS_NONE 0
> > > >  /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> > > >   * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS 
> > > > support. */
> > > > -#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > > > +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
> > > >  
> > > >  /* VHOST_NET specific defines */
> > > >  
> > > > -- 
> > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-29 Thread Igor Mammedov

On Wed, 29 Jul 2015 18:28:26 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jul 29, 2015 at 04:29:23PM +0200, Igor Mammedov wrote:
> > although now there is vhost module max_mem_regions option
> > to set custom limit it doesn't help for default setups,
> > since it requires administrator manually set a higher
> > limit on each host. Which complicates servers deployments
> > and management.
> > Rise limit to the same value as KVM has (509 slots max),
> > so that default deployments would work out of box.
> > 
> > Signed-off-by: Igor Mammedov 
> > ---
> > PS:
> > Users that would want to lock down vhost could still
> > use max_mem_regions option to set lower limit, but
> > I expect it would be minority.
> 
> I'm not inclined to merge this.
> 
> Once we change this we can't take it back. It's not a decision
> to be taken lightly.
considering that continuous HVA idea has failed, why would you
want to take limit back in the future if we rise it now?

> 
> And memory hotplug users are a minority.  Out of these, users with a
> heavily fragmented PA space due to hotplug abuse are an even smaller
> minority.
> 
> > ---
> >  include/uapi/linux/vhost.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > index 2511954..92657bf 100644
> > --- a/include/uapi/linux/vhost.h
> > +++ b/include/uapi/linux/vhost.h
> > @@ -140,7 +140,7 @@ struct vhost_memory {
> >  #define VHOST_MEM_MAX_NREGIONS_NONE 0
> >  /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> >   * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
> > -#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
> >  
> >  /* VHOST_NET specific defines */
> >  
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] vhost: add ioctl to query nregions upper limit

2015-07-29 Thread Igor Mammedov

On Wed, 29 Jul 2015 17:43:17 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jul 29, 2015 at 04:29:22PM +0200, Igor Mammedov wrote:
> > From: "Michael S. Tsirkin" 
> > 
> > Userspace currently simply tries to give vhost as many regions
> > as it happens to have, but you only have the mem table
> > when you have initialized a large part of VM, so graceful
> > failure is very hard to support.
> > 
> > The result is that userspace tends to fail catastrophically.
> > 
> > Instead, add a new ioctl so userspace can find out how much kernel
> > supports, up front. This returns a positive value that we commit to.
> > 
> > Also, document our contract with legacy userspace: when running on an
> > old kernel, you get -1 and you can assume at least 64 slots.  Since 0
> > value's left unused, let's make that mean that the current userspace
> > behaviour (trial and error) is required, just in case we want it back.
> 
> What's wrong with reading the module parameter value? It's there in
> sysfs ...
for most cases it would work but distro doesn't have to mount
sysfs under /sys so it adds to app a burden of discovering
where sysfs is mounted and in what module to look for parameter.

So IMHO, sysfs is more human oriented interface,
while ioctl is more stable API for apps.

> 
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > Signed-off-by: Igor Mammedov 
> > ---
> >  drivers/vhost/vhost.c  |  7 ++-
> >  include/uapi/linux/vhost.h | 17 -
> >  2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index eec2f11..76dc0cf 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -30,7 +30,7 @@
> >  
> >  #include "vhost.h"
> >  
> > -static ushort max_mem_regions = 64;
> > +static ushort max_mem_regions = VHOST_MEM_MAX_NREGIONS_DEFAULT;
> >  module_param(max_mem_regions, ushort, 0444);
> >  MODULE_PARM_DESC(max_mem_regions,
> > "Maximum number of memory regions in memory map. (default: 64)");
> > @@ -944,6 +944,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
> > ioctl, void __user *argp)
> > long r;
> > int i, fd;
> >  
> > +   if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
> > +   r = max_mem_regions;
> > +   goto done;
> > +   }
> > +
> > /* If you are not the owner, you can become one */
> > if (ioctl == VHOST_SET_OWNER) {
> > r = vhost_dev_set_owner(d);
> > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > index ab373191..2511954 100644
> > --- a/include/uapi/linux/vhost.h
> > +++ b/include/uapi/linux/vhost.h
> > @@ -80,7 +80,7 @@ struct vhost_memory {
> >   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> >  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> >  
> > -/* Set up/modify memory layout */
> > +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. 
> > */
> >  #define VHOST_SET_MEM_TABLE_IOW(VHOST_VIRTIO, 0x03, struct 
> > vhost_memory)
> >  
> >  /* Write logging setup. */
> > @@ -127,6 +127,21 @@ struct vhost_memory {
> >  /* Set eventfd to signal an error */
> >  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct 
> > vhost_vring_file)
> >  
> > +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
> > + * Returns:
> > + * 0 < value <= MAX_INT - gives the upper limit, higher values will fail
> > + * 0 - there's no static limit: try and see if it works
> > + * -1 - on failure
> > + */
> > +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
> > +
> > +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
> > + * try and it'll work if you are lucky. */
> > +#define VHOST_MEM_MAX_NREGIONS_NONE 0
> > +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> > + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
> > +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > +
> >  /* VHOST_NET specific defines */
> >  
> >  /* Attach virtio net ring to a raw socket, or tap device.
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] vhost: increase default limit of nregions from 64 to 509

2015-07-29 Thread Igor Mammedov

although now there is vhost module max_mem_regions option
to set custom limit it doesn't help for default setups,
since it requires administrator manually set a higher
limit on each host. Which complicates servers deployments
and management.
Rise limit to the same value as KVM has (509 slots max),
so that default deployments would work out of box.

Signed-off-by: Igor Mammedov 
---
PS:
Users that would want to lock down vhost could still
use max_mem_regions option to set lower limit, but
I expect it would be minority.
---
 include/uapi/linux/vhost.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 2511954..92657bf 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -140,7 +140,7 @@ struct vhost_memory {
 #define VHOST_MEM_MAX_NREGIONS_NONE 0
 /* We support at least as many nregions in VHOST_SET_MEM_TABLE:
  * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
-#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 509
 
 /* VHOST_NET specific defines */
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] vhost: add ioctl to query nregions upper limit

2015-07-29 Thread Igor Mammedov

From: "Michael S. Tsirkin" 

Userspace currently simply tries to give vhost as many regions
as it happens to have, but you only have the mem table
when you have initialized a large part of VM, so graceful
failure is very hard to support.

The result is that userspace tends to fail catastrophically.

Instead, add a new ioctl so userspace can find out how much kernel
supports, up front. This returns a positive value that we commit to.

Also, document our contract with legacy userspace: when running on an
old kernel, you get -1 and you can assume at least 64 slots.  Since 0
value's left unused, let's make that mean that the current userspace
behaviour (trial and error) is required, just in case we want it back.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c  |  7 ++-
 include/uapi/linux/vhost.h | 17 -
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..76dc0cf 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
 
 #include "vhost.h"
 
-static ushort max_mem_regions = 64;
+static ushort max_mem_regions = VHOST_MEM_MAX_NREGIONS_DEFAULT;
 module_param(max_mem_regions, ushort, 0444);
 MODULE_PARM_DESC(max_mem_regions,
"Maximum number of memory regions in memory map. (default: 64)");
@@ -944,6 +944,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
long r;
int i, fd;
 
+   if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
+   r = max_mem_regions;
+   goto done;
+   }
+
/* If you are not the owner, you can become one */
if (ioctl == VHOST_SET_OWNER) {
r = vhost_dev_set_owner(d);
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index ab373191..2511954 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -80,7 +80,7 @@ struct vhost_memory {
  * Allows subsequent call to VHOST_OWNER_SET to succeed. */
 #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
 
-/* Set up/modify memory layout */
+/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. */
 #define VHOST_SET_MEM_TABLE_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
 
 /* Write logging setup. */
@@ -127,6 +127,21 @@ struct vhost_memory {
 /* Set eventfd to signal an error */
 #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
 
+/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
+ * Returns:
+ * 0 < value <= MAX_INT - gives the upper limit, higher values will fail
+ * 0 - there's no static limit: try and see if it works
+ * -1 - on failure
+ */
+#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
+
+/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
+ * try and it'll work if you are lucky. */
+#define VHOST_MEM_MAX_NREGIONS_NONE 0
+/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
+ * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
+#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
+
 /* VHOST_NET specific defines */
 
 /* Attach virtio net ring to a raw socket, or tap device.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] vhost: add ioctl to query nregions limit and rise default limit

2015-07-29 Thread Igor Mammedov


Igor Mammedov (1):
  vhost: increase default limit of nregions from 64 to 509

Michael S. Tsirkin (1):
  vhost: add ioctl to query nregions upper limit

 drivers/vhost/vhost.c  |  7 ++-
 include/uapi/linux/vhost.h | 17 -
 2 files changed, 22 insertions(+), 2 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] vhost: add max_mem_regions module parameter

2015-07-16 Thread Igor Mammedov

On Thu,  2 Jul 2015 15:08:11 +0200
Igor Mammedov  wrote:

> it became possible to use a bigger amount of memory
> slots, which is used by memory hotplug for
> registering hotplugged memory.
> However QEMU crashes if it's used with more than ~60
> pc-dimm devices and vhost-net enabled since host kernel
> in module vhost-net refuses to accept more than 64
> memory regions.
> 
> Allow to tweak limit via max_mem_regions module paramemter
> with default value set to 64 slots.
Michael,

what was the reason not to rise default?
As much as I think I can't come up with one.

Making it as module option doesn't make much sense
since old userspace will crash on new kernels anyway
until admins learn that there is a module option
to rise limit.

Rising default limit on par with kvm's will help
to avoid those crashes and would allow to drop
not necessary option to reduce user confusion.

> 
> Signed-off-by: Igor Mammedov 
> ---
>  drivers/vhost/vhost.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 6488011..9a68e2e 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -29,8 +29,12 @@
>  
>  #include "vhost.h"
>  
> +static ushort max_mem_regions = 64;
> +module_param(max_mem_regions, ushort, 0444);
> +MODULE_PARM_DESC(max_mem_regions,
> + "Maximum number of memory regions in memory map. (default: 64)");
> +
>  enum {
> - VHOST_MEMORY_MAX_NREGIONS = 64,
>   VHOST_MEMORY_F_LOG = 0x1,
>  };
>  
> @@ -696,7 +700,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
> vhost_memory __user *m)
>   return -EFAULT;
>   if (mem.padding)
>   return -EOPNOTSUPP;
> - if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
> + if (mem.nregions > max_mem_regions)
>   return -E2BIG;
>   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
>   if (!newmem)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fixup! vhost: extend memory regions allocation to vmalloc

2015-07-15 Thread Igor Mammedov

callers of vhost_kvzalloc() expect the same behaviour on
allocation error as from kmalloc/vmalloc i.e. NULL return
value. So just return vzmalloc() returned value instead of
returning ERR_PTR(-ENOMEM)

issue introduced by
  4de7255f7d2be5e51664c6ac6011ffd6e5463571 in vhost-next tree

Spotted-by: Dan Carpenter 
Suggested-by: Julia Lawall 
Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a9fe859..3702487 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -683,11 +683,8 @@ static void *vhost_kvzalloc(unsigned long size)
 {
void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
 
-   if (!n) {
+   if (!n)
n = vzalloc(size);
-   if (!n)
-   return ERR_PTR(-ENOMEM);
-   }
return n;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] vhost: fix build failure on SPARC

2015-07-13 Thread Igor Mammedov

On Mon, 13 Jul 2015 23:19:59 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Jul 13, 2015 at 08:15:30PM +0200, Igor Mammedov wrote:
> > while on x86 target vmalloc.h is included indirectly through
> > other heaedrs, it's not included on SPARC.
> > Fix issue by including vmalloc.h directly from vhost.c
> > like it's done in vhost/net.c
> > 
> > Signed-off-by: Igor Mammedov 
> 
> It's your vmalloc patch that introduces the issue, right?
> I squashed this fix into that one.
Thanks!
yep, that right. I've sent this patch as reply to vmalloc
patch that introduced issue but forgot to add commit it
that introduced issue.

> 
> You can just send fixup! patches in cases like this
> in the future, or at least mention the commit that
> breaks the build.
ok, I'll do it next time.

> 
> 
> > ---
> >  drivers/vhost/vhost.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 9a68e2e..a9fe859 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] vhost: fix build failure on SPARC

2015-07-13 Thread Igor Mammedov

while on x86 target vmalloc.h is included indirectly through
other heaedrs, it's not included on SPARC.
Fix issue by including vmalloc.h directly from vhost.c
like it's done in vhost/net.c

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9a68e2e..a9fe859 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/2] vhost: support more than 64 memory regions

2015-07-08 Thread Igor Mammedov

On Thu,  2 Jul 2015 15:08:09 +0200
Igor Mammedov  wrote:

> changes since v3:
>   * rebased on top of vhost-next branch
> changes since v2:
>   * drop cache patches for now as suggested
>   * add max_mem_regions module parameter instead of unconditionally
> increasing limit
>   * drop bsearch patch since it's already queued
> 
> References to previous versions:
> v2: https://lkml.org/lkml/2015/6/17/276
> v1: http://www.spinics.net/lists/kvm/msg117654.html
> 
> Series allows to tweak vhost's memory regions count limit.
> 
> It fixes VM crashing on memory hotplug due to vhost refusing
> accepting more than 64 memory regions with max_mem_regions
> set to more than 262 slots in default QEMU configuration.
> 
> Igor Mammedov (2):
>   vhost: extend memory regions allocation to vmalloc
>   vhost: add max_mem_regions module parameter
> 
>  drivers/vhost/vhost.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 

ping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 1/2] vhost: extend memory regions allocation to vmalloc

2015-07-02 Thread Igor Mammedov

with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 71bb468..6488011 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -544,7 +544,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev->log_file);
dev->log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev->memory);
+   kvfree(dev->memory);
dev->memory = NULL;
WARN_ON(!list_empty(&dev->work_list));
if (dev->worker) {
@@ -674,6 +674,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -686,7 +698,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
return -ENOMEM;
 
@@ -700,7 +712,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d->memory;
@@ -712,7 +724,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d->vqs[i]->memory = newmem;
mutex_unlock(&d->vqs[i]->mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 2/2] vhost: add max_mem_regions module parameter

2015-07-02 Thread Igor Mammedov

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net enabled since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Allow to tweak limit via max_mem_regions module paramemter
with default value set to 64 slots.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6488011..9a68e2e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,8 +29,12 @@
 
 #include "vhost.h"
 
+static ushort max_mem_regions = 64;
+module_param(max_mem_regions, ushort, 0444);
+MODULE_PARM_DESC(max_mem_regions,
+   "Maximum number of memory regions in memory map. (default: 64)");
+
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
@@ -696,7 +700,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EFAULT;
if (mem.padding)
return -EOPNOTSUPP;
-   if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
+   if (mem.nregions > max_mem_regions)
return -E2BIG;
newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 0/2] vhost: support more than 64 memory regions

2015-07-02 Thread Igor Mammedov

changes since v3:
  * rebased on top of vhost-next branch
changes since v2:
  * drop cache patches for now as suggested
  * add max_mem_regions module parameter instead of unconditionally
increasing limit
  * drop bsearch patch since it's already queued

References to previous versions:
v2: https://lkml.org/lkml/2015/6/17/276
v1: http://www.spinics.net/lists/kvm/msg117654.html

Series allows to tweak vhost's memory regions count limit.

It fixes VM crashing on memory hotplug due to vhost refusing
accepting more than 64 memory regions with max_mem_regions
set to more than 262 slots in default QEMU configuration.

Igor Mammedov (2):
  vhost: extend memory regions allocation to vmalloc
  vhost: add max_mem_regions module parameter

 drivers/vhost/vhost.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 2/2] vhost: add max_mem_regions module parameter

2015-07-01 Thread Igor Mammedov

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net enabled since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Allow to tweak limit via max_mem_regions module paramemter
with default value set to 64 slots.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..5905cd7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,8 +29,12 @@
 
 #include "vhost.h"
 
+static ushort max_mem_regions = 64;
+module_param(max_mem_regions, ushort, 0444);
+MODULE_PARM_DESC(max_mem_regions,
+   "Maximum number of memory regions in memory map. (default: 64)");
+
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
@@ -623,7 +627,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EFAULT;
if (mem.padding)
return -EOPNOTSUPP;
-   if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
+   if (mem.nregions > max_mem_regions)
return -E2BIG;
newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/2] vhost: support more than 64 memory regions

2015-07-01 Thread Igor Mammedov

changes since v2:
  * drop cache patches for now as suggested
  * add max_mem_regions module parameter instead of unconditionally
increasing limit
  * drop bsearch patch since it's already queued

References to previous versions:
v2: https://lkml.org/lkml/2015/6/17/276
v1: http://www.spinics.net/lists/kvm/msg117654.html

Series allows to tweak vhost's memory regions count limit.

It fixes VM crashing on memory hotplug due to vhost refusing
accepting more than 64 memory regions with max_mem_regions
set to more than 262 slots in default QEMU configuration.

Igor Mammedov (2):
  vhost: extend memory regions allocation to vmalloc
  vhost: add max_mem_regions module parameter

 drivers/vhost/vhost.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/2] vhost: extend memory regions allocation to vmalloc

2015-07-01 Thread Igor Mammedov

with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f1e07b8..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev->log_file);
dev->log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev->memory);
+   kvfree(dev->memory);
dev->memory = NULL;
WARN_ON(!list_empty(&dev->work_list));
if (dev->worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,21 +625,21 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
return -ENOMEM;
 
memcpy(newmem, &mem, size);
if (copy_from_user(newmem->regions, m->regions,
   mem.nregions * sizeof *m->regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d->memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d->vqs[i]->memory = newmem;
mutex_unlock(&d->vqs[i]->mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-25 Thread Igor Mammedov

On Wed, 24 Jun 2015 17:08:56 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 24, 2015 at 04:52:29PM +0200, Igor Mammedov wrote:
> > On Wed, 24 Jun 2015 16:17:46 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 24, 2015 at 04:07:27PM +0200, Igor Mammedov wrote:
> > > > On Wed, 24 Jun 2015 15:49:27 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > Userspace currently simply tries to give vhost as many regions
> > > > > as it happens to have, but you only have the mem table
> > > > > when you have initialized a large part of VM, so graceful
> > > > > failure is very hard to support.
> > > > > 
> > > > > The result is that userspace tends to fail catastrophically.
> > > > > 
> > > > > Instead, add a new ioctl so userspace can find out how much
> > > > > kernel supports, up front. This returns a positive value that
> > > > > we commit to.
> > > > > 
> > > > > Also, document our contract with legacy userspace: when
> > > > > running on an old kernel, you get -1 and you can assume at
> > > > > least 64 slots.  Since 0 value's left unused, let's make that
> > > > > mean that the current userspace behaviour (trial and error)
> > > > > is required, just in case we want it back.
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > Cc: Igor Mammedov 
> > > > > Cc: Paolo Bonzini 
> > > > > ---
> > > > >  include/uapi/linux/vhost.h | 17 -
> > > > >  drivers/vhost/vhost.c  |  5 +
> > > > >  2 files changed, 21 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/include/uapi/linux/vhost.h
> > > > > b/include/uapi/linux/vhost.h index ab373191..f71fa6d 100644
> > > > > --- a/include/uapi/linux/vhost.h
> > > > > +++ b/include/uapi/linux/vhost.h
> > > > > @@ -80,7 +80,7 @@ struct vhost_memory {
> > > > >   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> > > > >  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> > > > >  
> > > > > -/* Set up/modify memory layout */
> > > > > +/* Set up/modify memory layout: see also
> > > > > VHOST_GET_MEM_MAX_NREGIONS below. */ #define
> > > > > VHOST_SET_MEM_TABLE   _IOW(VHOST_VIRTIO, 0x03, struct
> > > > > vhost_memory) /* Write logging setup. */
> > > > > @@ -127,6 +127,21 @@ struct vhost_memory {
> > > > >  /* Set eventfd to signal an error */
> > > > >  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct
> > > > > vhost_vring_file) 
> > > > > +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE
> > > > > arguments.
> > > > > + * Returns:
> > > > > + *   0 < value <= MAX_INT - gives the upper limit,
> > > > > higher values will fail
> > > > > + *   0 - there's no static limit: try and see if it
> > > > > works
> > > > > + *   -1 - on failure
> > > > > + */
> > > > > +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
> > > > > +
> > > > > +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no
> > > > > static limit:
> > > > > + * try and it'll work if you are lucky. */
> > > > > +#define VHOST_MEM_MAX_NREGIONS_NONE 0
> > > > is it needed? we always have a limit,
> > > > or don't have IOCTL => -1 => old try and see way
> > > > 
> > > > > +/* We support at least as many nregions in
> > > > > VHOST_SET_MEM_TABLE:
> > > > > + * for use on legacy kernels without
> > > > > VHOST_GET_MEM_MAX_NREGIONS support. */ +#define
> > > > > VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > > > ^^^ not used below,
> > > > if it's for legacy then perhaps s/DEFAULT/LEGACY/ 
> > > 
> > > The assumption was that userspace detecting old kernels will just
> > > use 64, this means we do want a flag to get the old way.
> > > 
> > > OTOH if you won't think it's useful, let me know.
> > this header will be synced into QEMU's tree so that we could use
> > this define there, isn't it? IMHO then _LEGACY is more exact
>

Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-24 Thread Igor Mammedov

On Wed, 24 Jun 2015 16:17:46 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 24, 2015 at 04:07:27PM +0200, Igor Mammedov wrote:
> > On Wed, 24 Jun 2015 15:49:27 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > Userspace currently simply tries to give vhost as many regions
> > > as it happens to have, but you only have the mem table
> > > when you have initialized a large part of VM, so graceful
> > > failure is very hard to support.
> > > 
> > > The result is that userspace tends to fail catastrophically.
> > > 
> > > Instead, add a new ioctl so userspace can find out how much kernel
> > > supports, up front. This returns a positive value that we commit to.
> > > 
> > > Also, document our contract with legacy userspace: when running on an
> > > old kernel, you get -1 and you can assume at least 64 slots.  Since 0
> > > value's left unused, let's make that mean that the current userspace
> > > behaviour (trial and error) is required, just in case we want it back.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > Cc: Igor Mammedov 
> > > Cc: Paolo Bonzini 
> > > ---
> > >  include/uapi/linux/vhost.h | 17 -
> > >  drivers/vhost/vhost.c  |  5 +
> > >  2 files changed, 21 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > > index ab373191..f71fa6d 100644
> > > --- a/include/uapi/linux/vhost.h
> > > +++ b/include/uapi/linux/vhost.h
> > > @@ -80,7 +80,7 @@ struct vhost_memory {
> > >   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> > >  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> > >  
> > > -/* Set up/modify memory layout */
> > > +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS 
> > > below. */
> > >  #define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct 
> > > vhost_memory)
> > >  
> > >  /* Write logging setup. */
> > > @@ -127,6 +127,21 @@ struct vhost_memory {
> > >  /* Set eventfd to signal an error */
> > >  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct 
> > > vhost_vring_file)
> > >  
> > > +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
> > > + * Returns:
> > > + *   0 < value <= MAX_INT - gives the upper limit, higher values 
> > > will fail
> > > + *   0 - there's no static limit: try and see if it works
> > > + *   -1 - on failure
> > > + */
> > > +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
> > > +
> > > +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static 
> > > limit:
> > > + * try and it'll work if you are lucky. */
> > > +#define VHOST_MEM_MAX_NREGIONS_NONE 0
> > is it needed? we always have a limit,
> > or don't have IOCTL => -1 => old try and see way
> > 
> > > +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> > > + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. 
> > > */
> > > +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
> > ^^^ not used below,
> > if it's for legacy then perhaps s/DEFAULT/LEGACY/ 
> 
> The assumption was that userspace detecting old kernels will just use 64,
> this means we do want a flag to get the old way.
> 
> OTOH if you won't think it's useful, let me know.
this header will be synced into QEMU's tree so that we could use this define 
there,
isn't it? IMHO then _LEGACY is more exact description of macro.

As for 0 return value, -1 is just fine for detecting old kernels (i.e. try and 
see if it works), so 0 looks unnecessary but it doesn't in any way hurt either.
For me limit or -1 is enough to try fix userspace.

> 
> > > +
> > >  /* VHOST_NET specific defines */
> > >  
> > >  /* Attach virtio net ring to a raw socket, or tap device.
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 9e8e004..3b68f9d 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -917,6 +917,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned 
> > > int ioctl, void __user *argp)
> > >   long r;
> > >   int i, fd;
> > >  
> > > + if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
> > > + r = VHOST_MEMORY_MAX_NREGIONS;
> > > + goto done;
> > > + }
> > > +
> > >   /* If you are not the owner, you can become one */
> > >   if (ioctl == VHOST_SET_OWNER) {
> > >   r = vhost_dev_set_owner(d);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] vhost: add ioctl to query nregions upper limit

2015-06-24 Thread Igor Mammedov

On Wed, 24 Jun 2015 15:49:27 +0200
"Michael S. Tsirkin"  wrote:

> Userspace currently simply tries to give vhost as many regions
> as it happens to have, but you only have the mem table
> when you have initialized a large part of VM, so graceful
> failure is very hard to support.
> 
> The result is that userspace tends to fail catastrophically.
> 
> Instead, add a new ioctl so userspace can find out how much kernel
> supports, up front. This returns a positive value that we commit to.
> 
> Also, document our contract with legacy userspace: when running on an
> old kernel, you get -1 and you can assume at least 64 slots.  Since 0
> value's left unused, let's make that mean that the current userspace
> behaviour (trial and error) is required, just in case we want it back.
> 
> Signed-off-by: Michael S. Tsirkin 
> Cc: Igor Mammedov 
> Cc: Paolo Bonzini 
> ---
>  include/uapi/linux/vhost.h | 17 -
>  drivers/vhost/vhost.c  |  5 +
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index ab373191..f71fa6d 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -80,7 +80,7 @@ struct vhost_memory {
>   * Allows subsequent call to VHOST_OWNER_SET to succeed. */
>  #define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
>  
> -/* Set up/modify memory layout */
> +/* Set up/modify memory layout: see also VHOST_GET_MEM_MAX_NREGIONS below. */
>  #define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
>  
>  /* Write logging setup. */
> @@ -127,6 +127,21 @@ struct vhost_memory {
>  /* Set eventfd to signal an error */
>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>  
> +/* Query upper limit on nregions in VHOST_SET_MEM_TABLE arguments.
> + * Returns:
> + *   0 < value <= MAX_INT - gives the upper limit, higher values will fail
> + *   0 - there's no static limit: try and see if it works
> + *   -1 - on failure
> + */
> +#define VHOST_GET_MEM_MAX_NREGIONS   _IO(VHOST_VIRTIO, 0x23)
> +
> +/* Returned by VHOST_GET_MEM_MAX_NREGIONS to mean there's no static limit:
> + * try and it'll work if you are lucky. */
> +#define VHOST_MEM_MAX_NREGIONS_NONE 0
is it needed? we always have a limit,
or don't have IOCTL => -1 => old try and see way

> +/* We support at least as many nregions in VHOST_SET_MEM_TABLE:
> + * for use on legacy kernels without VHOST_GET_MEM_MAX_NREGIONS support. */
> +#define VHOST_MEM_MAX_NREGIONS_DEFAULT 64
^^^ not used below,
if it's for legacy then perhaps s/DEFAULT/LEGACY/ 

> +
>  /* VHOST_NET specific defines */
>  
>  /* Attach virtio net ring to a raw socket, or tap device.
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 9e8e004..3b68f9d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -917,6 +917,11 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
> ioctl, void __user *argp)
>   long r;
>   int i, fd;
>  
> + if (ioctl == VHOST_GET_MEM_MAX_NREGIONS) {
> + r = VHOST_MEMORY_MAX_NREGIONS;
> + goto done;
> + }
> +
>   /* If you are not the owner, you can become one */
>   if (ioctl == VHOST_SET_OWNER) {
>   r = vhost_dev_set_owner(d);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-22 Thread Igor Mammedov

On Fri, 19 Jun 2015 18:33:39 +0200
"Michael S. Tsirkin"  wrote:

> On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > > > We could, but I/O is just an example.  It can be I/O, a network ring,
> > > > whatever.  We cannot audit all address_space_map uses.
> > > > 
> > >
> > > No need to audit them all: defer device_add using an hva range until
> > > address_space_unmap drops using hvas in range drops reference count to
> > > 0.
> > 
> > That could be forever.  You certainly don't want to lockup the monitor
> > forever just because a device model isn't too friendly to memory hot-unplug.
> 
> We can defer the addition, no need to lockup the monitor.
> 
> > That's why you need to audit them (also, it's perfectly in the device
> > model's right to use address_space_unmap this way: it's the guest that's
> > buggy and leaves a dangling reference to a region before unplugging it).
> > 
> > Paolo
> 
> Then maybe it's not too bad that the guest will crash because the memory
> was unmapped.
So far HVA is unusable even if we will make this assumption and let guest crash.
virt_net doesn't work with it anyway,
translation of GPA to HVA for descriptors works as expected (correctly)
but vhost+HVA hack backed virtio still can't send/received packets.

That's why I prefer to merge kernel solution first as a stable and
not introducing any issues solution. And work on userspace approach on
top of that.

Hopefully it could be done but we still would need time
to iron out side effects/issues it causes or could cause so that
fix became stable enough for production.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov

On Thu, 18 Jun 2015 16:47:33 +0200
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> > > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> > >>
> > >>
> > >> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> > >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > >>>> Lets leave decision upto users instead of making them live with
> > >>>> crashing guests.
> > >>>
> > >>> Come on, let's fix it in userspace.
> > >>
> > >> It's not trivial to fix it in userspace.  Since QEMU uses RCU there
> > >> isn't a single memory map to use for a linear gpa->hva map.
> > > 
> > > Could you elaborate?
> > > 
> > > I'm confused by this mention of RCU.
> > > You use RCU for accesses to the memory map, correct?
> > > So memory map itself is a write side operation, as such all you need to
> > > do is take some kind of lock to prevent conflicting with other memory
> > > maps, do rcu sync under this lock.
> > 
> > You're right, the problem isn't directly related to RCU.  RCU would be
> > easy to handle by using synchronize_rcu instead of call_rcu.  While I
> > identified an RCU-related problem with Igor's patches, it's much more
> > entrenched.
> > 
> > RAM can be used by asynchronous operations while the VM runs, between
> > address_space_map and address_space_unmap.  It is possible and common to
> > have a quiescent state between the map and unmap, and a memory map
> > change can happen in the middle of this.  Normally this is not a
> > problem, because changes to the memory map do not make the hva go away
> > (memory regions are reference counted).
> 
> Right, so you want mmap(MAP_NORESERVE) when that reference
> count becomes 0.
> 
> > However, with Igor's patches a memory_region_del_subregion will cause a
> > mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> > 
> > I guess one way to do it would be to alias the same page in two places,
> > one for use by vhost and one for use by everything else.  However, the
> > kernel does not provide the means to do this kind of aliasing for
> > anonymous mmaps.
> > 
> > Paolo
> 
> Basically pages go away on munmap, so won't simple
>   lock
>   munmap
>   mmap(MAP_NORESERVE)
>   unlock
> do the trick?
at what time are you suggesting to do this?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov

On Thu, 18 Jun 2015 13:41:22 +0200
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > Lets leave decision upto users instead of making them live with
> > crashing guests.
> 
> Come on, let's fix it in userspace.
I'm not abandoning userspace approach either but it might take time
to implement in robust manner as it's much more complex and has much
more places to backfire then a straightforward kernel fix which will
work for both old userspace and a new one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov

On Thu, 18 Jun 2015 11:50:22 +0200
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 18:30:02 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 17:38:40 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > 
> > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > > > it needs to consume memory it desires.
> > > > > > > > > 
> > > > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > > > dropped.
> > > > > > > > 
> > > > > > > > Then what's the concern anyway?
> > > > > > > > 
> > > > > > > > Paolo
> > > > > > > 
> > > > > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > > > a ton of these fds.
> > > > > > if privileged tool gives out unlimited amount of fds then it
> > > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > > > 
> > > > > 
> > > > > Of course it does not give out unlimited fds, there's a way
> > > > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > > > uses vhost, it should become clear I think.
> > > > then it just means that tool has to take into account a new limits
> > > > to partition host in sensible manner.
> > > 
> > > Meanwhile old tools are vulnerable to OOM attacks.
> > I've chatted with libvirt folks, it doesn't care about how much memory
> > vhost would consume nor do any host capacity planning in that regard.
> 
> Exactly, it's up to host admin.
> 
> > But lets assume that there are tools that do this so
> > how about instead of hardcoding limit make it a module parameter
> > with default set to 64. That would allow users to set higher limit
> > if they need it and nor regress old tools. it will also give tools
> > interface for reading limit from vhost module.
> 
> And now you need to choose between security and functionality :(
There is no conflict here and it's not about choosing.
If admin has a method to estimate guest memory footprint
to do capacity partitioning then he would need to redo
partitioning taking in account new footprint when
he/she rises limit manually.

(BTW libvirt has tried and reverted patches that were trying to
predict required memory, admin might be able to do it manually
better but it's another topic how to do it ans it's not related
to this thread)

Lets leave decision upto users instead of making them live with
crashing guests.

> 
> > > 
> > > > Exposing limit as module parameter might be of help to tool for
> > > > getting/setting it in a way it needs.
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-18 Thread Igor Mammedov

On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > it needs to consume memory it desires.
> > > > > > > 
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > > 
> > > > > > Then what's the concern anyway?
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > 
> > > 
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
> 
> Meanwhile old tools are vulnerable to OOM attacks.
I've chatted with libvirt folks, it doesn't care about how much memory
vhost would consume nor do any host capacity planning in that regard.

But lets assume that there are tools that do this so
how about instead of hardcoding limit make it a module parameter
with default set to 64. That would allow users to set higher limit
if they need it and nor regress old tools. it will also give tools
interface for reading limit from vhost module.

> 
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 18:47:18 +0200
Paolo Bonzini  wrote:

> 
> 
> On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> 
> 
>  On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> > Meanwhile old tools are vulnerable to OOM attacks.
> 
>  For each vhost device there will be likely one tap interface,
>  and I suspect that it takes way, way more than 16KB of memory.
> >>>
> >>> That's not true. We have a vhost device per queue, all queues
> >>> are part of a single tap device.
> >>
> >> s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
> > 
> > That's up to you as a kvm maintainer :)
> 
> Not easy, when the CPU alone requires three (albeit non-consecutive)
> pages for the VMCS, the APIC access page and the EPT root.
> 
> > People are already concerned about vhost device
> > memory usage, I'm not happy to define our user/kernel interface
> > in a way that forces even more memory to be used up.
> 
> So, the questions to ask are:
> 
> 1) What is the memory usage like immediately after vhost is brought
> up, apart from these 16K?
> 
> 2) Is there anything in vhost that allocates a user-controllable
> amount of memory?
> 
> 3) What is the size of the data structures that support one virtqueue
> (there are two of them)?  Does it depend on the size of the
> virtqueues?
> 
> 4) Would it make sense to share memory regions between multiple vhost
> devices?  Would it be hard to implement?  It would also make memory
> operations O(1) rather than O(#cpus).
> 
> Paolo

in addition to that could vhost share memmap with KVM i.e. use its
memslots instead of duplicating it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in
> > > > > > > > any way, userspace still can create as many vhost
> > > > > > > > instances as it needs to consume memory it desires.
> > > > > > > 
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > > 
> > > > > > Then what's the concern anyway?
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Each fd now ties up 16K of kernel memory.  It didn't use to,
> > > > > so priveledged tool could safely give the unpriveledged
> > > > > userspace a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be
> > > > DoSed.
> > > > 
> > > 
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
> 
> Meanwhile old tools are vulnerable to OOM attacks.
Let's leave old limit by default and allow override it via module
parameter, that way tools old tools won't be affected and new tools
could set limit the way they need.
That will accommodate current slot hungry userspace and a new one with
continuous HVA and won't regress old tools.

> 
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 17:38:40 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 16:32:02 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > I don't think it's a valid concern in this case,
> > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > way, userspace still can create as many vhost instances as
> > > > > > it needs to consume memory it desires.
> > > > > 
> > > > > Not really since vhost char device isn't world-accessible.
> > > > > It's typically opened by a priveledged tool, the fd is
> > > > > then passed to an unpriveledged userspace, or permissions
> > > > > dropped.
> > > > 
> > > > Then what's the concern anyway?
> > > > 
> > > > Paolo
> > > 
> > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > priveledged tool could safely give the unpriveledged userspace
> > > a ton of these fds.
> > if privileged tool gives out unlimited amount of fds then it
> > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > 
> 
> Of course it does not give out unlimited fds, there's a way
> for the sysadmin to specify the number of fds. Look at how libvirt
> uses vhost, it should become clear I think.
then it just means that tool has to take into account a new limits
to partition host in sensible manner.
Exposing limit as module parameter might be of help to tool for
getting/setting it in a way it needs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 16:32:02 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > Considering userspace can be malicious, I guess yes.
> > > > I don't think it's a valid concern in this case,
> > > > setting limit back from 509 to 64 will not help here in any way,
> > > > userspace still can create as many vhost instances as it needs
> > > > to consume memory it desires.
> > > 
> > > Not really since vhost char device isn't world-accessible.
> > > It's typically opened by a priveledged tool, the fd is
> > > then passed to an unpriveledged userspace, or permissions dropped.
> > 
> > Then what's the concern anyway?
> > 
> > Paolo
> 
> Each fd now ties up 16K of kernel memory.  It didn't use to, so
> priveledged tool could safely give the unpriveledged userspace
> a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be DoSed.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/6] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

Ref to prefious version discussion:
[PATCH 0/5] vhost: support upto 509 memory regions
http://www.spinics.net/lists/kvm/msg117654.html

Chagelog v1->v2:
  * fix spelling errors
  * move "vhost: support upto 509 memory regions" to the end of queue
  * move kvfree() form 1/6 to 2/6 where it belongs
  * add vhost module parameter to enable/disable translation caching

Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.

It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.

It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.

Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:

Memory  |1G RAM|cached|non cached
regions #   |  3   |  53  |  53

upstream| 0.3% |  -   | 3.5%

this series | 0.2% | 0.5% | 0.7%

where "non cached" column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.

Igor Mammedov (6):
  vhost: use binary search instead of linear in find_region()
  vhost: extend memory regions allocation to vmalloc
  vhost: add per VQ memory region caching
  vhost: translate_desc: optimization for desc.len < region size
  vhost: add 'translation_cache' module parameter
  vhost: support upto 509 memory regions

 drivers/vhost/vhost.c | 105 ++
 drivers/vhost/vhost.h |   1 +
 2 files changed, 82 insertions(+), 24 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/6] vhost: add per VQ memory region caching

2015-06-17 Thread Igor Mammedov

that brings down translate_desc() cost to around 210ns
if accessed descriptors are from the same memory region.

Signed-off-by: Igor Mammedov 
---
that's what netperf/iperf workloads were during testing.
---
 drivers/vhost/vhost.c | 16 +---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..5c39a1e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -200,6 +200,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call = NULL;
vq->log_ctx = NULL;
vq->memory = NULL;
+   vq->cached_reg = 0;
 }
 
 static int vhost_worker(void *data)
@@ -649,6 +650,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
for (i = 0; i < d->nvqs; ++i) {
mutex_lock(&d->vqs[i]->mutex);
d->vqs[i]->memory = newmem;
+   d->vqs[i]->cached_reg = 0;
mutex_unlock(&d->vqs[i]->mutex);
}
kvfree(oldmem);
@@ -936,11 +938,17 @@ done:
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-__u64 addr, __u32 len)
+__u64 addr, __u32 len,
+int *cached_reg)
 {
const struct vhost_memory_region *reg;
int start = 0, end = mem->nregions;
 
+   reg = mem->regions + *cached_reg;
+   if (likely(addr >= reg->guest_phys_addr &&
+   reg->guest_phys_addr + reg->memory_size > addr))
+   return reg;
+
while (start < end) {
int slot = start + (end - start) / 2;
reg = mem->regions + slot;
@@ -952,8 +960,10 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
 
reg = mem->regions + start;
if (addr >= reg->guest_phys_addr &&
-   reg->guest_phys_addr + reg->memory_size > addr)
+   reg->guest_phys_addr + reg->memory_size > addr) {
+   *cached_reg = start;
return reg;
+   }
return NULL;
 }
 
@@ -1107,7 +1117,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
ret = -ENOBUFS;
break;
}
-   reg = find_region(mem, addr, len);
+   reg = find_region(mem, addr, len, &vq->cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..68bd00f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,6 +106,7 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+   int cached_reg;
 };
 
 struct vhost_dev {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/6] vhost: extend memory regions allocation to vmalloc

2015-06-17 Thread Igor Mammedov

with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f1e07b8..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev->log_file);
dev->log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev->memory);
+   kvfree(dev->memory);
dev->memory = NULL;
WARN_ON(!list_empty(&dev->work_list));
if (dev->worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,21 +625,21 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
return -ENOMEM;
 
memcpy(newmem, &mem, size);
if (copy_from_user(newmem->regions, m->regions,
   mem.nregions * sizeof *m->regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d->memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d->vqs[i]->memory = newmem;
mutex_unlock(&d->vqs[i]->mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/6] vhost: use binary search instead of linear in find_region()

2015-06-17 Thread Igor Mammedov

For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().

But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in KVM.

Signed-off-by: Igor Mammedov 
---
v2:
  move kvfree() to 2/2 where it belongs
---
 drivers/vhost/vhost.c | 36 +++-
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..f1e07b8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
+{
+   const struct vhost_memory_region *r1 = p1, *r2 = p2;
+   if (r1->guest_phys_addr < r2->guest_phys_addr)
+   return 1;
+   if (r1->guest_phys_addr > r2->guest_phys_addr)
+   return -1;
+   return 0;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -612,6 +623,8 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
kfree(newmem);
return -EFAULT;
}
+   sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
+   vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
kfree(newmem);
@@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
 __u64 addr, __u32 len)
 {
-   struct vhost_memory_region *reg;
-   int i;
+   const struct vhost_memory_region *reg;
+   int start = 0, end = mem->nregions;
 
-   /* linear search is not brilliant, but we really have on the order of 6
-* regions in practice */
-   for (i = 0; i < mem->nregions; ++i) {
-   reg = mem->regions + i;
-   if (reg->guest_phys_addr <= addr &&
-   reg->guest_phys_addr + reg->memory_size - 1 >= addr)
-   return reg;
+   while (start < end) {
+   int slot = start + (end - start) / 2;
+   reg = mem->regions + slot;
+   if (addr >= reg->guest_phys_addr)
+   end = slot;
+   else
+   start = slot + 1;
}
+
+   reg = mem->regions + start;
+   if (addr >= reg->guest_phys_addr &&
+   reg->guest_phys_addr + reg->memory_size > addr)
+   return reg;
return NULL;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 6/6] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 64
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS limit from 64 to 509
to match KVM_USER_MEM_SLOTS to fix issue for vhost-net
and current QEMU versions.

Signed-off-by: Igor Mammedov 
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 78290b7..e93023e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,7 +37,7 @@ MODULE_PARM_DESC(translation_cache,
" Set to 0 to disable. (default: 1)");
 
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
+   VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/6] vhost: translate_desc: optimization for desc.len < region size

2015-06-17 Thread Igor Mammedov

when translating descriptors they are typically less than
memory region that holds them and translated into 1 iov
entry, so it's not nessesary to check remaining length
twice and calculate used length and next address
in such cases.

replace a remaining length and 'size' increment branches
with a single remaining length check and execute
next iov steps only when it needed.

It saves a tiny 2% of translate_desc() execution time.

Signed-off-by: Igor Mammedov 
---
PS:
I'm not sure if iov_size > 0 is always true, if it's not
then better to drop this patch.
---
 drivers/vhost/vhost.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5c39a1e..5bcb323 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -,12 +,8 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
int ret = 0;
 
mem = vq->memory;
-   while ((u64)len > s) {
+   do {
u64 size;
-   if (unlikely(ret >= iov_size)) {
-   ret = -ENOBUFS;
-   break;
-   }
reg = find_region(mem, addr, len, &vq->cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
@@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
}
_iov = iov + ret;
size = reg->memory_size - addr + reg->guest_phys_addr;
-   _iov->iov_len = min((u64)len - s, size);
_iov->iov_base = (void __user *)(unsigned long)
(reg->userspace_addr + addr - reg->guest_phys_addr);
+   ++ret;
+   if (likely((u64)len - s < size)) {
+   _iov->iov_len = (u64)len - s;
+   break;
+   }
+
+   if (unlikely(ret >= iov_size)) {
+   ret = -ENOBUFS;
+   break;
+   }
+   _iov->iov_len = size;
s += size;
addr += size;
-   ++ret;
-   }
+   } while (1);
 
return ret;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/6] vhost: add 'translation_cache' module parameter

2015-06-17 Thread Igor Mammedov

by default translation of virtqueue descriptors is done with
caching enabled, but caching will add only extra cost
in cases of trashing workload where majority descriptors
are translated to different memory regions.
So add an option to allow exclude cache miss cost for such cases.

Performance with cashing enabled for sequential workload
doesn't seem to be affected much vs version without static key switch,
i.e. still the same 0.2% of total time with key(NOPs) consuming
5ms on 5min workload.

Signed-off-by: Igor Mammedov 
---
I don't have a test case for trashing workload though, but jmp
instruction adds up ~6ms(55M instructions) minus excluded caching
around 24ms on 5min workload.
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5bcb323..78290b7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -29,6 +29,13 @@
 
 #include "vhost.h"
 
+struct static_key translation_cache_key = STATIC_KEY_INIT_TRUE;
+static bool translation_cache = true;
+module_param(translation_cache, bool, 0444);
+MODULE_PARM_DESC(translation_cache,
+   "Enables/disables virtqueue descriptor translation caching,"
+   " Set to 0 to disable. (default: 1)");
+
 enum {
VHOST_MEMORY_MAX_NREGIONS = 64,
VHOST_MEMORY_F_LOG = 0x1,
@@ -944,10 +951,12 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
const struct vhost_memory_region *reg;
int start = 0, end = mem->nregions;
 
-   reg = mem->regions + *cached_reg;
-   if (likely(addr >= reg->guest_phys_addr &&
-   reg->guest_phys_addr + reg->memory_size > addr))
-   return reg;
+   if (static_key_true(&translation_cache_key)) {
+   reg = mem->regions + *cached_reg;
+   if (likely(addr >= reg->guest_phys_addr &&
+   reg->guest_phys_addr + reg->memory_size > addr))
+   return reg;
+   }
 
while (start < end) {
int slot = start + (end - start) / 2;
@@ -1612,6 +1621,9 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
 static int __init vhost_init(void)
 {
+   if (!translation_cache)
+   static_key_slow_dec(&translation_cache_key);
+
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 13:51:56 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > > So far it's kernel limitation and this patch fixes crashes
> > > > that users see now, with the rest of patches enabling performance
> > > > not to regress.
> > > 
> > > When I say regression I refer to an option to limit the array
> > > size again after userspace started using the larger size.
> > Is there a need to do so?
> 
> Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any way,
userspace still can create as many vhost instances as it needs
to consume memory it desires.

> 
> > Userspace that cares about memory footprint won't use many slots
> > keeping it low and user space that can't do without many slots
> > or doesn't care will have bigger memory footprint.
> 
> We really can't trust userspace to do the right thing though.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 12:46:09 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 12:11:09 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 09:39:06 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > 
> > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > > > 
> > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > > > since commit
> > > > > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > > > 
> > > > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > > > registering hotplugged memory.
> > > > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > > > memory regions.
> > > > > > > > > > 
> > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > > > 
> > > > > > > > > It was 64, not 65.
> > > > > > > > > 
> > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Igor Mammedov 
> > > > > > > > > 
> > > > > > > > > Still thinking about this: can you reorder this to
> > > > > > > > > be the last patch in the series please?
> > > > > > > > sure
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Also - 509?
> > > > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > > > KVM's allotment of memory slots for userspace side.
> > > > > > > 
> > > > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > > > why we need to match this exactly.
> > > > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > > > at 256+[initial fragmented memory]
> > > > > 
> > > > > But what's the point? We allocate 32 bytes per slot.
> > > > > 300*32 = 9600 which is more than 8K, so we are doing
> > > > > an order-3 allocation anyway.
> > > > > If we could cap it at 8K (256 slots) that would make sense
> > > > > since we could avoid wasting vmalloc space.
> > > > 256 is amount of hotpluggable slots  and there is no way
> > > > to predict how initial memory would be fragmented
> > > > (i.e. amount of slots it would take), if we guess wrong
> > > > we are back to square one with crashing userspace.
> > > > So I'd stay consistent with KVM's limit 509 since
> > > > it's only limit, i.e. not actual amount of allocated slots.
> > > > 
> > > > > I'm still not very happy with the whole approach,
> > > > > giving userspace ability allocate 4 whole pages
> > > > > of kernel memory like this.
> > > > I'm working in parallel so that userspace won't take so
> > > > many slots but it won't prevent its current versions
> > > > crashing due to kernel limitation.
> > > 
> > > Right but at least it's not a regression. If we promise userspace to
> >

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 12:11:09 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 09:39:06 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > "Michael S. Tsirkin"  wrote:
> > > > > > 
> > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > since commit
> > > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > 
> > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > registering hotplugged memory.
> > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > memory regions.
> > > > > > > > 
> > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > 
> > > > > > > It was 64, not 65.
> > > > > > > 
> > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > 
> > > > > > > > Signed-off-by: Igor Mammedov 
> > > > > > > 
> > > > > > > Still thinking about this: can you reorder this to
> > > > > > > be the last patch in the series please?
> > > > > > sure
> > > > > > 
> > > > > > > 
> > > > > > > Also - 509?
> > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > KVM's allotment of memory slots for userspace side.
> > > > > 
> > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > why we need to match this exactly.
> > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > at 256+[initial fragmented memory]
> > > 
> > > But what's the point? We allocate 32 bytes per slot.
> > > 300*32 = 9600 which is more than 8K, so we are doing
> > > an order-3 allocation anyway.
> > > If we could cap it at 8K (256 slots) that would make sense
> > > since we could avoid wasting vmalloc space.
> > 256 is amount of hotpluggable slots  and there is no way
> > to predict how initial memory would be fragmented
> > (i.e. amount of slots it would take), if we guess wrong
> > we are back to square one with crashing userspace.
> > So I'd stay consistent with KVM's limit 509 since
> > it's only limit, i.e. not actual amount of allocated slots.
> > 
> > > I'm still not very happy with the whole approach,
> > > giving userspace ability allocate 4 whole pages
> > > of kernel memory like this.
> > I'm working in parallel so that userspace won't take so
> > many slots but it won't prevent its current versions
> > crashing due to kernel limitation.
> 
> Right but at least it's not a regression. If we promise userspace to
> support a ton of regions, we can't take it back later, and I'm concerned
> about the memory usage.
> 
> I think it's already safe to merge the binary lookup patches, and maybe
> cache and vmalloc, so that the remaining patch will be small.
it isn't regression with switching to binary search and increasing
slots to 509 either performance wise it's more on improvment side.
And I was thinking about memory usage as well, that's why I've dropped
faster radix tree in favor of more compact array, can't do better
on kernel side of fix.

Yes we will give userspace to ability to use more slots/and lock up
more memory if it's not able to consolidate memory regions but
that leaves an option for user to run guest with vhost performance
vs crashing it at runtime.

userspace/targets that

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 09:39:06 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 08:34:26 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > since commit
> > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > 
> > > > > > it became possible to use a bigger amount of memory
> > > > > > slots, which is used by memory hotplug for
> > > > > > registering hotplugged memory.
> > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > in module vhost-net refuses to accept more than 65
> > > > > > memory regions.
> > > > > > 
> > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > 
> > > > > It was 64, not 65.
> > > > > 
> > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > 
> > > > > > Signed-off-by: Igor Mammedov 
> > > > > 
> > > > > Still thinking about this: can you reorder this to
> > > > > be the last patch in the series please?
> > > > sure
> > > > 
> > > > > 
> > > > > Also - 509?
> > > > userspace memory slots in terms of KVM, I made it match
> > > > KVM's allotment of memory slots for userspace side.
> > > 
> > > Maybe KVM has its reasons for this #. I don't see
> > > why we need to match this exactly.
> > np, I can cap it at safe 300 slots but it's unlikely that it
> > would take cut off 1 extra hop since it's capped by QEMU
> > at 256+[initial fragmented memory]
> 
> But what's the point? We allocate 32 bytes per slot.
> 300*32 = 9600 which is more than 8K, so we are doing
> an order-3 allocation anyway.
> If we could cap it at 8K (256 slots) that would make sense
> since we could avoid wasting vmalloc space.
256 is amount of hotpluggable slots  and there is no way
to predict how initial memory would be fragmented
(i.e. amount of slots it would take), if we guess wrong
we are back to square one with crashing userspace.
So I'd stay consistent with KVM's limit 509 since
it's only limit, i.e. not actual amount of allocated slots.

> I'm still not very happy with the whole approach,
> giving userspace ability allocate 4 whole pages
> of kernel memory like this.
I'm working in parallel so that userspace won't take so
many slots but it won't prevent its current versions
crashing due to kernel limitation.

 
> > > > > I think if we are changing this, it'd be nice to
> > > > > create a way for userspace to discover the support
> > > > > and the # of regions supported.
> > > > That was my first idea before extending KVM's memslots
> > > > to teach kernel to tell qemu this number so that QEMU
> > > > at least would be able to check if new memory slot could
> > > > be added but I was redirected to a more simple solution
> > > > of just extending vs everdoing things.
> > > > Currently QEMU supports upto ~250 memslots so 509
> > > > is about twice high we need it so it should work for near
> > > > future
> > > 
> > > Yes but old kernels are still around. Would be nice if you
> > > can detect them.
> > > 
> > > > but eventually we might still teach kernel and QEMU
> > > > to make things more robust.
> > > 
> > > A new ioctl would be easy to add, I think it's a good
> > > idea generally.
> > I can try to do something like this on top of this series.
> > 
> > > 
> > > > > 
> > > > > 
> > > > > > ---
> > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > index 99931a0..6a18c92 100644
> > > > > > --- a/drivers/vhost/vhost.c
> > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > @@ -30,7 +30,7 @@
> > > > > >  #include "vhost.h"
> > > > > >  
> > > > > >  enum {
> > > > > > -   VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > +   VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > >  };
> > > > > >  
> > > > > > -- 
> > > > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 08:31:23 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 12:19:15AM +0200, Igor Mammedov wrote:
> > On Tue, 16 Jun 2015 23:16:07 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > > > Series extends vhost to support upto 509 memory regions,
> > > > and adds some vhost:translate_desc() performance improvemnts
> > > > so it won't regress when memslots are increased to 509.
> > > > 
> > > > It fixes running VM crashing during memory hotplug due
> > > > to vhost refusing accepting more than 64 memory regions.
> > > > 
> > > > It's only host kernel side fix to make it work with QEMU
> > > > versions that support memory hotplug. But I'll continue
> > > > to work on QEMU side solution to reduce amount of memory
> > > > regions to make things even better.
> > > 
> > > I'm concerned userspace work will be harder, in particular,
> > > performance gains will be harder to measure.
> > it appears so, so far.
> > 
> > > How about a flag to disable caching?
> > I've tried to measure cost of cache miss but without much luck,
> > difference between version with cache and with caching removed
> > was within margin of error (±10ns) (i.e. not mensurable on my
> > 5min/10*10^6 test workload).
> 
> Confused. I thought it was very much measureable.
> So why add a cache if you can't measure its effect?
I hasn't been able to measure immediate delta between function
start/end with precision more than 10ns, perhaps used method
(system tap) is to blame.
But it's still possible to measure indirectly like 2% from 5/5.

> 
> > Also I'm concerned about adding extra fetch+branch for flag
> > checking will make things worse for likely path of cache hit,
> > so I'd avoid it if possible.
> > 
> > Or do you mean a simple global per module flag to disable it and
> > wrap thing in static key so that it will be cheap jump to skip
> > cache?
> 
> Something like this, yes.
ok, will do.

> 
> > > > Performance wise for guest with (in my case 3 memory regions)
> > > > and netperf's UDP_RR workload translate_desc() execution
> > > > time from total workload takes:
> > > > 
> > > > Memory  |1G RAM|cached|non cached
> > > > regions #   |  3   |  53  |  53
> > > > 
> > > > upstream| 0.3% |  -   | 3.5%
> > > > 
> > > > this series | 0.2% | 0.5% | 0.7%
> > > > 
> > > > where "non cached" column reflects trashing wokload
> > > > with constant cache miss. More details on timing in
> > > > respective patches.
> > > > 
> > > > Igor Mammedov (5):
> > > >   vhost: use binary search instead of linear in find_region()
> > > >   vhost: extend memory regions allocation to vmalloc
> > > >   vhost: support upto 509 memory regions
> > > >   vhost: add per VQ memory region caching
> > > >   vhost: translate_desc: optimization for desc.len < region size
> > > > 
> > > >  drivers/vhost/vhost.c | 95
> > > > +--
> > > > drivers/vhost/vhost.h |  1 + 2 files changed, 71 insertions(+),
> > > > 25 deletions(-)
> > > > 
> > > > -- 
> > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-17 Thread Igor Mammedov

On Wed, 17 Jun 2015 08:34:26 +0200
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > On Tue, 16 Jun 2015 23:14:20 +0200
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > since commit
> > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > 
> > > > it became possible to use a bigger amount of memory
> > > > slots, which is used by memory hotplug for
> > > > registering hotplugged memory.
> > > > However QEMU crashes if it's used with more than ~60
> > > > pc-dimm devices and vhost-net since host kernel
> > > > in module vhost-net refuses to accept more than 65
> > > > memory regions.
> > > > 
> > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > 
> > > It was 64, not 65.
> > > 
> > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > 
> > > > Signed-off-by: Igor Mammedov 
> > > 
> > > Still thinking about this: can you reorder this to
> > > be the last patch in the series please?
> > sure
> > 
> > > 
> > > Also - 509?
> > userspace memory slots in terms of KVM, I made it match
> > KVM's allotment of memory slots for userspace side.
> 
> Maybe KVM has its reasons for this #. I don't see
> why we need to match this exactly.
np, I can cap it at safe 300 slots but it's unlikely that it
would take cut off 1 extra hop since it's capped by QEMU
at 256+[initial fragmented memory]

> 
> > > I think if we are changing this, it'd be nice to
> > > create a way for userspace to discover the support
> > > and the # of regions supported.
> > That was my first idea before extending KVM's memslots
> > to teach kernel to tell qemu this number so that QEMU
> > at least would be able to check if new memory slot could
> > be added but I was redirected to a more simple solution
> > of just extending vs everdoing things.
> > Currently QEMU supports upto ~250 memslots so 509
> > is about twice high we need it so it should work for near
> > future
> 
> Yes but old kernels are still around. Would be nice if you
> can detect them.
> 
> > but eventually we might still teach kernel and QEMU
> > to make things more robust.
> 
> A new ioctl would be easy to add, I think it's a good
> idea generally.
I can try to do something like this on top of this series.

> 
> > > 
> > > 
> > > > ---
> > > >  drivers/vhost/vhost.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > index 99931a0..6a18c92 100644
> > > > --- a/drivers/vhost/vhost.c
> > > > +++ b/drivers/vhost/vhost.c
> > > > @@ -30,7 +30,7 @@
> > > >  #include "vhost.h"
> > > >  
> > > >  enum {
> > > > -   VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > +   VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > VHOST_MEMORY_F_LOG = 0x1,
> > > >  };
> > > >  
> > > > -- 
> > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov

On Tue, 16 Jun 2015 23:16:07 +0200
"Michael S. Tsirkin"  wrote:

> On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > Series extends vhost to support upto 509 memory regions,
> > and adds some vhost:translate_desc() performance improvemnts
> > so it won't regress when memslots are increased to 509.
> > 
> > It fixes running VM crashing during memory hotplug due
> > to vhost refusing accepting more than 64 memory regions.
> > 
> > It's only host kernel side fix to make it work with QEMU
> > versions that support memory hotplug. But I'll continue
> > to work on QEMU side solution to reduce amount of memory
> > regions to make things even better.
> 
> I'm concerned userspace work will be harder, in particular,
> performance gains will be harder to measure.
it appears so, so far.

> How about a flag to disable caching?
I've tried to measure cost of cache miss but without much luck,
difference between version with cache and with caching removed
was within margin of error (±10ns) (i.e. not mensurable on my
5min/10*10^6 test workload).
Also I'm concerned about adding extra fetch+branch for flag
checking will make things worse for likely path of cache hit,
so I'd avoid it if possible.

Or do you mean a simple global per module flag to disable it and
wrap thing in static key so that it will be cheap jump to skip
cache?
 
> > Performance wise for guest with (in my case 3 memory regions)
> > and netperf's UDP_RR workload translate_desc() execution
> > time from total workload takes:
> > 
> > Memory  |1G RAM|cached|non cached
> > regions #   |  3   |  53  |  53
> > 
> > upstream| 0.3% |  -   | 3.5%
> > 
> > this series | 0.2% | 0.5% | 0.7%
> > 
> > where "non cached" column reflects trashing wokload
> > with constant cache miss. More details on timing in
> > respective patches.
> > 
> > Igor Mammedov (5):
> >   vhost: use binary search instead of linear in find_region()
> >   vhost: extend memory regions allocation to vmalloc
> >   vhost: support upto 509 memory regions
> >   vhost: add per VQ memory region caching
> >   vhost: translate_desc: optimization for desc.len < region size
> > 
> >  drivers/vhost/vhost.c | 95
> > +--
> > drivers/vhost/vhost.h |  1 + 2 files changed, 71 insertions(+), 25
> > deletions(-)
> > 
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov

On Tue, 16 Jun 2015 23:14:20 +0200
"Michael S. Tsirkin"  wrote:

> On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > since commit
> >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > 
> > it became possible to use a bigger amount of memory
> > slots, which is used by memory hotplug for
> > registering hotplugged memory.
> > However QEMU crashes if it's used with more than ~60
> > pc-dimm devices and vhost-net since host kernel
> > in module vhost-net refuses to accept more than 65
> > memory regions.
> > 
> > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> 
> It was 64, not 65.
> 
> > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > 
> > Signed-off-by: Igor Mammedov 
> 
> Still thinking about this: can you reorder this to
> be the last patch in the series please?
sure

> 
> Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.

> I think if we are changing this, it'd be nice to
> create a way for userspace to discover the support
> and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future but eventually we might still teach kernel and QEMU
to make things more robust.

> 
> 
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 99931a0..6a18c92 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -30,7 +30,7 @@
> >  #include "vhost.h"
> >  
> >  enum {
> > -   VHOST_MEMORY_MAX_NREGIONS = 64,
> > +   VHOST_MEMORY_MAX_NREGIONS = 509,
> > VHOST_MEMORY_F_LOG = 0x1,
> >  };
> >  
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 226 matches

Mail list logo