Re: use generic DMA mapping code in powerpc V4

2019-02-03 Thread Christoph Hellwig
On Sun, Feb 03, 2019 at 05:49:02PM +0100, Christian Zigotzky wrote:
> OK, next step: b50f42f0fe12965ead395c76bcb6a14f00cdf65b (powerpc/dma: use 
> the dma_direct mapping routines)
>
> git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a
>
> git checkout b50f42f0fe12965ead395c76bcb6a14f00cdf65b
>
> Results: The X1000 and X5000 boot but unfortunately the P.A. Semi Ethernet 
> doesn't work.

Are there any interesting messages in the boot log?  Can you send me
the dmesg?


Re: [PATCH] dmaengine: fsldma: Add 64-bit I/O accessors for powerpc64

2019-02-03 Thread Vinod Koul
On 25-01-19, 05:54, Peng Ma wrote:
> Hi Vinod,
> 
> Sorry to replay late.
> 1:This patch has already send to the patchwork.
>   Please see the patch link: https://patchwork.kernel.org/patch/10741521/
> 2:I have already compile the fsl patches on arm and powerpc after patched 
> https://patchwork.kernel.org/patch/10741521/
>   The compile will successful, please let me know the reported regression 
> results, thanks very much.

And I thought there were further comments  on this patch. I have applied
this, please watch for failures reported if any..
-- 
~Vinod


Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-02-03 Thread David Gibson
On Sat, Jan 26, 2019 at 09:25:04AM +0100, Cédric Le Goater wrote:
> Was there a crashing.org shutdown ? 
> 
>   Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
>   by in5.mail.ovh.net (Postfix) with ESMTPS id 43mYnj0nrlz1N7KC
>   for ; Fri, 25 Jan 2019 22:38:00 + (UTC)
>   Received: from localhost (localhost.localdomain [127.0.0.1])
>   by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x0NLZf4K021092;
>   Wed, 23 Jan 2019 15:35:43 -0600
> 
> 
> On 1/23/19 10:35 PM, Benjamin Herrenschmidt wrote:
> > On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
> >>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> >>
> >> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> >> Linux should be adjusted ...
> > 
> > All the names changed between the HW design and the "architecture"
> > document. The HW guys use the old names, the architecture the new
> > names, and Linux & OPAL mostly use the old ones because frankly the new
> > names suck big time.
> 
> Well, It does not make XIVE any clearer ... I did prefer the v1 names
> but there was some naming overlap in the concepts. 
> 
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The size of the VP space is a bigger concern. Even today. We really
> > need qemu to tell the max #cpu to KVM so we can allocate less of them.
> 
> ah yes. we would also need to reduce the number of available priorities 
> per CPU to have more EQ descriptors available if I recall well. 
> 
> > As for nesting, I suggest for the foreseeable future we stick to XICS
> > emulation in nested guests.
> 
> ok. so no kernel_irqchip at all. hmm.  

That would certainly be step 0, making sure the capability advertises
this correctly.  I think we do want to make XICs-on-XIVE emulation
work in a KVM L1 (so we'd need to have it make XIVE hcalls to the L0
instead of OPAL calls).

XIVE-on-XIVE for L1 would be nice too, which would mean implementing
the XIVE hcalls from the L2 in terms of XIVE hcalls to the L0.  I
think it's ok to delay this indefinitely as long as the caps advertise
correctly so that qemu will use userspace emulation until its ready.

> I was wondering how possible it was to have L2 initialize the underlying 
> OPAL structures in the L0 hypervisor. May be with a sort of proxy hcall 
> which would perform the initialization in QEMU L1 on behalf of L2.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:26PM +0100, Cédric Le Goater wrote:
> When the VM is stopped in a migration sequence, the sources are masked
> and the XIVE IC is synced to stabilize the EQs. When done, the KVM
> ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.
> 
> The migration can then transfer the remaining dirty pages to the
> destination and start collecting the state of the devices.

Is there a reason to make this a separate step from the SYNC
operation?

> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>  arch/powerpc/kvm/book3s_xive_native.c | 40 +++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index f3b859223b80..1a8740629acf 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_ESB_FD1
>  #define   KVM_DEV_XIVE_GET_TIMA_FD   2
>  #define   KVM_DEV_XIVE_VC_BASE   3
> +#define   KVM_DEV_XIVE_SAVE_EQ_PAGES 4
>  #define KVM_DEV_XIVE_GRP_SOURCES 2   /* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC3   /* 64-bit source 
> attributes */
>  
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index a8052867afc1..f2de1bcf3b35 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct 
> kvmppc_xive *xive, u64 addr)
>   return put_user(ret, ubufp);
>  }
>  
> +static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu)
> +{
> + struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> + unsigned int prio;
> +
> + if (!xc)
> + return -ENOENT;
> +
> + for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
> + struct xive_q *q = &xc->queues[prio];
> +
> + if (!q->qpage)
> + continue;
> +
> + /* Mark EQ page dirty for migration */
> + mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
> + }
> + return 0;
> +}
> +
> +static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive)
> +{
> + struct kvm *kvm = xive->kvm;
> + struct kvm_vcpu *vcpu;
> + unsigned int i;
> +
> + pr_devel("%s\n", __func__);
> +
> + mutex_lock(&kvm->lock);
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + kvmppc_xive_native_vcpu_save_eq_pages(vcpu);
> + }
> + mutex_unlock(&kvm->lock);
> +
> + return 0;
> +}
> +
>  static int xive_native_validate_queue_size(u32 qsize)
>  {
>   switch (qsize) {
> @@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
> *dev,
>   switch (attr->attr) {
>   case KVM_DEV_XIVE_VC_BASE:
>   return kvmppc_xive_native_set_vc_base(xive, attr->addr);
> + case KVM_DEV_XIVE_SAVE_EQ_PAGES:
> + return kvmppc_xive_native_save_eq_pages(xive);
>   }
>   break;
>   case KVM_DEV_XIVE_GRP_SOURCES:
> @@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
> *dev,
>   case KVM_DEV_XIVE_GET_ESB_FD:
>   case KVM_DEV_XIVE_GET_TIMA_FD:
>   case KVM_DEV_XIVE_VC_BASE:
> + case KVM_DEV_XIVE_SAVE_EQ_PAGES:
>   return 0;
>   }
>   break;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
> At a VCPU level, the state of the thread context interrupt management
> registers needs to be collected. These registers are cached under the
> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
> pulled from the HW thread. An OPAL call retrieves the backup of the
> IPB register in the NVT structure and merges it in the KVM state.
> 
> The structures of the interface between QEMU and KVM provisions some
> extra room (two u64) for further extensions if more state needs to be
> transferred back to QEMU.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/kvm_ppc.h|  5 ++
>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
>  arch/powerpc/kvm/book3s.c | 24 +
>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++
>  4 files changed, 109 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 4cc897039485..49c488af168c 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
>   u64 addr;
>   u64 length;
>   }   vpaval;
> + u64 xive_timaval[4];
>  };
>  
>  struct kvmppc_ops {
> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct 
> kvm_vcpu *vcpu);
>  extern void kvmppc_xive_native_init_module(void);
>  extern void kvmppc_xive_native_exit_module(void);
>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union 
> kvmppc_one_reg *val);
> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union 
> kvmppc_one_reg *val);
>  
>  #else
>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { 
> }
>  static inline void kvmppc_xive_native_exit_module(void) { }
>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
>   { return 0; }
> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union 
> kvmppc_one_reg *val) { return 0; }
> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union 
> kvmppc_one_reg *val) { return -ENOENT; }

IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
rest of the XIVE code, can we use it here as well.

>  #endif /* CONFIG_KVM_XIVE */
>  
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 95302558ce10..3c958c39a782 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT  16  /* pending irq priority */
>  #define  KVM_REG_PPC_ICP_PPRI_MASK   0xff
>  
> +#define KVM_REG_PPC_VP_STATE (KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
> +
>  /* Device control API: PPC-specific devices */
>  #define KVM_DEV_MPIC_GRP_MISC1
>  #define   KVM_DEV_MPIC_BASE_ADDR 0   /* 64-bit */
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index de7eed191107..5ad658077a35 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
>   *val = get_reg_val(id, 
> kvmppc_xics_get_icp(vcpu));
>   break;
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> + case KVM_REG_PPC_VP_STATE:
> + if (!vcpu->arch.xive_vcpu) {
> + r = -ENXIO;
> + break;
> + }
> + if (xive_enabled())
> + r = kvmppc_xive_native_get_vp(vcpu, val);
> + else
> + r = -ENXIO;
> + break;
> +#endif /* CONFIG_KVM_XIVE */
>   case KVM_REG_PPC_FSCR:
>   *val = get_reg_val(id, vcpu->arch.fscr);
>   break;
> @@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>   r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, 
> *val));
>   break;
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> + case KVM_REG_PPC_VP_STATE:
> + if (!vcpu->arch.xive_vcpu) {
> + r = -ENXIO;
> + break;
> + }
> + if (xive_enabled())
> + r = kvmppc_xive_native_set_vp(vcpu, val);
> + else
> + r = -ENXIO;
> + break;
> +#endif /* CONFIG_KVM_XIVE */
>   case KVM_REG_PPC_FSCR:
>   vcpu->arch.fscr = set_reg_val(id, *val);
>  

Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> Theses are use to capure the XIVE EAS table of the KVM device, the
> configuration of the source targets.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   | 11 
>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 1a8740629acf..faf024f39858 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES 4
>  #define KVM_DEV_XIVE_GRP_SOURCES 2   /* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC3   /* 64-bit source 
> attributes */
> +#define KVM_DEV_XIVE_GRP_EAS 4   /* 64-bit eas attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0)
>  #define KVM_XIVE_LEVEL_ASSERTED  (1ULL << 1)
>  
> +/* Layout of 64-bit eas attribute values */
> +#define KVM_XIVE_EAS_PRIORITY_SHIFT  0
> +#define KVM_XIVE_EAS_PRIORITY_MASK   0x7
> +#define KVM_XIVE_EAS_SERVER_SHIFT3
> +#define KVM_XIVE_EAS_SERVER_MASK 0xfff8ULL
> +#define KVM_XIVE_EAS_MASK_SHIFT  32
> +#define KVM_XIVE_EAS_MASK_MASK   0x1ULL
> +#define KVM_XIVE_EAS_EISN_SHIFT  33
> +#define KVM_XIVE_EAS_EISN_MASK   0xfffeULL
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index f2de1bcf3b35..0468b605baa7 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive 
> *xive, long irq, u64 addr)
>   return 0;
>  }
>  
> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> +   u64 addr)

I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
values, but rather essentially the "source config" in the terminology
of the PAPR hcalls.  Which, yes, is basically implemented by setting
the EAS, but since it's the PAPR architected state that we need to
preserve across migration, I'd prefer to stick as close as we can to
the PAPR terminology.

> +{
> + struct kvmppc_xive_src_block *sb;
> + struct kvmppc_xive_irq_state *state;
> + u64 __user *ubufp = (u64 __user *) addr;
> + u16 src;
> + u64 kvm_eas;
> + u32 server;
> + u8 priority;
> + u32 eisn;
> +
> + sb = kvmppc_xive_find_source(xive, irq, &src);
> + if (!sb)
> + return -ENOENT;
> +
> + state = &sb->irq_state[src];
> +
> + if (!state->valid)
> + return -EINVAL;
> +
> + if (get_user(kvm_eas, ubufp))
> + return -EFAULT;
> +
> + pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
> +
> + priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
> + KVM_XIVE_EAS_PRIORITY_SHIFT;
> + server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
> + KVM_XIVE_EAS_SERVER_SHIFT;
> + eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
> +
> + if (priority != xive_prio_from_guest(priority)) {
> + pr_err("invalid priority for queue %d for VCPU %d\n",
> +priority, server);
> + return -EINVAL;
> + }
> +
> + return kvmppc_xive_native_set_source_config(xive, sb, state, server,
> + priority, eisn);
> +}
> +
> +static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
> +   u64 addr)
> +{
> + struct kvmppc_xive_src_block *sb;
> + struct kvmppc_xive_irq_state *state;
> + u64 __user *ubufp = (u64 __user *) addr;
> + u16 src;
> + u64 kvm_eas;
> +
> + sb = kvmppc_xive_find_source(xive, irq, &src);
> + if (!sb)
> + return -ENOENT;
> +
> + state = &sb->irq_state[src];
> +
> + if (!state->valid)
> + return -EINVAL;
> +
> + arch_spin_lock(&sb->lock);
> +
> + if (state->act_priority == MASKED)
> + kvm_eas = KVM_XIVE_EAS_MASK_MASK;
> + else {
> + kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) &
> + KVM_XIVE_EAS_PRIORITY_MASK;
> + kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) &
> + KVM_XIVE_EAS_SERVER_MASK;
> + kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) &
> + KVM_XIVE_EAS_EISN_MASK;
> + }
> + arch_spin_unlock(&sb->lock);
> +
> + pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
> +
> + if (put_user(kvm_eas, ubufp))
> + return -EFAULT;
> +
> +

Re: [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:25PM +0100, Cédric Le Goater wrote:
> When migration of a VM is initiated, a first copy of the RAM is
> transferred to the destination before the VM is stopped. At that time,
> QEMU needs to perform a XIVE quiesce sequence to stop the flow of
> event notifications and stabilize the EQs. The sources are masked and
> the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC.
>

Don't you also need to make sure the guests queue pages are marked
dirty here, in case they were already migrated?

> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>  arch/powerpc/kvm/book3s_xive_native.c | 32 +++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 6fc9660c5aec..f3b859223b80 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_TIMA_FD   2
>  #define   KVM_DEV_XIVE_VC_BASE   3
>  #define KVM_DEV_XIVE_GRP_SOURCES 2   /* 64-bit source attributes */
> +#define KVM_DEV_XIVE_GRP_SYNC3   /* 64-bit source 
> attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0)
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 4ca75aade069..a8052867afc1 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct 
> kvmppc_xive *xive, long irq,
>   return 0;
>  }
>  
> +static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 
> addr)
> +{
> + struct kvmppc_xive_src_block *sb;
> + struct kvmppc_xive_irq_state *state;
> + struct xive_irq_data *xd;
> + u32 hw_num;
> + u16 src;
> +
> + pr_devel("%s irq=0x%lx\n", __func__, irq);
> +
> + sb = kvmppc_xive_find_source(xive, irq, &src);
> + if (!sb)
> + return -ENOENT;
> +
> + state = &sb->irq_state[src];
> +
> + if (!state->valid)
> + return -ENOENT;
> +
> + arch_spin_lock(&sb->lock);
> +
> + kvmppc_xive_select_irq(state, &hw_num, &xd);
> + xive_native_sync_source(hw_num);
> + xive_native_sync_queue(hw_num);
> +
> + arch_spin_unlock(&sb->lock);
> + return 0;
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  struct kvm_device_attr *attr)
>  {
> @@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
> *dev,
>   case KVM_DEV_XIVE_GRP_SOURCES:
>   return kvmppc_xive_native_set_source(xive, attr->attr,
>attr->addr);
> + case KVM_DEV_XIVE_GRP_SYNC:
> + return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>   }
>   return -ENXIO;
>  }
> @@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
> *dev,
>   }
>   break;
>   case KVM_DEV_XIVE_GRP_SOURCES:
> + case KVM_DEV_XIVE_GRP_SYNC:
>   if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>   attr->attr < KVMPPC_XIVE_NR_IRQS)
>   return 0;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:28PM +0100, Cédric Le Goater wrote:
> These are used to capture the XIVE END table of the KVM device. It
> relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit
> and index which are updated by the HW when events are enqueued in the
> guest RAM.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  21 
>  arch/powerpc/kvm/book3s_xive_native.c | 166 ++
>  2 files changed, 187 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index faf024f39858..95302558ce10 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_SOURCES 2   /* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC3   /* 64-bit source 
> attributes */
>  #define KVM_DEV_XIVE_GRP_EAS 4   /* 64-bit eas attributes */
> +#define KVM_DEV_XIVE_GRP_EQ  5   /* 64-bit eq attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0)
> @@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char {
>  #define KVM_XIVE_EAS_EISN_SHIFT  33
>  #define KVM_XIVE_EAS_EISN_MASK   0xfffeULL
>  
> +/* Layout of 64-bit eq attribute */
> +#define KVM_XIVE_EQ_PRIORITY_SHIFT   0
> +#define KVM_XIVE_EQ_PRIORITY_MASK0x7
> +#define KVM_XIVE_EQ_SERVER_SHIFT 3
> +#define KVM_XIVE_EQ_SERVER_MASK  0xfff8ULL
> +
> +/* Layout of 64-bit eq attribute values */
> +struct kvm_ppc_xive_eq {
> + __u32 flags;
> + __u32 qsize;
> + __u64 qpage;
> + __u32 qtoggle;
> + __u32 qindex;

Should we pad this in case a) we discover some fields in the EQ that
we thought weren't relevant to the guest actually are or b) future
XIVE extensions add something we need to migrate.

> +};
> +
> +#define KVM_XIVE_EQ_FLAG_ENABLED 0x0001
> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY   0x0002
> +#define KVM_XIVE_EQ_FLAG_ESCALATE0x0004
> +
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 0468b605baa7..f4eb71eafc57 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct 
> kvmppc_xive *xive, long irq,
>   return 0;
>  }
>  
> +static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long 
> eq_idx,
> +   u64 addr)
> +{
> + struct kvm *kvm = xive->kvm;
> + struct kvm_vcpu *vcpu;
> + struct kvmppc_xive_vcpu *xc;
> + void __user *ubufp = (u64 __user *) addr;
> + u32 server;
> + u8 priority;
> + struct kvm_ppc_xive_eq kvm_eq;
> + int rc;
> + __be32 *qaddr = 0;
> + struct page *page;
> + struct xive_q *q;
> +
> + /*
> +  * Demangle priority/server tuple from the EQ index
> +  */
> + priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
> + KVM_XIVE_EQ_PRIORITY_SHIFT;
> + server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
> + KVM_XIVE_EQ_SERVER_SHIFT;
> +
> + if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
> + return -EFAULT;
> +
> + vcpu = kvmppc_xive_find_server(kvm, server);
> + if (!vcpu) {
> + pr_err("Can't find server %d\n", server);
> + return -ENOENT;
> + }
> + xc = vcpu->arch.xive_vcpu;
> +
> + if (priority != xive_prio_from_guest(priority)) {
> + pr_err("Trying to restore invalid queue %d for VCPU %d\n",
> +priority, server);
> + return -EINVAL;
> + }
> + q = &xc->queues[priority];
> +
> + pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
> +  __func__, server, priority, kvm_eq.flags,
> +  kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
> +
> + rc = xive_native_validate_queue_size(kvm_eq.qsize);
> + if (rc || !kvm_eq.qsize) {
> + pr_err("invalid queue size %d\n", kvm_eq.qsize);
> + return rc;
> + }
> +
> + page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
> + if (is_error_page(page)) {
> + pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage);
> + return -ENOMEM;
> + }
> + qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK);
> +
> + /* Backup queue page guest address for migration */
> + q->guest_qpage = kvm_eq.qpage;
> + q->guest_qsize = kvm_eq.qsize;
> +
> + rc = xive_native_configure_queue(xc->vp_id, q, priority,
> +  (__be32 *) qaddr, kvm_eq.qsize, true);
> + if (rc) {
> + pr_err("Failed to configure queue %d for VCPU %d: %d\n",
> +priority, xc-

Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device

2019-02-03 Thread David Gibson
On Wed, Jan 23, 2019 at 05:56:26PM +0100, Cédric Le Goater wrote:
> On 1/22/19 6:14 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
> >> The ESB MMIO region controls the interrupt sources of the guest. QEMU
> >> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
> >> address for the guest to use. The guest will obtain this information
> >> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
> >> setting used by QEMU, add a VC_BASE control to the KVM XIVE device
> > 
> > This needs a little more explanation.  I *think* the only way this
> > gets used is that it gets returned to the guest by the new
> > hypercalls.  If that is indeed the case it would be useful to mention
> > that in the patch description, because otherwise taking a value that
> > userspace provides and which looks like it is an address, and not
> > doing any validation on it, looks a bit scary.
> 
> I think we have solved this problem in another email thread. 
> 
> The H_INT_GET_SOURCE_INFO hcall does not need to be implemented in KVM
> as all the source information should already be available in QEMU. In
> that case, there is no need to inform KVM of where the ESB pages are 
> mapped in the guest address space. So we don't need that extra control
> on the KVM device. This is good news.

Ah, good to hear.  I thought this looked strange.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:24PM +0100, Cédric Le Goater wrote:
> The guest physical address of the event queue will be part of the
> state to transfer in the migration. Cache its value when the queue is
> configured, it will save us an OPAL call.

That doesn't sound like a very compelling case - migration is already
a hundreds of milliseconds type operation, I wouldn't expect a few
extra OPAL calls to be an issue.

> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/xive.h   | 2 ++
>  arch/powerpc/kvm/book3s_xive_native.c | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 7a7aa22d8258..e90c3c5d9533 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -74,6 +74,8 @@ struct xive_q {
>   u32 esc_irq;
>   atomic_tcount;
>   atomic_tpending_count;
> + u64 guest_qpage;
> + u32 guest_qsize;
>  };
>  
>  /* Global enable flags for the XIVE support */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 35d806740c3a..4ca75aade069 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu 
> *vcpu,
>   }
>   qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
>  
> + /* Backup queue page address and size for migration */
> + q->guest_qpage = qpage;
> + q->guest_qsize = qsize;
> +
>   rc = xive_native_configure_queue(xc->vp_id, q, priority,
>(__be32 *) qaddr, qsize, true);
>   if (rc) {

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:21PM +0100, Cédric Le Goater wrote:
> Interrupt sources are simply created at the OPAL level and then
> MASKED. KVM only needs to know about their type: LSI or MSI.

This commit message isn't very illuminating.

> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  5 +
>  arch/powerpc/kvm/book3s_xive_native.c | 98 +++
>  .../powerpc/kvm/book3s_xive_native_template.c | 27 +
>  3 files changed, 130 insertions(+)
>  create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 8b78b12aa118..6fc9660c5aec 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_ESB_FD1
>  #define   KVM_DEV_XIVE_GET_TIMA_FD   2
>  #define   KVM_DEV_XIVE_VC_BASE   3
> +#define KVM_DEV_XIVE_GRP_SOURCES 2   /* 64-bit source attributes */
> +
> +/* Layout of 64-bit XIVE source attribute values */
> +#define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0)
> +#define KVM_XIVE_LEVEL_ASSERTED  (1ULL << 1)
>  
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 29a62914de55..2518640d4a58 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -31,6 +31,24 @@
>  
>  #include "book3s_xive.h"
>  
> +/*
> + * We still instantiate them here because we use some of the
> + * generated utility functions as well in this file.

And this comment is downright cryptic.

> + */
> +#define XIVE_RUNTIME_CHECKS
> +#define X_PFX xive_vm_
> +#define X_STATIC static
> +#define X_STAT_PFX stat_vm_
> +#define __x_tima xive_tima
> +#define __x_eoi_page(xd) ((void __iomem *)((xd)->eoi_mmio))
> +#define __x_trig_page(xd)((void __iomem *)((xd)->trig_mmio))
> +#define __x_writeb   __raw_writeb
> +#define __x_readw__raw_readw
> +#define __x_readq__raw_readq
> +#define __x_writeq   __raw_writeq
> +
> +#include "book3s_xive_native_template.c"
> +
>  static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
>  {
>   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> @@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct 
> kvmppc_xive *xive, u64 addr)
>   return put_user(ret, ubufp);
>  }
>  
> +static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
> +  u64 addr)
> +{
> + struct kvmppc_xive_src_block *sb;
> + struct kvmppc_xive_irq_state *state;
> + u64 __user *ubufp = (u64 __user *) addr;
> + u64 val;
> + u16 idx;
> +
> + pr_devel("%s irq=0x%lx\n", __func__, irq);
> +
> + if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS)
> + return -ENOENT;
> +
> + sb = kvmppc_xive_find_source(xive, irq, &idx);
> + if (!sb) {
> + pr_debug("No source, creating source block...\n");

Doesn't this need to be protected by some lock?

> + sb = kvmppc_xive_create_src_block(xive, irq);
> + if (!sb) {
> + pr_err("Failed to create block...\n");
> + return -ENOMEM;
> + }
> + }
> + state = &sb->irq_state[idx];
> +
> + if (get_user(val, ubufp)) {
> + pr_err("fault getting user info !\n");
> + return -EFAULT;
> + }
> +
> + /*
> +  * If the source doesn't already have an IPI, allocate
> +  * one and get the corresponding data
> +  */
> + if (!state->ipi_number) {
> + state->ipi_number = xive_native_alloc_irq();
> + if (state->ipi_number == 0) {
> + pr_err("Failed to allocate IRQ !\n");
> + return -ENOMEM;
> + }

Am I right in thinking this is the point at which a specific guest irq
number gets bound to a specific host irq number?

> + xive_native_populate_irq_data(state->ipi_number,
> +   &state->ipi_data);
> + pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__,
> +  state->ipi_number, irq);
> + }
> +
> + arch_spin_lock(&sb->lock);
> +
> + /* Restore LSI state */
> + if (val & KVM_XIVE_LEVEL_SENSITIVE) {
> + state->lsi = true;
> + if (val & KVM_XIVE_LEVEL_ASSERTED)
> + state->asserted = true;
> + pr_devel("  LSI ! Asserted=%d\n", state->asserted);
> + }
> +
> + /* Mask IRQ to start with */
> + state->act_server = 0;
> + state->act_priority = MASKED;
> + xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
> + xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
> +
> + /* Increment the number of valid sources and

Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> This will let the guest create a memory mapping to expose the ESB MMIO
> regions used to control the interrupt sources, to trigger events, to
> EOI or to turn off the sources.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 8c876c166ef2..6bb61ba141c2 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_XICS_PRESENTED  (1ULL << 43)
>  #define  KVM_XICS_QUEUED (1ULL << 44)
>  
> +/* POWER9 XIVE Native Interrupt Controller */
> +#define KVM_DEV_XIVE_GRP_CTRL1
> +#define   KVM_DEV_XIVE_GET_ESB_FD1

Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
to both with an mmap() directly on the xive device fd?  Using the
offset to distinguish which one to map, obviously.

>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 115143e76c45..e20081f0c8d4 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device 
> *dev,
>   return rc;
>  }
>  
> +static int xive_native_esb_fault(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct kvmppc_xive *xive = vma->vm_file->private_data;
> + struct kvmppc_xive_src_block *sb;
> + struct kvmppc_xive_irq_state *state;
> + struct xive_irq_data *xd;
> + u32 hw_num;
> + u16 src;
> + u64 page;
> + unsigned long irq;
> +
> + /*
> +  * Linux/KVM uses a two pages ESB setting, one for trigger and
> +  * one for EOI
> +  */
> + irq = vmf->pgoff / 2;
> +
> + sb = kvmppc_xive_find_source(xive, irq, &src);
> + if (!sb) {
> + pr_err("%s: source %lx not found !\n", __func__, irq);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + state = &sb->irq_state[src];
> + kvmppc_xive_select_irq(state, &hw_num, &xd);
> +
> + arch_spin_lock(&sb->lock);
> +
> + /*
> +  * first/even page is for trigger
> +  * second/odd page is for EOI and management.
> +  */
> + page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
> + arch_spin_unlock(&sb->lock);
> +
> + if (!page) {
> + pr_err("%s: acessing invalid ESB page for source %lx !\n",
> +__func__, irq);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
> + return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct xive_native_esb_vmops = {
> + .fault = xive_native_esb_fault,
> +};
> +
> +static int xive_native_esb_mmap(struct file *file, struct vm_area_struct 
> *vma)
> +{
> + /* There are two ESB pages (trigger and EOI) per IRQ */
> + if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
> + return -EINVAL;
> +
> + vma->vm_flags |= VM_IO | VM_PFNMAP;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_ops = &xive_native_esb_vmops;
> + return 0;
> +}
> +
> +static const struct file_operations xive_native_esb_fops = {
> + .mmap = xive_native_esb_mmap,
> +};
> +
> +static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
> +{
> + u64 __user *ubufp = (u64 __user *) addr;
> + int ret;
> +
> + ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
> + O_RDWR | O_CLOEXEC);
> + if (ret < 0)
> + return ret;
> +
> + return put_user(ret, ubufp);
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  struct kvm_device_attr *attr)
>  {
> @@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct 
> kvm_device *dev,
>  static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>  struct kvm_device_attr *attr)
>  {
> + struct kvmppc_xive *xive = dev->private;
> +
> + switch (attr->group) {
> + case KVM_DEV_XIVE_GRP_CTRL:
> + switch (attr->attr) {
> + case KVM_DEV_XIVE_GET_ESB_FD:
> + return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
> + }
> + break;
> + }
>   return -ENXIO;
>  }
>  
>  static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  struct kvm_device_attr *attr)
>  {
> + switch (attr->group) {
> + case KVM_DEV_XIVE_GRP_CTRL:
> + switch (attr->attr) {
> + case KVM_DEV_XIVE_GET_ESB

Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-02-03 Thread David Gibson
On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> This is the basic framework for the new KVM device supporting the XIVE
> native exploitation mode. The user interface exposes a new capability
> and a new KVM device to be used by QEMU.
> 
> Internally, the interface to the new KVM device is protected with a
> new interrupt mode: KVMPPC_IRQ_XIVE.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/kvm_host.h   |   2 +
>  arch/powerpc/include/asm/kvm_ppc.h|  21 ++
>  arch/powerpc/kvm/book3s_xive.h|   3 +
>  include/uapi/linux/kvm.h  |   3 +
>  arch/powerpc/kvm/book3s.c |   7 +-
>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++
>  arch/powerpc/kvm/powerpc.c|  30 +++
>  arch/powerpc/kvm/Makefile |   2 +-
>  8 files changed, 398 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 0f98f00da2ea..c522e8274ad9 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
>  struct kvmppc_xive;
>  struct kvmppc_xive_vcpu;
>  extern struct kvm_device_ops kvm_xive_ops;
> +extern struct kvm_device_ops kvm_xive_native_ops;
>  
>  struct kvmppc_passthru_irqmap;
>  
> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
>  #define KVMPPC_IRQ_DEFAULT   0
>  #define KVMPPC_IRQ_MPIC  1
>  #define KVMPPC_IRQ_XICS  2 /* Includes a XIVE option */
> +#define KVMPPC_IRQ_XIVE  3 /* XIVE native exploitation mode */
>  
>  #define MMIO_HPTE_CACHE_SIZE 4
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index eb0d79f0ca45..1bb313f238fe 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, 
> u64 icpval);
>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>  int level, bool line_status);
>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
> +}
> +
> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> + struct kvm_vcpu *vcpu, u32 cpu);
> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> +extern void kvmppc_xive_native_init_module(void);
> +extern void kvmppc_xive_native_exit_module(void);
> +
>  #else
>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>  u32 priority) { return -1; }
> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
> *vcpu, u64 icpval) { retur
>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, 
> u32 irq,
> int level, bool line_status) { return 
> -ENODEV; }
>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> + { return 0; }
> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +   struct kvm_vcpu *vcpu, u32 
> cpu) { return -EBUSY; }
> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
> +static inline void kvmppc_xive_native_init_module(void) { }
> +static inline void kvmppc_xive_native_exit_module(void) { }
> +
>  #endif /* CONFIG_KVM_XIVE */
>  
>  /*
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 10c4aa5cd010..5f22415520b4 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -12,6 +12,9 @@
>  #ifdef CONFIG_KVM_XICS
>  #include "book3s_xics.h"
>  
> +#define KVMPPC_XIVE_FIRST_IRQ0
> +#define KVMPPC_XIVE_NR_IRQS  KVMPPC_XICS_NR_IRQS
> +
>  /*
>   * State for one guest irq source.
>   *
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6d4ea4b6c922..52bf74a1616e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>  #define KVM_CAP_HYPERV_CPUID 167
> +#define KVM_CAP_PPC_IRQ_XIVE 168
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_V3
>   KVM_DEV_TYPE_ARM_VGIC_ITS,
>  #define KVM_DEV_TYPE_ARM_VGIC_ITSKVM_DEV_TYPE_ARM_VGIC_ITS
> + KVM_DEV_TYPE_XIVE,
> +#define KVM_DEV_TYPE_XIVEKVM_DEV_TYPE_XIVE
>   KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/bo

Re: [PATCH] powerpc/powernv: Escalate reset when IODA reset fails

2019-02-03 Thread Oliver
On Mon, Feb 4, 2019 at 3:45 PM Alexey Kardashevskiy  wrote:
>
>
>
> On 01/02/2019 11:42, Oliver O'Halloran wrote:
> > The IODA reset is used to flush out any OS controlled state from the PHB.
> > This reset can fail if a PHB fatal error has occurred in early boot,
> > probably due to a because of a bad device. We already do a fundemental
> > reset of the device in some cases, so this patch just adds a test to force
> > a full reset if firmware reports an error when performing the IODA reset.
> >
> > Signed-off-by: Oliver O'Halloran 
>
> I am pretty sure I already saw this :-/

Uh yeah, looks like I posted it a while ago and forgot I did.

>
> ah, anyway
>
> Reviewed-by: Alexey Kardashevskiy 
>
>
>
>
> > ---
> >  arch/powerpc/platforms/powernv/pci-ioda.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index 1d6406a..53982f8 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -3943,9 +3943,12 @@ static void __init pnv_pci_init_ioda_phb(struct 
> > device_node *np,
> >* shutdown PCI devices correctly. We already got IODA table
> >* cleaned out. So we have to issue PHB reset to stop all PCI
> >* transactions from previous kernel. The ppc_pci_reset_phbs
> > -  * kernel parameter will force this reset too.
> > +  * kernel parameter will force this reset too. Additionally,
> > +  * if the IODA reset above failed then use a bigger hammer.
> > +  * This can happen if we get a PHB fatal error in very early
> > +  * boot.
> >*/
> > - if (is_kdump_kernel() || pci_reset_phbs) {
> > + if (is_kdump_kernel() || pci_reset_phbs || rc) {
> >   pr_info("  Issue PHB reset ...\n");
> >   pnv_eeh_phb_reset(hose, EEH_RESET_FUNDAMENTAL);
> >   pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
> >
>
> --
> Alexey


Re: [PATCH] powerpc/powernv: Escalate reset when IODA reset fails

2019-02-03 Thread Alexey Kardashevskiy



On 01/02/2019 11:42, Oliver O'Halloran wrote:
> The IODA reset is used to flush out any OS controlled state from the PHB.
> This reset can fail if a PHB fatal error has occurred in early boot,
> probably due to a because of a bad device. We already do a fundemental
> reset of the device in some cases, so this patch just adds a test to force
> a full reset if firmware reports an error when performing the IODA reset.
> 
> Signed-off-by: Oliver O'Halloran 

I am pretty sure I already saw this :-/

ah, anyway

Reviewed-by: Alexey Kardashevskiy 




> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 1d6406a..53982f8 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3943,9 +3943,12 @@ static void __init pnv_pci_init_ioda_phb(struct 
> device_node *np,
>* shutdown PCI devices correctly. We already got IODA table
>* cleaned out. So we have to issue PHB reset to stop all PCI
>* transactions from previous kernel. The ppc_pci_reset_phbs
> -  * kernel parameter will force this reset too.
> +  * kernel parameter will force this reset too. Additionally,
> +  * if the IODA reset above failed then use a bigger hammer.
> +  * This can happen if we get a PHB fatal error in very early
> +  * boot.
>*/
> - if (is_kdump_kernel() || pci_reset_phbs) {
> + if (is_kdump_kernel() || pci_reset_phbs || rc) {
>   pr_info("  Issue PHB reset ...\n");
>   pnv_eeh_phb_reset(hose, EEH_RESET_FUNDAMENTAL);
>   pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
> 

-- 
Alexey


[RFC PATCH 5/5] powerpc: sstep: Add selftests for addc[.] instruction

2019-02-03 Thread Sandipan Das
This adds test cases for the addc[.] instruction.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/include/asm/ppc-opcode.h |   1 +
 arch/powerpc/lib/sstep_tests.c| 212 ++
 2 files changed, 213 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 07bdb404571c..c0fe90173977 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -326,6 +326,7 @@
 #define PPC_INST_ADDI  0x3800
 #define PPC_INST_ADDIS 0x3c00
 #define PPC_INST_ADD   0x7c000214
+#define PPC_INST_ADDC  0x7c14
 #define PPC_INST_SUB   0x7c50
 #define PPC_INST_BLR   0x4e800020
 #define PPC_INST_BLRL  0x4e800021
diff --git a/arch/powerpc/lib/sstep_tests.c b/arch/powerpc/lib/sstep_tests.c
index fe6201a2add7..d2f4bb66f66f 100644
--- a/arch/powerpc/lib/sstep_tests.c
+++ b/arch/powerpc/lib/sstep_tests.c
@@ -243,6 +243,218 @@ static struct sstep_test tests[] = {
}
}
},
+   {
+   .mnemonic = "addc",
+   .subtests =
+   {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = LONG_MAX, RB = LONG_MAX",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MAX,
+   .gpr[22] = LONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = ULONG_MAX, RB = ULONG_MAX",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = ULONG_MAX,
+   .gpr[22] = ULONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = ULONG_MAX, RB = 0x1",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = ULONG_MAX,
+   .gpr[22] = 0x1,
+   }
+   },
+   {
+   .descr = "RA = INT_MIN, RB = INT_MIN",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MIN,
+   .gpr[22] = INT_MIN,
+   }
+   },
+   {
+   .descr = "RA = INT_MIN, RB = INT_MAX",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MIN,
+   .gpr[22] = INT_MAX,
+   }
+   },
+   {
+   .descr = "RA = INT_MAX, RB = INT_MAX",
+   .instr = PPC_INST_ADDC | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MAX,
+   .gpr[22] = INT_MAX,
+   }
+   },
+   {
+   .descr = "RA = UINT_MAX, RB = UINT_

[RFC PATCH 4/5] powerpc: sstep: Add selftests for add[.] instruction

2019-02-03 Thread Sandipan Das
This adds test cases for the add[.] instruction.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/lib/sstep_tests.c | 194 +
 1 file changed, 194 insertions(+)

diff --git a/arch/powerpc/lib/sstep_tests.c b/arch/powerpc/lib/sstep_tests.c
index a610c778044d..fe6201a2add7 100644
--- a/arch/powerpc/lib/sstep_tests.c
+++ b/arch/powerpc/lib/sstep_tests.c
@@ -49,6 +49,200 @@ static struct sstep_test tests[] = {
}
}
},
+   {
+   .mnemonic = "add",
+   .subtests =
+   {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = LONG_MAX, RB = LONG_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = LONG_MAX,
+   .gpr[22] = LONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = ULONG_MAX, RB = ULONG_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = ULONG_MAX,
+   .gpr[22] = ULONG_MAX,
+   }
+   },
+   {
+   .descr = "RA = ULONG_MAX, RB = 0x1",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = ULONG_MAX,
+   .gpr[22] = 0x1,
+   }
+   },
+   {
+   .descr = "RA = INT_MIN, RB = INT_MIN",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MIN,
+   .gpr[22] = INT_MIN,
+   }
+   },
+   {
+   .descr = "RA = INT_MIN, RB = INT_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MIN,
+   .gpr[22] = INT_MAX,
+   }
+   },
+   {
+   .descr = "RA = INT_MAX, RB = INT_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = INT_MAX,
+   .gpr[22] = INT_MAX,
+   }
+   },
+   {
+   .descr = "RA = UINT_MAX, RB = UINT_MAX",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+   .gpr[21] = UINT_MAX,
+   .gpr[22] = UINT_MAX,
+   }
+   },
+   {
+   .descr = "RA = UINT_MAX, RB = 0x1",
+   .instr = PPC_INST_ADD | ___PPC_RT(20) | 
___PPC_RA(21) | ___PPC_RB(22),
+   .regs =
+   {
+ 

[RFC PATCH 3/5] powerpc: sstep: Add instruction emulation selftests

2019-02-03 Thread Sandipan Das
This adds a selftest framework for the in-kernel instruction
emulation infrastructure. This currently does not support the
load/store and branch instructions and is limited to integer
ALU instructions. Support for SPRs is also limited to LR, CR
and XER for now.

Tests run at boot time if CONFIG_DEBUG_KERNEL, CONFIG_PPC64
and CONFIG_EMULATE_STEP_SELFTEST were set before the kernel
build.

When writing the tests, one must not use any instructions
that might overwrite the Stack Pointer (GPR1) or the Thread
Pointer (GPR13).

Signed-off-by: Sandipan Das 
---
 arch/powerpc/Kconfig.debug |   5 +
 arch/powerpc/lib/Makefile  |   1 +
 arch/powerpc/lib/exec_test_instr.S | 150 +++
 arch/powerpc/lib/sstep_tests.c | 158 +
 4 files changed, 314 insertions(+)
 create mode 100644 arch/powerpc/lib/exec_test_instr.S
 create mode 100644 arch/powerpc/lib/sstep_tests.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index f4961fbcb48d..d75c165538e9 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -56,6 +56,11 @@ config CODE_PATCHING_SELFTEST
bool "Run self-tests of the code-patching code"
depends on DEBUG_KERNEL
 
+config EMULATE_STEP_SELFTEST
+   bool "Run self-tests of the instruction emulation code"
+   depends on PPC64 && DEBUG_KERNEL
+   default n
+
 config JUMP_LABEL_FEATURE_CHECKS
bool "Enable use of jump label for cpu/mmu_has_feature()"
depends on JUMP_LABEL
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 3bf9fc6fd36c..c4485e004838 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -31,6 +31,7 @@ obj64-y   += copypage_64.o copyuser_64.o mem_64.o 
hweight_64.o \
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
 obj64-$(CONFIG_KPROBES_SANITY_TEST) += test_emulate_step.o
+obj64-$(CONFIG_EMULATE_STEP_SELFTEST)  += exec_test_instr.o sstep_tests.o
 
 obj-y  += checksum_$(BITS).o checksum_wrappers.o \
   string_$(BITS).o memcmp_$(BITS).o
diff --git a/arch/powerpc/lib/exec_test_instr.S 
b/arch/powerpc/lib/exec_test_instr.S
new file mode 100644
index ..217e83415eaf
--- /dev/null
+++ b/arch/powerpc/lib/exec_test_instr.S
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Non-emulated single-stepping support limited to basic integer ops for
+ * validating the instruction emulation infrastructure.
+ *
+ * Copyright (C) 2019 IBM Corporation
+ */
+
+#include 
+#include 
+#include 
+
+/* int exec_instr(struct pt_regs *regs) */
+_GLOBAL(exec_instr)
+
+   /*
+* Stack frame layout (INT_FRAME_SIZE bytes)
+*   In-memory pt_regs  (SP + STACK_FRAME_OVERHEAD)
+*   Scratch space  (SP + 8)
+*   Back chain (SP + 0)
+*/
+
+   /*
+* Allocate a new stack frame with enough space to hold the register
+* states in an in-memory pt_regs and also create the back chain to
+* the caller's stack frame.
+*/
+   stdur1, -INT_FRAME_SIZE(r1)
+
+   /*
+* Save non-volatile GPRs on stack. This includes TOC pointer (GPR2)
+* and local variables (GPR14 to GPR31). The register for the pt_regs
+* parameter (GPR3) is saved additionally to ensure that the resulting
+* register state can still be saved even if GPR3 gets overwritten
+* when loading the initial register state for the test instruction.
+* The stack pointer (GPR1) and the thread pointer (GPR13) are not
+* saved as these should not be modified anyway.
+*/
+   SAVE_2GPRS(2, r1)
+   SAVE_NVGPRS(r1)
+
+   /*
+* Save LR on stack to ensure that the return address is available
+* even if it gets overwritten by the test instruction.
+*/
+   mflrr0
+   std r0, _LINK(r1)
+
+   /*
+* Save CR on stack. For simplicity, the entire register is saved
+* even though only fields 2 to 4 are non-volatile.
+*/
+   mfcrr0
+   std r0, _CCR(r1)
+
+   /*
+* Load register state for the test instruction without touching the
+* critical non-volatile registers. The register state is passed as a
+* pointer to a pt_regs instance.
+*/
+   subir31, r3, GPR0
+
+   /* Load LR from pt_regs */
+   ld  r0, _LINK(r31)
+   mtlrr0
+
+   /* Load CR from pt_regs */
+   ld  r0, _CCR(r31)
+   mtcrr0
+
+   /* Load XER from pt_regs */
+   ld  r0, _XER(r31)
+   mtxer   r0
+
+   /* Load GPRs from pt_regs */
+   REST_GPR(0, r31)
+   REST_10GPRS(2, r31)
+   REST_GPR(12, r31)
+   REST_NVGPRS(r31)
+
+   .global exec_instr_execute
+exec_instr_execute:
+   /* Placeholder for the test instruction */
+1: nop
+
+   /*
+* Since GPR3 is overwritten, temporari

[RFC PATCH 1/5] powerpc: Add bitmasks for D-form instruction fields

2019-02-03 Thread Sandipan Das
This adds the bitmask definitions of D, SI and UI fields
found in D-form instructions.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/include/asm/ppc-opcode.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 19a8834e0398..9bc7dd6116a7 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -403,6 +403,9 @@
 #define __PPC_CT(t)(((t) & 0x0f) << 21)
 #define __PPC_SPR(r)   r) & 0x1f) << 16) | r) >> 5) & 0x1f) << 11))
 #define __PPC_RC21 (0x1 << 10)
+#define __PPC_D(d) ((d) & 0x)
+#define __PPC_SI(i)__PPC_D(i)
+#define __PPC_UI(i)__PPC_D(i)
 
 /*
  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
-- 
2.19.2



[RFC PATCH 2/5] powerpc: Add bitmask for Rc instruction field

2019-02-03 Thread Sandipan Das
This adds the bitmask definition for the Record bit that
is available at the end (bit 31) of some instructions.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/include/asm/ppc-opcode.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 9bc7dd6116a7..07bdb404571c 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -403,6 +403,7 @@
 #define __PPC_CT(t)(((t) & 0x0f) << 21)
 #define __PPC_SPR(r)   r) & 0x1f) << 16) | r) >> 5) & 0x1f) << 11))
 #define __PPC_RC21 (0x1 << 10)
+#define __PPC_RC31 (0x1)
 #define __PPC_D(d) ((d) & 0x)
 #define __PPC_SI(i)__PPC_D(i)
 #define __PPC_UI(i)__PPC_D(i)
-- 
2.19.2



[RFC PATCH 0/5] powerpc: sstep: Emulation test infrastructure

2019-02-03 Thread Sandipan Das
This aims to add a test infrastructure for the in-kernel instruction
emulation code. This is currently limited to testing only the basic
integer operations and supports verification of the GPRs, LR, XER and
CR.

There can be multiple test cases for each instruction. Each test case
has to be provided with the initial register state (in the form of a
struct pt_regs) and the 32-bit instruction to test.

Apart from verifying the end result, problems with the behaviour of
certain instructions for things like setting certain bits in CR or
XER (which can also be processor dependent) can be identified.
For example, the newly introduced CA32 bit in XER, exclusive to P9
CPUs as of now, was not being set when expected for some of the
arithmetic and shift instructions. With this infrastructure, it will
be easier to identify such problems and rectify them. The test cases
for the addc[.] instruction demonstrate this for different scenarios
where the CA and CA32 bits of XER should be set.

Sandipan Das (5):
  powerpc: Add bitmasks for D-form instruction fields
  powerpc: Add bitmask for Rc instruction field
  powerpc: sstep: Add instruction emulation selftests
  powerpc: sstep: Add selftests for add[.] instruction
  powerpc: sstep: Add selftests for addc[.] instruction

 arch/powerpc/Kconfig.debug|   5 +
 arch/powerpc/include/asm/ppc-opcode.h |   5 +
 arch/powerpc/lib/Makefile |   1 +
 arch/powerpc/lib/exec_test_instr.S| 150 +++
 arch/powerpc/lib/sstep_tests.c| 564 ++
 5 files changed, 725 insertions(+)
 create mode 100644 arch/powerpc/lib/exec_test_instr.S
 create mode 100644 arch/powerpc/lib/sstep_tests.c

-- 
2.19.2



Re: BUG: memcmp(): Accessing invalid memory location

2019-02-03 Thread Chandan Rajendra
On Friday, February 1, 2019 4:43:52 PM IST Michael Ellerman wrote:
> Michael Ellerman  writes:
> 
> > Adding Simon who wrote the code.
> >
> > Chandan Rajendra  writes:
> >> When executing fstests' generic/026 test, I hit the following call trace,
> >>
> >> [  417.061038] BUG: Unable to handle kernel data access at 
> >> 0xc0062ac4
> >> [  417.062172] Faulting instruction address: 0xc0092240
> >> [  417.062242] Oops: Kernel access of bad area, sig: 11 [#1]
> >> [  417.062299] LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
> >> [  417.062366] Modules linked in:
> >> [  417.062401] CPU: 0 PID: 27828 Comm: chacl Not tainted 
> >> 5.0.0-rc2-next-20190115-1-g6de6dba64dda #1
> >> [  417.062495] NIP:  c0092240 LR: c066a55c CTR: 
> >> 
> >> [  417.062567] REGS: c0062c0c3430 TRAP: 0300   Not tainted  
> >> (5.0.0-rc2-next-20190115-1-g6de6dba64dda)
> >> [  417.062660] MSR:  82009033   CR: 
> >> 44000842  XER: 2000
> >> [  417.062750] CFAR: 7fff7f3108ac DAR: c0062ac4 DSISR: 
> >> 4000 IRQMASK: 0
> >>GPR00:  c0062c0c36c0 c17f4c00 
> >> c121a660
> >>GPR04: c0062ac3fff9 0004 0020 
> >> 275b19c4
> >>GPR08: 000c 46494c45 5347495f41434c5f 
> >> c26073a0
> >>GPR12:  c27a  
> >> 
> >>GPR16:    
> >> 
> >>GPR20: c0062ea70020 c0062c0c38d0 0002 
> >> 0002
> >>GPR24: c0062ac3ffe8 275b19c4 0001 
> >> c0062ac3
> >>GPR28: c0062c0c38d0 c0062ac30050 c0062ac30058 
> >> 
> >> [  417.063563] NIP [c0092240] memcmp+0x120/0x690
> >> [  417.063635] LR [c066a55c] xfs_attr3_leaf_lookup_int+0x53c/0x5b0
> >> [  417.063709] Call Trace:
> >> [  417.063744] [c0062c0c36c0] [c066a098] 
> >> xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
> >> [  417.063851] [c0062c0c3760] [c0693f8c] 
> >> xfs_da3_node_lookup_int+0x32c/0x5a0
> >> [  417.063944] [c0062c0c3820] [c06634a0] 
> >> xfs_attr_node_addname+0x170/0x6b0
> >> [  417.064034] [c0062c0c38b0] [c0664ffc] 
> >> xfs_attr_set+0x2ac/0x340
> >> [  417.064118] [c0062c0c39a0] [c0758d40] 
> >> __xfs_set_acl+0xf0/0x230
> >> [  417.064190] [c0062c0c3a00] [c0758f50] xfs_set_acl+0xd0/0x160
> >> [  417.064268] [c0062c0c3aa0] [c04b69b0] 
> >> set_posix_acl+0xc0/0x130
> >> [  417.064339] [c0062c0c3ae0] [c04b6a88] 
> >> posix_acl_xattr_set+0x68/0x110
> >> [  417.064412] [c0062c0c3b20] [c04532d4] 
> >> __vfs_setxattr+0xa4/0x110
> >> [  417.064485] [c0062c0c3b80] [c0454c2c] 
> >> __vfs_setxattr_noperm+0xac/0x240
> >> [  417.064566] [c0062c0c3bd0] [c0454ee8] 
> >> vfs_setxattr+0x128/0x130
> >> [  417.064638] [c0062c0c3c30] [c0455138] setxattr+0x248/0x600
> >> [  417.064710] [c0062c0c3d90] [c0455738] 
> >> path_setxattr+0x108/0x120
> >> [  417.064785] [c0062c0c3e00] [c0455778] sys_setxattr+0x28/0x40
> >> [  417.064858] [c0062c0c3e20] [c000bae4] system_call+0x5c/0x70
> >> [  417.064930] Instruction dump:
> >> [  417.064964] 7d201c28 7d402428 7c295040 38630008 38840008 408201f0 
> >> 4200ffe8 2c05
> >> [  417.065051] 4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 
> >> 7d4a3436 7c295040
> >> [  417.065150] ---[ end trace 0d060411b5e3741b ]---
> >>
> >>
> >> Both the memory locations passed to memcmp() had "SGI_ACL_FILE" and len
> >> argument of memcmp() was set to 12. s1 argument of memcmp() had the value
> >> 0xf4af0485, while s2 argument had the value 0xce9e316f.
> >>
> >> The following is the code path within memcmp() that gets executed for the
> >> above mentioned values,
> >>
> >> - Since len (i.e. 12) is greater than 7, we branch to .Lno_short.
> >> - We then prefetch the contents of r3 & r4 and branch to
> >>   .Ldiffoffset_8bytes_make_align_start.
> >> - Under .Ldiffoffset_novmx_cmp, Since r3 is unaligned we end up comparing
> >>   "SGI" part of the string. r3's value is then aligned. r4's value is
> >>   incremented by 3. For comparing the remaining 9 bytes, we jump to
> >>   .Lcmp_lt32bytes.
> >> - Here, 8 bytes of the remaining 9 bytes are compared and execution moves 
> >> to
> >>   .Lcmp_rest_lt8bytes.
> >> - Here we execute "LD rB,0,r4". In the case of this bug, r4 has an 
> >> unaligned
> >>   value and hence ends up accessing the "next" double word. The "next" 
> >> double
> >>   word happens to occur after the last page mapped into the kernel's 
> >> address
> >>   space and hence this leads to the previously listed oops.
> >
> > Thanks for the analysis.
> >

Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type

2019-02-03 Thread David Gibson
On Wed, Jan 23, 2019 at 05:24:13PM +0100, Cédric Le Goater wrote:
> On 1/22/19 5:56 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
> >> We will have different KVM devices for interrupts, one for the
> >> XICS-over-XIVE mode and one for the XIVE native exploitation
> >> mode. Let's add some checks to make sure we are not mixing the
> >> interfaces in KVM.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>  arch/powerpc/kvm/book3s_xive.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_xive.c 
> >> b/arch/powerpc/kvm/book3s_xive.c
> >> index f78d002f0fe0..8a4fa45f07f8 100644
> >> --- a/arch/powerpc/kvm/book3s_xive.c
> >> +++ b/arch/powerpc/kvm/book3s_xive.c
> >> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
> >>  {
> >>struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >>  
> >> +  if (!kvmppc_xics_enabled(vcpu))
> >> +  return -EPERM;
> >> +
> >>if (!xc)
> >>return 0;
> >>  
> >> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
> >> icpval)
> >>u8 cppr, mfrr;
> >>u32 xisr;
> >>  
> >> +  if (!kvmppc_xics_enabled(vcpu))
> >> +  return -EPERM;
> >> +
> >>if (!xc || !xive)
> >>return -ENOENT;
> > 
> > I can't see how these new checks could ever trigger in the code as it
> > stands.  Is there a way at present? 
> 
> It would require some custom QEMU doing silly things : create the XICS 
> KVM device, and then call kvm_get_one_reg(KVM_REG_PPC_ICP_STATE) or 
> kvm_set_one_reg(icp->cs, KVM_REG_PPC_ICP_STATE) without connecting the
> vCPU to its presenter. 
> 
> Today, you get a ENOENT.

TBH, ENOENT seems fine to me.

> > Do following patches ever add a path where the new checks could trigger, 
> > or is this just an excess of caution? 
> 
> With the following patches, QEMU could to do something even more silly,
> which is to mix the interrupt mode interfaces : create a KVM XICS device
> and call KVM CPU ioctls of the KVM XIVE device, or the opposite.

AFAICT, like above, that won't really differ from calling the XIVE CPU
ioctl()s when no irqchip is set up at all, and should be covered by
just a !xive check.

> 
> > (Your patch description should ideally have answered these questions > for 
> > me.)
> 
> Yes. I also think that I introduced this patch to early in the series.
> It make more sense when the XICS and the XIVE KVM devices are available.  
> 
> Thanks,
> 
> C.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-03 Thread Alex Ghiti

On 1/17/19 1:39 PM, Alexandre Ghiti wrote:

From: Alexandre Ghiti 

On systems without CMA or (MEMORY_ISOLATION && COMPACTION) activated but
that support gigantic pages, boottime reserved gigantic pages can not be
freed at all. This patchs simply enables the possibility to hand back
those pages to memory allocator.

This commit then renames gigantic_page_supported and
ARCH_HAS_GIGANTIC_PAGE to make them more accurate. Indeed, those values
being false does not mean that the system cannot use gigantic pages: it
just means that runtime allocation of gigantic pages is not supported,
one can still allocate boottime gigantic pages if the architecture supports
it.

Signed-off-by: Alexandre Ghiti 
---

- Compiled on all architectures
- Tested on riscv architecture

  arch/arm64/Kconfig   |  2 +-
  arch/arm64/include/asm/hugetlb.h |  7 +++--
  arch/powerpc/include/asm/book3s/64/hugetlb.h |  4 +--
  arch/powerpc/platforms/Kconfig.cputype   |  2 +-
  arch/s390/Kconfig|  2 +-
  arch/s390/include/asm/hugetlb.h  |  7 +++--
  arch/x86/Kconfig |  2 +-
  arch/x86/include/asm/hugetlb.h   |  7 +++--
  fs/Kconfig   |  2 +-
  include/linux/gfp.h  |  2 +-
  mm/hugetlb.c | 43 +++-
  mm/page_alloc.c  |  4 +--
  12 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..18239cbd7fcd 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -18,7 +18,7 @@ config ARM64
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
+   select ARCH_HAS_GIGANTIC_PAGE_RUNTIME_ALLOCATION if (MEMORY_ISOLATION 
&& COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PTE_SPECIAL
diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index fb6609875455..797fc77eabcd 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -65,8 +65,11 @@ extern void set_huge_swap_pte_at(struct mm_struct *mm, 
unsigned long addr,
  
  #include 
  
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE

-static inline bool gigantic_page_supported(void) { return true; }
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE_RUNTIME_ALLOCATION
+static inline bool gigantic_page_runtime_allocation_supported(void)
+{
+   return true;
+}
  #endif
  
  #endif /* __ASM_HUGETLB_H */

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 5b0177733994..7711f0e2c7e5 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -32,8 +32,8 @@ static inline int hstate_get_psize(struct hstate *hstate)
}
  }
  
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE

-static inline bool gigantic_page_supported(void)
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE_RUNTIME_ALLOCATION
+static inline bool gigantic_page_runtime_allocation_supported(void)
  {
return true;
  }
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 8c7464c3f27f..779e06bac697 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -319,7 +319,7 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK
  config PPC_RADIX_MMU
bool "Radix MMU Support"
depends on PPC_BOOK3S_64
-   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
+   select ARCH_HAS_GIGANTIC_PAGE_RUNTIME_ALLOCATION if (MEMORY_ISOLATION 
&& COMPACTION) || CMA
default y
help
  Enable support for the Power ISA 3.0 Radix style MMU. Currently this
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..6776eef6a9ae 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -69,7 +69,7 @@ config S390
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
+   select ARCH_HAS_GIGANTIC_PAGE_RUNTIME_ALLOCATION if (MEMORY_ISOLATION 
&& COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 2d1afa58a4b6..57c952f5388e 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -116,7 +116,10 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t 
newprot)
return pte_modify(pte, newprot);
  }
  
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE

-static inline bool gigantic_page_supported(void) { return true; }
+#ifdef CONFIG_ARCH_HAS_GI

Re: use generic DMA mapping code in powerpc V4

2019-02-03 Thread Christian Zigotzky
OK, next step: b50f42f0fe12965ead395c76bcb6a14f00cdf65b (powerpc/dma: 
use the dma_direct mapping routines)


git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a

git checkout b50f42f0fe12965ead395c76bcb6a14f00cdf65b

Results: The X1000 and X5000 boot but unfortunately the P.A. Semi 
Ethernet doesn't work.


-- Christian


On 01 February 2019 at 5:54PM, Christian Zigotzky wrote:

Hi Christoph,

I will try it at the weekend.

Thanks,
Christian

Sent from my iPhone


On 1. Feb 2019, at 09:04, Christoph Hellwig  wrote:


On Thu, Jan 31, 2019 at 01:48:26PM +0100, Christian Zigotzky wrote:
Hi Christoph,

I compiled kernels for the X5000 and X1000 from your branch 'powerpc-dma.6'
today.

Gitweb:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6

git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a

The X1000 and X5000 boot but unfortunately the P.A. Semi Ethernet doesn't
work.

Oh.  Can you try with just the next one and then two patches applied
over the working setup?  That is first:

http://git.infradead.org/users/hch/misc.git/commitdiff/b50f42f0fe12965ead395c76bcb6a14f00cdf65b

then also with:

http://git.infradead.org/users/hch/misc.git/commitdiff/21fe52470a483afbb1726741118abef8602dde4d





Re: [PATCH v2 10/21] memblock: refactor internal allocation functions

2019-02-03 Thread Mike Rapoport
(dropped most of 'CC)

On Sun, Feb 03, 2019 at 08:39:20PM +1100, Michael Ellerman wrote:
> Mike Rapoport  writes:
> 
> > Currently, memblock has several internal functions with overlapping
> > functionality. They all call memblock_find_in_range_node() to find free
> > memory and then reserve the allocated range and mark it with kmemleak.
> > However, there is difference in the allocation constraints and in fallback
> > strategies.
> >
> > The allocations returning physical address first attempt to find free
> > memory on the specified node within mirrored memory regions, then retry on
> > the same node without the requirement for memory mirroring and finally fall
> > back to all available memory.
> >
> > The allocations returning virtual address start with clamping the allowed
> > range to memblock.current_limit, attempt to allocate from the specified
> > node from regions with mirroring and with user defined minimal address. If
> > such allocation fails, next attempt is done with node restriction lifted.
> > Next, the allocation is retried with minimal address reset to zero and at
> > last without the requirement for mirrored regions.
> >
> > Let's consolidate various fallbacks handling and make them more consistent
> > for physical and virtual variants. Most of the fallback handling is moved
> > to memblock_alloc_range_nid() and it now handles node and mirror fallbacks.
> >
> > The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
> > physical address of the allocated range and converts it to virtual address.
> >
> > The fallback for allocation below the specified minimal address remains in
> > memblock_alloc_internal() because memblock_alloc_range_nid() is used by CMA
> > with exact requirement for lower bounds.
> 
> This is causing problems on some of my machines.
> 
> I see NODE_DATA allocations falling back to node 0 when they shouldn't,
> or didn't previously.
> 
> eg, before:
> 
> 57990190: (116011251): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
> 58152042: (116373087): numa:   NODE_DATA [mem 0x8fff90980-0x8fff97fff]
> 
> after:
> 
> 16356872061562: (6296877055): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
> 16356872079279: (6296894772): numa:   NODE_DATA [mem 0xfffcd300-0xfffd497f]
> 16356872096376: (6296911869): numa: NODE_DATA(1) on node 0
> 
> 
> On some of my other systems it does that, and then panics because it
> can't allocate anything at all:
> 
> [0.00] numa:   NODE_DATA [mem 0x7ffcaee80-0x7ffcb3fff]
> [0.00] numa:   NODE_DATA [mem 0x7ffc99d00-0x7ffc9ee7f]
> [0.00] numa: NODE_DATA(1) on node 0
> [0.00] Kernel panic - not syncing: Cannot allocate 20864 bytes for 
> node 16 data
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.0.0-rc4-gccN-next-20190201-gdc4c899 #1
> [0.00] Call Trace:
> [0.00] [c11cfca0] [c0c11044] dump_stack+0xe8/0x164 
> (unreliable)
> [0.00] [c11cfcf0] [c00fdd6c] panic+0x17c/0x3e0
> [0.00] [c11cfd90] [c0f61bc8] initmem_init+0x128/0x260
> [0.00] [c11cfe60] [c0f57940] setup_arch+0x398/0x418
> [0.00] [c11cfee0] [c0f50a94] start_kernel+0xa0/0x684
> [0.00] [c11cff90] [c000af70] 
> start_here_common+0x1c/0x52c
> [0.00] Rebooting in 180 seconds..
> 
> 
> So there's something going wrong there, I haven't had time to dig into
> it though (Sunday night here).

Yeah, I've misplaced 'nid' and 'MEMBLOCK_ALLOC_ACCESSIBLE' in
memblock_phys_alloc_try_nid() :(

Can you please check if the below patch fixes the issue on your systems?
 
> cheers
> 

>From 5875b7440e985ce551e6da3cb28aa8e9af697e10 Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Sun, 3 Feb 2019 13:35:42 +0200
Subject: [PATCH] memblock: fix parameter order in
 memblock_phys_alloc_try_nid()

The refactoring of internal memblock allocation functions used wrong order
of parameters in memblock_alloc_range_nid() call from
memblock_phys_alloc_try_nid().
Fix it.

Signed-off-by: Mike Rapoport 
---
 mm/memblock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index e047933..0151a5b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1402,8 +1402,8 @@ phys_addr_t __init memblock_phys_alloc_range(phys_addr_t 
size,
 
 phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t 
align, int nid)
 {
-   return memblock_alloc_range_nid(size, align, 0, nid,
-   MEMBLOCK_ALLOC_ACCESSIBLE);
+   return memblock_alloc_range_nid(size, align, 0,
+   MEMBLOCK_ALLOC_ACCESSIBLE, nid);
 }
 
 /**
-- 
2.7.4


-- 
Sincerely yours,
Mike.



Re: [PATCH v2 10/21] memblock: refactor internal allocation functions

2019-02-03 Thread Mike Rapoport
On Sun, Feb 03, 2019 at 08:39:20PM +1100, Michael Ellerman wrote:
> Mike Rapoport  writes:
> 
> > Currently, memblock has several internal functions with overlapping
> > functionality. They all call memblock_find_in_range_node() to find free
> > memory and then reserve the allocated range and mark it with kmemleak.
> > However, there is difference in the allocation constraints and in fallback
> > strategies.
> >
> > The allocations returning physical address first attempt to find free
> > memory on the specified node within mirrored memory regions, then retry on
> > the same node without the requirement for memory mirroring and finally fall
> > back to all available memory.
> >
> > The allocations returning virtual address start with clamping the allowed
> > range to memblock.current_limit, attempt to allocate from the specified
> > node from regions with mirroring and with user defined minimal address. If
> > such allocation fails, next attempt is done with node restriction lifted.
> > Next, the allocation is retried with minimal address reset to zero and at
> > last without the requirement for mirrored regions.
> >
> > Let's consolidate various fallbacks handling and make them more consistent
> > for physical and virtual variants. Most of the fallback handling is moved
> > to memblock_alloc_range_nid() and it now handles node and mirror fallbacks.
> >
> > The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
> > physical address of the allocated range and converts it to virtual address.
> >
> > The fallback for allocation below the specified minimal address remains in
> > memblock_alloc_internal() because memblock_alloc_range_nid() is used by CMA
> > with exact requirement for lower bounds.
> 
> This is causing problems on some of my machines.
> 
> I see NODE_DATA allocations falling back to node 0 when they shouldn't,
> or didn't previously.
> 
> eg, before:
> 
> 57990190: (116011251): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
> 58152042: (116373087): numa:   NODE_DATA [mem 0x8fff90980-0x8fff97fff]
> 
> after:
> 
> 16356872061562: (6296877055): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
> 16356872079279: (6296894772): numa:   NODE_DATA [mem 0xfffcd300-0xfffd497f]
> 16356872096376: (6296911869): numa: NODE_DATA(1) on node 0
> 
> 
> On some of my other systems it does that, and then panics because it
> can't allocate anything at all:
> 
> [0.00] numa:   NODE_DATA [mem 0x7ffcaee80-0x7ffcb3fff]
> [0.00] numa:   NODE_DATA [mem 0x7ffc99d00-0x7ffc9ee7f]
> [0.00] numa: NODE_DATA(1) on node 0
> [0.00] Kernel panic - not syncing: Cannot allocate 20864 bytes for 
> node 16 data
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.0.0-rc4-gccN-next-20190201-gdc4c899 #1
> [0.00] Call Trace:
> [0.00] [c11cfca0] [c0c11044] dump_stack+0xe8/0x164 
> (unreliable)
> [0.00] [c11cfcf0] [c00fdd6c] panic+0x17c/0x3e0
> [0.00] [c11cfd90] [c0f61bc8] initmem_init+0x128/0x260
> [0.00] [c11cfe60] [c0f57940] setup_arch+0x398/0x418
> [0.00] [c11cfee0] [c0f50a94] start_kernel+0xa0/0x684
> [0.00] [c11cff90] [c000af70] 
> start_here_common+0x1c/0x52c
> [0.00] Rebooting in 180 seconds..
> 
> 
> So there's something going wrong there, I haven't had time to dig into
> it though (Sunday night here).

I'll try to see if I can reproduce it with qemu.
 
> cheers
> 

-- 
Sincerely yours,
Mike.



Re: [PATCH v2 10/21] memblock: refactor internal allocation functions

2019-02-03 Thread Michael Ellerman
Mike Rapoport  writes:

> Currently, memblock has several internal functions with overlapping
> functionality. They all call memblock_find_in_range_node() to find free
> memory and then reserve the allocated range and mark it with kmemleak.
> However, there is difference in the allocation constraints and in fallback
> strategies.
>
> The allocations returning physical address first attempt to find free
> memory on the specified node within mirrored memory regions, then retry on
> the same node without the requirement for memory mirroring and finally fall
> back to all available memory.
>
> The allocations returning virtual address start with clamping the allowed
> range to memblock.current_limit, attempt to allocate from the specified
> node from regions with mirroring and with user defined minimal address. If
> such allocation fails, next attempt is done with node restriction lifted.
> Next, the allocation is retried with minimal address reset to zero and at
> last without the requirement for mirrored regions.
>
> Let's consolidate various fallbacks handling and make them more consistent
> for physical and virtual variants. Most of the fallback handling is moved
> to memblock_alloc_range_nid() and it now handles node and mirror fallbacks.
>
> The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
> physical address of the allocated range and converts it to virtual address.
>
> The fallback for allocation below the specified minimal address remains in
> memblock_alloc_internal() because memblock_alloc_range_nid() is used by CMA
> with exact requirement for lower bounds.

This is causing problems on some of my machines.

I see NODE_DATA allocations falling back to node 0 when they shouldn't,
or didn't previously.

eg, before:

57990190: (116011251): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
58152042: (116373087): numa:   NODE_DATA [mem 0x8fff90980-0x8fff97fff]

after:

16356872061562: (6296877055): numa:   NODE_DATA [mem 0xfffe4980-0xfffebfff]
16356872079279: (6296894772): numa:   NODE_DATA [mem 0xfffcd300-0xfffd497f]
16356872096376: (6296911869): numa: NODE_DATA(1) on node 0


On some of my other systems it does that, and then panics because it
can't allocate anything at all:

[0.00] numa:   NODE_DATA [mem 0x7ffcaee80-0x7ffcb3fff]
[0.00] numa:   NODE_DATA [mem 0x7ffc99d00-0x7ffc9ee7f]
[0.00] numa: NODE_DATA(1) on node 0
[0.00] Kernel panic - not syncing: Cannot allocate 20864 bytes for node 
16 data
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.0.0-rc4-gccN-next-20190201-gdc4c899 #1
[0.00] Call Trace:
[0.00] [c11cfca0] [c0c11044] dump_stack+0xe8/0x164 
(unreliable)
[0.00] [c11cfcf0] [c00fdd6c] panic+0x17c/0x3e0
[0.00] [c11cfd90] [c0f61bc8] initmem_init+0x128/0x260
[0.00] [c11cfe60] [c0f57940] setup_arch+0x398/0x418
[0.00] [c11cfee0] [c0f50a94] start_kernel+0xa0/0x684
[0.00] [c11cff90] [c000af70] 
start_here_common+0x1c/0x52c
[0.00] Rebooting in 180 seconds..


So there's something going wrong there, I haven't had time to dig into
it though (Sunday night here).

cheers