Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Christoph Hellwig
On Tue, Jan 29, 2019 at 09:36:08PM -0500, Michael S. Tsirkin wrote:
> This has been discussed ad nauseum. virtio is all about compatibility.
> Losing a couple of lines of code isn't worth breaking working setups.
> People that want "just use DMA API no tricks" now have the option.
> Setting a flag in a feature bit map is literally a single line
> of code in the hypervisor. So stop pushing for breaking working
> legacy setups and just fix it in the right place.

I agree with the legacy aspect.  What I am missing is an extremely
strong wording that says you SHOULD always set this flag for new
hosts, including an explanation why.


Does SMP work at all on 40x ?

2019-01-29 Thread Christophe Leroy

In transfer_to_handler() (entry_32.S), we have:

#if defined(CONFIG_40x) || defined(CONFIG_BOOKE)
...
#ifdef CONFIG_SMP
CURRENT_THREAD_INFO(r9, r1)
lwz r9,TI_CPU(r9)
slwir9,r9,3
add r11,r11,r9
#endif
#endif

When running this piece of code, MMU translation is off. But r9 contains 
the virtual addr of current_thread_info, so unless I miss something, 
this cannot work on the 40x, can it ?


On CONFIG_BOOKE it works because phys addr = virt addr

Christophe


Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-01-29 Thread Cédric Le Goater
On 1/30/19 5:29 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
>> On 1/22/19 6:05 AM, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
 This is the basic framework for the new KVM device supporting the XIVE
 native exploitation mode. The user interface exposes a new capability
 and a new KVM device to be used by QEMU.
>>>
>>> [snip]
 @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
  #ifdef CONFIG_KVM_XIVE
if (xive_enabled()) {
kvmppc_xive_init_module();
 +  kvmppc_xive_native_init_module();
kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS);
 +  kvm_register_device_ops(_xive_native_ops,
 +  KVM_DEV_TYPE_XIVE);
>>>
>>> I think we want tighter conditions on initializing the xive_native
>>> stuff and creating the xive device class.  We could have
>>> xive_enabled() returning true in a guest, and this code will get
>>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
>>> are running bare metal).
>>
>> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
>> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
>>
>>  if (xive_enabled() && !kvmhv_on_pseries()) {
>>
>> for now.
>>
>> As this is a problem today in 5.0.x, I will send a patch for it if you think
> 
> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> with kernel_irqchip=on in a nested guest and it works just fine.  What
> exactly did you test?

L0: Linux 5.0.0-rc3 (+ KVM HV)
L1: QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
L2:  QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3

L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
it does an OPAL call and its running under SLOF. See below.

I don't understand how L2 can work with kernel_irqchip=on. Could you
please explain ? 

>> it is correct. I don't think we should bother taking care of the PR case
>> on P9. Should we ? 
> 
> We do need to take care of PR KVM on P9, since it is the only form of
> nested KVM that works inside a host in HPT mode.

ok. That is the test case. There are quite a few combinations now.

Thanks,

C.

[   49.547056] Oops: Exception in kernel mode, sig: 4 [#1]
[   49.555101] LE SMP NR_CPUS=2048 NUMA pSeries
[   49.555132] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 
libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmx_crypto 
crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables 
autofs4 virtio_net net_failover failover virtio_scsi
[   49.555335] CPU: 9 PID: 2162 Comm: qemu-system-ppc Kdump: loaded Not tainted 
5.0.0-rc3+ #53
[   49.555378] NIP:  c00a7548 LR: c00a4044 CTR: c00a24b0
[   49.555421] REGS: c003ad71f8a0 TRAP: 0700   Not tainted  (5.0.0-rc3+)
[   49.555456] MSR:  80041033   CR: 44222822  XER: 
2004
[   49.01] CFAR: c00a2508 IRQMASK: 0 
[   49.01] GPR00: 0087 c003ad71fb30 c175f700 
000b 
[   49.01] GPR04:   c003f88d4000 
000b 
[   49.01] GPR08: 0003fd80 000b 0800 
0031 
[   49.01] GPR12: 80001002 c7ff3280  
 
[   49.01] GPR16: 78d2bd60  02c9896d7800 
78d2b970 
[   49.01] GPR20: 02c95c876f90 02c95c876fa0 02c95c876f80 
02c95c876f70 
[   49.01] GPR24: 02c95cf4f648  c003ab3e4058 
006000c0 
[   49.01] GPR28: 000b c003ab3e  
c003f88d 
[   49.555883] NIP [c00a7548] opal_xive_alloc_vp_block+0x50/0x68
[   49.555919] LR [c00a4044] opal_return+0x0/0x48
[   49.555947] Call Trace:
[   49.555964] [c003ad71fb30] [c00a250c] 
xive_native_alloc_vp_block+0x5c/0x1c0 (unreliable)
[   49.556019] [c003ad71fbc0] [c0080430c0c0] 
kvmppc_xive_create+0x98/0x168 [kvm]
[   49.556065] [c003ad71fc00] [c008042f9fcc] kvm_vm_ioctl+0x474/0xa00 
[kvm]
[   49.556113] [c003ad71fd10] [c0423a64] do_vfs_ioctl+0xd4/0x8e0
[   49.556153] [c003ad71fdb0] [c0424334] ksys_ioctl+0xc4/0x110
[   49.556190] [c003ad71fe00] [c04243a8] sys_ioctl+0x28/0x80
[   49.556230] [c003ad71fe20] [c000b288] system_call+0x5c/0x70
[   49.556265] Instruction dump:
[   49.556288] 6000 7d600026 91610008 3960 616b8000 f98d0980 7d8c5878 
7d810164 
[   49.556332] e9628098 7d6803a6 39600031 7d8c5878 <7d9b4ba6> e96280b0 e98b0008 
e84b 
[   49.556378] ---[ end trace ac7420a6784de93b ]---


Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-01-29 Thread Paul Mackerras
On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >> This is the basic framework for the new KVM device supporting the XIVE
> >> native exploitation mode. The user interface exposes a new capability
> >> and a new KVM device to be used by QEMU.
> > 
> > [snip]
> >> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>  #ifdef CONFIG_KVM_XIVE
> >>if (xive_enabled()) {
> >>kvmppc_xive_init_module();
> >> +  kvmppc_xive_native_init_module();
> >>kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS);
> >> +  kvm_register_device_ops(_xive_native_ops,
> >> +  KVM_DEV_TYPE_XIVE);
> > 
> > I think we want tighter conditions on initializing the xive_native
> > stuff and creating the xive device class.  We could have
> > xive_enabled() returning true in a guest, and this code will get
> > called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> > are running bare metal).
> 
> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> 
>   if (xive_enabled() && !kvmhv_on_pseries()) {
> 
> for now.
> 
> As this is a problem today in 5.0.x, I will send a patch for it if you think

How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
with kernel_irqchip=on in a nested guest and it works just fine.  What
exactly did you test?

> it is correct. I don't think we should bother taking care of the PR case
> on P9. Should we ? 

We do need to take care of PR KVM on P9, since it is the only form of
nested KVM that works inside a host in HPT mode.

Paul.


Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-29 Thread Paul Mackerras
On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
> >>> Another general comment is that you seem to have written all this
> >>> code assuming we are using HV KVM in a host running bare-metal.
> >>
> >> Yes. I didn't look at the other configurations. I thought that we could
> >> use the kernel_irqchip=off option to begin with. A couple of checks
> >> are indeed missing.
> > 
> > Using kernel_irqchip=off would mean that we would not be able to use
> > the in-kernel XICS emulation, which would have a performance impact.
> 
> yes. But it is not supported today. Correct ? 

Not correct, it has been working for years, and works in v5.0-rc1 (I
just tested it), at both L0 and L1.

> > We need an explicit capability for XIVE exploitation that can be
> > enabled or disabled on the qemu command line, so that we can enforce a
> > uniform set of capabilities across all the hosts in a migration
> > domain.  And it's no good to say we have the capability when all
> > attempts to use it will fail.  Therefore the kernel needs to say that
> > it doesn't have the capability in a PR KVM guest or in a nested HV
> > guest.
> 
> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
> for future use.

That's not what I meant.  Why do we need that?  I meant that querying
the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
guest.  It should only return 1 if we are running bare-metal on a P9.

> >>> However, we could be using PR KVM (either in a bare-metal host or in a
> >>> guest), or we could be doing nested HV KVM where we are using the
> >>> kvm_hv module inside a KVM guest and using special hypercalls for
> >>> controlling our guests.
> >>
> >> Yes. 
> >>
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The current design of nested HV is that the entire non-volatile state
> > of all the nested guests is encapsulated within the state and
> > resources of the L1 hypervisor.  That means that if the L1 hypervisor
> > gets migrated, all of its guests go across inside it and there is no
> > extra state that L0 needs to be aware of.  That would imply that the
> > VP number space for the nested guests would need to come from within
> > the VP number space for L1; but the amount of VP space we allocate to
> > each guest doesn't seem to be large enough for that to be practical.
> 
> If the KVM XIVE device had some information on the max number of CPUs 
> provisioned for the guest, we could optimize the VP allocation.

The problem is that we might have 1000 guests running under L0, or we
might have 1 guest running under L0 and 1000 guests running under it,
and we have no way to know which situation to optimize for at the
point where an L1 guest starts.  If we had an enormous VP space then
we could just give each L1 guest a large amount of VP space and solve
it that way; but we don't.

Paul.


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-29 Thread Paul Mackerras
On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>> Would we end up with too many VMAs if we just used mmap() to
> >>> change the mappings from the software-generated pages to the
> >>> hardware-generated interrupt pages?  
> >> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> > 
> > Confused.  You say the number space has 32768 entries but then imply
> > there are only 8K entries.  Do you mean that the API allows for 15-bit
> > IRQ numbers but we are only making using of 8192 of them?
> 
> Ouch. My bad. Let's do it again. 
> 
> The sPAPR IRQ number space is 0x2000 wide :
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> 
> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> devices (which can be extended if needed).
> 
> So that's 8192 x 2 ESB pages.
> 
> >> extend the last range if needed as these are for MSIs. Dynamic 
> >> extensions under KVM should work also.
> >>
> >> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >> lot of mmap(), too much. Also the KVM model needs to be compatible
> > 
> > I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> > space would be covered by a single mmap, overlaid by subsequent mmaps
> > where we need to map real device interrupts.
> 
> ok. The same fault handler could be used to populate the VMA with the 
> ESB pages. 
> 
> But it would mean extra work on the QEMU side, which is not needed 
> with this patch. 

Maybe, but just storing a single vma pointer in our private data is
not a feasible approach.  First, you have no control on the lifetime
of the vma and thus this is a use-after-free waiting to happen, and
secondly, there could be multiple vmas that you need to worry about.
Userspace could do multiple mmaps, or could do mprotect or similar on
part of an existing mmap, which would split the vma for the mmap into
multiple vmas.  You don't get notified about munmap either as far as I
can tell, so the vma is liable to go away.  And it doesn't matter if
QEMU would never do such things; if userspace can provoke a
use-after-free in the kernel using KVM then someone will write a
specially crafted user program to do that.

So we either solve it in userspace, or we have to write and maintain
complex kernel code with deep links into the MM subsystem.  I'd much
rather solve it in userspace.

> >> with the QEMU emulated one and it was simpler to have one overall
> >> memory region for the IPI ESBs, one for the END ESBs (if we support
> >> that one day) and one for the TIMA.
> >>
> >>> Are the necessary pages for a PCI
> >>> passthrough device contiguous in both host real space 
> >>
> >> They should as they are the PHB4 ESBs.
> >>
> >>> and guest real space ? 
> >>
> >> also. That's how we organized the mapping. 
> > 
> > "How we organized the mapping" is a significant design decision that I
> > haven't seen documented anywhere, and is really needed for
> > understanding what's going on.
> 
> OK. I will add comments on that. See below for some description.
> 
> There is nothing fancy, it's simply indexed with the interrupt number,
> like for HW, or for the QEMU XIVE emulated model.
> 
> >>> If so we'd only need one mmap() for all the device's interrupt
> >>> pages.
> >>
> >> Ah. So we would want to make a special case for the passthrough 
> >> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>
> >> Wouldn't that require to hot plug a memory region under the guest ? 
> > 
> > No; the way that a memory region works is that userspace can do
> > whatever disparate mappings it likes within the region on the user
> > process side, and the corresponding region of guest real address space
> > follows that automatically.
> 
> yes. I suppose this should work also for 'ram device' memory mappings.
> 
> So when the passthrough device is added to the guest, we would add a 
> new 'ram device' memory region for the device interrupt ESB pages 
> that would overlap with the initial guest ESB pages.  

Not knowing the QEMU internals all that well, I don't at all
understand why a new ram device is necesssary.  I would see it as a
single virtual area mapping the ESB pages of guest hwirqs that are in
use, and we manage those mappings with mmap and munmap.

> This is really a different approach.
> 
> >> which means that we need to provision an address space/container 
> >> region for theses regions. What are the benefits ? 
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Explicitly unmapping parts of the VMA seems like the wrong way to do
> > it.  If you have a device mmap where the device wants to change the
> > physical page underlying parts of the 

Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-29 Thread Paul Mackerras
On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
> On 1/29/19 5:12 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Actually, now that I come to think of it, there could be any number of
> > VMAs (well, up to almost 64k of them), since once you have a file
> > descriptor you can call mmap on it multiple times.
> > 
> > The more I think about it, the more I think that getting userspace to
> > manage the mappings with mmap() and munmap() is the right idea if it
> > can be made to work.
> 
> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
> call "ibm,change-msi".  I think that's the right spot for it. 

I was thinking that the ESB pages should be mmapped for device
interrupts at VM startup or when a device is hot-plugged in, and
munmapped when the device is hot-removed.  Maybe the mmap could be
done in conjunction with the KVM_IRQFD call?

What is the reasoning behind doing it in the ibm,change-msi call?

Paul.


[RFC PATCH v3 4/4] kvmppc: Handle memory plug/unplug to secure VM

2019-01-29 Thread Bharata B Rao
Register the new memslot with UV during plug and unregister
the memslot during unplug.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/ucall-api.h|  7 +++
 arch/powerpc/include/uapi/asm/uapi_uvcall.h |  1 +
 arch/powerpc/kvm/book3s_hv.c| 19 +++
 3 files changed, 27 insertions(+)

diff --git a/arch/powerpc/include/asm/ucall-api.h 
b/arch/powerpc/include/asm/ucall-api.h
index d266670229cb..cbb8bb38eb8b 100644
--- a/arch/powerpc/include/asm/ucall-api.h
+++ b/arch/powerpc/include/asm/ucall-api.h
@@ -82,5 +82,12 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
return plpar_ucall(UV_REGISTER_MEM_SLOT, retbuf, lpid, start_gpa,
   size, flags, slotid);
 }
+
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   unsigned long retbuf[PLPAR_UCALL_BUFSIZE];
+
+   return plpar_ucall(UV_UNREGISTER_MEM_SLOT, retbuf, lpid, slotid);
+}
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_UCALL_API_H */
diff --git a/arch/powerpc/include/uapi/asm/uapi_uvcall.h 
b/arch/powerpc/include/uapi/asm/uapi_uvcall.h
index 79a11a6ee436..60e44c7b58c4 100644
--- a/arch/powerpc/include/uapi/asm/uapi_uvcall.h
+++ b/arch/powerpc/include/uapi/asm/uapi_uvcall.h
@@ -13,6 +13,7 @@
 #define UV_RESTRICTED_SPR_READ 0xf10C
 #define UV_RETURN  0xf11C
 #define UV_REGISTER_MEM_SLOT0xF120
+#define UV_UNREGISTER_MEM_SLOT0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1dfb42ac9626..61e36c4516d5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -76,6 +76,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4433,6 +4434,24 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   /*
+* TODO: Handle KVM_MR_MOVE
+*/
+   if (change == KVM_MR_CREATE) {
+   uv_register_mem_slot(kvm->arch.lpid,
+  new->base_gfn << PAGE_SHIFT,
+  new->npages * PAGE_SIZE,
+  0,
+  new->id);
+   } else if (change == KVM_MR_DELETE) {
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   }
 }
 
 /*
-- 
2.17.1



[RFC PATCH v3 3/4] kvmppc: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls

2019-01-29 Thread Bharata B Rao
H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

During early guest init, these hcalls will be issued by UV.
As part of these hcalls, [un]register memslots with UV.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |  2 ++
 arch/powerpc/include/asm/kvm_book3s_hmm.h   | 12 
 arch/powerpc/include/asm/ucall-api.h|  9 ++
 arch/powerpc/include/uapi/asm/uapi_uvcall.h |  1 +
 arch/powerpc/kvm/book3s_hv.c|  7 +
 arch/powerpc/kvm/book3s_hv_hmm.c| 33 +
 6 files changed, 64 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 05b8536f6653..fa7695928e30 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -343,6 +343,8 @@
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s_hmm.h 
b/arch/powerpc/include/asm/kvm_book3s_hmm.h
index e61519c17485..af093f8b86cf 100644
--- a/arch/powerpc/include/asm/kvm_book3s_hmm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_hmm.h
@@ -13,6 +13,8 @@ extern unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
  unsigned long gra,
  unsigned long flags,
  unsigned long page_shift);
+extern unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+extern unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned int lpid,
@@ -29,5 +31,15 @@ kvmppc_h_svm_page_out(struct kvm *kvm, unsigned int lpid,
 {
return H_UNSUPPORTED;
 }
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
 #endif /* CONFIG_PPC_KVM_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/ucall-api.h 
b/arch/powerpc/include/asm/ucall-api.h
index 6c3bddc97b55..d266670229cb 100644
--- a/arch/powerpc/include/asm/ucall-api.h
+++ b/arch/powerpc/include/asm/ucall-api.h
@@ -73,5 +73,14 @@ static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 
src_gpa, u64 flags,
return plpar_ucall(UV_PAGE_OUT, retbuf, lpid, dst_ra, src_gpa, flags,
   page_shift);
 }
+
+static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size,
+  u64 flags, u64 slotid)
+{
+   unsigned long retbuf[PLPAR_UCALL_BUFSIZE];
+
+   return plpar_ucall(UV_REGISTER_MEM_SLOT, retbuf, lpid, start_gpa,
+  size, flags, slotid);
+}
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_UCALL_API_H */
diff --git a/arch/powerpc/include/uapi/asm/uapi_uvcall.h 
b/arch/powerpc/include/uapi/asm/uapi_uvcall.h
index 3a30820663a2..79a11a6ee436 100644
--- a/arch/powerpc/include/uapi/asm/uapi_uvcall.h
+++ b/arch/powerpc/include/uapi/asm/uapi_uvcall.h
@@ -12,6 +12,7 @@
 #define UV_RESTRICTED_SPR_WRITE 0xf108
 #define UV_RESTRICTED_SPR_READ 0xf10C
 #define UV_RETURN  0xf11C
+#define UV_REGISTER_MEM_SLOT0xF120
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e7edba1ec16a..1dfb42ac9626 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1017,6 +1017,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_get_gpr(vcpu, 6),
kvmppc_get_gpr(vcpu, 7));
break;
+   case H_SVM_INIT_START:
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+   break;
+   case H_SVM_INIT_DONE:
+   ret = kvmppc_h_svm_init_done(vcpu->kvm);
+   break;
+
default:
return RESUME_HOST;
}
diff --git a/arch/powerpc/kvm/book3s_hv_hmm.c b/arch/powerpc/kvm/book3s_hv_hmm.c
index d8112092a242..b8a980172833 100644
--- a/arch/powerpc/kvm/book3s_hv_hmm.c
+++ b/arch/powerpc/kvm/book3s_hv_hmm.c
@@ -55,6 +55,39 @@ struct kvmppc_hmm_migrate_args {
unsigned long page_shift;
 };
 
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+   int ret = H_SUCCESS;
+   int srcu_idx;
+
+   srcu_idx = srcu_read_lock(>srcu);
+   slots = kvm_memslots(kvm);
+   kvm_for_each_memslot(memslot, slots) {
+   ret = uv_register_mem_slot(kvm->arch.lpid,
+  

[RFC PATCH v3 2/4] kvmppc: Add support for shared pages in HMM driver

2019-01-29 Thread Bharata B Rao
A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support shared pages in HMM driver.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h |  3 ++
 arch/powerpc/kvm/book3s_hv_hmm.c  | 58 +--
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2f6b952deb0f..05b8536f6653 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -337,6 +337,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_hmm.c b/arch/powerpc/kvm/book3s_hv_hmm.c
index edc512acebd3..d8112092a242 100644
--- a/arch/powerpc/kvm/book3s_hv_hmm.c
+++ b/arch/powerpc/kvm/book3s_hv_hmm.c
@@ -45,6 +45,7 @@ struct kvmppc_hmm_page_pvt {
unsigned long *rmap;
unsigned int lpid;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 struct kvmppc_hmm_migrate_args {
@@ -212,6 +213,45 @@ static const struct migrate_vma_ops kvmppc_hmm_migrate_ops 
= {
.finalize_and_map = kvmppc_hmm_migrate_finalize_and_map,
 };
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses the HMM fault handler to release the HMM page.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long *rmap, unsigned long gpa,
+ unsigned long addr, unsigned long page_shift)
+{
+
+   int ret;
+   unsigned int lpid = kvm->arch.lpid;
+   struct page *hmm_page;
+   struct kvmppc_hmm_page_pvt *pvt;
+   unsigned long pfn;
+   int srcu_idx;
+
+   if (kvmppc_is_hmm_pfn(*rmap)) {
+   hmm_page = pfn_to_page(*rmap & ~KVMPPC_PFN_HMM);
+   pvt = (struct kvmppc_hmm_page_pvt *)
+   hmm_devmem_page_get_drvdata(hmm_page);
+   pvt->skip_page_out = true;
+   }
+
+   srcu_idx = srcu_read_lock(>srcu);
+   pfn = gfn_to_pfn(kvm, gpa >> page_shift);
+   srcu_read_unlock(>srcu, srcu_idx);
+   if (is_error_noslot_pfn(pfn))
+   return H_PARAMETER;
+
+   ret = uv_page_in(lpid, pfn << page_shift, gpa, 0, page_shift);
+   kvm_release_pfn_clean(pfn);
+
+   return (ret == U_SUCCESS) ? H_SUCCESS : H_PARAMETER;
+}
+
 /*
  * Move page from normal memory to secure memory.
  */
@@ -242,9 +282,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
 
end = addr + (1UL << page_shift);
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+   if (flags & H_PAGE_IN_SHARED)
+   return kvmppc_share_page(kvm, rmap, gpa, addr, page_shift);
+
args.rmap = rmap;
args.lpid = kvm->arch.lpid;
args.gpa = gpa;
@@ -291,8 +334,17 @@ kvmppc_hmm_fault_migrate_alloc_and_copy(struct 
vm_area_struct *vma,
   hmm_devmem_page_get_drvdata(spage);
 
pfn = page_to_pfn(dpage);
-   ret = uv_page_out(pvt->lpid, pfn << PAGE_SHIFT,
- pvt->gpa, 0, PAGE_SHIFT);
+
+   /*
+* This same alloc_and_copy() callback is used in two cases:
+* - When HV touches a secure page, for which we do page-out
+* - When a secure page is converted to shared page, we touch
+*   the page to essentially discard the HMM page. In this case we
+*   skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(pvt->lpid, pfn << PAGE_SHIFT,
+ pvt->gpa, 0, PAGE_SHIFT);
if (ret == U_SUCCESS)
*dst_pfn = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
 }
-- 
2.17.1



[RFC PATCH v3 1/4] kvmppc: HMM backend driver to manage pages of secure guest

2019-01-29 Thread Bharata B Rao
HMM driver for KVM PPC to manage page transitions of
secure guest via H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   4 +
 arch/powerpc/include/asm/kvm_book3s_hmm.h   |  33 ++
 arch/powerpc/include/asm/kvm_host.h |  14 +
 arch/powerpc/include/asm/ucall-api.h|  19 +
 arch/powerpc/include/uapi/asm/uapi_uvcall.h |   3 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  22 +
 arch/powerpc/kvm/book3s_hv_hmm.c| 474 
 8 files changed, 572 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_hmm.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_hmm.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 463c63a9fcf1..2f6b952deb0f 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -337,6 +337,10 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_hmm.h 
b/arch/powerpc/include/asm/kvm_book3s_hmm.h
new file mode 100644
index ..e61519c17485
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_hmm.h
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __POWERPC_KVM_PPC_HMM_H__
+#define __POWERPC_KVM_PPC_HMM_H__
+
+#ifdef CONFIG_PPC_KVM_UV
+extern unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+ unsigned int lpid,
+ unsigned long gra,
+ unsigned long flags,
+ unsigned long page_shift);
+extern unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+ unsigned int lpid,
+ unsigned long gra,
+ unsigned long flags,
+ unsigned long page_shift);
+#else
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned int lpid,
+unsigned long gra, unsigned long flags,
+unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned int lpid,
+ unsigned long gra, unsigned long flags,
+ unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+#endif /* CONFIG_PPC_KVM_UV */
+#endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 162005ae50e2..15ea03852bf1 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -846,4 +846,18 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#ifdef CONFIG_PPC_KVM_UV
+extern int kvmppc_hmm_init(void);
+extern void kvmppc_hmm_free(void);
+extern void kvmppc_hmm_release_pfns(struct kvm_memory_slot *free);
+#else
+static inline int kvmppc_hmm_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_hmm_free(void) {}
+static inline void kvmppc_hmm_release_pfns(struct kvm_memory_slot *free) {}
+#endif /* CONFIG_PPC_KVM_UV */
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ucall-api.h 
b/arch/powerpc/include/asm/ucall-api.h
index 82d8edd1e409..6c3bddc97b55 100644
--- a/arch/powerpc/include/asm/ucall-api.h
+++ b/arch/powerpc/include/asm/ucall-api.h
@@ -9,6 +9,8 @@
 #include 
 #include 
 
+#define U_SUCCESS 0
+
 extern unsigned int smf_state;
 static inline bool smf_enabled(void)
 {
@@ -54,5 +56,22 @@ static inline int uv_restricted_spr_read(u64 reg, u64 *val)
return rc;
 }
 
+static inline int uv_page_in(u64 lpid, u64 src_ra, u64 dst_gpa, u64 flags,
+u64 page_shift)
+{
+   unsigned long retbuf[PLPAR_UCALL_BUFSIZE];
+
+   return plpar_ucall(UV_PAGE_IN, retbuf, lpid, src_ra, dst_gpa, flags,
+  page_shift);
+}
+
+static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 src_gpa, u64 flags,
+ u64 page_shift)
+{
+   unsigned long retbuf[PLPAR_UCALL_BUFSIZE];
+
+   return plpar_ucall(UV_PAGE_OUT, retbuf, lpid, dst_ra, src_gpa, flags,
+  page_shift);
+}
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_UCALL_API_H 

[RFC PATCH v3 0/4] kvmppc: HMM backend driver to manage pages of secure guest

2019-01-29 Thread Bharata B Rao
Hi,

A pseries guest can be run as a secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor (HV) and secure memory managed by Ultravisor (UV).

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created
via a HMM device. The movement of pages between normal and secure
memory is done by ->alloc_and_copy() callback routine of migrate_vma().

The page-in or page-out requests from UV will come to HV as hcalls and
HV will call back into UV via uvcalls to satisfy these page requests.

These patches apply and work on the base Ultravisor patches posted by
Ram Pai at https://www.spinics.net/lists/kvm-ppc/msg14981.html

Changes in v3
=
- Rebased to latest kernel
- Rebased on top of Ram's base Ultravisor patches, so that all the
  dependencies are met.
- Get secure memory size from device tree.
- Fix a mm struct leak in page-in and page-out hcalls, thereby
  allowing LPID recycling (Thanks to Sukadev Bhattiprolu for pointing
  this out)

v2: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-November/181669.html

Bharata B Rao (4):
  kvmppc: HMM backend driver to manage pages of secure guest
  kvmppc: Add support for shared pages in HMM driver
  kvmppc: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls
  kvmppc: Handle memory plug/unplug to secure VM

 arch/powerpc/include/asm/hvcall.h   |   9 +
 arch/powerpc/include/asm/kvm_book3s_hmm.h   |  45 ++
 arch/powerpc/include/asm/kvm_host.h |  14 +
 arch/powerpc/include/asm/ucall-api.h|  35 ++
 arch/powerpc/include/uapi/asm/uapi_uvcall.h |   5 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  48 ++
 arch/powerpc/kvm/book3s_hv_hmm.c| 559 
 8 files changed, 718 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_hmm.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_hmm.c

-- 
2.17.1



Re: use generic DMA mapping code in powerpc V4

2019-01-29 Thread Christian Zigotzky
Hi Christoph,

Thanks a lot for the updates. I will test the full branch tomorrow.

Cheers,
Christian

Sent from my iPhone

> On 29. Jan 2019, at 17:34, Christoph Hellwig  wrote:
> 
>> On Tue, Jan 29, 2019 at 05:14:11PM +0100, Christoph Hellwig wrote:
>>> On Tue, Jan 29, 2019 at 04:03:32PM +0100, Christian Zigotzky wrote:
>>> Hi Christoph,
>>> 
>>> I compiled kernels for the X5000 and X1000 from your new branch 
>>> 'powerpc-dma.6-debug.2' today. The kernels boot and the P.A. Semi Ethernet 
>>> works!
>> 
>> Thanks for testing!  I'll prepare a new series that adds the other
>> patches on top of this one.
> 
> And that was easier than I thought - we just had a few patches left
> in powerpc-dma.6, so I've rebased that branch on top of
> powerpc-dma.6-debug.2:
> 
>git://git.infradead.org/users/hch/misc.git powerpc-dma.6
> 
> Gitweb:
> 
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6
> 
> I hope the other patches are simple enough, so just testing the full
> branch checkout should be fine for now.


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Michael S. Tsirkin
On Wed, Jan 30, 2019 at 11:05:42AM +0800, Jason Wang wrote:
> 
> On 2019/1/30 上午10:36, Michael S. Tsirkin wrote:
> > On Wed, Jan 30, 2019 at 10:24:01AM +0800, Jason Wang wrote:
> > > On 2019/1/30 上午3:02, Michael S. Tsirkin wrote:
> > > > On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
> > > > > Fixing address of powerpc mailing list.
> > > > > 
> > > > > Thiago Jung Bauermann  writes:
> > > > > 
> > > > > > Hello,
> > > > > > 
> > > > > > With Christoph's rework of the DMA API that recently landed, the 
> > > > > > patch
> > > > > > below is the only change needed in virtio to make it work in a POWER
> > > > > > secure guest under the ultravisor.
> > > > > > 
> > > > > > The other change we need (making sure the device's dma_map_ops is 
> > > > > > NULL
> > > > > > so that the dma-direct/swiotlb code is used) can be made in
> > > > > > powerpc-specific code.
> > > > > > 
> > > > > > Of course, I also have patches (soon to be posted as RFC) which 
> > > > > > hook up
> > > > > >  to the powerpc secure guest support code.
> > > > > > 
> > > > > > What do you think?
> > > > > > 
> > > > > >   From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 
> > > > > > 2001
> > > > > > From: Thiago Jung Bauermann 
> > > > > > Date: Thu, 24 Jan 2019 22:08:02 -0200
> > > > > > Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is 
> > > > > > encrypted
> > > > > > 
> > > > > > The host can't access the guest memory when it's encrypted, so using
> > > > > > regular memory pages for the ring isn't an option. Go through the 
> > > > > > DMA API.
> > > > > > 
> > > > > > Signed-off-by: Thiago Jung Bauermann 
> > > > Well I think this will come back to bite us (witness xen which is now
> > > > reworking precisely this path - but at least they aren't to blame, xen
> > > > came before ACCESS_PLATFORM).
> > > > 
> > > > I also still think the right thing would have been to set
> > > > ACCESS_PLATFORM for all systems where device can't access all memory.
> > > > 
> > > > But I also think I don't have the energy to argue about power secure
> > > > guest anymore.  So be it for power secure guest since the involved
> > > > engineers disagree with me.  Hey I've been wrong in the past ;).
> > > > 
> > > > But the name "sev_active" makes me scared because at least AMD guys who
> > > > were doing the sensible thing and setting ACCESS_PLATFORM (unless I'm
> > > > wrong? I reemember distinctly that's so) will likely be affected too.
> > > > We don't want that.
> > > > 
> > > > So let's find a way to make sure it's just power secure guest for now
> > > > pls.
> > > > 
> > > > I also think we should add a dma_api near features under virtio_device
> > > > such that these hacks can move off data path.
> > > 
> > > Anyway the current Xen code is conflict with spec which said:
> > > 
> > > "If this feature bit is set to 0, then the device has same access to 
> > > memory
> > > addresses supplied to it as the driver has. In particular, the device will
> > > always use physical addresses matching addresses used by the driver
> > > (typically meaning physical addresses used by the CPU) and not translated
> > > further, and can access any address supplied to it by the driver. When
> > > clear, this overrides any platform-specific description of whether device
> > > access is limited or translated in any way, e.g. whether an IOMMU may be
> > > present. "
> > > 
> > > I wonder how much value that the above description can give us. It's kind 
> > > of
> > > odd that the behavior of "when the feature is not negotiated" is described
> > > in the spec.
> > Hmm what's odd about it? We need to describe the behaviour is all cases.
> 
> 
> Well, try to limit the behavior of 'legacy' driver is tricky or even
> impossible. Xen is an exact example for this.

So don't. Xen got grand-fathered in because when that came
along we thought it's a one off thing. Was easier to just
add that as a special case. But really >99% of people
have a hypervisor device with direct guest memory access.
All else is esoterica.

> 
> > 
> > > Personally I think we can remove the above and then we can
> > > switch to use DMA API unconditionally in guest driver. It may have single
> > > digit regression probably, we can try to overcome it.
> > > 
> > > Thanks
> > This has been discussed ad nauseum. virtio is all about compatibility.
> > Losing a couple of lines of code isn't worth breaking working setups.
> > People that want "just use DMA API no tricks" now have the option.
> > Setting a flag in a feature bit map is literally a single line
> > of code in the hypervisor. So stop pushing for breaking working
> > legacy setups and just fix it in the right place.
> 
> 
> I may miss soemthing, which kind of legacy setup is broken? Do you mean
> using virtio without IOMMU_PLATFORM on platform with IOMMU? We actually
> unbreak this setup.
> 
> Thanks


Legacy setups by definition are working setups.  The rules are pretty
simple. By default virtio 

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Jason Wang



On 2019/1/30 上午10:36, Michael S. Tsirkin wrote:

On Wed, Jan 30, 2019 at 10:24:01AM +0800, Jason Wang wrote:

On 2019/1/30 上午3:02, Michael S. Tsirkin wrote:

On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:

Fixing address of powerpc mailing list.

Thiago Jung Bauermann  writes:


Hello,

With Christoph's rework of the DMA API that recently landed, the patch
below is the only change needed in virtio to make it work in a POWER
secure guest under the ultravisor.

The other change we need (making sure the device's dma_map_ops is NULL
so that the dma-direct/swiotlb code is used) can be made in
powerpc-specific code.

Of course, I also have patches (soon to be posted as RFC) which hook up
 to the powerpc secure guest support code.

What do you think?

  From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann 
Date: Thu, 24 Jan 2019 22:08:02 -0200
Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

The host can't access the guest memory when it's encrypted, so using
regular memory pages for the ring isn't an option. Go through the DMA API.

Signed-off-by: Thiago Jung Bauermann 

Well I think this will come back to bite us (witness xen which is now
reworking precisely this path - but at least they aren't to blame, xen
came before ACCESS_PLATFORM).

I also still think the right thing would have been to set
ACCESS_PLATFORM for all systems where device can't access all memory.

But I also think I don't have the energy to argue about power secure
guest anymore.  So be it for power secure guest since the involved
engineers disagree with me.  Hey I've been wrong in the past ;).

But the name "sev_active" makes me scared because at least AMD guys who
were doing the sensible thing and setting ACCESS_PLATFORM (unless I'm
wrong? I reemember distinctly that's so) will likely be affected too.
We don't want that.

So let's find a way to make sure it's just power secure guest for now
pls.

I also think we should add a dma_api near features under virtio_device
such that these hacks can move off data path.


Anyway the current Xen code is conflict with spec which said:

"If this feature bit is set to 0, then the device has same access to memory
addresses supplied to it as the driver has. In particular, the device will
always use physical addresses matching addresses used by the driver
(typically meaning physical addresses used by the CPU) and not translated
further, and can access any address supplied to it by the driver. When
clear, this overrides any platform-specific description of whether device
access is limited or translated in any way, e.g. whether an IOMMU may be
present. "

I wonder how much value that the above description can give us. It's kind of
odd that the behavior of "when the feature is not negotiated" is described
in the spec.

Hmm what's odd about it? We need to describe the behaviour is all cases.



Well, try to limit the behavior of 'legacy' driver is tricky or even 
impossible. Xen is an exact example for this.






Personally I think we can remove the above and then we can
switch to use DMA API unconditionally in guest driver. It may have single
digit regression probably, we can try to overcome it.

Thanks

This has been discussed ad nauseum. virtio is all about compatibility.
Losing a couple of lines of code isn't worth breaking working setups.
People that want "just use DMA API no tricks" now have the option.
Setting a flag in a feature bit map is literally a single line
of code in the hypervisor. So stop pushing for breaking working
legacy setups and just fix it in the right place.



I may miss soemthing, which kind of legacy setup is broken? Do you mean 
using virtio without IOMMU_PLATFORM on platform with IOMMU? We actually 
unbreak this setup.


Thanks





By the way could you please respond about virtio-iommu and
why there's no support for ACCESS_PLATFORM on power?

I have Cc'd you on these discussions.


Thanks!



---
   drivers/virtio/virtio_ring.c | 5 -
   1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..321a27075380 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
 * not work without an even larger kludge.  Instead, enable
 * the DMA API if we're a Xen guest, which at least allows
 * all of the sensible Xen configurations to work correctly.
+*
+* Also, if guest memory is encrypted the host can't access
+* it directly. In this case, we'll need to use the DMA API.
 */
-   if (xen_domain())
+   if (xen_domain() || sev_active())
return true;

return false;

--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH] powerpc/cell: Remove duplicate header

2019-01-29 Thread Sabyasachi Gupta
On Thu, Jan 17, 2019 at 11:13 PM Souptick Joarder  wrote:
>
> On Thu, Jan 17, 2019 at 9:49 PM Sabyasachi Gupta
>  wrote:
> >
> > Remove linux/syscalls.h which is included more than once
> >
> > Signed-off-by: Sabyasachi Gupta 
>
> Acked-by: Souptick Joarder 
>
If no further comment, can we get this patch in queue for 5.1 ?

> > ---
> >  arch/powerpc/platforms/cell/spu_syscalls.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/arch/powerpc/platforms/cell/spu_syscalls.c 
> > b/arch/powerpc/platforms/cell/spu_syscalls.c
> > index 263413a..b95d6af 100644
> > --- a/arch/powerpc/platforms/cell/spu_syscalls.c
> > +++ b/arch/powerpc/platforms/cell/spu_syscalls.c
> > @@ -26,7 +26,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >
> >  #include 
> >
> > --
> > 2.7.4
> >


Re: [PATCH] powerpc/powernv: Remove duplicate header

2019-01-29 Thread Sabyasachi Gupta
On Fri, Jan 18, 2019 at 3:34 PM Souptick Joarder  wrote:
>
> On Thu, Jan 17, 2019 at 9:40 PM Sabyasachi Gupta
>  wrote:
> >
> > Remove linux/printk.h which is included more than once.
> >
> > Signed-off-by: Sabyasachi Gupta 
>
> Acked-by: Souptick Joarder 
>
If no further comment, can we get this patch in queue for 5.1 ?
> > ---
> >  arch/powerpc/platforms/powernv/opal.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/arch/powerpc/platforms/powernv/opal.c 
> > b/arch/powerpc/platforms/powernv/opal.c
> > index beed86f..802de0d 100644
> > --- a/arch/powerpc/platforms/powernv/opal.c
> > +++ b/arch/powerpc/platforms/powernv/opal.c
> > @@ -26,7 +26,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> >  #include 
> >  #include 
> > --
> > 2.7.4
> >


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Michael S. Tsirkin
On Wed, Jan 30, 2019 at 10:24:01AM +0800, Jason Wang wrote:
> 
> On 2019/1/30 上午3:02, Michael S. Tsirkin wrote:
> > On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
> > > Fixing address of powerpc mailing list.
> > > 
> > > Thiago Jung Bauermann  writes:
> > > 
> > > > Hello,
> > > > 
> > > > With Christoph's rework of the DMA API that recently landed, the patch
> > > > below is the only change needed in virtio to make it work in a POWER
> > > > secure guest under the ultravisor.
> > > > 
> > > > The other change we need (making sure the device's dma_map_ops is NULL
> > > > so that the dma-direct/swiotlb code is used) can be made in
> > > > powerpc-specific code.
> > > > 
> > > > Of course, I also have patches (soon to be posted as RFC) which hook up
> > > >  to the powerpc secure guest support code.
> > > > 
> > > > What do you think?
> > > > 
> > > >  From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
> > > > From: Thiago Jung Bauermann 
> > > > Date: Thu, 24 Jan 2019 22:08:02 -0200
> > > > Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is 
> > > > encrypted
> > > > 
> > > > The host can't access the guest memory when it's encrypted, so using
> > > > regular memory pages for the ring isn't an option. Go through the DMA 
> > > > API.
> > > > 
> > > > Signed-off-by: Thiago Jung Bauermann 
> > Well I think this will come back to bite us (witness xen which is now
> > reworking precisely this path - but at least they aren't to blame, xen
> > came before ACCESS_PLATFORM).
> > 
> > I also still think the right thing would have been to set
> > ACCESS_PLATFORM for all systems where device can't access all memory.
> > 
> > But I also think I don't have the energy to argue about power secure
> > guest anymore.  So be it for power secure guest since the involved
> > engineers disagree with me.  Hey I've been wrong in the past ;).
> > 
> > But the name "sev_active" makes me scared because at least AMD guys who
> > were doing the sensible thing and setting ACCESS_PLATFORM (unless I'm
> > wrong? I reemember distinctly that's so) will likely be affected too.
> > We don't want that.
> > 
> > So let's find a way to make sure it's just power secure guest for now
> > pls.
> > 
> > I also think we should add a dma_api near features under virtio_device
> > such that these hacks can move off data path.
> 
> 
> Anyway the current Xen code is conflict with spec which said:
> 
> "If this feature bit is set to 0, then the device has same access to memory
> addresses supplied to it as the driver has. In particular, the device will
> always use physical addresses matching addresses used by the driver
> (typically meaning physical addresses used by the CPU) and not translated
> further, and can access any address supplied to it by the driver. When
> clear, this overrides any platform-specific description of whether device
> access is limited or translated in any way, e.g. whether an IOMMU may be
> present. "
> 
> I wonder how much value that the above description can give us. It's kind of
> odd that the behavior of "when the feature is not negotiated" is described
> in the spec.

Hmm what's odd about it? We need to describe the behaviour is all cases.

> Personally I think we can remove the above and then we can
> switch to use DMA API unconditionally in guest driver. It may have single
> digit regression probably, we can try to overcome it.
> 
> Thanks

This has been discussed ad nauseum. virtio is all about compatibility.
Losing a couple of lines of code isn't worth breaking working setups.
People that want "just use DMA API no tricks" now have the option.
Setting a flag in a feature bit map is literally a single line
of code in the hypervisor. So stop pushing for breaking working
legacy setups and just fix it in the right place.

> 
> > 
> > By the way could you please respond about virtio-iommu and
> > why there's no support for ACCESS_PLATFORM on power?
> > 
> > I have Cc'd you on these discussions.
> > 
> > 
> > Thanks!
> > 
> > 
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 5 -
> > > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index cd7e755484e3..321a27075380 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device 
> > > > *vdev)
> > > >  * not work without an even larger kludge.  Instead, enable
> > > >  * the DMA API if we're a Xen guest, which at least allows
> > > >  * all of the sensible Xen configurations to work correctly.
> > > > +*
> > > > +* Also, if guest memory is encrypted the host can't access
> > > > +* it directly. In this case, we'll need to use the DMA API.
> > > >  */
> > > > -   if (xen_domain())
> > > > +   if (xen_domain() || sev_active())
> > > >   

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Jason Wang



On 2019/1/30 上午3:02, Michael S. Tsirkin wrote:

On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:

Fixing address of powerpc mailing list.

Thiago Jung Bauermann  writes:


Hello,

With Christoph's rework of the DMA API that recently landed, the patch
below is the only change needed in virtio to make it work in a POWER
secure guest under the ultravisor.

The other change we need (making sure the device's dma_map_ops is NULL
so that the dma-direct/swiotlb code is used) can be made in
powerpc-specific code.

Of course, I also have patches (soon to be posted as RFC) which hook up
 to the powerpc secure guest support code.

What do you think?

 From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann 
Date: Thu, 24 Jan 2019 22:08:02 -0200
Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

The host can't access the guest memory when it's encrypted, so using
regular memory pages for the ring isn't an option. Go through the DMA API.

Signed-off-by: Thiago Jung Bauermann 

Well I think this will come back to bite us (witness xen which is now
reworking precisely this path - but at least they aren't to blame, xen
came before ACCESS_PLATFORM).

I also still think the right thing would have been to set
ACCESS_PLATFORM for all systems where device can't access all memory.

But I also think I don't have the energy to argue about power secure
guest anymore.  So be it for power secure guest since the involved
engineers disagree with me.  Hey I've been wrong in the past ;).

But the name "sev_active" makes me scared because at least AMD guys who
were doing the sensible thing and setting ACCESS_PLATFORM (unless I'm
wrong? I reemember distinctly that's so) will likely be affected too.
We don't want that.

So let's find a way to make sure it's just power secure guest for now
pls.

I also think we should add a dma_api near features under virtio_device
such that these hacks can move off data path.



Anyway the current Xen code is conflict with spec which said:

"If this feature bit is set to 0, then the device has same access to 
memory addresses supplied to it as the driver has. In particular, the 
device will always use physical addresses matching addresses used by the 
driver (typically meaning physical addresses used by the CPU) and not 
translated further, and can access any address supplied to it by the 
driver. When clear, this overrides any platform-specific description of 
whether device access is limited or translated in any way, e.g. whether 
an IOMMU may be present. "


I wonder how much value that the above description can give us. It's 
kind of odd that the behavior of "when the feature is not negotiated" is 
described in the spec. Personally I think we can remove the above and 
then we can switch to use DMA API unconditionally in guest driver. It 
may have single digit regression probably, we can try to overcome it.


Thanks




By the way could you please respond about virtio-iommu and
why there's no support for ACCESS_PLATFORM on power?

I have Cc'd you on these discussions.


Thanks!



---
  drivers/virtio/virtio_ring.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..321a27075380 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
 * not work without an even larger kludge.  Instead, enable
 * the DMA API if we're a Xen guest, which at least allows
 * all of the sensible Xen configurations to work correctly.
+*
+* Also, if guest memory is encrypted the host can't access
+* it directly. In this case, we'll need to use the DMA API.
 */
-   if (xen_domain())
+   if (xen_domain() || sev_active())
return true;

return false;


--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH 0/4] powerpc/livepatch: reliable stack unwinder fixes

2019-01-29 Thread Jiri Kosina
On Tue, 22 Jan 2019, Joe Lawrence wrote:

> This patchset fixes a false negative report (ie, unreliable) from the
> ppc64 reliable stack unwinder, discussed here [1] when it may
> inadvertently trip over a stale exception marker left on the stack.
> 
> The first two patches fix this bug.  Nicolai's change clears the marker
> from the stack when an exception is finished.  The next patch modifies
> the unwinder to only look for such on stack elements when the ABI
> guarantees that they will actually be initialized.
> 
> The final two patches consist of code cleanups that Nicolai and I
> spotted during the development of the fixes.
> 
> Testing included re-running the original test scenario (loading a
> livepatch module on ppc64le) on a 5.0.0-rc2 kernel as well as a RHEL-7
> backport.  I ran internal tests on the RHEL-7 backport and no new test
> failures were introduced.  I believe that Nicolai has done the same
> with respect to the first patch.
> 
> [1] 
> https://lore.kernel.org/lkml/7f468285-b149-37e2-e782-c9e538b99...@redhat.com/
> 
> Joe Lawrence (3):
>   powerpc/livepatch: relax reliable stack tracer checks for first-frame
>   powerpc/livepatch: small cleanups in save_stack_trace_tsk_reliable()
>   powerpc/livepatch: return -ERRNO values in
> save_stack_trace_tsk_reliable()
> 
> Nicolai Stange (1):
>   powerpc/64s: Clear on-stack exception marker upon exception return

Michael, are you fine with this going through LP tree, or do you plan to 
take it through yours?

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH] ucc_geth: Reset BQL queue when stopping device

2019-01-29 Thread David Miller
From: Mathias Thore 
Date: Tue, 29 Jan 2019 08:07:54 +

> Is there a scenario where we are clearing the TX ring but don't want to reset 
> the BQL TX queue?
> 
> I think it makes sense to keep it in ucc_geth_free_tx since the
> reason it is needed isn't the timeout per se, but rather the
> clearing of the TX ring. This way, it will be performed no matter
> why the driver ends up calling this function.

I absolutely think the BQL reset should remain in ucc_geth_free_tx().

That way if another callsite for ucc_geth_free_tx() emerges that does
need the BQL queue reset, it won't be "forgotten".


Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

2019-01-29 Thread Andrew Morton
On Mon, 14 Jan 2019 15:24:32 +0530 "Aneesh Kumar K.V" 
 wrote:

> ppc64 use CMA area for the allocation of guest page table (hash page table). 
> We won't
> be able to start guest if we fail to allocate hash page table. We have 
> observed
> hash table allocation failure because we failed to migrate pages out of CMA 
> region
> because they were pinned. This happen when we are using VFIO. VFIO on ppc64 
> pins
> the entire guest RAM. If the guest RAM pages get allocated out of CMA region, 
> we
> won't be able to migrate those pages. The pages are also pinned for the 
> lifetime of the
> guest.
> 
> Currently we support migration of non-compound pages. With THP and with the 
> addition of
>  hugetlb migration we can end up allocating compound pages from CMA region. 
> This
> patch series add support for migrating compound pages. 

Very little review activity is in evidence.  Please identify some
appropriate reviewers and ask them to take a look?



Re: [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc

2019-01-29 Thread Andrew Morton
On Mon, 14 Jan 2019 15:24:33 +0530 "Aneesh Kumar K.V" 
 wrote:

> This patch adds PF_MEMALLOC_NOCMA which make sure any allocation in that 
> context
> is marked non-movable and hence cannot be satisfied by CMA region.
> 
> This is useful with get_user_pages_longterm where we want to take a page pin 
> by
> migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
> that we avoid unnecessary page migration later.
> 
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1406,6 +1406,7 @@ extern struct pid *cad_pid;
>  #define PF_RANDOMIZE 0x0040  /* Randomize virtual address 
> space */
>  #define PF_SWAPWRITE 0x0080  /* Allowed to write to swap */
>  #define PF_MEMSTALL  0x0100  /* Stalled due to lack of 
> memory */
> +#define PF_MEMALLOC_NOCMA0x0200  /* All allocation request will 
> have _GFP_MOVABLE cleared */
>  #define PF_NO_SETAFFINITY0x0400  /* Userland is not allowed to 
> meddle with cpus_allowed */
>  #define PF_MCE_EARLY 0x0800  /* Early kill for mce process 
> policy */
>  #define PF_MUTEX_TESTER  0x2000  /* Thread belongs to 
> the rt mutex tester */

This flag has been taken by PF_UMH so I moved it to 0x1000.

And we have now run out of PF_ flags.


Re: [PATCH v1] PCI/AER: use match_string() helper to simplify the code

2019-01-29 Thread Bjorn Helgaas
On Mon, Jan 28, 2019 at 01:57:28PM +0200, Andy Shevchenko wrote:
> match_string() returns the index of an array for a matching string,
> which can be used intead of open coded implementation.
> 
> Signed-off-by: Andy Shevchenko 

Applied to pci/aer for v5.1, thanks!

> ---
>  drivers/pci/pcie/aer.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fed29de783e0..f8fc2114ad39 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -117,7 +117,7 @@ bool pci_aer_available(void)
>  
>  static int ecrc_policy = ECRC_POLICY_DEFAULT;
>  
> -static const char *ecrc_policy_str[] = {
> +static const char * const ecrc_policy_str[] = {
>   [ECRC_POLICY_DEFAULT] = "bios",
>   [ECRC_POLICY_OFF] = "off",
>   [ECRC_POLICY_ON] = "on"
> @@ -203,11 +203,8 @@ void pcie_ecrc_get_policy(char *str)
>  {
>   int i;
>  
> - for (i = 0; i < ARRAY_SIZE(ecrc_policy_str); i++)
> - if (!strncmp(str, ecrc_policy_str[i],
> -  strlen(ecrc_policy_str[i])))
> - break;
> - if (i >= ARRAY_SIZE(ecrc_policy_str))
> + i = match_string(ecrc_policy_str, ARRAY_SIZE(ecrc_policy_str), str);
> + if (i < 0)
>   return;
>  
>   ecrc_policy = i;
> -- 
> 2.20.1
> 


Re: [PATCH 0/4] powerpc/livepatch: reliable stack unwinder fixes

2019-01-29 Thread Josh Poimboeuf
On Tue, Jan 22, 2019 at 10:57:20AM -0500, Joe Lawrence wrote:
> This patchset fixes a false negative report (ie, unreliable) from the
> ppc64 reliable stack unwinder, discussed here [1] when it may
> inadvertently trip over a stale exception marker left on the stack.
> 
> The first two patches fix this bug.  Nicolai's change clears the marker
> from the stack when an exception is finished.  The next patch modifies
> the unwinder to only look for such on stack elements when the ABI
> guarantees that they will actually be initialized.
> 
> The final two patches consist of code cleanups that Nicolai and I
> spotted during the development of the fixes.
> 
> Testing included re-running the original test scenario (loading a
> livepatch module on ppc64le) on a 5.0.0-rc2 kernel as well as a RHEL-7
> backport.  I ran internal tests on the RHEL-7 backport and no new test
> failures were introduced.  I believe that Nicolai has done the same
> with respect to the first patch.
> 
> [1] 
> https://lore.kernel.org/lkml/7f468285-b149-37e2-e782-c9e538b99...@redhat.com/
> 
> Joe Lawrence (3):
>   powerpc/livepatch: relax reliable stack tracer checks for first-frame
>   powerpc/livepatch: small cleanups in save_stack_trace_tsk_reliable()
>   powerpc/livepatch: return -ERRNO values in
> save_stack_trace_tsk_reliable()
> 
> Nicolai Stange (1):
>   powerpc/64s: Clear on-stack exception marker upon exception return
> 
>  arch/powerpc/Kconfig |  2 +-
>  arch/powerpc/kernel/entry_64.S   |  7 +++
>  arch/powerpc/kernel/stacktrace.c | 74 +---
>  3 files changed, 47 insertions(+), 36 deletions(-)

Reviewed-by: Josh Poimboeuf 

-- 
Josh


Re: [PATCH] ucc_geth: Reset BQL queue when stopping device

2019-01-29 Thread Li Yang
On Tue, Jan 29, 2019 at 2:09 AM Mathias Thore
 wrote:
>
> Is there a scenario where we are clearing the TX ring but don't want to reset 
> the BQL TX queue?

Right now the function is also used on interface open/close, driver
removal and driver resumption besides the timeout situation. I think
the reseting BQL queue is either not neccessary or already called
explicitly for the other scenarios.

>
> I think it makes sense to keep it in ucc_geth_free_tx since the reason it is 
> needed isn't the timeout per se, but rather the clearing of the TX ring. This 
> way, it will be performed no matter why the driver ends up calling this 
> function.

I don't see a consensus on when netdev_reset_queue() should be called
among existing drivers.  Doing it on buffer ring cleanup probably can
be future-proofing.  But if we want to use this approach, we can
remove the redundent netdev_reset_queue() calls in open/close
functions.

Regards,
Leo

>
> -Original Message-
> From: Li Yang [mailto:leoyang...@nxp.com]
> Sent: Monday, 28 January 2019 22:37
> To: Mathias Thore 
> Cc: Christophe Leroy ; net...@vger.kernel.org; 
> linuxppc-dev@lists.ozlabs.org; David Gounaris ; 
> Joakim Tjernlund 
> Subject: Re: [PATCH] ucc_geth: Reset BQL queue when stopping device
>
> On Mon, Jan 28, 2019 at 8:37 AM Mathias Thore  
> wrote:
> >
> > Hi,
> >
> >
> > This is what we observed: there was a storm on the medium so that our 
> > controller could not do its TX, resulting in timeout. When timeout occurs, 
> > the driver clears all descriptors from the TX queue. The function called in 
> > this patch is used to reflect this clearing also in the BQL layer. Without 
> > it, the controller would get stuck, unable to perform TX, even several 
> > minutes after the storm had ended. Bringing the device down and then up 
> > again would solve the problem, but this patch also solves it automatically.
>
> The explanation makes sense.  So this should only be required in the timeout 
> scenario instead of other clean up scenarios like device shutdown?  If so, it 
> probably it will be better to be done in ucc_geth_timeout_work()?
>
> >
> >
> > Some other drivers do the same, for example e1000e driver calls 
> > netdev_reset_queue in its e1000_clean_tx_ring function. It is possible that 
> > other drivers should do the same; I have no way of verifying this.
> >
> >
> > Regards,
> >
> > Mathias
> >
> > --
> >
> >
> > From: Christophe Leroy 
> > Sent: Monday, January 28, 2019 10:48 AM
> > To: Mathias Thore; leoyang...@nxp.com; net...@vger.kernel.org;
> > linuxppc-dev@lists.ozlabs.org; David Gounaris; Joakim Tjernlund
> > Subject: Re: [PATCH] ucc_geth: Reset BQL queue when stopping device
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you recognize the sender and know 
> > the content is safe.
> >
> >
> > Hi,
> >
> > Le 28/01/2019 à 10:07, Mathias Thore a écrit :
> > > After a timeout event caused by for example a broadcast storm, when
> > > the MAC and PHY are reset, the BQL TX queue needs to be reset as
> > > well. Otherwise, the device will exhibit severe performance issues
> > > even after the storm has ended.
> >
> > What are the symptomns ?
> >
> > Is this reset needed on any network driver in that case, or is it
> > something particular for the ucc_geth ?
> > For instance, the freescale fs_enet doesn't have that reset. Should it
> > have it too ?
> >
> > Christophe
> >
> > >
> > > Co-authored-by: David Gounaris 
> > > Signed-off-by: Mathias Thore 
> > > ---
> > >   drivers/net/ethernet/freescale/ucc_geth.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c
> > > b/drivers/net/ethernet/freescale/ucc_geth.c
> > > index c3d539e209ed..eb3e65e8868f 100644
> > > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > > @@ -1879,6 +1879,8 @@ static void ucc_geth_free_tx(struct 
> > > ucc_geth_private *ugeth)
> > >   u16 i, j;
> > >   u8 __iomem *bd;
> > >
> > > + netdev_reset_queue(ugeth->ndev);
> > > +
> > >   ug_info = ugeth->ug_info;
> > >   uf_info = _info->uf_info;
> > >
> > >
> >


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Michael S. Tsirkin
On Tue, Jan 29, 2019 at 03:42:44PM -0200, Thiago Jung Bauermann wrote:
> 
> Fixing address of powerpc mailing list.
> 
> Thiago Jung Bauermann  writes:
> 
> > Hello,
> >
> > With Christoph's rework of the DMA API that recently landed, the patch
> > below is the only change needed in virtio to make it work in a POWER
> > secure guest under the ultravisor.
> >
> > The other change we need (making sure the device's dma_map_ops is NULL
> > so that the dma-direct/swiotlb code is used) can be made in
> > powerpc-specific code.
> >
> > Of course, I also have patches (soon to be posted as RFC) which hook up
> >  to the powerpc secure guest support code.
> >
> > What do you think?
> >
> > From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
> > From: Thiago Jung Bauermann 
> > Date: Thu, 24 Jan 2019 22:08:02 -0200
> > Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
> >
> > The host can't access the guest memory when it's encrypted, so using
> > regular memory pages for the ring isn't an option. Go through the DMA API.
> >
> > Signed-off-by: Thiago Jung Bauermann 

Well I think this will come back to bite us (witness xen which is now
reworking precisely this path - but at least they aren't to blame, xen
came before ACCESS_PLATFORM).

I also still think the right thing would have been to set
ACCESS_PLATFORM for all systems where device can't access all memory.

But I also think I don't have the energy to argue about power secure
guest anymore.  So be it for power secure guest since the involved
engineers disagree with me.  Hey I've been wrong in the past ;).

But the name "sev_active" makes me scared because at least AMD guys who
were doing the sensible thing and setting ACCESS_PLATFORM (unless I'm
wrong? I reemember distinctly that's so) will likely be affected too.
We don't want that.

So let's find a way to make sure it's just power secure guest for now
pls.

I also think we should add a dma_api near features under virtio_device
such that these hacks can move off data path.

By the way could you please respond about virtio-iommu and
why there's no support for ACCESS_PLATFORM on power?

I have Cc'd you on these discussions.


Thanks!


> > ---
> >  drivers/virtio/virtio_ring.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index cd7e755484e3..321a27075380 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device 
> > *vdev)
> >  * not work without an even larger kludge.  Instead, enable
> >  * the DMA API if we're a Xen guest, which at least allows
> >  * all of the sensible Xen configurations to work correctly.
> > +*
> > +* Also, if guest memory is encrypted the host can't access
> > +* it directly. In this case, we'll need to use the DMA API.
> >  */
> > -   if (xen_domain())
> > +   if (xen_domain() || sev_active())
> > return true;
> >
> > return false;
> 
> 
> -- 
> Thiago Jung Bauermann
> IBM Linux Technology Center


Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-01-29 Thread Michael S. Tsirkin
On Mon, Jan 21, 2019 at 11:29:05AM +, Jean-Philippe Brucker wrote:
> Hi,
> 
> On 18/01/2019 15:51, Michael S. Tsirkin wrote:
> > 
> > On Tue, Jan 15, 2019 at 12:19:52PM +, Jean-Philippe Brucker wrote:
> >> Implement the virtio-iommu driver, following specification v0.9 [1].
> >>
> >> This is a simple rebase onto Linux v5.0-rc2. We now use the
> >> dev_iommu_fwspec_get() helper introduced in v5.0 instead of accessing
> >> dev->iommu_fwspec, but there aren't any functional change from v6 [2].
> >>
> >> Our current goal for virtio-iommu is to get a paravirtual IOMMU working
> >> on Arm, and enable device assignment to guest userspace. In this
> >> use-case the mappings are static, and don't require optimal performance,
> >> so this series tries to keep things simple. However there is plenty more
> >> to do for features and optimizations, and having this base in v5.1 would
> >> be good. Given that most of the changes are to drivers/iommu, I believe
> >> the driver and future changes should go via the IOMMU tree.
> >>
> >> You can find Linux driver and kvmtool device on v0.9.2 branches [3],
> >> module and x86 support on virtio-iommu/devel. Also tested with Eric's
> >> QEMU device [4]. Please note that the series depends on Robin's
> >> probe-deferral fix [5], which will hopefully land in v5.0.
> >>
> >> [1] Virtio-iommu specification v0.9, sources and pdf
> >> git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
> >> http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
> >>
> >> [2] [PATCH v6 0/7] Add virtio-iommu driver
> >> 
> >> https://lists.linuxfoundation.org/pipermail/iommu/2018-December/032127.html
> >>
> >> [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.2
> >> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9.2
> >>
> >> [4] [RFC v9 00/17] VIRTIO-IOMMU device
> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg575578.html
> >>
> >> [5] [PATCH] iommu/of: Fix probe-deferral
> >> https://www.spinics.net/lists/arm-kernel/msg698371.html
> > 
> > Thanks for the work!
> > So really my only issue with this is that there's no
> > way for the IOMMU to describe the devices that it
> > covers.
> > 
> > As a result that is then done in a platform-specific way.
> > 
> > And this means that for example it does not solve the problem that e.g.
> > some power people have in that their platform simply does not have a way
> > to specify which devices are covered by the IOMMU.
> 
> Isn't power using device tree? I haven't looked much at power because I
> was told a while ago that they already paravirtualize their IOMMU and
> don't need virtio-iommu, except perhaps for some legacy platforms. Or
> something along those lines. But I would certainly be interested in
> enabling the IOMMU for more architectures.

I have CC'd the relevant ppc developers, let's see what do they think.


> As for the enumeration problem, I still don't think we can get much
> better than DT and ACPI as solutions (and IMO they are necessary to make
> this device portable). But I believe that getting DT and ACPI support is
> just a one-off inconvenience. That is, once the required bindings are
> accepted, any future extension can then be done at the virtio level with
> feature bits and probe requests, without having to update ACPI or DT.
> 
> Thanks,
> Jean
> 
> > Solving that problem would make me much more excited about
> > this device.
> > 
> > On the other hand I can see that while there have been some
> > developments most of the code has been stable for quite a while now.
> > 
> > So what I am trying to do right about now, is making a small module that
> > loads early and pokes at the IOMMU sufficiently to get the data about
> > which devices use the IOMMU out of it using standard virtio config
> > space.  IIUC it's claimed to be impossible without messy changes to the
> > boot sequence.
> > 
> > If I succeed at least on some platforms I'll ask that this design is
> > worked into this device, minimizing info that goes through DT/ACPI.  If
> > I see I can't make it in time to meet the next merge window, I plan
> > merging the existing patches using DT (barring surprises).
> > 
> > As I only have a very small amount of time to spend on this attempt, If
> > someone else wants to try doing that in parallel, that would be great!
> > 
> > 
> >> Jean-Philippe Brucker (7):
> >>   dt-bindings: virtio-mmio: Add IOMMU description
> >>   dt-bindings: virtio: Add virtio-pci-iommu node
> >>   of: Allow the iommu-map property to omit untranslated devices
> >>   PCI: OF: Initialize dev->fwnode appropriately
> >>   iommu: Add virtio-iommu driver
> >>   iommu/virtio: Add probe request
> >>   iommu/virtio: Add event queue
> >>
> >>  .../devicetree/bindings/virtio/iommu.txt  |   66 +
> >>  .../devicetree/bindings/virtio/mmio.txt   |   30 +
> >>  MAINTAINERS   |7 +
> >>  drivers/iommu/Kconfig |   11 +
> >>  drivers/iommu/Makefile   

Re: [PATCH V5 0/5] NestMMU pte upgrade workaround for mprotect

2019-01-29 Thread Andrew Morton
On Wed, 16 Jan 2019 14:20:30 +0530 "Aneesh Kumar K.V" 
 wrote:

> We can upgrade pte access (R -> RW transition) via mprotect. We need
> to make sure we follow the recommended pte update sequence as outlined in
> commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
> nest MMU hang")
> for such updates. This patch series do that.

This series is at v5 and has no recorded review activity.  Perhaps you
cold hunt down some people to help out here?



Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-29 Thread Cédric Le Goater
On 1/29/19 5:12 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Actually, now that I come to think of it, there could be any number of
> VMAs (well, up to almost 64k of them), since once you have a file
> descriptor you can call mmap on it multiple times.
> 
> The more I think about it, the more I think that getting userspace to
> manage the mappings with mmap() and munmap() is the right idea if it
> can be made to work.

We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
call "ibm,change-msi".  I think that's the right spot for it. 

Thanks,

C.


Re: [PATCH] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-29 Thread Gustavo A. R. Silva
Hi Breno,

On 1/29/19 10:38 AM, Breno Leitao wrote:
> Hi Gustavo,
> 
> On 1/24/19 3:25 PM, Gustavo A. R. Silva wrote:
>>
>>
>> On 1/24/19 8:01 AM, Breno Leitao wrote:
>>> 'regno' is directly controlled by user space, hence leading to a potential
>>> exploitation of the Spectre variant 1 vulnerability.
>>>
>>> On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
>>> register number that would be read or written. This register number is
>>> called 'regno' which is part of the 'addr' syscall parameter.
>>>
>>> This 'regno' value is checked against the maximum pt_regs structure size,
>>> and then used to dereference it, which matches the initial part of a
>>> Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
>>> is returned to userspace in the GETREGS case.
>>>
>>
>> Was this reported by any tool?
> 
> It was not. I was writing an userspace tool, which required me to read the
> Ptrace subsystem carefully, then, I just found this case.
> 

Great. Good catch.

>> If so, it might be worth mentioning it.
>>
>>> This patch sanitizes 'regno' before using it to dereference pt_reg.
>>>
>>> Notice that given that speculation windows are large, the policy is
>>> to kill the speculation on the first load and not worry if it can be
>>> completed with a dependent load/store [1].
>>>
>>> [1] https://marc.info/?l=linux-kernel=152449131114778=2
>>>
>>> Signed-off-by: Breno Leitao 
>>> ---
>>>  arch/powerpc/kernel/ptrace.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
>>> index cdd5d1d3ae41..3eac38a29863 100644
>>> --- a/arch/powerpc/kernel/ptrace.c
>>> +++ b/arch/powerpc/kernel/ptrace.c
>>> @@ -33,6 +33,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -298,6 +299,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
>>> unsigned long *data)
>>>  #endif
>>>  
>>> if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
>>
>> I would use a variable to store sizeof(struct user_pt_regs) / 
>> sizeof(unsigned long).
> 
> Right.
> 
>>
>>> +   regno = array_index_nospec(regno,
>>> +   (sizeof(struct user_pt_regs) /
>>> +sizeof(unsigned long)));
>>
>> See the rest of my comments below.
>>
>>> *data = ((unsigned long *)task->thread.regs)[regno];
>>> return 0;
>>> }
>>> @@ -321,6 +325,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
>>> unsigned long data)
>>> return set_user_dscr(task, data);
>>>  
>>> if (regno <= PT_MAX_PUT_REG) {
>>> +   regno = array_index_nospec(regno, PT_MAX_PUT_REG);
>>
>> This is wrong.  array_index_nospec() will return PT_MAX_PUT_REG - 1 in case 
>> regno is equal to
>> PT_MAX_PUT_REG, and this is not what you want.
> 
> Right, this is really wrong. It would be correct if the comparison was
> regno < PT_MAX_PUT_REG. Since it is PT_MAX_PUT_REGS is a valid array entry,
> then I need to care about this case also. Doing something as:
> 
>   regno = array_index_nospec(regno, PT_MAX_PUT_REG + 1);
> 

This is the right way.

> Other than that, I think that for the regno = PT_MAX_PUT_REG base,
> array_index_nospec() will redefine regno to 0, not PT_MAX_PUT_REG - 1.
> 
>> Similar reasoning applies to the case above.
> 
> I understand that the case above does not seem to have the same problem,
> since it does not address over the array as done in the case above. Does it
> make sense?
> 

Yeah. I was wrong about that.  If regno is out of bounds, array_index_nospec()
will return 0.

> Anyway, this is how I am thinking the v2 of the patch:
> 

V2 looks good.

Thanks
--
Gustavo

> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index cdd5d1d3ae41..7535f89e08cd 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -274,6 +275,8 @@ static int set_user_trap(struct task_struct *task,
> unsigned long trap)
>   */
>  int ptrace_get_reg(struct task_struct *task, int regno, unsigned long *data)
>  {
> + unsigned int regs_max;
> +
>   if ((task->thread.regs == NULL) || !data)
>   return -EIO;
> 
> @@ -297,7 +300,9 @@ int ptrace_get_reg(struct task_struct *task, int regno,
> unsigned long *data)
>   }
>  #endif
> 
> - if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
> + regs_max = sizeof(struct user_pt_regs) / sizeof(unsigned long);
> + if (regno < regs_max) {
> + regno = array_index_nospec(regno, regs_max);
>   *data = ((unsigned long *)task->thread.regs)[regno];
>   return 0;
>   }
> @@ -321,6 +326,7 @@ int ptrace_put_reg(struct task_struct *task, int regno,
> unsigned long data)
>   return set_user_dscr(task, data);
> 
>   if (regno <= 

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-01-29 Thread Thiago Jung Bauermann


Fixing address of powerpc mailing list.

Thiago Jung Bauermann  writes:

> Hello,
>
> With Christoph's rework of the DMA API that recently landed, the patch
> below is the only change needed in virtio to make it work in a POWER
> secure guest under the ultravisor.
>
> The other change we need (making sure the device's dma_map_ops is NULL
> so that the dma-direct/swiotlb code is used) can be made in
> powerpc-specific code.
>
> Of course, I also have patches (soon to be posted as RFC) which hook up
>  to the powerpc secure guest support code.
>
> What do you think?
>
> From d0629a36a75c678b4a72b853f8f7f8c17eedd6b3 Mon Sep 17 00:00:00 2001
> From: Thiago Jung Bauermann 
> Date: Thu, 24 Jan 2019 22:08:02 -0200
> Subject: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted
>
> The host can't access the guest memory when it's encrypted, so using
> regular memory pages for the ring isn't an option. Go through the DMA API.
>
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  drivers/virtio/virtio_ring.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index cd7e755484e3..321a27075380 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -259,8 +259,11 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
>* not work without an even larger kludge.  Instead, enable
>* the DMA API if we're a Xen guest, which at least allows
>* all of the sensible Xen configurations to work correctly.
> +  *
> +  * Also, if guest memory is encrypted the host can't access
> +  * it directly. In this case, we'll need to use the DMA API.
>*/
> - if (xen_domain())
> + if (xen_domain() || sev_active())
>   return true;
>
>   return false;


-- 
Thiago Jung Bauermann
IBM Linux Technology Center



[REPOST PATCH v07 5/5] migration/memory: Support 'ibm,dynamic-memory-v2'

2019-01-29 Thread Michael Bringmann
migration/memory: This patch adds recognition for changes to the
associativity of memory blocks described by 'ibm,dynamic-memory-v2'.
If the associativity of an LMB has changed, it should be readded to
the system in order to update local and general kernel data structures.
This patch builds upon previous enhancements that scan the device-tree
"ibm,dynamic-memory" properties using the base LMB array, and a copy
derived from the updated properties.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index f7a40f4..23f5655 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -1184,7 +1184,8 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
err = pseries_remove_mem_node(rd->dn);
break;
case OF_RECONFIG_UPDATE_PROPERTY:
-   if (!strcmp(rd->prop->name, "ibm,dynamic-memory")) {
+   if (!strcmp(rd->prop->name, "ibm,dynamic-memory") ||
+   !strcmp(rd->prop->name, "ibm,dynamic-memory-v2")) {
struct drmem_lmb_info *dinfo =
drmem_lmbs_init(rd->prop);
if (!dinfo)



[REPOST PATCH v07 4/5] migration/memory: Evaluate LMB assoc changes

2019-01-29 Thread Michael Bringmann
migration/memory: This patch adds code that recognizes changes to
the associativity of memory blocks described by the device-tree
properties in order to drive equivalent 'hotplug' operations to
update local and general kernel data structures to reflect those
changes.  These differences may include:

* Evaluate 'ibm,dynamic-memory' properties when processing the
  updated device-tree properties of the system during Post Migration
  events (migration_store).  The new functionality looks for changes
  to the aa_index values for each drc_index/LMB to identify any memory
  blocks that should be readded.

* In an LPAR migration scenario, the "ibm,associativity-lookup-arrays"
  property may change.  In the event that a row of the array differs,
  locate all assigned memory blocks with that 'aa_index' and 're-add'
  them to the system memory block data structures.  In the process of
  the 're-add', the system routines will update the corresponding entry
  for the memory in the LMB structures and any other relevant kernel
  data structures.

A number of previous extensions made to the DRMEM code for scanning
device-tree properties and creating LMB arrays are used here to
ensure that the resulting code is simpler and more usable:

* Use new paired list iterator for the DRMEM LMB info arrays to find
  differences in old and new versions of properties.
* Use new iterator for copies of the DRMEM info arrays to evaluate
  completely new structures.
* Combine common code for parsing and evaluating memory description
  properties based on the DRMEM LMB array model to greatly simplify
  extension from the older property 'ibm,dynamic-memory' to the new
  property model of 'ibm,dynamic-memory-v2'.

For support, add a new pseries hotplug action for DLPAR operations,
PSERIES_HP_ELOG_ACTION_READD_MULTIPLE.  It is a variant of the READD
operation which performs the action upon multiple instances of the
resource at one time.  The operation is to be triggered by device-tree
analysis of updates by RTAS events analyzed by 'migation_store' during
post-migration processing.  It will be used for memory updates,
initially.

Signed-off-by: Michael Bringmann 
---
Changes in v07:
  -- Ensure return value from dlpar_memory_readd_multiple
Changes in v06:
  -- Rebase to powerpc next branch to account for recent code changes.
  -- Fix prototype problem when CONFIG_MEMORY_HOTPLUG not defined.
Changes in v05:
  -- Move common structure from numa.c + hotplug-memory.c to header file.
  -- Clarify some comments.
  -- Use walk_drmem_lmbs_pairs and callback instead of local loop
Changes in v04:
  -- Move dlpar_memory_readd_multiple() function definition and use
 into previous patch along with action constant definition.
  -- Correct spacing in patch
Changes in v03:
  -- Modify the code that parses the memory affinity attributes to
 mark relevant DRMEM LMB array entries using the internal_flags
 mechanism instead of generate unique hotplug actions for each
 memory block to be readded.  The change is intended to both
 simplify the code, and to require fewer resources on systems
 with huge amounts of memory.
  -- Save up notice about any all LMB entries until the end of the
 'migration_store' operation at which point a single action is
 queued to scan the entire DRMEM array.
  -- Add READD_MULTIPLE function for memory that scans the DRMEM
 array to identify multiple entries that were marked previously.
 The corresponding memory blocks are to be readded to the system
 to update relevant data structures outside of the powerpc-
 specific code.
  -- Change dlpar_memory_pmt_changes_action to directly queue worker
 to pseries work queue.
---
 arch/powerpc/include/asm/topology.h |7 +
 arch/powerpc/mm/numa.c  |6 -
 arch/powerpc/platforms/pseries/hotplug-memory.c |  209 +++
 arch/powerpc/platforms/pseries/mobility.c   |3 
 arch/powerpc/platforms/pseries/pseries.h|8 +
 5 files changed, 187 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index a4a718d..fbe03df 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -135,5 +135,12 @@ static inline void shared_proc_topology_init(void) {}
 #endif
 #endif
 
+
+struct assoc_arrays {
+   u32 n_arrays;
+   u32 array_sz;
+   const __be32 *arrays;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TOPOLOGY_H */
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 693ae1c..f1e7287 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -368,12 +368,6 @@ static unsigned long read_n_cells(int n, const __be32 
**buf)
return result;
 }
 
-struct assoc_arrays {
-   u32 n_arrays;
-   u32 array_sz;
-   const __be32 *arrays;
-};
-
 /*
  * Retrieve and validate the list of associativity arrays for drconf
  * memory from 

[REPOST PATCH v07 3/5] migration/memory: Add hotplug READD_MULTIPLE

2019-01-29 Thread Michael Bringmann
migration/memory: This patch adds a new pseries hotplug action
for CPU and memory operations, PSERIES_HP_ELOG_ACTION_READD_MULTIPLE.
This is a variant of the READD operation which performs the action
upon multiple instances of the resource at one time.  The operation
is to be triggered by device-tree analysis of updates by RTAS events
analyzed by 'migation_store' during post-migration processing.  It
will be used for memory updates, initially.

Signed-off-by: Michael Bringmann 
---
Changes in v05:
  -- Provide dlpar_memory_readd_helper routine to compress some common code
Changes in v04:
  -- Move init of 'lmb->internal_flags' in init_drmem_v2_lmbs to
 previous patch.
  -- Pull in implementation of dlpar_memory_readd_multiple() to go
 with operation flag.
---
 arch/powerpc/include/asm/rtas.h |1 +
 arch/powerpc/platforms/pseries/hotplug-memory.c |   44 ---
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 0183e95..cc00451 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -333,6 +333,7 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ACTION_ADD 1
 #define PSERIES_HP_ELOG_ACTION_REMOVE  2
 #define PSERIES_HP_ELOG_ACTION_READD   3
+#define PSERIES_HP_ELOG_ACTION_READD_MULTIPLE  4
 
 #define PSERIES_HP_ELOG_ID_DRC_NAME1
 #define PSERIES_HP_ELOG_ID_DRC_INDEX   2
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 2b796da..9c76345 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -507,6 +507,19 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
return rc;
 }
 
+static int dlpar_memory_readd_helper(struct drmem_lmb *lmb)
+{
+   int rc;
+
+   rc = dlpar_remove_lmb(lmb);
+   if (!rc) {
+   rc = dlpar_add_lmb(lmb);
+   if (rc)
+   dlpar_release_drc(lmb->drc_index);
+   }
+   return rc;
+}
+
 static int dlpar_memory_readd_by_index(u32 drc_index)
 {
struct drmem_lmb *lmb;
@@ -519,12 +532,7 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
for_each_drmem_lmb(lmb) {
if (lmb->drc_index == drc_index) {
lmb_found = 1;
-   rc = dlpar_remove_lmb(lmb);
-   if (!rc) {
-   rc = dlpar_add_lmb(lmb);
-   if (rc)
-   dlpar_release_drc(lmb->drc_index);
-   }
+   rc = dlpar_memory_readd_helper(lmb);
break;
}
}
@@ -541,6 +549,23 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
return rc;
 }
 
+static int dlpar_memory_readd_multiple(void)
+{
+   struct drmem_lmb *lmb;
+   int rc;
+
+   pr_info("Attempting to update multiple LMBs\n");
+
+   for_each_drmem_lmb(lmb) {
+   if (drmem_lmb_update(lmb)) {
+   rc = dlpar_memory_readd_helper(lmb);
+   drmem_remove_lmb_update(lmb);
+   }
+   }
+
+   return rc;
+}
+
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
@@ -641,6 +666,10 @@ static int dlpar_memory_readd_by_index(u32 drc_index)
 {
return -EOPNOTSUPP;
 }
+static int dlpar_memory_readd_multiple(void)
+{
+   return -EOPNOTSUPP;
+}
 
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
@@ -918,6 +947,9 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
drc_index = hp_elog->_drc_u.drc_index;
rc = dlpar_memory_readd_by_index(drc_index);
break;
+   case PSERIES_HP_ELOG_ACTION_READD_MULTIPLE:
+   rc = dlpar_memory_readd_multiple();
+   break;
default:
pr_err("Invalid action (%d) specified\n", hp_elog->action);
rc = -EINVAL;



Re: [PATCH] powerpc/mm: Add _PAGE_SAO to _PAGE_CACHE_CTL mask

2019-01-29 Thread Reza Arbab

On Tue, Jan 29, 2019 at 08:37:28PM +0530, Aneesh Kumar K.V wrote:

Not sure what the fix is about. We set the related hash pte flags via

if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_TOLERANT)
rflags |= HPTE_R_I;
else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_NON_IDEMPOTENT)
rflags |= (HPTE_R_I | HPTE_R_G);
else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_SAO)
rflags |= (HPTE_R_W | HPTE_R_I | HPTE_R_M);


Again, nothing broken here, just a code readability thing. As Alexey 
(and Charlie) noted, given the above it is a little confusing to define 
_PAGE_CACHE_CTL this way:


 #define _PAGE_CACHE_CTL  (_PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT)

I like Alexey's idea, maybe just use a literal?

 #define _PAGE_CACHE_CTL 0x30

--
Reza Arbab



[REPOST PATCH v07 2/5] powerpc/drmem: Add internal_flags feature

2019-01-29 Thread Michael Bringmann
powerpc/drmem: Add internal_flags field to each LMB to allow
marking of kernel software-specific operations that need not
be exported to other users.  For instance, if information about
selected LMBs needs to be maintained for subsequent passes
through the system, it can be encoded into the LMB array itself
without requiring the allocation and maintainance of additional
data structures.

Signed-off-by: Michael Bringmann 
---
Changes in v04:
  -- Add another initialization of 'lmb->internal_flags' to
 init_drmem_v2_lmbs.
---
 arch/powerpc/include/asm/drmem.h |   18 ++
 arch/powerpc/mm/drmem.c  |3 +++
 2 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index cfe8598..dbb3e6c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -17,6 +17,7 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
+   u32 internal_flags;
 };
 
 struct drmem_lmb_info {
@@ -94,6 +95,23 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
return lmb->flags & DRMEM_LMB_RESERVED;
 }
 
+#define DRMEM_LMBINT_UPDATE0x0001
+
+static inline void drmem_mark_lmb_update(struct drmem_lmb *lmb)
+{
+   lmb->internal_flags |= DRMEM_LMBINT_UPDATE;
+}
+
+static inline void drmem_remove_lmb_update(struct drmem_lmb *lmb)
+{
+   lmb->internal_flags &= ~DRMEM_LMBINT_UPDATE;
+}
+
+static inline bool drmem_lmb_update(struct drmem_lmb *lmb)
+{
+   return lmb->internal_flags & DRMEM_LMBINT_UPDATE;
+}
+
 u64 drmem_lmb_memory_max(void);
 void __init walk_drmem_lmbs(struct device_node *dn,
void (*func)(struct drmem_lmb *, const __be32 **));
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index ded9dbf..f199fe5 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -207,6 +207,7 @@ static void read_drconf_v1_cell(struct drmem_lmb *lmb,
 
lmb->aa_index = of_read_number(p++, 1);
lmb->flags = of_read_number(p++, 1);
+   lmb->internal_flags = 0;
 
*prop = p;
 }
@@ -265,6 +266,7 @@ static void __walk_drmem_v2_lmbs(const __be32 *prop, const 
__be32 *usm,
 
lmb.aa_index = dr_cell.aa_index;
lmb.flags = dr_cell.flags;
+   lmb.internal_flags = 0;
 
func(, );
}
@@ -441,6 +443,7 @@ static void init_drmem_v2_lmbs(const __be32 *prop,
 
lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
+   lmb->internal_flags = 0;
}
}
 }



[REPOST PATCH v07 1/5] powerpc/drmem: Export 'dynamic-memory' loader

2019-01-29 Thread Michael Bringmann
powerpc/drmem: Export many of the functions of DRMEM to parse
"ibm,dynamic-memory" and "ibm,dynamic-memory-v2" during hotplug
operations and for Post Migration events.

Also modify the DRMEM initialization code to allow it to,

* Be called after system initialization
* Provide a separate user copy of the LMB array that is produces
* Free the user copy upon request

In addition, a couple of changes were made to make the creation
of additional copies of the LMB array more useful including,

* Add iterator function to work through a pair of drmem_info arrays
  with a callback function to apply specific tests.
* Modify DRMEM code to replace usages of dt_root_addr_cells, and
  dt_mem_next_cell, as these are only available at first boot.

Signed-off-by: Michael Bringmann 
---
Changes in v05:
  -- Add walk_drmem_lmbs_pairs to replace macro for_each_pair_lmb
---
 arch/powerpc/include/asm/drmem.h |   13 +
 arch/powerpc/mm/drmem.c  |   96 ++
 2 files changed, 89 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 7c1d8e7..cfe8598 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -35,6 +35,11 @@ struct drmem_lmb_info {
_info->lmbs[0],   \
_info->lmbs[drmem_info->n_lmbs - 1])
 
+#define for_each_dinfo_lmb(dinfo, lmb) \
+   for_each_drmem_lmb_in_range((lmb),  \
+   >lmbs[0],\
+   >lmbs[dinfo->n_lmbs - 1])
+
 /*
  * The of_drconf_cell_v1 struct defines the layout of the LMB data
  * specified in the ibm,dynamic-memory device tree property.
@@ -94,6 +99,14 @@ void __init walk_drmem_lmbs(struct device_node *dn,
void (*func)(struct drmem_lmb *, const __be32 **));
 int drmem_update_dt(void);
 
+struct drmem_lmb_info *drmem_lmbs_init(struct property *prop);
+void drmem_lmbs_free(struct drmem_lmb_info *dinfo);
+int walk_drmem_lmbs_pairs(struct drmem_lmb_info *dinfo_oth,
+ int (*func)(struct drmem_lmb *cnt,
+   struct drmem_lmb *oth,
+   void *data),
+ void *data);
+
 #ifdef CONFIG_PPC_PSERIES
 void __init walk_drmem_lmbs_early(unsigned long node,
void (*func)(struct drmem_lmb *, const __be32 **));
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 3f18036..ded9dbf 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -20,6 +20,7 @@
 
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
+static int n_root_addr_cells;
 
 u64 drmem_lmb_memory_max(void)
 {
@@ -193,12 +194,13 @@ int drmem_update_dt(void)
return rc;
 }
 
-static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
+static void read_drconf_v1_cell(struct drmem_lmb *lmb,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
-   lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   lmb->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
lmb->drc_index = of_read_number(p++, 1);
 
p++; /* skip reserved field */
@@ -209,7 +211,7 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
*lmb,
*prop = p;
 }
 
-static void __init __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *usm,
+static void __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *usm,
void (*func)(struct drmem_lmb *, const __be32 **))
 {
struct drmem_lmb lmb;
@@ -225,13 +227,14 @@ static void __init __walk_drmem_v1_lmbs(const __be32 
*prop, const __be32 *usm,
}
 }
 
-static void __init read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
+static void read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
dr_cell->seq_lmbs = of_read_number(p++, 1);
-   dr_cell->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   dr_cell->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
dr_cell->drc_index = of_read_number(p++, 1);
dr_cell->aa_index = of_read_number(p++, 1);
dr_cell->flags = of_read_number(p++, 1);
@@ -239,7 +242,7 @@ static void __init read_drconf_v2_cell(struct 
of_drconf_cell_v2 *dr_cell,
*prop = p;
 }
 
-static void __init __walk_drmem_v2_lmbs(const __be32 *prop, const __be32 *usm,
+static void __walk_drmem_v2_lmbs(const __be32 *prop, const __be32 *usm,
void (*func)(struct drmem_lmb *, const __be32 **))
 {
struct of_drconf_cell_v2 dr_cell;
@@ -275,6 +278,9 @@ void __init walk_drmem_lmbs_early(unsigned long node,
const __be32 *prop, *usm;
int len;
 
+   

[REPOST PATCH v07 0/5] powerpc/migration: Affinity fix for memory

2019-01-29 Thread Michael Bringmann
The migration of LPARs across Power systems affects many attributes
including that of the associativity of memory blocks.  The patches
in this set execute when a system is coming up fresh upon a migration
target.  They are intended to,

* Recognize changes to the associativity of memory recorded in
  internal data structures when compared to the latest copies in
  the device tree (e.g. ibm,dynamic-memory, ibm,dynamic-memory-v2).
* Recognize changes to the associativity mapping (e.g. ibm,
  associativity-lookup-arrays), locate all assigned memory blocks
  corresponding to each changed row, and readd all such blocks.
* Generate calls to other code layers to reset the data structures
  related to associativity of memory.
* Re-register the 'changed' entities into the target system.
  Re-registration of memory blocks mostly entails acting as if they
  have been newly hot-added into the target system.

This code builds upon features introduced in a previous patch set
that updates CPUs for affinity changes that may occur during LPM.

Signed-off-by: Michael Bringmann 

Michael Bringmann (5):
  powerpc/drmem: Export 'dynamic-memory' loader
  powerpc/drmem: Add internal_flags feature
  migration/memory: Add hotplug flags READD_MULTIPLE
  migration/memory: Evaluate LMB assoc changes
  migration/memory: Support 'ibm,dynamic-memory-v2'
---
Changes in v07:
  -- Provide more useful return value from dlpar_memory_readd_multiple
Changes in v06:
  -- Rebase to powerpc next branch to account for recent code changes.
  -- Fix prototype problem when CONFIG_MEMORY_HOTPLUG not defined.
Changes in v05:
  -- Add walk_drmem_lmbs_pairs to replace macro for_each_pair_lmb
  -- Use walk_drmem_lmbs_pairs and callback instead of local loop
  -- Provide dlpar_memory_readd_helper routine to compress some common code
  -- Move common structure from numa.c + hotplug-memory.c to header file.
  -- Clarify some comments.
Changes in v04:
  -- Move dlpar_memory_readd_multiple() to patch with new ACTION
 constant.
  -- Move init of 'lmb->internal_flags' in init_drmem_v2_lmbs to
 patch with other references to flag.
  -- Correct spacing in one of the patches
Changes in v03:
  -- Change operation to tag changed LMBs in DRMEM array instead of
 queuing a potentially huge number of structures.
  -- Added another hotplug queue event for CPU/memory operations
  -- Added internal_flags feature to DRMEM
  -- Improve the patch description language for the patch set.
  -- Revise patch set to queue worker for memory association
 updates directly to pseries worker queue.



Re: [PATCH] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-29 Thread Breno Leitao
Hi Gustavo,

On 1/24/19 3:25 PM, Gustavo A. R. Silva wrote:
> 
> 
> On 1/24/19 8:01 AM, Breno Leitao wrote:
>> 'regno' is directly controlled by user space, hence leading to a potential
>> exploitation of the Spectre variant 1 vulnerability.
>>
>> On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
>> register number that would be read or written. This register number is
>> called 'regno' which is part of the 'addr' syscall parameter.
>>
>> This 'regno' value is checked against the maximum pt_regs structure size,
>> and then used to dereference it, which matches the initial part of a
>> Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
>> is returned to userspace in the GETREGS case.
>>
> 
> Was this reported by any tool?

It was not. I was writing an userspace tool, which required me to read the
Ptrace subsystem carefully, then, I just found this case.

> If so, it might be worth mentioning it.
> 
>> This patch sanitizes 'regno' before using it to dereference pt_reg.
>>
>> Notice that given that speculation windows are large, the policy is
>> to kill the speculation on the first load and not worry if it can be
>> completed with a dependent load/store [1].
>>
>> [1] https://marc.info/?l=linux-kernel=152449131114778=2
>>
>> Signed-off-by: Breno Leitao 
>> ---
>>  arch/powerpc/kernel/ptrace.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
>> index cdd5d1d3ae41..3eac38a29863 100644
>> --- a/arch/powerpc/kernel/ptrace.c
>> +++ b/arch/powerpc/kernel/ptrace.c
>> @@ -33,6 +33,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -298,6 +299,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
>> unsigned long *data)
>>  #endif
>>  
>>  if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
> 
> I would use a variable to store sizeof(struct user_pt_regs) / sizeof(unsigned 
> long).

Right.

> 
>> +regno = array_index_nospec(regno,
>> +(sizeof(struct user_pt_regs) /
>> + sizeof(unsigned long)));
> 
> See the rest of my comments below.
> 
>>  *data = ((unsigned long *)task->thread.regs)[regno];
>>  return 0;
>>  }
>> @@ -321,6 +325,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
>> unsigned long data)
>>  return set_user_dscr(task, data);
>>  
>>  if (regno <= PT_MAX_PUT_REG) {
>> +regno = array_index_nospec(regno, PT_MAX_PUT_REG);
> 
> This is wrong.  array_index_nospec() will return PT_MAX_PUT_REG - 1 in case 
> regno is equal to
> PT_MAX_PUT_REG, and this is not what you want.

Right, this is really wrong. It would be correct if the comparison was
regno < PT_MAX_PUT_REG. Since it is PT_MAX_PUT_REGS is a valid array entry,
then I need to care about this case also. Doing something as:

regno = array_index_nospec(regno, PT_MAX_PUT_REG + 1);

Other than that, I think that for the regno = PT_MAX_PUT_REG base,
array_index_nospec() will redefine regno to 0, not PT_MAX_PUT_REG - 1.

> Similar reasoning applies to the case above.

I understand that the case above does not seem to have the same problem,
since it does not address over the array as done in the case above. Does it
make sense?

Anyway, this is how I am thinking the v2 of the patch:

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index cdd5d1d3ae41..7535f89e08cd 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -274,6 +275,8 @@ static int set_user_trap(struct task_struct *task,
unsigned long trap)
  */
 int ptrace_get_reg(struct task_struct *task, int regno, unsigned long *data)
 {
+   unsigned int regs_max;
+
if ((task->thread.regs == NULL) || !data)
return -EIO;

@@ -297,7 +300,9 @@ int ptrace_get_reg(struct task_struct *task, int regno,
unsigned long *data)
}
 #endif

-   if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
+   regs_max = sizeof(struct user_pt_regs) / sizeof(unsigned long);
+   if (regno < regs_max) {
+   regno = array_index_nospec(regno, regs_max);
*data = ((unsigned long *)task->thread.regs)[regno];
return 0;
}
@@ -321,6 +326,7 @@ int ptrace_put_reg(struct task_struct *task, int regno,
unsigned long data)
return set_user_dscr(task, data);

if (regno <= PT_MAX_PUT_REG) {
+   regno = array_index_nospec(regno, PT_MAX_PUT_REG + 1);
((unsigned long *)task->thread.regs)[regno] = data;
return 0;
}


[PATCH v2] powerpc/traps: fix the message printed when stack overflows

2019-01-29 Thread Christophe Leroy
Today's message is useless:

[   42.253267] Kernel stack overflow in process (ptrval), r1=c65500b0

This patch fixes it:

[   66.905235] Kernel stack overflow in process sh[356], r1=c65560b0

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc: 
Signed-off-by: Christophe Leroy 
---
 v2: make checkpatch happy :)

 arch/powerpc/kernel/traps.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 5e917a84f949..a3dc6872d5aa 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1535,8 +1535,8 @@ void alignment_exception(struct pt_regs *regs)
 
 void StackOverflow(struct pt_regs *regs)
 {
-   printk(KERN_CRIT "Kernel stack overflow in process %p, r1=%lx\n",
-  current, regs->gpr[1]);
+   pr_crit("Kernel stack overflow in process %s[%d], r1=%lx\n",
+   current->comm, current->pid, regs->gpr[1]);
debugger(regs);
show_regs(regs);
panic("kernel stack overflow");
-- 
2.13.3



Re: use generic DMA mapping code in powerpc V4

2019-01-29 Thread Christoph Hellwig
On Tue, Jan 29, 2019 at 05:14:11PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 29, 2019 at 04:03:32PM +0100, Christian Zigotzky wrote:
> > Hi Christoph,
> >
> > I compiled kernels for the X5000 and X1000 from your new branch 
> > 'powerpc-dma.6-debug.2' today. The kernels boot and the P.A. Semi Ethernet 
> > works!
> 
> Thanks for testing!  I'll prepare a new series that adds the other
> patches on top of this one.

And that was easier than I thought - we just had a few patches left
in powerpc-dma.6, so I've rebased that branch on top of
powerpc-dma.6-debug.2:

git://git.infradead.org/users/hch/misc.git powerpc-dma.6

Gitweb:


http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6

I hope the other patches are simple enough, so just testing the full
branch checkout should be fine for now.


[PATCH] powerpc/traps: fix the message printed when stack overflows

2019-01-29 Thread Christophe Leroy
Today's message is useless:

[   42.253267] Kernel stack overflow in process (ptrval), r1=c65500b0

This patch fixes it:

[   66.905235] Kernel stack overflow in process sh[356], r1=c65560b0

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc: 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/traps.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 5e917a84f949..ca8f86ea0345 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1535,8 +1535,8 @@ void alignment_exception(struct pt_regs *regs)
 
 void StackOverflow(struct pt_regs *regs)
 {
-   printk(KERN_CRIT "Kernel stack overflow in process %p, r1=%lx\n",
-  current, regs->gpr[1]);
+   printk(KERN_CRIT "Kernel stack overflow in process %s[%d], r1=%lx\n",
+  current->comm, current->pid, regs->gpr[1]);
debugger(regs);
show_regs(regs);
panic("kernel stack overflow");
-- 
2.13.3



Re: [RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2019-01-29 Thread Michael Bringmann
On 1/29/19 3:31 AM, Michael Ellerman wrote:
> Tyrel Datwyler  writes:
>> On 12/14/2018 12:50 PM, Michael Bringmann wrote:
>>> Define interface to acquire arch-specific drc info to match against
>>> hotpluggable devices.  The current implementation exposes several
>>> pseries-specific dynamic memory properties in generic kernel code.
>>> This patch set provides an interface to pull that code out of the
>>> generic kernel.
>>>
>>> Signed-off-by: Michael Bringmann 
>>> ---
>>>  include/linux/topology.h |9 +
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>>> index cb0775e..df97f5f 100644
>>> --- a/include/linux/topology.h
>>> +++ b/include/linux/topology.h
>>> @@ -44,6 +44,15 @@
>>
>> As far as I know pseries is the only platform that uses DR connectors, and I
>> highly doubt that any other powerpc platform or arch ever will. So, I'm not 
>> sure
>> that this is really generic enough to belong in topology.h.
> 
> Right. This does not belong in include/linux.
> 
>> If anything I would
>> suggest putting this in an include in arch/powerpc/include/ named something 
>> like
>> drcinfo.h or pseries-drc.h. That will make it visible to modules like rpaphp
>> that want/need to use this functionality.
> 
> Yeah that would make more sense.

If you see no objection to referencing a powerpc-specific function from
the code ...

> 
> Using "arch" in the name is wrong, it's pseries specific so
> pseries_find_drc_match() would be more appropriate.
> 
>>> +int arch_find_drc_match(struct device_node *dn,
>>> +   bool (*usercb)(struct device_node *dn,
>>> +   u32 drc_index, char *drc_name,
>>> +   char *drc_type, u32 drc_power_domain,
>>> +   void *data),
>>> +   char *opt_drc_type, char *opt_drc_name,
>>> +   bool match_drc_index, bool ck_php_type,
>>> +   void *data);
> 
> This function signature is kind of insane.
> 
> You end with calls like:
> 
> + return arch_find_drc_match(dn, rpaphp_add_slot_cb,
> + NULL, NULL, false, true, NULL);
> 
> Which is impossible to parse.
> 
> I feel like maybe this isn't the right level of abstraction.

...
I had already been considering simplifying the interface for these
calls to something like the following:

int rpaphp_check_drc_props(struct device_node *dn, char *drc_name,
char *drc_type)
{
return pseries_find_drc_match(dn, drc_type, drc_name);
}
...
int rpaphp_add_slot(struct device_node *dn)
{
   if (!dn->name || strcmp(dn->name, "pci"))
   return 0;

   return pseries_add_drc_slot(dn, rpaphp_add_slot_cb);
}
...

Further details would be hidden within the pseries code.


> 
> cheers

Regards

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: use generic DMA mapping code in powerpc V4

2019-01-29 Thread Christoph Hellwig
On Tue, Jan 29, 2019 at 04:03:32PM +0100, Christian Zigotzky wrote:
> Hi Christoph,
>
> I compiled kernels for the X5000 and X1000 from your new branch 
> 'powerpc-dma.6-debug.2' today. The kernels boot and the P.A. Semi Ethernet 
> works!

Thanks for testing!  I'll prepare a new series that adds the other
patches on top of this one.


Re: [PATCH] powerpc/pseries: Perform full re-add of CPU for topology update post-migration

2019-01-29 Thread Michael Bringmann



On 1/29/19 3:37 AM, Michael Ellerman wrote:
> Michael Bringmann  writes:
> 
>> On 10/29/18 1:43 PM, Nathan Fontenot wrote:
>>> On pseries systems, performing a partition migration can result in
>>> altering the nodes a CPU is assigned to on the destination system. For
>>> exampl, pre-migration on the source system CPUs are in node 1 and 3,
>>> post-migration on the destination system CPUs are in nodes 2 and 3.
>>>
>>> Handling the node change for a CPU can cause corruption in the slab
>>> cache if we hit a timing where a CPUs node is changed while cache_reap()
>>> is invoked. The corruption occurs because the slab cache code appears
>>> to rely on the CPU and slab cache pages being on the same node.
>>>
>>> The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
>>> does not prevent us from hitting this scenario.
>>>
>>> Changing the device tree property update notification handler that
>>> recognizes an affinity change for a CPU to do a full DLPAR remove and
>>> add of the CPU instead of dynamically changing its node resolves this
>>> issue.
>>>
>>> Signed-off-by: Nathan Fontenot > Signed-off-by: Michael W. Bringmann 

Tested-by: Michael W. Bringmann 

> 
> Are you sure that's what you meant? ie. you wrote some of the patch?
> 
> What I'd like is to get a Tested-by from you.
> 
> cheers
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH] powerpc/mm: Add _PAGE_SAO to _PAGE_CACHE_CTL mask

2019-01-29 Thread Aneesh Kumar K.V

On 1/28/19 11:01 PM, Reza Arbab wrote:

In htab_convert_pte_flags(), _PAGE_CACHE_CTL is used to check for the
_PAGE_SAO flag:

   else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_SAO)
   rflags |= (HPTE_R_W | HPTE_R_I | HPTE_R_M);

But, it isn't defined to include that flag:

   #define _PAGE_CACHE_CTL (_PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT)

This happens to work, but only because of the flag values:

   #define _PAGE_SAO   0x00010 /* Strong access order */
   #define _PAGE_NON_IDEMPOTENT0x00020 /* non idempotent memory */
   #define _PAGE_TOLERANT  0x00030 /* tolerant memory, cache inhibited 
*/

To prevent any issues if these particulars ever change, add _PAGE_SAO to
the mask.



Not sure what the fix is about. We set the related hash pte flags via

if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_TOLERANT)
rflags |= HPTE_R_I;
else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_NON_IDEMPOTENT)
rflags |= (HPTE_R_I | HPTE_R_G);
else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_SAO)
rflags |= (HPTE_R_W | HPTE_R_I | HPTE_R_M);
else
/*
 * Add memory coherence if cache inhibited is not set
 */
rflags |= HPTE_R_M;





Suggested-by: Charles Johns 
Signed-off-by: Reza Arbab 
---
  arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 2e6ada2..1d97a28 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -811,7 +811,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
return hash__set_pte_at(mm, addr, ptep, pte, percpu);
  }
  
-#define _PAGE_CACHE_CTL	(_PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT)

+#define _PAGE_CACHE_CTL(_PAGE_SAO | _PAGE_NON_IDEMPOTENT | 
_PAGE_TOLERANT)
  
  #define pgprot_noncached pgprot_noncached

  static inline pgprot_t pgprot_noncached(pgprot_t prot)





Re: use generic DMA mapping code in powerpc V4

2019-01-29 Thread Christian Zigotzky

Hi Christoph,

I compiled kernels for the X5000 and X1000 from your new branch 
'powerpc-dma.6-debug.2' today. The kernels boot and the P.A. Semi 
Ethernet works!


Cheers,
Christian


On 28 January 2019 at 5:52PM, Christian Zigotzky wrote:

Thanks a lot! I will test it tomorrow.

— Christian

Sent from my iPhone


On 28. Jan 2019, at 17:22, Christoph Hellwig  wrote:


On Mon, Jan 28, 2019 at 08:04:22AM +0100, Christoph Hellwig wrote:

On Sun, Jan 27, 2019 at 02:13:09PM +0100, Christian Zigotzky wrote:
Christoph,

What shall I do next?

I'll need to figure out what went wrong with the new zone selection
on powerpc and give you another branch to test.

Can you try the new powerpc-dma.6-debug.2 branch:

git://git.infradead.org/users/hch/misc.git powerpc-dma.6-debug.2

Gitweb:


http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6-debug.2






Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-29 Thread Cédric Le Goater
>>> Another general comment is that you seem to have written all this
>>> code assuming we are using HV KVM in a host running bare-metal.
>>
>> Yes. I didn't look at the other configurations. I thought that we could
>> use the kernel_irqchip=off option to begin with. A couple of checks
>> are indeed missing.
> 
> Using kernel_irqchip=off would mean that we would not be able to use
> the in-kernel XICS emulation, which would have a performance impact.

yes. But it is not supported today. Correct ? 

> We need an explicit capability for XIVE exploitation that can be
> enabled or disabled on the qemu command line, so that we can enforce a
> uniform set of capabilities across all the hosts in a migration
> domain.  And it's no good to say we have the capability when all
> attempts to use it will fail.  Therefore the kernel needs to say that
> it doesn't have the capability in a PR KVM guest or in a nested HV
> guest.

OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
for future use.

>>> However, we could be using PR KVM (either in a bare-metal host or in a
>>> guest), or we could be doing nested HV KVM where we are using the
>>> kvm_hv module inside a KVM guest and using special hypercalls for
>>> controlling our guests.
>>
>> Yes. 
>>
>> It would be good to talk a little about the nested support (offline 
>> may be) to make sure that we are not missing some major interface that 
>> would require a lot of change. If we need to prepare ground, I think
>> the timing is good.
>>
>> The size of the IRQ number space might be a problem. It seems we 
>> would need to increase it considerably to support multiple nested 
>> guests. That said I haven't look much how nested is designed.  
> 
> The current design of nested HV is that the entire non-volatile state
> of all the nested guests is encapsulated within the state and
> resources of the L1 hypervisor.  That means that if the L1 hypervisor
> gets migrated, all of its guests go across inside it and there is no
> extra state that L0 needs to be aware of.  That would imply that the
> VP number space for the nested guests would need to come from within
> the VP number space for L1; but the amount of VP space we allocate to
> each guest doesn't seem to be large enough for that to be practical.

If the KVM XIVE device had some information on the max number of CPUs 
provisioned for the guest, we could optimize the VP allocation.

That might be a larger KVM topic though. There are some static limits 
on the number of CPUs in QEMU and in KVM, which have no relation AFAICT. 

Thanks,

C.


Re: [PATCH v2] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Arnaldo Carvalho de Melo
Em Tue, Jan 29, 2019 at 02:42:36PM +0100, Jiri Olsa escreveu:
> On Tue, Jan 29, 2019 at 06:54:12PM +0530, Ravi Bangoria wrote:
> > Powerpc hw does not have inbuilt latency filter (--ldlat) for mem-load
> > event and, perf_mem_events by default includes ldlat=30 which is
> > causing failure on powerpc. Refactor code to support perf mem/c2c on
> > powerpc.
> > 
> > This patch depends on kernel side changes done my Madhavan:
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-December/182596.html
> > 
> > Signed-off-by: Ravi Bangoria 
> > ---
> 
> Acked-by: Jiri Olsa 

Applied to perf/urgent, as soon as the kernel bits are there tooling
will be ready.

- Arnaldo
 
> thanks,
> jirka
> 
> >  tools/perf/Documentation/perf-c2c.txt | 16 
> >  tools/perf/Documentation/perf-mem.txt |  2 +-
> >  tools/perf/arch/powerpc/util/Build|  1 +
> >  tools/perf/arch/powerpc/util/mem-events.c | 11 +++
> >  tools/perf/util/mem-events.c  |  2 +-
> >  5 files changed, 26 insertions(+), 6 deletions(-)
> >  create mode 100644 tools/perf/arch/powerpc/util/mem-events.c
> > 
> > diff --git a/tools/perf/Documentation/perf-c2c.txt 
> > b/tools/perf/Documentation/perf-c2c.txt
> > index 095aebd..e6150f2 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -19,8 +19,11 @@ C2C stands for Cache To Cache.
> >  The perf c2c tool provides means for Shared Data C2C/HITM analysis. It 
> > allows
> >  you to track down the cacheline contentions.
> >  
> > -The tool is based on x86's load latency and precise store facility events
> > -provided by Intel CPUs. These events provide:
> > +On x86, the tool is based on load latency and precise store facility events
> > +provided by Intel CPUs. On PowerPC, the tool uses random instruction 
> > sampling
> > +with thresholding feature.
> > +
> > +These events provide:
> >- memory address of the access
> >- type of the access (load and store details)
> >- latency (in cycles) of the load access
> > @@ -46,7 +49,7 @@ RECORD OPTIONS
> >  
> >  -l::
> >  --ldlat::
> > -   Configure mem-loads latency.
> > +   Configure mem-loads latency. (x86 only)
> >  
> >  -k::
> >  --all-kernel::
> > @@ -119,11 +122,16 @@ Following perf record options are configured by 
> > default:
> >-W,-d,--phys-data,--sample-cpu
> >  
> >  Unless specified otherwise with '-e' option, following events are 
> > monitored by
> > -default:
> > +default on x86:
> >  
> >cpu/mem-loads,ldlat=30/P
> >cpu/mem-stores/P
> >  
> > +and following on PowerPC:
> > +
> > +  cpu/mem-loads/
> > +  cpu/mem-stores/
> > +
> >  User can pass any 'perf record' option behind '--' mark, like (to enable
> >  callchains and system wide monitoring):
> >  
> > diff --git a/tools/perf/Documentation/perf-mem.txt 
> > b/tools/perf/Documentation/perf-mem.txt
> > index f8d2167..199ea0f 100644
> > --- a/tools/perf/Documentation/perf-mem.txt
> > +++ b/tools/perf/Documentation/perf-mem.txt
> > @@ -82,7 +82,7 @@ RECORD OPTIONS
> > Be more verbose (show counter open errors, etc)
> >  
> >  --ldlat ::
> > -   Specify desired latency for loads event.
> > +   Specify desired latency for loads event. (x86 only)
> >  
> >  In addition, for report all perf report options are valid, and for record
> >  all perf record options.
> > diff --git a/tools/perf/arch/powerpc/util/Build 
> > b/tools/perf/arch/powerpc/util/Build
> > index 2e659531..ba98bd0 100644
> > --- a/tools/perf/arch/powerpc/util/Build
> > +++ b/tools/perf/arch/powerpc/util/Build
> > @@ -2,6 +2,7 @@ libperf-y += header.o
> >  libperf-y += sym-handling.o
> >  libperf-y += kvm-stat.o
> >  libperf-y += perf_regs.o
> > +libperf-y += mem-events.o
> >  
> >  libperf-$(CONFIG_DWARF) += dwarf-regs.o
> >  libperf-$(CONFIG_DWARF) += skip-callchain-idx.o
> > diff --git a/tools/perf/arch/powerpc/util/mem-events.c 
> > b/tools/perf/arch/powerpc/util/mem-events.c
> > new file mode 100644
> > index 000..f1194fc
> > --- /dev/null
> > +++ b/tools/perf/arch/powerpc/util/mem-events.c
> > @@ -0,0 +1,11 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "mem-events.h"
> > +
> > +/* PowerPC does not support 'ldlat' parameter. */
> > +char *perf_mem_events__name(int i)
> > +{
> > +   if (i == PERF_MEM_EVENTS__LOAD)
> > +   return (char *) "cpu/mem-loads/";
> > +
> > +   return (char *) "cpu/mem-stores/";
> > +}
> > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > index 93f74d8..42c3e5a 100644
> > --- a/tools/perf/util/mem-events.c
> > +++ b/tools/perf/util/mem-events.c
> > @@ -28,7 +28,7 @@ struct perf_mem_event 
> > perf_mem_events[PERF_MEM_EVENTS__MAX] = {
> >  static char mem_loads_name[100];
> >  static bool mem_loads_name__init;
> >  
> > -char *perf_mem_events__name(int i)
> > +char * __weak perf_mem_events__name(int i)
> >  {
> > if (i == PERF_MEM_EVENTS__LOAD) {
> > if (!mem_loads_name__init) {
> > -- 
> > 1.8.3.1
> > 

-- 

- Arnaldo


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-29 Thread Cédric Le Goater
On 1/29/19 3:45 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>> Would we end up with too many VMAs if we just used mmap() to
>>> change the mappings from the software-generated pages to the
>>> hardware-generated interrupt pages?  
>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> 
> Confused.  You say the number space has 32768 entries but then imply
> there are only 8K entries.  Do you mean that the API allows for 15-bit
> IRQ numbers but we are only making using of 8192 of them?

Ouch. My bad. Let's do it again. 

The sPAPR IRQ number space is 0x2000 wide :

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396

The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
devices (which can be extended if needed).

So that's 8192 x 2 ESB pages.

>> extend the last range if needed as these are for MSIs. Dynamic 
>> extensions under KVM should work also.
>>
>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>> lot of mmap(), too much. Also the KVM model needs to be compatible
> 
> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> space would be covered by a single mmap, overlaid by subsequent mmaps
> where we need to map real device interrupts.

ok. The same fault handler could be used to populate the VMA with the 
ESB pages. 

But it would mean extra work on the QEMU side, which is not needed 
with this patch. 

>> with the QEMU emulated one and it was simpler to have one overall
>> memory region for the IPI ESBs, one for the END ESBs (if we support
>> that one day) and one for the TIMA.
>>
>>> Are the necessary pages for a PCI
>>> passthrough device contiguous in both host real space 
>>
>> They should as they are the PHB4 ESBs.
>>
>>> and guest real space ? 
>>
>> also. That's how we organized the mapping. 
> 
> "How we organized the mapping" is a significant design decision that I
> haven't seen documented anywhere, and is really needed for
> understanding what's going on.

OK. I will add comments on that. See below for some description.

There is nothing fancy, it's simply indexed with the interrupt number,
like for HW, or for the QEMU XIVE emulated model.

>>> If so we'd only need one mmap() for all the device's interrupt
>>> pages.
>>
>> Ah. So we would want to make a special case for the passthrough 
>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>
>> Wouldn't that require to hot plug a memory region under the guest ? 
> 
> No; the way that a memory region works is that userspace can do
> whatever disparate mappings it likes within the region on the user
> process side, and the corresponding region of guest real address space
> follows that automatically.

yes. I suppose this should work also for 'ram device' memory mappings.

So when the passthrough device is added to the guest, we would add a 
new 'ram device' memory region for the device interrupt ESB pages 
that would overlap with the initial guest ESB pages.  

This is really a different approach.

>> which means that we need to provision an address space/container 
>> region for theses regions. What are the benefits ? 
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Explicitly unmapping parts of the VMA seems like the wrong way to do
> it.  If you have a device mmap where the device wants to change the
> physical page underlying parts of the mapping, there should be a way
> for it to do that explicitly (but I don't know off the top of my head
> what the interface to do that is).
> 
> However, I still haven't seen a clear and concise explanation of what
> is being changed, when, and why we need to do that.

Yes. I agree on that. The problem is not very different from what we 
have today with the XICS-over-XIVE glue in KVM. Let me try to explain.


The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
implement an IRQ space for the guest using the generic IPI interrupts 
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and 
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information 
on the interrupts, among which the xive_irq_data struct containing the 
addresses of the IPI ESB pages and an extra one in case of passthrough. 
The later contains the addresses of the ESB pages of the underlying HW 
controller interrupts, PHB4 in all cases for now.

A guest when running in the XICS legacy interrupt mode lets the KVM 
XICS-over-XIVE device "handle" interrupt management, that is to perform  
the loads and stores on the addresses of the ESB pages of the guest 

Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-29 Thread Cédric Le Goater
On 1/28/19 5:43 AM, Paul Mackerras wrote:
> On Thu, Jan 24, 2019 at 08:25:15AM +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
 Afaik bcs we change the mapping to point to the real HW irq ESB page
 instead of the "IPI" that was there at VM init time.
>>>
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> The main problem is understanding XIVE itself. It's not realistic to
>> ask Cedric to write a proper documentation for XIVE as part of the
>> patch series, but sadly IBM doesn't have a good one to provide either.
> 
> There are: (a) the XIVE hardware, (b) the definition of the XIVE
> hypercalls that guests use, and (c) the design decisions around how to
> implement that hypercall interface.  We need to get (b) published
> somehow, but it is mostly (c) that I would expect the patch
> descriptions to explain.
> 
> It sounds like there will be a mapping to userspace where the pages
> can sometimes point to an IPI page and sometimes point to a real HW
> irq ESB page.  

Just to be clear. In both cases, these pages are real HW ESB pages. 
They are just attached to a different controller : the XIVE IC for 
the IPIs and the PHB4 for the others. 

> That is, the same guest "hardware" irq number sometimes
> refers to a software-generated interrupt (what you called an "IPI"
> above) and sometimes to a hardware-generated interrupt.  That fact,> the 
> reason why it is so and the consequences all need to be explained
> somewhere.  They are really not obvious and I don't believe they are
> part of either the XIVE hardware spec or the XIVE hypercall spec.

I tried to put the reasons behind the current approach in another 
thread, not saying this is the correct one.

Thanks,

C.



Re: [PATCH v2] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Jiri Olsa
On Tue, Jan 29, 2019 at 06:54:12PM +0530, Ravi Bangoria wrote:
> Powerpc hw does not have inbuilt latency filter (--ldlat) for mem-load
> event and, perf_mem_events by default includes ldlat=30 which is
> causing failure on powerpc. Refactor code to support perf mem/c2c on
> powerpc.
> 
> This patch depends on kernel side changes done my Madhavan:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-December/182596.html
> 
> Signed-off-by: Ravi Bangoria 
> ---

Acked-by: Jiri Olsa 

thanks,
jirka

>  tools/perf/Documentation/perf-c2c.txt | 16 
>  tools/perf/Documentation/perf-mem.txt |  2 +-
>  tools/perf/arch/powerpc/util/Build|  1 +
>  tools/perf/arch/powerpc/util/mem-events.c | 11 +++
>  tools/perf/util/mem-events.c  |  2 +-
>  5 files changed, 26 insertions(+), 6 deletions(-)
>  create mode 100644 tools/perf/arch/powerpc/util/mem-events.c
> 
> diff --git a/tools/perf/Documentation/perf-c2c.txt 
> b/tools/perf/Documentation/perf-c2c.txt
> index 095aebd..e6150f2 100644
> --- a/tools/perf/Documentation/perf-c2c.txt
> +++ b/tools/perf/Documentation/perf-c2c.txt
> @@ -19,8 +19,11 @@ C2C stands for Cache To Cache.
>  The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
>  you to track down the cacheline contentions.
>  
> -The tool is based on x86's load latency and precise store facility events
> -provided by Intel CPUs. These events provide:
> +On x86, the tool is based on load latency and precise store facility events
> +provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
> +with thresholding feature.
> +
> +These events provide:
>- memory address of the access
>- type of the access (load and store details)
>- latency (in cycles) of the load access
> @@ -46,7 +49,7 @@ RECORD OPTIONS
>  
>  -l::
>  --ldlat::
> - Configure mem-loads latency.
> + Configure mem-loads latency. (x86 only)
>  
>  -k::
>  --all-kernel::
> @@ -119,11 +122,16 @@ Following perf record options are configured by default:
>-W,-d,--phys-data,--sample-cpu
>  
>  Unless specified otherwise with '-e' option, following events are monitored 
> by
> -default:
> +default on x86:
>  
>cpu/mem-loads,ldlat=30/P
>cpu/mem-stores/P
>  
> +and following on PowerPC:
> +
> +  cpu/mem-loads/
> +  cpu/mem-stores/
> +
>  User can pass any 'perf record' option behind '--' mark, like (to enable
>  callchains and system wide monitoring):
>  
> diff --git a/tools/perf/Documentation/perf-mem.txt 
> b/tools/perf/Documentation/perf-mem.txt
> index f8d2167..199ea0f 100644
> --- a/tools/perf/Documentation/perf-mem.txt
> +++ b/tools/perf/Documentation/perf-mem.txt
> @@ -82,7 +82,7 @@ RECORD OPTIONS
>   Be more verbose (show counter open errors, etc)
>  
>  --ldlat ::
> - Specify desired latency for loads event.
> + Specify desired latency for loads event. (x86 only)
>  
>  In addition, for report all perf report options are valid, and for record
>  all perf record options.
> diff --git a/tools/perf/arch/powerpc/util/Build 
> b/tools/perf/arch/powerpc/util/Build
> index 2e659531..ba98bd0 100644
> --- a/tools/perf/arch/powerpc/util/Build
> +++ b/tools/perf/arch/powerpc/util/Build
> @@ -2,6 +2,7 @@ libperf-y += header.o
>  libperf-y += sym-handling.o
>  libperf-y += kvm-stat.o
>  libperf-y += perf_regs.o
> +libperf-y += mem-events.o
>  
>  libperf-$(CONFIG_DWARF) += dwarf-regs.o
>  libperf-$(CONFIG_DWARF) += skip-callchain-idx.o
> diff --git a/tools/perf/arch/powerpc/util/mem-events.c 
> b/tools/perf/arch/powerpc/util/mem-events.c
> new file mode 100644
> index 000..f1194fc
> --- /dev/null
> +++ b/tools/perf/arch/powerpc/util/mem-events.c
> @@ -0,0 +1,11 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "mem-events.h"
> +
> +/* PowerPC does not support 'ldlat' parameter. */
> +char *perf_mem_events__name(int i)
> +{
> + if (i == PERF_MEM_EVENTS__LOAD)
> + return (char *) "cpu/mem-loads/";
> +
> + return (char *) "cpu/mem-stores/";
> +}
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index 93f74d8..42c3e5a 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -28,7 +28,7 @@ struct perf_mem_event perf_mem_events[PERF_MEM_EVENTS__MAX] 
> = {
>  static char mem_loads_name[100];
>  static bool mem_loads_name__init;
>  
> -char *perf_mem_events__name(int i)
> +char * __weak perf_mem_events__name(int i)
>  {
>   if (i == PERF_MEM_EVENTS__LOAD) {
>   if (!mem_loads_name__init) {
> -- 
> 1.8.3.1
> 


[PATCH v2] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Ravi Bangoria
Powerpc hw does not have inbuilt latency filter (--ldlat) for mem-load
event and, perf_mem_events by default includes ldlat=30 which is
causing failure on powerpc. Refactor code to support perf mem/c2c on
powerpc.

This patch depends on kernel side changes done my Madhavan:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-December/182596.html

Signed-off-by: Ravi Bangoria 
---
 tools/perf/Documentation/perf-c2c.txt | 16 
 tools/perf/Documentation/perf-mem.txt |  2 +-
 tools/perf/arch/powerpc/util/Build|  1 +
 tools/perf/arch/powerpc/util/mem-events.c | 11 +++
 tools/perf/util/mem-events.c  |  2 +-
 5 files changed, 26 insertions(+), 6 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/mem-events.c

diff --git a/tools/perf/Documentation/perf-c2c.txt 
b/tools/perf/Documentation/perf-c2c.txt
index 095aebd..e6150f2 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -19,8 +19,11 @@ C2C stands for Cache To Cache.
 The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
 you to track down the cacheline contentions.
 
-The tool is based on x86's load latency and precise store facility events
-provided by Intel CPUs. These events provide:
+On x86, the tool is based on load latency and precise store facility events
+provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
+with thresholding feature.
+
+These events provide:
   - memory address of the access
   - type of the access (load and store details)
   - latency (in cycles) of the load access
@@ -46,7 +49,7 @@ RECORD OPTIONS
 
 -l::
 --ldlat::
-   Configure mem-loads latency.
+   Configure mem-loads latency. (x86 only)
 
 -k::
 --all-kernel::
@@ -119,11 +122,16 @@ Following perf record options are configured by default:
   -W,-d,--phys-data,--sample-cpu
 
 Unless specified otherwise with '-e' option, following events are monitored by
-default:
+default on x86:
 
   cpu/mem-loads,ldlat=30/P
   cpu/mem-stores/P
 
+and following on PowerPC:
+
+  cpu/mem-loads/
+  cpu/mem-stores/
+
 User can pass any 'perf record' option behind '--' mark, like (to enable
 callchains and system wide monitoring):
 
diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index f8d2167..199ea0f 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -82,7 +82,7 @@ RECORD OPTIONS
Be more verbose (show counter open errors, etc)
 
 --ldlat ::
-   Specify desired latency for loads event.
+   Specify desired latency for loads event. (x86 only)
 
 In addition, for report all perf report options are valid, and for record
 all perf record options.
diff --git a/tools/perf/arch/powerpc/util/Build 
b/tools/perf/arch/powerpc/util/Build
index 2e659531..ba98bd0 100644
--- a/tools/perf/arch/powerpc/util/Build
+++ b/tools/perf/arch/powerpc/util/Build
@@ -2,6 +2,7 @@ libperf-y += header.o
 libperf-y += sym-handling.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += mem-events.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_DWARF) += skip-callchain-idx.o
diff --git a/tools/perf/arch/powerpc/util/mem-events.c 
b/tools/perf/arch/powerpc/util/mem-events.c
new file mode 100644
index 000..f1194fc
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/mem-events.c
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "mem-events.h"
+
+/* PowerPC does not support 'ldlat' parameter. */
+char *perf_mem_events__name(int i)
+{
+   if (i == PERF_MEM_EVENTS__LOAD)
+   return (char *) "cpu/mem-loads/";
+
+   return (char *) "cpu/mem-stores/";
+}
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 93f74d8..42c3e5a 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -28,7 +28,7 @@ struct perf_mem_event perf_mem_events[PERF_MEM_EVENTS__MAX] = 
{
 static char mem_loads_name[100];
 static bool mem_loads_name__init;
 
-char *perf_mem_events__name(int i)
+char * __weak perf_mem_events__name(int i)
 {
if (i == PERF_MEM_EVENTS__LOAD) {
if (!mem_loads_name__init) {
-- 
1.8.3.1



Re: [PATCHv5 1/4] dt-bindings: add DT binding for the layerscape PCIe controller with EP mode

2019-01-29 Thread Lorenzo Pieralisi
Rob,

Is it OK for you if I pull this series into the pci tree ?

Please let me know, thanks.

Lorenzo

On Mon, Jan 21, 2019 at 05:44:57PM +0800, Xiaowei Bao wrote:
> Add the documentation for the Device Tree binding for the layerscape PCIe
> controller with EP mode.
> 
> Signed-off-by: Xiaowei Bao 
> Reviewed-by: Minghuan Lian 
> Reviewed-by: Zhiqiang Hou 
> ---
> v2:
>  - Add the SoC specific compatibles.
> v3:
>  - modify the commit message.
> v4:
>  - no change.
> v5:
>  - no change.
> 
>  .../devicetree/bindings/pci/layerscape-pci.txt |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt 
> b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> index 9b2b8d6..e20ceaa 100644
> --- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> +++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
> @@ -13,6 +13,7 @@ information.
>  
>  Required properties:
>  - compatible: should contain the platform identifier such as:
> +  RC mode:
>  "fsl,ls1021a-pcie"
>  "fsl,ls2080a-pcie", "fsl,ls2085a-pcie"
>  "fsl,ls2088a-pcie"
> @@ -20,6 +21,8 @@ Required properties:
>  "fsl,ls1046a-pcie"
>  "fsl,ls1043a-pcie"
>  "fsl,ls1012a-pcie"
> +  EP mode:
> + "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"
>  - reg: base addresses and lengths of the PCIe controller register blocks.
>  - interrupts: A list of interrupt outputs of the controller. Must contain an
>entry for each entry in the interrupt-names property.
> -- 
> 1.7.1
> 


Re: [PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-29 Thread Vaibhav Jain


Andrew Donnellan  writes:

> On 29/1/19 5:15 pm, Vaibhav Jain wrote:
>> Within cxl module, iteration over array 'adapter->slices' may be racy
>
> adapter->slices isn't an array, adapter->afu is the array.
Thanks for catching this. Have fixed the patch description in the resent patch.

>
> Does this need to go to stable? (I'm guessing we've been hitting actual 
> Oopses?)
Re-send patch marked to stable.

> Acked-by: Andrew Donnellan 
Thanks !!

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



[RESEND PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-29 Thread Vaibhav Jain
Within cxl module, iteration over array 'adapter->afu' may be racy
at few points as it might be simultaneously read during an EEH and its
contents being set to NULL while driver is being unloaded or unbound
from the adapter. This might result in a NULL pointer to 'struct afu'
being de-referenced during an EEH thereby causing a kernel oops.

This patch fixes this by making sure that all access to the array
'adapter->afu' is wrapped within the context of spin-lock
'adapter->afu_list_lock'.

Cc: sta...@vger.kernel.org
Fixes: 9e8df8a2196("cxl: EEH support")
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Acked-by: Christophe Lombard 
Signed-off-by: Vaibhav Jain 
---
Changelog:

Resend:
* Fixed the reference to 'adapter->afu' in patch description. [Andrew]
* Added the 'Fixes' tag and marked the patch to stable

v3:
* Updated a slice loop in cxl_pci_error_detectected() to ignore NULL
  slices [Fred]
* Added a NULL AFU check in cxl_pci_slot_reset() [Fred]

v2:
* Fixed a wrong comparison of non-null pointer [Fred]
* Moved a call to cxl_vphb_error_detected() within a branch that
  checks for not null AFU pointer in 'adapter->slices' [Fred]
* Removed a misleading comment in code.
---
 drivers/misc/cxl/guest.c |  2 ++
 drivers/misc/cxl/pci.c   | 39 ++-
 2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 5d28d9e454f5..08f4a512afad 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -267,6 +267,7 @@ static int guest_reset(struct cxl *adapter)
int i, rc;
 
pr_devel("Adapter reset request\n");
+   spin_lock(>afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
if ((afu = adapter->afu[i])) {
pci_error_handlers(afu, CXL_ERROR_DETECTED_EVENT,
@@ -283,6 +284,7 @@ static int guest_reset(struct cxl *adapter)
pci_error_handlers(afu, CXL_RESUME_EVENT, 0);
}
}
+   spin_unlock(>afu_list_lock);
return rc;
 }
 
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index c79ba1c699ad..300531d6136f 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1805,7 +1805,7 @@ static pci_ers_result_t cxl_vphb_error_detected(struct 
cxl_afu *afu,
/* There should only be one entry, but go through the list
 * anyway
 */
-   if (afu->phb == NULL)
+   if (afu == NULL || afu->phb == NULL)
return result;
 
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) {
@@ -1832,7 +1832,8 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 {
struct cxl *adapter = pci_get_drvdata(pdev);
struct cxl_afu *afu;
-   pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET, afu_result;
+   pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET;
+   pci_ers_result_t afu_result = PCI_ERS_RESULT_NEED_RESET;
int i;
 
/* At this point, we could still have an interrupt pending.
@@ -1843,6 +1844,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 
/* If we're permanently dead, give up. */
if (state == pci_channel_io_perm_failure) {
+   spin_lock(>afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
/*
@@ -1851,6 +1853,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 */
cxl_vphb_error_detected(afu, state);
}
+   spin_unlock(>afu_list_lock);
return PCI_ERS_RESULT_DISCONNECT;
}
 
@@ -1932,11 +1935,17 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 * * In slot_reset, free the old resources and allocate new ones.
 * * In resume, clear the flag to allow things to start.
 */
+
+   /* Make sure no one else changes the afu list */
+   spin_lock(>afu_list_lock);
+
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
 
-   afu_result = cxl_vphb_error_detected(afu, state);
+   if (afu == NULL)
+   continue;
 
+   afu_result = cxl_vphb_error_detected(afu, state);
cxl_context_detach_all(afu);
cxl_ops->afu_deactivate_mode(afu, afu->current_mode);
pci_deconfigure_afu(afu);
@@ -1948,6 +1957,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 (result == PCI_ERS_RESULT_NEED_RESET))
result = PCI_ERS_RESULT_NONE;
}
+   spin_unlock(>afu_list_lock);
 
/* should take the context lock here */
if (cxl_adapter_context_lock(adapter) != 0)
@@ -1980,14 +1990,18 @@ static pci_ers_result_t cxl_pci_slot_reset(struct 
pci_dev *pdev)
 */

Re: [PATCH V5 0/5] NestMMU pte upgrade workaround for mprotect

2019-01-29 Thread Aneesh Kumar K.V


Andrew,

How do you want to merge this? Michael Ellerman suggests this should go
via -mm tree.

-aneesh


"Aneesh Kumar K.V"  writes:

> We can upgrade pte access (R -> RW transition) via mprotect. We need
> to make sure we follow the recommended pte update sequence as outlined in
> commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
> nest MMU hang")
> for such updates. This patch series do that.
>
> Changes from V4:
> * Drop EXPORT_SYMBOL 
>
> Changes from V3:
> * Build fix for x86
>
> Changes from V2:
> * Update commit message for patch 4
> * use radix tlb flush routines directly.
>
> Changes from V1:
> * Restrict ths only for R->RW upgrade. We don't need to do this for Autonuma
> * Restrict this only for radix translation mode.
>
>
> Aneesh Kumar K.V (5):
>   mm: Update ptep_modify_prot_start/commit to take vm_area_struct as arg
>   mm: update ptep_modify_prot_commit to take old pte value as arg
>   arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.
>   mm/hugetlb: Add prot_modify_start/commit sequence for hugetlb update
>   arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW
> upgrade
>
>  arch/powerpc/include/asm/book3s/64/hugetlb.h | 12 ++
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 18 ++
>  arch/powerpc/include/asm/book3s/64/radix.h   |  4 
>  arch/powerpc/mm/hugetlbpage-hash64.c | 25 
>  arch/powerpc/mm/hugetlbpage-radix.c  | 17 +
>  arch/powerpc/mm/pgtable-book3s64.c   | 25 
>  arch/powerpc/mm/pgtable-radix.c  | 18 ++
>  arch/s390/include/asm/pgtable.h  |  5 ++--
>  arch/s390/mm/pgtable.c   |  8 ---
>  arch/x86/include/asm/paravirt.h  | 13 +-
>  arch/x86/include/asm/paravirt_types.h|  5 ++--
>  arch/x86/xen/mmu.h   |  4 ++--
>  arch/x86/xen/mmu_pv.c|  8 +++
>  fs/proc/task_mmu.c   |  8 ---
>  include/asm-generic/pgtable.h| 18 +++---
>  include/linux/hugetlb.h  | 20 
>  mm/hugetlb.c |  8 ---
>  mm/memory.c  |  8 +++
>  mm/mprotect.c|  6 ++---
>  19 files changed, 189 insertions(+), 41 deletions(-)
>
> -- 
> 2.20.1



Re: [PATCH] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Ravi Bangoria


On 1/29/19 3:23 PM, Arnaldo Carvalho de Melo wrote:
> I think its just a tooling side, I haven't processed it because I'm
> waiting for Ravi to address Jiri's comment, after that I'm happy to put
> it in my perf/urgent branch that I'm brewing to push to Ingo today or
> tomorrow.

Ah.. Will try to send v2 today.

Thanks.



Re: [PATCH v2 09/21] memblock: drop memblock_alloc_base()

2019-01-29 Thread Michael Ellerman
Mike Rapoport  writes:

> The memblock_alloc_base() function tries to allocate a memory up to the
> limit specified by its max_addr parameter and panics if the allocation
> fails. Replace its usage with memblock_phys_alloc_range() and make the
> callers check the return value and panic in case of error.
>
> Signed-off-by: Mike Rapoport 
> ---
>  arch/powerpc/kernel/rtas.c  |  6 +-
>  arch/powerpc/mm/hash_utils_64.c |  8 ++--
>  arch/s390/kernel/smp.c  |  6 +-
>  drivers/macintosh/smu.c |  2 +-
>  include/linux/memblock.h|  2 --
>  mm/memblock.c   | 14 --
>  6 files changed, 17 insertions(+), 21 deletions(-)

Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-29 Thread christophe lombard

On 29/01/2019 07:15, Vaibhav Jain wrote:

Within cxl module, iteration over array 'adapter->slices' may be racy
at few points as it might be simultaneously read during an EEH and its
contents being set to NULL while driver is being unloaded or unbound
from the adapter. This might result in a NULL pointer to 'struct afu'
being de-referenced during an EEH thereby causing a kernel oops.

This patch fixes this by making sure that all access to the array
'adapter->slices' is wrapped within the context of spin-lock
'adapter->afu_list_lock'.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Updated a slice loop in cxl_pci_error_detectected() to ignore NULL
   slices [Fred]
* Added a NULL AFU check in cxl_pci_slot_reset() [Fred]

v2:
* Fixed a wrong comparison of non-null pointer [Fred]
* Moved a call to cxl_vphb_error_detected() within a branch that
   checks for not null AFU pointer in 'adapter->slices' [Fred]
* Removed a misleading comment in code.
---


Thanks

Acked-by: Christophe Lombard



Re: [PATCH v2 06/21] memblock: memblock_phys_alloc_try_nid(): don't panic

2019-01-29 Thread Michael Ellerman
Michael Ellerman  writes:

> Mike Rapoport  writes:
>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index ae34e3a..2c61ea4 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -237,6 +237,10 @@ static void __init setup_node_data(int nid, u64 
>> start_pfn, u64 end_pfn)
>>  pr_info("Initmem setup node %d []\n", nid);
>>  
>>  nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +if (!nd_pa)
>> +panic("Cannot allocate %zu bytes for node %d data\n",
>> +  nd_size, nid);
>> +
>>  nd = __va(nd_pa);

Wrong hunk, O_o

> Acked-by: Michael Ellerman  (powerpc)

You know what I mean though :)

cheers


Re: [PATCH v2 06/21] memblock: memblock_phys_alloc_try_nid(): don't panic

2019-01-29 Thread Michael Ellerman
Mike Rapoport  writes:

> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index ae34e3a..2c61ea4 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -237,6 +237,10 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>   pr_info("Initmem setup node %d []\n", nid);
>  
>   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> + if (!nd_pa)
> + panic("Cannot allocate %zu bytes for node %d data\n",
> +   nd_size, nid);
> +
>   nd = __va(nd_pa);

Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [PATCH] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Arnaldo Carvalho de Melo
Em Tue, Jan 29, 2019 at 08:45:44PM +1100, Michael Ellerman escreveu:
> Ravi Bangoria  writes:
> 
> > On 1/14/19 9:44 AM, Ravi Bangoria wrote:
> >> Powerpc hw does not have inbuilt latency filter (--ldlat) for mem-load
> >> event and, perf_mem_events by default includes ldlat=30 which is
> >> causing failure on powerpc. Refactor code to support perf mem/c2c on
> >> powerpc.
> >> 
> >> This patch depends on kernel side changes done my Madhavan:
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-December/182596.html
> >> 
> >> Signed-off-by: Ravi Bangoria 
> >
> >
> > Arnaldo / Michael, Any thoughts?
> 
> I haven't merged the kernel patch, I think because Maddy told me not to
> because it would break the userspace tooling :)
> 
> What is the actual dependency between them? ie. should we merge the
> kernel fix first or second or what?

I think its just a tooling side, I haven't processed it because I'm
waiting for Ravi to address Jiri's comment, after that I'm happy to put
it in my perf/urgent branch that I'm brewing to push to Ingo today or
tomorrow.

- Arnaldo


Re: [PATCH v2 02/21] powerpc: use memblock functions returning virtual address

2019-01-29 Thread Michael Ellerman
Mike Rapoport  writes:

> From: Christophe Leroy 
>
> Since only the virtual address of allocated blocks is used,
> lets use functions returning directly virtual address.
>
> Those functions have the advantage of also zeroing the block.
>
> [ MR:
>  - updated error message in alloc_stack() to be more verbose
>  - convereted several additional call sites ]
>
> Signed-off-by: Christophe Leroy 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/powerpc/kernel/dt_cpu_ftrs.c |  3 +--
>  arch/powerpc/kernel/irq.c |  5 -
>  arch/powerpc/kernel/paca.c|  6 +-
>  arch/powerpc/kernel/prom.c|  5 -
>  arch/powerpc/kernel/setup_32.c| 26 --
>  5 files changed, 26 insertions(+), 19 deletions(-)

LGTM.

Acked-by: Michael Ellerman 

cheers


Re: [PATCH] perf mem/c2c: Fix perf_mem_events to support powerpc

2019-01-29 Thread Michael Ellerman
Ravi Bangoria  writes:

> On 1/14/19 9:44 AM, Ravi Bangoria wrote:
>> Powerpc hw does not have inbuilt latency filter (--ldlat) for mem-load
>> event and, perf_mem_events by default includes ldlat=30 which is
>> causing failure on powerpc. Refactor code to support perf mem/c2c on
>> powerpc.
>> 
>> This patch depends on kernel side changes done my Madhavan:
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-December/182596.html
>> 
>> Signed-off-by: Ravi Bangoria 
>
>
> Arnaldo / Michael, Any thoughts?

I haven't merged the kernel patch, I think because Maddy told me not to
because it would break the userspace tooling :)

What is the actual dependency between them? ie. should we merge the
kernel fix first or second or what?

cheers


Re: [PATCH] powerpc/pseries: Perform full re-add of CPU for topology update post-migration

2019-01-29 Thread Michael Ellerman
Michael Bringmann  writes:

> On 10/29/18 1:43 PM, Nathan Fontenot wrote:
>> On pseries systems, performing a partition migration can result in
>> altering the nodes a CPU is assigned to on the destination system. For
>> exampl, pre-migration on the source system CPUs are in node 1 and 3,
>> post-migration on the destination system CPUs are in nodes 2 and 3.
>> 
>> Handling the node change for a CPU can cause corruption in the slab
>> cache if we hit a timing where a CPUs node is changed while cache_reap()
>> is invoked. The corruption occurs because the slab cache code appears
>> to rely on the CPU and slab cache pages being on the same node.
>> 
>> The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
>> does not prevent us from hitting this scenario.
>> 
>> Changing the device tree property update notification handler that
>> recognizes an affinity change for a CPU to do a full DLPAR remove and
>> add of the CPU instead of dynamically changing its node resolves this
>> issue.
>> 
>> Signed-off-by: Nathan Fontenot  Signed-off-by: Michael W. Bringmann 

Are you sure that's what you meant? ie. you wrote some of the patch?

What I'd like is to get a Tested-by from you.

cheers


Re: [RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2019-01-29 Thread Michael Ellerman
Tyrel Datwyler  writes:
> On 12/14/2018 12:50 PM, Michael Bringmann wrote:
>> Define interface to acquire arch-specific drc info to match against
>> hotpluggable devices.  The current implementation exposes several
>> pseries-specific dynamic memory properties in generic kernel code.
>> This patch set provides an interface to pull that code out of the
>> generic kernel.
>> 
>> Signed-off-by: Michael Bringmann 
>> ---
>>  include/linux/topology.h |9 +
>>  1 file changed, 9 insertions(+)
>> 
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index cb0775e..df97f5f 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -44,6 +44,15 @@
>
> As far as I know pseries is the only platform that uses DR connectors, and I
> highly doubt that any other powerpc platform or arch ever will. So, I'm not 
> sure
> that this is really generic enough to belong in topology.h.

Right. This does not belong in include/linux.

> If anything I would
> suggest putting this in an include in arch/powerpc/include/ named something 
> like
> drcinfo.h or pseries-drc.h. That will make it visible to modules like rpaphp
> that want/need to use this functionality.

Yeah that would make more sense.

Using "arch" in the name is wrong, it's pseries specific so
pseries_find_drc_match() would be more appropriate.

>> +int arch_find_drc_match(struct device_node *dn,
>> +bool (*usercb)(struct device_node *dn,
>> +u32 drc_index, char *drc_name,
>> +char *drc_type, u32 drc_power_domain,
>> +void *data),
>> +char *opt_drc_type, char *opt_drc_name,
>> +bool match_drc_index, bool ck_php_type,
>> +void *data);

This function signature is kind of insane.

You end with calls like:

+   return arch_find_drc_match(dn, rpaphp_add_slot_cb,
+   NULL, NULL, false, true, NULL);

Which is impossible to parse.

I feel like maybe this isn't the right level of abstraction.

cheers


Re: [RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2019-01-29 Thread Michael Ellerman
Michael Bringmann  writes:
> On 1/25/19 10:09 AM, Michael Bringmann wrote:
>> Adding Nathan Lynch
>> 
>> On 1/24/19 6:04 PM, Tyrel Datwyler wrote:
>>> On 12/14/2018 12:50 PM, Michael Bringmann wrote:
 Define interface to acquire arch-specific drc info to match against
 hotpluggable devices.  The current implementation exposes several
 pseries-specific dynamic memory properties in generic kernel code.
 This patch set provides an interface to pull that code out of the
 generic kernel.

 Signed-off-by: Michael Bringmann 
 ---
  include/linux/topology.h |9 +
  1 file changed, 9 insertions(+)

 diff --git a/include/linux/topology.h b/include/linux/topology.h
 index cb0775e..df97f5f 100644
 --- a/include/linux/topology.h
 +++ b/include/linux/topology.h
 @@ -44,6 +44,15 @@
>>>
>>> As far as I know pseries is the only platform that uses DR connectors, and I
>>> highly doubt that any other powerpc platform or arch ever will. So, I'm not 
>>> sure
>>> that this is really generic enough to belong in topology.h. If anything I 
>>> would
>>> suggest putting this in an include in arch/powerpc/include/ named something 
>>> like
>>> drcinfo.h or pseries-drc.h. That will make it visible to modules like rpaphp
>>> that want/need to use this functionality.
>
> It looks like the 'rpaphp' and 'rpadlpar_io' modules are also dependent upon 
> the
> powerpc platform.

Yes that's right.

> Shouldn't the relevant source files be moved completely to the
> powerpc-specific directories out of drivers/pci/hotplug as well?

I don't think so. They are PCI hotplug drivers, so they should sit with
the other PCI hotplug drivers.

It's true that PCI hotplug drivers are more platform specific than other
types of drivers, but still they have some things in common with other
PCI hotplug drivers.

cheers



Re: [PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-29 Thread Frederic Barrat




Le 29/01/2019 à 07:15, Vaibhav Jain a écrit :

Within cxl module, iteration over array 'adapter->slices' may be racy
at few points as it might be simultaneously read during an EEH and its
contents being set to NULL while driver is being unloaded or unbound
from the adapter. This might result in a NULL pointer to 'struct afu'
being de-referenced during an EEH thereby causing a kernel oops.

This patch fixes this by making sure that all access to the array
'adapter->slices' is wrapped within the context of spin-lock
'adapter->afu_list_lock'.

Signed-off-by: Vaibhav Jain 
---


Thanks for fixing this!

Acked-by: Frederic Barrat 




Changelog:

v3:
* Updated a slice loop in cxl_pci_error_detectected() to ignore NULL
   slices [Fred]
* Added a NULL AFU check in cxl_pci_slot_reset() [Fred]

v2:
* Fixed a wrong comparison of non-null pointer [Fred]
* Moved a call to cxl_vphb_error_detected() within a branch that
   checks for not null AFU pointer in 'adapter->slices' [Fred]
* Removed a misleading comment in code.
---
  drivers/misc/cxl/guest.c |  2 ++
  drivers/misc/cxl/pci.c   | 39 ++-
  2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 5d28d9e454f5..08f4a512afad 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -267,6 +267,7 @@ static int guest_reset(struct cxl *adapter)
int i, rc;
  
  	pr_devel("Adapter reset request\n");

+   spin_lock(>afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
if ((afu = adapter->afu[i])) {
pci_error_handlers(afu, CXL_ERROR_DETECTED_EVENT,
@@ -283,6 +284,7 @@ static int guest_reset(struct cxl *adapter)
pci_error_handlers(afu, CXL_RESUME_EVENT, 0);
}
}
+   spin_unlock(>afu_list_lock);
return rc;
  }
  
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c

index c79ba1c699ad..300531d6136f 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1805,7 +1805,7 @@ static pci_ers_result_t cxl_vphb_error_detected(struct 
cxl_afu *afu,
/* There should only be one entry, but go through the list
 * anyway
 */
-   if (afu->phb == NULL)
+   if (afu == NULL || afu->phb == NULL)
return result;
  
  	list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) {

@@ -1832,7 +1832,8 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
  {
struct cxl *adapter = pci_get_drvdata(pdev);
struct cxl_afu *afu;
-   pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET, afu_result;
+   pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET;
+   pci_ers_result_t afu_result = PCI_ERS_RESULT_NEED_RESET;
int i;
  
  	/* At this point, we could still have an interrupt pending.

@@ -1843,6 +1844,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
  
  	/* If we're permanently dead, give up. */

if (state == pci_channel_io_perm_failure) {
+   spin_lock(>afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
/*
@@ -1851,6 +1853,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 */
cxl_vphb_error_detected(afu, state);
}
+   spin_unlock(>afu_list_lock);
return PCI_ERS_RESULT_DISCONNECT;
}
  
@@ -1932,11 +1935,17 @@ static pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,

 * * In slot_reset, free the old resources and allocate new ones.
 * * In resume, clear the flag to allow things to start.
 */
+
+   /* Make sure no one else changes the afu list */
+   spin_lock(>afu_list_lock);
+
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
  
-		afu_result = cxl_vphb_error_detected(afu, state);

+   if (afu == NULL)
+   continue;
  
+		afu_result = cxl_vphb_error_detected(afu, state);

cxl_context_detach_all(afu);
cxl_ops->afu_deactivate_mode(afu, afu->current_mode);
pci_deconfigure_afu(afu);
@@ -1948,6 +1957,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 (result == PCI_ERS_RESULT_NEED_RESET))
result = PCI_ERS_RESULT_NONE;
}
+   spin_unlock(>afu_list_lock);
  
  	/* should take the context lock here */

if (cxl_adapter_context_lock(adapter) != 0)
@@ -1980,14 +1990,18 @@ static pci_ers_result_t cxl_pci_slot_reset(struct 
pci_dev *pdev)
 */
cxl_adapter_context_unlock(adapter);
  
+	spin_lock(>afu_list_lock);

for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
  
+		if (afu == NULL)

+

RE: [PATCH] ucc_geth: Reset BQL queue when stopping device

2019-01-29 Thread Mathias Thore
Is there a scenario where we are clearing the TX ring but don't want to reset 
the BQL TX queue?

I think it makes sense to keep it in ucc_geth_free_tx since the reason it is 
needed isn't the timeout per se, but rather the clearing of the TX ring. This 
way, it will be performed no matter why the driver ends up calling this 
function.

-Original Message-
From: Li Yang [mailto:leoyang...@nxp.com] 
Sent: Monday, 28 January 2019 22:37
To: Mathias Thore 
Cc: Christophe Leroy ; net...@vger.kernel.org; 
linuxppc-dev@lists.ozlabs.org; David Gounaris ; 
Joakim Tjernlund 
Subject: Re: [PATCH] ucc_geth: Reset BQL queue when stopping device

On Mon, Jan 28, 2019 at 8:37 AM Mathias Thore  
wrote:
>
> Hi,
>
>
> This is what we observed: there was a storm on the medium so that our 
> controller could not do its TX, resulting in timeout. When timeout occurs, 
> the driver clears all descriptors from the TX queue. The function called in 
> this patch is used to reflect this clearing also in the BQL layer. Without 
> it, the controller would get stuck, unable to perform TX, even several 
> minutes after the storm had ended. Bringing the device down and then up again 
> would solve the problem, but this patch also solves it automatically.

The explanation makes sense.  So this should only be required in the timeout 
scenario instead of other clean up scenarios like device shutdown?  If so, it 
probably it will be better to be done in ucc_geth_timeout_work()?

>
>
> Some other drivers do the same, for example e1000e driver calls 
> netdev_reset_queue in its e1000_clean_tx_ring function. It is possible that 
> other drivers should do the same; I have no way of verifying this.
>
>
> Regards,
>
> Mathias
>
> --
>
>
> From: Christophe Leroy 
> Sent: Monday, January 28, 2019 10:48 AM
> To: Mathias Thore; leoyang...@nxp.com; net...@vger.kernel.org; 
> linuxppc-dev@lists.ozlabs.org; David Gounaris; Joakim Tjernlund
> Subject: Re: [PATCH] ucc_geth: Reset BQL queue when stopping device
>
>
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
>
>
> Hi,
>
> Le 28/01/2019 à 10:07, Mathias Thore a écrit :
> > After a timeout event caused by for example a broadcast storm, when 
> > the MAC and PHY are reset, the BQL TX queue needs to be reset as 
> > well. Otherwise, the device will exhibit severe performance issues 
> > even after the storm has ended.
>
> What are the symptomns ?
>
> Is this reset needed on any network driver in that case, or is it 
> something particular for the ucc_geth ?
> For instance, the freescale fs_enet doesn't have that reset. Should it 
> have it too ?
>
> Christophe
>
> >
> > Co-authored-by: David Gounaris 
> > Signed-off-by: Mathias Thore 
> > ---
> >   drivers/net/ethernet/freescale/ucc_geth.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
> > b/drivers/net/ethernet/freescale/ucc_geth.c
> > index c3d539e209ed..eb3e65e8868f 100644
> > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > @@ -1879,6 +1879,8 @@ static void ucc_geth_free_tx(struct ucc_geth_private 
> > *ugeth)
> >   u16 i, j;
> >   u8 __iomem *bd;
> >
> > + netdev_reset_queue(ugeth->ndev);
> > +
> >   ug_info = ugeth->ug_info;
> >   uf_info = _info->uf_info;
> >
> >
>