Re: [PATCH 1/2] powerpc: Batch up loads/stores on saving and restoring Altivec
On Thu, 10 Mar 2016 16:37:47 +1100 Cyril Bur wrote: > On Thu, 10 Mar 2016 10:01:07 +1100 > Michael Neuling wrote: > > > On Tue, 2016-03-01 at 16:55 +1100, Cyril Bur wrote: > > > > > Currently the assembly to save and restore Altivec registers boils down to > > > a load immediate of the offset of the specific Altivec register in memory > > > followed by the load/store which repeats in sequence for each Altivec > > > register. > > > > > > This patch attempts to do better by loading up four registers with > > > immediates so that the loads and stores can be batched up and better > > > pipelined by the processor. > > > > > > This patch results in four load/stores in sequence and one add between > > > groups of four. Also, by using a pair of base registers it means that the > > > result of the add is not needed by the following instruction. > > > > What the performance improvement? > > So I have some numbers. This is the same context switch benchmark that was > used > for my other series, a modified version of: > http://www.ozlabs.org/~anton/junkcode/context_switch2.c > > We pingpong across a pipe and touch FP/VMX/VSX or some combinations of both. > The numbers are the number of pingpongs per second, the tests run for 52 > seconds with the first two seconds of results being ignored to allow for the > test to stabilise. > > Run 1 > Touched Facility Average Stddev Speedup %Speedup Speedup/Stddev > None 1845984 8261.28 15017.32 100.8201 1.817793 > FPU 1639296 3966.94 54770.04 103.4565 13.80660 > FPU + VEC 1555836 3708.59 34533.72 102.2700 9.311813 > FPU + VSX 1523202 78984.7 -362.64 99.97619 -0.00459 > VEC 1631529 23665.5 -11818.0 99.28085 -0.49937 > VEC + VSX 1543007 24614.0 32330.52 102.1401 1.313500 > VSX 1554007 28723.0 40296.56 102.6621 1.402932 > FPU + VEC + VSX 1546072 17201.1 41687.28 102.7710 2.423519 > > Run 2 > Touched Facility Average Stddev Speedup %Speedup Speedup/Stddev > None 1837869 30263.4 -7780.6 99.57843 -0.25709 > FPU 1587883 70260.6 -23927 98.51546 -0.34055 > FPU + VEC 1552703 13563.6 37243.4 102.4575 2.745831 > FPU + VSX 1558519 13706.7 32365.9 102.1207 2.361308 > VEC 1651599 1388.83 13474.3 100.8225 9.701918 > VEC + VSX 1552319 1752.77 42443.2 102.8110 24.21487 > VSX 1559806 7891.66 55542.5 103.6923 7.038124 > FPU + VEC + VSX 1549330 22148.1 29010.8 101.9082 1.309849 > > > I can't help but notice these are noisy. These were run on KVM on a fairly > busy box. I wonder if the numbers smooth out on an otherwise idle machine. It > doesn't look super consistent across two runs. Did four runs on a OpenPower system, booted into a buildroot ramfs with basically nothing running except ssh daemon and busybox. Run 1 Touched Average Stddev % of AverageSpeedup %SpeedupSpeedup/Stddev None1772303.48 1029.674394 0.05809808564 -2514.2 99.85834038 -2.441742764 FPU 1564827.36 623.4840148 0.03984362945 21815.16 101.4138035 34.98912479 FPU + VEC 1485572.76 865.0766997 0.05823186336 34317.64 102.3646869 39.6700547 FPU + VSX 1485859.72 606.5295067 0.04082010559 47010.68 103.267242 77.50765541 VEC 1571237.88 927.2565997 0.05901439958459.76 100.5413283 9.123429268 VEC + VSX 1480972.92 652.0581181 0.04402903722 47769.32 103.3330449 73.25929802 VSX 1468496.12 140955.9898 9.598662731 10461.72 100.7175222 0.0742197619 All 1480732.36 777.801684 0.05252817491 50908.56408 103.5604782 65.45185634 Run 2 Touched Average Stddev % of Average Speedup %SpeedupSpeedup/Stddev None1773836.72 868.7485496 0.0489756774-2374.68 99.86630645 -2.733449168 FPU 1564397.8 688.6031113 0.04401713626 21650.48 101.4033717 31.44115913 FPU + VEC 1484855.48 905.2225623 0.06096368128 33020.52 102.274399 36.47779162 FPU + VSX 1486762.64 933.4386234 0.06278329831 48576.48 103.3776212 52.04035786 VEC 1571551.44 785.1688538 0.04996138426 7980.44 100.5103983 10.16397933 VEC + VSX 1480887.72 574.725 0.03876053969 49535.16 103.4607239 86.29817725 VSX 1489902.32 546.0581355 0.03665059972 35301.64 102.4268956 64.64813489 All 1480020.88 733.4377717 0.04955590705 54243.28 103.8044699 73.95757636 Run 3 Touched Average Stddev % of AverageSpeedup
Re: [PATCH 1/2] ppc64le live patch: clear out storage location(s) in mini stack frame
On 10/03/16 04:28, Torsten Duwe wrote: > This can be applied on top of Petr Mladek's v4 rework of the ppc64le > live patching. Inspired by Balbir Singh's v5, information about the > callee's r2 is stored in a "reserved" 32 bit location in the caller's > stack frame, instead of 64 bits in the newly created mini frame 24(r1). > > It only needs to work for a local call, when caller's TOC == callee's > TOC, and along with the return address (LR) it's all within a 4GiB > range (+-31 bits). If the original call already was global, we are > allowed to restore any nonsense into r2, because the global caller > will restore its TOC anyway from the ABI compliant location 24(r1) > right after return. > > Hi, Torsten Sorry, I've had no time to test this. Caught up with something else for the moment. Hopefully I'll get a chance over the weekend. Have you tested this against Petr's sample changes to patch printk? Balbir Singh. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 2/9] mm/hugetlb: Add follow_huge_pgd function
On 03/09/2016 05:40 PM, Anshuman Khandual wrote: > This just adds 'follow_huge_pgd' function which is will be used > later in this series to make 'follow_page_mask' function aware > of PGD based huge page implementation. Hugh/Mel/Naoya/Andrew, Thoughts/inputs/suggestions ? Does this change looks okay ? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 3/9] mm/gup: Make follow_page_mask function PGD implementation aware
On 03/09/2016 05:40 PM, Anshuman Khandual wrote: > Currently the function 'follow_page_mask' does not take into account > PGD based huge page implementation. This change achieves that and > makes it complete. Hugh/Mel/Naoya/Andrew, Thoughts/inputs/suggestions ? Does this change look okay ? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 1/9] mm/hugetlb: Make GENERAL_HUGETLB functions PGD implementation aware
On 03/09/2016 05:40 PM, Anshuman Khandual wrote: > Currently both the ARCH_WANT_GENERAL_HUGETLB functions 'huge_pte_alloc' > and 'huge_pte_offset' dont take into account huge page implementation > at the PGD level. With addition of PGD awareness into these functions, > more architectures like POWER which also implements huge pages at PGD > level (along with PMD level), can use ARCH_WANT_GENERAL_HUGETLB option. Hugh/Mel/Naoya/Andrew, Thoughts/inputs/suggestions ? Does this change looks okay ? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 8/9] KVM: PPC: Add in-kernel handling for VFIO
On 03/10/2016 04:18 PM, David Gibson wrote: On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote: On 03/08/2016 10:08 PM, David Gibson wrote: On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space; this is not expected to happen ever though. Well... not expect to happen with a qemu which uses this. Presumably it will fall back to userspace routinely if you have an old qemu that doesn't add the liobn mappings. Ah. Ok, thanks, I'll add this to the commit log. Ok. The first user of this is VFIO on POWER. Trampolines to the VFIO external user API functions are required for this patch. I'm not sure what you mean by "trampoline" here. For example, look at kvm_vfio_group_get_external_user. It calls symbol_get(vfio_group_get_external_user) and then calls a function via the returned pointer. Is there a better word for this? Uh.. probably although I don't immediately know what. "Trampoline" usually refers to code on the stack used for bouncing places, which isn't what this resembles. "Dynamic wrapper"? This uses a VFIO KVM device to associate a logical bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. Group fd? Or container fd? The group fd wouldn't make a lot of sense. Group. KVM has no idea about containers. That's not going to fly. Having a liobn registered against just one group in a container makes no sense at all. Conceptually, if not physically, the container shares a single set of TCE tables. If handling that means teaching KVM the concept of containers, then so be it. Btw, I'm not sure yet if extending the existing vfio kvm device to make the vfio<->kvm linkages makes sense. I think the reason some x86 machines need that is quite different from how we're using it for Power. I haven't got a clear enough picture yet to be sure either way. The other option that would seem likely to me would be a "bind VFIO container" ioctl() on the fd associated with a kernel accelerated TCE table. Oh, I just noticed this response. I need to digest it. Looks like this is going to take other 2 years to upstream... To make use of the feature, the user space has to create a guest view of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and then associate a LIOBN with this table via VFIO KVM device, a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in the next patch). Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Is that with or without DDW (i.e. with or without a 64-bit DMA window)? Without DDW, I should have mentioned this. The patch is from the times when there was no DDW :( Ok. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_64_vio.c| 184 +++ arch/powerpc/kvm/book3s_64_vio_hv.c | 186 2 files changed, 370 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 7965fc7..9417d12 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -317,11 +318,161 @@ fail: return ret; } +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, + unsigned long entry) +{ + struct mm_iommu_table_group_mem_t *mem = NULL; + const unsigned long pgsize = 1ULL << tbl->it_page_shift; + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + + if (!pua) + return H_HARDWARE; + + mem = mm_iommu_lookup(*pua, pgsize); + if (!mem) + return H_HARDWARE; + + mm_iommu_mapped_dec(mem); + + *pua = 0; + + return H_SUCCESS; +} + +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, + unsigned long entry) +{ + enum dma_data_direction dir = DMA_NONE; + unsigned long hpa = 0; + + if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) + return H_HARDWARE; + + if (dir == DMA_NONE) + return H_SUCCESS; + + return kvmppc_tce_iommu_mapped_dec(tbl, entry); +} + +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, + unsigned long entry, unsigned long gpa, + enum dma_data_direction dir) +{ + long ret; + unsigned long hpa, ua, *pua = IOMMU_TABL
Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events
On 10/03/16 12:18, Ian Munsie wrote: On a related matter, we should send a patch to remove some of the leftover config options that were added to smooth the merging of cxlflash in the first place (CXL_KERNEL_API, CXL_EEH). I'm happy to do that after this series is merged. -- Andrew Donnellan Software Engineer, OzLabs andrew.donnel...@au1.ibm.com Australia Development Lab, Canberra +61 2 6201 8874 (work)IBM Australia Limited ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] powerpc/powernv: Cpuidle related cleanup
On Thu, 2016-03-10 at 16:15 +0530, Shreyas B Prabhu wrote: > Hi, > Any thoughts on this? Haven't had time to give it a proper review yet. Will try and get to it. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] T104xRDB: add tdm riser card node to device tree
Signed-off-by: Zhao Qiang --- arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi index 830ea48..be7bab6 100644 --- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi +++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi @@ -107,6 +107,11 @@ reg = <0>; spi-max-frequency = <1000>; /* input clock */ }; + slic@3 { + compatible = "maxim,ds26522"; + reg = <3>; + spi-max-frequency = <200>; /* input clock */ + }; }; i2c@118000 { -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] ppc64le live patch: get rid of mini stack frame
On Thu, 2016-03-10 at 14:04 +0100, Torsten Duwe wrote: > On Thu, Mar 10, 2016 at 01:51:16PM +0100, Petr Mladek wrote: > > On Thu 2016-03-10 13:25:08, Petr Mladek wrote: > > > On Wed 2016-03-09 18:30:17, Torsten Duwe wrote: > > > > After the mini stack frame is no longer required for TOC storage, it can > > > > be eliminated iff the functionality of klp_return_helper, which required > > > > a stack frame for the extra return address previously, is carried out > > > > by the replacement function now. This requires _every_ live patch > > > > replacement > > > > function to execute the following (or similar) sequence of machine > > > > instructions > > > > just before every return to the original caller: > > > > > > I have thought about it and it is a nono from my point of view. > > > It is too error prone, especially that there are functions that > > > call return on several locations. > > Yes, that's what I think as well when I look at it. > > BTW: How is this solved in kretprobes? Or is it easier there? > > Without any look at the code I assume it uses solution 3. Once > you have a probing framework in place, you can remember the real > return addresses in a data structure. As I wrote, the function > graph tracer does it this way so it would be abvious. Yeah it has a linked list of struct kretprobe_instance's, each of which stores the real return address. I have some ideas for how to fix livepatch, but this week is a bit busy with merge window prep. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [2/2] Fix misleading comment for hard_irq_disable() in pmao_restore_workaround
On Fri, 2016-04-03 at 05:01:49 UTC, Madhavan Srinivasan wrote: > Current comment added in pmao_restore_workaround() for > hard_irq_disable is misleading. Comment should say to hard > "disable" interrupts instead of "enable" it. Patch to fix the typo. > > Signed-off-by: Madhavan Srinivasan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/58bffb5bbb238d56e8818acb46 cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [1/2] powerpc/perf: Remove PME_ prefix for power7 events
On Mon, 2016-11-01 at 22:55:25 UTC, Sukadev Bhattiprolu wrote: > We used the PME_ prefix earlier to avoid some macro/variable name > collisions. We have since changed the way we define/use the event > macros so we no longer need the prefix. > > By dropping the prefix, we keep the the event macros consistent with > their official names. > > Reported-by: Michael Ellerman > Signed-off-by: Sukadev Bhattiprolu Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/d4969e2459c6e852a6862256cf cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [1/2] powerpc/perf/hv-24x7: Fix usage with chip events
On Sat, 2016-30-01 at 03:07:10 UTC, Sukadev Bhattiprolu wrote: > >From 9b5848ce1834a4d82fc251022035d36d9e26b500 Mon Sep 17 00:00:00 2001 > From: Sukadev Bhattiprolu > Date: Sat, 23 Jan 2016 03:58:12 -0500 > Subject: [PATCH 1/2] powerpc/perf/hv-24x7: Fix usage with chip events. > > 24x7 counters can belong to different domains (core, chip, virtual CPU > etc). For events in the 'chip' domain, sysfs entry currently looks like: > > $ cd /sys/bus/event_source/devices/hv_24x7/events > $ cat PM_XLINK_CYCLES__PHYS_CHIP > domain=0x1,offset=0x230,core=?,lpar=0x0 ... > > Signed-off-by: Sukadev Bhattiprolu Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/e5a5886d7ae32b7afebfffecca cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [1/2] powerpc/perf/hv-24x7: Display domain indices in sysfs
On Tue, 2016-16-02 at 23:24:27 UTC, Sukadev Bhattiprolu wrote: > >From aff5a822e873522b9a3f355f816547394b452a64 Mon Sep 17 00:00:00 2001 > From: Sukadev Bhattiprolu > Date: Tue, 16 Feb 2016 20:07:51 -0500 > Subject: [PATCH 1/2] powerpc/perf/hv-24x7: Display domain indices in sysfs > > To help users determine domains, display the domain indices used by the > kernel in sysfs. > > $ cat /sys/bus/event_source/devices/hv_24x7/interface/domains > 1: Physical Chip > 2: Physical Core > 3: VCPU Home Core > 4: VCPU Home Chip > 5: VCPU Home Node > 6: VCPU Remote Node > > Signed-off-by: Sukadev Bhattiprolu Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/d34171e88aeed4a99c00c7f2af cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
On Thu, Mar 10, 2016 at 07:33:05PM +1100, Paul Mackerras wrote: > On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote: > > On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote: > > > VFIO on sPAPR already implements guest memory pre-registration > > > when the entire guest RAM gets pinned. This can be used to translate > > > the physical address of a guest page containing the TCE list > > > from H_PUT_TCE_INDIRECT. > > > > > > This makes use of the pre-registrered memory API to access TCE list > > > pages in order to avoid unnecessary locking on the KVM memory > > > reverse map. > > > > > > Signed-off-by: Alexey Kardashevskiy > > > > Ok.. so, what's the benefit of not having to lock the rmap? > > It's not primarily about locking or not locking the rmap. The point > is that when memory is pre-registered, we know that all of guest > memory is pinned and we have a flat array mapping GPA to HPA. It's > simpler and quicker to index into that array (even with looking up the > kernel page tables in vmalloc_to_phys) than it is to find the memslot, > lock the rmap entry, look up the user page tables, and unlock the rmap > entry. We were only locking the rmap entry to stop the page being > unmapped and reallocated to something else, but if memory is > pre-registered, it's all pinned, so it can't be reallocated. Ok, that makes sense. Alexey, can you fold some of that rationale into the commit message? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/pci: Reuse PHB number on pci_controller add if available
On Fri, Mar 11, 2016 at 08:34:58AM +1100, Benjamin Herrenschmidt wrote: >On Thu, 2016-03-10 at 15:11 -0300, Guilherme G. Piccoli wrote: >> The domain/PHB field of PCI addresses has its value obtained from a >> global variable, incremented each time a new domain (represented by >> struct pci_controller) is added on the system. The domain addition >> process happens during boot or due to PCI device hotplug. > >I'd rather we use a fixed numbering for domains, for example, under >OPAL, use the opal-id, under phyp, there is also a number in the DT we >can use (forgot what it's called) etc... > guest on phyp has something like below: ibm,fw-phb-id = <0x800 0x2200>; However, PowerKVM guest doesn't have it. I guess it's a problem, right? Thanks, Gavin >> As recent kernels are using predictable naming for network >> interfaces, >> the network stack is more tied to PCI naming. This can be a problem >> in >> hotplug scenarios, because PCI addresses will change if a device is >> removed and then re-added. This situation seems unusual, but it can >> happen if a user wants to replace a NIC without rebooting the >> machine, >> for example. >> >> This patch changes the way PCI domain values are generated: the >> global >> variable is no longer used; instead, a list is introduced, containing >> a flag that indicates whenever a domain is in use or was released, >> for >> example by a PCI hotplug remove. If the new PHB being added finds a >> free domain, it will use its number; otherwise a new number is >> generated incrementing the higher domain value present on the system. >> No functional changes were introduced. >> >> Signed-off-by: Guilherme G. Piccoli >> --- >> arch/powerpc/kernel/pci-common.c | 47 >> +--- >> 1 file changed, 44 insertions(+), 3 deletions(-) >> >> diff --git a/arch/powerpc/kernel/pci-common.c >> b/arch/powerpc/kernel/pci-common.c >> index 0f7a60f..6eac0dd 100644 >> --- a/arch/powerpc/kernel/pci-common.c >> +++ b/arch/powerpc/kernel/pci-common.c >> @@ -44,8 +44,12 @@ >> static DEFINE_SPINLOCK(hose_spinlock); >> LIST_HEAD(hose_list); >> >> -/* XXX kill that some day ... */ >> -static int global_phb_number; /* Global phb counter >> */ >> +/* Global PHB counter list */ >> +struct global_phb_number { >> +bool used; >> +struct list_head node; >> +}; >> +LIST_HEAD(global_phb_number_list); >> >> /* ISA Memory physical address */ >> resource_size_t isa_mem_base; >> @@ -64,6 +68,42 @@ struct dma_map_ops *get_pci_dma_ops(void) >> } >> EXPORT_SYMBOL(get_pci_dma_ops); >> >> +static int get_phb_number(void) >> +{ >> +struct global_phb_number *ptr; >> +int latest_number = 0; >> + >> +list_for_each_entry(ptr, &global_phb_number_list, node) { >> +if (!ptr->used) { >> +ptr->used = true; >> +return latest_number; >> +} >> +latest_number++; >> +} >> + >> +ptr = zalloc_maybe_bootmem(sizeof(struct global_phb_number), >> GFP_KERNEL); >> +BUG_ON(ptr == NULL); >> + >> +ptr->used = true; >> +list_add_tail(&ptr->node, &global_phb_number_list); >> + >> +return latest_number; >> +} >> + >> +static void release_phb_number(int phb_number) >> +{ >> +struct global_phb_number *ptr; >> +int current_number = 0; >> + >> +list_for_each_entry(ptr, &global_phb_number_list, node) { >> +if (current_number == phb_number) { >> +ptr->used = false; >> +return; >> +} >> +current_number++; >> +} >> +} >> + >> struct pci_controller *pcibios_alloc_controller(struct device_node >> *dev) >> { >> struct pci_controller *phb; >> @@ -72,7 +112,7 @@ struct pci_controller >> *pcibios_alloc_controller(struct device_node *dev) >> if (phb == NULL) >> return NULL; >> spin_lock(&hose_spinlock); >> -phb->global_number = global_phb_number++; >> +phb->global_number = get_phb_number(); >> list_add_tail(&phb->list_node, &hose_list); >> spin_unlock(&hose_spinlock); >> phb->dn = dev; >> @@ -94,6 +134,7 @@ EXPORT_SYMBOL_GPL(pcibios_alloc_controller); >> void pcibios_free_controller(struct pci_controller *phb) >> { >> spin_lock(&hose_spinlock); >> +release_phb_number(phb->global_number); >> list_del(&phb->list_node); >> spin_unlock(&hose_spinlock); >> > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE
On 03/10/2016 04:21 PM, David Gibson wrote: On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote: On 03/09/2016 04:45 PM, David Gibson wrote: On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote: sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap via hypercalls which take a logical bus id (LIOBN) as a target IOMMU identifier. LIOBNs are made up, advertised to guest systems and linked to IOMMU groups by the user space. In order to enable acceleration for IOMMU operations in KVM, we need to tell KVM the information about the LIOBN-to-group mapping. For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter is added which accepts: - a VFIO group fd and IO base address to find the actual hardware TCE table; - a LIOBN to assign to the found table. Before notifying KVM about new link, this check the group for being registered with KVM device in order to release them at unexpected KVM finish. This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user space. While we are here, this also fixes VFIO KVM device compiling to let it link to a KVM module. Signed-off-by: Alexey Kardashevskiy --- Documentation/virtual/kvm/devices/vfio.txt | 21 +- arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/Makefile | 5 +- arch/powerpc/kvm/powerpc.c | 1 + include/uapi/linux/kvm.h | 9 +++ virt/kvm/vfio.c| 106 + 6 files changed, 140 insertions(+), 3 deletions(-) diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt index ef51740..c0d3eb7 100644 --- a/Documentation/virtual/kvm/devices/vfio.txt +++ b/Documentation/virtual/kvm/devices/vfio.txt @@ -16,7 +16,24 @@ Groups: KVM_DEV_VFIO_GROUP attributes: KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking + kvm_device_attr.addr points to an int32_t file descriptor + for the VFIO group. AFAICT these changes are accurate for VFIO as it is already, in which case it might be clearer to put them in a separate patch. KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking + kvm_device_attr.addr points to an int32_t file descriptor + for the VFIO group. -For each, kvm_device_attr.addr points to an int32_t file descriptor -for the VFIO group. + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group + kvm_device_attr.addr points to a struct: + struct kvm_vfio_spapr_tce_liobn { + __u32 argsz; + __s32 fd; + __u32 liobn; + __u8pad[4]; + __u64 start_addr; + }; + where + @argsz is the size of kvm_vfio_spapr_tce_liobn; + @fd is a file descriptor for a VFIO group; + @liobn is a logical bus id to be associated with the group; + @start_addr is a DMA window offset on the IO (PCI) bus For the cause of DDW and multiple windows, I'm assuming you can call this multiple times with different LIOBNs and the same IOMMU group? Yes. It is called twice per each group (when DDW is activated) - for 32bit and 64bit windows, this is why @start_addr is there. diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 1059846..dfa3488 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 select KVM select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE select SPAPR_TCE_IOMMU if IOMMU_SUPPORT + select KVM_VFIO if VFIO ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 7f7b6d8..71f577c 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm KVM := ../../../virt/kvm common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ - $(KVM)/eventfd.o $(KVM)/vfio.o + $(KVM)/eventfd.o Please don't disable the VFIO device for the non-book3s case. I added it (even though it didn't do anything until now) so that libvirt wouldn't choke when it finds it's not available. Obviously the new ioctl needs to be only for the right IOMMU setup, but the device itself should be available always. Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module. CFLAGS_e500_mmu.o := -I. CFLAGS_e500_mmu_host.o := -I. @@ -87,6 +87,9 @@ endif kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ book3s_xics.o +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ + $(KVM)/vfio.o \ + kvm-book3s_64-module-objs += \ $(KVM)/kvm_main.o \ $(KVM)/eventfd.o \ diff --git a
Re: [PATCH] powerpc/pci: Reuse PHB number on pci_controller add if available
On Thu, 2016-03-10 at 15:11 -0300, Guilherme G. Piccoli wrote: > The domain/PHB field of PCI addresses has its value obtained from a > global variable, incremented each time a new domain (represented by > struct pci_controller) is added on the system. The domain addition > process happens during boot or due to PCI device hotplug. I'd rather we use a fixed numbering for domains, for example, under OPAL, use the opal-id, under phyp, there is also a number in the DT we can use (forgot what it's called) etc... > As recent kernels are using predictable naming for network > interfaces, > the network stack is more tied to PCI naming. This can be a problem > in > hotplug scenarios, because PCI addresses will change if a device is > removed and then re-added. This situation seems unusual, but it can > happen if a user wants to replace a NIC without rebooting the > machine, > for example. > > This patch changes the way PCI domain values are generated: the > global > variable is no longer used; instead, a list is introduced, containing > a flag that indicates whenever a domain is in use or was released, > for > example by a PCI hotplug remove. If the new PHB being added finds a > free domain, it will use its number; otherwise a new number is > generated incrementing the higher domain value present on the system. > No functional changes were introduced. > > Signed-off-by: Guilherme G. Piccoli > --- > arch/powerpc/kernel/pci-common.c | 47 > +--- > 1 file changed, 44 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/pci-common.c > b/arch/powerpc/kernel/pci-common.c > index 0f7a60f..6eac0dd 100644 > --- a/arch/powerpc/kernel/pci-common.c > +++ b/arch/powerpc/kernel/pci-common.c > @@ -44,8 +44,12 @@ > static DEFINE_SPINLOCK(hose_spinlock); > LIST_HEAD(hose_list); > > -/* XXX kill that some day ... */ > -static int global_phb_number;/* Global phb counter > */ > +/* Global PHB counter list */ > +struct global_phb_number { > + bool used; > + struct list_head node; > +}; > +LIST_HEAD(global_phb_number_list); > > /* ISA Memory physical address */ > resource_size_t isa_mem_base; > @@ -64,6 +68,42 @@ struct dma_map_ops *get_pci_dma_ops(void) > } > EXPORT_SYMBOL(get_pci_dma_ops); > > +static int get_phb_number(void) > +{ > + struct global_phb_number *ptr; > + int latest_number = 0; > + > + list_for_each_entry(ptr, &global_phb_number_list, node) { > + if (!ptr->used) { > + ptr->used = true; > + return latest_number; > + } > + latest_number++; > + } > + > + ptr = zalloc_maybe_bootmem(sizeof(struct global_phb_number), > GFP_KERNEL); > + BUG_ON(ptr == NULL); > + > + ptr->used = true; > + list_add_tail(&ptr->node, &global_phb_number_list); > + > + return latest_number; > +} > + > +static void release_phb_number(int phb_number) > +{ > + struct global_phb_number *ptr; > + int current_number = 0; > + > + list_for_each_entry(ptr, &global_phb_number_list, node) { > + if (current_number == phb_number) { > + ptr->used = false; > + return; > + } > + current_number++; > + } > +} > + > struct pci_controller *pcibios_alloc_controller(struct device_node > *dev) > { > struct pci_controller *phb; > @@ -72,7 +112,7 @@ struct pci_controller > *pcibios_alloc_controller(struct device_node *dev) > if (phb == NULL) > return NULL; > spin_lock(&hose_spinlock); > - phb->global_number = global_phb_number++; > + phb->global_number = get_phb_number(); > list_add_tail(&phb->list_node, &hose_list); > spin_unlock(&hose_spinlock); > phb->dn = dev; > @@ -94,6 +134,7 @@ EXPORT_SYMBOL_GPL(pcibios_alloc_controller); > void pcibios_free_controller(struct pci_controller *phb) > { > spin_lock(&hose_spinlock); > + release_phb_number(phb->global_number); > list_del(&phb->list_node); > spin_unlock(&hose_spinlock); > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] livepatch: Add some basic LivePatch documentation
On Wed, Mar 09, 2016 at 03:01:46PM +0100, Petr Mladek wrote: > LivePatch framework deserves some documentation, definitely. > This is an attempt to provide some basic info. I hope that > it will be useful for both LivePatch producers and also > potential developers of the framework itself. > > Signed-off-by: Petr Mladek Hi Petr, This is awesome. Thank you for starting this! Some minor nits below. > This patch was motivated by the LivePatch port for PPC. The guys > might want to document some PPC-specific limitations on top of it. > > I am sure that it is far from perfect. But I hope that it is > an acceptable start that can be improved later. I hope that > I did not write that many factual mistakes. > > I wrote only some generic info about the consistency model. > I am not sure if we have agreed on some specification yet. > > I am sorry for grammar mistakes. I hope that some hairs will > stay on your head if you are sensitive. > > Documentation/livepatch/livepatch.txt | 277 > ++ > MAINTAINERS | 1 + > 2 files changed, 278 insertions(+) > create mode 100644 Documentation/livepatch/livepatch.txt > > diff --git a/Documentation/livepatch/livepatch.txt > b/Documentation/livepatch/livepatch.txt > new file mode 100644 > index ..28e8047abb61 > --- /dev/null > +++ b/Documentation/livepatch/livepatch.txt > @@ -0,0 +1,277 @@ > += > +LivePatch > += At the risk of sounding like a broken record, I have to say that I really don't like the CamelCase in LivePatch. I don't have a good reason other than it gives me flashbacks to being exposed to EnterpriseApplicationObjectOrientedProgramming. And there's no existing precedent... # git grep LivePatch # ...so can we please not create one now? :-) > + > +This document outlines basic information about kernel LivePatching. > + > +Table of Contents: > + > +1. Motivation > +2. Kprobes, Ftrace, LivePatching > +3. Consistency model > +4. LivePatch life-cycle > + 4.1. Registration > + 4.2. Enabling > + 4.3. Disabling > + 4.4. Unregistration > +5. Livepatch module > + 5.1. New functions > + 5.2. Metadata > + 5.3. Module handling > +6. Sysfs > +7. Limitations > + > + > +1. Motivation > += > + > +There are situations when people are really reluctant to reboot a system. > +It might be because the computer is in the middle of a complex scientific > +computation. Or the system is busy handling customer requests in the high > +season. > + > +On the other hand, people also want to keep the system stable and secure. > +This is where LivePatch infrastructure comes handy. It allows to redirect > +selected function calls to a fixed implementation without rebooting > +the system. > + > + > +2. Kprobes, Ftrace, LivePatching > + > + > +Linux kernel has more ways how to redirect an existing code into a new one. > +It happens with kernel probes, function tracing, and LivePatching: > + > + + The kernel probes are the most generic way. The code can be redirected > +by putting an interrupt instruction instead of any instruction. > + > + + The function tracer calls the code from a predefined location that is > +close the function entry. The location is generated by the compiler, > +see -pg gcc option. > + > + + LivePatching typically needs to redirect the code at the very beginning > +of the function entry before the function parameters or the stack > +are anyhow muffled. > + > +All three approaches need to modify the existing code at runtime. Therefore > +they need to be aware of each other and do not step over othres' toes. Most > +of these problems are solved by using the dynamic ftrace framework as a base. > +A Kprobe is registered as a ftrace handler when the function entry is probed, > +see CONFIG_KPROBES_ON_FTRACE. Also an alternative function from a live patch > +is called from a custom ftrace handler. But there are some limitations, > +see below. > + > + > +3. Consistency model > + > + > +Functions are there for a reason. They take some input parameters, get or > +release locks, read, process, and even write some data in a defined way, > +have return values. By other words, each function has a defined semantic. > + > +Many fixes do not change the semantic of the modified functions. For example, > +they add a NULL pointer or a boundary check, fix a race by adding a missing > +memory barrier, or add some locking about a critical section. Most of these > +changes are self contained and the function present itself the same way > +to the rest of the system. In this case, the functions might be updated > +independently one by one. > + > +But there are more complex fixes. For example, a patch might change > +ordering of locking in more functions at the same time. Or a patch > +might exchange meaning of some temporary structures and update > +all the relevant functions. In this case, the aff
[PATCH] Reuse PHB/domain number on PCI adresses when available
This patch changes the way PCI domain numbers are generated on powerpc. No functional changes were introduced. The reason for this modification is better explained on patch's commit message, but in short we currently increment a global variable at each new PHB discovered, and use this value as domain number. The problem is that in some cases, like PCI device hotplug remove and re-add, the address is changed - as modern kernels are using predictable network naming for example, we can end up having some issues tracking network interfaces after hotplug operations. I CC'ed both cxl folks, Bjorn and PCI list, so we can be sure this modification, if accepted, won't impact any other related area. Thanks in advance, Guilherme Guilherme G. Piccoli (1): powerpc/pci: Reuse PHB number on pci_controller add if available arch/powerpc/kernel/pci-common.c | 47 +--- 1 file changed, 44 insertions(+), 3 deletions(-) -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/pci: Reuse PHB number on pci_controller add if available
The domain/PHB field of PCI addresses has its value obtained from a global variable, incremented each time a new domain (represented by struct pci_controller) is added on the system. The domain addition process happens during boot or due to PCI device hotplug. As recent kernels are using predictable naming for network interfaces, the network stack is more tied to PCI naming. This can be a problem in hotplug scenarios, because PCI addresses will change if a device is removed and then re-added. This situation seems unusual, but it can happen if a user wants to replace a NIC without rebooting the machine, for example. This patch changes the way PCI domain values are generated: the global variable is no longer used; instead, a list is introduced, containing a flag that indicates whenever a domain is in use or was released, for example by a PCI hotplug remove. If the new PHB being added finds a free domain, it will use its number; otherwise a new number is generated incrementing the higher domain value present on the system. No functional changes were introduced. Signed-off-by: Guilherme G. Piccoli --- arch/powerpc/kernel/pci-common.c | 47 +--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 0f7a60f..6eac0dd 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -44,8 +44,12 @@ static DEFINE_SPINLOCK(hose_spinlock); LIST_HEAD(hose_list); -/* XXX kill that some day ... */ -static int global_phb_number; /* Global phb counter */ +/* Global PHB counter list */ +struct global_phb_number { + bool used; + struct list_head node; +}; +LIST_HEAD(global_phb_number_list); /* ISA Memory physical address */ resource_size_t isa_mem_base; @@ -64,6 +68,42 @@ struct dma_map_ops *get_pci_dma_ops(void) } EXPORT_SYMBOL(get_pci_dma_ops); +static int get_phb_number(void) +{ + struct global_phb_number *ptr; + int latest_number = 0; + + list_for_each_entry(ptr, &global_phb_number_list, node) { + if (!ptr->used) { + ptr->used = true; + return latest_number; + } + latest_number++; + } + + ptr = zalloc_maybe_bootmem(sizeof(struct global_phb_number), GFP_KERNEL); + BUG_ON(ptr == NULL); + + ptr->used = true; + list_add_tail(&ptr->node, &global_phb_number_list); + + return latest_number; +} + +static void release_phb_number(int phb_number) +{ + struct global_phb_number *ptr; + int current_number = 0; + + list_for_each_entry(ptr, &global_phb_number_list, node) { + if (current_number == phb_number) { + ptr->used = false; + return; + } + current_number++; + } +} + struct pci_controller *pcibios_alloc_controller(struct device_node *dev) { struct pci_controller *phb; @@ -72,7 +112,7 @@ struct pci_controller *pcibios_alloc_controller(struct device_node *dev) if (phb == NULL) return NULL; spin_lock(&hose_spinlock); - phb->global_number = global_phb_number++; + phb->global_number = get_phb_number(); list_add_tail(&phb->list_node, &hose_list); spin_unlock(&hose_spinlock); phb->dn = dev; @@ -94,6 +134,7 @@ EXPORT_SYMBOL_GPL(pcibios_alloc_controller); void pcibios_free_controller(struct pci_controller *phb) { spin_lock(&hose_spinlock); + release_phb_number(phb->global_number); list_del(&phb->list_node); spin_unlock(&hose_spinlock); -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events
Ian Munsie writes: > No, the kconfig option is there so that cxlflash can add support for > this and not have to worry about breaking any builds if their code is > merged into the scsi tree that doesn't have our code yet. > > There is nothing optional about this within our driver, which is why > this is a select and has no configuration choice of it's own. > > On a related matter, we should send a patch to remove some of the > leftover config options that were added to smooth the merging of > cxlflash in the first place (CXL_KERNEL_API, CXL_EEH). Agreed. >> > diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c >> > index 783337d..d1cc297 100644 >> > --- a/drivers/misc/cxl/file.c >> > +++ b/drivers/misc/cxl/file.c >> > @@ -295,6 +295,17 @@ int afu_mmap(struct file *file, struct vm_area_struct >> > *vm) >> > return cxl_context_iomap(ctx, vm); >> > } >> > >> > +static inline bool ctx_event_pending(struct cxl_context *ctx) >> > +{ >> > +if (ctx->pending_irq || ctx->pending_fault || ctx->pending_afu_err) >> > +return true; >> > + >> > +if (ctx->afu_driver_ops) >> > +return ctx->afu_driver_ops->event_pending(ctx); >> Should also check if ctx->afu_driver_ops->event_pending is NULL before >> calling it. > > The v1 patch did exactly that and mpe rejected it as it made this code > too ugly - we now check that event_pending field is valid when it is > registered and WARN if it is not. The driver_ops struct pointer being passed is still owned by the external module and its free change it even after calling this function. We can mitigate this to some extent by accepting a const pointer to the struct or by copying this struct to the context object. > I'm very reluctant to make this kind of change - while nice on paper, > poll() and read() are already very easy calls to screw up, and we have > seen that happen countless times in the past from different drivers that > e.g. and end up in a situation where poll says there is an event but > then read blocks, or poll blocks even though there is an event already > pending. > > The API at the moment fits into the poll() / read() model and has > appropriate locking and the correct waiting semantics to avoid those > kind of issues (provided that the afu driver doesn't do something that > violates these semantics like sleep in one of these calls, but the > kernel has debug features to detect that), but any deviation from this > is too risky in my view. Ian I have responded to Fred with an example implementation of these calls. Requesting you to please take a look. Cheers, ~ Vaibhav ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events
Michael Neuling writes: > These are here to enable the feature in other drivers. So the cxlflash > (or whoever) can put their code in via the linux-scsi tree but that new > piece is only enabled when CXL_AFU_DRIVER_OPS is present (ie. when > merged upstream). But if it's not, their code can still compile. > > Hence their code compiles in linux-scsi and our code compiles in linux > -ppc, but only once they're together do they actually enable the full > feature. We don't have a nasty dependency of linux-scsi having to pull > in linux-ppc or visa versa before the merge window. Everyone works > independently and it all gets fixed in linus tree. > > Eventually, when everyone has the all the code in merged upstream, we > can remove these config options. We should be able to remove > CXL_KERNEL_API and CXL_EEH now actually! > > So no, we shouldn't wrap the actual code. Mikey & Ian, Agree on the point made. Thanks for detailed explaination. ~ Vaibhav ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] cxl: Add mechanism for delivering AFU driver specific events
Frederic Barrat writes: > Hi Vaibhav, > > Le 09/03/2016 15:37, Vaibhav Jain a écrit : > >> I would propose these two apis. >> >> /* >> * fetches an event from the driver event queue. NULL means that queue >> * is empty. Can sleep if needed. The memory for cxl_event is allocated >> * by module being called. Hence it can be potentially be larger then >> * sizeof(struct cxl_event). Multiple calls to this should return same >> * pointer untill ack_event is called. >> */ >> struct cxl_event * fetch_event(struct cxl_context * ctx); >> >> /* >> * Returns and acknowledge the struct cxl_event * back to the driver >> * which can then free it or maybe put it back in a kmem_cache. This >> * should be called once we have completely returned the current >> * struct cxl_event from the readcall >> */ >> void ack_event(struct cxl_context * ctx, struct cxl_event *); > > > How would you implement polling on those APIs? Hi Fred. I am looking at an implementation similar to this: static inline bool ctx_event_pending(struct cxl_context *ctx) { typeof (ctx->afu_driver_ops->fetch_event) fn_events = (ctx->afu_driver_ops != NULL) ? ctx->afu_driver_ops->fetch_event : NULL; if (ctx->pending_irq || ctx->pending_fault || ctx->pending_afu_err) return true; /* * if fn_event returns a not null then its gauranteed to return * the same pointer on next call */ if (fn_events) return fn_events(ctx) != NULL; return false; } unsigned int afu_poll(struct file *file, struct poll_table_struct *poll) { struct cxl_context *ctx = file->private_data; int mask = 0; unsigned long flags; poll_wait(file, &ctx->wq, poll); pr_devel("afu_poll wait done pe: %i\n", ctx->pe); spin_lock_irqsave(&ctx->lock, flags); if (ctx_event_pending(ctx)) mask |= POLLIN | POLLRDNORM; else if (ctx->status == CLOSED) /* Only error on closed when there are no futher events pending */ mask |= POLLERR; spin_unlock_irqrestore(&ctx->lock, flags); pr_devel("afu_poll pe: %i returning %#x\n", ctx->pe, mask); return mask; } > How would you implement afu_read? There are several sources of events. Looking at an implementation similar to this: ssize_t afu_read(struct file *file, char __user *buf, size_t count, loff_t *off) { unsigned long flags; ssize_t rc = 0; struct cxl_context *ctx = file->private_data; struct cxl_event *ptr_event, event = { .header.process_element = ctx->pe, .header.size = sizeof(struct cxl_event_header) }; typeof (ctx->afu_driver_ops->fetch_event) fn_fetch_event = (ctx->afu_driver_ops != NULL) ? ctx->afu_driver_ops->fetch_event : NULL; typeof (ctx->afu_driver_ops->ack_event) fn_ack_event = (ctx->afu_driver_ops != NULL) ? ctx->afu_driver_ops->ack_event : NULL; if (count < CXL_READ_MIN_SIZE) return -EINVAL; if (!cxl_adapter_link_ok(ctx->afu->adapter) || ctx->status == CLOSED) return -EIO; if (signal_pending(current)) return -ERESTARTSYS; /* if no events then wait */ if (!ctx_event_pending(ctx)) { if ((file->f_flags & O_NONBLOCK)) return -EAGAIN; pr_devel("afu_read going to sleep...\n"); rc = wait_event_interruptible(ctx->wq, (ctx->status == CLOSED) || cxl_adapter_link_ok(ctx->afu->adapter) || ctx_event_pending(ctx)); pr_devel("afu_read woken up\n"); } /* did we get interrupted during wait sleep */ if (rc) return rc; /* get driver events if any */ ptr_event = fn_fetch_event ? fn_fetch_event(ctx) : NULL; /* In case of error feching driver specific event */ if (IS_ERR(ptr_event)) { pr_warn("Error fetching driver specific event %ld", PTR_ERR(ptr_event)); ptr_event = NULL; } /* code below manipulates ctx so take a spin lock */ spin_lock_irqsave(&ctx->lock, flags); /* give driver events first priority */ if (ptr_event) { pr_devel("afu_read delivering AFU driver specific event\n"); /* populate the header type and pe in the event struct */ ptr_event->header.type = CXL_EVENT_AFU_DRIVER; ptr_event->header.process_element = ctx->pe; WARN_ON(event.header.size > count); } else if (ctx->pending_irq) { pr_devel("afu_read delivering AFU interrupt\n"); event.header.size += si
Re: [PATCH 2/2] ppc64le live patch: get rid of mini stack frame
On Thu, Mar 10, 2016 at 01:51:16PM +0100, Petr Mladek wrote: > On Thu 2016-03-10 13:25:08, Petr Mladek wrote: > > On Wed 2016-03-09 18:30:17, Torsten Duwe wrote: > > > After the mini stack frame is no longer required for TOC storage, it can > > > be eliminated iff the functionality of klp_return_helper, which required > > > a stack frame for the extra return address previously, is carried out > > > by the replacement function now. This requires _every_ live patch > > > replacement > > > function to execute the following (or similar) sequence of machine > > > instructions > > > just before every return to the original caller: > > > > I have thought about it and it is a nono from my point of view. > > It is too error prone, especially that there are functions that > > call return on several locations. Yes, that's what I think as well when I look at it. > BTW: How is this solved in kretprobes? Or is it easier there? Without any look at the code I assume it uses solution 3. Once you have a probing framework in place, you can remember the real return addresses in a data structure. As I wrote, the function graph tracer does it this way so it would be abvious. Torsten ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] ppc64le live patch: get rid of mini stack frame
On Thu, 10 Mar 2016, Petr Mladek wrote: > BTW: How is this solved in kretprobes? Or is it easier there? Is that really a problem? With kretprobes you're never performing the module->core->module transition in "one" redirection, right? The 'kretprobe_trampoline' is a global symbol, and hence everybody is generating 'global call' for it, and that's pretty much it. -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] ppc64le live patch: get rid of mini stack frame
On Thu 2016-03-10 13:25:08, Petr Mladek wrote: > On Wed 2016-03-09 18:30:17, Torsten Duwe wrote: > > After the mini stack frame is no longer required for TOC storage, it can > > be eliminated iff the functionality of klp_return_helper, which required > > a stack frame for the extra return address previously, is carried out > > by the replacement function now. This requires _every_ live patch > > replacement > > function to execute the following (or similar) sequence of machine > > instructions > > just before every return to the original caller: > > I have thought about it and it is a nono from my point of view. > It is too error prone, especially that there are functions that > call return on several locations. BTW: How is this solved in kretprobes? Or is it easier there? Best Regards, Petr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] ppc64le live patch: get rid of mini stack frame
On Wed 2016-03-09 18:30:17, Torsten Duwe wrote: > After the mini stack frame is no longer required for TOC storage, it can > be eliminated iff the functionality of klp_return_helper, which required > a stack frame for the extra return address previously, is carried out > by the replacement function now. This requires _every_ live patch replacement > function to execute the following (or similar) sequence of machine > instructions > just before every return to the original caller: I have thought about it and it is a nono from my point of view. It is too error prone, especially that there are functions that call return on several locations. Best Regards, Petr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: Freescale P5020 cpu will be kvm-pr?
Hi Scott, sorry for late reply in this last few days i become another time father . In any way for better explain the issue i attached my dmesg log probably will make understand better than my poor english Thanks for your time Luigi Burdo AmigaOne X5000 beta tester From: intermedi...@hotmail.com To: o...@buserror.net; linuxppc-dev@lists.ozlabs.org Subject: RE: Freescale P5020 cpu will be kvm-pr? Date: Wed, 9 Mar 2016 01:27:00 +0100 CC: matt...@a-eon.com; cont...@a-eon.com; chzigot...@xenosoft.de Hi Scott, thanks for the reply,about: I don't know what you mean by "working only on emb side".thru qemu 2.5 i can use only ppcemb as kvm not ppc64 or ppc Dont have the Kvm-pr is a really unhappy news, it means we will not have the opportunity to run for example Mol-kvm on this machine. about:KVM for e5500 uses hardware hypervisor featureslook like hypervisor is disabled because the main Hardware is too young and need a better knowledge. Thanks for your time and infos Luigi > From: o...@buserror.net > To: intermedi...@hotmail.com; linuxppc-dev@lists.ozlabs.org > Date: Tue, 8 Mar 2016 18:10:05 -0600 > Subject: Re: Freescale P5020 cpu will be kvm-pr? > CC: matt...@a-eon.com; cont...@a-eon.com; chzigot...@xenosoft.de > > On Mon, 2016-02-29 at 20:03 +0100, luigi burdo wrote: > > Hi all, > > sorry for this email but i need to ask to Alexander Graf and Michael > > Ellerman if will be possible have kvm and kvm-pr for the Freescale/Nix P5020 > > cpu . > > KVM for e5500 uses hardware hypervisor features. PR-mode KVM is not > supported. Nested KVM is not supported. > > > We are testing a new desktop machine with this powerpc cpu classes > > and right now we see kvm under qemu is working only on emb side, this > > because the kernel > > set the e5500 cpu as a emb and there is no an option for build the kernel > > with book3s. > > p5020 uses an e5500 core which is book3e. You cannot use a book3s kernel with > it. Why would you want to? > > I don't know what you mean by "working only on emb side". > > -Scott > > > ___ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev dmsg Description: Binary data ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] powerpc/powernv: Cpuidle related cleanup
Hi, Any thoughts on this? On 02/29/2016 05:52 PM, Shreyas B. Prabhu wrote: > These patches are purely code movement and cleanup. > There is no functionality change. > > Note, there are multiple style error reported for patch 1 and 2. > I think this is because checkpatch script mistakes the patches > as C code because of the presence of semi-colon in each line. I > for now have largely ignored these errors. > > Shreyas B. Prabhu (3): > powerpc/powernv: Move CHECK_HMI_INTERRUPT to exception-64s header > powerpc/powernv: Encapsulate idle preparation steps in a macro > powerpc/powernv: make hypervisor state restore a function > > arch/powerpc/include/asm/cpuidle.h | 68 ++ > arch/powerpc/include/asm/exception-64s.h | 19 > arch/powerpc/kernel/exceptions-64s.S | 29 +- > arch/powerpc/kernel/idle_power7.S| 148 > --- > 4 files changed, 129 insertions(+), 135 deletions(-) > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [1/1] powerpc/embedded6xx: Make reboot works on MVME5100
On Wed, Mar 09, 2016 at 03:26:21PM -0600, Scott Wood wrote: > On Wed, 2016-03-09 at 11:28 +0100, Gabriel Paubert wrote: > > On Wed, Mar 09, 2016 at 12:38:18AM -0600, Scott Wood wrote: > > > On Tue, Mar 08, 2016 at 08:59:12AM +0100, Alessio Igor Bogani wrote: > > > > The mtmsr() function hangs during restart. Make reboot works on > > > > MVME5100 removing that function call. > > > > --- > > > > arch/powerpc/platforms/embedded6xx/mvme5100.c | 2 -- > > > > 1 file changed, 2 deletions(-) > > > > > > Missing signoff > > > > > > Do you know why MSR_IP was there to begin with? > > > > Huh? The patch sets MSR_IP for reset but it is cleared while running Linux. > > I don't have any MVME5100, however I do have MVME2400/2700 with the same > > bridge (Hawk), so I can say that the address space layout is quite standard: > > memory at 0, ROM at the high end of the 32-bit address space. However the > > reset method is quite different (no external register set on the Hawk). > > I meant why the line of code was there, not why the bit was ever set. It > seems odd to me that the pre-reset value of MSR would matter at all -- > shouldn't it get reset along with the rest of the CPU, as you later suggest? I don't know, it may create a soft reset. I have the board documentation and it is not clear (at least not 100% clear to me) whether this reset drives the HRESET of SRESET pin, although HRESET is more likely. > And if it does matter, then why are Alessio and whoever wrote the code > (assuming it wasn't an untested copy-and-paste from somewhere else) seeing > different results regarding whether IP needs to be set or unset? The different results are perfectly explained by taking a DSI exception on accessing *restart (the only access to this address in the whole code, so the hash table has to be filled at this time). With MSR_IP set, DSI exception goes to 0xfff00300, and then probably hangs. > > > > Does this board have a switch that determines whether boot vectors > > > are high or low (I remember some 83xx boards that did), in which > > > case is this fixing one config by breaking another? > > > > For the switch, no AFAICT. And the code is MVME5100 specific so I > > suspect that it is very unlikely to break other boards. > > > > Very likely the source of the problem is that the restart address is > > remapped > > (ioremap) but never accessed while the kernel is running (the only access to > > *restart is in the reboot routine) so we take a DSI exception to fill the > > hash > > table when attempting to reboot. > > > > It would be enough to move the setting of MSR_IP until after triggering the > > restart, but this performs a hard reset of the CPU, which will set MSR_IP > > anyway (granted that the CPU will probably set MSR_IP way before the reset > > signal comes in). > > How can you perform anything after the reset is triggered? There might be > some lag before it takes effect, but depending on that seems hackish and > unreliable, and why would it make a difference? It would make a difference if this reset is connected to SRESET, which does not affect MSR_IP. > > > One way to check this hypothesis would be to introduce a write of 0 to > > the restart address before setting MSR_IP. > > I'm not familiar with the register interface for this board logic, but what > would that demonstrate? Writing a 0 would not trigger the reset, but ensure that the hash table is filled, then you can change MSR_IP and perform the real reset, and be sure that you'll end up executing the reset in FLASH, not in RAM. The alternative solution it to put code at 0x100 that sets MSR_IP and branches to 0xfff00100. > > > This said restart is declared as u_char *, so the cast in the out_8 > > register access is useless and ugly. > > Yes, and restart should be __iomem. Indeed. Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] livepatch: Add some basic LivePatch documentation
On Wed, 9 Mar 2016, Petr Mladek wrote: > LivePatch framework deserves some documentation, definitely. > This is an attempt to provide some basic info. I hope that > it will be useful for both LivePatch producers and also > potential developers of the framework itself. Thanks for starting the efforts; this is really needed if we want the infrastructure to be used also by someone else than its developers :) [ ... snip ... ] > +7. Limitations > +== Miroslav Benes, who is working on creating the actual patches we are shipping for our kernels, should already have a decent cheat-sheet regarding things that the patch author should be extra careful when creating a patch (inlining and other gcc optimizations such as isra, functions that use switch_to(), percpu, ...). I am not really sure whether these should go to limitations, or maybe they'd better be placed into a separate chapter, but I'd really like to see it included. Thanks, -- Jiri Kosina SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote: > In real mode, TCE tables are invalidated using different > cache-inhibited store instructions which is different from > the virtual mode. I suggest "In real mode, TCE tables are invalidated using special cache-inhibited store instructions which are not available in virtual mode". Also, the subject could make people think it's about the kernel xchg() function defined in . Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 4/9] powerpc/powernv/iommu: Add real mode version of xchg()
On Mon, Mar 07, 2016 at 02:41:12PM +1100, Alexey Kardashevskiy wrote: > In real mode, TCE tables are invalidated using different > cache-inhibited store instructions which is different from > the virtual mode. > > This defines and implements exchange_rm() callback. This does not > define set_rm/clear_rm/flush_rm callbacks as there is no user for those - > exchange/exchange_rm are only to be used by KVM for VFIO. > > The exchange_rm callback is defined for IODA1/IODA2 powernv platforms. > > This replaces list_for_each_entry_rcu with its lockless version as > from now on pnv_pci_ioda2_tce_invalidate() can be called in > the real mode too. [snip] > @@ -1062,6 +1062,21 @@ void iommu_release_ownership(struct iommu_table *tbl) > } > EXPORT_SYMBOL_GPL(iommu_release_ownership); > > +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry, > + unsigned long *hpa, enum dma_data_direction *direction) > +{ > + long ret; > + > + ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction); > + > + if (!ret && ((*direction == DMA_FROM_DEVICE) || > + (*direction == DMA_BIDIRECTIONAL))) > + SetPageDirty(realmode_pfn_to_page(*hpa >> PAGE_SHIFT)); realmode_pfn_to_page can fail and return NULL, can't it? You need to handle that situation somehow. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 3/9] KVM: PPC: Use preregistered memory API to access TCE list
On Mon, Mar 07, 2016 at 05:00:14PM +1100, David Gibson wrote: > On Mon, Mar 07, 2016 at 02:41:11PM +1100, Alexey Kardashevskiy wrote: > > VFIO on sPAPR already implements guest memory pre-registration > > when the entire guest RAM gets pinned. This can be used to translate > > the physical address of a guest page containing the TCE list > > from H_PUT_TCE_INDIRECT. > > > > This makes use of the pre-registrered memory API to access TCE list > > pages in order to avoid unnecessary locking on the KVM memory > > reverse map. > > > > Signed-off-by: Alexey Kardashevskiy > > Ok.. so, what's the benefit of not having to lock the rmap? It's not primarily about locking or not locking the rmap. The point is that when memory is pre-registered, we know that all of guest memory is pinned and we have a flat array mapping GPA to HPA. It's simpler and quicker to index into that array (even with looking up the kernel page tables in vmalloc_to_phys) than it is to find the memslot, lock the rmap entry, look up the user page tables, and unlock the rmap entry. We were only locking the rmap entry to stop the page being unmapped and reallocated to something else, but if memory is pre-registered, it's all pinned, so it can't be reallocated. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev