[PATCH] powerpc/powernv: Fixes for hypervisor doorbell handling
Since we can now use hypervisor doorbells for host IPIs, this makes sure we clear the host IPI flag when taking a doorbell interrupt, and clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we already do for IPIs sent via the XICS interrupt controller). Otherwise if there did happen to be a leftover pending doorbell interrupt for an offline CPU thread for any reason, it would prevent that thread from going into a power-saving mode; it would instead keep waking up because of the interrupt. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/reg.h | 3 +++ arch/powerpc/kernel/dbell.c | 2 ++ arch/powerpc/platforms/powernv/smp.c | 13 +++-- 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1c874fb..af56b5c 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -608,13 +608,16 @@ #define SRR1_ISI_N_OR_G 0x1000 /* ISI: Access is no-exec or G */ #define SRR1_ISI_PROT0x0800 /* ISI: Other protection fault */ #define SRR1_WAKEMASK0x0038 /* reason for wakeup */ +#define SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 */ #define SRR1_WAKESYSERR 0x0030 /* System error */ #define SRR1_WAKEEE 0x0020 /* External interrupt */ #define SRR1_WAKEMT 0x0028 /* mtctrl */ #define SRR1_WAKEHMI 0x0028 /* Hypervisor maintenance */ #define SRR1_WAKEDEC 0x0018 /* Decrementer interrupt */ +#define SRR1_WAKEDBELL 0x0014 /* Privileged doorbell on P8 */ #define SRR1_WAKETHERM 0x0010 /* Thermal management interrupt */ #define SRR1_WAKERESET0x0010 /* System reset */ +#define SRR1_WAKEHDBELL 0x000c /* Hypervisor doorbell on P8 */ #define SRR1_WAKESTATE0x0003 /* Powersave exit mask [46:47] */ #define SRR1_WS_DEEPEST 0x0003 /* Some resources not maintained, * may not be recoverable */ diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c index f421781..2128f3a 100644 --- a/arch/powerpc/kernel/dbell.c +++ b/arch/powerpc/kernel/dbell.c @@ -17,6 +17,7 @@ #include asm/dbell.h #include asm/irq_regs.h +#include asm/kvm_ppc.h #ifdef CONFIG_SMP void doorbell_setup_this_cpu(void) @@ -41,6 +42,7 @@ void doorbell_exception(struct pt_regs *regs) may_hard_irq_enable(); + kvmppc_set_host_ipi(smp_processor_id(), 0); __this_cpu_inc(irq_stat.doorbell_irqs); smp_ipi_demux(); diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index fc34025..7259a24 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -33,6 +33,7 @@ #include asm/runlatch.h #include asm/code-patching.h #include asm/dbell.h +#include asm/kvm_ppc.h #include powernv.h @@ -149,7 +150,7 @@ static int pnv_smp_cpu_disable(void) static void pnv_smp_cpu_kill_self(void) { unsigned int cpu; - unsigned long srr1; + unsigned long srr1, wmask; u32 idle_states; /* Standard hot unplug procedure */ @@ -161,6 +162,10 @@ static void pnv_smp_cpu_kill_self(void) generic_set_cpu_dead(cpu); smp_wmb(); + wmask = SRR1_WAKEMASK; + if (cpu_has_feature(CPU_FTR_ARCH_207S)) + wmask = SRR1_WAKEMASK_P8; + idle_states = pnv_get_supported_cpuidle_states(); /* We don't want to take decrementer interrupts while we are offline, * so clear LPCR:PECE1. We keep PECE2 enabled. @@ -191,10 +196,14 @@ static void pnv_smp_cpu_kill_self(void) * having finished executing in a KVM guest, then srr1 * contains 0. */ - if ((srr1 SRR1_WAKEMASK) == SRR1_WAKEEE) { + if ((srr1 wmask) == SRR1_WAKEEE) { icp_native_flush_interrupt(); local_paca-irq_happened = PACA_IRQ_HARD_DIS; smp_mb(); + } else if ((srr1 wmask) == SRR1_WAKEHDBELL) { + unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER); + asm volatile(PPC_MSGCLR(%0) : : r (msg)); + kvmppc_set_host_ipi(cpu, 0); } if (cpu_core_split_required()) -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'
On 19/03/15 08:45, Michael Ellerman wrote: On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote: The 'arg' argument to copy_thread() is only ever used when forking a new kernel thread. Hence, rename it to 'kthread_arg' for clarity (and consistency with do_fork() and other arch-specific implementations of copy_thread()). I don't understand the bit about consistency with do_fork() ? This series of patches includes one patch which renames the arg for do_fork(), and others which rename the same arg for each arch-specific implementation of copy_thread(). So if all of them are accepted and merged, then all will be consistent. If only some of the patches are accepted, I will rewrite the commit message so it doesn't mention consistency. Thanks! AD Otherwise it looks fine. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'
On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote: The 'arg' argument to copy_thread() is only ever used when forking a new kernel thread. Hence, rename it to 'kthread_arg' for clarity (and consistency with do_fork() and other arch-specific implementations of copy_thread()). I don't understand the bit about consistency with do_fork() ? Otherwise it looks fine. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/powernv: Fix return value from power7_nap() et al.
The power7_nap(), power7_sleep() and power7_winkle() functions are called from pnv_smp_cpu_kill_self(), which expects them to return the SRR1 value set by the hardware on wakeup, or 0 if no nap/sleep/winkle occurred. However, in the case where an interrupt needs to be replayed, the logic in power7_powersave_common (the common code for power7_nap et al.) doesn't set r3 to 0 in this case. Instead what we get as the return value is the selector for the type of power-saving mode requested (1, 2 or 3). In fact this should not affect the operation of pnv_smp_cpu_kill_self(), but it is better to get this correct, so this adds an instruction to set r3 to 0 in this case. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kernel/idle_power7.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index 05adc8b..eeaa0d5 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -94,6 +94,7 @@ _GLOBAL(power7_powersave_common) beq 1f addir1,r1,INT_FRAME_SIZE ld r0,16(r1) + li r3,0/* Return 0 (no nap) */ mtlrr0 blr -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
2015-03-20 1:54 GMT+08:00 Bjorn Helgaas bhelg...@google.com: On Thu, Mar 19, 2015 at 11:18 AM, Wei Yang weiyang.ker...@gmail.com wrote: Oh, I thought you are not comfortable with the Patch v12 10/21 PCI: Consider additional PF's IOV BAR alignment ... V14 is ready to send which is based on v4.0-rc1. Unless I missed something, the last email in that thread [1] is from you, so I think we're ready for the next iteration. [1] http://lkml.kernel.org/r/20150224083406.32124.65957.st...@bhelgaas-glaptop2.roam.corp.google.com Great~~~ I thought you didn't get a chance to read it. Will send out a v14 ASAP. -- Richard Yang Help You, Help Me ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote: On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds torva...@linux-foundation.org wrote: So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But fork() will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write -pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Wondering where the 55k number came from, I played around with when I started the measurement - all the numbers since I did the bisect have come from starting it at roughly 130AGs into phase 3 where the memory footprint stabilises and the tlb flush overhead kicks in. However, if I start the measurement at the same time as the repair test, I get something much closer to the 55k number. I also note that my original 4.0-rc1 numbers were much lower than the more recent steady state measurements (360k vs 470k), so I'd say the original numbers weren't representative of the steady state behaviour and so can be ignored... Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Possibly. The xfs_repair binary has definitely been rebuilt (testing unrelated bug fixes that only affect phase 6/7 behaviour), but otherwise the system libraries are unchanged. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). No, the profiles all came from steady state. The profiles from the initial startup phase hammer the mmap_sem because of page fault vs mprotect contention (glibc runs mprotect() on every chunk of memory it allocates). It's not until the cache reaches full and it starts recycling old buffers rather than allocating new ones that the tlb flush problem dominates the profiles. The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about unstable algorithm changes a small varioation into a factor-of-two vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Yup, will do. might take an hour or two before I get to it, though... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. $ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 266,750 migrate:mm_migrate_pages ( +- 7.43% ) 10.002032292 seconds time elapsed ( +- 0.00% ) Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Export __spin_yield
On Wed, Feb 25, 2015 at 05:23:53PM -0600, Suresh E. Warrier wrote: Export __spin_yield so that the arch_spin_unlock() function can be invoked from a module. This will be required for modules where we want to take a lock that is also is acquired in hypervisor real mode. Because we want to avoid running any lockdep code (which may not be safe in real mode), this lock needs to be an arch_spinlock_t instead of a normal spinlock. Signed-off-by: Suresh Warrier warr...@linux.vnet.ibm.com Acked-by: Paul Mackerras pau...@samba.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc/powernv: Fixes for hypervisor doorbell handling
Since we can now use hypervisor doorbells for host IPIs, this makes sure we clear the host IPI flag when taking a doorbell interrupt, and clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we already do for IPIs sent via the XICS interrupt controller). Otherwise if there did happen to be a leftover pending doorbell interrupt for an offline CPU thread for any reason, it would prevent that thread from going into a power-saving mode; it would instead keep waking up because of the interrupt. Signed-off-by: Paul Mackerras pau...@samba.org --- This one actually compiles... (blush) arch/powerpc/include/asm/ppc-opcode.h | 3 +++ arch/powerpc/include/asm/reg.h| 3 +++ arch/powerpc/kernel/dbell.c | 2 ++ arch/powerpc/platforms/powernv/smp.c | 14 -- 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 03cd858..4cbe23a 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -153,6 +153,7 @@ #define PPC_INST_MFSPR_PVR_MASK0xfc1f #define PPC_INST_MFTMR 0x7c0002dc #define PPC_INST_MSGSND0x7c00019c +#define PPC_INST_MSGCLR0x7c0001dc #define PPC_INST_MSGSNDP 0x7c00011c #define PPC_INST_MTTMR 0x7c0003dc #define PPC_INST_NOP 0x6000 @@ -309,6 +310,8 @@ ___PPC_RB(b) | __PPC_EH(eh)) #define PPC_MSGSND(b) stringify_in_c(.long PPC_INST_MSGSND | \ ___PPC_RB(b)) +#define PPC_MSGCLR(b) stringify_in_c(.long PPC_INST_MSGCLR | \ + ___PPC_RB(b)) #define PPC_MSGSNDP(b) stringify_in_c(.long PPC_INST_MSGSNDP | \ ___PPC_RB(b)) #define PPC_POPCNTB(a, s) stringify_in_c(.long PPC_INST_POPCNTB | \ diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1c874fb..af56b5c 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -608,13 +608,16 @@ #define SRR1_ISI_N_OR_G 0x1000 /* ISI: Access is no-exec or G */ #define SRR1_ISI_PROT0x0800 /* ISI: Other protection fault */ #define SRR1_WAKEMASK0x0038 /* reason for wakeup */ +#define SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 */ #define SRR1_WAKESYSERR 0x0030 /* System error */ #define SRR1_WAKEEE 0x0020 /* External interrupt */ #define SRR1_WAKEMT 0x0028 /* mtctrl */ #define SRR1_WAKEHMI 0x0028 /* Hypervisor maintenance */ #define SRR1_WAKEDEC 0x0018 /* Decrementer interrupt */ +#define SRR1_WAKEDBELL 0x0014 /* Privileged doorbell on P8 */ #define SRR1_WAKETHERM 0x0010 /* Thermal management interrupt */ #define SRR1_WAKERESET0x0010 /* System reset */ +#define SRR1_WAKEHDBELL 0x000c /* Hypervisor doorbell on P8 */ #define SRR1_WAKESTATE0x0003 /* Powersave exit mask [46:47] */ #define SRR1_WS_DEEPEST 0x0003 /* Some resources not maintained, * may not be recoverable */ diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c index f421781..2128f3a 100644 --- a/arch/powerpc/kernel/dbell.c +++ b/arch/powerpc/kernel/dbell.c @@ -17,6 +17,7 @@ #include asm/dbell.h #include asm/irq_regs.h +#include asm/kvm_ppc.h #ifdef CONFIG_SMP void doorbell_setup_this_cpu(void) @@ -41,6 +42,7 @@ void doorbell_exception(struct pt_regs *regs) may_hard_irq_enable(); + kvmppc_set_host_ipi(smp_processor_id(), 0); __this_cpu_inc(irq_stat.doorbell_irqs); smp_ipi_demux(); diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index fc34025..38a4508 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -33,6 +33,8 @@ #include asm/runlatch.h #include asm/code-patching.h #include asm/dbell.h +#include asm/kvm_ppc.h +#include asm/ppc-opcode.h #include powernv.h @@ -149,7 +151,7 @@ static int pnv_smp_cpu_disable(void) static void pnv_smp_cpu_kill_self(void) { unsigned int cpu; - unsigned long srr1; + unsigned long srr1, wmask; u32 idle_states; /* Standard hot unplug procedure */ @@ -161,6 +163,10 @@ static void pnv_smp_cpu_kill_self(void) generic_set_cpu_dead(cpu); smp_wmb(); + wmask = SRR1_WAKEMASK; + if (cpu_has_feature(CPU_FTR_ARCH_207S)) + wmask = SRR1_WAKEMASK_P8; + idle_states = pnv_get_supported_cpuidle_states(); /* We don't want to take decrementer interrupts while we are offline, * so clear LPCR:PECE1. We keep
Re: [alsa-devel] [PATCH 1/7 linux-next] ALSA: aoa: constify of_device_id array
At Wed, 18 Mar 2015 17:48:56 +0100, Fabian Frederick wrote: of_device_id is always used as const. (See driver.of_match_table and open firmware functions) Signed-off-by: Fabian Frederick f...@skynet.be Thanks, applied this one. The rest ASoC patches are left to Mark. Takashi --- sound/aoa/soundbus/i2sbus/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/aoa/soundbus/i2sbus/core.c b/sound/aoa/soundbus/i2sbus/core.c index b9737fa..1cbf210 100644 --- a/sound/aoa/soundbus/i2sbus/core.c +++ b/sound/aoa/soundbus/i2sbus/core.c @@ -31,7 +31,7 @@ module_param(force, int, 0444); MODULE_PARM_DESC(force, Force loading i2sbus even when no layout-id property is present); -static struct of_device_id i2sbus_match[] = { +static const struct of_device_id i2sbus_match[] = { { .name = i2s }, { } }; -- 1.9.1 ___ Alsa-devel mailing list alsa-de...@alsa-project.org http://mailman.alsa-project.org/mailman/listinfo/alsa-devel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] leds/powernv: Add driver for PowerNV platform
Vasant Hegde hegdevas...@linux.vnet.ibm.com writes: From: Anshuman Khandual khand...@linux.vnet.ibm.com This patch implements LED driver for PowerNV platform using the existing generic LED class framework. It registers classdev structures for all individual LEDs detected on the system through LED specific device tree nodes. Device tree nodes specify what all kind of LEDs present on the same location code. It registers LED classdev structure for each of them. The platform level implementation of LED get and set state has been achieved through OPAL calls. These calls are made available for the driver by exporting from architecture specific codes. As per the LED class framework, the 'brightness_set' function should not sleep. Hence these functions have been implemented through global work queue tasks which might sleep on OPAL async call completion. All the system LEDs can be found in the same regular path /sys/class/leds/. There are two different kind of LEDs present for the same location code, one being the identify indicator and other one being the fault indicator. We don't use LED colors. Hence our LEDs have names in this format. location_code:IDENTIFY|FAULT Any positive brightness value would turn on the LED and a zero value would turn off the LED. The driver will return LED_FULL (255) for any turned on LED and LED_OFF for any turned off LED. Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com Acked-by: Stewart Smith stew...@linux.vnet.ibm.com Tested-by: Stewart Smith stew...@linux.vnet.ibm.com (well, it boots, interacts with firmware. I didn't go and look at the LEDs themselves). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically
Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup stage. Since PEs are based on the hardware layout which is static in the system, they will never get released. This means the iommu_table in the pnv_ioda_pe will never get rleased neither. This no longer works for VF PE. VF PEs are created and released dynamically when VFs are created and released. So we need to assign pnv_ioda_pe to VF PEs respectively when VFs are enabled and clean up those resources for VF PE when VFs are disabled. And iommu_table is one of the resources we need to handle dynamically. Current iommu_table is a static field in pnv_ioda_pe, which will face a problem when freeing it. During the disabling of a VF, pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the iommu_table for this PE. A static iommu_table will fail in iommu_free_table. According to these requirement, this patch allocates iommu_table dynamically. Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/iommu.h |3 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 26 ++ arch/powerpc/platforms/powernv/pci.h |2 +- 3 files changed, 18 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 9cfa370..5574eeb 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,9 @@ struct iommu_table { struct iommu_group *it_group; #endif void (*set_bypass)(struct iommu_table *tbl, bool enable); +#ifdef CONFIG_PPC_POWERNV + void *data; +#endif }; /* Pure 2^n version of get_order */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index df4a295..1b37066 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all) return; } + pe-tce32_table = kzalloc_node(sizeof(struct iommu_table), + GFP_KERNEL, hose-node); + pe-tce32_table-data = pe; + /* Associate it with all child devices */ pnv_ioda_setup_same_PE(bus, pe); @@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev pe = phb-ioda.pe_array[pdn-pe_number]; WARN_ON(get_dma_ops(pdev-dev) != dma_iommu_ops); - set_iommu_table_base_and_group(pdev-dev, pe-tce32_table); + set_iommu_table_base_and_group(pdev-dev, pe-tce32_table); } static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, @@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, } else { dev_info(pdev-dev, Using 32-bit DMA via iommu\n); set_dma_ops(pdev-dev, dma_iommu_ops); - set_iommu_table_base(pdev-dev, pe-tce32_table); + set_iommu_table_base(pdev-dev, pe-tce32_table); } *pdev-dev.dma_mask = dma_mask; return 0; @@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, list_for_each_entry(dev, bus-devices, bus_list) { if (add_to_iommu_group) set_iommu_table_base_and_group(dev-dev, - pe-tce32_table); + pe-tce32_table); else - set_iommu_table_base(dev-dev, pe-tce32_table); + set_iommu_table_base(dev-dev, pe-tce32_table); if (dev-subordinate) pnv_ioda_setup_bus_dma(pe, dev-subordinate, @@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl, __be64 *startp, __be64 *endp, bool rm) { - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); + struct pnv_ioda_pe *pe = tbl-data; struct pnv_phb *phb = pe-phb; if (phb-type == PNV_PHB_IODA1) @@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } /* Setup linux iommu table */ - tbl = pe-tce32_table; + tbl = pe-tce32_table; pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs, base 28, IOMMU_PAGE_SHIFT_4K); @@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); + struct pnv_ioda_pe *pe = tbl-data; uint16_t window_id =
[PATCH V14 13/21] powerpc/pci: Don't unset PCI resources for VFs
Flag PCI_REASSIGN_ALL_RSRC is used to ignore resources information setup by firmware, so that kernel would re-assign all resources of pci devices. On powerpc arch, this happens in a header fixup function pcibios_fixup_resources(), which will clean up the resources if this flag is set. This works fine for PFs, since after clean up, kernel will re-assign the resources in pcibios_resource_survey(). Below is a simple call flow on how it works: pcibios_init pcibios_scan_phb pci_scan_child_bus ... pci_device_add pci_fixup_device(pci_fixup_header) pcibios_fixup_resources # header fixup for (i = 0; i DEVICE_COUNT_RESOURCE; i++) dev-resource[i].start = 0 pcibios_resource_survey # re-assign pcibios_allocate_resources However, the VF resources won't be re-assigned, since the VF resources are completely determined by the PF resources, and the PF resources have already been reassigned. This means we need to leave VF's resources un-cleared in pcibios_fixup_resources(). In this patch, we skip the resource unset process in pcibios_fixup_resources(), if the pci_dev is a VF. Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/kernel/pci-common.c |4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 2a525c9..8203101 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev) pci_name(dev)); return; } + + if (dev-is_virtfn) + return; + for (i = 0; i DEVICE_COUNT_RESOURCE; i++) { struct resource *res = dev-resource + i; struct pci_bus_region reg; -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup
On Fri, 2015-03-20 at 11:55 +0700, Arseny Solokha wrote: And by the way, while revisiting the series I've noticed that though the patch 4/4 basically reverts [1], it leaves #define MPIC_GREG_GLOBAL_CONF_1 0x00030 in arch/powerpc/include/asm/mpic.h untouched. That define also loses its uses after applying the patch. Compare the following hunk in today's patch w/ the one you committed: @@ -33,11 +33,6 @@ #defineMPIC_GREG_GCONF_NO_BIAS 0x1000 #defineMPIC_GREG_GCONF_BASE_MASK 0x000f #defineMPIC_GREG_GCONF_MCK 0x0800 -#define MPIC_GREG_GLOBAL_CONF_10x00030 -#defineMPIC_GREG_GLOBAL_CONF_1_SIE 0x0800 -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK 0x7000 -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\ - (((r) 28) MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK) #define MPIC_GREG_VENDOR_0 0x00040 #define MPIC_GREG_VENDOR_1 0x00050 #define MPIC_GREG_VENDOR_2 0x00060 So the question is, should #define MPIC_GREG_GLOBAL_CONF_1 have been also removed, or could be left as is? [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html OK, thanks for the thoroughness. With #defines like that it's never clear if they should be removed or not. On the one hand it's not used, so it should be removed. But, it can be useful to keep the #defines there as documentation. So I'm 50/50 on it. If you send me a patch to remove it I'll merge it, unless someone else objects. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V14 00/21] Enable SRIOV on Power8
On Fri, Mar 20, 2015 at 11:06:16AM +0800, Wei Yang wrote: [snip] --- v14: * call ppc_md.pcibios_fixup_sriov() in pcibios_add_device * add more explanation in change log * Following patches have been reordered to the beginning. EEH refactor to use pci_dn: 8ec20d6 powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor a3460fc powerpc/pci: Refactor pci_dn These two patches will be modified to merge with other patches which are under discussion/review in ppc mail list. Some changes may also be made in other patches, which I didn't include them in this series, so that the auto build robot could work on this. The comment here isn't precise enough and not the things I suggested before. Those 2 patches have been split into 3 patches (A/B/C). Some other EEH cleanup/refactor patches depends on A/B and those patches would be merged before your SRIOV patches to PowerPC tree. C, which I already sent to you, need to be integrated to your patchset right after the following one: powerpc/pci: Don't unset PCI resources for VFs I guess you can move the patches around after checking if Bjorn has further concerns/comments. Thanks, Gavin There may have several changes in powerpc arch, which not effect the pci core. So after this patch set pass the review in pci community, I would rebase this series on ppc brach and send out for comment. * use add_res-min_align as the alignment in reassign_resources_sorted() * some cleanup in Document v13: * fix error in pcibios_iov_resource_alignment(), use pdev instead of dev * rename vf_num to num_vfs in pcibios_sriov_enable(), pnv_pci_vf_resource_shift(), pnv_pci_sriov_disable(), pnv_pci_sriov_enable(), pnv_pci_ioda2_setup_dma_pe() * add more explanation in commit powerpc/pci: Don't unset PCI resources for VFs * fix IOV BAR in hotplug path as well, and don't fixup an already added device * use roundup_pow_of_two() instead of __roundup_pow_of_two() * this is based on v4.0-rc1 v12: * remove align parameter from pcibios_iov_resource_alignment() default version returns pci_iov_resource_size() instead of the align parameter * in powerpc pcibios_iov_resource_alignment(), return pci_iov_resource_size() if there's no ppc_md function pointer * in pci_sriov_resource_alignment(), don't re-read base, since we saved the required alignment when reading it the first time * remove vf_num parameter from add_dev_pci_info() and remove_dev_pci_info(); use pci_sriov_get_totalvfs() instead * use dev_warn() instead of pr_warn() when possible * check to be sure IOV BAR is still in range after shifting, change pnv_pci_vf_resource_shift() from void to int * improve sriov_enable() error message * improve SR-IOV BAR sizing message * index IOV resources in conventional style * include preamble patches (refresh offset/stride when updating numVFs, calculate max buses required * restructure pci_iov_max_bus_range() to return value instead of updating internally, rename to virtfn_max_buses() * fix typos formatting * expand documentation v11: * fix some compile warning v10: * remove weak function pcibios_iov_resource_size() the VF BAR size is stored in pci_sriov structure and retrieved from pci_iov_resource_size() * Use Reserve additional instead of Expand to be more acurate in the change log * add log message to show the PF's IOV BAR final size * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable() for arch setup before enable VFs. Like the arch could fix up the BDF for VFs, since the change of NumVFs would affect the BDF of VFs. * Add some explanation of PE on Power arch in the documentation v9: * make the change log consistent in the terminology PF's IOV BAR - the SRIOV BAR in PF VF's BAR - the normal BAR in VF's view * rename all newly introduced function from _sriov_ to _iov_ * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt * add the vendor id and device id of the tested devices * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured * rebase on 3.18-rc2 and tested v8: * use weak funcion pcibios_sriov_resource_size() instead of some flag to retrieve the IOV BAR size. * add a document Documentation/powerpc/pci_resource.txt to explain the design. * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline. * extract a function res_to_dev_res(), so that it is more general to get additional size and alignment * fix one contention which is introduced in powrepc/pci: Refactor pci_dn. the root cause is pci_get_slot() takes pci_bus_sem and leads to dead lock. v7: * add IORESOURCE_ARCH flag for IOV BAR on powernv platform. * when IOV BAR has IORESOURCE_ARCH flag,
Re: [PATCH V14 10/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
On Fri, Mar 20, 2015 at 11:06:26AM +0800, Wei Yang wrote: VFs are dynamically created when a driver enables them. On some platforms, like PowerNV, special resources are necessary to enable VFs. Add platform hooks for enabling and disabling VFs. Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 5643a10..64c4692 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset) pci_dev_put(dev); } +int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs) +{ + return 0; +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) struct pci_sriov *iov = dev-sriov; int bars = 0; int bus; + int retval; if (!nr_virtfn) return 0; @@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) if (nr_virtfn initial) initial = nr_virtfn; + if ((retval = pcibios_sriov_enable(dev, initial))) { + dev_err(dev-dev, failure %d from pcibios_sriov_enable()\n, + retval); + return retval; + } + for (i = 0; i initial; i++) { rc = virtfn_add(dev, i, 0); if (rc) @@ -335,6 +347,11 @@ failed: return rc; } +int __weak pcibios_sriov_disable(struct pci_dev *pdev) +{ + return 0; +} + Since you will have to v15, I would suggest to drop the return value for this function. It seems there isn't a reason to have int return value here. Thanks, Gavin static void sriov_disable(struct pci_dev *dev) { int i; @@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev) for (i = 0; i iov-num_VFs; i++) virtfn_remove(dev, i, 0); + pcibios_sriov_disable(dev); + iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE); pci_cfg_access_lock(dev); pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 09/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn()
On PowerNV, some resource reservation is needed for SR-IOV VFs that don't exist at the bootup stage. To do the match between resources and VFs, the code need to get the VF's BDF in advance. Rename virtfn_bus() and virtfn_devfn() to pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() and export them. [bhelgaas: changelog, make busnr int] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 28 include/linux/pci.h | 11 +++ 2 files changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 2ae921f..5643a10 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -19,16 +19,20 @@ #define VIRTFN_ID_LEN 16 -static inline u8 virtfn_bus(struct pci_dev *dev, int id) +int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id) { + if (!dev-is_physfn) + return -EINVAL; return dev-bus-number + ((dev-devfn + dev-sriov-offset + - dev-sriov-stride * id) 8); + dev-sriov-stride * vf_id) 8); } -static inline u8 virtfn_devfn(struct pci_dev *dev, int id) +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) { + if (!dev-is_physfn) + return -EINVAL; return (dev-devfn + dev-sriov-offset + - dev-sriov-stride * id) 0xff; + dev-sriov-stride * vf_id) 0xff; } /* @@ -58,11 +62,11 @@ static inline u8 virtfn_max_buses(struct pci_dev *dev) struct pci_sriov *iov = dev-sriov; int nr_virtfn; u8 max = 0; - u8 busnr; + int busnr; for (nr_virtfn = 1; nr_virtfn = iov-total_VFs; nr_virtfn++) { pci_iov_set_numvfs(dev, nr_virtfn); - busnr = virtfn_bus(dev, nr_virtfn - 1); + busnr = pci_iov_virtfn_bus(dev, nr_virtfn - 1); if (busnr max) max = busnr; } @@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) struct pci_bus *bus; mutex_lock(iov-dev-sriov-lock); - bus = virtfn_add_bus(dev-bus, virtfn_bus(dev, id)); + bus = virtfn_add_bus(dev-bus, pci_iov_virtfn_bus(dev, id)); if (!bus) goto failed; @@ -124,7 +128,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) if (!virtfn) goto failed0; - virtfn-devfn = virtfn_devfn(dev, id); + virtfn-devfn = pci_iov_virtfn_devfn(dev, id); virtfn-vendor = dev-vendor; pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device); pci_setup_device(virtfn); @@ -186,8 +190,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset) struct pci_sriov *iov = dev-sriov; virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev-bus), -virtfn_bus(dev, id), -virtfn_devfn(dev, id)); +pci_iov_virtfn_bus(dev, id), +pci_iov_virtfn_devfn(dev, id)); if (!virtfn) return; @@ -226,7 +230,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) struct pci_dev *pdev; struct pci_sriov *iov = dev-sriov; int bars = 0; - u8 bus; + int bus; if (!nr_virtfn) return 0; @@ -263,7 +267,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) iov-offset = offset; iov-stride = stride; - bus = virtfn_bus(dev, nr_virtfn - 1); + bus = pci_iov_virtfn_bus(dev, nr_virtfn - 1); if (bus dev-bus-busn_res.end) { dev_err(dev-dev, can't enable %d VFs (bus %02x out of range of %pR)\n, nr_virtfn, bus, dev-bus-busn_res); diff --git a/include/linux/pci.h b/include/linux/pci.h index 1559658..99ea948 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1669,6 +1669,9 @@ int pci_ext_cfg_avail(void); void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar); #ifdef CONFIG_PCI_IOV +int pci_iov_virtfn_bus(struct pci_dev *dev, int id); +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id); + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); void pci_disable_sriov(struct pci_dev *dev); int pci_num_vf(struct pci_dev *dev); @@ -1677,6 +1680,14 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); int pci_sriov_get_totalvfs(struct pci_dev *dev); resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); #else +static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id) +{ + return -ENOSYS; +} +static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id) +{ + return -ENOSYS; +} static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn) { return -ENODEV; } static inline void pci_disable_sriov(struct pci_dev *dev) { } --
[PATCH V14 10/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()
VFs are dynamically created when a driver enables them. On some platforms, like PowerNV, special resources are necessary to enable VFs. Add platform hooks for enabling and disabling VFs. Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 5643a10..64c4692 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset) pci_dev_put(dev); } +int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs) +{ + return 0; +} + static int sriov_enable(struct pci_dev *dev, int nr_virtfn) { int rc; @@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) struct pci_sriov *iov = dev-sriov; int bars = 0; int bus; + int retval; if (!nr_virtfn) return 0; @@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) if (nr_virtfn initial) initial = nr_virtfn; + if ((retval = pcibios_sriov_enable(dev, initial))) { + dev_err(dev-dev, failure %d from pcibios_sriov_enable()\n, + retval); + return retval; + } + for (i = 0; i initial; i++) { rc = virtfn_add(dev, i, 0); if (rc) @@ -335,6 +347,11 @@ failed: return rc; } +int __weak pcibios_sriov_disable(struct pci_dev *pdev) +{ + return 0; +} + static void sriov_disable(struct pci_dev *dev) { int i; @@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev) for (i = 0; i iov-num_VFs; i++) virtfn_remove(dev, i, 0); + pcibios_sriov_disable(dev); + iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE); pci_cfg_access_lock(dev); pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 21/21] powerpc/pci: Add PCI resource alignment documentation
In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be adjusted: 1. size expanded 2. aligned to M64BT size This patch documents this change on the reason and how. [bhelgaas: reformat, clarify, expand] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- .../powerpc/pci_iov_resource_on_powernv.txt| 301 1 file changed, 301 insertions(+) create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt new file mode 100644 index 000..b55c5cd --- /dev/null +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt @@ -0,0 +1,301 @@ +Wei Yang weiy...@linux.vnet.ibm.com +Benjamin Herrenschmidt b...@au1.ibm.com +Bjorn Helgaas bhelg...@google.com +26 Aug 2014 + +This document describes the requirement from hardware for PCI MMIO resource +sizing and assignment on PowerKVM and how generic PCI code handles this +requirement. The first two sections describe the concepts of Partitionable +Endpoints and the implementation on P8 (IODA2). The next two sections talks +about considerations on enabling SRIOV on IODA2. + +1. Introduction to Partitionable Endpoints + +A Partitionable Endpoint (PE) is a way to group the various resources +associated with a device or a set of devices to provide isolation between +partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism +to freeze a device that is causing errors in order to limit the possibility +of propagation of bad data. + +There is thus, in HW, a table of PE states that contains a pair of frozen +state bits (one for MMIO and one for DMA, they get set together but can be +cleared independently) for each PE. + +When a PE is frozen, all stores in any direction are dropped and all loads +return all 1's value. MSIs are also blocked. There's a bit more state that +captures things like the details of the error that caused the freeze etc., but +that's not critical. + +The interesting part is how the various PCIe transactions (MMIO, DMA, ...) +are matched to their corresponding PEs. + +The following section provides a rough description of what we have on P8 +(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB +is a completely separate HW entity that replicates the entire logic, so has +its own set of PEs, etc. + +2. Implementation of Partitionable Endpoints on P8 (IODA2) + +P8 supports up to 256 Partitionable Endpoints per PHB. + + * Inbound + +For DMA, MSIs and inbound PCIe error messages, we have a table (in +memory but accessed in HW by the chip) that provides a direct +correspondence between a PCIe RID (bus/dev/fn) with a PE number. +We call this the RTT. + +- For DMA we then provide an entire address space for each PE that can + contain two windows, depending on the value of PCI address bit 59. + Each window can be configured to be remapped via a TCE table (IOMMU + translation table), which has various configurable characteristics + not described here. + +- For MSIs, we have two windows in the address space (one at the top of + the 32-bit space and one much higher) which, via a combination of the + address and MSI value, will result in one of the 2048 interrupts per + bridge being triggered. There's a PE# in the interrupt controller + descriptor table as well which is compared with the PE# obtained from + the RTT to authorize the device to emit that specific interrupt. + +- Error messages just use the RTT. + + * Outbound. That's where the tricky part is. + +Like other PCI host bridges, the Power8 IODA2 PHB supports windows +from the CPU address space to the PCI address space. There is one M32 +window and sixteen M64 windows. They have different characteristics. +First what they have in common: they forward a configurable portion of +the CPU address space to the PCIe bus and must be naturally aligned +power of two in size. The rest is different: + +- The M32 window: + + * Is limited to 4GB in size. + + * Drops the top bits of the address (above the size) and replaces + them with a configurable value. This is typically used to generate + 32-bit PCIe accesses. We configure that window at boot from FW and + don't touch it from Linux; it's usually set to forward a 2GB + portion of address space from the CPU to PCIe + 0x8000_..0x_. (Note: The top 64KB are actually + reserved for MSIs but this is not a problem at this point; we just + need to ensure Linux doesn't assign anything there, the M32 logic + ignores that however and will forward in that space if we try). + + * It is divided into 256 segments of equal size. A table in the chip + maps each segment to a PE#. That allows portions of the MMIO space + to be assigned to PEs on a segment
[PATCH V14 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
M64 aperture size is limited on PHB3. When the IOV BAR is too big, this will exceed the limitation and failed to be assigned. Introduce a different mechanism based on the IOV BAR size: - if IOV BAR size is smaller than 64MB, expand to total_pe - if IOV BAR size is bigger than 64MB, roundup power2 [bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/pci-bridge.h |2 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 33 ++--- 2 files changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 3c95097..d6942c9 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -179,6 +179,8 @@ struct pci_dn { u16 vfs_expanded; /* number of VFs IOV BAR expanded */ u16 num_vfs;/* number of VFs enabled*/ int offset; /* PE# for the first VF PE */ +#define M64_PER_IOV 4 + int m64_per_iov; #define IODA_INVALID_M64(-1) int m64_wins[PCI_SRIOV_NUM_BARS]; #endif /* CONFIG_PCI_IOV */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index bd1b678..89bbcc4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2246,6 +2246,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev) int i; resource_size_t size; struct pci_dn *pdn; + int mul, total_vfs; if (!pdev-is_physfn || pdev-is_added) return; @@ -2256,6 +2257,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev) pdn = pci_get_pdn(pdev); pdn-vfs_expanded = 0; + total_vfs = pci_sriov_get_totalvfs(pdev); + pdn-m64_per_iov = 1; + mul = phb-ioda.total_pe; + + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { + res = pdev-resource[i + PCI_IOV_RESOURCES]; + if (!res-flags || res-parent) + continue; + if (!pnv_pci_is_mem_pref_64(res-flags)) { + dev_warn(pdev-dev, non M64 VF BAR%d: %pR\n, +i, res); + continue; + } + + size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES); + + /* bigger than 64M */ + if (size (1 26)) { + dev_info(pdev-dev, PowerNV: VF BAR%d: %pR IOV size is bigger than 64M, roundup power2\n, +i, res); + pdn-m64_per_iov = M64_PER_IOV; + mul = roundup_pow_of_two(total_vfs); + break; + } + } + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { res = pdev-resource[i + PCI_IOV_RESOURCES]; if (!res-flags || res-parent) @@ -2268,12 +2295,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev) dev_dbg(pdev-dev, Fixing VF BAR%d: %pR to\n, i, res); size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES); - res-end = res-start + size * phb-ioda.total_pe - 1; + res-end = res-start + size * mul - 1; dev_dbg(pdev-dev,%pR\n, res); dev_info(pdev-dev, VF BAR%d: %pR (expanded to %d VFs for PE alignment), - i, res, phb-ioda.total_pe); +i, res, mul); } - pdn-vfs_expanded = phb-ioda.total_pe; + pdn-vfs_expanded = mul; } #endif /* CONFIG_PCI_IOV */ -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup
On Fri, 2015-03-20 at 10:56 +0700, Arseny Solokha wrote: This series removes unused functions from powerpc tree that I've been able to discover. Two machines at hands, e300 and e500 based, boot and run without regressions on my workload with this series applied. The removed code seems also been rarely touched, so it seems the series is safe at least in general. But I can't obviously express any strong point in support of the series, so it's completely OK to leave things as is. v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao (powerpc/85xx: workaround for chips with MSI hardware errata) makes use of it. Sorry, too late. https://git.kernel.org/cgit/linux/kernel/git/mpe/linux.git/commit/?h=nextid=5e86bfde9cd93f272844c3ff6ac5f93d3666b3e7 The patch that needs it can just add it back. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 00/21] Enable SRIOV on Power8
This patchset enables the SRIOV on POWER8. The general idea is put each VF into one individual PE and allocate required resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO allocation and adjustment for PF's IOV BAR. On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF sit in its own PE. This gives more flexiblity, while at the mean time it brings on some restrictions on the PF's IOV BAR size and alignment. To achieve this effect, we need to do some hack on pci devices's resources. 1. Expand the IOV BAR properly. Done by pnv_pci_ioda_fixup_iov_resources(). 2. Shift the IOV BAR properly. Done by pnv_pci_vf_resource_shift(). 3. IOV BAR alignment is calculated by arch dependent function instead of an individual VF BAR size. Done by pnv_pcibios_sriov_resource_alignment(). 4. Take the IOV BAR alignment into consideration in the sizing and assigning. This is achieved by commit: PCI: Take additional IOV BAR alignment in sizing and assigning Test Environment: The SRIOV device tested is Emulex Lancer(10df:e220) and Mellanox ConnectX-3(15b3:1003) on POWER8. Examples on pass through a VF to guest through vfio: 1. unbind the original driver and bind to vfio-pci driver echo :06:0d.0 /sys/bus/pci/devices/:06:0d.0/driver/unbind echo 1102 0002 /sys/bus/pci/drivers/vfio-pci/new_id Note: this should be done for each device in the same iommu_group 2. Start qemu and pass device through vfio /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \ -M pseries -m 2048 -enable-kvm -nographic \ -drive file=/home/ywywyang/kvm/fc19.img \ -monitor telnet:localhost:5435,server,nowait -boot cd \ -device spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6 Verify this is the exact VF response: 1. ping from a machine in the same subnet(the broadcast domain) 2. run arp -n on this machine 9.115.251.20 ether 00:00:c9:df:ed:bf C eth0 3. ifconfig in the guest # ifconfig eth1 eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST mtu 1500 inet 9.115.251.20 netmask 255.255.255.0 broadcast 9.115.251.255 inet6 fe80::200:c9ff:fedf:edbf prefixlen 64 scopeid 0x20link ether 00:00:c9:df:ed:bf txqueuelen 1000 (Ethernet) RX packets 175 bytes 13278 (12.9 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 58 bytes 9276 (9.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 4. They have the same MAC address Note: make sure you shutdown other network interfaces in guest. --- v14: * call ppc_md.pcibios_fixup_sriov() in pcibios_add_device * add more explanation in change log * Following patches have been reordered to the beginning. EEH refactor to use pci_dn: 8ec20d6 powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor a3460fc powerpc/pci: Refactor pci_dn These two patches will be modified to merge with other patches which are under discussion/review in ppc mail list. Some changes may also be made in other patches, which I didn't include them in this series, so that the auto build robot could work on this. There may have several changes in powerpc arch, which not effect the pci core. So after this patch set pass the review in pci community, I would rebase this series on ppc brach and send out for comment. * use add_res-min_align as the alignment in reassign_resources_sorted() * some cleanup in Document v13: * fix error in pcibios_iov_resource_alignment(), use pdev instead of dev * rename vf_num to num_vfs in pcibios_sriov_enable(), pnv_pci_vf_resource_shift(), pnv_pci_sriov_disable(), pnv_pci_sriov_enable(), pnv_pci_ioda2_setup_dma_pe() * add more explanation in commit powerpc/pci: Don't unset PCI resources for VFs * fix IOV BAR in hotplug path as well, and don't fixup an already added device * use roundup_pow_of_two() instead of __roundup_pow_of_two() * this is based on v4.0-rc1 v12: * remove align parameter from pcibios_iov_resource_alignment() default version returns pci_iov_resource_size() instead of the align parameter * in powerpc pcibios_iov_resource_alignment(), return pci_iov_resource_size() if there's no ppc_md function pointer * in pci_sriov_resource_alignment(), don't re-read base, since we saved the required alignment when reading it the first time * remove vf_num parameter from add_dev_pci_info() and remove_dev_pci_info(); use pci_sriov_get_totalvfs() instead * use dev_warn() instead of pr_warn() when possible * check to be sure IOV BAR is still in range after shifting, change pnv_pci_vf_resource_shift() from void to int
[PATCH V14 08/21] PCI: Calculate maximum number of buses required for VFs
An SR-IOV device can change its First VF Offset and VF Stride based on the values of ARI Capable Hierarchy and NumVFs. The number of buses required for all VFs is determined by NumVFs, First VF Offset, and VF Stride (see SR-IOV spec r1.1, sec 2.1.2). Previously pci_iov_bus_range() computed how many buses would be required by TotalVFs, but this was based on a single NumVFs value and may not have been the maximum for all NumVFs configurations. Iterate over all valid NumVFs and calculate the maximum number of bus numbers that could ever be required for VFs of this device. [bhelgaas: changelog, compute busnr of NumVFs, not TotalVFs, remove kerenl-doc comment marker] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 31 +++ drivers/pci/pci.h |1 + 2 files changed, 28 insertions(+), 4 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index a8752c2..2ae921f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -46,6 +46,30 @@ static inline void pci_iov_set_numvfs(struct pci_dev *dev, int nr_virtfn) pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_STRIDE, iov-stride); } +/* + * The PF consumes one bus number. NumVFs, First VF Offset, and VF Stride + * determine how many additional bus numbers will be consumed by VFs. + * + * Iterate over all valid NumVFs and calculate the maximum number of bus + * numbers that could ever be required. + */ +static inline u8 virtfn_max_buses(struct pci_dev *dev) +{ + struct pci_sriov *iov = dev-sriov; + int nr_virtfn; + u8 max = 0; + u8 busnr; + + for (nr_virtfn = 1; nr_virtfn = iov-total_VFs; nr_virtfn++) { + pci_iov_set_numvfs(dev, nr_virtfn); + busnr = virtfn_bus(dev, nr_virtfn - 1); + if (busnr max) + max = busnr; + } + + return max; +} + static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) { struct pci_bus *child; @@ -427,6 +451,7 @@ found: dev-sriov = iov; dev-is_physfn = 1; + iov-max_VF_buses = virtfn_max_buses(dev); return 0; @@ -556,15 +581,13 @@ void pci_restore_iov_state(struct pci_dev *dev) int pci_iov_bus_range(struct pci_bus *bus) { int max = 0; - u8 busnr; struct pci_dev *dev; list_for_each_entry(dev, bus-devices, bus_list) { if (!dev-is_physfn) continue; - busnr = virtfn_bus(dev, dev-sriov-total_VFs - 1); - if (busnr max) - max = busnr; + if (dev-sriov-max_VF_buses max) + max = dev-sriov-max_VF_buses; } return max ? max - bus-number : 0; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5732964..bae593c 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -243,6 +243,7 @@ struct pci_sriov { u16 stride; /* following VF stride */ u32 pgsz; /* page size for BAR alignment */ u8 link;/* Function Dependency Link */ + u8 max_VF_buses;/* max buses consumed by VFs */ u16 driver_max_VFs; /* max num VFs driver supports */ struct pci_dev *dev;/* lowest numbered PF */ struct pci_dev *self; /* this PF */ -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 07/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs
The First VF Offset and VF Stride fields depend on the NumVFs setting, so refresh the cached fields in struct pci_sriov when updating NumVFs. See the SR-IOV spec r1.1, sec 3.3.9 and 3.3.10. [bhelgaas: changelog, remove kernel-doc comment marker] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 27b98c3..a8752c2 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -31,6 +31,21 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id) dev-sriov-stride * id) 0xff; } +/* + * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may + * change when NumVFs changes. + * + * Update iov-offset and iov-stride when NumVFs is written. + */ +static inline void pci_iov_set_numvfs(struct pci_dev *dev, int nr_virtfn) +{ + struct pci_sriov *iov = dev-sriov; + + pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, nr_virtfn); + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_OFFSET, iov-offset); + pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_STRIDE, iov-stride); +} + static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr) { struct pci_bus *child; @@ -253,7 +268,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) return rc; } - pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, nr_virtfn); + pci_iov_set_numvfs(dev, nr_virtfn); iov-ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE; pci_cfg_access_lock(dev); pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); @@ -282,7 +297,7 @@ failed: iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE); pci_cfg_access_lock(dev); pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); - pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, 0); + pci_iov_set_numvfs(dev, 0); ssleep(1); pci_cfg_access_unlock(dev); @@ -313,7 +328,7 @@ static void sriov_disable(struct pci_dev *dev) sysfs_remove_link(dev-dev.kobj, dep_link); iov-num_VFs = 0; - pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, 0); + pci_iov_set_numvfs(dev, 0); } static int sriov_init(struct pci_dev *dev, int pos) @@ -452,7 +467,7 @@ static void sriov_restore_state(struct pci_dev *dev) pci_update_resource(dev, i); pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz); - pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, iov-num_VFs); + pci_iov_set_numvfs(dev, iov-num_VFs); pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl); if (iov-ctrl PCI_SRIOV_CTRL_VFE) msleep(100); -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
When IOV BAR is big, each is covered by 4 M64 windows. This leads to several VF PE sits in one PE in terms of M64. Group VF PEs according to the M64 allocation. [bhelgaas: use dev_printk() when possible] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/pci-bridge.h |2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 197 ++--- 2 files changed, 154 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index d6942c9..ec83b51 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -182,7 +182,7 @@ struct pci_dn { #define M64_PER_IOV 4 int m64_per_iov; #define IODA_INVALID_M64(-1) - int m64_wins[PCI_SRIOV_NUM_BARS]; + int m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV]; #endif /* CONFIG_PCI_IOV */ #endif struct list_head child_list; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 89bbcc4..8e8399f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1156,26 +1156,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev) struct pci_controller *hose; struct pnv_phb*phb; struct pci_dn *pdn; - inti; + inti, j; bus = pdev-bus; hose = pci_bus_to_host(bus); phb = hose-private_data; pdn = pci_get_pdn(pdev); - for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { - if (pdn-m64_wins[i] == IODA_INVALID_M64) - continue; - opal_pci_phb_mmio_enable(phb-opal_id, - OPAL_M64_WINDOW_TYPE, pdn-m64_wins[i], 0); - clear_bit(pdn-m64_wins[i], phb-ioda.m64_bar_alloc); - pdn-m64_wins[i] = IODA_INVALID_M64; - } + for (i = 0; i PCI_SRIOV_NUM_BARS; i++) + for (j = 0; j M64_PER_IOV; j++) { + if (pdn-m64_wins[i][j] == IODA_INVALID_M64) + continue; + opal_pci_phb_mmio_enable(phb-opal_id, + OPAL_M64_WINDOW_TYPE, pdn-m64_wins[i][j], 0); + clear_bit(pdn-m64_wins[i][j], phb-ioda.m64_bar_alloc); + pdn-m64_wins[i][j] = IODA_INVALID_M64; + } return 0; } -static int pnv_pci_vf_assign_m64(struct pci_dev *pdev) +static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs) { struct pci_bus*bus; struct pci_controller *hose; @@ -1183,17 +1184,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev) struct pci_dn *pdn; unsigned int win; struct resource *res; - inti; + inti, j; int64_trc; + inttotal_vfs; + resource_size_tsize, start; + intpe_num; + intvf_groups; + intvf_per_group; bus = pdev-bus; hose = pci_bus_to_host(bus); phb = hose-private_data; pdn = pci_get_pdn(pdev); + total_vfs = pci_sriov_get_totalvfs(pdev); /* Initialize the m64_wins to IODA_INVALID_M64 */ for (i = 0; i PCI_SRIOV_NUM_BARS; i++) - pdn-m64_wins[i] = IODA_INVALID_M64; + for (j = 0; j M64_PER_IOV; j++) + pdn-m64_wins[i][j] = IODA_INVALID_M64; + + if (pdn-m64_per_iov == M64_PER_IOV) { + vf_groups = (num_vfs = M64_PER_IOV) ? num_vfs: M64_PER_IOV; + vf_per_group = (num_vfs = M64_PER_IOV)? 1: + roundup_pow_of_two(num_vfs) / pdn-m64_per_iov; + } else { + vf_groups = 1; + vf_per_group = 1; + } for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { res = pdev-resource[i + PCI_IOV_RESOURCES]; @@ -1203,35 +1220,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev) if (!pnv_pci_is_mem_pref_64(res-flags)) continue; - do { - win = find_next_zero_bit(phb-ioda.m64_bar_alloc, - phb-ioda.m64_bar_idx + 1, 0); - - if (win = phb-ioda.m64_bar_idx + 1) - goto m64_failed; - } while (test_and_set_bit(win, phb-ioda.m64_bar_alloc)); + for (j = 0; j vf_groups; j++) { + do { + win = find_next_zero_bit(phb-ioda.m64_bar_alloc, + phb-ioda.m64_bar_idx + 1, 0); + + if (win = phb-ioda.m64_bar_idx + 1) +
[PATCH V14 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field
In struct pci_dn, the pcidev field is assigned but not used, so remove it. Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com Acked-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/pci-bridge.h |1 - arch/powerpc/platforms/powernv/pci-ioda.c |1 - 2 files changed, 2 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index ec83b51..680ae56 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -168,7 +168,6 @@ struct pci_dn { int pci_ext_config_space; /* for pci devices */ - struct pci_dev *pcidev;/* back-pointer to the pci device */ #ifdef CONFIG_EEH struct eeh_dev *edev; /* eeh device */ #endif diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 8e8399f..3a79dfa 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1028,7 +1028,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe) pci_name(dev)); continue; } - pdn-pcidev = dev; pdn-pe_number = pe-pe_number; pe-dma_weight += pnv_ioda_dma_weight(dev); if ((pe-flags PNV_IODA_PE_BUS_ALL) dev-subordinate) -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote: Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Ahh, so that !(vma-vm_flags VM_WRITE) test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 01/21] powerpc/pci: Refactor pci_dn
From: Gavin Shan gws...@linux.vnet.ibm.com pci_dn is the extension of PCI device node and is created from device node. Unfortunately, VFs are enabled dynamically by PF's driver and they don't have corresponding device nodes, and pci_dn. Refactor pci_dn to support VFs: * pci_dn is organized as a hierarchy tree. VF's pci_dn is put to the child list of pci_dn of PF's bridge. pci_dn of other device put to the child list of pci_dn of its upstream bridge. * VF's pci_dn is expected to be created dynamically when PF enabling VFs. VF's pci_dn will be destroyed when PF disabling VFs. pci_dn of other device is still created from device node as before. * For one particular PCI device (VF or not), its pci_dn can be found from pdev-dev.archdata.firmware_data, PCI_DN(devnode), or parent's list. The fast path (fetching pci_dn through PCI device instance) is populated during early fixup time. [bhelgaas: add ifdef around add_one_dev_pci_info(), use dev_printk()] Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/device.h |3 + arch/powerpc/include/asm/pci-bridge.h | 14 +- arch/powerpc/kernel/pci_dn.c | 245 - arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++ 4 files changed, 272 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h index 38faede..29992cd 100644 --- a/arch/powerpc/include/asm/device.h +++ b/arch/powerpc/include/asm/device.h @@ -34,6 +34,9 @@ struct dev_archdata { #ifdef CONFIG_SWIOTLB dma_addr_t max_direct_dma_addr; #endif +#ifdef CONFIG_PPC64 + void*firmware_data; +#endif #ifdef CONFIG_EEH struct eeh_dev *edev; #endif diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 546d036..513f8f2 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -89,6 +89,7 @@ struct pci_controller { #ifdef CONFIG_PPC64 unsigned long buid; + void *firmware_data; #endif /* CONFIG_PPC64 */ void *private_data; @@ -154,9 +155,13 @@ static inline int isa_vaddr_is_ioport(void __iomem *address) struct iommu_table; struct pci_dn { + int flags; +#define PCI_DN_FLAG_IOV_VF 0x01 + int busno; /* pci bus number */ int devfn; /* pci device and function number */ + struct pci_dn *parent; struct pci_controller *phb;/* for pci devices */ struct iommu_table *iommu_table; /* for phb's or bridges */ struct device_node *node; /* back-pointer to the device_node */ @@ -171,14 +176,19 @@ struct pci_dn { #ifdef CONFIG_PPC_POWERNV int pe_number; #endif + struct list_head child_list; + struct list_head list; }; /* Get the pointer to a device_node's pci_dn */ #define PCI_DN(dn) ((struct pci_dn *) (dn)-data) +extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus, + int devfn); extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev); - -extern void * update_dn_pci_info(struct device_node *dn, void *data); +extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev); +extern void remove_dev_pci_info(struct pci_dev *pdev); +extern void *update_dn_pci_info(struct device_node *dn, void *data); static inline int pci_device_from_OF_node(struct device_node *np, u8 *bus, u8 *devfn) diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index 83df307..f3a1a81 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -32,12 +32,223 @@ #include asm/ppc-pci.h #include asm/firmware.h +/* + * The function is used to find the firmware data of one + * specific PCI device, which is attached to the indicated + * PCI bus. For VFs, their firmware data is linked to that + * one of PF's bridge. For other devices, their firmware + * data is linked to that of their bridge. + */ +static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus) +{ + struct pci_bus *pbus; + struct device_node *dn; + struct pci_dn *pdn; + + /* +* We probably have virtual bus which doesn't +* have associated bridge. +*/ + pbus = bus; + while (pbus) { + if (pci_is_root_bus(pbus) || pbus-self) + break; + + pbus = pbus-parent; + } + + /* +* Except virtual bus, all PCI buses should +* have device nodes. +*/ + dn = pci_bus_to_OF_node(pbus); + pdn = dn ? PCI_DN(dn) : NULL; + + return pdn; +} + +struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus, + int devfn) +{ + struct device_node *dn = NULL; + struct pci_dn *parent,
[PATCH V14 02/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor
The PCI config accessors previously relied on device_node. Unfortunately, VFs don't have a corresponding device_node, so change the accessors to use pci_dn instead. [bhelgaas: changelog] Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/eeh-powernv.c | 14 +- arch/powerpc/platforms/powernv/pci.c | 69 ++ arch/powerpc/platforms/powernv/pci.h |4 +- 3 files changed, 40 insertions(+), 47 deletions(-) diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index e261869..7a5021b 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -430,21 +430,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn) static int powernv_eeh_read_config(struct device_node *dn, int where, int size, u32 *val) { + struct pci_dn *pdn = PCI_DN(dn); + + if (!pdn) + return PCIBIOS_DEVICE_NOT_FOUND; + if (powernv_eeh_cfg_blocked(dn)) { *val = 0x; return PCIBIOS_SET_FAILED; } - return pnv_pci_cfg_read(dn, where, size, val); + return pnv_pci_cfg_read(pdn, where, size, val); } static int powernv_eeh_write_config(struct device_node *dn, int where, int size, u32 val) { + struct pci_dn *pdn = PCI_DN(dn); + + if (!pdn) + return PCIBIOS_DEVICE_NOT_FOUND; + if (powernv_eeh_cfg_blocked(dn)) return PCIBIOS_SET_FAILED; - return pnv_pci_cfg_write(dn, where, size, val); + return pnv_pci_cfg_write(pdn, where, size, val); } /** diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index e69142f..6c20d6e 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) spin_unlock_irqrestore(phb-lock, flags); } -static void pnv_pci_config_check_eeh(struct pnv_phb *phb, -struct device_node *dn) +static void pnv_pci_config_check_eeh(struct pci_dn *pdn) { + struct pnv_phb *phb = pdn-phb-private_data; u8 fstate; __be16 pcierr; int pe_no; @@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb, * setup that yet. So all ER errors should be mapped to * reserved PE. */ - pe_no = PCI_DN(dn)-pe_number; + pe_no = pdn-pe_number; if (pe_no == IODA_INVALID_PE) { if (phb-type == PNV_PHB_P5IOC2) pe_no = 0; @@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb, } cfg_dbg( - EEH check, bdfn=%04x PE#%d fstate=%x\n, - (PCI_DN(dn)-busno 8) | (PCI_DN(dn)-devfn), - pe_no, fstate); + (pdn-busno 8) | (pdn-devfn), pe_no, fstate); /* Clear the frozen state if applicable */ if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE || @@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb, } } -int pnv_pci_cfg_read(struct device_node *dn, +int pnv_pci_cfg_read(struct pci_dn *pdn, int where, int size, u32 *val) { - struct pci_dn *pdn = PCI_DN(dn); struct pnv_phb *phb = pdn-phb-private_data; u32 bdfn = (pdn-busno 8) | pdn-devfn; s64 rc; @@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn, return PCIBIOS_SUCCESSFUL; } -int pnv_pci_cfg_write(struct device_node *dn, +int pnv_pci_cfg_write(struct pci_dn *pdn, int where, int size, u32 val) { - struct pci_dn *pdn = PCI_DN(dn); struct pnv_phb *phb = pdn-phb-private_data; u32 bdfn = (pdn-busno 8) | pdn-devfn; @@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn, } #if CONFIG_EEH -static bool pnv_pci_cfg_check(struct pci_controller *hose, - struct device_node *dn) +static bool pnv_pci_cfg_check(struct pci_dn *pdn) { struct eeh_dev *edev = NULL; - struct pnv_phb *phb = hose-private_data; + struct pnv_phb *phb = pdn-phb-private_data; /* EEH not enabled ? */ if (!(phb-flags PNV_PHB_FLAG_EEH)) return true; /* PE reset or device removed ? */ - edev = of_node_to_eeh_dev(dn); + edev = pdn-edev; if (edev) { if (edev-pe (edev-pe-state EEH_PE_CFG_BLOCKED)) @@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose, return true; } #else -static inline pnv_pci_cfg_check(struct pci_controller *hose, - struct device_node *dn) +static inline pnv_pci_cfg_check(struct pci_dn *pdn) {
[PATCH V14 17/21] powerpc/powernv: Shift VF resource with an offset
On PowerNV platform, resource position in M64 BAR implies the PE# the resource belongs to. In some cases, adjustment of a resource is necessary to locate it to a correct position in M64 BAR . This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address according to an offset. Note: After doing so, there would be a hole in the /proc/iomem when offset is a positive value. It looks like the device return some mmio back to the system, which actually no one could use it. [bhelgaas: rework loops, rework overlap check, index resource[] conventionally, remove pci_regs.h include, squashed with next patch] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/pci-bridge.h |4 + arch/powerpc/kernel/pci_dn.c | 13 + arch/powerpc/platforms/powernv/pci-ioda.c | 524 - arch/powerpc/platforms/powernv/pci.c | 18 + arch/powerpc/platforms/powernv/pci.h |7 + 5 files changed, 549 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index d0d1718..3c95097 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -177,6 +177,10 @@ struct pci_dn { int pe_number; #ifdef CONFIG_PCI_IOV u16 vfs_expanded; /* number of VFs IOV BAR expanded */ + u16 num_vfs;/* number of VFs enabled*/ + int offset; /* PE# for the first VF PE */ +#define IODA_INVALID_M64(-1) + int m64_wins[PCI_SRIOV_NUM_BARS]; #endif /* CONFIG_PCI_IOV */ #endif struct list_head child_list; diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index f3a1a81..93ed7b3 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -217,6 +217,19 @@ void remove_dev_pci_info(struct pci_dev *pdev) struct pci_dn *pdn, *tmp; int i; + /* +* VF and VF PE are created/released dynamically, so we need to +* bind/unbind them. Otherwise the VF and VF PE would be mismatched +* when re-enabling SR-IOV. +*/ + if (pdev-is_virtfn) { + pdn = pci_get_pdn(pdev); +#ifdef CONFIG_PPC_POWERNV + pdn-pe_number = IODA_INVALID_PE; +#endif + return; + } + /* Only support IOV PF for now */ if (!pdev-is_physfn) return; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 93ec16c..bd1b678 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -44,6 +44,9 @@ #include powernv.h #include pci.h +/* 256M DMA window, 4K TCE pages, 8 bytes TCE */ +#define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) + static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { @@ -56,11 +59,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, vaf.fmt = fmt; vaf.va = args; - if (pe-pdev) + if (pe-flags PNV_IODA_PE_DEV) strlcpy(pfix, dev_name(pe-pdev-dev), sizeof(pfix)); - else + else if (pe-flags (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) sprintf(pfix, %04x:%02x , pci_domain_nr(pe-pbus), pe-pbus-number); +#ifdef CONFIG_PCI_IOV + else if (pe-flags PNV_IODA_PE_VF) + sprintf(pfix, %04x:%02x:%2x.%d, + pci_domain_nr(pe-parent_dev-bus), + (pe-rid 0xff00) 8, + PCI_SLOT(pe-rid), PCI_FUNC(pe-rid)); +#endif /* CONFIG_PCI_IOV*/ printk(%spci %s: [PE# %.3d] %pV, level, pfix, pe-pe_number, vaf); @@ -591,7 +601,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb, bool is_add) { struct pnv_ioda_pe *slave; - struct pci_dev *pdev; + struct pci_dev *pdev = NULL; int ret; /* @@ -630,8 +640,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb, if (pe-flags (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS)) pdev = pe-pbus-self; - else + else if (pe-flags PNV_IODA_PE_DEV) pdev = pe-pdev-bus-self; +#ifdef CONFIG_PCI_IOV + else if (pe-flags PNV_IODA_PE_VF) + pdev = pe-parent_dev-bus-self; +#endif /* CONFIG_PCI_IOV */ while (pdev) { struct pci_dn *pdn = pci_get_pdn(pdev); struct pnv_ioda_pe *parent; @@ -649,6 +663,87 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb, return 0; } +#ifdef CONFIG_PCI_IOV +static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) +{ + struct pci_dev *parent; + uint8_t bcomp, dcomp, fcomp; + int64_t rc; + long rid_end, rid; + + /*
[PATCH 2/4] kvm/ppc/mpic: drop unused IRQ_testbit
Drop unused static procedure which doesn't have callers within its translation unit. It had been already removed independently in QEMU[1] from the OpenPIC implementation borrowed from the kernel. [1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.html Signed-off-by: Arseny Solokha asolo...@kb.kras.ru Cc: Alexander Graf ag...@suse.de Cc: Gleb Natapov g...@kernel.org Cc: Paolo Bonzini pbonz...@redhat.com --- arch/powerpc/kvm/mpic.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index 39b3a8f..a480d99 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -289,11 +289,6 @@ static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ) clear_bit(n_IRQ, q-queue); } -static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ) -{ - return test_bit(n_IRQ, q-queue); -} - static void IRQ_check(struct openpic *opp, struct irq_queue *q) { int irq = -1; -- 2.3.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] powerpc/boot: drop planetcore_set_serial_speed
Drop planetcore_set_serial_speed() which had no users since its inception in commit fec6047047fd ([POWERPC] bootwrapper: Add PlanetCore firmware support) in 2007. Signed-off-by: Arseny Solokha asolo...@kb.kras.ru --- arch/powerpc/boot/planetcore.c | 33 - arch/powerpc/boot/planetcore.h | 3 --- 2 files changed, 36 deletions(-) diff --git a/arch/powerpc/boot/planetcore.c b/arch/powerpc/boot/planetcore.c index 0d8558a..75117e6 100644 --- a/arch/powerpc/boot/planetcore.c +++ b/arch/powerpc/boot/planetcore.c @@ -131,36 +131,3 @@ void planetcore_set_stdout_path(const char *table) setprop_str(chosen, linux,stdout-path, path); } - -void planetcore_set_serial_speed(const char *table) -{ - void *chosen, *stdout; - u64 baud; - u32 baud32; - int len; - - chosen = finddevice(/chosen); - if (!chosen) - return; - - len = getprop(chosen, linux,stdout-path, prop_buf, MAX_PROP_LEN); - if (len = 0) - return; - - stdout = finddevice(prop_buf); - if (!stdout) { - printf(planetcore_set_serial_speed: - Bad /chosen/linux,stdout-path.\r\n); - - return; - } - - if (!planetcore_get_decimal(table, PLANETCORE_KEY_SERIAL_BAUD, - baud)) { - printf(planetcore_set_serial_speed: No SB tag.\r\n); - return; - } - - baud32 = baud; - setprop(stdout, current-speed, baud32, 4); -} diff --git a/arch/powerpc/boot/planetcore.h b/arch/powerpc/boot/planetcore.h index 0d4094f..d53c733 100644 --- a/arch/powerpc/boot/planetcore.h +++ b/arch/powerpc/boot/planetcore.h @@ -43,7 +43,4 @@ void planetcore_set_mac_addrs(const char *table); */ void planetcore_set_stdout_path(const char *table); -/* Sets the current-speed property in the serial node. */ -void planetcore_set_serial_speed(const char *table); - #endif -- 2.3.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] powerpc/mpic: remove unused functions
Drop unused mpic_set_clk_ratio() and mpic_set_serial_int(). Both functions are almost nine years old[1] but still have no chance to be called even from out-of-tree modules because they both are __init and of course aren't exported. [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html Signed-off-by: Arseny Solokha asolo...@kb.kras.ru Cc: Jia Hongtao hongtao@freescale.com --- arch/powerpc/include/asm/mpic.h | 11 --- arch/powerpc/sysdev/mpic.c | 25 - 2 files changed, 36 deletions(-) diff --git a/arch/powerpc/include/asm/mpic.h b/arch/powerpc/include/asm/mpic.h index 754f93d..3b39c28 100644 --- a/arch/powerpc/include/asm/mpic.h +++ b/arch/powerpc/include/asm/mpic.h @@ -33,11 +33,6 @@ #defineMPIC_GREG_GCONF_NO_BIAS 0x1000 #defineMPIC_GREG_GCONF_BASE_MASK 0x000f #defineMPIC_GREG_GCONF_MCK 0x0800 -#define MPIC_GREG_GLOBAL_CONF_10x00030 -#defineMPIC_GREG_GLOBAL_CONF_1_SIE 0x0800 -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK 0x7000 -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\ - (((r) 28) MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK) #define MPIC_GREG_VENDOR_0 0x00040 #define MPIC_GREG_VENDOR_1 0x00050 #define MPIC_GREG_VENDOR_2 0x00060 @@ -496,11 +491,5 @@ extern unsigned int mpic_get_coreint_irq(void); /* Fetch Machine Check interrupt from primary mpic */ extern unsigned int mpic_get_mcirq(void); -/* Set the EPIC clock ratio */ -void mpic_set_clk_ratio(struct mpic *mpic, u32 clock_ratio); - -/* Enable/Disable EPIC serial interrupt mode */ -void mpic_set_serial_int(struct mpic *mpic, int enable); - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_MPIC_H */ diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c index bbfbbf2..2c817a7 100644 --- a/arch/powerpc/sysdev/mpic.c +++ b/arch/powerpc/sysdev/mpic.c @@ -1676,31 +1676,6 @@ void __init mpic_init(struct mpic *mpic) mpic_err_int_init(mpic, MPIC_FSL_ERR_INT); } -void __init mpic_set_clk_ratio(struct mpic *mpic, u32 clock_ratio) -{ - u32 v; - - v = mpic_read(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1); - v = ~MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK; - v |= MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(clock_ratio); - mpic_write(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1, v); -} - -void __init mpic_set_serial_int(struct mpic *mpic, int enable) -{ - unsigned long flags; - u32 v; - - raw_spin_lock_irqsave(mpic_lock, flags); - v = mpic_read(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1); - if (enable) - v |= MPIC_GREG_GLOBAL_CONF_1_SIE; - else - v = ~MPIC_GREG_GLOBAL_CONF_1_SIE; - mpic_write(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1, v); - raw_spin_unlock_irqrestore(mpic_lock, flags); -} - void mpic_irq_set_priority(unsigned int irq, unsigned int pri) { struct mpic *mpic = mpic_find(irq); -- 2.3.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] powrepc/qe: drop unused ucc_slow_poll_transmitter_now
Drop ucc_slow_poll_transmitter_now() which has no users since its inception in 2007 in commit 986585385131 ([POWERPC] Add QUICC Engine (QE) infrastructure). Signed-off-by: Arseny Solokha asolo...@kb.kras.ru --- arch/powerpc/include/asm/ucc_slow.h | 13 - arch/powerpc/sysdev/qe_lib/ucc_slow.c | 5 - 2 files changed, 18 deletions(-) diff --git a/arch/powerpc/include/asm/ucc_slow.h b/arch/powerpc/include/asm/ucc_slow.h index c44131e..233ef5f 100644 --- a/arch/powerpc/include/asm/ucc_slow.h +++ b/arch/powerpc/include/asm/ucc_slow.h @@ -251,19 +251,6 @@ void ucc_slow_enable(struct ucc_slow_private * uccs, enum comm_dir mode); */ void ucc_slow_disable(struct ucc_slow_private * uccs, enum comm_dir mode); -/* ucc_slow_poll_transmitter_now - * Immediately forces a poll of the transmitter for data to be sent. - * Typically, the hardware performs a periodic poll for data that the - * transmit routine has set up to be transmitted. In cases where - * this polling cycle is not soon enough, this optional routine can - * be invoked to force a poll right away, instead. Proper use for - * each transmission for which this functionality is desired is to - * call the transmit routine and then this routine right after. - * - * uccs - (In) pointer to the slow UCC structure. - */ -void ucc_slow_poll_transmitter_now(struct ucc_slow_private * uccs); - /* ucc_slow_graceful_stop_tx * Smoothly stops transmission on a specified slow UCC. * diff --git a/arch/powerpc/sysdev/qe_lib/ucc_slow.c b/arch/powerpc/sysdev/qe_lib/ucc_slow.c index befaf11..5f91628 100644 --- a/arch/powerpc/sysdev/qe_lib/ucc_slow.c +++ b/arch/powerpc/sysdev/qe_lib/ucc_slow.c @@ -43,11 +43,6 @@ u32 ucc_slow_get_qe_cr_subblock(int uccs_num) } EXPORT_SYMBOL(ucc_slow_get_qe_cr_subblock); -void ucc_slow_poll_transmitter_now(struct ucc_slow_private * uccs) -{ - out_be16(uccs-us_regs-utodr, UCC_SLOW_TOD); -} - void ucc_slow_graceful_stop_tx(struct ucc_slow_private * uccs) { struct ucc_slow_info *us_info = uccs-us_info; -- 2.3.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote: Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Ahh, so that !(vma-vm_flags VM_WRITE) test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s *nod* so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). No, nothing new pops up in the kernel profiles. All the system CPU time is still being spent sending IPIs on the tlb flush path. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 06/21] PCI: Index IOV resources in the conventional style
From: Bjorn Helgaas bhelg...@google.com Most of PCI uses res = dev-resource[i], not res = dev-resource + i. Use that style in iov.c also. No functional change. Signed-off-by: Bjorn Helgaas bhelg...@google.com --- drivers/pci/iov.c |8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 5bca0e1..27b98c3 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -95,7 +95,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn-multifunction = 0; for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { - res = dev-resource + PCI_IOV_RESOURCES + i; + res = dev-resource[i + PCI_IOV_RESOURCES]; if (!res-parent) continue; virtfn-resource[i].name = pci_name(virtfn); @@ -212,7 +212,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) nres = 0; for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { bars |= (1 (i + PCI_IOV_RESOURCES)); - res = dev-resource + PCI_IOV_RESOURCES + i; + res = dev-resource[i + PCI_IOV_RESOURCES]; if (res-parent) nres++; } @@ -373,7 +373,7 @@ found: nres = 0; for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { - res = dev-resource + PCI_IOV_RESOURCES + i; + res = dev-resource[i + PCI_IOV_RESOURCES]; bar64 = __pci_read_base(dev, pci_bar_unknown, res, pos + PCI_SRIOV_BAR + i * 4); if (!res-flags) @@ -417,7 +417,7 @@ found: failed: for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { - res = dev-resource + PCI_IOV_RESOURCES + i; + res = dev-resource[i + PCI_IOV_RESOURCES]; res-flags = 0; } -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 05/21] PCI: Keep individual VF BAR size in struct pci_sriov
Currently we don't store the individual VF BAR size. We calculate it when needed by dividing the PF's IOV resource size (which contains space for *all* the VFs) by total_VFs or by reading the BAR in the SR-IOV capability again. Keep the individual VF BAR size in struct pci_sriov.barsz[], add pci_iov_resource_size() to retrieve it, and use that instead of doing the division or reading the SR-IOV capability BAR. [bhelgaas: rename to barsz[], simplify barsz[] index computation, remove SR-IOV capability BAR sizing] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c | 39 --- drivers/pci/pci.h |1 + include/linux/pci.h |3 +++ 3 files changed, 24 insertions(+), 19 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 05f9d97..5bca0e1 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -57,6 +57,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus) pci_remove_bus(virtbus); } +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno) +{ + if (!dev-is_physfn) + return 0; + + return dev-sriov-barsz[resno - PCI_IOV_RESOURCES]; +} + static int virtfn_add(struct pci_dev *dev, int id, int reset) { int i; @@ -92,8 +100,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) continue; virtfn-resource[i].name = pci_name(virtfn); virtfn-resource[i].flags = res-flags; - size = resource_size(res); - do_div(size, iov-total_VFs); + size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES); virtfn-resource[i].start = res-start + size * id; virtfn-resource[i].end = virtfn-resource[i].start + size - 1; rc = request_resource(res, virtfn-resource[i]); @@ -311,7 +318,7 @@ static void sriov_disable(struct pci_dev *dev) static int sriov_init(struct pci_dev *dev, int pos) { - int i; + int i, bar64; int rc; int nres; u32 pgsz; @@ -360,29 +367,29 @@ found: pgsz = ~(pgsz - 1); pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz); + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) + return -ENOMEM; + nres = 0; for (i = 0; i PCI_SRIOV_NUM_BARS; i++) { res = dev-resource + PCI_IOV_RESOURCES + i; - i += __pci_read_base(dev, pci_bar_unknown, res, -pos + PCI_SRIOV_BAR + i * 4); + bar64 = __pci_read_base(dev, pci_bar_unknown, res, + pos + PCI_SRIOV_BAR + i * 4); if (!res-flags) continue; if (resource_size(res) (PAGE_SIZE - 1)) { rc = -EIO; goto failed; } + iov-barsz[i] = resource_size(res); res-end = res-start + resource_size(res) * total - 1; dev_info(dev-dev, VF(n) BAR%d space: %pR (contains BAR%d for %d VFs)\n, i, res, i, total); + i += bar64; nres++; } - iov = kzalloc(sizeof(*iov), GFP_KERNEL); - if (!iov) { - rc = -ENOMEM; - goto failed; - } - iov-pos = pos; iov-nres = nres; iov-ctrl = ctrl; @@ -414,6 +421,7 @@ failed: res-flags = 0; } + kfree(iov); return rc; } @@ -510,14 +518,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno) */ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno) { - struct resource tmp; - int reg = pci_iov_resource_bar(dev, resno); - - if (!reg) - return 0; - -__pci_read_base(dev, pci_bar_unknown, tmp, reg); - return resource_alignment(tmp); + return pci_iov_resource_size(dev, resno); } /** diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 4091f82..5732964 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -247,6 +247,7 @@ struct pci_sriov { struct pci_dev *dev;/* lowest numbered PF */ struct pci_dev *self; /* this PF */ struct mutex lock; /* lock for VF bus */ + resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ }; #ifdef CONFIG_PCI_ATS diff --git a/include/linux/pci.h b/include/linux/pci.h index 211e9da..1559658 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1675,6 +1675,7 @@ int pci_num_vf(struct pci_dev *dev); int pci_vfs_assigned(struct pci_dev *dev); int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); int pci_sriov_get_totalvfs(struct pci_dev *dev); +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); #else static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn) {
[PATCH V14 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
Implement pcibios_iov_resource_alignment() on powernv platform. On PowerNV platform, there are 3 cases for the IOV BAR: 1. initial state, the IOV BAR size is multiple times of VF BAR size 2. after expanded, the IOV BAR size is expanded to meet the M64 segment size 3. sizing stage, the IOV BAR is truncated to 0 pnv_pci_iov_resource_alignment() handle these three cases respectively. [bhelgaas: adjust to drop align parameter, return pci_iov_resource_size() if no ppc_md machdep_call version] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/machdep.h|1 + arch/powerpc/kernel/pci-common.c | 10 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 20 3 files changed, 31 insertions(+) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 1d72fda..37e451f 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -252,6 +252,7 @@ struct machdep_calls { #ifdef CONFIG_PCI_IOV void (*pcibios_fixup_sriov)(struct pci_dev *pdev); + resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *, int resno); #endif /* CONFIG_PCI_IOV */ /* Called to shutdown machine specific hardware not already controlled diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 375bf70..9a306ff 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -130,6 +130,16 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev) pci_reset_secondary_bus(dev); } +#ifdef CONFIG_PCI_IOV +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int resno) +{ + if (ppc_md.pcibios_iov_resource_alignment) + return ppc_md.pcibios_iov_resource_alignment(pdev, resno); + + return pci_iov_resource_size(pdev, resno); +} +#endif /* CONFIG_PCI_IOV */ + static resource_size_t pcibios_io_size(const struct pci_controller *hose) { #ifdef CONFIG_PPC64 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cadd3fb..93ec16c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1965,6 +1965,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus, return phb-ioda.io_segsize; } +#ifdef CONFIG_PCI_IOV +static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev, + int resno) +{ + struct pci_dn *pdn = pci_get_pdn(pdev); + resource_size_t align, iov_align; + + iov_align = resource_size(pdev-resource[resno]); + if (iov_align) + return iov_align; + + align = pci_iov_resource_size(pdev, resno); + if (pdn-vfs_expanded) + return pdn-vfs_expanded * align; + + return align; +} +#endif /* CONFIG_PCI_IOV */ + /* Prevent enabling devices for which we couldn't properly * assign a PE */ @@ -2167,6 +2186,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus; #ifdef CONFIG_PCI_IOV ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_iov_resources; + ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment; #endif /* CONFIG_PCI_IOV */ pci_add_flags(PCI_REASSIGN_ALL_RSRC); -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/4] powerpc: trivial unused functions cleanup
This series removes unused functions from powerpc tree that I've been able to discover. Two machines at hands, e300 and e500 based, boot and run without regressions on my workload with this series applied. The removed code seems also been rarely touched, so it seems the series is safe at least in general. But I can't obviously express any strong point in support of the series, so it's completely OK to leave things as is. v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao (powerpc/85xx: workaround for chips with MSI hardware errata) makes use of it. v2: Added a brief explanation to each patch description of why removed functions are unused, as suggested by Michael Ellerman. Arseny Solokha (4): powerpc/boot: drop planetcore_set_serial_speed kvm/ppc/mpic: drop unused IRQ_testbit powrepc/qe: drop unused ucc_slow_poll_transmitter_now powerpc/mpic: remove unused functions arch/powerpc/boot/planetcore.c| 33 - arch/powerpc/boot/planetcore.h| 3 --- arch/powerpc/include/asm/mpic.h | 11 --- arch/powerpc/include/asm/ucc_slow.h | 13 - arch/powerpc/kvm/mpic.c | 5 - arch/powerpc/sysdev/mpic.c| 25 - arch/powerpc/sysdev/qe_lib/ucc_slow.c | 5 - 7 files changed, 95 deletions(-) -- 2.3.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup
On Fri, 2015-03-20 at 10:56 +0700, Arseny Solokha wrote: This series removes unused functions from powerpc tree that I've been able to discover. Two machines at hands, e300 and e500 based, boot and run without regressions on my workload with this series applied. The removed code seems also been rarely touched, so it seems the series is safe at least in general. But I can't obviously express any strong point in support of the series, so it's completely OK to leave things as is. v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao (powerpc/85xx: workaround for chips with MSI hardware errata) makes use of it. Sorry, too late. https://git.kernel.org/cgit/linux/kernel/git/mpe/linux.git/commit/?h=nextid=5e86bfde9cd93f272844c3ff6ac5f93d3666b3e7 The patch that needs it can just add it back. I failed to notice that the series has been finally committed, so resent it. Of course Hongtao can add the removed function back if he needs to. And by the way, while revisiting the series I've noticed that though the patch 4/4 basically reverts [1], it leaves #define MPIC_GREG_GLOBAL_CONF_1 0x00030 in arch/powerpc/include/asm/mpic.h untouched. That define also loses its uses after applying the patch. Compare the following hunk in today's patch w/ the one you committed: @@ -33,11 +33,6 @@ #define MPIC_GREG_GCONF_NO_BIAS 0x1000 #define MPIC_GREG_GCONF_BASE_MASK 0x000f #define MPIC_GREG_GCONF_MCK 0x0800 -#define MPIC_GREG_GLOBAL_CONF_1 0x00030 -#define MPIC_GREG_GLOBAL_CONF_1_SIE 0x0800 -#define MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK 0x7000 -#define MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\ - (((r) 28) MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK) #define MPIC_GREG_VENDOR_0 0x00040 #define MPIC_GREG_VENDOR_1 0x00050 #define MPIC_GREG_VENDOR_2 0x00060 So the question is, should #define MPIC_GREG_GLOBAL_CONF_1 have been also removed, or could be left as is? [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html Arsény cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/3] powerpc/dma: Remove ZONE_DMA completely
Benjamin Herrenschmidt b...@kernel.crashing.org writes: On Wed, 2015-03-18 at 19:49 +0530, Aneesh Kumar K.V wrote: Why not do it also for DMA32 as below. We call limit_zone_pfn() only for few platforms. They can select ZONE_DMA32. Everywhere else ZONE_DMA32 is same as ZONE_NORMAL ? So instead of having confusing output for show_mem(), we now have There may be cases where we want ZONE_DMA32, what's confusing ? To see output like below Node 0 DMA32: 6*64kB (UM) 566*128kB (UEM) 367*256kB (UM) 185*512kB (UM)59*1024kB (UEM) 11*2048kB (UM) 4*4096kB (UM) 3*8192kB (UEM) 4016*16384kB Node 1 DMA32: 42*64kB (UEM) 14*128kB (UEM) 10*256kB (UM) 2*512kB (UM) 6*1024kB (UM) 7*2048kB (UEM) 5*4096kB (UM) 2*8192kB (U) 3644*16384kB (MR) = That is to find DMA32 convering more than 4GB range. -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 04/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space
When we size VF BAR0, VF BAR1, etc., from the SR-IOV Capability of a PF, we learn the alignment requirement and amount of space consumed by a single VF. But when VFs are enabled, *each* of the NumVFs consumes that amount of space, so the total size of the PF resource is VF BAR size * NumVFs. Add a printk of the total space consumed by the VFs corresponding to what we already do for normal non-IOV BARs. No functional change; new message only. [bhelgaas: split out into its own patch] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index c4c33ea..05f9d97 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -372,6 +372,8 @@ found: goto failed; } res-end = res-start + resource_size(res) * total - 1; + dev_info(dev-dev, VF(n) BAR%d space: %pR (contains BAR%d for %d VFs)\n, +i, res, i, total); nres++; } -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 03/21] PCI: Print more info in sriov_enable() error message
From: Bjorn Helgaas bhelg...@google.com If we don't have space for all the bus numbers required to enable VFs, print the largest bus number required and the range available. No functional change; improved error message only. Signed-off-by: Bjorn Helgaas bhelg...@google.com --- drivers/pci/iov.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 4b3a4ea..c4c33ea 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -180,6 +180,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) struct pci_dev *pdev; struct pci_sriov *iov = dev-sriov; int bars = 0; + u8 bus; if (!nr_virtfn) return 0; @@ -216,8 +217,10 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) iov-offset = offset; iov-stride = stride; - if (virtfn_bus(dev, nr_virtfn - 1) dev-bus-busn_res.end) { - dev_err(dev-dev, SR-IOV: bus number out of range\n); + bus = virtfn_bus(dev, nr_virtfn - 1); + if (bus dev-bus-busn_res.end) { + dev_err(dev-dev, can't enable %d VFs (bus %02x out of range of %pR)\n, + nr_virtfn, bus, dev-bus-busn_res); return -ENOMEM; } -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V14 12/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning
When sizing and assigning resources, we divide the resources into two lists: the requested list and the additional list. We don't consider the alignment of additional VF(n) BAR space. This is because the alignment required for the VF(n) BAR space is the size of an individual VF BAR, not the size of the space for *all* VFs. But we want additional alignment to support partitioning on PowerNV. Consider the additional IOV BAR alignment when sizing and assigning resources. When there is not enough system MMIO space to accomodate both the requested list and the additional list, the PF's IOV BAR alignment will not contribute to the bridge. When there is enough system MMIO space for both lists, the additional alignment will contribute to the bridge. The additional alignment is stored in the min_align of pci_dev_resource, which is stored in the additional list by add_to_list() at the end of pbus_size_mem(). The additional alignment is calculated in pci_resource_alignment(). For an IOV BAR, we have arch dependent function to get the alignment for different arch. [bhelgaas: changelog, printk cast] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/setup-bus.c | 95 +++ 1 file changed, 79 insertions(+), 16 deletions(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index e3e17f3..6603d40 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head, } } -static resource_size_t get_res_add_size(struct list_head *head, - struct resource *res) +static struct pci_dev_resource *res_to_dev_res(struct list_head *head, + struct resource *res) { struct pci_dev_resource *dev_res; @@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head, int idx = res - dev_res-dev-resource[0]; dev_printk(KERN_DEBUG, dev_res-dev-dev, -res[%d]=%pR get_res_add_size add_size %llx\n, +res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n, idx, dev_res-res, -(unsigned long long)dev_res-add_size); +(unsigned long long)dev_res-add_size, +(unsigned long long)dev_res-min_align); - return dev_res-add_size; + return dev_res; } } - return 0; + return NULL; } +static resource_size_t get_res_add_size(struct list_head *head, + struct resource *res) +{ + struct pci_dev_resource *dev_res; + + dev_res = res_to_dev_res(head, res); + return dev_res ? dev_res-add_size : 0; +} + +static resource_size_t get_res_add_align(struct list_head *head, +struct resource *res) +{ + struct pci_dev_resource *dev_res; + + dev_res = res_to_dev_res(head, res); + return dev_res ? dev_res-min_align : 0; +} + + /* Sort resources by alignment */ static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head) { @@ -215,7 +235,7 @@ static void reassign_resources_sorted(struct list_head *realloc_head, struct resource *res; struct pci_dev_resource *add_res, *tmp; struct pci_dev_resource *dev_res; - resource_size_t add_size; + resource_size_t add_size, align; int idx; list_for_each_entry_safe(add_res, tmp, realloc_head, list) { @@ -238,13 +258,13 @@ static void reassign_resources_sorted(struct list_head *realloc_head, idx = res - add_res-dev-resource[0]; add_size = add_res-add_size; + align = add_res-min_align; if (!resource_size(res)) { - res-start = add_res-start; + res-start = align; res-end = res-start + add_size - 1; if (pci_assign_resource(add_res-dev, idx)) reset_resource(res); } else { - resource_size_t align = add_res-min_align; res-flags |= add_res-flags (IORESOURCE_STARTALIGN|IORESOURCE_SIZEALIGN); if (pci_reassign_resource(add_res-dev, idx, @@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head, LIST_HEAD(save_head); LIST_HEAD(local_fail_head); struct pci_dev_resource *save_res; - struct pci_dev_resource *dev_res, *tmp_res; + struct pci_dev_resource *dev_res, *tmp_res, *dev_res2; unsigned long fail_type; + resource_size_t add_align, align; /* Check if optional add_size is there */
[PATCH V14 11/21] PCI: Add pcibios_iov_resource_alignment() interface
Per the SR-IOV spec r1.1, sec 3.3.14, the required alignment of a PF's IOV BAR is the size of an individual VF BAR, and the size consumed is the individual VF BAR size times NumVFs. The PowerNV platform has additional alignment requirements to help support its Partitionable Endpoint device isolation feature (see Documentation/powerpc/pci_iov_resource_on_powernv.txt). Add a pcibios_iov_resource_alignment() interface to allow platforms to request additional alignment. [bhelgaas: changelog, adapt to reworked pci_sriov_resource_alignment(), drop align parameter] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- drivers/pci/iov.c |8 +++- include/linux/pci.h |1 + 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 64c4692..ee0ebff 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -569,6 +569,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno) 4 * (resno - PCI_IOV_RESOURCES); } +resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev, + int resno) +{ + return pci_iov_resource_size(dev, resno); +} + /** * pci_sriov_resource_alignment - get resource alignment for VF BAR * @dev: the PCI device @@ -581,7 +587,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno) */ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno) { - return pci_iov_resource_size(dev, resno); + return pcibios_iov_resource_alignment(dev, resno); } /** diff --git a/include/linux/pci.h b/include/linux/pci.h index 99ea948..4e1f17d 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1174,6 +1174,7 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus); void pci_setup_bridge(struct pci_bus *bus); resource_size_t pcibios_window_alignment(struct pci_bus *bus, unsigned long type); +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno); #define PCI_VGA_STATE_CHANGE_BRIDGE (1 0) #define PCI_VGA_STATE_CHANGE_DECODES (1 1) -- 1.7.9.5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about unstable algorithm changes a small varioation into a factor-of-two vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'
On Thu, 2015-03-19 at 09:22 +0200, Alex Dowad wrote: On 19/03/15 08:45, Michael Ellerman wrote: On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote: The 'arg' argument to copy_thread() is only ever used when forking a new kernel thread. Hence, rename it to 'kthread_arg' for clarity (and consistency with do_fork() and other arch-specific implementations of copy_thread()). I don't understand the bit about consistency with do_fork() ? This series of patches includes one patch which renames the arg for do_fork(), and others which rename the same arg for each arch-specific implementation of copy_thread(). So if all of them are accepted and merged, then all will be consistent. If only some of the patches are accepted, I will rewrite the commit message so it doesn't mention consistency. Ah OK, I only got patch 23, so I missed the context of the whole series. I'll apply this one to the powerpc tree. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 4/5] hwmon: (ibmpowernv) change create_hwmon_attr_name() prototype
It simplifies the creation of the hwmon attributes and will help when support for a new device tree layout is added. The patch also changes the name of the routine to parse_opal_node_name(). Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- Changes since v1: - changed returned value of parse_opal_node_name() - used *_PTR macros to check for errors drivers/hwmon/ibmpowernv.c | 37 - 1 file changed, 20 insertions(+), 17 deletions(-) Index: linux.git/drivers/hwmon/ibmpowernv.c === --- linux.git.orig/drivers/hwmon/ibmpowernv.c +++ linux.git/drivers/hwmon/ibmpowernv.c @@ -152,29 +152,22 @@ static const char *convert_opal_attr_nam * which need to be mapped as fan2_input, temp1_max respectively before * populating them inside hwmon device class. */ -static int create_hwmon_attr_name(struct device *dev, enum sensors type, -const char *node_name, -char *hwmon_attr_name) +static const char *parse_opal_node_name(const char *node_name, + enum sensors type, u32 *index) { char attr_suffix[MAX_ATTR_LEN]; const char *attr_name; - u32 index; int err; - err = get_sensor_index_attr(node_name, index, attr_suffix); - if (err) { - dev_err(dev, Sensor device node name '%s' is invalid\n, - node_name); - return err; - } + err = get_sensor_index_attr(node_name, index, attr_suffix); + if (err) + return ERR_PTR(err); attr_name = convert_opal_attr_name(type, attr_suffix); if (!attr_name) - return -ENOENT; + return ERR_PTR(-ENOENT); - snprintf(hwmon_attr_name, MAX_ATTR_LEN, %s%d_%s, -sensor_groups[type].name, index, attr_name); - return 0; + return attr_name; } static int get_sensor_type(struct device_node *np) @@ -249,6 +242,9 @@ static int create_device_attrs(struct pl } for_each_child_of_node(opal, np) { + const char *attr_name; + u32 opal_index; + if (np-name == NULL) continue; @@ -265,10 +261,17 @@ static int create_device_attrs(struct pl sdata[count].id = sensor_id; sdata[count].type = type; - err = create_hwmon_attr_name(pdev-dev, type, np-name, -sdata[count].name); - if (err) + + attr_name = parse_opal_node_name(np-name, type, opal_index); + if (IS_ERR(attr_name)) { + dev_err(pdev-dev, Sensor device node name '%s' is invalid\n, + np-name); + err = IS_ERR(attr_name); goto exit_put_node; + } + + snprintf(sdata[count].name, MAX_ATTR_LEN, %s%d_%s, +sensor_groups[type].name, opal_index, attr_name); sysfs_attr_init(sdata[count].dev_attr.attr); sdata[count].dev_attr.attr.name = sdata[count].name; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. -* -* FIXME! This checks pmd_dirty() as an approximation of -* is this a read-only page, since checking pmd_write() -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. -*/ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma-vm_flags VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. +* Avoid grouping on RO pages in general. RO pages shouldn't hurt as +* much anyway since they can be in shared cache state. This misses +* the case where a mapping is writable but the process never writes +* to it but pte_write gets cleared during protection updates and +* pte_dirty has unpredictable behaviour between PTE scan updates, +* background writeback, dirty balancing and application behaviour. * -* FIXME! This checks pmd_dirty() as an approximation of -* is this a read-only page, since checking pmd_write() -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. +* TODO: Note that the ideal here would be to avoid a situation where a +* NUMA fault is taken immediately followed by a write fault in +* some cases which would have lower overhead overall but would be +* invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma-vm_flags VM_WRITE)) flags |= TNF_NO_GROUP; /* ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/5] hwmon: (ibmpowernv) remove dependency on OPAL index
Hello ! The current implementation of the driver uses an index for the hwmon attribute which is extracted from the device node name. This index is calculated by the OPAL firmware and its usage creates a dependency with the driver which makes changes a little more complex in OPAL. This patchset changes the ibmpowernv code to use its own index. It starts with a few cleanups, mostly code shuffling around the creation of the hwmon sysfs attributes and completes by removing the dependency. It also prepares ground for future OPAL changes : https://lists.ozlabs.org/pipermail/skiboot/2015-March/000639.html which will be addressed in a other small patchset. The patches are based on Linux 4.0.0-rc4 and were tested on IBM Power and Open Power systems running Trusty. Cheers, C. Changes since v1: - fixed alignment - killed a couple of useless return NULL - changed returned value of parse_opal_node_name() - used *_PTR macros to check for errors Cédric Le Goater (5): hwmon: (ibmpowernv) replace AMBIENT_TEMP by TEMP hwmon: (ibmpowernv) add a get_sensor_type() routine hwmon: (ibmpowernv) add a convert_opal_attr_name() routine hwmon: (ibmpowernv) change create_hwmon_attr_name() prototype hwmon: (ibmpowernv) do not use the OPAL index for hwmon attribute names drivers/hwmon/ibmpowernv.c | 122 +--- 1 file changed, 81 insertions(+), 41 deletions(-) -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 5/5] hwmon: (ibmpowernv) do not use the OPAL index for hwmon attribute names
The current OPAL firmware exposes the different sensors of an IBM Power system using node names such as : sensors/amb-temp#1-data sensors/amb-temp#1-thrs cooling-fan#1-data cooling-fan#1-faulted cooling-fan#1-thrs cooling-fan#2-data ... The ibmpowernv driver, when loaded, parses these names to extract the sensor index and the sensor attribute name. Unfortunately, this scheme makes it difficult to add sensors with a different layout (specially of the same type, like temperature) as the sensor index calculated in OPAL is directly used in the hwmon sysfs interface. What this patch does is add a independent hwmon index for each sensor. The increment of the hwmon index (temp, fan, power, etc.) is kept per sensor type in the sensor_group table. The sensor_data table is used to store the association of the hwmon and OPAL indexes, as we need to have the same hwmon index for different attributes of a same sensor. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- drivers/hwmon/ibmpowernv.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) Index: linux.git/drivers/hwmon/ibmpowernv.c === --- linux.git.orig/drivers/hwmon/ibmpowernv.c +++ linux.git/drivers/hwmon/ibmpowernv.c @@ -55,6 +55,7 @@ static struct sensor_group { const char *compatible; struct attribute_group group; u32 attr_count; + u32 hwmon_index; } sensor_groups[] = { {fan, ibm,opal-sensor-cooling-fan}, {temp, ibm,opal-sensor-amb-temp}, @@ -64,6 +65,8 @@ static struct sensor_group { struct sensor_data { u32 id; /* An opaque id of the firmware for each sensor */ + u32 hwmon_index; + u32 opal_index; enum sensors type; char name[MAX_ATTR_LEN]; struct device_attribute dev_attr; @@ -181,6 +184,19 @@ static int get_sensor_type(struct device return MAX_SENSOR_TYPE; } +static u32 get_sensor_hwmon_index(struct sensor_data *sdata, + struct sensor_data *sdata_table, int count) +{ + int i; + + for (i = 0; i count; i++) + if (sdata_table[i].opal_index == sdata-opal_index + sdata_table[i].type == sdata-type) + return sdata_table[i].hwmon_index; + + return ++sensor_groups[sdata-type].hwmon_index; +} + static int populate_attr_groups(struct platform_device *pdev) { struct platform_data *pdata = platform_get_drvdata(pdev); @@ -270,8 +286,13 @@ static int create_device_attrs(struct pl goto exit_put_node; } + sdata[count].opal_index = opal_index; + sdata[count].hwmon_index = + get_sensor_hwmon_index(sdata[count], sdata, count); + snprintf(sdata[count].name, MAX_ATTR_LEN, %s%d_%s, -sensor_groups[type].name, opal_index, attr_name); +sensor_groups[type].name, sdata[count].hwmon_index, +attr_name); sysfs_attr_init(sdata[count].dev_attr.attr); sdata[count].dev_attr.attr.name = sdata[count].name; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 7:10 AM, Mel Gorman mgor...@suse.de wrote: - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma-vm_flags VM_WRITE)) Yeah, that would certainly be a whole lot more obvious than all the if this particular pte/pmd looks like X tests. So that, together with scanning rate improvements (this *does* seem to be somewhat chaotic, so it's quite possible that the current scanning rate thing is just fairly unstable) is likely the right thing. I'd just like to _understand_ why that write/dirty bit makes such a difference. I thought I understood what was going on, and was happy, and then Dave come with his crazy numbers. Damn you Dave, and damn your numbers and facts and stuff. Sometimes I much prefer ignorant bliss. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/5] hwmon: (ibmpowernv) replace AMBIENT_TEMP by TEMP
Ambient is too restrictive as there can be other temperature channels : core, memory, etc. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- drivers/hwmon/ibmpowernv.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux.git/drivers/hwmon/ibmpowernv.c === --- linux.git.orig/drivers/hwmon/ibmpowernv.c +++ linux.git/drivers/hwmon/ibmpowernv.c @@ -44,7 +44,7 @@ */ enum sensors { FAN, - AMBIENT_TEMP, + TEMP, POWER_SUPPLY, POWER_INPUT, MAX_SENSOR_TYPE, @@ -87,7 +87,7 @@ static ssize_t show_sensor(struct device return ret; /* Convert temperature to milli-degrees */ - if (sdata-type == AMBIENT_TEMP) + if (sdata-type == TEMP) x *= 1000; /* Convert power to micro-watts */ else if (sdata-type == POWER_INPUT) @@ -154,7 +154,7 @@ static int create_hwmon_attr_name(struct } else if (!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX)) { attr_name = input; } else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) { - if (type == AMBIENT_TEMP) + if (type == TEMP) attr_name = max; else if (type == FAN) attr_name = min; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 3/5] hwmon: (ibmpowernv) add a convert_opal_attr_name() routine
It simplifies the create_hwmon_attr_name() routine and it clearly isolates the conversion done between the OPAL node names and hwmon attributes names. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- Changes since v1: - fixed alignment - killed a couple of useless return NULL drivers/hwmon/ibmpowernv.c | 36 ++-- 1 file changed, 22 insertions(+), 14 deletions(-) Index: linux.git/drivers/hwmon/ibmpowernv.c === --- linux.git.orig/drivers/hwmon/ibmpowernv.c +++ linux.git/drivers/hwmon/ibmpowernv.c @@ -127,6 +127,25 @@ static int get_sensor_index_attr(const c return 0; } +static const char *convert_opal_attr_name(enum sensors type, + const char *opal_attr) +{ + const char *attr_name = NULL; + + if (!strcmp(opal_attr, DT_FAULT_ATTR_SUFFIX)) { + attr_name = fault; + } else if (!strcmp(opal_attr, DT_DATA_ATTR_SUFFIX)) { + attr_name = input; + } else if (!strcmp(opal_attr, DT_THRESHOLD_ATTR_SUFFIX)) { + if (type == TEMP) + attr_name = max; + else if (type == FAN) + attr_name = min; + } + + return attr_name; +} + /* * This function translates the DT node name into the 'hwmon' attribute name. * IBMPOWERNV device node appear like cooling-fan#2-data, amb-temp#1-thrs etc. @@ -138,7 +157,7 @@ static int create_hwmon_attr_name(struct char *hwmon_attr_name) { char attr_suffix[MAX_ATTR_LEN]; - char *attr_name; + const char *attr_name; u32 index; int err; @@ -149,20 +168,9 @@ static int create_hwmon_attr_name(struct return err; } - if (!strcmp(attr_suffix, DT_FAULT_ATTR_SUFFIX)) { - attr_name = fault; - } else if (!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX)) { - attr_name = input; - } else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) { - if (type == TEMP) - attr_name = max; - else if (type == FAN) - attr_name = min; - else - return -ENOENT; - } else { + attr_name = convert_opal_attr_name(type, attr_suffix); + if (!attr_name) return -ENOENT; - } snprintf(hwmon_attr_name, MAX_ATTR_LEN, %s%d_%s, sensor_groups[type].name, index, attr_name); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
On Thu, Mar 19, 2015 at 11:18 AM, Wei Yang weiyang.ker...@gmail.com wrote: Oh, I thought you are not comfortable with the Patch v12 10/21 PCI: Consider additional PF's IOV BAR alignment ... V14 is ready to send which is based on v4.0-rc1. Unless I missed something, the last email in that thread [1] is from you, so I think we're ready for the next iteration. [1] http://lkml.kernel.org/r/20150224083406.32124.65957.st...@bhelgaas-glaptop2.roam.corp.google.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/5] hwmon: (ibmpowernv) add a get_sensor_type() routine
It will help in adding different compatible properties, coming from a new device tree layout for example. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- drivers/hwmon/ibmpowernv.c | 26 +++--- 1 file changed, 15 insertions(+), 11 deletions(-) Index: linux.git/drivers/hwmon/ibmpowernv.c === --- linux.git.orig/drivers/hwmon/ibmpowernv.c +++ linux.git/drivers/hwmon/ibmpowernv.c @@ -169,6 +169,17 @@ static int create_hwmon_attr_name(struct return 0; } +static int get_sensor_type(struct device_node *np) +{ + enum sensors type; + + for (type = 0; type MAX_SENSOR_TYPE; type++) { + if (of_device_is_compatible(np, sensor_groups[type].compatible)) + return type; + } + return MAX_SENSOR_TYPE; +} + static int populate_attr_groups(struct platform_device *pdev) { struct platform_data *pdata = platform_get_drvdata(pdev); @@ -181,12 +192,9 @@ static int populate_attr_groups(struct p if (np-name == NULL) continue; - for (type = 0; type MAX_SENSOR_TYPE; type++) - if (of_device_is_compatible(np, - sensor_groups[type].compatible)) { - sensor_groups[type].attr_count++; - break; - } + type = get_sensor_type(np); + if (type != MAX_SENSOR_TYPE) + sensor_groups[type].attr_count++; } of_node_put(opal); @@ -236,11 +244,7 @@ static int create_device_attrs(struct pl if (np-name == NULL) continue; - for (type = 0; type MAX_SENSOR_TYPE; type++) - if (of_device_is_compatible(np, - sensor_groups[type].compatible)) - break; - + type = get_sensor_type(np); if (type == MAX_SENSOR_TYPE) continue; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
On Thu, Mar 12, 2015 at 09:15:17AM +0800, Wei Yang wrote: On Wed, Mar 11, 2015 at 08:55:07AM -0500, Bjorn Helgaas wrote: On Wed, Mar 04, 2015 at 01:19:07PM +0800, Wei Yang wrote: On PHB3, PF IOV BAR will be covered by M64 window to have better PE isolation. The total_pe number is usually different from total_VFs, which can lead to a conflict between MMIO space and the PE number. For example, if total_VFs is 128 and total_pe is 256, the second half of M64 window will be part of other PCI device, which may already belong to other PEs. Prevent the conflict by reserving additional space for the PF IOV BAR, which is total_pe number of VF's BAR size. [bhelgaas: make dev_printk() output more consistent, index resource[] conventionally] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/machdep.h|4 ++ arch/powerpc/include/asm/pci-bridge.h |3 ++ arch/powerpc/kernel/pci-common.c |5 +++ arch/powerpc/kernel/pci-hotplug.c |4 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 61 + 5 files changed, 77 insertions(+) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index c8175a3..965547c 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -250,6 +250,10 @@ struct machdep_calls { /* Reset the secondary bus of bridge */ void (*pcibios_reset_secondary_bus)(struct pci_dev *dev); +#ifdef CONFIG_PCI_IOV + void (*pcibios_fixup_sriov)(struct pci_bus *bus); +#endif /* CONFIG_PCI_IOV */ + /* Called to shutdown machine specific hardware not already controlled * by other drivers. */ diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 513f8f2..de11de7 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -175,6 +175,9 @@ struct pci_dn { #define IODA_INVALID_PE (-1) #ifdef CONFIG_PPC_POWERNV int pe_number; +#ifdef CONFIG_PCI_IOV + u16 max_vfs;/* number of VFs IOV BAR expended */ +#endif /* CONFIG_PCI_IOV */ #endif struct list_head child_list; struct list_head list; diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 8203101..022e9fe 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose) if (ppc_md.pcibios_fixup_phb) ppc_md.pcibios_fixup_phb(hose); +#ifdef CONFIG_PCI_IOV + if (ppc_md.pcibios_fixup_sriov) + ppc_md.pcibios_fixup_sriov(bus); +#endif /* CONFIG_PCI_IOV */ Here, and ... + /* Configure PCI Express settings */ if (bus !pci_has_flag(PCI_PROBE_ONLY)) { struct pci_bus *child; diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 5b78917..7d238ae 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -94,6 +94,10 @@ void pcibios_add_pci_devices(struct pci_bus * bus) */ slotno = PCI_SLOT(PCI_DN(dn-child)-devfn); pci_scan_slot(bus, PCI_DEVFN(slotno, 0)); +#ifdef CONFIG_PCI_IOV + if (ppc_md.pcibios_fixup_sriov) + ppc_md.pcibios_fixup_sriov(bus); +#endif /* CONFIG_PCI_IOV */ here, you have the same code. It's good that we now do it for hot-added devices as well as those present at boot. But it's bad that it happens in two different paths. Isn't there some way we can unify this so the same path is used for the initial pcibios_scan_phb() and also the hot-add case? Maybe call pcibios_fixup_sriov() from pcibios_add_device()? This is a very good suggestion. I have changed this and works fine. I was expecting a v14 series with this change. Is it coming, or are you waiting for something else from me? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling
On Thu, 19 Mar 2015 17:56:57 +0200 Horia Geantă horia.gea...@freescale.com wrote: On 3/18/2015 12:03 AM, Kim Phillips wrote: On Tue, 17 Mar 2015 19:58:55 +0200 Horia Geantă horia.gea...@freescale.com wrote: On 3/17/2015 2:19 AM, Kim Phillips wrote: On Mon, 16 Mar 2015 12:02:51 +0200 Horia Geantă horia.gea...@freescale.com wrote: On 3/4/2015 2:23 AM, Kim Phillips wrote: Only potential problem is getting the crypto API to set the GFP_DMA flag in the allocation request, but presumably a CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that. Seems there are quite a few places that do not use the {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests. Among them, IPsec and dm-crypt. I've looked at the code and I don't think it can be converted to use crypto API. why not? It would imply having 2 memory allocations, one for crypto request and the other for the rest of the data bundled with the request (for IPsec that would be ESN + space for IV + sg entries for authenticated-only data and sk_buff extension, if needed). Trying to have a single allocation by making ESN, IV etc. part of the request private context requires modifying tfm.reqsize on the fly. This won't work without adding some kind of locking for the tfm. can't a common minimum tfm.reqsize be co-established up front, at least for the fast path? Indeed, for IPsec at tfm allocation time - esp_init_state() - tfm.reqsize could be increased to account for what is known for a given flow: ESN, IV and asg (S/G entries for authenticated-only data). The layout would be: aead request (fixed part) private ctx of backend algorithm seq_no_hi (if ESN) IV asg sg -- S/G table for skb_to_sgvec; how many entries is the question Do you have a suggestion for how many S/G entries to preallocate for representing the sk_buff data to be encrypted? An ancient esp4.c used ESP_NUM_FAST_SG, set to 4. Btw, currently maximum number of fragments supported by the net stack (MAX_SKB_FRAGS) is 16 or more. This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these places. Some of the maintainers do not agree, as you've seen. would modifying the crypto API to either have a different *_request_alloc() API, and/or adding calls to negotiate the GFP mask between crypto users and drivers, e.g., get/set_gfp_mask, work? I think what DaveM asked for was the change to be transparent. Besides converting to *_request_alloc(), seems that all other options require some extra awareness from the user. Could you elaborate on the idea above? was merely suggesting communicating GFP flags anonymously across the API, i.e., GFP_DMA wouldn't appear in user code. Meaning user would have to get_gfp_mask before allocating a crypto request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have kmalloc(GFP_ATOMIC | get_gfp_mask(aead))? An alternative would be for talitos to use the page allocator to get 1 / 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor = 8 kB), dma_map_page the area and manage it internally for talitos_desc hw descriptors. What do you think? There's a comment in esp_alloc_tmp(): Use spare space in skb for this where possible, which is ideally where we'd want to be (esp. Ok, I'll check that. But note the where possible - finding room in the skb to avoid the allocation won't always be the case, and then we're back to square one. So the skb cb is out of the question, being too small (48B). Any idea what was the intention of the TODO - maybe to use the tailroom in the skb data area? because that memory could already be DMA-able). Your above suggestion would be in the opposite direction of that. The proposal: -removes dma (un)mapping on the fast path sure, but at the expense of additional complexity. Right, there's no free lunch. But it's cheaper. -avoids requesting dma mappable memory for more than it's actually needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not only its private context) compared to the payload? Plus, we have plenty of DMA space these days. -for caam it has the added benefit of speeding the below search for the offending descriptor in the SW ring from O(n) to O(1): for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) = 1; i++) { sw_idx = (tail + i) (JOBR_DEPTH - 1); if (jrp-outring[hw_idx].desc == jrp-entinfo[sw_idx].desc_addr_dma) break; /* found */ } (drivers/crypto/caam/jr.c - caam_dequeue) how? The job ring h/w will still be spitting things out out-of-order. jrp-outring[hw_idx].desc bus address can be used to find the sw_idx in O(1): dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...); [...] sw_idx = (desc_base - jrp-outring[hw_idx].desc) / JD_SIZE; JD_SIZE would be 16 words (64B) - 13 words used for the h/w job descriptor, 3 words can be used for
Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
Oh, I thought you are not comfortable with the Patch v12 10/21 PCI: Consider additional PF's IOV BAR alignment ... V14 is ready to send which is based on v4.0-rc1. 2015-03-19 23:08 GMT+08:00 Bjorn Helgaas bhelg...@google.com: On Thu, Mar 12, 2015 at 09:15:17AM +0800, Wei Yang wrote: On Wed, Mar 11, 2015 at 08:55:07AM -0500, Bjorn Helgaas wrote: On Wed, Mar 04, 2015 at 01:19:07PM +0800, Wei Yang wrote: On PHB3, PF IOV BAR will be covered by M64 window to have better PE isolation. The total_pe number is usually different from total_VFs, which can lead to a conflict between MMIO space and the PE number. For example, if total_VFs is 128 and total_pe is 256, the second half of M64 window will be part of other PCI device, which may already belong to other PEs. Prevent the conflict by reserving additional space for the PF IOV BAR, which is total_pe number of VF's BAR size. [bhelgaas: make dev_printk() output more consistent, index resource[] conventionally] Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com --- arch/powerpc/include/asm/machdep.h|4 ++ arch/powerpc/include/asm/pci-bridge.h |3 ++ arch/powerpc/kernel/pci-common.c |5 +++ arch/powerpc/kernel/pci-hotplug.c |4 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 61 + 5 files changed, 77 insertions(+) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index c8175a3..965547c 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -250,6 +250,10 @@ struct machdep_calls { /* Reset the secondary bus of bridge */ void (*pcibios_reset_secondary_bus)(struct pci_dev *dev); +#ifdef CONFIG_PCI_IOV + void (*pcibios_fixup_sriov)(struct pci_bus *bus); +#endif /* CONFIG_PCI_IOV */ + /* Called to shutdown machine specific hardware not already controlled * by other drivers. */ diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 513f8f2..de11de7 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -175,6 +175,9 @@ struct pci_dn { #define IODA_INVALID_PE (-1) #ifdef CONFIG_PPC_POWERNV int pe_number; +#ifdef CONFIG_PCI_IOV + u16 max_vfs;/* number of VFs IOV BAR expended */ +#endif /* CONFIG_PCI_IOV */ #endif struct list_head child_list; struct list_head list; diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 8203101..022e9fe 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose) if (ppc_md.pcibios_fixup_phb) ppc_md.pcibios_fixup_phb(hose); +#ifdef CONFIG_PCI_IOV + if (ppc_md.pcibios_fixup_sriov) + ppc_md.pcibios_fixup_sriov(bus); +#endif /* CONFIG_PCI_IOV */ Here, and ... + /* Configure PCI Express settings */ if (bus !pci_has_flag(PCI_PROBE_ONLY)) { struct pci_bus *child; diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 5b78917..7d238ae 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -94,6 +94,10 @@ void pcibios_add_pci_devices(struct pci_bus * bus) */ slotno = PCI_SLOT(PCI_DN(dn-child)-devfn); pci_scan_slot(bus, PCI_DEVFN(slotno, 0)); +#ifdef CONFIG_PCI_IOV + if (ppc_md.pcibios_fixup_sriov) + ppc_md.pcibios_fixup_sriov(bus); +#endif /* CONFIG_PCI_IOV */ here, you have the same code. It's good that we now do it for hot-added devices as well as those present at boot. But it's bad that it happens in two different paths. Isn't there some way we can unify this so the same path is used for the initial pcibios_scan_phb() and also the hot-add case? Maybe call pcibios_fixup_sriov() from pcibios_add_device()? This is a very good suggestion. I have changed this and works fine. I was expecting a v14 series with this change. Is it coming, or are you waiting for something else from me? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- Richard Yang Help You, Help Me ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v0 2/4] ppc64le: dynamic ftrace configuration options
Switch on -mprofile-kernel, and remove it again from directories involved in exception handling. This needs to be done more fine grained, of course. diff --git a/Makefile b/Makefile index 1a60bdd..72644e6 100644 --- a/Makefile +++ b/Makefile @@ -732,7 +732,10 @@ ifdef CONFIG_FUNCTION_TRACER ifdef CONFIG_HAVE_FENTRY CC_USING_FENTRY:= $(call cc-option, -mfentry -DCC_USING_FENTRY) endif -KBUILD_CFLAGS += -pg $(CC_USING_FENTRY) +ifdef CONFIG_HAVE_MPROFILE_KERNEL +CC_USING_MPROFILE_KERNEL := $(call cc-option, -mprofile-kernel -DCC_USING_MPROFILE_KERNEL) +endif +KBUILD_CFLAGS += -pg $(CC_USING_FENTRY) $(CC_USING_MPROFILE_KERNEL) KBUILD_AFLAGS += $(CC_USING_FENTRY) ifdef CONFIG_DYNAMIC_FTRACE ifdef CONFIG_HAVE_C_RECORDMCOUNT diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 4bc7b62..d82d7c8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -93,8 +93,10 @@ config PPC select OF_RESERVED_MEM select HAVE_FTRACE_MCOUNT_RECORD select HAVE_DYNAMIC_FTRACE + select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER + select HAVE_MPROFILE_KERNEL select SYSCTL_EXCEPTION_TRACE select ARCH_WANT_OPTIONAL_GPIOLIB select VIRT_TO_BUS if !PPC64 diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 670c312..688e6f9 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -17,14 +17,14 @@ endif ifdef CONFIG_FUNCTION_TRACER # Do not trace early boot code -CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog -CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog -mprofile-kernel +CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog -mprofile-kernel # do not trace tracer code -CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog -mprofile-kernel # timers used by tracing -CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog +CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog -mprofile-kernel endif obj-y := cputable.o ptrace.o syscalls.o \ diff --git a/kernel/Makefile b/kernel/Makefile index 8af7403..3c8821d 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -13,8 +13,9 @@ obj-y = fork.o exec_domain.o panic.o \ ifdef CONFIG_FUNCTION_TRACER # Do not trace debug files and internal ftrace files -CFLAGS_REMOVE_cgroup-debug.o = -pg -CFLAGS_REMOVE_irq_work.o = -pg +CFLAGS_REMOVE_cgroup-debug.o = -pg -mprofile-kernel +CFLAGS_REMOVE_irq_work.o = -pg -mprofile-kernel +CFLAGS_REMOVE_extable.o = -pg -mprofile-kernel endif # cond_syscall is currently not LTO compatible diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile index 8541bfd..1cc57c8 100644 --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -2,10 +2,10 @@ obj-y += mutex.o semaphore.o rwsem.o mcs_spinlock.o ifdef CONFIG_FUNCTION_TRACER -CFLAGS_REMOVE_lockdep.o = -pg -CFLAGS_REMOVE_lockdep_proc.o = -pg -CFLAGS_REMOVE_mutex-debug.o = -pg -CFLAGS_REMOVE_rtmutex-debug.o = -pg +CFLAGS_REMOVE_lockdep.o = -pg -mprofile-kernel +CFLAGS_REMOVE_lockdep_proc.o = -pg -mprofile-kernel +CFLAGS_REMOVE_mutex-debug.o = -pg -mprofile-kernel +CFLAGS_REMOVE_rtmutex-debug.o = -pg -mprofile-kernel endif obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index a5da09c..dd53f3d 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -52,6 +52,11 @@ config HAVE_FENTRY help Arch supports the gcc options -pg with -mfentry +config HAVE_MPROFILE_KERNEL + bool + help + Arch supports the gcc options -pg with -mprofile-kernel + config HAVE_C_RECORDMCOUNT bool help diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile index 59fa2de..b2f5029 100644 --- a/arch/powerpc/lib/Makefile +++ b/arch/powerpc/lib/Makefile @@ -6,8 +6,8 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC) -CFLAGS_REMOVE_code-patching.o = -pg -CFLAGS_REMOVE_feature-fixups.o = -pg +CFLAGS_REMOVE_code-patching.o = -pg -mprofile-kernel +CFLAGS_REMOVE_feature-fixups.o = -pg -mprofile-kernel obj-y := string.o alloc.o \ crtsavres.o diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index d0130ff..22633af 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -6,6 +6,11 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC) +# needed for do_page_fault in fault.c : +KBUILD_CFLAGS := $(filter-out -mprofile-kernel, $(KBUILD_CFLAGS)) +KBUILD_CFLAGS := $(filter-out -pg,
Re: new decimal conversion - seeking testers
On Fri, Mar 13 2015, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: On 13.03.2015 [00:09:19 +0100], Rasmus Villemoes wrote: Since the new code plays a little endianness game I would really appreciate it if someone here would run the test and verification code on ppc. On a ppc64le box: [...] On a ppc64 box: [...] Thanks! Cheers, Rasmus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v0 0/4] ppc64le: dynamic ftrace and kgraft support
Here's an initial version of dynamic ftrace for ABIv2 (ppc64le), the code maturity is somewhere between proof of concept and pre-alpha. I have split it into 4 parts, for ftrace and kgraft, a configuration enablement and the actual code, respectively. Please have a look and tell me whether this is the way to go. Torsten ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v0 1/4] ppc64le: dynamic ftrace
I'm pretty sure not everything is ifdef'd properly, and the FIXME needs to be solved in order to disable ftracing again. Built upon some original code by Vojtech Pavlik. diff --git a/arch/powerpc/include/asm/ftrace.h b/arch/powerpc/include/asm/ftrace.h index e366187..a69d47e 100644 --- a/arch/powerpc/include/asm/ftrace.h +++ b/arch/powerpc/include/asm/ftrace.h @@ -46,6 +46,8 @@ extern void _mcount(void); #ifdef CONFIG_DYNAMIC_FTRACE +# define FTRACE_ADDR ((unsigned long)ftrace_caller+8) +# define FTRACE_REGS_ADDR FTRACE_ADDR static inline unsigned long ftrace_call_adjust(unsigned long addr) { /* reloction of mcount call site is the same as the address */ @@ -57,6 +58,9 @@ struct dyn_arch_ftrace { #endif /* CONFIG_DYNAMIC_FTRACE */ #endif /* __ASSEMBLY__ */ +#ifdef CONFIG_DYNAMIC_FTRACE +#define ARCH_SUPPORTS_FTRACE_OPS 1 +#endif #endif #if defined(CONFIG_FTRACE_SYSCALLS) defined(CONFIG_PPC64) !defined(__ASSEMBLY__) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 5bbd1bc..9caf9af 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -1159,32 +1170,107 @@ _GLOBAL(enter_prom) #ifdef CONFIG_FUNCTION_TRACER #ifdef CONFIG_DYNAMIC_FTRACE -_GLOBAL(mcount) + +#define TOCSAVE 24 + _GLOBAL(_mcount) - blr + nop // REQUIRED for ftrace, to calculate local/global entry diff +.localentry _mcount,.-_mcount + mflrr0 + mtctr r0 + + LOAD_REG_ADDR_PIC(r12,ftrace_trace_function) + ld r12,0(r12) + LOAD_REG_ADDR_PIC(r0,ftrace_stub) + cmpdr0,r12 + ld r0,LRSAVE(r1) + bne-2f + + mtlrr0 + bctr + +2: /* here we have (*ftrace_trace_function)() in r12, + selfpc in CTR + and frompc in r0 */ + + mtlrr0 + bctr + +_GLOBAL(ftrace_caller) + mr r0,r2 // global (module) call: save module TOC + b 1f +.localentry ftrace_caller,.-ftrace_caller + mr r0,r2 // local call: callee's TOC == our TOC + b 2f + +1: addis r2,r12,(.TOC.-0b)@ha + addir2,r2,(.TOC.-0b)@l + +2: // Here we have our proper TOC ptr in R2, + // and the one we need to restore on return in r0. + + ld r12, 16(r1) // get caller's adress + + stdur1,-SWITCH_FRAME_SIZE(r1) + + std r12, _LINK(r1) + SAVE_8GPRS(0,r1) + std r0,TOCSAVE(r1) + SAVE_8GPRS(8,r1) + SAVE_8GPRS(16,r1) + SAVE_8GPRS(24,r1) + + + LOAD_REG_IMMEDIATE(r3,function_trace_op) + ld r5,0(r3) + + mflrr3 + std r3, _NIP(r1) + std r3, 16(r1) + subir3, r3, MCOUNT_INSN_SIZE + mfmsr r4 + std r4, _MSR(r1) + mfctr r4 + std r4, _CTR(r1) + mfxer r4 + std r4, _XER(r1) + mr r4, r12 + addir6, r1 ,STACK_FRAME_OVERHEAD -_GLOBAL_TOC(ftrace_caller) - /* Taken from output of objdump from lib64/glibc */ - mflrr3 - ld r11, 0(r1) - stdur1, -112(r1) - std r3, 128(r1) - ld r4, 16(r11) - subir3, r3, MCOUNT_INSN_SIZE .globl ftrace_call ftrace_call: bl ftrace_stub nop + + ld r3, _NIP(r1) + mtlrr3 + + REST_8GPRS(0,r1) + REST_8GPRS(8,r1) + REST_8GPRS(16,r1) + REST_8GPRS(24,r1) + + addi r1, r1, SWITCH_FRAME_SIZE + + ld r12, 16(r1) // get caller's adress + mr r2,r0 // restore callee's TOC + mflrr0 // move this LR to CTR + mtctr r0 + mr r0,r12 // restore callee's lr at _mcount site + mtlrr0 + bctr// jump after _mcount site + #ifdef CONFIG_FUNCTION_GRAPH_TRACER .globl ftrace_graph_call ftrace_graph_call: b ftrace_graph_stub _GLOBAL(ftrace_graph_stub) #endif - ld r0, 128(r1) - mtlrr0 - addir1, r1, 112 + _GLOBAL(ftrace_stub) + nop + nop +.localentry ftrace_stub,.-ftrace_stub blr #else _GLOBAL_TOC(_mcount) @@ -1218,20 +1304,17 @@ _GLOBAL(ftrace_stub) #ifdef CONFIG_FUNCTION_GRAPH_TRACER _GLOBAL(ftrace_graph_caller) /* load r4 with local address */ - ld r4, 128(r1) + ld r4, LRSAVE+SWITCH_FRAME_SIZE(r1) subir4, r4, MCOUNT_INSN_SIZE /* get the parent address */ - ld r11, 112(r1) - addir3, r11, 16 + ld r11, SWITCH_FRAME_SIZE(r1) + addir3, r11, LRSAVE bl prepare_ftrace_return nop - ld r0, 128(r1) - mtlrr0 - addir1, r1, 112 - blr + b ftrace_graph_stub _GLOBAL(return_to_handler) /* need to save return values */ diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c index 390311c..4fe16fb 100644 --- a/arch/powerpc/kernel/ftrace.c +++
[PATCH v0 3/4] ppc64le: kgraft support
The kgraft hooks for ppc64. Just massaged a bit to get them to compile and not interfere. Feel free to test them if you're daring ;) diff --git a/arch/powerpc/include/asm/kgraft.h b/arch/powerpc/include/asm/kgraft.h new file mode 100644 index 000..7f8600d --- /dev/null +++ b/arch/powerpc/include/asm/kgraft.h @@ -0,0 +1,33 @@ +/* + * kGraft Online Kernel Patching + * + * Copyright (c) 2013-2014 SUSE + * Authors: Jiri Kosina + * Vojtech Pavlik + * Jiri Slaby + */ + +/* + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + */ + +#ifndef ASM_KGR_H +#define ASM_KGR_H + +#include asm/ptrace.h +#include linux/stacktrace.h + +static inline void kgr_set_regs_ip(struct pt_regs *regs, unsigned long ip) +{ + regs-link = ip; +} + +static inline bool kgr_needs_lazy_migration(struct task_struct *p) +{ + return true; +} + +#endif diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index b034ecd..aa6a084 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void) TIF_NEED_RESCHED */ #define TIF_32BIT 4 /* 32 bit binary */ #define TIF_RESTORE_TM 5 /* need to restore TM FP/VEC/VSX */ +#define TIF_KGR_IN_PROGRESS6 /* kGraft patching in progress */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ #define TIF_NOHZ 9 /* in adaptive nohz mode */ @@ -115,8 +117,10 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_POLLING_NRFLAG(1TIF_POLLING_NRFLAG) #define _TIF_32BIT (1TIF_32BIT) #define _TIF_RESTORE_TM(1TIF_RESTORE_TM) +#define _TIF_KGR_IN_PROGRESS (1TIF_KGR_IN_PROGRESS) #define _TIF_SYSCALL_AUDIT (1TIF_SYSCALL_AUDIT) #define _TIF_SINGLESTEP(1TIF_SINGLESTEP) +#define _TIF_NOHZ (1TIF_NOHZ) #define _TIF_SECCOMP (1TIF_SECCOMP) #define _TIF_RESTOREALL(1TIF_RESTOREALL) #define _TIF_NOERROR (1TIF_NOERROR) @@ -124,7 +128,7 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_UPROBE(1TIF_UPROBE) #define _TIF_SYSCALL_TRACEPOINT(1TIF_SYSCALL_TRACEPOINT) #define _TIF_EMULATE_STACK_STORE (1TIF_EMULATE_STACK_STORE) -#define _TIF_NOHZ (1TIF_NOHZ) + #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ _TIF_NOHZ) @@ -132,7 +136,8 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_RESTORE_TM) -#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR) + +#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR|_TIF_KGR_IN_PROGRESS) /* Bits in local_flags */ /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */ diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 5bbd1bc..569acd4 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -151,8 +151,8 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR) #endif CURRENT_THREAD_INFO(r11, r1) ld r10,TI_FLAGS(r11) - andi. r11,r10,_TIF_SYSCALL_T_OR_A - bne syscall_dotrace + andi. r10,r10,(_TIF_SYSCALL_T_OR_A|_TIF_KGR_IN_PROGRESS) + bne-syscall_precall .Lsyscall_dotrace_cont: cmpldi 0,r0,NR_syscalls bge-syscall_enosys @@ -245,6 +245,17 @@ syscall_error: neg r3,r3 std r5,_CCR(r1) b .Lsyscall_error_cont + +syscall_precall: + andi. r10,r10,(_TIF_KGR_IN_PROGRESS) + beq+syscall_dotrace + + addir11,r11,TI_FLAGS +1: ldarx r12,0,r11 + andcr12,r12,r10 + stdcx. r12,0,r11 + bne-1b + subir11,r11,TI_FLAGS /* Traced system call support */ syscall_dotrace: ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: Ben One thing I noticed is the asymetry in your code between the alloc Ben and the free path. The alloc path is similar to us in that the lock Ben covers the allocation and that's about it, there's no actual mapping to Ben the HW done, it's done by the caller level right ? yes, the only constraint is that the h/w alloc transaction should be done after the arena-alloc, whereas for the unmap, the h/w transaction should happen first, and arena-unmap should happen after. Ben The free path however, in your case, takes the lock and calls back into Ben demap (which I assume is what removes the translation from the HW) Ben with the lock held. There's also some mapping between cookies Ben and index which here too isn't exposed to the alloc side but is Ben exposed to the free side. Regarding the -demap indirection- somewhere between V1 and V2, I realized that, at least for sun4v, it's not necessary to hold the pool lock when doing the unmap, (V1 had originally passed this a the -demap). Revisiting the LDC change, I think that even that has no pool-specific info that needs to be passed in, so possibly the -demap is not required at all? I can remove that, and re-verify the LDC code (though I might not be able to get to this till early next week, as I'm travelling at the moment). About the cookie_to_index, that came out of observation of the LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, the workflow is approximately base = alloc_npages(..); /* calls iommu_tbl_range_alloc *. /* set up cookie_state using base */ /* populate cookies calling fill_cookies() - make_cookie() */ The make_cookie() is the inverse operation of cookie_to_index() (afaict, the code is not very well commented, I'm afraid), but I need that indirection to figure out which bitmap to clear. I dont know if there's a better way to do this, or if the -cookie_to_index can get more complex for other IOMMU users Ben One thing that Alexey is doing on our side is to move some of the Ben hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our Ben global ppc_md. data structure to a new iommu_table_ops, so your patches Ben will definitely collide with our current work so we'll have to figure Ben out how things can made to match. We might be able to move more than Ben just the allocator to the generic code, and the whole implementation of Ben map_sg/unmap_sg if we have the right set of ops, unless you see a Ben reason why that wouldn't work for you ? I cant think of why that wont work, though it would help to see the patch itself.. Ben We also need to add some additional platform specific fields to certain Ben iommu table instances to deal with some KVM related tracking of pinned Ben DMAble memory, here too we might have to be creative and possibly Ben enclose the generic iommu_table in a platform specific variant. Ben Ben Alexey, any other comment ? Ben Ben Cheers, Ben Ben. Ben Ben Ben Ben -- Ben To unsubscribe from this list: send the line unsubscribe sparclinux in Ben the body of a message to majord...@vger.kernel.org Ben More majordomo info at http://vger.kernel.org/majordomo-info.html Ben BenAlexey, any other comment ? On (03/19/15 16:27), Alexey Kardashevskiy wrote: Alexey Alexey Agree about missing symmetry. In general, I would call it zoned Alexey pool-locked memory allocator-ish rather than iommu_table and have Alexey no callbacks there. Alexey Alexey The iommu_tbl_range_free() caller could call cookie_to_index() and Problem is that tbl_range_free itself needs the `entry' from -Alexeycookie_to_index.. dont know if there's a way to move the code to avoid that.. Alexey what the reset() callback does here - I do not understand, some The -Alexeyreset callback came out of the sun4u use-case. Davem might have more history here than I do, but my understanding is that the iommu_flushall() was needed on the older sun4u architectures, where there was on intermediating HV? Alexey documentation would help here, and demap() does not have to be Alexey executed under the lock (as map() is not executed under the lock). Alexey Alexey btw why demap, not unmap? :) Maybe neither is needed, as you folks made me realize above. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v0 4/4] ppc64le: kgraft config options
Enable kgraft on ppc, fairly trivial. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 4bc7b62..d82d7c8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -102,6 +104,7 @@ config PPC select HAVE_IOREMAP_PROT select HAVE_EFFICIENT_UNALIGNED_ACCESS if !CPU_LITTLE_ENDIAN select HAVE_KPROBES + select HAVE_KGRAFT select HAVE_ARCH_KGDB select HAVE_KRETPROBES select HAVE_ARCH_TRACEHOOK @@ -291,6 +294,8 @@ source init/Kconfig source kernel/Kconfig.freezer +source kernel/Kconfig.kgraft + source arch/powerpc/sysdev/Kconfig source arch/powerpc/platforms/Kconfig ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states
Vasant Hegde hegdevas...@linux.vnet.ibm.com writes: From: Anshuman Khandual khand...@linux.vnet.ibm.com This patch registers the following two new OPAL interfaces calls for the platform LED subsystem. With the help of these new OPAL calls, the kernel will be able to get or set the state of various individual LEDs on the system at any given location code which is passed through the LED specific device tree nodes. (1) OPAL_LEDS_GET_INDICATOR opal_leds_get_ind (2) OPAL_LEDS_SET_INDICATOR opal_leds_set_ind Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com I also just merged the skiboot side of these calls. Acked-by: Stewart Smith stew...@linux.vnet.ibm.com Tested-by: Stewart Smith stew...@linux.vnet.ibm.com (well, it boots, interacts with firmware. I didn't go and look at the LEDs themselves). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2 v5] cpufreq: qoriq: rename the driver
On Friday, March 13, 2015 12:39:02 PM yuantian.t...@freescale.com wrote: From: Tang Yuantian yuantian.t...@freescale.com This driver works on all QorIQ platforms which include ARM-based cores and PPC-based cores. Rename it in order to represent better. Signed-off-by: Tang Yuantian yuantian.t...@freescale.com Acked-by: Viresh Kumar viresh.ku...@linaro.org Both queued up for 4.1, thanks! --- v5: - rebased to 4.0-rc3 - added Kconfig and Makefile entry v3, v4 - none v2: - use -C -M options when format-patch drivers/cpufreq/Kconfig| 8 drivers/cpufreq/Kconfig.powerpc| 9 - drivers/cpufreq/Makefile | 2 +- drivers/cpufreq/{ppc-corenet-cpufreq.c = qoriq-cpufreq.c} | 0 4 files changed, 9 insertions(+), 10 deletions(-) rename drivers/cpufreq/{ppc-corenet-cpufreq.c = qoriq-cpufreq.c} (100%) diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..659879a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -293,5 +293,13 @@ config SH_CPU_FREQ If unsure, say N. endif +config QORIQ_CPUFREQ + tristate CPU frequency scaling driver for Freescale QorIQ SoCs + depends on OF COMMON_CLK (PPC_E500MC || ARM) + select CLK_QORIQ + help + This adds the CPUFreq driver support for Freescale QorIQ SoCs + which are capable of changing the CPU's frequency dynamically. + endif endmenu diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index 7ea2441..3a0595b 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -23,15 +23,6 @@ config CPU_FREQ_MAPLE This adds support for frequency switching on Maple 970FX Evaluation Board and compatible boards (IBM JS2x blades). -config PPC_CORENET_CPUFREQ - tristate CPU frequency scaling driver for Freescale E500MC SoCs - depends on PPC_E500MC OF COMMON_CLK - select CLK_QORIQ - help - This adds the CPUFreq driver support for Freescale e500mc, - e5500 and e6500 series SoCs which are capable of changing - the CPU's frequency dynamically. - config CPU_FREQ_PMAC bool Support for Apple PowerBooks depends on ADB_PMU PPC32 diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 82a1821..26df0ad 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -85,7 +85,7 @@ obj-$(CONFIG_CPU_FREQ_CBE) += ppc-cbe-cpufreq.o ppc-cbe-cpufreq-y+= ppc_cbe_cpufreq_pervasive.o ppc_cbe_cpufreq.o obj-$(CONFIG_CPU_FREQ_CBE_PMI) += ppc_cbe_cpufreq_pmi.o obj-$(CONFIG_CPU_FREQ_MAPLE) += maple-cpufreq.o -obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o +obj-$(CONFIG_QORIQ_CPUFREQ) += qoriq-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC) += pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64)+= pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o diff --git a/drivers/cpufreq/ppc-corenet-cpufreq.c b/drivers/cpufreq/qoriq-cpufreq.c similarity index 100% rename from drivers/cpufreq/ppc-corenet-cpufreq.c rename to drivers/cpufreq/qoriq-cpufreq.c -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds torva...@linux-foundation.org wrote: So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But fork() will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write -pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. I'm running a kernel right now with a hacky update_mmu_cache() that warns if pte_dirty is ever different from pte_write(). +void update_mmu_cache(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + if (!(vma-vm_flags VM_SHARED)) { + pte_t now = READ_ONCE(*ptep); + if (!pte_write(now) != !pte_dirty(now)) { + static int count = 20; + static unsigned int prev = 0; + unsigned int val = pte_val(now) 0xfff; + if (prev != val count) { + prev = val; + count--; + WARN(1, pte value %x, val); + } + } + } +} I haven't seen a single warning so far (and there I wrote all that code to limit repeated warnings), although admittedly update_mu_cache() isn't called for all cases where we change a pte (not for the fork case, for example). But it *is* called for the page faulting cases Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling
On 3/18/2015 12:03 AM, Kim Phillips wrote: On Tue, 17 Mar 2015 19:58:55 +0200 Horia Geantă horia.gea...@freescale.com wrote: On 3/17/2015 2:19 AM, Kim Phillips wrote: On Mon, 16 Mar 2015 12:02:51 +0200 Horia Geantă horia.gea...@freescale.com wrote: On 3/4/2015 2:23 AM, Kim Phillips wrote: Only potential problem is getting the crypto API to set the GFP_DMA flag in the allocation request, but presumably a CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that. Seems there are quite a few places that do not use the {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests. Among them, IPsec and dm-crypt. I've looked at the code and I don't think it can be converted to use crypto API. why not? It would imply having 2 memory allocations, one for crypto request and the other for the rest of the data bundled with the request (for IPsec that would be ESN + space for IV + sg entries for authenticated-only data and sk_buff extension, if needed). Trying to have a single allocation by making ESN, IV etc. part of the request private context requires modifying tfm.reqsize on the fly. This won't work without adding some kind of locking for the tfm. can't a common minimum tfm.reqsize be co-established up front, at least for the fast path? Indeed, for IPsec at tfm allocation time - esp_init_state() - tfm.reqsize could be increased to account for what is known for a given flow: ESN, IV and asg (S/G entries for authenticated-only data). The layout would be: aead request (fixed part) private ctx of backend algorithm seq_no_hi (if ESN) IV asg sg -- S/G table for skb_to_sgvec; how many entries is the question Do you have a suggestion for how many S/G entries to preallocate for representing the sk_buff data to be encrypted? An ancient esp4.c used ESP_NUM_FAST_SG, set to 4. Btw, currently maximum number of fragments supported by the net stack (MAX_SKB_FRAGS) is 16 or more. This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these places. Some of the maintainers do not agree, as you've seen. would modifying the crypto API to either have a different *_request_alloc() API, and/or adding calls to negotiate the GFP mask between crypto users and drivers, e.g., get/set_gfp_mask, work? I think what DaveM asked for was the change to be transparent. Besides converting to *_request_alloc(), seems that all other options require some extra awareness from the user. Could you elaborate on the idea above? was merely suggesting communicating GFP flags anonymously across the API, i.e., GFP_DMA wouldn't appear in user code. Meaning user would have to get_gfp_mask before allocating a crypto request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have kmalloc(GFP_ATOMIC | get_gfp_mask(aead))? An alternative would be for talitos to use the page allocator to get 1 / 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor = 8 kB), dma_map_page the area and manage it internally for talitos_desc hw descriptors. What do you think? There's a comment in esp_alloc_tmp(): Use spare space in skb for this where possible, which is ideally where we'd want to be (esp. Ok, I'll check that. But note the where possible - finding room in the skb to avoid the allocation won't always be the case, and then we're back to square one. So the skb cb is out of the question, being too small (48B). Any idea what was the intention of the TODO - maybe to use the tailroom in the skb data area? because that memory could already be DMA-able). Your above suggestion would be in the opposite direction of that. The proposal: -removes dma (un)mapping on the fast path sure, but at the expense of additional complexity. Right, there's no free lunch. But it's cheaper. -avoids requesting dma mappable memory for more than it's actually needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not only its private context) compared to the payload? Plus, we have plenty of DMA space these days. -for caam it has the added benefit of speeding the below search for the offending descriptor in the SW ring from O(n) to O(1): for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) = 1; i++) { sw_idx = (tail + i) (JOBR_DEPTH - 1); if (jrp-outring[hw_idx].desc == jrp-entinfo[sw_idx].desc_addr_dma) break; /* found */ } (drivers/crypto/caam/jr.c - caam_dequeue) how? The job ring h/w will still be spitting things out out-of-order. jrp-outring[hw_idx].desc bus address can be used to find the sw_idx in O(1): dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...); [...] sw_idx = (desc_base - jrp-outring[hw_idx].desc) / JD_SIZE; JD_SIZE would be 16 words (64B) - 13 words used for the h/w job descriptor, 3 words can be used for smth. else. Basically all JDs would be filled at a 64B-aligned offset in the memory page. Plus, like I said, it's taking the problem in the wrong direction: we need to strive to merge the allocation