[PATCH] powerpc/powernv: Fixes for hypervisor doorbell handling

2015-03-19 Thread Paul Mackerras
Since we can now use hypervisor doorbells for host IPIs, this makes
sure we clear the host IPI flag when taking a doorbell interrupt, and
clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we
already do for IPIs sent via the XICS interrupt controller).  Otherwise
if there did happen to be a leftover pending doorbell interrupt for
an offline CPU thread for any reason, it would prevent that thread from
going into a power-saving mode; it would instead keep waking up because
of the interrupt.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/reg.h   |  3 +++
 arch/powerpc/kernel/dbell.c  |  2 ++
 arch/powerpc/platforms/powernv/smp.c | 13 +++--
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1c874fb..af56b5c 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -608,13 +608,16 @@
 #define   SRR1_ISI_N_OR_G  0x1000 /* ISI: Access is no-exec or G */
 #define   SRR1_ISI_PROT0x0800 /* ISI: Other protection 
fault */
 #define   SRR1_WAKEMASK0x0038 /* reason for wakeup */
+#define   SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 */
 #define   SRR1_WAKESYSERR  0x0030 /* System error */
 #define   SRR1_WAKEEE  0x0020 /* External interrupt */
 #define   SRR1_WAKEMT  0x0028 /* mtctrl */
 #define  SRR1_WAKEHMI  0x0028 /* Hypervisor maintenance */
 #define   SRR1_WAKEDEC 0x0018 /* Decrementer interrupt */
+#define   SRR1_WAKEDBELL   0x0014 /* Privileged doorbell on P8 */
 #define   SRR1_WAKETHERM   0x0010 /* Thermal management interrupt */
 #define  SRR1_WAKERESET0x0010 /* System reset */
+#define   SRR1_WAKEHDBELL  0x000c /* Hypervisor doorbell on P8 */
 #define  SRR1_WAKESTATE0x0003 /* Powersave exit mask 
[46:47] */
 #define  SRR1_WS_DEEPEST   0x0003 /* Some resources not 
maintained,
  * may not be recoverable */
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index f421781..2128f3a 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -17,6 +17,7 @@
 
 #include asm/dbell.h
 #include asm/irq_regs.h
+#include asm/kvm_ppc.h
 
 #ifdef CONFIG_SMP
 void doorbell_setup_this_cpu(void)
@@ -41,6 +42,7 @@ void doorbell_exception(struct pt_regs *regs)
 
may_hard_irq_enable();
 
+   kvmppc_set_host_ipi(smp_processor_id(), 0);
__this_cpu_inc(irq_stat.doorbell_irqs);
 
smp_ipi_demux();
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index fc34025..7259a24 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -33,6 +33,7 @@
 #include asm/runlatch.h
 #include asm/code-patching.h
 #include asm/dbell.h
+#include asm/kvm_ppc.h
 
 #include powernv.h
 
@@ -149,7 +150,7 @@ static int pnv_smp_cpu_disable(void)
 static void pnv_smp_cpu_kill_self(void)
 {
unsigned int cpu;
-   unsigned long srr1;
+   unsigned long srr1, wmask;
u32 idle_states;
 
/* Standard hot unplug procedure */
@@ -161,6 +162,10 @@ static void pnv_smp_cpu_kill_self(void)
generic_set_cpu_dead(cpu);
smp_wmb();
 
+   wmask = SRR1_WAKEMASK;
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   wmask = SRR1_WAKEMASK_P8;
+
idle_states = pnv_get_supported_cpuidle_states();
/* We don't want to take decrementer interrupts while we are offline,
 * so clear LPCR:PECE1. We keep PECE2 enabled.
@@ -191,10 +196,14 @@ static void pnv_smp_cpu_kill_self(void)
 * having finished executing in a KVM guest, then srr1
 * contains 0.
 */
-   if ((srr1  SRR1_WAKEMASK) == SRR1_WAKEEE) {
+   if ((srr1  wmask) == SRR1_WAKEEE) {
icp_native_flush_interrupt();
local_paca-irq_happened = PACA_IRQ_HARD_DIS;
smp_mb();
+   } else if ((srr1  wmask) == SRR1_WAKEHDBELL) {
+   unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
+   asm volatile(PPC_MSGCLR(%0) : : r (msg));
+   kvmppc_set_host_ipi(cpu, 0);
}
 
if (cpu_core_split_required())
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'

2015-03-19 Thread Alex Dowad


On 19/03/15 08:45, Michael Ellerman wrote:

On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote:

The 'arg' argument to copy_thread() is only ever used when forking a new
kernel thread. Hence, rename it to 'kthread_arg' for clarity (and consistency
with do_fork() and other arch-specific implementations of copy_thread()).
  
I don't understand the bit about consistency with do_fork() ?
This series of patches includes one patch which renames the arg for 
do_fork(), and others which rename the same arg for each arch-specific 
implementation of copy_thread(). So if all of them are accepted and 
merged, then all will be consistent. If only some of the patches are 
accepted, I will rewrite the commit message so it doesn't mention 
consistency.


Thanks! AD


Otherwise it looks fine.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'

2015-03-19 Thread Michael Ellerman
On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote:
 The 'arg' argument to copy_thread() is only ever used when forking a new
 kernel thread. Hence, rename it to 'kthread_arg' for clarity (and consistency
 with do_fork() and other arch-specific implementations of copy_thread()).
 
I don't understand the bit about consistency with do_fork() ?

Otherwise it looks fine.

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/powernv: Fix return value from power7_nap() et al.

2015-03-19 Thread Paul Mackerras
The power7_nap(), power7_sleep() and power7_winkle() functions are
called from pnv_smp_cpu_kill_self(), which expects them to return the
SRR1 value set by the hardware on wakeup, or 0 if no nap/sleep/winkle
occurred.  However, in the case where an interrupt needs to be
replayed, the logic in power7_powersave_common (the common code for
power7_nap et al.) doesn't set r3 to 0 in this case.  Instead what we
get as the return value is the selector for the type of power-saving
mode requested (1, 2 or 3).  In fact this should not affect the
operation of pnv_smp_cpu_kill_self(), but it is better to get this
correct, so this adds an instruction to set r3 to 0 in this case.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kernel/idle_power7.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index 05adc8b..eeaa0d5 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -94,6 +94,7 @@ _GLOBAL(power7_powersave_common)
beq 1f
addir1,r1,INT_FRAME_SIZE
ld  r0,16(r1)
+   li  r3,0/* Return 0 (no nap) */
mtlrr0
blr
 
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe

2015-03-19 Thread Wei Yang
2015-03-20 1:54 GMT+08:00 Bjorn Helgaas bhelg...@google.com:
 On Thu, Mar 19, 2015 at 11:18 AM, Wei Yang weiyang.ker...@gmail.com wrote:
 Oh, I thought you are not comfortable with the Patch v12 10/21 PCI:
 Consider additional PF's IOV BAR alignment ...

 V14 is ready to send which is based on v4.0-rc1.

 Unless I missed something, the last email in that thread [1] is from
 you, so I think we're ready for the next iteration.

 [1] 
 http://lkml.kernel.org/r/20150224083406.32124.65957.st...@bhelgaas-glaptop2.roam.corp.google.com


Great~~~ I thought you didn't get a chance to read it.

Will send out a v14 ASAP.


-- 
Richard Yang
Help You, Help Me
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So I think there's something I'm missing. For non-shared mappings, I
  still have the idea that pte_dirty should be the same as pte_write.
  And yet, your testing of 3.19 shows that it's a big difference.
  There's clearly something I'm completely missing.
 
 Ahh. The normal page table scanning and page fault handling both clear
 and set the dirty bit together with the writable one. But fork()
 will clear the writable bit without clearing dirty. For some reason I
 thought it moved the dirty bit into the struct page like the VM
 scanning does, but that was just me having a brainfart. So yeah,
 pte_dirty doesn't have to match pte_write even under perfectly normal
 circumstances. Maybe there are other cases.
 
 Not that I see a lot of forking in the xfs repair case either, so..
 
 Dave, mind re-running the plain 3.19 numbers to really verify that the
 pte_dirty/pte_write change really made that big of a difference. Maybe
 your recollection of ~55,000 migrate_pages events was faulty. If the
 pte_write -pte_dirty change is the *only* difference, it's still very
 odd how that one difference would make migrate_rate go from ~55k to
 471k. That's an order of magnitude difference, for what really
 shouldn't be a big change.

My recollection wasn't faulty - I pulled it from an earlier email.
That said, the original measurement might have been faulty. I ran
the numbers again on the 3.19 kernel I saved away from the original
testing. That came up at 235k, which is pretty much the same as
yesterday's test. The runtime,however, is unchanged from my original
measurements of 4m54s (pte_hack came in at 5m20s).

Wondering where the 55k number came from, I played around with when
I started the measurement - all the numbers since I did the bisect
have come from starting it at roughly 130AGs into phase 3 where the
memory footprint stabilises and the tlb flush overhead kicks in.

However, if I start the measurement at the same time as the repair
test, I get something much closer to the 55k number. I also note
that my original 4.0-rc1 numbers were much lower than the more
recent steady state measurements (360k vs 470k), so I'd say the
original numbers weren't representative of the steady state
behaviour and so can be ignored...

 Maybe a system update has changed libraries and memory allocation
 patterns, and there is something bigger than that one-liner
 pte_dirty/write change going on?

Possibly. The xfs_repair binary has definitely been rebuilt (testing
unrelated bug fixes that only affect phase 6/7 behaviour), but
otherwise the system libraries are unchanged.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:
 
  My recollection wasn't faulty - I pulled it from an earlier email.
  That said, the original measurement might have been faulty. I ran
  the numbers again on the 3.19 kernel I saved away from the original
  testing. That came up at 235k, which is pretty much the same as
  yesterday's test. The runtime,however, is unchanged from my original
  measurements of 4m54s (pte_hack came in at 5m20s).
 
 Ok. Good. So the more than an order of magnitude difference was
 really about measurement differences, not quite as real. Looks like
 more a factor of two than a factor of 20.
 
 Did you do the profiles the same way? Because that would explain the
 differences in the TLB flush percentages too (the 1.4% from
 tlb_invalidate_range() vs pretty much everything from migration).

No, the profiles all came from steady state. The profiles from the
initial startup phase hammer the mmap_sem because of page fault vs
mprotect contention (glibc runs mprotect() on every chunk of
memory it allocates). It's not until the cache reaches full and it
starts recycling old buffers rather than allocating new ones that
the tlb flush problem dominates the profiles.

 The runtime variation does show that there's some *big* subtle
 difference for the numa balancing in the exact TNF_NO_GROUP details.
 It must be *very* unstable for it to make that big of a difference.
 But I feel at least a *bit* better about unstable algorithm changes a
 small varioation into a factor-of-two vs that crazy factor-of-20.
 
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

Yup, will do. might take an hour or two before I get to it, though...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

$ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

266,750  migrate:mm_migrate_pages ( +-  7.43% )

  10.002032292 seconds time elapsed ( +-  0.00% )

Bit more variance there than the pte checking, but runtime
difference is in the noise - 5m4s vs 4m54s - and profiles are
identical to the pte checking version.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Export __spin_yield

2015-03-19 Thread Paul Mackerras
On Wed, Feb 25, 2015 at 05:23:53PM -0600, Suresh E. Warrier wrote:
 Export __spin_yield so that the arch_spin_unlock() function can
 be invoked from a module. This will be required for modules where
 we want to take a lock that is also is acquired in hypervisor
 real mode. Because we want to avoid running any lockdep code
 (which may not be safe in real mode), this lock needs to be 
 an arch_spinlock_t instead of a normal spinlock.
 
 Signed-off-by: Suresh Warrier warr...@linux.vnet.ibm.com

Acked-by: Paul Mackerras pau...@samba.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc/powernv: Fixes for hypervisor doorbell handling

2015-03-19 Thread Paul Mackerras
Since we can now use hypervisor doorbells for host IPIs, this makes
sure we clear the host IPI flag when taking a doorbell interrupt, and
clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we
already do for IPIs sent via the XICS interrupt controller).  Otherwise
if there did happen to be a leftover pending doorbell interrupt for
an offline CPU thread for any reason, it would prevent that thread from
going into a power-saving mode; it would instead keep waking up because
of the interrupt.

Signed-off-by: Paul Mackerras pau...@samba.org
---
This one actually compiles... (blush)

 arch/powerpc/include/asm/ppc-opcode.h |  3 +++
 arch/powerpc/include/asm/reg.h|  3 +++
 arch/powerpc/kernel/dbell.c   |  2 ++
 arch/powerpc/platforms/powernv/smp.c  | 14 --
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 03cd858..4cbe23a 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -153,6 +153,7 @@
 #define PPC_INST_MFSPR_PVR_MASK0xfc1f
 #define PPC_INST_MFTMR 0x7c0002dc
 #define PPC_INST_MSGSND0x7c00019c
+#define PPC_INST_MSGCLR0x7c0001dc
 #define PPC_INST_MSGSNDP   0x7c00011c
 #define PPC_INST_MTTMR 0x7c0003dc
 #define PPC_INST_NOP   0x6000
@@ -309,6 +310,8 @@
___PPC_RB(b) | __PPC_EH(eh))
 #define PPC_MSGSND(b)  stringify_in_c(.long PPC_INST_MSGSND | \
___PPC_RB(b))
+#define PPC_MSGCLR(b)  stringify_in_c(.long PPC_INST_MSGCLR | \
+   ___PPC_RB(b))
 #define PPC_MSGSNDP(b) stringify_in_c(.long PPC_INST_MSGSNDP | \
___PPC_RB(b))
 #define PPC_POPCNTB(a, s)  stringify_in_c(.long PPC_INST_POPCNTB | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1c874fb..af56b5c 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -608,13 +608,16 @@
 #define   SRR1_ISI_N_OR_G  0x1000 /* ISI: Access is no-exec or G */
 #define   SRR1_ISI_PROT0x0800 /* ISI: Other protection 
fault */
 #define   SRR1_WAKEMASK0x0038 /* reason for wakeup */
+#define   SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 */
 #define   SRR1_WAKESYSERR  0x0030 /* System error */
 #define   SRR1_WAKEEE  0x0020 /* External interrupt */
 #define   SRR1_WAKEMT  0x0028 /* mtctrl */
 #define  SRR1_WAKEHMI  0x0028 /* Hypervisor maintenance */
 #define   SRR1_WAKEDEC 0x0018 /* Decrementer interrupt */
+#define   SRR1_WAKEDBELL   0x0014 /* Privileged doorbell on P8 */
 #define   SRR1_WAKETHERM   0x0010 /* Thermal management interrupt */
 #define  SRR1_WAKERESET0x0010 /* System reset */
+#define   SRR1_WAKEHDBELL  0x000c /* Hypervisor doorbell on P8 */
 #define  SRR1_WAKESTATE0x0003 /* Powersave exit mask 
[46:47] */
 #define  SRR1_WS_DEEPEST   0x0003 /* Some resources not 
maintained,
  * may not be recoverable */
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index f421781..2128f3a 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -17,6 +17,7 @@
 
 #include asm/dbell.h
 #include asm/irq_regs.h
+#include asm/kvm_ppc.h
 
 #ifdef CONFIG_SMP
 void doorbell_setup_this_cpu(void)
@@ -41,6 +42,7 @@ void doorbell_exception(struct pt_regs *regs)
 
may_hard_irq_enable();
 
+   kvmppc_set_host_ipi(smp_processor_id(), 0);
__this_cpu_inc(irq_stat.doorbell_irqs);
 
smp_ipi_demux();
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index fc34025..38a4508 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -33,6 +33,8 @@
 #include asm/runlatch.h
 #include asm/code-patching.h
 #include asm/dbell.h
+#include asm/kvm_ppc.h
+#include asm/ppc-opcode.h
 
 #include powernv.h
 
@@ -149,7 +151,7 @@ static int pnv_smp_cpu_disable(void)
 static void pnv_smp_cpu_kill_self(void)
 {
unsigned int cpu;
-   unsigned long srr1;
+   unsigned long srr1, wmask;
u32 idle_states;
 
/* Standard hot unplug procedure */
@@ -161,6 +163,10 @@ static void pnv_smp_cpu_kill_self(void)
generic_set_cpu_dead(cpu);
smp_wmb();
 
+   wmask = SRR1_WAKEMASK;
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   wmask = SRR1_WAKEMASK_P8;
+
idle_states = pnv_get_supported_cpuidle_states();
/* We don't want to take decrementer interrupts while we are offline,
 * so clear LPCR:PECE1. We keep 

Re: [alsa-devel] [PATCH 1/7 linux-next] ALSA: aoa: constify of_device_id array

2015-03-19 Thread Takashi Iwai
At Wed, 18 Mar 2015 17:48:56 +0100,
Fabian Frederick wrote:
 
 of_device_id is always used as const.
 (See driver.of_match_table and open firmware functions)
 
 Signed-off-by: Fabian Frederick f...@skynet.be

Thanks, applied this one.
The rest ASoC patches are left to Mark.


Takashi

 ---
  sound/aoa/soundbus/i2sbus/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/sound/aoa/soundbus/i2sbus/core.c 
 b/sound/aoa/soundbus/i2sbus/core.c
 index b9737fa..1cbf210 100644
 --- a/sound/aoa/soundbus/i2sbus/core.c
 +++ b/sound/aoa/soundbus/i2sbus/core.c
 @@ -31,7 +31,7 @@ module_param(force, int, 0444);
  MODULE_PARM_DESC(force, Force loading i2sbus even when
no layout-id property is present);
  
 -static struct of_device_id i2sbus_match[] = {
 +static const struct of_device_id i2sbus_match[] = {
   { .name = i2s },
   { }
  };
 -- 
 1.9.1
 
 ___
 Alsa-devel mailing list
 alsa-de...@alsa-project.org
 http://mailman.alsa-project.org/mailman/listinfo/alsa-devel
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] leds/powernv: Add driver for PowerNV platform

2015-03-19 Thread Stewart Smith
Vasant Hegde hegdevas...@linux.vnet.ibm.com writes:

 From: Anshuman Khandual khand...@linux.vnet.ibm.com

 This patch implements LED driver for PowerNV platform using the existing
 generic LED class framework. It registers classdev structures for all
 individual LEDs detected on the system through LED specific device tree
 nodes. Device tree nodes specify what all kind of LEDs present on the
 same location code. It registers LED classdev structure for each of them.

 The platform level implementation of LED get and set state has been
 achieved through OPAL calls. These calls are made available for the
 driver by exporting from architecture specific codes.

 As per the LED class framework, the 'brightness_set' function should not
 sleep. Hence these functions have been implemented through global work
 queue tasks which might sleep on OPAL async call completion.

 All the system LEDs can be found in the same regular path /sys/class/leds/.
 There are two different kind of LEDs present for the same location code,
 one being the identify indicator and other one being the fault indicator.
 We don't use LED colors. Hence our LEDs have names in this format.

 location_code:IDENTIFY|FAULT

 Any positive brightness value would turn on the LED and a zero value
 would turn off the LED. The driver will return LED_FULL (255) for any
 turned on LED and LED_OFF for any turned off LED.

 Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com
 Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com

Acked-by: Stewart Smith stew...@linux.vnet.ibm.com
Tested-by: Stewart Smith stew...@linux.vnet.ibm.com


(well, it boots, interacts with firmware. I didn't go and look at the
LEDs themselves).

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 14/21] powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically

2015-03-19 Thread Wei Yang
Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe
and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup
stage. Since PEs are based on the hardware layout which is static in the
system, they will never get released. This means the iommu_table in the
pnv_ioda_pe will never get rleased neither.

This no longer works for VF PE. VF PEs are created and released dynamically
when VFs are created and released. So we need to assign pnv_ioda_pe to VF
PEs respectively when VFs are enabled and clean up those resources for VF
PE when VFs are disabled. And iommu_table is one of the resources we need
to handle dynamically.

Current iommu_table is a static field in pnv_ioda_pe, which will face a
problem when freeing it. During the disabling of a VF,
pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the
iommu_table for this PE. A static iommu_table will fail in
iommu_free_table.

According to these requirement, this patch allocates iommu_table
dynamically.

Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/iommu.h  |3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++
 arch/powerpc/platforms/powernv/pci.h  |2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..5574eeb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
struct iommu_group *it_group;
 #endif
void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+   void   *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index df4a295..1b37066 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -916,6 +916,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int 
all)
return;
}
 
+   pe-tce32_table = kzalloc_node(sizeof(struct iommu_table),
+   GFP_KERNEL, hose-node);
+   pe-tce32_table-data = pe;
+
/* Associate it with all child devices */
pnv_ioda_setup_same_PE(bus, pe);
 
@@ -1005,7 +1009,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb 
*phb, struct pci_dev *pdev
 
pe = phb-ioda.pe_array[pdn-pe_number];
WARN_ON(get_dma_ops(pdev-dev) != dma_iommu_ops);
-   set_iommu_table_base_and_group(pdev-dev, pe-tce32_table);
+   set_iommu_table_base_and_group(pdev-dev, pe-tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1032,7 +1036,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(pdev-dev, Using 32-bit DMA via iommu\n);
set_dma_ops(pdev-dev, dma_iommu_ops);
-   set_iommu_table_base(pdev-dev, pe-tce32_table);
+   set_iommu_table_base(pdev-dev, pe-tce32_table);
}
*pdev-dev.dma_mask = dma_mask;
return 0;
@@ -1069,9 +1073,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, bus-devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(dev-dev,
-  pe-tce32_table);
+  pe-tce32_table);
else
-   set_iommu_table_base(dev-dev, pe-tce32_table);
+   set_iommu_table_base(dev-dev, pe-tce32_table);
 
if (dev-subordinate)
pnv_ioda_setup_bus_dma(pe, dev-subordinate,
@@ -1161,8 +1165,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 __be64 *startp, __be64 *endp, bool rm)
 {
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+   struct pnv_ioda_pe *pe = tbl-data;
struct pnv_phb *phb = pe-phb;
 
if (phb-type == PNV_PHB_IODA1)
@@ -1228,7 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}
 
/* Setup linux iommu table */
-   tbl = pe-tce32_table;
+   tbl = pe-tce32_table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
  base  28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1266,8 +1269,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+   struct pnv_ioda_pe *pe = tbl-data;
uint16_t window_id = 

[PATCH V14 13/21] powerpc/pci: Don't unset PCI resources for VFs

2015-03-19 Thread Wei Yang
Flag PCI_REASSIGN_ALL_RSRC is used to ignore resources information setup by
firmware, so that kernel would re-assign all resources of pci devices.

On powerpc arch, this happens in a header fixup function
pcibios_fixup_resources(), which will clean up the resources if this flag
is set. This works fine for PFs, since after clean up, kernel will
re-assign the resources in pcibios_resource_survey().

Below is a simple call flow on how it works:

pcibios_init
  pcibios_scan_phb
pci_scan_child_bus
  ...
pci_device_add
  pci_fixup_device(pci_fixup_header)
pcibios_fixup_resources # header fixup
  for (i = 0; i  DEVICE_COUNT_RESOURCE; i++)
dev-resource[i].start = 0
  pcibios_resource_survey   # re-assign
pcibios_allocate_resources

However, the VF resources won't be re-assigned, since the VF resources are
completely determined by the PF resources, and the PF resources have
already been reassigned. This means we need to leave VF's resources
un-cleared in pcibios_fixup_resources().

In this patch, we skip the resource unset process in
pcibios_fixup_resources(), if the pci_dev is a VF.

Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/pci-common.c |4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 2a525c9..8203101 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
   pci_name(dev));
return;
}
+
+   if (dev-is_virtfn)
+   return;
+
for (i = 0; i  DEVICE_COUNT_RESOURCE; i++) {
struct resource *res = dev-resource + i;
struct pci_bus_region reg;
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup

2015-03-19 Thread Michael Ellerman
On Fri, 2015-03-20 at 11:55 +0700, Arseny Solokha wrote:
 
 And by the way, while revisiting the series I've noticed that though the patch
 4/4 basically reverts [1], it leaves
 
   #define MPIC_GREG_GLOBAL_CONF_1 0x00030
 
 in arch/powerpc/include/asm/mpic.h untouched. That define also loses its uses
 after applying the patch. Compare the following hunk in today's patch w/ the 
 one
 you committed:
 
   @@ -33,11 +33,6 @@
#defineMPIC_GREG_GCONF_NO_BIAS 0x1000
#defineMPIC_GREG_GCONF_BASE_MASK   0x000f
#defineMPIC_GREG_GCONF_MCK 0x0800
   -#define MPIC_GREG_GLOBAL_CONF_10x00030
   -#defineMPIC_GREG_GLOBAL_CONF_1_SIE 0x0800
   -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK  0x7000
   -#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\
   -   (((r)  28)  MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK)
#define MPIC_GREG_VENDOR_0 0x00040
#define MPIC_GREG_VENDOR_1 0x00050
#define MPIC_GREG_VENDOR_2 0x00060
 
 So the question is, should #define MPIC_GREG_GLOBAL_CONF_1 have been also
 removed, or could be left as is?
 
 [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html


OK, thanks for the thoroughness.

With #defines like that it's never clear if they should be removed or not. On
the one hand it's not used, so it should be removed. But, it can be useful to
keep the #defines there as documentation.

So I'm 50/50 on it. If you send me a patch to remove it I'll merge it, unless
someone else objects.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V14 00/21] Enable SRIOV on Power8

2015-03-19 Thread Gavin Shan
On Fri, Mar 20, 2015 at 11:06:16AM +0800, Wei Yang wrote:

[snip]

---
v14:
   * call ppc_md.pcibios_fixup_sriov() in pcibios_add_device
   * add more explanation in change log
   * Following patches have been reordered to the beginning.
 EEH refactor to use pci_dn:
 8ec20d6 powerpc/powernv: Use pci_dn, not device_node, in PCI config 
 accessor
 a3460fc powerpc/pci: Refactor pci_dn
 These two patches will be modified to merge with other patches which are
 under discussion/review in ppc mail list. Some changes may also be made in
 other patches, which I didn't include them in this series, so that the
 auto build robot could work on this.

The comment here isn't precise enough and not the things I suggested before.
Those 2 patches have been split into 3 patches (A/B/C). Some other EEH
cleanup/refactor patches depends on A/B and those patches would be merged
before your SRIOV patches to PowerPC tree. C, which I already sent to you,
need to be integrated to your patchset right after the following one:

powerpc/pci: Don't unset PCI resources for VFs

I guess you can move the patches around after checking if Bjorn has further
concerns/comments.

Thanks,
Gavin

 There may have several changes in powerpc arch, which not effect the pci
 core. So after this patch set pass the review in pci community, I would
 rebase this series on ppc brach and send out for comment.
   * use add_res-min_align as the alignment in reassign_resources_sorted()
   * some cleanup in Document
v13:
   * fix error in pcibios_iov_resource_alignment(), use pdev instead of dev
   * rename vf_num to num_vfs in pcibios_sriov_enable(),
 pnv_pci_vf_resource_shift(), pnv_pci_sriov_disable(),
 pnv_pci_sriov_enable(), pnv_pci_ioda2_setup_dma_pe()
   * add more explanation in commit powerpc/pci: Don't unset PCI resources
 for VFs
   * fix IOV BAR in hotplug path as well, and don't fixup an already added
 device
   * use roundup_pow_of_two() instead of __roundup_pow_of_two()
   * this is based on v4.0-rc1
v12:
   * remove align parameter from pcibios_iov_resource_alignment()
 default version returns pci_iov_resource_size() instead of the
 align parameter
   * in powerpc pcibios_iov_resource_alignment(), return
 pci_iov_resource_size() if there's no ppc_md function pointer
   * in pci_sriov_resource_alignment(), don't re-read base, since we
 saved the required alignment when reading it the first time
   * remove vf_num parameter from add_dev_pci_info() and
 remove_dev_pci_info(); use pci_sriov_get_totalvfs() instead
   * use dev_warn() instead of pr_warn() when possible
   * check to be sure IOV BAR is still in range after shifting, change
 pnv_pci_vf_resource_shift() from void to int
   * improve sriov_enable() error message
   * improve SR-IOV BAR sizing message
   * index IOV resources in conventional style
   * include preamble patches (refresh offset/stride when updating numVFs,
 calculate max buses required
   * restructure pci_iov_max_bus_range() to return value instead of updating
 internally, rename to virtfn_max_buses()
   * fix typos  formatting
   * expand documentation
v11:
   * fix some compile warning
v10:
   * remove weak function pcibios_iov_resource_size()
 the VF BAR size is stored in pci_sriov structure and retrieved from
 pci_iov_resource_size()
   * Use Reserve additional instead of Expand to be more acurate in the
 change log
   * add log message to show the PF's IOV BAR final size
   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
 for arch setup before enable VFs. Like the arch could fix up the BDF for
 VFs, since the change of NumVFs would affect the BDF of VFs.
   * Add some explanation of PE on Power arch in the documentation
v9:
   * make the change log consistent in the terminology
 PF's IOV BAR - the SRIOV BAR in PF
 VF's BAR - the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to 
 Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
 pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
 retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
 design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
 additional size and alignment
   * fix one contention which is introduced in powrepc/pci: Refactor pci_dn.
 the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
 lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, 

Re: [PATCH V14 10/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()

2015-03-19 Thread Gavin Shan
On Fri, Mar 20, 2015 at 11:06:26AM +0800, Wei Yang wrote:
VFs are dynamically created when a driver enables them.  On some platforms,
like PowerNV, special resources are necessary to enable VFs.

Add platform hooks for enabling and disabling VFs.

Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5643a10..64c4692 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, 
int reset)
   pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
+{
+   return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
   int rc;
@@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
   struct pci_sriov *iov = dev-sriov;
   int bars = 0;
   int bus;
+  int retval;
 
   if (!nr_virtfn)
   return 0;
@@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int 
nr_virtfn)
   if (nr_virtfn  initial)
   initial = nr_virtfn;
 
+  if ((retval = pcibios_sriov_enable(dev, initial))) {
+  dev_err(dev-dev, failure %d from pcibios_sriov_enable()\n,
+  retval);
+  return retval;
+  }
+
   for (i = 0; i  initial; i++) {
   rc = virtfn_add(dev, i, 0);
   if (rc)
@@ -335,6 +347,11 @@ failed:
   return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev)
+{
+   return 0;
+}
+

Since you will have to v15, I would suggest to drop the return
value for this function. It seems there isn't a reason to have int
return value here.

Thanks,
Gavin

 static void sriov_disable(struct pci_dev *dev)
 {
   int i;
@@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
   for (i = 0; i  iov-num_VFs; i++)
   virtfn_remove(dev, i, 0);
 
+  pcibios_sriov_disable(dev);
+
   iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
   pci_cfg_access_lock(dev);
   pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 09/21] PCI: Export pci_iov_virtfn_bus() and pci_iov_virtfn_devfn()

2015-03-19 Thread Wei Yang
On PowerNV, some resource reservation is needed for SR-IOV VFs that don't
exist at the bootup stage.  To do the match between resources and VFs, the
code need to get the VF's BDF in advance.

Rename virtfn_bus() and virtfn_devfn() to pci_iov_virtfn_bus() and
pci_iov_virtfn_devfn() and export them.

[bhelgaas: changelog, make busnr int]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c   |   28 
 include/linux/pci.h |   11 +++
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 2ae921f..5643a10 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,16 +19,20 @@
 
 #define VIRTFN_ID_LEN  16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int vf_id)
 {
+   if (!dev-is_physfn)
+   return -EINVAL;
return dev-bus-number + ((dev-devfn + dev-sriov-offset +
-   dev-sriov-stride * id)  8);
+   dev-sriov-stride * vf_id)  8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 {
+   if (!dev-is_physfn)
+   return -EINVAL;
return (dev-devfn + dev-sriov-offset +
-   dev-sriov-stride * id)  0xff;
+   dev-sriov-stride * vf_id)  0xff;
 }
 
 /*
@@ -58,11 +62,11 @@ static inline u8 virtfn_max_buses(struct pci_dev *dev)
struct pci_sriov *iov = dev-sriov;
int nr_virtfn;
u8 max = 0;
-   u8 busnr;
+   int busnr;
 
for (nr_virtfn = 1; nr_virtfn = iov-total_VFs; nr_virtfn++) {
pci_iov_set_numvfs(dev, nr_virtfn);
-   busnr = virtfn_bus(dev, nr_virtfn - 1);
+   busnr = pci_iov_virtfn_bus(dev, nr_virtfn - 1);
if (busnr  max)
max = busnr;
}
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
struct pci_bus *bus;
 
mutex_lock(iov-dev-sriov-lock);
-   bus = virtfn_add_bus(dev-bus, virtfn_bus(dev, id));
+   bus = virtfn_add_bus(dev-bus, pci_iov_virtfn_bus(dev, id));
if (!bus)
goto failed;
 
@@ -124,7 +128,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
if (!virtfn)
goto failed0;
 
-   virtfn-devfn = virtfn_devfn(dev, id);
+   virtfn-devfn = pci_iov_virtfn_devfn(dev, id);
virtfn-vendor = dev-vendor;
pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_DID, virtfn-device);
pci_setup_device(virtfn);
@@ -186,8 +190,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int 
reset)
struct pci_sriov *iov = dev-sriov;
 
virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev-bus),
-virtfn_bus(dev, id),
-virtfn_devfn(dev, id));
+pci_iov_virtfn_bus(dev, id),
+pci_iov_virtfn_devfn(dev, id));
if (!virtfn)
return;
 
@@ -226,7 +230,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
struct pci_dev *pdev;
struct pci_sriov *iov = dev-sriov;
int bars = 0;
-   u8 bus;
+   int bus;
 
if (!nr_virtfn)
return 0;
@@ -263,7 +267,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
iov-offset = offset;
iov-stride = stride;
 
-   bus = virtfn_bus(dev, nr_virtfn - 1);
+   bus = pci_iov_virtfn_bus(dev, nr_virtfn - 1);
if (bus  dev-bus-busn_res.end) {
dev_err(dev-dev, can't enable %d VFs (bus %02x out of range 
of %pR)\n,
nr_virtfn, bus, dev-bus-busn_res);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1559658..99ea948 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1669,6 +1669,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1677,6 +1680,14 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 
numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+   return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+   return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 

[PATCH V14 10/21] PCI: Add pcibios_sriov_enable() and pcibios_sriov_disable()

2015-03-19 Thread Wei Yang
VFs are dynamically created when a driver enables them.  On some platforms,
like PowerNV, special resources are necessary to enable VFs.

Add platform hooks for enabling and disabling VFs.

Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5643a10..64c4692 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -220,6 +220,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int 
reset)
pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
+{
+   return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -231,6 +236,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
struct pci_sriov *iov = dev-sriov;
int bars = 0;
int bus;
+   int retval;
 
if (!nr_virtfn)
return 0;
@@ -307,6 +313,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
if (nr_virtfn  initial)
initial = nr_virtfn;
 
+   if ((retval = pcibios_sriov_enable(dev, initial))) {
+   dev_err(dev-dev, failure %d from pcibios_sriov_enable()\n,
+   retval);
+   return retval;
+   }
+
for (i = 0; i  initial; i++) {
rc = virtfn_add(dev, i, 0);
if (rc)
@@ -335,6 +347,11 @@ failed:
return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev)
+{
+   return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
int i;
@@ -346,6 +363,8 @@ static void sriov_disable(struct pci_dev *dev)
for (i = 0; i  iov-num_VFs; i++)
virtfn_remove(dev, i, 0);
 
+   pcibios_sriov_disable(dev);
+
iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 21/21] powerpc/pci: Add PCI resource alignment documentation

2015-03-19 Thread Wei Yang
In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:

1. size expanded
2. aligned to M64BT size

This patch documents this change on the reason and how.

[bhelgaas: reformat, clarify, expand]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 .../powerpc/pci_iov_resource_on_powernv.txt|  301 
 1 file changed, 301 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt 
b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 000..b55c5cd
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,301 @@
+Wei Yang weiy...@linux.vnet.ibm.com
+Benjamin Herrenschmidt b...@au1.ibm.com
+Bjorn Helgaas bhelg...@google.com
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerKVM and how generic PCI code handles this
+requirement. The first two sections describe the concepts of Partitionable
+Endpoints and the implementation on P8 (IODA2). The next two sections talks
+about considerations on enabling SRIOV on IODA2.
+
+1. Introduction to Partitionable Endpoints
+
+A Partitionable Endpoint (PE) is a way to group the various resources
+associated with a device or a set of devices to provide isolation between
+partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
+to freeze a device that is causing errors in order to limit the possibility
+of propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of frozen
+state bits (one for MMIO and one for DMA, they get set together but can be
+cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc., but
+that's not critical.
+
+The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
+are matched to their corresponding PEs.
+
+The following section provides a rough description of what we have on P8
+(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
+is a completely separate HW entity that replicates the entire logic, so has
+its own set of PEs, etc.
+
+2. Implementation of Partitionable Endpoints on P8 (IODA2)
+
+P8 supports up to 256 Partitionable Endpoints per PHB.
+
+  * Inbound
+
+For DMA, MSIs and inbound PCIe error messages, we have a table (in
+memory but accessed in HW by the chip) that provides a direct
+correspondence between a PCIe RID (bus/dev/fn) with a PE number.
+We call this the RTT.
+
+- For DMA we then provide an entire address space for each PE that can
+  contain two windows, depending on the value of PCI address bit 59.
+  Each window can be configured to be remapped via a TCE table (IOMMU
+  translation table), which has various configurable characteristics
+  not described here.
+
+- For MSIs, we have two windows in the address space (one at the top of
+  the 32-bit space and one much higher) which, via a combination of the
+  address and MSI value, will result in one of the 2048 interrupts per
+  bridge being triggered.  There's a PE# in the interrupt controller
+  descriptor table as well which is compared with the PE# obtained from
+  the RTT to authorize the device to emit that specific interrupt.
+
+- Error messages just use the RTT.
+
+  * Outbound.  That's where the tricky part is.
+
+Like other PCI host bridges, the Power8 IODA2 PHB supports windows
+from the CPU address space to the PCI address space.  There is one M32
+window and sixteen M64 windows.  They have different characteristics.
+First what they have in common: they forward a configurable portion of
+the CPU address space to the PCIe bus and must be naturally aligned
+power of two in size.  The rest is different:
+
+- The M32 window:
+
+  * Is limited to 4GB in size.
+
+  * Drops the top bits of the address (above the size) and replaces
+   them with a configurable value.  This is typically used to generate
+   32-bit PCIe accesses.  We configure that window at boot from FW and
+   don't touch it from Linux; it's usually set to forward a 2GB
+   portion of address space from the CPU to PCIe
+   0x8000_..0x_.  (Note: The top 64KB are actually
+   reserved for MSIs but this is not a problem at this point; we just
+   need to ensure Linux doesn't assign anything there, the M32 logic
+   ignores that however and will forward in that space if we try).
+
+  * It is divided into 256 segments of equal size.  A table in the chip
+   maps each segment to a PE#.  That allows portions of the MMIO space
+   to be assigned to PEs on a segment 

[PATCH V14 18/21] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported

2015-03-19 Thread Wei Yang
M64 aperture size is limited on PHB3.  When the IOV BAR is too big, this
will exceed the limitation and failed to be assigned.

Introduce a different mechanism based on the IOV BAR size:

  - if IOV BAR size is smaller than 64MB, expand to total_pe
  - if IOV BAR size is bigger than 64MB, roundup power2

[bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/pci-bridge.h |2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 3c95097..d6942c9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -179,6 +179,8 @@ struct pci_dn {
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
u16 num_vfs;/* number of VFs enabled*/
int offset; /* PE# for the first VF PE */
+#define M64_PER_IOV 4
+   int m64_per_iov;
 #define IODA_INVALID_M64(-1)
int m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index bd1b678..89bbcc4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2246,6 +2246,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
int i;
resource_size_t size;
struct pci_dn *pdn;
+   int mul, total_vfs;
 
if (!pdev-is_physfn || pdev-is_added)
return;
@@ -2256,6 +2257,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
pdn = pci_get_pdn(pdev);
pdn-vfs_expanded = 0;
 
+   total_vfs = pci_sriov_get_totalvfs(pdev);
+   pdn-m64_per_iov = 1;
+   mul = phb-ioda.total_pe;
+
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
+   res = pdev-resource[i + PCI_IOV_RESOURCES];
+   if (!res-flags || res-parent)
+   continue;
+   if (!pnv_pci_is_mem_pref_64(res-flags)) {
+   dev_warn(pdev-dev,  non M64 VF BAR%d: %pR\n,
+i, res);
+   continue;
+   }
+
+   size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
+
+   /* bigger than 64M */
+   if (size  (1  26)) {
+   dev_info(pdev-dev, PowerNV: VF BAR%d: %pR IOV size 
is bigger than 64M, roundup power2\n,
+i, res);
+   pdn-m64_per_iov = M64_PER_IOV;
+   mul = roundup_pow_of_two(total_vfs);
+   break;
+   }
+   }
+
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
res = pdev-resource[i + PCI_IOV_RESOURCES];
if (!res-flags || res-parent)
@@ -2268,12 +2295,12 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
 
dev_dbg(pdev-dev,  Fixing VF BAR%d: %pR to\n, i, res);
size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
-   res-end = res-start + size * phb-ioda.total_pe - 1;
+   res-end = res-start + size * mul - 1;
dev_dbg(pdev-dev,%pR\n, res);
dev_info(pdev-dev, VF BAR%d: %pR (expanded to %d VFs for PE 
alignment),
-   i, res, phb-ioda.total_pe);
+i, res, mul);
}
-   pdn-vfs_expanded = phb-ioda.total_pe;
+   pdn-vfs_expanded = mul;
 }
 #endif /* CONFIG_PCI_IOV */
 
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup

2015-03-19 Thread Michael Ellerman
On Fri, 2015-03-20 at 10:56 +0700, Arseny Solokha wrote:
 This series removes unused functions from powerpc tree that I've been able
 to discover.
 
 Two machines at hands, e300 and e500 based, boot and run without regressions
 on my workload with this series applied. The removed code seems also been
 rarely touched, so it seems the series is safe at least in general. But I
 can't obviously express any strong point in support of the series, so it's
 completely OK to leave things as is.
 
 v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from
 arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao
 (powerpc/85xx: workaround for chips with MSI hardware errata) makes
 use of it.

Sorry, too late.

https://git.kernel.org/cgit/linux/kernel/git/mpe/linux.git/commit/?h=nextid=5e86bfde9cd93f272844c3ff6ac5f93d3666b3e7


The patch that needs it can just add it back.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 00/21] Enable SRIOV on Power8

2015-03-19 Thread Wei Yang
This patchset enables the SRIOV on POWER8.

The general idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: PCI: Take additional IOV BAR alignment in
   sizing and assigning

Test Environment:
   The SRIOV device tested is Emulex Lancer(10df:e220) and
   Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
1. unbind the original driver and bind to vfio-pci driver
   echo :06:0d.0  /sys/bus/pci/devices/:06:0d.0/driver/unbind
   echo  1102 0002  /sys/bus/pci/drivers/vfio-pci/new_id
   Note: this should be done for each device in the same iommu_group
2. Start qemu and pass device through vfio
   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
   -M pseries -m 2048 -enable-kvm -nographic \
   -drive file=/home/ywywyang/kvm/fc19.img \
   -monitor telnet:localhost:5435,server,nowait -boot cd \
   -device 
spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6

Verify this is the exact VF response:
1. ping from a machine in the same subnet(the broadcast domain)
2. run arp -n on this machine
   9.115.251.20 ether   00:00:c9:df:ed:bf   C eth0
3. ifconfig in the guest
   # ifconfig eth1
   eth1: flags=4163UP,BROADCAST,RUNNING,MULTICAST  mtu 1500
inet 9.115.251.20  netmask 255.255.255.0  broadcast 
9.115.251.255
inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20link
ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
RX packets 175  bytes 13278 (12.9 KiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 58  bytes 9276 (9.0 KiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
4. They have the same MAC address

Note: make sure you shutdown other network interfaces in guest.

---
v14:
   * call ppc_md.pcibios_fixup_sriov() in pcibios_add_device
   * add more explanation in change log
   * Following patches have been reordered to the beginning.
 EEH refactor to use pci_dn:
 8ec20d6 powerpc/powernv: Use pci_dn, not device_node, in PCI config 
accessor
 a3460fc powerpc/pci: Refactor pci_dn
 These two patches will be modified to merge with other patches which are
 under discussion/review in ppc mail list. Some changes may also be made in
 other patches, which I didn't include them in this series, so that the
 auto build robot could work on this.
 There may have several changes in powerpc arch, which not effect the pci
 core. So after this patch set pass the review in pci community, I would
 rebase this series on ppc brach and send out for comment.
   * use add_res-min_align as the alignment in reassign_resources_sorted()
   * some cleanup in Document
v13:
   * fix error in pcibios_iov_resource_alignment(), use pdev instead of dev
   * rename vf_num to num_vfs in pcibios_sriov_enable(),
 pnv_pci_vf_resource_shift(), pnv_pci_sriov_disable(),
 pnv_pci_sriov_enable(), pnv_pci_ioda2_setup_dma_pe()
   * add more explanation in commit powerpc/pci: Don't unset PCI resources
 for VFs
   * fix IOV BAR in hotplug path as well, and don't fixup an already added
 device
   * use roundup_pow_of_two() instead of __roundup_pow_of_two()
   * this is based on v4.0-rc1
v12:
   * remove align parameter from pcibios_iov_resource_alignment()
 default version returns pci_iov_resource_size() instead of the
 align parameter
   * in powerpc pcibios_iov_resource_alignment(), return
 pci_iov_resource_size() if there's no ppc_md function pointer
   * in pci_sriov_resource_alignment(), don't re-read base, since we
 saved the required alignment when reading it the first time
   * remove vf_num parameter from add_dev_pci_info() and
 remove_dev_pci_info(); use pci_sriov_get_totalvfs() instead
   * use dev_warn() instead of pr_warn() when possible
   * check to be sure IOV BAR is still in range after shifting, change
 pnv_pci_vf_resource_shift() from void to int
   

[PATCH V14 08/21] PCI: Calculate maximum number of buses required for VFs

2015-03-19 Thread Wei Yang
An SR-IOV device can change its First VF Offset and VF Stride based on the
values of ARI Capable Hierarchy and NumVFs.  The number of buses required
for all VFs is determined by NumVFs, First VF Offset, and VF Stride (see
SR-IOV spec r1.1, sec 2.1.2).

Previously pci_iov_bus_range() computed how many buses would be required by
TotalVFs, but this was based on a single NumVFs value and may not have been
the maximum for all NumVFs configurations.

Iterate over all valid NumVFs and calculate the maximum number of bus
numbers that could ever be required for VFs of this device.

[bhelgaas: changelog, compute busnr of NumVFs, not TotalVFs, remove
kerenl-doc comment marker]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c |   31 +++
 drivers/pci/pci.h |1 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index a8752c2..2ae921f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -46,6 +46,30 @@ static inline void pci_iov_set_numvfs(struct pci_dev *dev, 
int nr_virtfn)
pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_STRIDE, iov-stride);
 }
 
+/*
+ * The PF consumes one bus number.  NumVFs, First VF Offset, and VF Stride
+ * determine how many additional bus numbers will be consumed by VFs.
+ *
+ * Iterate over all valid NumVFs and calculate the maximum number of bus
+ * numbers that could ever be required.
+ */
+static inline u8 virtfn_max_buses(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev-sriov;
+   int nr_virtfn;
+   u8 max = 0;
+   u8 busnr;
+
+   for (nr_virtfn = 1; nr_virtfn = iov-total_VFs; nr_virtfn++) {
+   pci_iov_set_numvfs(dev, nr_virtfn);
+   busnr = virtfn_bus(dev, nr_virtfn - 1);
+   if (busnr  max)
+   max = busnr;
+   }
+
+   return max;
+}
+
 static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
 {
struct pci_bus *child;
@@ -427,6 +451,7 @@ found:
 
dev-sriov = iov;
dev-is_physfn = 1;
+   iov-max_VF_buses = virtfn_max_buses(dev);
 
return 0;
 
@@ -556,15 +581,13 @@ void pci_restore_iov_state(struct pci_dev *dev)
 int pci_iov_bus_range(struct pci_bus *bus)
 {
int max = 0;
-   u8 busnr;
struct pci_dev *dev;
 
list_for_each_entry(dev, bus-devices, bus_list) {
if (!dev-is_physfn)
continue;
-   busnr = virtfn_bus(dev, dev-sriov-total_VFs - 1);
-   if (busnr  max)
-   max = busnr;
+   if (dev-sriov-max_VF_buses  max)
+   max = dev-sriov-max_VF_buses;
}
 
return max ? max - bus-number : 0;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5732964..bae593c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -243,6 +243,7 @@ struct pci_sriov {
u16 stride; /* following VF stride */
u32 pgsz;   /* page size for BAR alignment */
u8 link;/* Function Dependency Link */
+   u8 max_VF_buses;/* max buses consumed by VFs */
u16 driver_max_VFs; /* max num VFs driver supports */
struct pci_dev *dev;/* lowest numbered PF */
struct pci_dev *self;   /* this PF */
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 07/21] PCI: Refresh First VF Offset and VF Stride when updating NumVFs

2015-03-19 Thread Wei Yang
The First VF Offset and VF Stride fields depend on the NumVFs setting, so
refresh the cached fields in struct pci_sriov when updating NumVFs.  See
the SR-IOV spec r1.1, sec 3.3.9 and 3.3.10.

[bhelgaas: changelog, remove kernel-doc comment marker]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c |   23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 27b98c3..a8752c2 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,21 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
dev-sriov-stride * id)  0xff;
 }
 
+/*
+ * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
+ * change when NumVFs changes.
+ *
+ * Update iov-offset and iov-stride when NumVFs is written.
+ */
+static inline void pci_iov_set_numvfs(struct pci_dev *dev, int nr_virtfn)
+{
+   struct pci_sriov *iov = dev-sriov;
+
+   pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_OFFSET, iov-offset);
+   pci_read_config_word(dev, iov-pos + PCI_SRIOV_VF_STRIDE, iov-stride);
+}
+
 static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
 {
struct pci_bus *child;
@@ -253,7 +268,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
return rc;
}
 
-   pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+   pci_iov_set_numvfs(dev, nr_virtfn);
iov-ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
@@ -282,7 +297,7 @@ failed:
iov-ctrl = ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
-   pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, 0);
+   pci_iov_set_numvfs(dev, 0);
ssleep(1);
pci_cfg_access_unlock(dev);
 
@@ -313,7 +328,7 @@ static void sriov_disable(struct pci_dev *dev)
sysfs_remove_link(dev-dev.kobj, dep_link);
 
iov-num_VFs = 0;
-   pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, 0);
+   pci_iov_set_numvfs(dev, 0);
 }
 
 static int sriov_init(struct pci_dev *dev, int pos)
@@ -452,7 +467,7 @@ static void sriov_restore_state(struct pci_dev *dev)
pci_update_resource(dev, i);
 
pci_write_config_dword(dev, iov-pos + PCI_SRIOV_SYS_PGSIZE, iov-pgsz);
-   pci_write_config_word(dev, iov-pos + PCI_SRIOV_NUM_VF, iov-num_VFs);
+   pci_iov_set_numvfs(dev, iov-num_VFs);
pci_write_config_word(dev, iov-pos + PCI_SRIOV_CTRL, iov-ctrl);
if (iov-ctrl  PCI_SRIOV_CTRL_VFE)
msleep(100);
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 19/21] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

2015-03-19 Thread Wei Yang
When IOV BAR is big, each is covered by 4 M64 windows.  This leads to
several VF PE sits in one PE in terms of M64.

Group VF PEs according to the M64 allocation.

[bhelgaas: use dev_printk() when possible]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/pci-bridge.h |2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  197 ++---
 2 files changed, 154 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index d6942c9..ec83b51 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -182,7 +182,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
int m64_per_iov;
 #define IODA_INVALID_M64(-1)
-   int m64_wins[PCI_SRIOV_NUM_BARS];
+   int m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 89bbcc4..8e8399f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1156,26 +1156,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
-   inti;
+   inti, j;
 
bus = pdev-bus;
hose = pci_bus_to_host(bus);
phb = hose-private_data;
pdn = pci_get_pdn(pdev);
 
-   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
-   if (pdn-m64_wins[i] == IODA_INVALID_M64)
-   continue;
-   opal_pci_phb_mmio_enable(phb-opal_id,
-   OPAL_M64_WINDOW_TYPE, pdn-m64_wins[i], 0);
-   clear_bit(pdn-m64_wins[i], phb-ioda.m64_bar_alloc);
-   pdn-m64_wins[i] = IODA_INVALID_M64;
-   }
+   for (i = 0; i  PCI_SRIOV_NUM_BARS; i++)
+   for (j = 0; j  M64_PER_IOV; j++) {
+   if (pdn-m64_wins[i][j] == IODA_INVALID_M64)
+   continue;
+   opal_pci_phb_mmio_enable(phb-opal_id,
+   OPAL_M64_WINDOW_TYPE, pdn-m64_wins[i][j], 0);
+   clear_bit(pdn-m64_wins[i][j], 
phb-ioda.m64_bar_alloc);
+   pdn-m64_wins[i][j] = IODA_INVALID_M64;
+   }
 
return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
struct pci_bus*bus;
struct pci_controller *hose;
@@ -1183,17 +1184,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
struct pci_dn *pdn;
unsigned int   win;
struct resource   *res;
-   inti;
+   inti, j;
int64_trc;
+   inttotal_vfs;
+   resource_size_tsize, start;
+   intpe_num;
+   intvf_groups;
+   intvf_per_group;
 
bus = pdev-bus;
hose = pci_bus_to_host(bus);
phb = hose-private_data;
pdn = pci_get_pdn(pdev);
+   total_vfs = pci_sriov_get_totalvfs(pdev);
 
/* Initialize the m64_wins to IODA_INVALID_M64 */
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++)
-   pdn-m64_wins[i] = IODA_INVALID_M64;
+   for (j = 0; j  M64_PER_IOV; j++)
+   pdn-m64_wins[i][j] = IODA_INVALID_M64;
+
+   if (pdn-m64_per_iov == M64_PER_IOV) {
+   vf_groups = (num_vfs = M64_PER_IOV) ? num_vfs: M64_PER_IOV;
+   vf_per_group = (num_vfs = M64_PER_IOV)? 1:
+   roundup_pow_of_two(num_vfs) / pdn-m64_per_iov;
+   } else {
+   vf_groups = 1;
+   vf_per_group = 1;
+   }
 
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
res = pdev-resource[i + PCI_IOV_RESOURCES];
@@ -1203,35 +1220,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
if (!pnv_pci_is_mem_pref_64(res-flags))
continue;
 
-   do {
-   win = find_next_zero_bit(phb-ioda.m64_bar_alloc,
-   phb-ioda.m64_bar_idx + 1, 0);
-
-   if (win = phb-ioda.m64_bar_idx + 1)
-   goto m64_failed;
-   } while (test_and_set_bit(win, phb-ioda.m64_bar_alloc));
+   for (j = 0; j  vf_groups; j++) {
+   do {
+   win = 
find_next_zero_bit(phb-ioda.m64_bar_alloc,
+   phb-ioda.m64_bar_idx + 1, 0);
+
+   if (win = phb-ioda.m64_bar_idx + 1)
+   

[PATCH V14 20/21] powerpc/pci: Remove unused struct pci_dn.pcidev field

2015-03-19 Thread Wei Yang
In struct pci_dn, the pcidev field is assigned but not used, so remove it.

Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
Acked-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/pci-bridge.h |1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index ec83b51..680ae56 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -168,7 +168,6 @@ struct pci_dn {
 
int pci_ext_config_space;   /* for pci devices */
 
-   struct  pci_dev *pcidev;/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
struct eeh_dev *edev;   /* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8e8399f..3a79dfa 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1028,7 +1028,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, 
struct pnv_ioda_pe *pe)
pci_name(dev));
continue;
}
-   pdn-pcidev = dev;
pdn-pe_number = pe-pe_number;
pe-dma_weight += pnv_ioda_dma_weight(dev);
if ((pe-flags  PNV_IODA_PE_BUS_ALL)  dev-subordinate)
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:

 Bit more variance there than the pte checking, but runtime
 difference is in the noise - 5m4s vs 4m54s - and profiles are
 identical to the pte checking version.

Ahh, so that !(vma-vm_flags  VM_WRITE) test works _almost_ as well
as the original !pte_write() test.

Now, can you check that on top of rc4? If I've gotten everything
right, we now have:

 - plain 3.19 (pte_write): 4m54s
 - 3.19 with vm_flags  VM_WRITE: 5m4s
 - 3.19 with pte_dirty: 5m20s

so the pte_dirty version seems to have been a bad choice indeed.

For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
_much_ worse, but I'm wondering whether that VM_WRITE test will at
least shrink the difference like it does for 3.19.

And the VM_WRITE test should be stable and not have any subtle
interaction with the other changes that the numa pte things
introduced. It would be good to see if the profiles then pop something
*else* up as the performance difference (which I'm sure will remain,
since the 7m50s was so far off).

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 01/21] powerpc/pci: Refactor pci_dn

2015-03-19 Thread Wei Yang
From: Gavin Shan gws...@linux.vnet.ibm.com

pci_dn is the extension of PCI device node and is created from device node.
Unfortunately, VFs are enabled dynamically by PF's driver and they don't
have corresponding device nodes, and pci_dn.  Refactor pci_dn to support
VFs:

   * pci_dn is organized as a hierarchy tree.  VF's pci_dn is put
 to the child list of pci_dn of PF's bridge.  pci_dn of other device
 put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when PF
 enabling VFs.  VF's pci_dn will be destroyed when PF disabling VFs.
 pci_dn of other device is still created from device node as before.

   * For one particular PCI device (VF or not), its pci_dn can be
 found from pdev-dev.archdata.firmware_data, PCI_DN(devnode), or
 parent's list.  The fast path (fetching pci_dn through PCI device
 instance) is populated during early fixup time.

[bhelgaas: add ifdef around add_one_dev_pci_info(), use dev_printk()]
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/device.h |3 +
 arch/powerpc/include/asm/pci-bridge.h |   14 +-
 arch/powerpc/kernel/pci_dn.c  |  245 -
 arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
 4 files changed, 272 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 38faede..29992cd 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -34,6 +34,9 @@ struct dev_archdata {
 #ifdef CONFIG_SWIOTLB
dma_addr_t  max_direct_dma_addr;
 #endif
+#ifdef CONFIG_PPC64
+   void*firmware_data;
+#endif
 #ifdef CONFIG_EEH
struct eeh_dev  *edev;
 #endif
diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 546d036..513f8f2 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -89,6 +89,7 @@ struct pci_controller {
 
 #ifdef CONFIG_PPC64
unsigned long buid;
+   void *firmware_data;
 #endif /* CONFIG_PPC64 */
 
void *private_data;
@@ -154,9 +155,13 @@ static inline int isa_vaddr_is_ioport(void __iomem 
*address)
 struct iommu_table;
 
 struct pci_dn {
+   int flags;
+#define PCI_DN_FLAG_IOV_VF 0x01
+
int busno;  /* pci bus number */
int devfn;  /* pci device and function number */
 
+   struct  pci_dn *parent;
struct  pci_controller *phb;/* for pci devices */
struct  iommu_table *iommu_table;   /* for phb's or bridges */
struct  device_node *node;  /* back-pointer to the device_node */
@@ -171,14 +176,19 @@ struct pci_dn {
 #ifdef CONFIG_PPC_POWERNV
int pe_number;
 #endif
+   struct list_head child_list;
+   struct list_head list;
 };
 
 /* Get the pointer to a device_node's pci_dn */
 #define PCI_DN(dn) ((struct pci_dn *) (dn)-data)
 
+extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+  int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-
-extern void * update_dn_pci_info(struct device_node *dn, void *data);
+extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev);
+extern void remove_dev_pci_info(struct pci_dev *pdev);
+extern void *update_dn_pci_info(struct device_node *dn, void *data);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 83df307..f3a1a81 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -32,12 +32,223 @@
 #include asm/ppc-pci.h
 #include asm/firmware.h
 
+/*
+ * The function is used to find the firmware data of one
+ * specific PCI device, which is attached to the indicated
+ * PCI bus. For VFs, their firmware data is linked to that
+ * one of PF's bridge. For other devices, their firmware
+ * data is linked to that of their bridge.
+ */
+static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
+{
+   struct pci_bus *pbus;
+   struct device_node *dn;
+   struct pci_dn *pdn;
+
+   /*
+* We probably have virtual bus which doesn't
+* have associated bridge.
+*/
+   pbus = bus;
+   while (pbus) {
+   if (pci_is_root_bus(pbus) || pbus-self)
+   break;
+
+   pbus = pbus-parent;
+   }
+
+   /*
+* Except virtual bus, all PCI buses should
+* have device nodes.
+*/
+   dn = pci_bus_to_OF_node(pbus);
+   pdn = dn ? PCI_DN(dn) : NULL;
+
+   return pdn;
+}
+
+struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+   int devfn)
+{
+   struct device_node *dn = NULL;
+   struct pci_dn *parent, 

[PATCH V14 02/21] powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor

2015-03-19 Thread Wei Yang
The PCI config accessors previously relied on device_node.  Unfortunately,
VFs don't have a corresponding device_node, so change the accessors to use
pci_dn instead.

[bhelgaas: changelog]
Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +-
 arch/powerpc/platforms/powernv/pci.c |   69 ++
 arch/powerpc/platforms/powernv/pci.h |4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index e261869..7a5021b 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -430,21 +430,31 @@ static inline bool powernv_eeh_cfg_blocked(struct 
device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
   int where, int size, u32 *val)
 {
+   struct pci_dn *pdn = PCI_DN(dn);
+
+   if (!pdn)
+   return PCIBIOS_DEVICE_NOT_FOUND;
+
if (powernv_eeh_cfg_blocked(dn)) {
*val = 0x;
return PCIBIOS_SET_FAILED;
}
 
-   return pnv_pci_cfg_read(dn, where, size, val);
+   return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
int where, int size, u32 val)
 {
+   struct pci_dn *pdn = PCI_DN(dn);
+
+   if (!pdn)
+   return PCIBIOS_DEVICE_NOT_FOUND;
+
if (powernv_eeh_cfg_blocked(dn))
return PCIBIOS_SET_FAILED;
 
-   return pnv_pci_cfg_write(dn, where, size, val);
+   return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index e69142f..6c20d6e 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, 
u32 pe_no)
spin_unlock_irqrestore(phb-lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+   struct pnv_phb *phb = pdn-phb-private_data;
u8  fstate;
__be16  pcierr;
int pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 * setup that yet. So all ER errors should be mapped to
 * reserved PE.
 */
-   pe_no = PCI_DN(dn)-pe_number;
+   pe_no = pdn-pe_number;
if (pe_no == IODA_INVALID_PE) {
if (phb-type == PNV_PHB_P5IOC2)
pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
}
 
cfg_dbg( - EEH check, bdfn=%04x PE#%d fstate=%x\n,
-   (PCI_DN(dn)-busno  8) | (PCI_DN(dn)-devfn),
-   pe_no, fstate);
+   (pdn-busno  8) | (pdn-devfn), pe_no, fstate);
 
/* Clear the frozen state if applicable */
if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 int where, int size, u32 *val)
 {
-   struct pci_dn *pdn = PCI_DN(dn);
struct pnv_phb *phb = pdn-phb-private_data;
u32 bdfn = (pdn-busno  8) | pdn-devfn;
s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
  int where, int size, u32 val)
 {
-   struct pci_dn *pdn = PCI_DN(dn);
struct pnv_phb *phb = pdn-phb-private_data;
u32 bdfn = (pdn-busno  8) | pdn-devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
- struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
struct eeh_dev *edev = NULL;
-   struct pnv_phb *phb = hose-private_data;
+   struct pnv_phb *phb = pdn-phb-private_data;
 
/* EEH not enabled ? */
if (!(phb-flags  PNV_PHB_FLAG_EEH))
return true;
 
/* PE reset or device removed ? */
-   edev = of_node_to_eeh_dev(dn);
+   edev = pdn-edev;
if (edev) {
if (edev-pe 
(edev-pe-state  EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-   struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {

[PATCH V14 17/21] powerpc/powernv: Shift VF resource with an offset

2015-03-19 Thread Wei Yang
On PowerNV platform, resource position in M64 BAR implies the PE# the
resource belongs to. In some cases, adjustment of a resource is necessary
to locate it to a correct position in M64 BAR .

This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR
address according to an offset.

Note:

After doing so, there would be a hole in the /proc/iomem when offset
is a positive value. It looks like the device return some mmio back to
the system, which actually no one could use it.

[bhelgaas: rework loops, rework overlap check, index resource[]
conventionally, remove pci_regs.h include, squashed with next patch]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/pci-bridge.h |4 +
 arch/powerpc/kernel/pci_dn.c  |   13 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  524 -
 arch/powerpc/platforms/powernv/pci.c  |   18 +
 arch/powerpc/platforms/powernv/pci.h  |7 +
 5 files changed, 549 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index d0d1718..3c95097 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,6 +177,10 @@ struct pci_dn {
int pe_number;
 #ifdef CONFIG_PCI_IOV
u16 vfs_expanded;   /* number of VFs IOV BAR expanded */
+   u16 num_vfs;/* number of VFs enabled*/
+   int offset; /* PE# for the first VF PE */
+#define IODA_INVALID_M64(-1)
+   int m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index f3a1a81..93ed7b3 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -217,6 +217,19 @@ void remove_dev_pci_info(struct pci_dev *pdev)
struct pci_dn *pdn, *tmp;
int i;
 
+   /*
+* VF and VF PE are created/released dynamically, so we need to
+* bind/unbind them.  Otherwise the VF and VF PE would be mismatched
+* when re-enabling SR-IOV.
+*/
+   if (pdev-is_virtfn) {
+   pdn = pci_get_pdn(pdev);
+#ifdef CONFIG_PPC_POWERNV
+   pdn-pe_number = IODA_INVALID_PE;
+#endif
+   return;
+   }
+
/* Only support IOV PF for now */
if (!pdev-is_physfn)
return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 93ec16c..bd1b678 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -44,6 +44,9 @@
 #include powernv.h
 #include pci.h
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE   ((0x1000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
 {
@@ -56,11 +59,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
vaf.fmt = fmt;
vaf.va = args;
 
-   if (pe-pdev)
+   if (pe-flags  PNV_IODA_PE_DEV)
strlcpy(pfix, dev_name(pe-pdev-dev), sizeof(pfix));
-   else
+   else if (pe-flags  (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
sprintf(pfix, %04x:%02x ,
pci_domain_nr(pe-pbus), pe-pbus-number);
+#ifdef CONFIG_PCI_IOV
+   else if (pe-flags  PNV_IODA_PE_VF)
+   sprintf(pfix, %04x:%02x:%2x.%d,
+   pci_domain_nr(pe-parent_dev-bus),
+   (pe-rid  0xff00)  8,
+   PCI_SLOT(pe-rid), PCI_FUNC(pe-rid));
+#endif /* CONFIG_PCI_IOV*/
 
printk(%spci %s: [PE# %.3d] %pV,
   level, pfix, pe-pe_number, vaf);
@@ -591,7 +601,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
  bool is_add)
 {
struct pnv_ioda_pe *slave;
-   struct pci_dev *pdev;
+   struct pci_dev *pdev = NULL;
int ret;
 
/*
@@ -630,8 +640,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
if (pe-flags  (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
pdev = pe-pbus-self;
-   else
+   else if (pe-flags  PNV_IODA_PE_DEV)
pdev = pe-pdev-bus-self;
+#ifdef CONFIG_PCI_IOV
+   else if (pe-flags  PNV_IODA_PE_VF)
+   pdev = pe-parent_dev-bus-self;
+#endif /* CONFIG_PCI_IOV */
while (pdev) {
struct pci_dn *pdn = pci_get_pdn(pdev);
struct pnv_ioda_pe *parent;
@@ -649,6 +663,87 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+   struct pci_dev *parent;
+   uint8_t bcomp, dcomp, fcomp;
+   int64_t rc;
+   long rid_end, rid;
+
+   /* 

[PATCH 2/4] kvm/ppc/mpic: drop unused IRQ_testbit

2015-03-19 Thread Arseny Solokha
Drop unused static procedure which doesn't have callers within its
translation unit. It had been already removed independently in QEMU[1]
from the OpenPIC implementation borrowed from the kernel.

[1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.html

Signed-off-by: Arseny Solokha asolo...@kb.kras.ru
Cc: Alexander Graf ag...@suse.de
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
---
 arch/powerpc/kvm/mpic.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 39b3a8f..a480d99 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -289,11 +289,6 @@ static inline void IRQ_resetbit(struct irq_queue *q, int 
n_IRQ)
clear_bit(n_IRQ, q-queue);
 }
 
-static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
-{
-   return test_bit(n_IRQ, q-queue);
-}
-
 static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
int irq = -1;
-- 
2.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/4] powerpc/boot: drop planetcore_set_serial_speed

2015-03-19 Thread Arseny Solokha
Drop planetcore_set_serial_speed() which had no users since its
inception in commit fec6047047fd ([POWERPC] bootwrapper: Add PlanetCore
firmware support) in 2007.

Signed-off-by: Arseny Solokha asolo...@kb.kras.ru
---
 arch/powerpc/boot/planetcore.c | 33 -
 arch/powerpc/boot/planetcore.h |  3 ---
 2 files changed, 36 deletions(-)

diff --git a/arch/powerpc/boot/planetcore.c b/arch/powerpc/boot/planetcore.c
index 0d8558a..75117e6 100644
--- a/arch/powerpc/boot/planetcore.c
+++ b/arch/powerpc/boot/planetcore.c
@@ -131,36 +131,3 @@ void planetcore_set_stdout_path(const char *table)
 
setprop_str(chosen, linux,stdout-path, path);
 }
-
-void planetcore_set_serial_speed(const char *table)
-{
-   void *chosen, *stdout;
-   u64 baud;
-   u32 baud32;
-   int len;
-
-   chosen = finddevice(/chosen);
-   if (!chosen)
-   return;
-
-   len = getprop(chosen, linux,stdout-path, prop_buf, MAX_PROP_LEN);
-   if (len = 0)
-   return;
-
-   stdout = finddevice(prop_buf);
-   if (!stdout) {
-   printf(planetcore_set_serial_speed: 
-  Bad /chosen/linux,stdout-path.\r\n);
-
-   return;
-   }
-
-   if (!planetcore_get_decimal(table, PLANETCORE_KEY_SERIAL_BAUD,
-   baud)) {
-   printf(planetcore_set_serial_speed: No SB tag.\r\n);
-   return;
-   }
-
-   baud32 = baud;
-   setprop(stdout, current-speed, baud32, 4);
-}
diff --git a/arch/powerpc/boot/planetcore.h b/arch/powerpc/boot/planetcore.h
index 0d4094f..d53c733 100644
--- a/arch/powerpc/boot/planetcore.h
+++ b/arch/powerpc/boot/planetcore.h
@@ -43,7 +43,4 @@ void planetcore_set_mac_addrs(const char *table);
  */
 void planetcore_set_stdout_path(const char *table);
 
-/* Sets the current-speed property in the serial node. */
-void planetcore_set_serial_speed(const char *table);
-
 #endif
-- 
2.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 4/4] powerpc/mpic: remove unused functions

2015-03-19 Thread Arseny Solokha
Drop unused mpic_set_clk_ratio() and mpic_set_serial_int().

Both functions are almost nine years old[1] but still have no chance
to be called even from out-of-tree modules because they both are __init
and of course aren't exported.

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html

Signed-off-by: Arseny Solokha asolo...@kb.kras.ru
Cc: Jia Hongtao hongtao@freescale.com
---
 arch/powerpc/include/asm/mpic.h | 11 ---
 arch/powerpc/sysdev/mpic.c  | 25 -
 2 files changed, 36 deletions(-)

diff --git a/arch/powerpc/include/asm/mpic.h b/arch/powerpc/include/asm/mpic.h
index 754f93d..3b39c28 100644
--- a/arch/powerpc/include/asm/mpic.h
+++ b/arch/powerpc/include/asm/mpic.h
@@ -33,11 +33,6 @@
 #defineMPIC_GREG_GCONF_NO_BIAS 0x1000
 #defineMPIC_GREG_GCONF_BASE_MASK   0x000f
 #defineMPIC_GREG_GCONF_MCK 0x0800
-#define MPIC_GREG_GLOBAL_CONF_10x00030
-#defineMPIC_GREG_GLOBAL_CONF_1_SIE 0x0800
-#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK  0x7000
-#defineMPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\
-   (((r)  28)  MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK)
 #define MPIC_GREG_VENDOR_0 0x00040
 #define MPIC_GREG_VENDOR_1 0x00050
 #define MPIC_GREG_VENDOR_2 0x00060
@@ -496,11 +491,5 @@ extern unsigned int mpic_get_coreint_irq(void);
 /* Fetch Machine Check interrupt from primary mpic */
 extern unsigned int mpic_get_mcirq(void);
 
-/* Set the EPIC clock ratio */
-void mpic_set_clk_ratio(struct mpic *mpic, u32 clock_ratio);
-
-/* Enable/Disable EPIC serial interrupt mode */
-void mpic_set_serial_int(struct mpic *mpic, int enable);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_MPIC_H */
diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index bbfbbf2..2c817a7 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -1676,31 +1676,6 @@ void __init mpic_init(struct mpic *mpic)
mpic_err_int_init(mpic, MPIC_FSL_ERR_INT);
 }
 
-void __init mpic_set_clk_ratio(struct mpic *mpic, u32 clock_ratio)
-{
-   u32 v;
-
-   v = mpic_read(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1);
-   v = ~MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK;
-   v |= MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(clock_ratio);
-   mpic_write(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1, v);
-}
-
-void __init mpic_set_serial_int(struct mpic *mpic, int enable)
-{
-   unsigned long flags;
-   u32 v;
-
-   raw_spin_lock_irqsave(mpic_lock, flags);
-   v = mpic_read(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1);
-   if (enable)
-   v |= MPIC_GREG_GLOBAL_CONF_1_SIE;
-   else
-   v = ~MPIC_GREG_GLOBAL_CONF_1_SIE;
-   mpic_write(mpic-gregs, MPIC_GREG_GLOBAL_CONF_1, v);
-   raw_spin_unlock_irqrestore(mpic_lock, flags);
-}
-
 void mpic_irq_set_priority(unsigned int irq, unsigned int pri)
 {
struct mpic *mpic = mpic_find(irq);
-- 
2.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/4] powrepc/qe: drop unused ucc_slow_poll_transmitter_now

2015-03-19 Thread Arseny Solokha
Drop ucc_slow_poll_transmitter_now() which has no users since its
inception in 2007 in commit 986585385131 ([POWERPC] Add QUICC
Engine (QE) infrastructure).

Signed-off-by: Arseny Solokha asolo...@kb.kras.ru
---
 arch/powerpc/include/asm/ucc_slow.h   | 13 -
 arch/powerpc/sysdev/qe_lib/ucc_slow.c |  5 -
 2 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/ucc_slow.h 
b/arch/powerpc/include/asm/ucc_slow.h
index c44131e..233ef5f 100644
--- a/arch/powerpc/include/asm/ucc_slow.h
+++ b/arch/powerpc/include/asm/ucc_slow.h
@@ -251,19 +251,6 @@ void ucc_slow_enable(struct ucc_slow_private * uccs, enum 
comm_dir mode);
  */
 void ucc_slow_disable(struct ucc_slow_private * uccs, enum comm_dir mode);
 
-/* ucc_slow_poll_transmitter_now
- * Immediately forces a poll of the transmitter for data to be sent.
- * Typically, the hardware performs a periodic poll for data that the
- * transmit routine has set up to be transmitted. In cases where
- * this polling cycle is not soon enough, this optional routine can
- * be invoked to force a poll right away, instead. Proper use for
- * each transmission for which this functionality is desired is to
- * call the transmit routine and then this routine right after.
- *
- * uccs - (In) pointer to the slow UCC structure.
- */
-void ucc_slow_poll_transmitter_now(struct ucc_slow_private * uccs);
-
 /* ucc_slow_graceful_stop_tx
  * Smoothly stops transmission on a specified slow UCC.
  *
diff --git a/arch/powerpc/sysdev/qe_lib/ucc_slow.c 
b/arch/powerpc/sysdev/qe_lib/ucc_slow.c
index befaf11..5f91628 100644
--- a/arch/powerpc/sysdev/qe_lib/ucc_slow.c
+++ b/arch/powerpc/sysdev/qe_lib/ucc_slow.c
@@ -43,11 +43,6 @@ u32 ucc_slow_get_qe_cr_subblock(int uccs_num)
 }
 EXPORT_SYMBOL(ucc_slow_get_qe_cr_subblock);
 
-void ucc_slow_poll_transmitter_now(struct ucc_slow_private * uccs)
-{
-   out_be16(uccs-us_regs-utodr, UCC_SLOW_TOD);
-}
-
 void ucc_slow_graceful_stop_tx(struct ucc_slow_private * uccs)
 {
struct ucc_slow_info *us_info = uccs-us_info;
-- 
2.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:
 
  Bit more variance there than the pte checking, but runtime
  difference is in the noise - 5m4s vs 4m54s - and profiles are
  identical to the pte checking version.
 
 Ahh, so that !(vma-vm_flags  VM_WRITE) test works _almost_ as well
 as the original !pte_write() test.
 
 Now, can you check that on top of rc4? If I've gotten everything
 right, we now have:
 
  - plain 3.19 (pte_write): 4m54s
  - 3.19 with vm_flags  VM_WRITE: 5m4s
  - 3.19 with pte_dirty: 5m20s

*nod*

 so the pte_dirty version seems to have been a bad choice indeed.
 
 For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
 _much_ worse, but I'm wondering whether that VM_WRITE test will at
 least shrink the difference like it does for 3.19.

Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
7m36s. IOWs's a bit better, but not significantly. page migrations
are pretty much unchanged, too:

   558,632  migrate:mm_migrate_pages ( +-  6.38% )

 And the VM_WRITE test should be stable and not have any subtle
 interaction with the other changes that the numa pte things
 introduced. It would be good to see if the profiles then pop something
 *else* up as the performance difference (which I'm sure will remain,
 since the 7m50s was so far off).

No, nothing new pops up in the kernel profiles. All the system CPU
time is still being spent sending IPIs on the tlb flush path.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 06/21] PCI: Index IOV resources in the conventional style

2015-03-19 Thread Wei Yang
From: Bjorn Helgaas bhelg...@google.com

Most of PCI uses res = dev-resource[i], not res = dev-resource + i.
Use that style in iov.c also.

No functional change.

Signed-off-by: Bjorn Helgaas bhelg...@google.com
---
 drivers/pci/iov.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5bca0e1..27b98c3 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -95,7 +95,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
virtfn-multifunction = 0;
 
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
-   res = dev-resource + PCI_IOV_RESOURCES + i;
+   res = dev-resource[i + PCI_IOV_RESOURCES];
if (!res-parent)
continue;
virtfn-resource[i].name = pci_name(virtfn);
@@ -212,7 +212,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
nres = 0;
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
bars |= (1  (i + PCI_IOV_RESOURCES));
-   res = dev-resource + PCI_IOV_RESOURCES + i;
+   res = dev-resource[i + PCI_IOV_RESOURCES];
if (res-parent)
nres++;
}
@@ -373,7 +373,7 @@ found:
 
nres = 0;
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
-   res = dev-resource + PCI_IOV_RESOURCES + i;
+   res = dev-resource[i + PCI_IOV_RESOURCES];
bar64 = __pci_read_base(dev, pci_bar_unknown, res,
pos + PCI_SRIOV_BAR + i * 4);
if (!res-flags)
@@ -417,7 +417,7 @@ found:
 
 failed:
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
-   res = dev-resource + PCI_IOV_RESOURCES + i;
+   res = dev-resource[i + PCI_IOV_RESOURCES];
res-flags = 0;
}
 
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 05/21] PCI: Keep individual VF BAR size in struct pci_sriov

2015-03-19 Thread Wei Yang
Currently we don't store the individual VF BAR size.  We calculate it when
needed by dividing the PF's IOV resource size (which contains space for
*all* the VFs) by total_VFs or by reading the BAR in the SR-IOV capability
again.

Keep the individual VF BAR size in struct pci_sriov.barsz[], add
pci_iov_resource_size() to retrieve it, and use that instead of doing the
division or reading the SR-IOV capability BAR.

[bhelgaas: rename to barsz[], simplify barsz[] index computation, remove
SR-IOV capability BAR sizing]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c   |   39 ---
 drivers/pci/pci.h   |1 +
 include/linux/pci.h |3 +++
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 05f9d97..5bca0e1 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -57,6 +57,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, 
struct pci_bus *virtbus)
pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+   if (!dev-is_physfn)
+   return 0;
+
+   return dev-sriov-barsz[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
int i;
@@ -92,8 +100,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
continue;
virtfn-resource[i].name = pci_name(virtfn);
virtfn-resource[i].flags = res-flags;
-   size = resource_size(res);
-   do_div(size, iov-total_VFs);
+   size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
virtfn-resource[i].start = res-start + size * id;
virtfn-resource[i].end = virtfn-resource[i].start + size - 1;
rc = request_resource(res, virtfn-resource[i]);
@@ -311,7 +318,7 @@ static void sriov_disable(struct pci_dev *dev)
 
 static int sriov_init(struct pci_dev *dev, int pos)
 {
-   int i;
+   int i, bar64;
int rc;
int nres;
u32 pgsz;
@@ -360,29 +367,29 @@ found:
pgsz = ~(pgsz - 1);
pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+   if (!iov)
+   return -ENOMEM;
+
nres = 0;
for (i = 0; i  PCI_SRIOV_NUM_BARS; i++) {
res = dev-resource + PCI_IOV_RESOURCES + i;
-   i += __pci_read_base(dev, pci_bar_unknown, res,
-pos + PCI_SRIOV_BAR + i * 4);
+   bar64 = __pci_read_base(dev, pci_bar_unknown, res,
+   pos + PCI_SRIOV_BAR + i * 4);
if (!res-flags)
continue;
if (resource_size(res)  (PAGE_SIZE - 1)) {
rc = -EIO;
goto failed;
}
+   iov-barsz[i] = resource_size(res);
res-end = res-start + resource_size(res) * total - 1;
dev_info(dev-dev, VF(n) BAR%d space: %pR (contains BAR%d for 
%d VFs)\n,
 i, res, i, total);
+   i += bar64;
nres++;
}
 
-   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-   if (!iov) {
-   rc = -ENOMEM;
-   goto failed;
-   }
-
iov-pos = pos;
iov-nres = nres;
iov-ctrl = ctrl;
@@ -414,6 +421,7 @@ failed:
res-flags = 0;
}
 
+   kfree(iov);
return rc;
 }
 
@@ -510,14 +518,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
  */
 resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
-   struct resource tmp;
-   int reg = pci_iov_resource_bar(dev, resno);
-
-   if (!reg)
-   return 0;
-
-__pci_read_base(dev, pci_bar_unknown, tmp, reg);
-   return resource_alignment(tmp);
+   return pci_iov_resource_size(dev, resno);
 }
 
 /**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4091f82..5732964 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -247,6 +247,7 @@ struct pci_sriov {
struct pci_dev *dev;/* lowest numbered PF */
struct pci_dev *self;   /* this PF */
struct mutex lock;  /* lock for VF bus */
+   resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 211e9da..1559658 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1675,6 +1675,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { 

[PATCH V14 16/21] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv

2015-03-19 Thread Wei Yang
Implement pcibios_iov_resource_alignment() on powernv platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

[bhelgaas: adjust to drop align parameter, return pci_iov_resource_size()
if no ppc_md machdep_call version]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/machdep.h|1 +
 arch/powerpc/kernel/pci-common.c  |   10 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 
 3 files changed, 31 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 1d72fda..37e451f 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,7 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
void (*pcibios_fixup_sriov)(struct pci_dev *pdev);
+   resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *, int 
resno);
 #endif /* CONFIG_PCI_IOV */
 
/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 375bf70..9a306ff 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,16 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int resno)
+{
+   if (ppc_md.pcibios_iov_resource_alignment)
+   return ppc_md.pcibios_iov_resource_alignment(pdev, resno);
+
+   return pci_iov_resource_size(pdev, resno);
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index cadd3fb..93ec16c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1965,6 +1965,25 @@ static resource_size_t pnv_pci_window_alignment(struct 
pci_bus *bus,
return phb-ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+ int resno)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+   resource_size_t align, iov_align;
+
+   iov_align = resource_size(pdev-resource[resno]);
+   if (iov_align)
+   return iov_align;
+
+   align = pci_iov_resource_size(pdev, resno);
+   if (pdn-vfs_expanded)
+   return pdn-vfs_expanded * align;
+
+   return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2167,6 +2186,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_iov_resources;
+   ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 0/4] powerpc: trivial unused functions cleanup

2015-03-19 Thread Arseny Solokha
This series removes unused functions from powerpc tree that I've been able
to discover.

Two machines at hands, e300 and e500 based, boot and run without regressions
on my workload with this series applied. The removed code seems also been
rarely touched, so it seems the series is safe at least in general. But I
can't obviously express any strong point in support of the series, so it's
completely OK to leave things as is.

v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from
arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao
(powerpc/85xx: workaround for chips with MSI hardware errata) makes
use of it.

v2: Added a brief explanation to each patch description of why removed
functions are unused, as suggested by Michael Ellerman.

Arseny Solokha (4):
  powerpc/boot: drop planetcore_set_serial_speed
  kvm/ppc/mpic: drop unused IRQ_testbit
  powrepc/qe: drop unused ucc_slow_poll_transmitter_now
  powerpc/mpic: remove unused functions

 arch/powerpc/boot/planetcore.c| 33 -
 arch/powerpc/boot/planetcore.h|  3 ---
 arch/powerpc/include/asm/mpic.h   | 11 ---
 arch/powerpc/include/asm/ucc_slow.h   | 13 -
 arch/powerpc/kvm/mpic.c   |  5 -
 arch/powerpc/sysdev/mpic.c| 25 -
 arch/powerpc/sysdev/qe_lib/ucc_slow.c |  5 -
 7 files changed, 95 deletions(-)

-- 
2.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/4] powerpc: trivial unused functions cleanup

2015-03-19 Thread Arseny Solokha
 On Fri, 2015-03-20 at 10:56 +0700, Arseny Solokha wrote:
 This series removes unused functions from powerpc tree that I've been able
 to discover.
 
 Two machines at hands, e300 and e500 based, boot and run without regressions
 on my workload with this series applied. The removed code seems also been
 rarely touched, so it seems the series is safe at least in general. But I
 can't obviously express any strong point in support of the series, so it's
 completely OK to leave things as is.
 
 v3: In patch 4/4, do not remove fsl_mpic_primary_get_version() from
 arch/powerpc/sysdev/mpic.c because the patch by Jia Hongtao
 (powerpc/85xx: workaround for chips with MSI hardware errata) makes
 use of it.

 Sorry, too late.

 https://git.kernel.org/cgit/linux/kernel/git/mpe/linux.git/commit/?h=nextid=5e86bfde9cd93f272844c3ff6ac5f93d3666b3e7


 The patch that needs it can just add it back.

I failed to notice that the series has been finally committed, so resent it. Of
course Hongtao can add the removed function back if he needs to.

And by the way, while revisiting the series I've noticed that though the patch
4/4 basically reverts [1], it leaves

  #define MPIC_GREG_GLOBAL_CONF_1   0x00030

in arch/powerpc/include/asm/mpic.h untouched. That define also loses its uses
after applying the patch. Compare the following hunk in today's patch w/ the one
you committed:

  @@ -33,11 +33,6 @@
   #define  MPIC_GREG_GCONF_NO_BIAS 0x1000
   #define  MPIC_GREG_GCONF_BASE_MASK   0x000f
   #define  MPIC_GREG_GCONF_MCK 0x0800
  -#define MPIC_GREG_GLOBAL_CONF_1  0x00030
  -#define  MPIC_GREG_GLOBAL_CONF_1_SIE 0x0800
  -#define  MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK  0x7000
  -#define  MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO(r)\
  - (((r)  28)  MPIC_GREG_GLOBAL_CONF_1_CLK_RATIO_MASK)
   #define MPIC_GREG_VENDOR_0   0x00040
   #define MPIC_GREG_VENDOR_1   0x00050
   #define MPIC_GREG_VENDOR_2   0x00060

So the question is, should #define MPIC_GREG_GLOBAL_CONF_1 have been also
removed, or could be left as is?

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.html

Arsény

 cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/3] powerpc/dma: Remove ZONE_DMA completely

2015-03-19 Thread Aneesh Kumar K.V
Benjamin Herrenschmidt b...@kernel.crashing.org writes:

 On Wed, 2015-03-18 at 19:49 +0530, Aneesh Kumar K.V wrote:
 Why not do it also for DMA32 as below. We call limit_zone_pfn() only
 for few
 platforms. They can select ZONE_DMA32. Everywhere else ZONE_DMA32 is
 same as ZONE_NORMAL ? So instead of having confusing output for
 show_mem(), we now have

 There may be cases where we want ZONE_DMA32, what's confusing ?

To see output like below

 Node 0 DMA32: 6*64kB (UM) 566*128kB (UEM) 367*256kB (UM) 185*512kB 
(UM)59*1024kB (UEM) 11*2048kB (UM) 4*4096kB (UM) 3*8192kB (UEM) 4016*16384kB

 Node 1 DMA32: 42*64kB (UEM) 14*128kB (UEM) 10*256kB (UM) 2*512kB (UM) 6*1024kB 
(UM) 7*2048kB (UEM) 5*4096kB (UM) 2*8192kB (U) 3644*16384kB (MR) =

 That is to find DMA32 convering more than 4GB range.

 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 04/21] PCI: Print PF SR-IOV resource that contains all VF(n) BAR space

2015-03-19 Thread Wei Yang
When we size VF BAR0, VF BAR1, etc., from the SR-IOV Capability of a PF, we
learn the alignment requirement and amount of space consumed by a single
VF.  But when VFs are enabled, *each* of the NumVFs consumes that amount of
space, so the total size of the PF resource is VF BAR size * NumVFs.

Add a printk of the total space consumed by the VFs corresponding to what
we already do for normal non-IOV BARs.

No functional change; new message only.

[bhelgaas: split out into its own patch]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index c4c33ea..05f9d97 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -372,6 +372,8 @@ found:
goto failed;
}
res-end = res-start + resource_size(res) * total - 1;
+   dev_info(dev-dev, VF(n) BAR%d space: %pR (contains BAR%d for 
%d VFs)\n,
+i, res, i, total);
nres++;
}
 
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 03/21] PCI: Print more info in sriov_enable() error message

2015-03-19 Thread Wei Yang
From: Bjorn Helgaas bhelg...@google.com

If we don't have space for all the bus numbers required to enable VFs,
print the largest bus number required and the range available.

No functional change; improved error message only.

Signed-off-by: Bjorn Helgaas bhelg...@google.com
---
 drivers/pci/iov.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4b3a4ea..c4c33ea 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -180,6 +180,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
struct pci_dev *pdev;
struct pci_sriov *iov = dev-sriov;
int bars = 0;
+   u8 bus;
 
if (!nr_virtfn)
return 0;
@@ -216,8 +217,10 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
iov-offset = offset;
iov-stride = stride;
 
-   if (virtfn_bus(dev, nr_virtfn - 1)  dev-bus-busn_res.end) {
-   dev_err(dev-dev, SR-IOV: bus number out of range\n);
+   bus = virtfn_bus(dev, nr_virtfn - 1);
+   if (bus  dev-bus-busn_res.end) {
+   dev_err(dev-dev, can't enable %d VFs (bus %02x out of range 
of %pR)\n,
+   nr_virtfn, bus, dev-bus-busn_res);
return -ENOMEM;
}
 
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V14 12/21] PCI: Consider additional PF's IOV BAR alignment in sizing and assigning

2015-03-19 Thread Wei Yang
When sizing and assigning resources, we divide the resources into two
lists: the requested list and the additional list.  We don't consider the
alignment of additional VF(n) BAR space.

This is because the alignment required for the VF(n) BAR space is the size
of an individual VF BAR, not the size of the space for *all* VFs.  But we
want additional alignment to support partitioning on PowerNV.

Consider the additional IOV BAR alignment when sizing and assigning
resources.  When there is not enough system MMIO space to accomodate both
the requested list and the additional list, the PF's IOV BAR alignment will
not contribute to the bridge. When there is enough system MMIO space for
both lists, the additional alignment will contribute to the bridge.

The additional alignment is stored in the min_align of pci_dev_resource,
which is stored in the additional list by add_to_list() at the end of
pbus_size_mem(). The additional alignment is calculated in
pci_resource_alignment().  For an IOV BAR, we have arch dependent function
to get the alignment for different arch.

[bhelgaas: changelog, printk cast]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/setup-bus.c |   95 +++
 1 file changed, 79 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index e3e17f3..6603d40 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-   struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+  struct resource *res)
 {
struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head 
*head,
int idx = res - dev_res-dev-resource[0];
 
dev_printk(KERN_DEBUG, dev_res-dev-dev,
-res[%d]=%pR get_res_add_size add_size %llx\n,
+res[%d]=%pR res_to_dev_res add_size %llx 
min_align %llx\n,
 idx, dev_res-res,
-(unsigned long long)dev_res-add_size);
+(unsigned long long)dev_res-add_size,
+(unsigned long long)dev_res-min_align);
 
-   return dev_res-add_size;
+   return dev_res;
}
}
 
-   return 0;
+   return NULL;
 }
 
+static resource_size_t get_res_add_size(struct list_head *head,
+   struct resource *res)
+{
+   struct pci_dev_resource *dev_res;
+
+   dev_res = res_to_dev_res(head, res);
+   return dev_res ? dev_res-add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+struct resource *res)
+{
+   struct pci_dev_resource *dev_res;
+
+   dev_res = res_to_dev_res(head, res);
+   return dev_res ? dev_res-min_align : 0;
+}
+
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -215,7 +235,7 @@ static void reassign_resources_sorted(struct list_head 
*realloc_head,
struct resource *res;
struct pci_dev_resource *add_res, *tmp;
struct pci_dev_resource *dev_res;
-   resource_size_t add_size;
+   resource_size_t add_size, align;
int idx;
 
list_for_each_entry_safe(add_res, tmp, realloc_head, list) {
@@ -238,13 +258,13 @@ static void reassign_resources_sorted(struct list_head 
*realloc_head,
 
idx = res - add_res-dev-resource[0];
add_size = add_res-add_size;
+   align = add_res-min_align;
if (!resource_size(res)) {
-   res-start = add_res-start;
+   res-start = align;
res-end = res-start + add_size - 1;
if (pci_assign_resource(add_res-dev, idx))
reset_resource(res);
} else {
-   resource_size_t align = add_res-min_align;
res-flags |= add_res-flags 
 (IORESOURCE_STARTALIGN|IORESOURCE_SIZEALIGN);
if (pci_reassign_resource(add_res-dev, idx,
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head 
*head,
LIST_HEAD(save_head);
LIST_HEAD(local_fail_head);
struct pci_dev_resource *save_res;
-   struct pci_dev_resource *dev_res, *tmp_res;
+   struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
unsigned long fail_type;
+   resource_size_t add_align, align;
 
/* Check if optional add_size is there */

[PATCH V14 11/21] PCI: Add pcibios_iov_resource_alignment() interface

2015-03-19 Thread Wei Yang
Per the SR-IOV spec r1.1, sec 3.3.14, the required alignment of a PF's IOV
BAR is the size of an individual VF BAR, and the size consumed is the
individual VF BAR size times NumVFs.

The PowerNV platform has additional alignment requirements to help support
its Partitionable Endpoint device isolation feature (see
Documentation/powerpc/pci_iov_resource_on_powernv.txt).

Add a pcibios_iov_resource_alignment() interface to allow platforms to
request additional alignment.

[bhelgaas: changelog, adapt to reworked pci_sriov_resource_alignment(),
drop align parameter]
Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
---
 drivers/pci/iov.c   |8 +++-
 include/linux/pci.h |1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 64c4692..ee0ebff 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -569,6 +569,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+ int resno)
+{
+   return pci_iov_resource_size(dev, resno);
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -581,7 +587,7 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
  */
 resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
-   return pci_iov_resource_size(dev, resno);
+   return pcibios_iov_resource_alignment(dev, resno);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 99ea948..4e1f17d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1174,6 +1174,7 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1  0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1  1)
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:

 My recollection wasn't faulty - I pulled it from an earlier email.
 That said, the original measurement might have been faulty. I ran
 the numbers again on the 3.19 kernel I saved away from the original
 testing. That came up at 235k, which is pretty much the same as
 yesterday's test. The runtime,however, is unchanged from my original
 measurements of 4m54s (pte_hack came in at 5m20s).

Ok. Good. So the more than an order of magnitude difference was
really about measurement differences, not quite as real. Looks like
more a factor of two than a factor of 20.

Did you do the profiles the same way? Because that would explain the
differences in the TLB flush percentages too (the 1.4% from
tlb_invalidate_range() vs pretty much everything from migration).

The runtime variation does show that there's some *big* subtle
difference for the numa balancing in the exact TNF_NO_GROUP details.
It must be *very* unstable for it to make that big of a difference.
But I feel at least a *bit* better about unstable algorithm changes a
small varioation into a factor-of-two vs that crazy factor-of-20.

Can you try Mel's change to make it use

if (!(vma-vm_flags  VM_WRITE))

instead of the pte details? Again, on otherwise plain 3.19, just so
that we have a baseline. I'd be *so* much happer with checking the vma
details over per-pte details, especially ones that change over the
lifetime of the pte entry, and the NUMA code explicitly mucks with.

   Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [23/32] powerpc: copy_thread(): rename 'arg' argument to 'kthread_arg'

2015-03-19 Thread Michael Ellerman
On Thu, 2015-03-19 at 09:22 +0200, Alex Dowad wrote:
 On 19/03/15 08:45, Michael Ellerman wrote:
  On Fri, 2015-13-03 at 18:14:46 UTC, Alex Dowad wrote:
  The 'arg' argument to copy_thread() is only ever used when forking a new
  kernel thread. Hence, rename it to 'kthread_arg' for clarity (and 
  consistency
  with do_fork() and other arch-specific implementations of copy_thread()).

  I don't understand the bit about consistency with do_fork() ?

 This series of patches includes one patch which renames the arg for 
 do_fork(), and others which rename the same arg for each arch-specific 
 implementation of copy_thread(). So if all of them are accepted and 
 merged, then all will be consistent. If only some of the patches are 
 accepted, I will rewrite the commit message so it doesn't mention 
 consistency.

Ah OK, I only got patch 23, so I missed the context of the whole series.

I'll apply this one to the powerpc tree.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 4/5] hwmon: (ibmpowernv) change create_hwmon_attr_name() prototype

2015-03-19 Thread Cédric Le Goater
It simplifies the creation of the hwmon attributes and will help when
support for a new device tree layout is added. The patch also changes
the name of the routine to parse_opal_node_name().

Signed-off-by: Cédric Le Goater c...@fr.ibm.com
---

Changes since v1:

 - changed returned value of parse_opal_node_name()
 - used *_PTR macros to check for errors

 drivers/hwmon/ibmpowernv.c |   37 -
 1 file changed, 20 insertions(+), 17 deletions(-)

Index: linux.git/drivers/hwmon/ibmpowernv.c
===
--- linux.git.orig/drivers/hwmon/ibmpowernv.c
+++ linux.git/drivers/hwmon/ibmpowernv.c
@@ -152,29 +152,22 @@ static const char *convert_opal_attr_nam
  * which need to be mapped as fan2_input, temp1_max respectively before
  * populating them inside hwmon device class.
  */
-static int create_hwmon_attr_name(struct device *dev, enum sensors type,
-const char *node_name,
-char *hwmon_attr_name)
+static const char *parse_opal_node_name(const char *node_name,
+   enum sensors type, u32 *index)
 {
char attr_suffix[MAX_ATTR_LEN];
const char *attr_name;
-   u32 index;
int err;
 
-   err = get_sensor_index_attr(node_name, index, attr_suffix);
-   if (err) {
-   dev_err(dev, Sensor device node name '%s' is invalid\n,
-   node_name);
-   return err;
-   }
+   err = get_sensor_index_attr(node_name, index, attr_suffix);
+   if (err)
+   return ERR_PTR(err);
 
attr_name = convert_opal_attr_name(type, attr_suffix);
if (!attr_name)
-   return -ENOENT;
+   return ERR_PTR(-ENOENT);
 
-   snprintf(hwmon_attr_name, MAX_ATTR_LEN, %s%d_%s,
-sensor_groups[type].name, index, attr_name);
-   return 0;
+   return attr_name;
 }
 
 static int get_sensor_type(struct device_node *np)
@@ -249,6 +242,9 @@ static int create_device_attrs(struct pl
}
 
for_each_child_of_node(opal, np) {
+   const char *attr_name;
+   u32 opal_index;
+
if (np-name == NULL)
continue;
 
@@ -265,10 +261,17 @@ static int create_device_attrs(struct pl
 
sdata[count].id = sensor_id;
sdata[count].type = type;
-   err = create_hwmon_attr_name(pdev-dev, type, np-name,
-sdata[count].name);
-   if (err)
+
+   attr_name = parse_opal_node_name(np-name, type, opal_index);
+   if (IS_ERR(attr_name)) {
+   dev_err(pdev-dev, Sensor device node name '%s' is 
invalid\n,
+   np-name);
+   err = IS_ERR(attr_name);
goto exit_put_node;
+   }
+
+   snprintf(sdata[count].name, MAX_ATTR_LEN, %s%d_%s,
+sensor_groups[type].name, opal_index, attr_name);
 
sysfs_attr_init(sdata[count].dev_attr.attr);
sdata[count].dev_attr.attr.name = sdata[count].name;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Mel Gorman
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
   - something completely different that I am entirely missing
 
 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.
 

Minimally, there is still the window where we clear the PTE to set the
protections. During that window, a fault can occur. In the old code which
was inherently racy and unsafe, the fault might still go ahead deferring
a potential migration for a short period. In the current code, it'll stall
on the lock, notice the PTE is changed and refault so the overhead is very
different but functionally correct.

In the old code, pte_write had complex interactions with background
cleaning and sync in the case of file mappings (not applicable to Dave's
case but still it's unpredictable behaviour). pte_dirty is close but there
are interactions with the application as the timing of writes vs the PTE
scanner matter.

Even if we restored the original behaviour, it would still be very difficult
to understand all the interactions between userspace and kernel.  The patch
below should be tested because it's clearer what the intent is. Using
the VMA flags is coarse but it's not vulnerable to timing artifacts that
behave differently depending on the machine. My preliminary testing shows
it helps but not by much. It does not restore performance to where it was
but it's easier to understand which is important if there are changes in
the scheduler later.

In combination, I also think that slowing PTE scanning when migration fails
is the correct action even if it is unrelated to the patch Dave bisected
to. It's stupid to increase scanning rates and incurs more faults when
migrations are failing so I'll be testing that next.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 626e93db28ba..2f12e9fcf1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
flags |= TNF_FAULT_LOCAL;
}
 
-   /*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
-*
-* FIXME! This checks pmd_dirty() as an approximation of
-* is this a read-only page, since checking pmd_write()
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
-*/
-   if (!pmd_dirty(pmd))
+   /* See similar comment in do_numa_page for explanation */
+   if (!(vma-vm_flags  VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
diff --git a/mm/memory.c b/mm/memory.c
index 411144f977b1..20beb6647dba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
/*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
+* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+* much anyway since they can be in shared cache state. This misses
+* the case where a mapping is writable but the process never writes
+* to it but pte_write gets cleared during protection updates and
+* pte_dirty has unpredictable behaviour between PTE scan updates,
+* background writeback, dirty balancing and application behaviour.
 *
-* FIXME! This checks pmd_dirty() as an approximation of
-* is this a read-only page, since checking pmd_write()
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
+* TODO: Note that the ideal here would be to avoid a situation where a
+* NUMA fault is taken immediately followed by a write fault in
+* some cases which would have lower overhead overall but would be
+* invasive as the fault paths would need to be unified.
 */
-   if (!pte_dirty(pte))
+   if (!(vma-vm_flags  VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 0/5] hwmon: (ibmpowernv) remove dependency on OPAL index

2015-03-19 Thread Cédric Le Goater
Hello !

The current implementation of the driver uses an index for the hwmon 
attribute which is extracted from the device node name. This index
is calculated by the OPAL firmware and its usage creates a dependency 
with the driver which makes changes a little more complex in OPAL.

This patchset changes the ibmpowernv code to use its own index. It 
starts with a few cleanups, mostly code shuffling around the creation 
of the hwmon sysfs attributes and completes by removing the dependency.

It also prepares ground for future OPAL changes :  

   https://lists.ozlabs.org/pipermail/skiboot/2015-March/000639.html

which will be addressed in a other small patchset.


The patches are based on Linux 4.0.0-rc4 and were tested on IBM Power 
and Open Power systems running Trusty. 

Cheers,

C.


Changes since v1:

 - fixed alignment
 - killed a couple of useless return NULL
 - changed returned value of parse_opal_node_name()
 - used *_PTR macros to check for errors

Cédric Le Goater (5):
  hwmon: (ibmpowernv) replace AMBIENT_TEMP by TEMP
  hwmon: (ibmpowernv) add a get_sensor_type() routine
  hwmon: (ibmpowernv) add a convert_opal_attr_name() routine
  hwmon: (ibmpowernv) change create_hwmon_attr_name() prototype
  hwmon: (ibmpowernv) do not use the OPAL index for hwmon attribute
names

 drivers/hwmon/ibmpowernv.c |  122 +---
 1 file changed, 81 insertions(+), 41 deletions(-)

-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 5/5] hwmon: (ibmpowernv) do not use the OPAL index for hwmon attribute names

2015-03-19 Thread Cédric Le Goater
The current OPAL firmware exposes the different sensors of an IBM Power
system using node names such as :

sensors/amb-temp#1-data
sensors/amb-temp#1-thrs
cooling-fan#1-data
cooling-fan#1-faulted
cooling-fan#1-thrs
cooling-fan#2-data
...

The ibmpowernv driver, when loaded, parses these names to extract the
sensor index and the sensor attribute name. Unfortunately, this scheme
makes it difficult to add sensors with a different layout (specially of
the same type, like temperature) as the sensor index calculated in OPAL
is directly used in the hwmon sysfs interface.

What this patch does is add a independent hwmon index for each sensor.
The increment of the hwmon index (temp, fan, power, etc.) is kept per
sensor type in the sensor_group table. The sensor_data table is used
to store the association of the hwmon and OPAL indexes, as we need to
have the same hwmon index for different attributes of a same sensor.

Signed-off-by: Cédric Le Goater c...@fr.ibm.com
---
 drivers/hwmon/ibmpowernv.c |   23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux.git/drivers/hwmon/ibmpowernv.c
===
--- linux.git.orig/drivers/hwmon/ibmpowernv.c
+++ linux.git/drivers/hwmon/ibmpowernv.c
@@ -55,6 +55,7 @@ static struct sensor_group {
const char *compatible;
struct attribute_group group;
u32 attr_count;
+   u32 hwmon_index;
 } sensor_groups[] = {
{fan, ibm,opal-sensor-cooling-fan},
{temp, ibm,opal-sensor-amb-temp},
@@ -64,6 +65,8 @@ static struct sensor_group {
 
 struct sensor_data {
u32 id; /* An opaque id of the firmware for each sensor */
+   u32 hwmon_index;
+   u32 opal_index;
enum sensors type;
char name[MAX_ATTR_LEN];
struct device_attribute dev_attr;
@@ -181,6 +184,19 @@ static int get_sensor_type(struct device
return MAX_SENSOR_TYPE;
 }
 
+static u32 get_sensor_hwmon_index(struct sensor_data *sdata,
+   struct sensor_data *sdata_table, int count)
+{
+   int i;
+
+   for (i = 0; i  count; i++)
+   if (sdata_table[i].opal_index == sdata-opal_index 
+   sdata_table[i].type == sdata-type)
+   return sdata_table[i].hwmon_index;
+
+   return ++sensor_groups[sdata-type].hwmon_index;
+}
+
 static int populate_attr_groups(struct platform_device *pdev)
 {
struct platform_data *pdata = platform_get_drvdata(pdev);
@@ -270,8 +286,13 @@ static int create_device_attrs(struct pl
goto exit_put_node;
}
 
+   sdata[count].opal_index = opal_index;
+   sdata[count].hwmon_index =
+   get_sensor_hwmon_index(sdata[count], sdata, count);
+
snprintf(sdata[count].name, MAX_ATTR_LEN, %s%d_%s,
-sensor_groups[type].name, opal_index, attr_name);
+sensor_groups[type].name, sdata[count].hwmon_index,
+attr_name);
 
sysfs_attr_init(sdata[count].dev_attr.attr);
sdata[count].dev_attr.attr.name = sdata[count].name;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 7:10 AM, Mel Gorman mgor...@suse.de wrote:
 -   if (!pmd_dirty(pmd))
 +   /* See similar comment in do_numa_page for explanation */
 +   if (!(vma-vm_flags  VM_WRITE))

Yeah, that would certainly be a whole lot more obvious than all the
if this particular pte/pmd looks like X tests.

So that, together with scanning rate improvements (this *does* seem to
be somewhat chaotic, so it's quite possible that the current scanning
rate thing is just fairly unstable) is likely the right thing. I'd
just like to _understand_ why that write/dirty bit makes such a
difference. I thought I understood what was going on, and was happy,
and then Dave come with his crazy numbers.

Damn you Dave, and damn your numbers and facts and stuff. Sometimes
I much prefer ignorant bliss.

   Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/5] hwmon: (ibmpowernv) replace AMBIENT_TEMP by TEMP

2015-03-19 Thread Cédric Le Goater
Ambient is too restrictive as there can be other temperature channels :
core, memory, etc.

Signed-off-by: Cédric Le Goater c...@fr.ibm.com
---
 drivers/hwmon/ibmpowernv.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux.git/drivers/hwmon/ibmpowernv.c
===
--- linux.git.orig/drivers/hwmon/ibmpowernv.c
+++ linux.git/drivers/hwmon/ibmpowernv.c
@@ -44,7 +44,7 @@
  */
 enum sensors {
FAN,
-   AMBIENT_TEMP,
+   TEMP,
POWER_SUPPLY,
POWER_INPUT,
MAX_SENSOR_TYPE,
@@ -87,7 +87,7 @@ static ssize_t show_sensor(struct device
return ret;
 
/* Convert temperature to milli-degrees */
-   if (sdata-type == AMBIENT_TEMP)
+   if (sdata-type == TEMP)
x *= 1000;
/* Convert power to micro-watts */
else if (sdata-type == POWER_INPUT)
@@ -154,7 +154,7 @@ static int create_hwmon_attr_name(struct
} else if (!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX)) {
attr_name = input;
} else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) {
-   if (type == AMBIENT_TEMP)
+   if (type == TEMP)
attr_name = max;
else if (type == FAN)
attr_name = min;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 3/5] hwmon: (ibmpowernv) add a convert_opal_attr_name() routine

2015-03-19 Thread Cédric Le Goater
It simplifies the create_hwmon_attr_name() routine and it clearly isolates
the conversion done between the OPAL node names and hwmon attributes names.

Signed-off-by: Cédric Le Goater c...@fr.ibm.com
---

Changes since v1:

 - fixed alignment
 - killed a couple of useless return NULL

 drivers/hwmon/ibmpowernv.c |   36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

Index: linux.git/drivers/hwmon/ibmpowernv.c
===
--- linux.git.orig/drivers/hwmon/ibmpowernv.c
+++ linux.git/drivers/hwmon/ibmpowernv.c
@@ -127,6 +127,25 @@ static int get_sensor_index_attr(const c
return 0;
 }
 
+static const char *convert_opal_attr_name(enum sensors type,
+ const char *opal_attr)
+{
+   const char *attr_name = NULL;
+
+   if (!strcmp(opal_attr, DT_FAULT_ATTR_SUFFIX)) {
+   attr_name = fault;
+   } else if (!strcmp(opal_attr, DT_DATA_ATTR_SUFFIX)) {
+   attr_name = input;
+   } else if (!strcmp(opal_attr, DT_THRESHOLD_ATTR_SUFFIX)) {
+   if (type == TEMP)
+   attr_name = max;
+   else if (type == FAN)
+   attr_name = min;
+   }
+
+   return attr_name;
+}
+
 /*
  * This function translates the DT node name into the 'hwmon' attribute name.
  * IBMPOWERNV device node appear like cooling-fan#2-data, amb-temp#1-thrs etc.
@@ -138,7 +157,7 @@ static int create_hwmon_attr_name(struct
 char *hwmon_attr_name)
 {
char attr_suffix[MAX_ATTR_LEN];
-   char *attr_name;
+   const char *attr_name;
u32 index;
int err;
 
@@ -149,20 +168,9 @@ static int create_hwmon_attr_name(struct
return err;
}
 
-   if (!strcmp(attr_suffix, DT_FAULT_ATTR_SUFFIX)) {
-   attr_name = fault;
-   } else if (!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX)) {
-   attr_name = input;
-   } else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) {
-   if (type == TEMP)
-   attr_name = max;
-   else if (type == FAN)
-   attr_name = min;
-   else
-   return -ENOENT;
-   } else {
+   attr_name = convert_opal_attr_name(type, attr_suffix);
+   if (!attr_name)
return -ENOENT;
-   }
 
snprintf(hwmon_attr_name, MAX_ATTR_LEN, %s%d_%s,
 sensor_groups[type].name, index, attr_name);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe

2015-03-19 Thread Bjorn Helgaas
On Thu, Mar 19, 2015 at 11:18 AM, Wei Yang weiyang.ker...@gmail.com wrote:
 Oh, I thought you are not comfortable with the Patch v12 10/21 PCI:
 Consider additional PF's IOV BAR alignment ...

 V14 is ready to send which is based on v4.0-rc1.

Unless I missed something, the last email in that thread [1] is from
you, so I think we're ready for the next iteration.

[1] 
http://lkml.kernel.org/r/20150224083406.32124.65957.st...@bhelgaas-glaptop2.roam.corp.google.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/5] hwmon: (ibmpowernv) add a get_sensor_type() routine

2015-03-19 Thread Cédric Le Goater
It will help in adding different compatible properties, coming from a
new device tree layout for example.

Signed-off-by: Cédric Le Goater c...@fr.ibm.com
---
 drivers/hwmon/ibmpowernv.c |   26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

Index: linux.git/drivers/hwmon/ibmpowernv.c
===
--- linux.git.orig/drivers/hwmon/ibmpowernv.c
+++ linux.git/drivers/hwmon/ibmpowernv.c
@@ -169,6 +169,17 @@ static int create_hwmon_attr_name(struct
return 0;
 }
 
+static int get_sensor_type(struct device_node *np)
+{
+   enum sensors type;
+
+   for (type = 0; type  MAX_SENSOR_TYPE; type++) {
+   if (of_device_is_compatible(np, sensor_groups[type].compatible))
+   return type;
+   }
+   return MAX_SENSOR_TYPE;
+}
+
 static int populate_attr_groups(struct platform_device *pdev)
 {
struct platform_data *pdata = platform_get_drvdata(pdev);
@@ -181,12 +192,9 @@ static int populate_attr_groups(struct p
if (np-name == NULL)
continue;
 
-   for (type = 0; type  MAX_SENSOR_TYPE; type++)
-   if (of_device_is_compatible(np,
-   sensor_groups[type].compatible)) {
-   sensor_groups[type].attr_count++;
-   break;
-   }
+   type = get_sensor_type(np);
+   if (type != MAX_SENSOR_TYPE)
+   sensor_groups[type].attr_count++;
}
 
of_node_put(opal);
@@ -236,11 +244,7 @@ static int create_device_attrs(struct pl
if (np-name == NULL)
continue;
 
-   for (type = 0; type  MAX_SENSOR_TYPE; type++)
-   if (of_device_is_compatible(np,
-   sensor_groups[type].compatible))
-   break;
-
+   type = get_sensor_type(np);
if (type == MAX_SENSOR_TYPE)
continue;
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe

2015-03-19 Thread Bjorn Helgaas
On Thu, Mar 12, 2015 at 09:15:17AM +0800, Wei Yang wrote:
 On Wed, Mar 11, 2015 at 08:55:07AM -0500, Bjorn Helgaas wrote:
 On Wed, Mar 04, 2015 at 01:19:07PM +0800, Wei Yang wrote:
  On PHB3, PF IOV BAR will be covered by M64 window to have better PE
  isolation.  The total_pe number is usually different from total_VFs, which
  can lead to a conflict between MMIO space and the PE number.
  
  For example, if total_VFs is 128 and total_pe is 256, the second half of
  M64 window will be part of other PCI device, which may already belong
  to other PEs.
  
  Prevent the conflict by reserving additional space for the PF IOV BAR,
  which is total_pe number of VF's BAR size.
  
  [bhelgaas: make dev_printk() output more consistent, index resource[]
  conventionally]
  Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
  ---
   arch/powerpc/include/asm/machdep.h|4 ++
   arch/powerpc/include/asm/pci-bridge.h |3 ++
   arch/powerpc/kernel/pci-common.c  |5 +++
   arch/powerpc/kernel/pci-hotplug.c |4 ++
   arch/powerpc/platforms/powernv/pci-ioda.c |   61 
  +
   5 files changed, 77 insertions(+)
  
  diff --git a/arch/powerpc/include/asm/machdep.h 
  b/arch/powerpc/include/asm/machdep.h
  index c8175a3..965547c 100644
  --- a/arch/powerpc/include/asm/machdep.h
  +++ b/arch/powerpc/include/asm/machdep.h
  @@ -250,6 +250,10 @@ struct machdep_calls {
 /* Reset the secondary bus of bridge */
 void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
   
  +#ifdef CONFIG_PCI_IOV
  +  void (*pcibios_fixup_sriov)(struct pci_bus *bus);
  +#endif /* CONFIG_PCI_IOV */
  +
 /* Called to shutdown machine specific hardware not already controlled
  * by other drivers.
  */
  diff --git a/arch/powerpc/include/asm/pci-bridge.h 
  b/arch/powerpc/include/asm/pci-bridge.h
  index 513f8f2..de11de7 100644
  --- a/arch/powerpc/include/asm/pci-bridge.h
  +++ b/arch/powerpc/include/asm/pci-bridge.h
  @@ -175,6 +175,9 @@ struct pci_dn {
   #define IODA_INVALID_PE   (-1)
   #ifdef CONFIG_PPC_POWERNV
 int pe_number;
  +#ifdef CONFIG_PCI_IOV
  +  u16 max_vfs;/* number of VFs IOV BAR expended */
  +#endif /* CONFIG_PCI_IOV */
   #endif
 struct list_head child_list;
 struct list_head list;
  diff --git a/arch/powerpc/kernel/pci-common.c 
  b/arch/powerpc/kernel/pci-common.c
  index 8203101..022e9fe 100644
  --- a/arch/powerpc/kernel/pci-common.c
  +++ b/arch/powerpc/kernel/pci-common.c
  @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 if (ppc_md.pcibios_fixup_phb)
 ppc_md.pcibios_fixup_phb(hose);
   
  +#ifdef CONFIG_PCI_IOV
  +  if (ppc_md.pcibios_fixup_sriov)
  +  ppc_md.pcibios_fixup_sriov(bus);
  +#endif /* CONFIG_PCI_IOV */
 
 Here, and ...
 
  +
 /* Configure PCI Express settings */
 if (bus  !pci_has_flag(PCI_PROBE_ONLY)) {
 struct pci_bus *child;
  diff --git a/arch/powerpc/kernel/pci-hotplug.c 
  b/arch/powerpc/kernel/pci-hotplug.c
  index 5b78917..7d238ae 100644
  --- a/arch/powerpc/kernel/pci-hotplug.c
  +++ b/arch/powerpc/kernel/pci-hotplug.c
  @@ -94,6 +94,10 @@ void pcibios_add_pci_devices(struct pci_bus * bus)
  */
 slotno = PCI_SLOT(PCI_DN(dn-child)-devfn);
 pci_scan_slot(bus, PCI_DEVFN(slotno, 0));
  +#ifdef CONFIG_PCI_IOV
  +  if (ppc_md.pcibios_fixup_sriov)
  +  ppc_md.pcibios_fixup_sriov(bus);
  +#endif /* CONFIG_PCI_IOV */
 
 here, you have the same code.  It's good that we now do it for hot-added
 devices as well as those present at boot.  But it's bad that it happens in
 two different paths.
 
 Isn't there some way we can unify this so the same path is used for the
 initial pcibios_scan_phb() and also the hot-add case?  Maybe call
 pcibios_fixup_sriov() from pcibios_add_device()?
 
 
 
 This is a very good suggestion. I have changed this and works fine.

I was expecting a v14 series with this change.  Is it coming, or are you
waiting for something else from me?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

2015-03-19 Thread Kim Phillips
On Thu, 19 Mar 2015 17:56:57 +0200
Horia Geantă horia.gea...@freescale.com wrote:

 On 3/18/2015 12:03 AM, Kim Phillips wrote:
  On Tue, 17 Mar 2015 19:58:55 +0200
  Horia Geantă horia.gea...@freescale.com wrote:
  
  On 3/17/2015 2:19 AM, Kim Phillips wrote:
  On Mon, 16 Mar 2015 12:02:51 +0200
  Horia Geantă horia.gea...@freescale.com wrote:
 
  On 3/4/2015 2:23 AM, Kim Phillips wrote:
  Only potential problem is getting the crypto API to set the GFP_DMA
  flag in the allocation request, but presumably a
  CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
 
  Seems there are quite a few places that do not use the
  {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
  Among them, IPsec and dm-crypt.
  I've looked at the code and I don't think it can be converted to use
  crypto API.
 
  why not?
 
  It would imply having 2 memory allocations, one for crypto request and
  the other for the rest of the data bundled with the request (for IPsec
  that would be ESN + space for IV + sg entries for authenticated-only
  data and sk_buff extension, if needed).
 
  Trying to have a single allocation by making ESN, IV etc. part of the
  request private context requires modifying tfm.reqsize on the fly.
  This won't work without adding some kind of locking for the tfm.
  
  can't a common minimum tfm.reqsize be co-established up front, at
  least for the fast path?
 
 Indeed, for IPsec at tfm allocation time - esp_init_state() -
 tfm.reqsize could be increased to account for what is known for a given
 flow: ESN, IV and asg (S/G entries for authenticated-only data).
 The layout would be:
 aead request (fixed part)
 private ctx of backend algorithm
 seq_no_hi (if ESN)
 IV
 asg
 sg -- S/G table for skb_to_sgvec; how many entries is the question
 
 Do you have a suggestion for how many S/G entries to preallocate for
 representing the sk_buff data to be encrypted?
 An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
 Btw, currently maximum number of fragments supported by the net stack
 (MAX_SKB_FRAGS) is 16 or more.
 
  This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
  places. Some of the maintainers do not agree, as you've seen.
 
  would modifying the crypto API to either have a different
  *_request_alloc() API, and/or adding calls to negotiate the GFP mask
  between crypto users and drivers, e.g., get/set_gfp_mask, work?
 
  I think what DaveM asked for was the change to be transparent.
 
  Besides converting to *_request_alloc(), seems that all other options
  require some extra awareness from the user.
  Could you elaborate on the idea above?
  
  was merely suggesting communicating GFP flags anonymously across the
  API, i.e., GFP_DMA wouldn't appear in user code.
 
 Meaning user would have to get_gfp_mask before allocating a crypto
 request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
 kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?
 
  An alternative would be for talitos to use the page allocator to get 1 /
  2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
  = 8 kB), dma_map_page the area and manage it internally for talitos_desc
  hw descriptors.
  What do you think?
 
  There's a comment in esp_alloc_tmp(): Use spare space in skb for
  this where possible, which is ideally where we'd want to be (esp.
 
  Ok, I'll check that. But note the where possible - finding room in the
  skb to avoid the allocation won't always be the case, and then we're
  back to square one.
 
 So the skb cb is out of the question, being too small (48B).
 Any idea what was the intention of the TODO - maybe to use the
 tailroom in the skb data area?
 
  because that memory could already be DMA-able).  Your above
  suggestion would be in the opposite direction of that.
 
  The proposal:
  -removes dma (un)mapping on the fast path
  
  sure, but at the expense of additional complexity.
 
 Right, there's no free lunch. But it's cheaper.
 
  -avoids requesting dma mappable memory for more than it's actually
  needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
  only its private context)
  
  compared to the payload?  Plus, we have plenty of DMA space these
  days.
  
  -for caam it has the added benefit of speeding the below search for the
  offending descriptor in the SW ring from O(n) to O(1):
  for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) = 1; i++) {
 sw_idx = (tail + i)  (JOBR_DEPTH - 1);
 
 if (jrp-outring[hw_idx].desc ==
 jrp-entinfo[sw_idx].desc_addr_dma)
 break; /* found */
  }
  (drivers/crypto/caam/jr.c - caam_dequeue)
  
  how?  The job ring h/w will still be spitting things out
  out-of-order.
 
 jrp-outring[hw_idx].desc bus address can be used to find the sw_idx in
 O(1):
 
 dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
 [...]
 sw_idx = (desc_base - jrp-outring[hw_idx].desc) / JD_SIZE;
 
 JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
 descriptor, 3 words can be used for 

Re: [PATCH V13 15/21] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe

2015-03-19 Thread Wei Yang
Oh, I thought you are not comfortable with the Patch v12 10/21 PCI:
Consider additional PF's IOV BAR alignment ...

V14 is ready to send which is based on v4.0-rc1.

2015-03-19 23:08 GMT+08:00 Bjorn Helgaas bhelg...@google.com:

 On Thu, Mar 12, 2015 at 09:15:17AM +0800, Wei Yang wrote:
  On Wed, Mar 11, 2015 at 08:55:07AM -0500, Bjorn Helgaas wrote:
  On Wed, Mar 04, 2015 at 01:19:07PM +0800, Wei Yang wrote:
   On PHB3, PF IOV BAR will be covered by M64 window to have better PE
   isolation.  The total_pe number is usually different from total_VFs,
 which
   can lead to a conflict between MMIO space and the PE number.
  
   For example, if total_VFs is 128 and total_pe is 256, the second half
 of
   M64 window will be part of other PCI device, which may already belong
   to other PEs.
  
   Prevent the conflict by reserving additional space for the PF IOV BAR,
   which is total_pe number of VF's BAR size.
  
   [bhelgaas: make dev_printk() output more consistent, index resource[]
   conventionally]
   Signed-off-by: Wei Yang weiy...@linux.vnet.ibm.com
   ---
arch/powerpc/include/asm/machdep.h|4 ++
arch/powerpc/include/asm/pci-bridge.h |3 ++
arch/powerpc/kernel/pci-common.c  |5 +++
arch/powerpc/kernel/pci-hotplug.c |4 ++
arch/powerpc/platforms/powernv/pci-ioda.c |   61
 +
5 files changed, 77 insertions(+)
  
   diff --git a/arch/powerpc/include/asm/machdep.h
 b/arch/powerpc/include/asm/machdep.h
   index c8175a3..965547c 100644
   --- a/arch/powerpc/include/asm/machdep.h
   +++ b/arch/powerpc/include/asm/machdep.h
   @@ -250,6 +250,10 @@ struct machdep_calls {
  /* Reset the secondary bus of bridge */
  void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
  
   +#ifdef CONFIG_PCI_IOV
   +  void (*pcibios_fixup_sriov)(struct pci_bus *bus);
   +#endif /* CONFIG_PCI_IOV */
   +
  /* Called to shutdown machine specific hardware not already
 controlled
   * by other drivers.
   */
   diff --git a/arch/powerpc/include/asm/pci-bridge.h
 b/arch/powerpc/include/asm/pci-bridge.h
   index 513f8f2..de11de7 100644
   --- a/arch/powerpc/include/asm/pci-bridge.h
   +++ b/arch/powerpc/include/asm/pci-bridge.h
   @@ -175,6 +175,9 @@ struct pci_dn {
#define IODA_INVALID_PE   (-1)
#ifdef CONFIG_PPC_POWERNV
  int pe_number;
   +#ifdef CONFIG_PCI_IOV
   +  u16 max_vfs;/* number of VFs IOV BAR expended
 */
   +#endif /* CONFIG_PCI_IOV */
#endif
  struct list_head child_list;
  struct list_head list;
   diff --git a/arch/powerpc/kernel/pci-common.c
 b/arch/powerpc/kernel/pci-common.c
   index 8203101..022e9fe 100644
   --- a/arch/powerpc/kernel/pci-common.c
   +++ b/arch/powerpc/kernel/pci-common.c
   @@ -1646,6 +1646,11 @@ void pcibios_scan_phb(struct pci_controller
 *hose)
  if (ppc_md.pcibios_fixup_phb)
  ppc_md.pcibios_fixup_phb(hose);
  
   +#ifdef CONFIG_PCI_IOV
   +  if (ppc_md.pcibios_fixup_sriov)
   +  ppc_md.pcibios_fixup_sriov(bus);
   +#endif /* CONFIG_PCI_IOV */
  
  Here, and ...
  
   +
  /* Configure PCI Express settings */
  if (bus  !pci_has_flag(PCI_PROBE_ONLY)) {
  struct pci_bus *child;
   diff --git a/arch/powerpc/kernel/pci-hotplug.c
 b/arch/powerpc/kernel/pci-hotplug.c
   index 5b78917..7d238ae 100644
   --- a/arch/powerpc/kernel/pci-hotplug.c
   +++ b/arch/powerpc/kernel/pci-hotplug.c
   @@ -94,6 +94,10 @@ void pcibios_add_pci_devices(struct pci_bus * bus)
   */
  slotno = PCI_SLOT(PCI_DN(dn-child)-devfn);
  pci_scan_slot(bus, PCI_DEVFN(slotno, 0));
   +#ifdef CONFIG_PCI_IOV
   +  if (ppc_md.pcibios_fixup_sriov)
   +  ppc_md.pcibios_fixup_sriov(bus);
   +#endif /* CONFIG_PCI_IOV */
  
  here, you have the same code.  It's good that we now do it for hot-added
  devices as well as those present at boot.  But it's bad that it happens
 in
  two different paths.
  
  Isn't there some way we can unify this so the same path is used for the
  initial pcibios_scan_phb() and also the hot-add case?  Maybe call
  pcibios_fixup_sriov() from pcibios_add_device()?
  
 
 
  This is a very good suggestion. I have changed this and works fine.

 I was expecting a v14 series with this change.  Is it coming, or are you
 waiting for something else from me?
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev




-- 
Richard Yang
Help You, Help Me
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v0 2/4] ppc64le: dynamic ftrace configuration options

2015-03-19 Thread Torsten Duwe
Switch on -mprofile-kernel, and remove it again
from directories involved in exception handling.
This needs to be done more fine grained, of course.

diff --git a/Makefile b/Makefile
index 1a60bdd..72644e6 100644
--- a/Makefile
+++ b/Makefile
@@ -732,7 +732,10 @@ ifdef CONFIG_FUNCTION_TRACER
 ifdef CONFIG_HAVE_FENTRY
 CC_USING_FENTRY:= $(call cc-option, -mfentry -DCC_USING_FENTRY)
 endif
-KBUILD_CFLAGS  += -pg $(CC_USING_FENTRY)
+ifdef CONFIG_HAVE_MPROFILE_KERNEL
+CC_USING_MPROFILE_KERNEL   := $(call cc-option, -mprofile-kernel 
-DCC_USING_MPROFILE_KERNEL)
+endif
+KBUILD_CFLAGS  += -pg $(CC_USING_FENTRY) $(CC_USING_MPROFILE_KERNEL)
 KBUILD_AFLAGS  += $(CC_USING_FENTRY)
 ifdef CONFIG_DYNAMIC_FTRACE
ifdef CONFIG_HAVE_C_RECORDMCOUNT
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4bc7b62..d82d7c8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -93,8 +93,10 @@ config PPC
select OF_RESERVED_MEM
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_DYNAMIC_FTRACE
+   select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_GRAPH_TRACER
+   select HAVE_MPROFILE_KERNEL
select SYSCTL_EXCEPTION_TRACE
select ARCH_WANT_OPTIONAL_GPIOLIB
select VIRT_TO_BUS if !PPC64
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 670c312..688e6f9 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -17,14 +17,14 @@ endif
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace early boot code
-CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog
-CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog
-CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog
-CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog
+CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog -mprofile-kernel
+CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog -mprofile-kernel
+CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog -mprofile-kernel
+CFLAGS_REMOVE_prom.o = -pg -mno-sched-epilog -mprofile-kernel
 # do not trace tracer code
-CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog
+CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog -mprofile-kernel
 # timers used by tracing
-CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog
+CFLAGS_REMOVE_time.o = -pg -mno-sched-epilog -mprofile-kernel
 endif
 
 obj-y  := cputable.o ptrace.o syscalls.o \
diff --git a/kernel/Makefile b/kernel/Makefile
index 8af7403..3c8821d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -13,8 +13,9 @@ obj-y = fork.o exec_domain.o panic.o \
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace debug files and internal ftrace files
-CFLAGS_REMOVE_cgroup-debug.o = -pg
-CFLAGS_REMOVE_irq_work.o = -pg
+CFLAGS_REMOVE_cgroup-debug.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_irq_work.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_extable.o = -pg -mprofile-kernel
 endif
 
 # cond_syscall is currently not LTO compatible
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 8541bfd..1cc57c8 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,10 +2,10 @@
 obj-y += mutex.o semaphore.o rwsem.o mcs_spinlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
-CFLAGS_REMOVE_lockdep.o = -pg
-CFLAGS_REMOVE_lockdep_proc.o = -pg
-CFLAGS_REMOVE_mutex-debug.o = -pg
-CFLAGS_REMOVE_rtmutex-debug.o = -pg
+CFLAGS_REMOVE_lockdep.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_lockdep_proc.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_mutex-debug.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_rtmutex-debug.o = -pg -mprofile-kernel
 endif
 
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c..dd53f3d 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -52,6 +52,11 @@ config HAVE_FENTRY
help
  Arch supports the gcc options -pg with -mfentry
 
+config HAVE_MPROFILE_KERNEL
+   bool
+   help
+ Arch supports the gcc options -pg with -mprofile-kernel
+
 config HAVE_C_RECORDMCOUNT
bool
help
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 59fa2de..b2f5029 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -6,8 +6,8 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC)
 
-CFLAGS_REMOVE_code-patching.o = -pg
-CFLAGS_REMOVE_feature-fixups.o = -pg
+CFLAGS_REMOVE_code-patching.o = -pg -mprofile-kernel
+CFLAGS_REMOVE_feature-fixups.o = -pg -mprofile-kernel
 
 obj-y  := string.o alloc.o \
   crtsavres.o
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index d0130ff..22633af 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -6,6 +6,11 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC)
 
+# needed for do_page_fault in fault.c :
+KBUILD_CFLAGS := $(filter-out -mprofile-kernel, $(KBUILD_CFLAGS))
+KBUILD_CFLAGS := $(filter-out -pg, 

Re: new decimal conversion - seeking testers

2015-03-19 Thread Rasmus Villemoes
On Fri, Mar 13 2015, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote:

 On 13.03.2015 [00:09:19 +0100], Rasmus Villemoes wrote:
 Since the new code plays a little endianness game I would really
 appreciate it if someone here would run the test and verification code
 on ppc.

 On a ppc64le box:
[...]

 On a ppc64 box:
[...]

Thanks!

Cheers,
Rasmus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v0 0/4] ppc64le: dynamic ftrace and kgraft support

2015-03-19 Thread Torsten Duwe
Here's an initial version of dynamic ftrace for ABIv2 (ppc64le),
the code maturity is somewhere between proof of concept and pre-alpha.
I have split it into 4 parts, for ftrace and kgraft, a configuration
enablement and the actual code, respectively.
Please have a look and tell me whether this is the way to go.

Torsten

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v0 1/4] ppc64le: dynamic ftrace

2015-03-19 Thread Torsten Duwe
I'm pretty sure not everything is ifdef'd properly,
and the FIXME needs to be solved in order to disable
ftracing again. Built upon some original code by
Vojtech Pavlik.

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index e366187..a69d47e 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -46,6 +46,8 @@
 extern void _mcount(void);
 
 #ifdef CONFIG_DYNAMIC_FTRACE
+# define FTRACE_ADDR ((unsigned long)ftrace_caller+8)
+# define FTRACE_REGS_ADDR FTRACE_ADDR
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
/* reloction of mcount call site is the same as the address */
@@ -57,6 +58,9 @@ struct dyn_arch_ftrace {
 #endif /*  CONFIG_DYNAMIC_FTRACE */
 #endif /* __ASSEMBLY__ */
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+#define ARCH_SUPPORTS_FTRACE_OPS 1
+#endif
 #endif
 
 #if defined(CONFIG_FTRACE_SYSCALLS)  defined(CONFIG_PPC64)  
!defined(__ASSEMBLY__)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 5bbd1bc..9caf9af 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -1159,32 +1170,107 @@ _GLOBAL(enter_prom)
 
 #ifdef CONFIG_FUNCTION_TRACER
 #ifdef CONFIG_DYNAMIC_FTRACE
-_GLOBAL(mcount)
+
+#define TOCSAVE 24
+
 _GLOBAL(_mcount)
-   blr
+   nop // REQUIRED for ftrace, to calculate local/global entry diff
+.localentry _mcount,.-_mcount
+   mflrr0
+   mtctr   r0
+
+   LOAD_REG_ADDR_PIC(r12,ftrace_trace_function)
+   ld  r12,0(r12)
+   LOAD_REG_ADDR_PIC(r0,ftrace_stub)
+   cmpdr0,r12
+   ld  r0,LRSAVE(r1)
+   bne-2f
+
+   mtlrr0
+   bctr
+
+2: /* here we have (*ftrace_trace_function)() in r12,
+  selfpc in CTR
+  and frompc in r0 */
+
+   mtlrr0
+   bctr
+
+_GLOBAL(ftrace_caller)
+   mr  r0,r2   // global (module) call: save module TOC
+   b   1f
+.localentry ftrace_caller,.-ftrace_caller
+   mr  r0,r2   // local call: callee's TOC == our TOC
+   b   2f
+
+1: addis   r2,r12,(.TOC.-0b)@ha
+   addir2,r2,(.TOC.-0b)@l
+
+2: // Here we have our proper TOC ptr in R2,
+   // and the one we need to restore on return in r0.
+
+   ld  r12, 16(r1) // get caller's adress
+
+   stdur1,-SWITCH_FRAME_SIZE(r1)
+
+   std r12, _LINK(r1)
+   SAVE_8GPRS(0,r1)
+   std r0,TOCSAVE(r1)
+   SAVE_8GPRS(8,r1)
+   SAVE_8GPRS(16,r1)
+   SAVE_8GPRS(24,r1)
+
+
+   LOAD_REG_IMMEDIATE(r3,function_trace_op)
+   ld  r5,0(r3)
+
+   mflrr3
+   std r3, _NIP(r1)
+   std r3, 16(r1)
+   subir3, r3, MCOUNT_INSN_SIZE
+   mfmsr   r4
+   std r4, _MSR(r1)
+   mfctr   r4
+   std r4, _CTR(r1)
+   mfxer   r4
+   std r4, _XER(r1)
+   mr  r4, r12
+   addir6, r1 ,STACK_FRAME_OVERHEAD
 
-_GLOBAL_TOC(ftrace_caller)
-   /* Taken from output of objdump from lib64/glibc */
-   mflrr3
-   ld  r11, 0(r1)
-   stdur1, -112(r1)
-   std r3, 128(r1)
-   ld  r4, 16(r11)
-   subir3, r3, MCOUNT_INSN_SIZE
 .globl ftrace_call
 ftrace_call:
bl  ftrace_stub
nop
+
+   ld  r3, _NIP(r1)
+   mtlrr3
+
+   REST_8GPRS(0,r1)
+   REST_8GPRS(8,r1)
+   REST_8GPRS(16,r1)
+   REST_8GPRS(24,r1)
+
+   addi r1, r1, SWITCH_FRAME_SIZE
+
+   ld  r12, 16(r1) // get caller's adress
+   mr  r2,r0   // restore callee's TOC
+   mflrr0  // move this LR to CTR
+   mtctr   r0
+   mr  r0,r12  // restore callee's lr at _mcount site
+   mtlrr0
+   bctr// jump after _mcount site
+
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 .globl ftrace_graph_call
 ftrace_graph_call:
b   ftrace_graph_stub
 _GLOBAL(ftrace_graph_stub)
 #endif
-   ld  r0, 128(r1)
-   mtlrr0
-   addir1, r1, 112
+   
 _GLOBAL(ftrace_stub)
+   nop
+   nop
+.localentry ftrace_stub,.-ftrace_stub  
blr
 #else
 _GLOBAL_TOC(_mcount)
@@ -1218,20 +1304,17 @@ _GLOBAL(ftrace_stub)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 _GLOBAL(ftrace_graph_caller)
/* load r4 with local address */
-   ld  r4, 128(r1)
+   ld  r4, LRSAVE+SWITCH_FRAME_SIZE(r1)
subir4, r4, MCOUNT_INSN_SIZE
 
/* get the parent address */
-   ld  r11, 112(r1)
-   addir3, r11, 16
+   ld  r11, SWITCH_FRAME_SIZE(r1)
+   addir3, r11, LRSAVE
 
bl  prepare_ftrace_return
nop
 
-   ld  r0, 128(r1)
-   mtlrr0
-   addir1, r1, 112
-   blr
+   b ftrace_graph_stub
 
 _GLOBAL(return_to_handler)
/* need to save return values */
diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c
index 390311c..4fe16fb 100644
--- a/arch/powerpc/kernel/ftrace.c
+++ 

[PATCH v0 3/4] ppc64le: kgraft support

2015-03-19 Thread Torsten Duwe
The kgraft hooks for ppc64. Just massaged a bit to
get them to compile and not interfere.
Feel free to test them if you're daring ;)

diff --git a/arch/powerpc/include/asm/kgraft.h 
b/arch/powerpc/include/asm/kgraft.h
new file mode 100644
index 000..7f8600d
--- /dev/null
+++ b/arch/powerpc/include/asm/kgraft.h
@@ -0,0 +1,33 @@
+/*
+ * kGraft Online Kernel Patching
+ *
+ *  Copyright (c) 2013-2014 SUSE
+ *   Authors: Jiri Kosina
+ *   Vojtech Pavlik
+ *   Jiri Slaby
+ */
+
+/*
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#ifndef ASM_KGR_H
+#define ASM_KGR_H
+
+#include asm/ptrace.h
+#include linux/stacktrace.h
+
+static inline void kgr_set_regs_ip(struct pt_regs *regs, unsigned long ip)
+{
+   regs-link = ip;
+}
+
+static inline bool kgr_needs_lazy_migration(struct task_struct *p)
+{
+   return true;
+}
+
+#endif
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index b034ecd..aa6a084 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void)
   TIF_NEED_RESCHED */
 #define TIF_32BIT  4   /* 32 bit binary */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
+#define TIF_KGR_IN_PROGRESS6   /* kGraft patching in progress */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
 #define TIF_NOHZ   9   /* in adaptive nohz mode */
@@ -115,8 +117,10 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_POLLING_NRFLAG(1TIF_POLLING_NRFLAG)
 #define _TIF_32BIT (1TIF_32BIT)
 #define _TIF_RESTORE_TM(1TIF_RESTORE_TM)
+#define _TIF_KGR_IN_PROGRESS   (1TIF_KGR_IN_PROGRESS)
 #define _TIF_SYSCALL_AUDIT (1TIF_SYSCALL_AUDIT)
 #define _TIF_SINGLESTEP(1TIF_SINGLESTEP)
+#define _TIF_NOHZ  (1TIF_NOHZ)
 #define _TIF_SECCOMP   (1TIF_SECCOMP)
 #define _TIF_RESTOREALL(1TIF_RESTOREALL)
 #define _TIF_NOERROR   (1TIF_NOERROR)
@@ -124,7 +128,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_UPROBE(1TIF_UPROBE)
 #define _TIF_SYSCALL_TRACEPOINT(1TIF_SYSCALL_TRACEPOINT)
 #define _TIF_EMULATE_STACK_STORE   (1TIF_EMULATE_STACK_STORE)
-#define _TIF_NOHZ  (1TIF_NOHZ)
+
 #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
 _TIF_NOHZ)
@@ -132,7 +136,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
 _TIF_RESTORE_TM)
-#define _TIF_PERSYSCALL_MASK   (_TIF_RESTOREALL|_TIF_NOERROR)
+
+#define _TIF_PERSYSCALL_MASK   
(_TIF_RESTOREALL|_TIF_NOERROR|_TIF_KGR_IN_PROGRESS)
 
 /* Bits in local_flags */
 /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 5bbd1bc..569acd4 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -151,8 +151,8 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
 #endif
CURRENT_THREAD_INFO(r11, r1)
ld  r10,TI_FLAGS(r11)
-   andi.   r11,r10,_TIF_SYSCALL_T_OR_A
-   bne syscall_dotrace
+   andi.   r10,r10,(_TIF_SYSCALL_T_OR_A|_TIF_KGR_IN_PROGRESS)
+   bne-syscall_precall
 .Lsyscall_dotrace_cont:
cmpldi  0,r0,NR_syscalls
bge-syscall_enosys
@@ -245,6 +245,17 @@ syscall_error:
neg r3,r3
std r5,_CCR(r1)
b   .Lsyscall_error_cont
+
+syscall_precall:
+   andi.   r10,r10,(_TIF_KGR_IN_PROGRESS)
+   beq+syscall_dotrace
+
+   addir11,r11,TI_FLAGS
+1: ldarx   r12,0,r11
+   andcr12,r12,r10
+   stdcx.  r12,0,r11
+   bne-1b
+   subir11,r11,TI_FLAGS

 /* Traced system call support */
 syscall_dotrace:
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-19 Thread Sowmini Varadhan

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

Ben One thing I noticed is the asymetry in your code between the alloc
Ben and the free path. The alloc path is similar to us in that the lock
Ben covers the allocation and that's about it, there's no actual mapping to
Ben the HW done, it's done by the caller level right ?

yes, the only constraint  is that the h/w alloc transaction should be
done after the arena-alloc, whereas for the unmap, the h/w  transaction
should happen first, and arena-unmap should happen after.

Ben The free path however, in your case, takes the lock and calls back into
Ben demap (which I assume is what removes the translation from the HW)
Ben with the lock held. There's also some mapping between cookies
Ben and index which here too isn't exposed to the alloc side but is
Ben exposed to the free side.

Regarding the -demap indirection- somewhere between V1 and V2, I 
realized that, at least for sun4v, it's not necessary to hold
the pool lock when doing the unmap, (V1 had originally passed this
a the -demap). Revisiting the LDC change, I think that even that
has no pool-specific info that needs to be passed in, so possibly the
-demap is not required at all?

I can remove that, and re-verify the LDC code (though I might
not be able to get to this till early next week, as I'm travelling
at the moment).

About the cookie_to_index, that came out of observation of the
LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, 
the workflow is approximately
   base = alloc_npages(..); /* calls iommu_tbl_range_alloc *.
   /* set up cookie_state using base */
   /* populate cookies calling fill_cookies() - make_cookie() */
The make_cookie() is the inverse operation of cookie_to_index()
(afaict, the code is not very well commented, I'm afraid), but 
I need that indirection to figure out which bitmap to clear.

I dont know if there's a better way to do this, or if
the -cookie_to_index can get more complex for other IOMMU users

Ben One thing that Alexey is doing on our side is to move some of the
Ben hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
Ben global ppc_md. data structure to a new iommu_table_ops, so your patches
Ben will definitely collide with our current work so we'll have to figure
Ben out how things can made to match. We might be able to move more than
Ben just the allocator to the generic code, and the whole implementation of
Ben map_sg/unmap_sg if we have the right set of ops, unless you see a
Ben reason why that wouldn't work for you ?

I cant think of why that wont work, though it would help to see
the patch itself..


Ben We also need to add some additional platform specific fields to certain
Ben iommu table instances to deal with some KVM related tracking of pinned
Ben DMAble memory, here too we might have to be creative and possibly
Ben enclose the generic iommu_table in a platform specific variant.
Ben 
Ben Alexey, any other comment ?
Ben 
Ben Cheers,
Ben Ben.
Ben 
Ben 
Ben 
Ben --
Ben To unsubscribe from this list: send the line unsubscribe sparclinux in
Ben the body of a message to majord...@vger.kernel.org
Ben More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben 
BenAlexey, any other comment ?

On (03/19/15 16:27), Alexey Kardashevskiy wrote:
Alexey 
Alexey Agree about missing symmetry. In general, I would call it zoned
Alexey pool-locked memory allocator-ish rather than iommu_table and have
Alexey no callbacks there.
Alexey 
Alexey The iommu_tbl_range_free() caller could call cookie_to_index() and

Problem is that tbl_range_free itself needs the `entry' from
-Alexeycookie_to_index.. dont know if there's a way to move the code
to avoid that..

Alexey what the reset() callback does here - I do not understand, some

The -Alexeyreset callback came out of the sun4u use-case. Davem might
have more history here than I do, but my understanding is that the
iommu_flushall() was needed on the older sun4u architectures, where
there was on intermediating HV?

Alexey documentation would help here, and demap() does not have to be
Alexey executed under the lock (as map() is not executed under the lock).
Alexey 
Alexey btw why demap, not unmap? :)

Maybe neither is needed, as you folks made me realize above.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v0 4/4] ppc64le: kgraft config options

2015-03-19 Thread Torsten Duwe
Enable kgraft on ppc, fairly trivial.

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4bc7b62..d82d7c8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -102,6 +104,7 @@ config PPC
select HAVE_IOREMAP_PROT
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !CPU_LITTLE_ENDIAN
select HAVE_KPROBES
+   select HAVE_KGRAFT
select HAVE_ARCH_KGDB
select HAVE_KRETPROBES
select HAVE_ARCH_TRACEHOOK
@@ -291,6 +294,8 @@ source init/Kconfig
 
 source kernel/Kconfig.freezer
 
+source kernel/Kconfig.kgraft
+
 source arch/powerpc/sysdev/Kconfig
 source arch/powerpc/platforms/Kconfig
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states

2015-03-19 Thread Stewart Smith
Vasant Hegde hegdevas...@linux.vnet.ibm.com writes:
 From: Anshuman Khandual khand...@linux.vnet.ibm.com

 This patch registers the following two new OPAL interfaces calls
 for the platform LED subsystem. With the help of these new OPAL calls,
 the kernel will be able to get or set the state of various individual
 LEDs on the system at any given location code which is passed through
 the LED specific device tree nodes.

   (1) OPAL_LEDS_GET_INDICATOR opal_leds_get_ind
   (2) OPAL_LEDS_SET_INDICATOR opal_leds_set_ind

 Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com
 Signed-off-by: Vasant Hegde hegdevas...@linux.vnet.ibm.com

I also just merged the skiboot side of these calls.

Acked-by: Stewart Smith stew...@linux.vnet.ibm.com
Tested-by: Stewart Smith stew...@linux.vnet.ibm.com


(well, it boots, interacts with firmware. I didn't go and look at the
LEDs themselves).

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2 v5] cpufreq: qoriq: rename the driver

2015-03-19 Thread Rafael J. Wysocki
On Friday, March 13, 2015 12:39:02 PM yuantian.t...@freescale.com wrote:
 From: Tang Yuantian yuantian.t...@freescale.com
 
 This driver works on all QorIQ platforms which include
 ARM-based cores and PPC-based cores.
 Rename it in order to represent better.
 
 Signed-off-by: Tang Yuantian yuantian.t...@freescale.com
 Acked-by: Viresh Kumar viresh.ku...@linaro.org

Both queued up for 4.1, thanks!

 ---
 v5:
   - rebased to 4.0-rc3
   - added Kconfig and Makefile entry
 v3, v4
   - none
 v2:
   - use -C -M options when format-patch
 
  drivers/cpufreq/Kconfig| 8 
  drivers/cpufreq/Kconfig.powerpc| 9 -
  drivers/cpufreq/Makefile   | 2 +-
  drivers/cpufreq/{ppc-corenet-cpufreq.c = qoriq-cpufreq.c} | 0
  4 files changed, 9 insertions(+), 10 deletions(-)
  rename drivers/cpufreq/{ppc-corenet-cpufreq.c = qoriq-cpufreq.c} (100%)
 
 diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
 index a171fef..659879a 100644
 --- a/drivers/cpufreq/Kconfig
 +++ b/drivers/cpufreq/Kconfig
 @@ -293,5 +293,13 @@ config SH_CPU_FREQ
 If unsure, say N.
  endif
  
 +config QORIQ_CPUFREQ
 + tristate CPU frequency scaling driver for Freescale QorIQ SoCs
 + depends on OF  COMMON_CLK  (PPC_E500MC || ARM)
 + select CLK_QORIQ
 + help
 +   This adds the CPUFreq driver support for Freescale QorIQ SoCs
 +   which are capable of changing the CPU's frequency dynamically.
 +
  endif
  endmenu
 diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
 index 7ea2441..3a0595b 100644
 --- a/drivers/cpufreq/Kconfig.powerpc
 +++ b/drivers/cpufreq/Kconfig.powerpc
 @@ -23,15 +23,6 @@ config CPU_FREQ_MAPLE
 This adds support for frequency switching on Maple 970FX
 Evaluation Board and compatible boards (IBM JS2x blades).
  
 -config PPC_CORENET_CPUFREQ
 - tristate CPU frequency scaling driver for Freescale E500MC SoCs
 - depends on PPC_E500MC  OF  COMMON_CLK
 - select CLK_QORIQ
 - help
 -   This adds the CPUFreq driver support for Freescale e500mc,
 -   e5500 and e6500 series SoCs which are capable of changing
 -   the CPU's frequency dynamically.
 -
  config CPU_FREQ_PMAC
   bool Support for Apple PowerBooks
   depends on ADB_PMU  PPC32
 diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
 index 82a1821..26df0ad 100644
 --- a/drivers/cpufreq/Makefile
 +++ b/drivers/cpufreq/Makefile
 @@ -85,7 +85,7 @@ obj-$(CONFIG_CPU_FREQ_CBE)  += ppc-cbe-cpufreq.o
  ppc-cbe-cpufreq-y+= ppc_cbe_cpufreq_pervasive.o 
 ppc_cbe_cpufreq.o
  obj-$(CONFIG_CPU_FREQ_CBE_PMI)   += ppc_cbe_cpufreq_pmi.o
  obj-$(CONFIG_CPU_FREQ_MAPLE) += maple-cpufreq.o
 -obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 +obj-$(CONFIG_QORIQ_CPUFREQ)  += qoriq-cpufreq.o
  obj-$(CONFIG_CPU_FREQ_PMAC)  += pmac32-cpufreq.o
  obj-$(CONFIG_CPU_FREQ_PMAC64)+= pmac64-cpufreq.o
  obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o
 diff --git a/drivers/cpufreq/ppc-corenet-cpufreq.c 
 b/drivers/cpufreq/qoriq-cpufreq.c
 similarity index 100%
 rename from drivers/cpufreq/ppc-corenet-cpufreq.c
 rename to drivers/cpufreq/qoriq-cpufreq.c
 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.

Ahh. The normal page table scanning and page fault handling both clear
and set the dirty bit together with the writable one. But fork()
will clear the writable bit without clearing dirty. For some reason I
thought it moved the dirty bit into the struct page like the VM
scanning does, but that was just me having a brainfart. So yeah,
pte_dirty doesn't have to match pte_write even under perfectly normal
circumstances. Maybe there are other cases.

Not that I see a lot of forking in the xfs repair case either, so..

Dave, mind re-running the plain 3.19 numbers to really verify that the
pte_dirty/pte_write change really made that big of a difference. Maybe
your recollection of ~55,000 migrate_pages events was faulty. If the
pte_write -pte_dirty change is the *only* difference, it's still very
odd how that one difference would make migrate_rate go from ~55k to
471k. That's an order of magnitude difference, for what really
shouldn't be a big change.

I'm running a kernel right now with a hacky update_mmu_cache() that
warns if pte_dirty is ever different from pte_write().

+void update_mmu_cache(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep)
+{
+   if (!(vma-vm_flags  VM_SHARED)) {
+   pte_t now = READ_ONCE(*ptep);
+   if (!pte_write(now) != !pte_dirty(now)) {
+   static int count = 20;
+   static unsigned int prev = 0;
+   unsigned int val = pte_val(now)  0xfff;
+   if (prev != val  count) {
+   prev = val;
+   count--;
+   WARN(1, pte value %x, val);
+   }
+   }
+   }
+}

I haven't seen a single warning so far (and there I wrote all that
code to limit repeated warnings), although admittedly
update_mu_cache() isn't called for all cases where we change a pte
(not for the fork case, for example). But it *is* called for the page
faulting cases

Maybe a system update has changed libraries and memory allocation
patterns, and there is something bigger than that one-liner
pte_dirty/write change going on?

 Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

2015-03-19 Thread Horia Geantă
On 3/18/2015 12:03 AM, Kim Phillips wrote:
 On Tue, 17 Mar 2015 19:58:55 +0200
 Horia Geantă horia.gea...@freescale.com wrote:
 
 On 3/17/2015 2:19 AM, Kim Phillips wrote:
 On Mon, 16 Mar 2015 12:02:51 +0200
 Horia Geantă horia.gea...@freescale.com wrote:

 On 3/4/2015 2:23 AM, Kim Phillips wrote:
 Only potential problem is getting the crypto API to set the GFP_DMA
 flag in the allocation request, but presumably a
 CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

 Seems there are quite a few places that do not use the
 {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
 Among them, IPsec and dm-crypt.
 I've looked at the code and I don't think it can be converted to use
 crypto API.

 why not?

 It would imply having 2 memory allocations, one for crypto request and
 the other for the rest of the data bundled with the request (for IPsec
 that would be ESN + space for IV + sg entries for authenticated-only
 data and sk_buff extension, if needed).

 Trying to have a single allocation by making ESN, IV etc. part of the
 request private context requires modifying tfm.reqsize on the fly.
 This won't work without adding some kind of locking for the tfm.
 
 can't a common minimum tfm.reqsize be co-established up front, at
 least for the fast path?

Indeed, for IPsec at tfm allocation time - esp_init_state() -
tfm.reqsize could be increased to account for what is known for a given
flow: ESN, IV and asg (S/G entries for authenticated-only data).
The layout would be:
aead request (fixed part)
private ctx of backend algorithm
seq_no_hi (if ESN)
IV
asg
sg -- S/G table for skb_to_sgvec; how many entries is the question

Do you have a suggestion for how many S/G entries to preallocate for
representing the sk_buff data to be encrypted?
An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
Btw, currently maximum number of fragments supported by the net stack
(MAX_SKB_FRAGS) is 16 or more.

 This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
 places. Some of the maintainers do not agree, as you've seen.

 would modifying the crypto API to either have a different
 *_request_alloc() API, and/or adding calls to negotiate the GFP mask
 between crypto users and drivers, e.g., get/set_gfp_mask, work?

 I think what DaveM asked for was the change to be transparent.

 Besides converting to *_request_alloc(), seems that all other options
 require some extra awareness from the user.
 Could you elaborate on the idea above?
 
 was merely suggesting communicating GFP flags anonymously across the
 API, i.e., GFP_DMA wouldn't appear in user code.

Meaning user would have to get_gfp_mask before allocating a crypto
request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?

 An alternative would be for talitos to use the page allocator to get 1 /
 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
 = 8 kB), dma_map_page the area and manage it internally for talitos_desc
 hw descriptors.
 What do you think?

 There's a comment in esp_alloc_tmp(): Use spare space in skb for
 this where possible, which is ideally where we'd want to be (esp.

 Ok, I'll check that. But note the where possible - finding room in the
 skb to avoid the allocation won't always be the case, and then we're
 back to square one.

So the skb cb is out of the question, being too small (48B).
Any idea what was the intention of the TODO - maybe to use the
tailroom in the skb data area?

 because that memory could already be DMA-able).  Your above
 suggestion would be in the opposite direction of that.

 The proposal:
 -removes dma (un)mapping on the fast path
 
 sure, but at the expense of additional complexity.

Right, there's no free lunch. But it's cheaper.

 -avoids requesting dma mappable memory for more than it's actually
 needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
 only its private context)
 
 compared to the payload?  Plus, we have plenty of DMA space these
 days.
 
 -for caam it has the added benefit of speeding the below search for the
 offending descriptor in the SW ring from O(n) to O(1):
 for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) = 1; i++) {
  sw_idx = (tail + i)  (JOBR_DEPTH - 1);

  if (jrp-outring[hw_idx].desc ==
  jrp-entinfo[sw_idx].desc_addr_dma)
  break; /* found */
 }
 (drivers/crypto/caam/jr.c - caam_dequeue)
 
 how?  The job ring h/w will still be spitting things out
 out-of-order.

jrp-outring[hw_idx].desc bus address can be used to find the sw_idx in
O(1):

dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
[...]
sw_idx = (desc_base - jrp-outring[hw_idx].desc) / JD_SIZE;

JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
descriptor, 3 words can be used for smth. else.
Basically all JDs would be filled at a 64B-aligned offset in the memory
page.

 Plus, like I said, it's taking the problem in the wrong direction:
 we need to strive to merge the allocation