Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
On 04/22/2013 08:20 AM, Michael Neuling wrote:
> Michael Ellerman  wrote:
> 
>> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote:
>>> Michael Ellerman  wrote:
>>>
 On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote:
> This patch adds new POWER8 instruction encoding for reading
> the BHRB buffer entries and also clearing it. Encoding for
> "clrbhrb" instruction is straight forward.

 Which is "clear branch history rolling buffer" ?

> But "mfbhrbe"
> encoding involves reading a certain index of BHRB buffer
> into a particular GPR register.

 And "Move from branch history rolling buffer entry" ?

> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index 8752bc8..93ae5a1 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -82,6 +82,7 @@
>  #define  __REGA0_R31 31
>  
>  /* sorted alphabetically */
> +#define PPC_INST_BHRBE   0x7c00025c

 I don't think you really need this, just use the literal value below.
>>>
>>> The rest of the defines in this file do this, so Anshuman's right. 
>>
>> I don't see the point, but sure let's be consistent. Though in that case
>> he should do the same for PPC_CLRBHRB below.
> 
> Agreed.
> 

Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c
before using it for PPC_CLRBHRB.

> Mikey
> 
>>
> @@ -297,6 +298,12 @@
>  #define PPC_NAP  stringify_in_c(.long PPC_INST_NAP)
>  #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP)
>  
> +/* BHRB instructions */
> +#define PPC_CLRBHRB  stringify_in_c(.long 0x7c00035c)
> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \
> + __PPC_RS(r) | \
> + (((n) & 0x1f) << 11))

 Why are you not using ___PPC_RB(n) here ?
>>>
>>> Actually, this is wrong.  The number field should be 10 bits (0x3ff),
>>> not 5 (0x1f)  Anshuman please fix.
>>
>> ACK.

I got it wrong, thought this as 32 instead of 1024. Would fix it.

Regards
Anshuman

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V3 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
On 04/22/2013 08:20 AM, Michael Neuling wrote:
> Michael Ellerman  wrote:
> 
>> On Mon, Apr 22, 2013 at 11:13:43AM +1000, Michael Neuling wrote:
>>> Michael Ellerman  wrote:
>>>
 On Thu, Apr 18, 2013 at 05:56:12PM +0530, Anshuman Khandual wrote:
> This patch adds new POWER8 instruction encoding for reading
> the BHRB buffer entries and also clearing it. Encoding for
> "clrbhrb" instruction is straight forward.

 Which is "clear branch history rolling buffer" ?

> But "mfbhrbe"
> encoding involves reading a certain index of BHRB buffer
> into a particular GPR register.

 And "Move from branch history rolling buffer entry" ?

> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index 8752bc8..93ae5a1 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -82,6 +82,7 @@
>  #define  __REGA0_R31 31
>  
>  /* sorted alphabetically */
> +#define PPC_INST_BHRBE   0x7c00025c

 I don't think you really need this, just use the literal value below.
>>>
>>> The rest of the defines in this file do this, so Anshuman's right. 
>>
>> I don't see the point, but sure let's be consistent. Though in that case
>> he should do the same for PPC_CLRBHRB below.
> 
> Agreed.
> 

Sure, would define a new macro (PPC_INST_CLRBHRB) to encode 0x7c00035c
before using it for PPC_CLRBHRB.

> Mikey
> 
>>
> @@ -297,6 +298,12 @@
>  #define PPC_NAP  stringify_in_c(.long PPC_INST_NAP)
>  #define PPC_SLEEPstringify_in_c(.long PPC_INST_SLEEP)
>  
> +/* BHRB instructions */
> +#define PPC_CLRBHRB  stringify_in_c(.long 0x7c00035c)
> +#define PPC_MFBHRBE(r, n)stringify_in_c(.long PPC_INST_BHRBE | \
> + __PPC_RS(r) | \
> + (((n) & 0x1f) << 11))

 Why are you not using ___PPC_RB(n) here ?
>>>
>>> Actually, this is wrong.  The number field should be 10 bits (0x3ff),
>>> not 5 (0x1f)  Anshuman please fix.
>>
>> ACK.

I got it wrong, thought this as 32 instead of 1024. Would fix it.

Regards
Anshuman

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/5] of_platform_driver and OF_DEVICE removal

2013-04-22 Thread Arnd Bergmann
On Monday 22 April 2013, Rob Herring wrote:
> From: Rob Herring 
> 
> This series is a relatively straight-forward removal of the last remaining
> user of of_platform_driver (ibmebus) and removal of CONFIG_OF_DEVICE which
> is always enabled when CONFIG_OF is enabled.
> 
> Compile tested on powerpc and sparc.
> 

Acked-by: Arnd Bergmann 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 00/27] THP support for PPC64

2013-04-22 Thread Aneesh Kumar K.V
Hi,

This patchset adds transparent hugepage support for PPC64.

Some numbers:

The latency measurements code from Anton  found at
http://ozlabs.org/~anton/junkcode/latency2001.c

64K page size (With THP support)
--
[root@llmp24l02 test]# ./latency2001 8G
 8589934592428.49 cycles120.50 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592471.16 cycles132.50 ns
[root@llmp24l02 test]# echo never > /sys/kernel/mm/transparent_hugepage/enabled 
[root@llmp24l02 test]# ./latency2001 8G
 8589934592766.52 cycles215.56 ns
[root@llmp24l02 test]# 

4K page size (No THP support for 4K)

[root@llmp24l02 test]# ./latency2001 8G
 8589934592814.88 cycles229.16 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592463.69 cycles130.40 ns
[root@llmp24l02 test]# 

We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.

Another test that does 5000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.

split_huge_page impact:
-
To look at the performance impact of large page invalidate, I tried the below
experiment. The test involved, accessing a large contiguous region of memory
location as below

for (i = 0; i < size; i += PAGE_SIZE)
data[i] = i;

We wanted to access the data in sequential order so that we look at the
worst case THP performance. Accesing the data in sequential order implies
we have the Page table cached and overhead of TLB miss is as minimal as
possible. We also don't touch the entire page, because that can result in
cache evict.

After we touched the full range as above, we now call mprotect on each
of that page. A mprotect will result in a hugepage split. This should
allow us to measure the impact of hugepage split.

for (i = 0; i < size; i += PAGE_SIZE)
 mprotect(&data[i], PAGE_SIZE, PROT_READ);

Split hugepage impact: 
-
THP enabled: 2.851561705 seconds for test completion
THP disable: 3.599146098 seconds for test completion

We are 20.7% better than non THP case even when we have all the large pages 
split.

Detailed output:

THP enabled:
---
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# /root/thp/tools/perf/perf stat -e 
page-faults,dTLB-load-misses ./split-huge-page-mpro 20G 
 
time taken to touch all the data in ns: 2763096913 

 Performance counter stats for './split-huge-page-mpro 20G':

 1,581 page-faults 
 3,159 dTLB-load-misses

   2.851561705 seconds time elapsed

[root@llmp24l02 ~]# 
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 1279
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 1279
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# 

77.05%  split-huge-page  [kernel.kallsyms] [k] .clear_user_page 
   
 7.10%  split-huge-page  [kernel.kallsyms] [k] .perf_event_mmap_ctx 
   
 1.51%  split-huge-page  split-huge-page-mpro  [.] 0x0a70   
   
 0.96%  split-huge-page  [unknown] [H] 0x0157e3bc   
   
 0.81%  split-huge-page  [kernel.kallsyms] [k] .up_write
   
 0.76%  split-huge-page  [kernel.kallsyms] [k] .perf_event_mmap 
   
 0.76%  split-huge-page  [kernel.kallsyms] [k] .down_write  
   
 0.74%  split-huge-page  [kernel.kallsyms] [k] .lru_add_page_tail   
   
 0.61%  split-huge-page  [kernel.kallsyms] [k] .split_huge_page 
   
 0.59%  split-huge-page  [kernel.kallsyms] [k] .change_protection   
   
 0.51%  split-huge-page  [kernel.kallsyms] [k] .release_pages   
   


 0.96%  split-huge-page  [unknown] [H] 0x0157e3bc   
   
|  
|--79.44%-- reloc_start
|  |  
|  |--86.54%-- .__pSeries_lpar_hugepage_invalidate
|  |  .pSeries_lpar_hugepage_invalidate
|  |  .hpte_need_hugepage_flush
|  |  .split_huge_page
|  |  .__split_huge_page_pmd
|  |  .vma_adjust
|  |  .vma_merge
|  |  .mprotect_fixup
|  |  .S

[PATCH -V6 01/27] powerpc: Use signed formatting when printing error

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

PAPR defines these errors as negative values. So print them accordingly
for easy debugging.

Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 299731e..9b02ab1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -155,7 +155,7 @@ static long pSeries_lpar_hpte_insert(unsigned long 
hpte_group,
 */
if (unlikely(lpar_rc != H_SUCCESS)) {
if (!(vflags & HPTE_V_BOLTED))
-   pr_devel(" lpar err %lu\n", lpar_rc);
+   pr_devel(" lpar err %ld\n", lpar_rc);
return -2;
}
if (!(vflags & HPTE_V_BOLTED))
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 02/27] powerpc: Save DAR and DSISR in pt_regs on MCE

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We were not saving DAR and DSISR on MCE. Save then and also print the values
along with exception details in xmon.

Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/kernel/exceptions-64s.S |9 +
 arch/powerpc/xmon/xmon.c |2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 56bd923..7da3f94 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -688,9 +688,18 @@ slb_miss_user_pseries:
.align  7
.globl machine_check_common
 machine_check_common:
+
+   mfspr   r10,SPRN_DAR
+   std r10,PACA_EXGEN+EX_DAR(r13)
+   mfspr   r10,SPRN_DSISR
+   stw r10,PACA_EXGEN+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
FINISH_NAP
DISABLE_INTS
+   ld  r3,PACA_EXGEN+EX_DAR(r13)
+   lwz r4,PACA_EXGEN+EX_DSISR(r13)
+   std r3,_DAR(r1)
+   std r4,_DSISR(r1)
bl  .save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  .machine_check_exception
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 13f85de..51e237c 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1430,7 +1430,7 @@ static void excprint(struct pt_regs *fp)
printf("sp: %lx\n", fp->gpr[1]);
printf("   msr: %lx\n", fp->msr);
 
-   if (trap == 0x300 || trap == 0x380 || trap == 0x600) {
+   if (trap == 0x300 || trap == 0x380 || trap == 0x600 || trap == 0x200) {
printf("   dar: %lx\n", fp->dar);
if (trap != 0x380)
printf(" dsisr: %lx\n", fp->dsisr);
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 03/27] powerpc: Don't hard code the size of pte page

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

USE PTRS_PER_PTE to indicate the size of pte page. To support THP,
later patches will be changing PTRS_PER_PTE value.

Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h |6 ++
 arch/powerpc/mm/hash_low_64.S  |4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index a9cbd3b..4b52726 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -17,6 +17,12 @@ struct mm_struct;
 #  include 
 #endif
 
+/*
+ * We save the slot number & secondary bit in the second half of the
+ * PTE page. We use the 8 bytes per each pte entry.
+ */
+#define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 7443481..abdd5e2 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -490,7 +490,7 @@ END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, 
CPU_FTR_NOEXECUTE)
beq htab_inval_old_hpte
 
ld  r6,STK_PARAM(R6)(r1)
-   ori r26,r6,0x8000   /* Load the hidx mask */
+   ori r26,r6,PTE_PAGE_HIDX_OFFSET /* Load the hidx mask. */
ld  r26,0(r26)
addir5,r25,36   /* Check actual HPTE_SUB bit, this */
rldcr.  r0,r31,r5,0 /* must match pgtable.h definition */
@@ -607,7 +607,7 @@ htab_pte_insert_ok:
sld r4,r4,r5
andcr26,r26,r4
or  r26,r26,r3
-   ori r5,r6,0x8000
+   ori r5,r6,PTE_PAGE_HIDX_OFFSET
std r26,0(r5)
lwsync
std r30,0(r6)
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 04/27] powerpc: Don't truncate pgd_index wrongly

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

With PGD_INDEX_SIZE set to 12 the existing macro doesn't work. Fix it to
use PTRS_PER_PGD

The idea originally was to have one more bit in the result of
pgd_index() than PGD_INDEX_SIZE, so that if one had an address
corresponding to the last PGD entry, and then incremented that address
by PGD_SIZE, and took pgd_index() of that, you wouldn't end up with
zero.  The commit that introduced that dates back to 2002, and the
code that was sensitive to that edge case has long since been
refactored (several times), so there is no need for it these days.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..e3d55f6f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -167,8 +167,7 @@
  * Find an entry in a page-table-directory.  We combine the address region
  * (the high order N bits) and the pgd portion of the address.
  */
-/* to avoid overflow in free_pgtables we don't use PTRS_PER_PGD here */
-#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x1ff)
+#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & (PTRS_PER_PGD - 1))
 
 #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address))
 
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 05/27] powerpc: New hugepage directory format

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Change the hugepage directory format so that we can have leaf ptes directly
at page directory avoiding the allocation of hugepage directory.

With the new table format we have 3 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table

Instead of storing shift value in hugepd pointer we use mmu_psize_def index
so that we can fit all the supported hugepage size in 4 bits

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hugetlb.h|   13 +++--
 arch/powerpc/include/asm/mmu-hash64.h |   20 +++-
 arch/powerpc/include/asm/page.h   |   18 +-
 arch/powerpc/include/asm/pgalloc-64.h |5 -
 arch/powerpc/mm/hugetlbpage.c |   23 ---
 arch/powerpc/mm/init_64.c |3 +--
 6 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 62e11a3..81f7677 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -9,12 +9,21 @@ extern struct kmem_cache *hugepte_cache;
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
BUG_ON(!hugepd_ok(hpd));
-   return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | PD_HUGE);
+   /*
+* We have only four bits to encode, MMU page size
+*/
+   BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf);
+   return (pte_t *)(hpd.pd & ~HUGEPD_SHIFT_MASK);
+}
+
+static inline unsigned int hugepd_mmu_psize(hugepd_t hpd)
+{
+   return (hpd.pd & HUGEPD_SHIFT_MASK) >> 2;
 }
 
 static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
-   return hpd.pd & HUGEPD_SHIFT_MASK;
+   return mmu_psize_to_shift(hugepd_mmu_psize(hpd));
 }
 
 static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index b59e06f..05895cf 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -21,6 +21,7 @@
  * complete pgtable.h but only a portion of it.
  */
 #include 
+#include 
 
 /*
  * Segment table
@@ -159,6 +160,24 @@ struct mmu_psize_def
unsigned long   avpnm;  /* bits to mask out in AVPN in the HPTE */
unsigned long   sllp;   /* SLB L||LP (exact mask to use in slbmte) */
 };
+extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
+
+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+   int psize;
+
+   for (psize = 0; psize < MMU_PAGE_COUNT; ++psize)
+   if (mmu_psize_defs[psize].shift == shift)
+   return psize;
+   return -1;
+}
+
+static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
+{
+   if (mmu_psize_defs[mmu_psize].shift)
+   return mmu_psize_defs[mmu_psize].shift;
+   BUG();
+}
 
 #endif /* __ASSEMBLY__ */
 
@@ -193,7 +212,6 @@ static inline int segment_shift(int ssize)
 /*
  * The current system page and segment sizes
  */
-extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
 extern int mmu_linear_psize;
 extern int mmu_virtual_psize;
 extern int mmu_vmalloc_psize;
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index f072e97..b309cf4 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -250,18 +250,6 @@ extern long long virt_phys_offset;
 #endif
 
 /*
- * Use the top bit of the higher-level page table entries to indicate whether
- * the entries we point to contain hugepages.  This works because we know that
- * the page tables live in kernel space.  If we ever decide to support having
- * page tables at arbitrary addresses, this breaks and will have to change.
- */
-#ifdef CONFIG_PPC64
-#define PD_HUGE 0x8000
-#else
-#define PD_HUGE 0x8000
-#endif
-
-/*
  * Some number of bits at the level of the page table that points to
  * a hugepte are used to encode the size.  This masks those bits.
  */
@@ -356,7 +344,11 @@ typedef struct { signed long pd; } hugepd_t;
 #ifdef CONFIG_HUGETLB_PAGE
 static inline int hugepd_ok(hugepd_t hpd)
 {
-   return (hpd.pd > 0);
+   /*
+* hugepd pointer, bottom two bits == 00 and next 4 bits
+* indicate size of table
+*/
+   return (((hpd.pd & 0x3) == 0x0) && ((hpd.pd & HUGEPD_SHIFT_MASK) != 0));
 }
 
 #define is_hugepd(pdep)   (hugepd_ok(*((hugepd_t *)(pdep
diff --git a/arch/powerpc/include/asm/pgalloc-64.h 
b/arch/powerpc/include/asm/pgalloc-64.h
index 292725c..69e352a 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -35,7 +35,10 @@ struct vmemmap_backing {
 #define MAX_PGTABLE_INDEX_SIZE 0xf
 
 extern struct kmem_cache *pgtable_cache[];
-#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
+#define PGT_CACHE(shift) ({   

[PATCH -V6 06/27] powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation.
With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level.
That means with 32 bit process we cannot allocate normal pages at
all, because we cover the entire address space with one pgd entry. Fix this
by switching to a new page table format for hugepages. With the new page table
format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead
we encode the PTE information directly at the directory level. This forces 16MB
hugepage at PMD level. This will also make the page take walk much simpler later
when we add the THP support.

With the new table format we have 4 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(3) leaf pte for huge page, bottom two bits != 00
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page.h|2 +
 arch/powerpc/include/asm/pgtable.h |2 +
 arch/powerpc/mm/gup.c  |   18 +++-
 arch/powerpc/mm/hugetlbpage.c  |  176 ++--
 4 files changed, 168 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index b309cf4..6faa416 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -352,8 +352,10 @@ static inline int hugepd_ok(hugepd_t hpd)
 }
 
 #define is_hugepd(pdep)   (hugepd_ok(*((hugepd_t *)(pdep
+int pgd_huge(pgd_t pgd);
 #else /* CONFIG_HUGETLB_PAGE */
 #define is_hugepd(pdep)0
+#define pgd_huge(pgd)  0
 #endif /* CONFIG_HUGETLB_PAGE */
 
 struct page;
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 4b52726..7aeb955 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -218,6 +218,8 @@ extern void update_mmu_cache(struct vm_area_struct *, 
unsigned long, pte_t *);
 extern int gup_hugepd(hugepd_t *hugepd, unsigned pdshift, unsigned long addr,
  unsigned long end, int write, struct page **pages, int 
*nr);
 
+extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+  unsigned long end, int write, struct page **pages, int 
*nr);
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d7efdbf..4b921af 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -68,7 +68,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
next = pmd_addr_end(addr, end);
if (pmd_none(pmd))
return 0;
-   if (is_hugepd(pmdp)) {
+   if (pmd_huge(pmd)) {
+   if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
+write, pages, nr))
+   return 0;
+   } else if (is_hugepd(pmdp)) {
if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
addr, next, write, pages, nr))
return 0;
@@ -92,7 +96,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (is_hugepd(pudp)) {
+   if (pud_huge(pud)) {
+   if (!gup_hugepte((pte_t *)pudp, PUD_SIZE, addr, next,
+write, pages, nr))
+   return 0;
+   } else if (is_hugepd(pudp)) {
if (!gup_hugepd((hugepd_t *)pudp, PUD_SHIFT,
addr, next, write, pages, nr))
return 0;
@@ -153,7 +161,11 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
goto slow;
-   if (is_hugepd(pgdp)) {
+   if (pgd_huge(pgd)) {
+   if (!gup_hugepte((pte_t *)pgdp, PGDIR_SIZE, addr, next,
+write, pages, &nr))
+   goto slow;
+   } else if (is_hugepd(pgdp)) {
if (!gup_hugepd((hugepd_t *)pgdp, PGDIR_SHIFT,
addr, next, write, pages, &nr))
goto slow;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9108ce7..2da8fe6 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -50,11 +50,69 @@ static unsigned nr_gpages;
 
 #define hugepd_none(hpd)   ((hpd).pd == 0)
 
+#ifdef CONFIG_PPC_BOOK3S

[PATCH -V6 07/27] powerpc: Reduce the PTE_INDEX_SIZE

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

This make one PMD cover 16MB range. That helps in easier implementation of THP
on power. THP core code make use of one pmd entry to track the hugepage and
the range mapped by a single pmd entry should be equal to the hugepage size
supported by the hardware.

This also switch PGD to cover 16GB. That is needed so that we can simplify the
hugetlb page walking code so that we have same pte format for explicit hugepage
and THP hugepage.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64-64k.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h 
b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index be4e287..45142d6 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -4,10 +4,10 @@
 #include 
 
 
-#define PTE_INDEX_SIZE  12
-#define PMD_INDEX_SIZE  12
+#define PTE_INDEX_SIZE  8
+#define PMD_INDEX_SIZE  10
 #define PUD_INDEX_SIZE 0
-#define PGD_INDEX_SIZE  6
+#define PGD_INDEX_SIZE  12
 
 #ifndef __ASSEMBLY__
 #define PTE_TABLE_SIZE (sizeof(real_pte_t) << PTE_INDEX_SIZE)
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 08/27] powerpc: Move the pte free routines from common header

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

This patch moves the common code to 32/64 bit headers and also duplicate
4K_PAGES and 64K_PAGES section. We will later change the 64 bit 64K_PAGES
version to support smaller PTE fragments. The patch doesn't introduce
any functional changes.

Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgalloc-32.h |   45 ++
 arch/powerpc/include/asm/pgalloc-64.h |  157 ++---
 arch/powerpc/include/asm/pgalloc.h|   46 +-
 3 files changed, 189 insertions(+), 59 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc-32.h 
b/arch/powerpc/include/asm/pgalloc-32.h
index 580cf73..27b2386 100644
--- a/arch/powerpc/include/asm/pgalloc-32.h
+++ b/arch/powerpc/include/asm/pgalloc-32.h
@@ -37,6 +37,17 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+   free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+   pgtable_page_dtor(ptepage);
+   __free_page(ptepage);
+}
+
 static inline void pgtable_free(void *table, unsigned index_size)
 {
BUG_ON(index_size); /* 32-bit doesn't use this */
@@ -45,4 +56,38 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 
 #define check_pgt_cache()  do { } while (0)
 
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+   void *table, int shift)
+{
+   unsigned long pgf = (unsigned long)table;
+   BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+   pgf |= shift;
+   tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+   void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+   unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+   pgtable_free(table, shift);
+}
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+   void *table, int shift)
+{
+   pgtable_free(table, shift);
+}
+#endif
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+ unsigned long address)
+{
+   struct page *page = page_address(table);
+
+   tlb_flush_pgtable(tlb, address);
+   pgtable_page_dtor(page);
+   pgtable_free_tlb(tlb, page, 0);
+}
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h 
b/arch/powerpc/include/asm/pgalloc-64.h
index 69e352a..d390123 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -75,8 +75,83 @@ static inline void pud_populate(struct mm_struct *mm, pud_t 
*pud, pmd_t *pmd)
 #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, (unsigned long)(pte))
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
+ unsigned long address)
+{
+   return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+}
+
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
+ unsigned long address)
+{
+   struct page *page;
+   pte_t *pte;
+
+   pte = pte_alloc_one_kernel(mm, address);
+   if (!pte)
+   return NULL;
+   page = virt_to_page(pte);
+   pgtable_page_ctor(page);
+   return page;
+}
+
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+   free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+   pgtable_page_dtor(ptepage);
+   __free_page(ptepage);
+}
+
+static inline void pgtable_free(void *table, unsigned index_size)
+{
+   if (!index_size)
+   free_page((unsigned long)table);
+   else {
+   BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+   kmem_cache_free(PGT_CACHE(index_size), table);
+   }
+}
+
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+   void *table, int shift)
+{
+   unsigned long pgf = (unsigned long)table;
+   BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+   pgf |= shift;
+   tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+   void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+   unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+   pgtable_free(table, shift);
+}
+#else /* !CONFIG_SMP */
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+   void *table, int shift)
+{
+   pgtable_free(table, shift);
+}
+#endif /* CONFIG_SMP */
+
+static inline v

[PATCH -V6 10/27] powerpc: Use encode avpn where we need only avpn values

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

In all these cases we are doing something similar to

HPTE_V_COMPARE(hpte_v, want_v) which ignores the HPTE_V_LARGE bit

With MPSS support we would need actual page size to set HPTE_V_LARGE
bit and that won't be available in most of these cases. Since we are ignoring
HPTE_V_LARGE bit, use the  avpn value instead. There should not be any change
in behaviour after this patch.

Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_native_64.c|8 
 arch/powerpc/platforms/cell/beat_htab.c |   10 +-
 arch/powerpc/platforms/ps3/htab.c   |2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index ffc1e00..9d8983a 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -252,7 +252,7 @@ static long native_hpte_updatepp(unsigned long slot, 
unsigned long newpp,
unsigned long hpte_v, want_v;
int ret = 0;
 
-   want_v = hpte_encode_v(vpn, psize, ssize);
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
 
DBG_LOW("update(vpn=%016lx, avpnv=%016lx, group=%lx, newpp=%lx)",
vpn, want_v & HPTE_V_AVPN, slot, newpp);
@@ -288,7 +288,7 @@ static long native_hpte_find(unsigned long vpn, int psize, 
int ssize)
unsigned long want_v, hpte_v;
 
hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
-   want_v = hpte_encode_v(vpn, psize, ssize);
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
 
/* Bolted mappings are only ever in the primary group */
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
@@ -348,7 +348,7 @@ static void native_hpte_invalidate(unsigned long slot, 
unsigned long vpn,
 
DBG_LOW("invalidate(vpn=%016lx, hash: %lx)\n", vpn, slot);
 
-   want_v = hpte_encode_v(vpn, psize, ssize);
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
native_lock_hpte(hptep);
hpte_v = hptep->v;
 
@@ -520,7 +520,7 @@ static void native_flush_hash_range(unsigned long number, 
int local)
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
slot += hidx & _PTEIDX_GROUP_IX;
hptep = htab_address + slot;
-   want_v = hpte_encode_v(vpn, psize, ssize);
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
native_lock_hpte(hptep);
hpte_v = hptep->v;
if (!HPTE_V_COMPARE(hpte_v, want_v) ||
diff --git a/arch/powerpc/platforms/cell/beat_htab.c 
b/arch/powerpc/platforms/cell/beat_htab.c
index 0f6f839..472f9a7 100644
--- a/arch/powerpc/platforms/cell/beat_htab.c
+++ b/arch/powerpc/platforms/cell/beat_htab.c
@@ -191,7 +191,7 @@ static long beat_lpar_hpte_updatepp(unsigned long slot,
u64 dummy0, dummy1;
unsigned long want_v;
 
-   want_v = hpte_encode_v(vpn, psize, MMU_SEGSIZE_256M);
+   want_v = hpte_encode_avpn(vpn, psize, MMU_SEGSIZE_256M);
 
DBG_LOW("update: "
"avpnv=%016lx, slot=%016lx, psize: %d, newpp %016lx ... ",
@@ -228,7 +228,7 @@ static long beat_lpar_hpte_find(unsigned long vpn, int 
psize)
unsigned long want_v, hpte_v;
 
hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, MMU_SEGSIZE_256M);
-   want_v = hpte_encode_v(vpn, psize, MMU_SEGSIZE_256M);
+   want_v = hpte_encode_avpn(vpn, psize, MMU_SEGSIZE_256M);
 
for (j = 0; j < 2; j++) {
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
@@ -283,7 +283,7 @@ static void beat_lpar_hpte_invalidate(unsigned long slot, 
unsigned long vpn,
 
DBG_LOW("inval : slot=%lx, va=%016lx, psize: %d, local: %d\n",
slot, va, psize, local);
-   want_v = hpte_encode_v(vpn, psize, MMU_SEGSIZE_256M);
+   want_v = hpte_encode_avpn(vpn, psize, MMU_SEGSIZE_256M);
 
raw_spin_lock_irqsave(&beat_htab_lock, flags);
dummy1 = beat_lpar_hpte_getword0(slot);
@@ -372,7 +372,7 @@ static long beat_lpar_hpte_updatepp_v3(unsigned long slot,
unsigned long want_v;
unsigned long pss;
 
-   want_v = hpte_encode_v(vpn, psize, MMU_SEGSIZE_256M);
+   want_v = hpte_encode_avpn(vpn, psize, MMU_SEGSIZE_256M);
pss = (psize == MMU_PAGE_4K) ? -1UL : mmu_psize_defs[psize].penc;
 
DBG_LOW("update: "
@@ -402,7 +402,7 @@ static void beat_lpar_hpte_invalidate_v3(unsigned long 
slot, unsigned long vpn,
 
DBG_LOW("inval : slot=%lx, vpn=%016lx, psize: %d, local: %d\n",
slot, vpn, psize, local);
-   want_v = hpte_encode_v(vpn, psize, MMU_SEGSIZE_256M);
+   want_v = hpte_encode_avpn(vpn, psize, MMU_SEGSIZE_256M);
pss = (psize == MMU_PAGE_4K) ? -1UL : mmu_psize_defs[psize].penc;
 
lpar_rc = beat_invalidate_htab_entry3(0, slot, want_v, pss);
diff --git a/arch/powerpc/platforms/ps3/htab.c 
b/arch/powerpc/

[PATCH -V6 11/27] powerpc: Decode the pte-lp-encoding bits correctly.

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We look at both the segment base page size and actual page size and store
the pte-lp-encodings in an array per base page size.

We also update all relevant functions to take actual page size argument
so that we can use the correct PTE LP encoding in HPTE. This should also
get the basic Multiple Page Size per Segment (MPSS) support. This is needed
to enable THP on ppc64.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/machdep.h  |3 +-
 arch/powerpc/include/asm/mmu-hash64.h   |   33 
 arch/powerpc/kvm/book3s_hv.c|8 +-
 arch/powerpc/mm/hash_low_64.S   |   18 +++--
 arch/powerpc/mm/hash_native_64.c|  135 ++-
 arch/powerpc/mm/hash_utils_64.c |  121 +--
 arch/powerpc/mm/hugetlbpage-hash64.c|4 +-
 arch/powerpc/platforms/cell/beat_htab.c |   16 ++--
 arch/powerpc/platforms/ps3/htab.c   |6 +-
 arch/powerpc/platforms/pseries/lpar.c   |6 +-
 10 files changed, 233 insertions(+), 117 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 3d6b410..3f3f691 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -50,7 +50,8 @@ struct machdep_calls {
   unsigned long prpn,
   unsigned long rflags,
   unsigned long vflags,
-  int psize, int ssize);
+  int psize, int apsize,
+  int ssize);
long(*hpte_remove)(unsigned long hpte_group);
void(*hpte_removebolted)(unsigned long ea,
 int psize, int ssize);
diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index de9e577..18171a8 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -155,7 +155,7 @@ extern unsigned long htab_hash_mask;
 struct mmu_psize_def
 {
unsigned intshift;  /* number of bits */
-   unsigned intpenc;   /* HPTE encoding */
+   int penc[MMU_PAGE_COUNT];   /* HPTE encoding */
unsigned inttlbiel; /* tlbiel supported for that page size */
unsigned long   avpnm;  /* bits to mask out in AVPN in the HPTE */
unsigned long   sllp;   /* SLB L||LP (exact mask to use in slbmte) */
@@ -200,6 +200,13 @@ static inline unsigned int mmu_psize_to_shift(unsigned int 
mmu_psize)
  */
 #define VPN_SHIFT  12
 
+/*
+ * HPTE Large Page (LP) details
+ */
+#define LP_SHIFT   12
+#define LP_BITS8
+#define LP_MASK(i) ((0xFF >> (i)) << LP_SHIFT)
+
 #ifndef __ASSEMBLY__
 
 static inline int segment_shift(int ssize)
@@ -255,14 +262,14 @@ static inline unsigned long hpte_encode_avpn(unsigned 
long vpn, int psize,
 
 /*
  * This function sets the AVPN and L fields of the HPTE  appropriately
- * for the page size
+ * using the base page size and actual page size.
  */
-static inline unsigned long hpte_encode_v(unsigned long vpn,
- int psize, int ssize)
+static inline unsigned long hpte_encode_v(unsigned long vpn, int base_psize,
+ int actual_psize, int ssize)
 {
unsigned long v;
-   v = hpte_encode_avpn(vpn, psize, ssize);
-   if (psize != MMU_PAGE_4K)
+   v = hpte_encode_avpn(vpn, base_psize, ssize);
+   if (actual_psize != MMU_PAGE_4K)
v |= HPTE_V_LARGE;
return v;
 }
@@ -272,19 +279,17 @@ static inline unsigned long hpte_encode_v(unsigned long 
vpn,
  * for the page size. We assume the pa is already "clean" that is properly
  * aligned for the requested page size
  */
-static inline unsigned long hpte_encode_r(unsigned long pa, int psize)
+static inline unsigned long hpte_encode_r(unsigned long pa, int base_psize,
+ int actual_psize)
 {
-   unsigned long r;
-
/* A 4K page needs no special encoding */
-   if (psize == MMU_PAGE_4K)
+   if (actual_psize == MMU_PAGE_4K)
return pa & HPTE_R_RPN;
else {
-   unsigned int penc = mmu_psize_defs[psize].penc;
-   unsigned int shift = mmu_psize_defs[psize].shift;
-   return (pa & ~((1ul << shift) - 1)) | (penc << 12);
+   unsigned int penc = 
mmu_psize_defs[base_psize].penc[actual_psize];
+   unsigned int shift = mmu_psize_defs[actual_psize].shift;
+   return (pa & ~((1ul << shift) - 1)) | (penc << LP_SHIFT);
}
-   return r;
 }
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80dcc53..c794a4c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1515,7 +1515,13 @@ static void kvmppc

[PATCH -V6 12/27] powerpc: Fix hpte_decode to use the correct decoding for page sizes

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

As per ISA doc, we encode base and actual page size in the LP bits of
PTE. The number of bit used to encode the page sizes depend on actual
page size.  ISA doc lists this as

   PTE LP actual page size
 rrrz   >=8KB
 rrzz   >=16KB
 rzzz   >=32KB
    >=64KB
rrrz    >=128KB
rrzz    >=256KB
rzzz    >=512KB
    >=1MB

ISA doc also says
"The values of the “z” bits used to specify each size, along with all possible
values of “r” bits in the LP field, must result in LP values distinct from
other LP values for other sizes."

based on the above update hpte_decode to use the correct decoding for LP bits.

Reviewed-by: David Gibson 
Acked-by: Paul Mackerras 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_native_64.c |   53 --
 1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 14e3fe8..bb920ee 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -245,19 +245,10 @@ static long native_hpte_remove(unsigned long hpte_group)
return i;
 }
 
-static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
+static inline int __hpte_actual_psize(unsigned int lp, int psize)
 {
int i, shift;
unsigned int mask;
-   /* Look at the 8 bit LP value */
-   unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
-
-   if (!(hptep->v & HPTE_V_VALID))
-   return -1;
-
-   /* First check if it is large page */
-   if (!(hptep->v & HPTE_V_LARGE))
-   return MMU_PAGE_4K;
 
/* start from 1 ignoring MMU_PAGE_4K */
for (i = 1; i < MMU_PAGE_COUNT; i++) {
@@ -284,6 +275,21 @@ static inline int hpte_actual_psize(struct hash_pte 
*hptep, int psize)
return -1;
 }
 
+static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
+{
+   /* Look at the 8 bit LP value */
+   unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
+
+   if (!(hptep->v & HPTE_V_VALID))
+   return -1;
+
+   /* First check if it is large page */
+   if (!(hptep->v & HPTE_V_LARGE))
+   return MMU_PAGE_4K;
+
+   return __hpte_actual_psize(lp, psize);
+}
+
 static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 unsigned long vpn, int psize, int ssize,
 int local)
@@ -425,42 +431,27 @@ static void hpte_decode(struct hash_pte *hpte, unsigned 
long slot,
int *psize, int *apsize, int *ssize, unsigned long *vpn)
 {
unsigned long avpn, pteg, vpi;
-   unsigned long hpte_r = hpte->r;
unsigned long hpte_v = hpte->v;
unsigned long vsid, seg_off;
-   int i, size, a_size, shift, penc;
+   int size, a_size, shift;
+   /* Look at the 8 bit LP value */
+   unsigned int lp = (hpte->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
 
if (!(hpte_v & HPTE_V_LARGE)) {
size   = MMU_PAGE_4K;
a_size = MMU_PAGE_4K;
} else {
-   for (i = 0; i < LP_BITS; i++) {
-   if ((hpte_r & LP_MASK(i+1)) == LP_MASK(i+1))
-   break;
-   }
-   penc = LP_MASK(i+1) >> LP_SHIFT;
for (size = 0; size < MMU_PAGE_COUNT; size++) {
 
/* valid entries have a shift value */
if (!mmu_psize_defs[size].shift)
continue;
-   for (a_size = 0; a_size < MMU_PAGE_COUNT; a_size++) {
-
-   /* 4K pages are not represented by LP */
-   if (a_size == MMU_PAGE_4K)
-   continue;
 
-   /* valid entries have a shift value */
-   if (!mmu_psize_defs[a_size].shift)
-   continue;
-
-   if (penc == mmu_psize_defs[size].penc[a_size])
-   goto out;
-   }
+   a_size = __hpte_actual_psize(lp, size);
+   if (a_size != -1)
+   break;
}
}
-
-out:
/* This works for all page sizes, and for 256M and 1T segments */
*ssize = hpte_v >> HPTE_V_SSIZE_SHIFT;
shift = mmu_psize_defs[size].shift;
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH -V6 14/27] powerpc: Print page size info during boot

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

This gives hint about different base and actual page size combination
supported by the platform.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_utils_64.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 33cdc3a..d0eb6d4 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -316,7 +316,7 @@ static int __init htab_dt_scan_page_sizes(unsigned long 
node,
prop = (u32 *)of_get_flat_dt_prop(node,
  "ibm,segment-page-sizes", &size);
if (prop != NULL) {
-   DBG("Page sizes from device-tree:\n");
+   pr_info("Page sizes from device-tree:\n");
size /= 4;
cur_cpu_spec->mmu_features &= ~(MMU_FTR_16M_PAGE);
while(size > 0) {
@@ -370,10 +370,10 @@ static int __init htab_dt_scan_page_sizes(unsigned long 
node,
   "shift=%d\n", base_shift, shift);
 
def->penc[idx] = penc;
-   DBG(" %d: shift=%02x, sllp=%04lx, "
-   "avpnm=%08lx, tlbiel=%d, penc=%d\n",
-   idx, shift, def->sllp, def->avpnm,
-   def->tlbiel, def->penc[idx]);
+   pr_info("base_shift=%d: shift=%d, sllp=0x%04lx,"
+   " avpnm=0x%08lx, tlbiel=%d, penc=%d\n",
+   base_shift, shift, def->sllp,
+   def->avpnm, def->tlbiel, 
def->penc[idx]);
}
}
return 1;
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 15/27] powerpc: Update tlbie/tlbiel as per ISA doc

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Encode the actual page correctly in tlbie/tlbiel. This make sure we handle
multiple page size segment correctly.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_native_64.c |   32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index bb920ee..6a2aead 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -61,7 +61,10 @@ static inline void __tlbie(unsigned long vpn, int psize, int 
apsize, int ssize)
 
switch (psize) {
case MMU_PAGE_4K:
+   /* clear out bits after (52) [052.63] */
+   va &= ~((1ul << (64 - 52)) - 1);
va |= ssize << 8;
+   va |= mmu_psize_defs[apsize].sllp << 6;
asm volatile(ASM_FTR_IFCLR("tlbie %0,0", PPC_TLBIE(%1,%0), %2)
 : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
 : "memory");
@@ -69,9 +72,20 @@ static inline void __tlbie(unsigned long vpn, int psize, int 
apsize, int ssize)
default:
/* We need 14 to 14 + i bits of va */
penc = mmu_psize_defs[psize].penc[apsize];
-   va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
+   va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
va |= penc << 12;
va |= ssize << 8;
+   /* Add AVAL part */
+   if (psize != apsize) {
+   /*
+* MPSS, 64K base page size and 16MB parge page size
+* We don't need all the bits, but rest of the bits
+* must be ignored by the processor.
+* vpn cover upto 65 bits of va. (0...65) and we need
+* 58..64 bits of va.
+*/
+   va |= (vpn & 0xfe);
+   }
va |= 1; /* L */
asm volatile(ASM_FTR_IFCLR("tlbie %0,1", PPC_TLBIE(%1,%0), %2)
 : : "r" (va), "r"(0), "i" (CPU_FTR_ARCH_206)
@@ -96,16 +110,30 @@ static inline void __tlbiel(unsigned long vpn, int psize, 
int apsize, int ssize)
 
switch (psize) {
case MMU_PAGE_4K:
+   /* clear out bits after(52) [052.63] */
+   va &= ~((1ul << (64 - 52)) - 1);
va |= ssize << 8;
+   va |= mmu_psize_defs[apsize].sllp << 6;
asm volatile(".long 0x7c000224 | (%0 << 11) | (0 << 21)"
 : : "r"(va) : "memory");
break;
default:
/* We need 14 to 14 + i bits of va */
penc = mmu_psize_defs[psize].penc[apsize];
-   va &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
+   va &= ~((1ul << mmu_psize_defs[apsize].shift) - 1);
va |= penc << 12;
va |= ssize << 8;
+   /* Add AVAL part */
+   if (psize != apsize) {
+   /*
+* MPSS, 64K base page size and 16MB parge page size
+* We don't need all the bits, but rest of the bits
+* must be ignored by the processor.
+* vpn cover upto 65 bits of va. (0...65) and we need
+* 58..64 bits of va.
+*/
+   va |= (vpn & 0xfe);
+   }
va |= 1; /* L */
asm volatile(".long 0x7c000224 | (%0 << 11) | (1 << 21)"
 : : "r"(va) : "memory");
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 09/27] powerpc: Reduce PTE table memory wastage

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We allocate one page for the last level of linux page table. With THP and
large page size of 16MB, that would mean we are wasting large part
of that page. To map 16MB area, we only need a PTE space of 2K with 64K
page size. This patch reduce the space wastage by sharing the page
allocated for the last level of linux page table with multiple pmd
entries. We call these smaller chunks PTE page fragments and allocated
page, PTE page.

In order to support systems which doesn't have 64K HPTE support, we also
add another 2K to PTE page fragment. The second half of the PTE fragments
is used for storing slot and secondary bit information of an HPTE. With this
we now have a 4K PTE fragment.

We use a simple approach to share the PTE page. On allocation, we bump the
PTE page refcount to 16 and share the PTE page with the next 16 pte alloc
request. This should help in the node locality of the PTE page fragment,
assuming that the immediate pte alloc request will mostly come from the
same NUMA node. We don't try to reuse the freed PTE page fragment. Hence
we could be waisting some space.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/mmu-book3e.h |4 ++
 arch/powerpc/include/asm/mmu-hash64.h |4 ++
 arch/powerpc/include/asm/page.h   |4 ++
 arch/powerpc/include/asm/pgalloc-64.h |   82 +++
 arch/powerpc/kernel/setup_64.c|4 +-
 arch/powerpc/mm/mmu_context_hash64.c  |   37 +++
 arch/powerpc/mm/pgtable_64.c  |  118 +
 7 files changed, 195 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-book3e.h 
b/arch/powerpc/include/asm/mmu-book3e.h
index 99d43e0..8bd560c 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -231,6 +231,10 @@ typedef struct {
u64 high_slices_psize;  /* 4 bits per slice for now */
u16 user_psize; /* page size index */
 #endif
+#ifdef CONFIG_PPC_64K_PAGES
+   /* for 4K PTE fragment support */
+   void *pte_frag;
+#endif
 } mm_context_t;
 
 /* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index 05895cf..de9e577 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -516,6 +516,10 @@ typedef struct {
unsigned long acop; /* mask of enabled coprocessor types */
unsigned int cop_pid;   /* pid value used with coprocessors */
 #endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+   /* for 4K PTE fragment support */
+   void *pte_frag;
+#endif
 } mm_context_t;
 
 
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 6faa416..c204ad1 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -372,7 +372,11 @@ void arch_free_page(struct page *page, int order);
 
 struct vm_area_struct;
 
+#ifdef CONFIG_PPC_64K_PAGES
+typedef pte_t *pgtable_t;
+#else
 typedef struct page *pgtable_t;
+#endif
 
 #include 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h 
b/arch/powerpc/include/asm/pgalloc-64.h
index d390123..91acb12 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -152,6 +152,23 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
 }
 
 #else /* if CONFIG_PPC_64K_PAGES */
+/*
+ * we support 16 fragments per PTE page.
+ */
+#define PTE_FRAG_NR16
+/*
+ * We use a 2K PTE page fragment and another 2K for storing
+ * real_pte_t hash index
+ */
+#define PTE_FRAG_SIZE_SHIFT  12
+#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
+
+extern pte_t *page_table_alloc(struct mm_struct *, unsigned long, int);
+extern void page_table_free(struct mm_struct *, unsigned long *, int);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
+#ifdef CONFIG_SMP
+extern void __tlb_remove_table(void *_table);
+#endif
 
 #define pud_populate(mm, pud, pmd) pud_set(pud, (unsigned long)pmd)
 
@@ -164,90 +181,42 @@ static inline void pmd_populate_kernel(struct mm_struct 
*mm, pmd_t *pmd,
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
pgtable_t pte_page)
 {
-   pmd_populate_kernel(mm, pmd, page_address(pte_page));
+   pmd_set(pmd, (unsigned long)pte_page);
 }
 
 static inline pgtable_t pmd_pgtable(pmd_t pmd)
 {
-   return pmd_page(pmd);
+   return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+   return (pte_t *)page_table_alloc(mm, address, 1);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
- un

[PATCH -V6 13/27] powerpc: print both base and actual page size on hash failure

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/mmu-hash64.h |3 ++-
 arch/powerpc/mm/hash_utils_64.c   |   12 +++-
 arch/powerpc/mm/hugetlbpage-hash64.c  |2 +-
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index 18171a8..2accc96 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -342,7 +342,8 @@ int __hash_page_huge(unsigned long ea, unsigned long 
access, unsigned long vsid,
 unsigned int shift, unsigned int mmu_psize);
 extern void hash_failure_debug(unsigned long ea, unsigned long access,
   unsigned long vsid, unsigned long trap,
-  int ssize, int psize, unsigned long pte);
+  int ssize, int psize, int lpsize,
+  unsigned long pte);
 extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
 unsigned long pstart, unsigned long prot,
 int psize, int ssize);
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d98626a..33cdc3a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -936,14 +936,14 @@ static inline int subpage_protection(struct mm_struct 
*mm, unsigned long ea)
 
 void hash_failure_debug(unsigned long ea, unsigned long access,
unsigned long vsid, unsigned long trap,
-   int ssize, int psize, unsigned long pte)
+   int ssize, int psize, int lpsize, unsigned long pte)
 {
if (!printk_ratelimit())
return;
pr_info("mm: Hashing failure ! EA=0x%lx access=0x%lx current=%s\n",
ea, access, current->comm);
-   pr_info("trap=0x%lx vsid=0x%lx ssize=%d psize=%d pte=0x%lx\n",
-   trap, vsid, ssize, psize, pte);
+   pr_info("trap=0x%lx vsid=0x%lx ssize=%d base psize=%d psize %d 
pte=0x%lx\n",
+   trap, vsid, ssize, psize, lpsize, pte);
 }
 
 /* Result code is:
@@ -1116,7 +1116,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
 */
if (rc == -1)
hash_failure_debug(ea, access, vsid, trap, ssize, psize,
-  pte_val(*ptep));
+  psize, pte_val(*ptep));
 #ifndef CONFIG_PPC_64K_PAGES
DBG_LOW(" o-pte: %016lx\n", pte_val(*ptep));
 #else
@@ -1194,7 +1194,9 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 */
if (rc == -1)
hash_failure_debug(ea, access, vsid, trap, ssize,
-  mm->context.user_psize, pte_val(*ptep));
+  mm->context.user_psize,
+  mm->context.user_psize,
+  pte_val(*ptep));
 
local_irq_restore(flags);
 }
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
b/arch/powerpc/mm/hugetlbpage-hash64.c
index e0d52ee..06ecb55 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -129,7 +129,7 @@ repeat:
if (unlikely(slot == -2)) {
*ptep = __pte(old_pte);
hash_failure_debug(ea, access, vsid, trap, ssize,
-  mmu_psize, old_pte);
+  mmu_psize, mmu_psize, old_pte);
return -1;
}
 
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 17/27] mm/THP: Add pmd args to pgtable deposit and withdraw APIs

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Cc: Andrea Arcangeli 

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/s390/include/asm/pgtable.h |5 +++--
 arch/s390/mm/pgtable.c  |5 +++--
 arch/sparc/include/asm/pgtable_64.h |5 +++--
 arch/sparc/mm/tlb.c |5 +++--
 include/asm-generic/pgtable.h   |5 +++--
 mm/huge_memory.c|   18 +-
 mm/pgtable-generic.c|5 +++--
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3cb47cf..83da660 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1283,10 +1283,11 @@ static inline void __pmd_idte(unsigned long address, 
pmd_t *pmdp)
 #define SEGMENT_RW __pgprot(_HPAGE_TYPE_RW)
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t 
pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+  pgtable_t pgtable);
 
 #define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t 
*pmdp);
 
 static inline int pmd_trans_splitting(pmd_t pmd)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ae44d2a..9ab3224 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -920,7 +920,8 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, 
unsigned long address,
}
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+   pgtable_t pgtable)
 {
struct list_head *lh = (struct list_head *) pgtable;
 
@@ -934,7 +935,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, 
pgtable_t pgtable)
mm->pmd_huge_pte = pgtable;
 }
 
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
struct list_head *lh;
pgtable_t pgtable;
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 08fcce9..4c86de2 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -853,10 +853,11 @@ extern void update_mmu_cache_pmd(struct vm_area_struct 
*vma, unsigned long addr,
 pmd_t *pmd);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t 
pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+  pgtable_t pgtable);
 
 #define __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t 
*pmdp);
 #endif
 
 /* Encode and de-code a swap entry */
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index ba6ae7f..0a8ac2a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -157,7 +157,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
}
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+   pgtable_t pgtable)
 {
struct list_head *lh = (struct list_head *) pgtable;
 
@@ -171,7 +172,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, 
pgtable_t pgtable)
mm->pmd_huge_pte = pgtable;
 }
 
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 {
struct list_head *lh;
pgtable_t pgtable;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bfd8768..7250645 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -163,11 +163,12 @@ extern void pmdp_splitting_flush(struct vm_area_struct 
*vma,
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t 
pgtable);
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+  pgtable_t pgtable);
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm);
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t 
*pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78bd84f..84f3180 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -735

[PATCH -V6 20/27] powerpc/THP: Implement transparent hugepages for ppc64

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We now have pmd entries covering 16MB range and the PMD table double its 
original size.
We use the second half of the PMD table to deposit the pgtable (PTE page).
The depoisted PTE page is further used to track the HPTE information. The 
information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE 
entry.
With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to 
walk
the PTE page and invalidate all valid HPTEs.

This patch implements necessary arch specific functions for THP support and also
hugepage invalidate logic. These PMD related functions are intentionally kept
similar to their PTE counter-part.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page.h  |   11 +-
 arch/powerpc/include/asm/pgtable-ppc64-64k.h |3 +-
 arch/powerpc/include/asm/pgtable-ppc64.h |  259 -
 arch/powerpc/include/asm/pgtable.h   |5 +
 arch/powerpc/include/asm/pte-hash64-64k.h|   17 ++
 arch/powerpc/mm/pgtable_64.c |  318 ++
 arch/powerpc/platforms/Kconfig.cputype   |1 +
 7 files changed, 611 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index c204ad1..4042b66 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -37,8 +37,17 @@
 #define PAGE_SIZE  (ASM_CONST(1) << PAGE_SHIFT)
 
 #ifndef __ASSEMBLY__
-#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
+ * configurable. But we enable THP only with 16MB hugepage.
+ * With only THP configured, we force hugepage size to 16MB.
+ * This should ensure that all subarchs that doesn't support
+ * THP continue to work fine with HPAGE_SHIFT usage.
+ */
+#if defined(CONFIG_HUGETLB_PAGE)
 extern unsigned int HPAGE_SHIFT;
+#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define HPAGE_SHIFT PMD_SHIFT
 #else
 #define HPAGE_SHIFT PAGE_SHIFT
 #endif
diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h 
b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index 45142d6..a56b82f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -33,7 +33,8 @@
 #define PGDIR_MASK (~(PGDIR_SIZE-1))
 
 /* Bits to mask out from a PMD to get to the PTE page */
-#define PMD_MASKED_BITS0x1ff
+/* PMDs point to PTE table fragments which are 4K aligned.  */
+#define PMD_MASKED_BITS0xfff
 /* Bits to mask out from a PGD/PUD to get to the PMD page */
 #define PUD_MASKED_BITS0x1ff
 
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index ab84332..20133c1 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -154,7 +154,7 @@
 #definepmd_present(pmd)(pmd_val(pmd) != 0)
 #definepmd_clear(pmdp) (pmd_val(*(pmdp)) = 0)
 #define pmd_page_vaddr(pmd)(pmd_val(pmd) & ~PMD_MASKED_BITS)
-#define pmd_page(pmd)  virt_to_page(pmd_page_vaddr(pmd))
+extern struct page *pmd_page(pmd_t pmd);
 
 #define pud_set(pudp, pudval)  (pud_val(*(pudp)) = (pudval))
 #define pud_none(pud)  (!pud_val(pud))
@@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 
 #endif /* __ASSEMBLY__ */
 
+#ifndef _PAGE_SPLITTING
+/*
+ * THP pages can't be special. So use the _PAGE_SPECIAL
+ */
+#define _PAGE_SPLITTING _PAGE_SPECIAL
+#endif
+
+#ifndef _PAGE_THP_HUGE
+/*
+ * We need to differentiate between explicit huge page and THP huge
+ * page, since THP huge page also need to track real subpage details
+ * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
+ * support THP.
+ */
+#define _PAGE_THP_HUGE  0x1000
+#endif
+
+/*
+ * PTE flags to conserve for HPTE identification for THP page.
+ */
+#ifndef _PAGE_THP_HPTEFLAGS
+#define _PAGE_THP_HPTEFLAGS(_PAGE_BUSY | _PAGE_HASHPTE)
+#endif
+
+#define HUGE_PAGE_SIZE (ASM_CONST(1) << 24)
+#define HUGE_PAGE_MASK (~(HUGE_PAGE_SIZE - 1))
+
+/*
+ * set of bits not changed in pmd_modify.
+ */
+#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_THP_HPTEFLAGS | \
+_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
+
+#ifndef __ASSEMBLY__
+extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+pmd_t *pmdp);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
+extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+  pmd_t *pmdp, pmd_t pmd);
+extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long 
addr,

[PATCH -V6 23/27] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64.h |   24 
 arch/powerpc/kernel/io-workarounds.c |   10 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |4 +++-
 arch/powerpc/mm/hash_utils_64.c  |8 +++-
 arch/powerpc/mm/hugetlbpage.c|8 ++--
 arch/powerpc/mm/tlb_hash64.c |7 ++-
 arch/powerpc/platforms/pseries/eeh.c |7 ++-
 7 files changed, 36 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index f0effab..97fc839 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -343,30 +343,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, 
pte_t entry)
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
-
-/*
- * find_linux_pte returns the address of a linux pte for a given
- * effective address and directory.  If not found, it returns zero.
- */
-static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
-{
-   pgd_t *pg;
-   pud_t *pu;
-   pmd_t *pm;
-   pte_t *pt = NULL;
-
-   pg = pgdir + pgd_index(ea);
-   if (!pgd_none(*pg)) {
-   pu = pud_offset(pg, ea);
-   if (!pud_none(*pu)) {
-   pm = pmd_offset(pu, ea);
-   if (pmd_present(*pm))
-   pt = pte_offset_kernel(pm, ea);
-   }
-   }
-   return pt;
-}
-
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 unsigned *shift);
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/kernel/io-workarounds.c 
b/arch/powerpc/kernel/io-workarounds.c
index 50e90b7..e5263ab 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -55,6 +55,7 @@ static struct iowa_bus *iowa_pci_find(unsigned long vaddr, 
unsigned long paddr)
 
 struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 {
+   unsigned shift;
struct iowa_bus *bus;
int token;
 
@@ -70,11 +71,16 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
return NULL;
 
-   ptep = find_linux_pte(init_mm.pgd, vaddr);
+   ptep = find_linux_pte_or_hugepte(init_mm.pgd, vaddr, &shift);
if (ptep == NULL)
paddr = 0;
-   else
+   else {
+   /*
+* we don't have hugepages backing iomem
+*/
+   BUG_ON(shift);
paddr = pte_pfn(*ptep) << PAGE_SHIFT;
+   }
bus = iowa_pci_find(vaddr, paddr);
 
if (bus == NULL)
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..aa6a351 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -24,13 +24,15 @@
 /* Translate address of a vmalloc'd thing to a linear map address */
 static void *real_vmalloc_addr(void *x)
 {
+   unsigned shift;
unsigned long addr = (unsigned long) x;
pte_t *p;
 
-   p = find_linux_pte(swapper_pg_dir, addr);
+   p = find_linux_pte_or_hugepte(swapper_pg_dir, addr, &shift);
if (!p || !pte_present(*p))
return NULL;
/* assume we don't have huge pages in vmalloc space... */
+   BUG_ON(shift);
addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
return __va(addr);
 }
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d0eb6d4..e942ae9 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1131,6 +1131,7 @@ EXPORT_SYMBOL_GPL(hash_page);
 void hash_preload(struct mm_struct *mm, unsigned long ea,
  unsigned long access, unsigned long trap)
 {
+   int shift;
unsigned long vsid;
pgd_t *pgdir;
pte_t *ptep;
@@ -1152,10 +1153,15 @@ void hash_preload(struct mm_struct *mm, unsigned long 
ea,
pgdir = mm->pgd;
if (pgdir == NULL)
return;
-   ptep = find_linux_pte(pgdir, ea);
+   /*
+* THP pages use update_mmu_cache_pmd. We don't do
+* hash preload there. Hence can ignore THP here
+*/
+   ptep = find_linux_pte_or_hugepte(pgdir, ea, &shift);
if (!ptep)
return;
 
+   BUG_ON(shift);
 #ifdef CONFIG_PPC_64K_PAGES
/* If either _PAGE_4K_PFN or _PAGE_NO_CACHE is set (and we are on
 * a 64K kernel), then we don't preload, hash_page() will take
diff --git a/arch

[PATCH -V6 16/27] mm/THP: HPAGE_SHIFT is not a #define on some arch

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT
and other derived values like HPAGE_PMD_ORDER are not constants. So move
that to hugepage_init

Cc: Andrea Arcangeli 

Reviewed-by: David Gibson 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/huge_mm.h |3 ---
 mm/huge_memory.c|9 ++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..bdc5aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,9 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct 
*vma,
} while (0)
 extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
pmd_t *pmd);
-#if HPAGE_PMD_ORDER > MAX_ORDER
-#error "hugepages can't be allocated by the buddy allocator"
-#endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..78bd84f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -45,7 +45,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
(1< MAX_ORDER)) {
transparent_hugepage_flags = 0;
return -EINVAL;
}
 
+   khugepaged_pages_to_scan = HPAGE_PMD_NR*8;
+   khugepaged_max_ptes_none = HPAGE_PMD_NR-1;
+
err = hugepage_init_sysfs(&hugepage_kobj);
if (err)
return err;
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

For architectures like ppc64 we look at deposited pgtable when
calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw
after finishing pmdp related operations.

Cc: Andrea Arcangeli 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/huge_memory.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 84f3180..2a43782 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,9 +1363,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
-   pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+   pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
if (is_huge_zero_pmd(orig_pmd)) {
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 25/27] powerpc/THP: Add code to handle HPTE faults for large pages

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/mmu-hash64.h |   13 +++
 arch/powerpc/mm/Makefile  |1 +
 arch/powerpc/mm/hash_utils_64.c   |   13 ++-
 arch/powerpc/mm/hugepage-hash64.c |  180 +
 4 files changed, 203 insertions(+), 4 deletions(-)
 create mode 100644 arch/powerpc/mm/hugepage-hash64.c

diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index 2accc96..3d6fbb0 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -340,6 +340,19 @@ extern int hash_page(unsigned long ea, unsigned long 
access, unsigned long trap)
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long 
vsid,
 pte_t *ptep, unsigned long trap, int local, int ssize,
 unsigned int shift, unsigned int mmu_psize);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __hash_page_thp(unsigned long ea, unsigned long access,
+  unsigned long vsid, pmd_t *pmdp, unsigned long trap,
+  int local, int ssize, unsigned int psize);
+#else
+static inline int __hash_page_thp(unsigned long ea, unsigned long access,
+ unsigned long vsid, pmd_t *pmdp,
+ unsigned long trap, int local,
+ int ssize, unsigned int psize)
+{
+   BUG();
+}
+#endif
 extern void hash_failure_debug(unsigned long ea, unsigned long access,
   unsigned long vsid, unsigned long trap,
   int ssize, int psize, int lpsize,
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index fde36e6..87671eb 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -33,6 +33,7 @@ ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-$(CONFIG_PPC_STD_MMU_64)   += hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)   += hugetlbpage-book3e.o
 endif
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += hugepage-hash64.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)  += highmem.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index e942ae9..cea7267 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1041,11 +1041,16 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
return 1;
}
 
+   if (hugeshift) {
+   if (pmd_trans_huge((pmd_t) *ptep))
+   return __hash_page_thp(ea, access, vsid, (pmd_t *)ptep,
+  trap, local, ssize, psize);
 #ifdef CONFIG_HUGETLB_PAGE
-   if (hugeshift)
-   return __hash_page_huge(ea, access, vsid, ptep, trap, local,
-   ssize, hugeshift, psize);
-#endif /* CONFIG_HUGETLB_PAGE */
+   else
+   return __hash_page_huge(ea, access, vsid, ptep, trap,
+   local, ssize, hugeshift, psize);
+#endif
+   }
 
 #ifndef CONFIG_PPC_64K_PAGES
DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
diff --git a/arch/powerpc/mm/hugepage-hash64.c 
b/arch/powerpc/mm/hugepage-hash64.c
new file mode 100644
index 000..4d8f232
--- /dev/null
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -0,0 +1,180 @@
+/*
+ * Copyright IBM Corporation, 2013
+ * Author Aneesh Kumar K.V 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+/*
+ * PPC64 THP Support for hash based MMUs
+ */
+#include 
+#include 
+
+/*
+ * The linux hugepage PMD now include the pmd entries followed by the address
+ * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
+ * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE 
entry.
+ * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
+ * 4096 entries. Both will fit in a 4K pgtable_t.
+ */
+int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+   pmd_t *pmdp, unsigned long trap, int local, int ssize,
+   unsigned int psize)
+{
+   unsigned int index, valid;
+   unsigned char *hpte_slot_array;
+   unsigned long rflags, pa, hidx;
+   unsigned long old_pmd, new_pmd;
+   int ret,

[PATCH -V6 26/27] powerpc/THP: Enable THP on PPC64

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We enable only if the we support 16MB page size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64.h |3 +--
 arch/powerpc/mm/pgtable_64.c |   28 
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 97fc839..d65534b 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
return pmd_val(pmd) >> PTE_RPN_SHIFT;
 }
 
-/* We will enable it in the last patch */
-#define has_transparent_hugepage() 0
+extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int pmd_young(pmd_t pmd)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 156706b..e0c2d09 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, 
unsigned long addr,
return;
 }
 
+int has_transparent_hugepage(void)
+{
+   if (!mmu_has_feature(MMU_FTR_16M_PAGE))
+   return 0;
+   /*
+* We support THP only if HPAGE_SHIFT is 16MB.
+*/
+   if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
+   return 0;
+   /*
+* We need to make sure that we support 16MB hugepage in a segement
+* with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
+* of 64K.
+*/
+   /*
+* If we have 64K HPTE, we will be using that by default
+*/
+   if (mmu_psize_defs[MMU_PAGE_64K].shift &&
+   (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
+   return 0;
+   /*
+* Ok we only have 4K HPTE
+*/
+   if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
+   return 0;
+
+   return 1;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 pmd_t pmdp_get_and_clear(struct mm_struct *mm,
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 21/27] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We will use this in the later patch for handling THP pages

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hugetlb.h   |8 +-
 arch/powerpc/include/asm/pgtable-ppc64.h |   11 --
 arch/powerpc/mm/Makefile |2 +-
 arch/powerpc/mm/hugetlbpage.c|  251 +++---
 4 files changed, 136 insertions(+), 136 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 81f7677..ad3fa8b 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -169,8 +169,14 @@ static inline void flush_hugetlb_page(struct 
vm_area_struct *vma,
  unsigned long vmaddr)
 {
 }
-#endif /* CONFIG_HUGETLB_PAGE */
 
+#define hugepd_shift(x) 0
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
+   unsigned pdshift)
+{
+   return 0;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
 
 /*
  * FSL Book3E platforms require special gpage handling - the gpages
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 20133c1..f0effab 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -367,19 +367,8 @@ static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned 
long ea)
return pt;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 unsigned *shift);
-#else
-static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-  unsigned *shift)
-{
-   if (shift)
-   *shift = 0;
-   return find_linux_pte(pgdir, ea);
-}
-#endif /* !CONFIG_HUGETLB_PAGE */
-
 #endif /* __ASSEMBLY__ */
 
 #ifndef _PAGE_SPLITTING
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index cf16b57..fde36e6 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -28,8 +28,8 @@ obj-$(CONFIG_44x) += 44x_mmu.o
 obj-$(CONFIG_PPC_FSL_BOOK3E)   += fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)+= slice.o
-ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-y  += hugetlbpage.o
+ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-$(CONFIG_PPC_STD_MMU_64)   += hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)   += hugetlbpage-book3e.o
 endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 2da8fe6..29d8534 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -21,6 +21,9 @@
 #include 
 #include 
 #include 
+#include 
+
+#ifdef CONFIG_HUGETLB_PAGE
 
 #define PAGE_SHIFT_64K 16
 #define PAGE_SHIFT_16M 24
@@ -100,66 +103,6 @@ int pgd_huge(pgd_t pgd)
 }
 #endif
 
-/*
- * We have 4 cases for pgds and pmds:
- * (1) invalid (all zeroes)
- * (2) pointer to next table, as normal; bottom 6 bits == 0
- * (3) leaf pte for huge page, bottom two bits != 00
- * (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of 
table
- */
-pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned 
*shift)
-{
-   pgd_t *pg;
-   pud_t *pu;
-   pmd_t *pm;
-   pte_t *ret_pte;
-   hugepd_t *hpdp = NULL;
-   unsigned pdshift = PGDIR_SHIFT;
-
-   if (shift)
-   *shift = 0;
-
-   pg = pgdir + pgd_index(ea);
-
-   if (pgd_huge(*pg)) {
-   ret_pte = (pte_t *) pg;
-   goto out;
-   } else if (is_hugepd(pg))
-   hpdp = (hugepd_t *)pg;
-   else if (!pgd_none(*pg)) {
-   pdshift = PUD_SHIFT;
-   pu = pud_offset(pg, ea);
-
-   if (pud_huge(*pu)) {
-   ret_pte = (pte_t *) pu;
-   goto out;
-   } else if (is_hugepd(pu))
-   hpdp = (hugepd_t *)pu;
-   else if (!pud_none(*pu)) {
-   pdshift = PMD_SHIFT;
-   pm = pmd_offset(pu, ea);
-
-   if (pmd_huge(*pm)) {
-   ret_pte = (pte_t *) pm;
-   goto out;
-   } else if (is_hugepd(pm))
-   hpdp = (hugepd_t *)pm;
-   else if (!pmd_none(*pm))
-   return pte_offset_kernel(pm, ea);
-   }
-   }
-   if (!hpdp)
-   return NULL;
-
-   ret_pte = hugepte_offset(hpdp, ea, pdshift);
-   pdshift = hugepd_shift(*hpdp);
-out:
-   if (shift)
-   *shift = pdshift;
-   return ret_pte;
-}
-EXPORT_SYMBOL_GPL(find_linux_pte_or_hugepte);
-
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
@@ -748,69 +691,6 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
address,
 

[PATCH -V6 22/27] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 29d8534..a4b0fa5 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -949,7 +949,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned 
long ea, unsigned *shift
pdshift = PMD_SHIFT;
pm = pmd_offset(pu, ea);
 
-   if (pmd_huge(*pm)) {
+   if (pmd_huge(*pm) || pmd_large(*pm)) {
ret_pte = (pte_t *) pm;
goto out;
} else if (is_hugepd(pm))
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 19/27] powerpc/THP: Double the PMD table size for THP

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.

On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgalloc-64.h|6 +++---
 arch/powerpc/include/asm/pgtable-ppc64.h |6 +-
 arch/powerpc/mm/init_64.c|9 ++---
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc-64.h 
b/arch/powerpc/include/asm/pgalloc-64.h
index 91acb12..c756463 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -221,17 +221,17 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, 
pgtable_t table,
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
+   return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-   kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
+   kmem_cache_free(PGT_CACHE(PMD_CACHE_INDEX), pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd, addr)   \
-   pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
+   pgtable_free_tlb(tlb, pmd, PMD_CACHE_INDEX)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)   \
pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..ab84332 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -20,7 +20,11 @@
PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT)
 #define PGTABLE_RANGE (ASM_CONST(1) << PGTABLE_EADDR_SIZE)
 
-
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PMD_CACHE_INDEX(PMD_INDEX_SIZE + 1)
+#else
+#define PMD_CACHE_INDEXPMD_INDEX_SIZE
+#endif
 /*
  * Define the address range of the kernel non-linear virtual area
  */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a56de85..97f741d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -88,7 +88,11 @@ static void pgd_ctor(void *addr)
 
 static void pmd_ctor(void *addr)
 {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   memset(addr, 0, PMD_TABLE_SIZE * 2);
+#else
memset(addr, 0, PMD_TABLE_SIZE);
+#endif
 }
 
 struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
@@ -137,10 +141,9 @@ void pgtable_cache_add(unsigned shift, void (*ctor)(void 
*))
 void pgtable_cache_init(void)
 {
pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
-   pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
-   if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+   pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
+   if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
panic("Couldn't allocate pgtable caches");
-
/* In all current configs, when the PUD index exists it's the
 * same size as either the pgd or pmd index.  Verify that the
 * initialization above has also created a PUD cache.  This
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 24/27] powerpc: Update gup_pmd_range to handle transparent hugepages

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/gup.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index 4b921af..3d36fd7 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -66,9 +66,20 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
pmd_t pmd = *pmdp;
 
next = pmd_addr_end(addr, end);
-   if (pmd_none(pmd))
+   /*
+* The pmd_trans_splitting() check below explains why
+* pmdp_splitting_flush has to flush the tlb, to stop
+* this gup-fast code from running while we set the
+* splitting bit in the pmd. Returning zero will take
+* the slow path that will call wait_split_huge_page()
+* if the pmd is still in splitting state. gup-fast
+* can't because it has irq disabled and
+* wait_split_huge_page() would never return as the
+* tlb flush IPI wouldn't run.
+*/
+   if (pmd_none(pmd) || pmd_trans_splitting(pmd))
return 0;
-   if (pmd_huge(pmd)) {
+   if (pmd_huge(pmd) || pmd_large(pmd)) {
if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
 write, pages, nr))
return 0;
-- 
1.7.10

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH -V6 27/27] powerpc: Optimize hugepage invalidate

2013-04-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Hugepage invalidate involves invalidating multiple hpte entries.
Optimize the operation using H_BULK_REMOVE on lpar platforms.
On native, reduce the number of tlb flush.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/machdep.h|3 +
 arch/powerpc/mm/hash_native_64.c  |   78 
 arch/powerpc/mm/pgtable_64.c  |   13 +++-
 arch/powerpc/platforms/pseries/lpar.c |  126 +++--
 4 files changed, 210 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 3f3f691..5d1e7d2 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -56,6 +56,9 @@ struct machdep_calls {
void(*hpte_removebolted)(unsigned long ea,
 int psize, int ssize);
void(*flush_hash_range)(unsigned long number, int local);
+   void(*hugepage_invalidate)(struct mm_struct *mm,
+  unsigned char *hpte_slot_array,
+  unsigned long addr, int psize);
 
/* special for kexec, to be called in real mode, linear mapping is
 * destroyed as well */
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..8ca178d 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -455,6 +455,83 @@ static void native_hpte_invalidate(unsigned long slot, 
unsigned long vpn,
local_irq_restore(flags);
 }
 
+static void native_hugepage_invalidate(struct mm_struct *mm,
+  unsigned char *hpte_slot_array,
+  unsigned long addr, int psize)
+{
+   int ssize = 0, i;
+   int lock_tlbie;
+   struct hash_pte *hptep;
+   int actual_psize = MMU_PAGE_16M;
+   unsigned int max_hpte_count, valid;
+   unsigned long flags, s_addr = addr;
+   unsigned long hpte_v, want_v, shift;
+   unsigned long hidx, vpn = 0, vsid, hash, slot;
+
+   shift = mmu_psize_defs[psize].shift;
+   max_hpte_count = HUGE_PAGE_SIZE >> shift;
+
+   local_irq_save(flags);
+   for (i = 0; i < max_hpte_count; i++) {
+   /*
+* 8 bits per each hpte entries
+* 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
+*/
+   valid = hpte_slot_array[i] & 0x1;
+   if (!valid)
+   continue;
+   hidx =  hpte_slot_array[i]  >> 1;
+
+   /* get the vpn */
+   addr = s_addr + (i * (1ul << shift));
+   if (!is_kernel_addr(addr)) {
+   ssize = user_segment_size(addr);
+   vsid = get_vsid(mm->context.id, addr, ssize);
+   WARN_ON(vsid == 0);
+   } else {
+   vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+   ssize = mmu_kernel_ssize;
+   }
+
+   vpn = hpt_vpn(addr, vsid, ssize);
+   hash = hpt_hash(vpn, shift, ssize);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+
+   hptep = htab_address + slot;
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
+   native_lock_hpte(hptep);
+   hpte_v = hptep->v;
+
+   /* Even if we miss, we need to invalidate the TLB */
+   if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
+   native_unlock_hpte(hptep);
+   else
+   /* Invalidate the hpte. NOTE: this also unlocks it */
+   hptep->v = 0;
+   }
+   /*
+* Since this is a hugepage, we just need a single tlbie.
+* use the last vpn.
+*/
+   lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+   if (lock_tlbie)
+   raw_spin_lock(&native_tlbie_lock);
+
+   asm volatile("ptesync":::"memory");
+   __tlbie(vpn, psize, actual_psize, ssize);
+   asm volatile("eieio; tlbsync; ptesync":::"memory");
+
+   if (lock_tlbie)
+   raw_spin_unlock(&native_tlbie_lock);
+
+   local_irq_restore(flags);
+}
+
+
 static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
int *psize, int *apsize, int *ssize, unsigned long *vpn)
 {
@@ -658,4 +735,5 @@ void __init hpte_init_native(void)
ppc_md.hpte_remove  = native_hpte_remove;
ppc_md.hpte_clear_all   = native_hpte_clear;
ppc_md.flush_hash_range = native_flush_hash_range;
+   ppc_md.hugepage_invalidate   = native_hugepage_invalidate;
 }
diff --git a/arch/powerpc/mm/pgtable_64.c b/

[PATCH] powerpc/fsl-pci:fix incorrect iounmap pci hose->private_data

2013-04-22 Thread Roy Zang
pci hose->private_data will be used by other function, for example,
fsl_pcie_check_link(), so do not iounmap it.

fix the kerenl crash on T4240:

Unable to handle kernel paging request for data at address
0x880080060f14
Faulting instruction address: 0xc0032554
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=24 T4240 QDS
Modules linked in:
NIP: c0032554 LR: c003254c CTR: c001e5c0
REGS: c00179143440 TRAP: 0300   Not tainted
(3.8.8-rt2-00754-g951f064-dirt)
MSR: 80029000   CR: 24adbe22  XER: 
SOFTE: 0
DEAR: 880080060f14, ESR: 
TASK = c0017913d2c0[1] 'swapper/0' THREAD: c0017914 CPU: 2
GPR00: c003254c c001791436c0 c0ae2998
0027
GPR04:  05a5 
0002
GPR08: 3030303038303038 c0a2d4d0 c0aebeb8
c0af2998
GPR12: 24adbe22 cfffa800 c0001be0

GPR16:   

GPR20:   
c09ddf70
GPR24: c09e8d40 c0af2998 c0b1529c
c00179143b40
GPR28: c001799b4000 c00179143c00 88008006
c0727ec8
NIP [c0032554] .fsl_pcie_check_link+0x104/0x150
LR [c003254c] .fsl_pcie_check_link+0xfc/0x150
Call Trace:
[c001791436c0] [c003254c] .fsl_pcie_check_link+0xfc/0x150
(unreliab)
[c00179143a30] [c00325d4]
.fsl_indirect_read_config+0x34/0xb0
[c00179143ad0] [c02c7ee8]
.pci_bus_read_config_byte+0x88/0xd0
[c00179143b90] [c09c0528] .pci_apply_final_quirks+0x9c/0x18c
[c00179143c40] [c000142c] .do_one_initcall+0x5c/0x1f0
[c00179143cf0] [c09a0bb4] .kernel_init_freeable+0x180/0x264
[c00179143db0] [c0001bfc] .kernel_init+0x1c/0x420
[c00179143e30] [c8b4] .ret_from_kernel_thread+0x64/0xb0
Instruction dump:
6000 4ba0 ebc301d0 3fe2ffc4 3c62ffe0 3bff5530 38638a78 7fe4fb78
7fc5f378 486ea77d 6000 7c0004ac <801e0f14> 0c00 4c00012c
3c62ffe0
---[ end trace f841fbc03c9d2e1b ]---

Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b

Rebooting in 180 seconds..

Signed-off-by: Yuanquan Chen 
Signed-off-by: Roy Zang 
---
based on Kumar's next branch.
tested on P3041 and T4240.

 arch/powerpc/sysdev/fsl_pci.c |   11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index f823304..c343edc 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -242,15 +242,11 @@ static void setup_pci_atmu(struct pci_controller *hose)
paddr_hi -= hose->pci_mem_offset;
paddr_lo -= hose->pci_mem_offset;
 
-   if (paddr_hi == paddr_lo) {
+   if (paddr_hi == paddr_lo)
pr_err("%s: No outbound window space\n", name);
-   goto out;
-   }
 
-   if (paddr_lo == 0) {
+   if (paddr_lo == 0)
pr_err("%s: No space for inbound window\n", name);
-   goto out;
-   }
 
/* setup PCSRBAR/PEXCSRBAR */
early_write_config_dword(hose, 0, 0, PCI_BASE_ADDRESS_0, 0x);
@@ -395,9 +391,6 @@ static void setup_pci_atmu(struct pci_controller *hose)
pr_info("%s: DMA window size is 0x%llx\n", name,
(u64)hose->dma_window_size);
}
-
-out:
-   iounmap(pci);
 }
 
 static void __init setup_pci_cmd(struct pci_controller *hose)
-- 
1.7.9.5


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v8 3/3] of/pci: mips: convert to common of_pci_range_parser

2013-04-22 Thread Andrew Murray
This patch converts the pci_load_of_ranges function to use the new common
of_pci_range_parser.

Signed-off-by: Andrew Murray 
Signed-off-by: Liviu Dudau 
Signed-off-by: Gabor Juhos 
Reviewed-by: Rob Herring 
Reviewed-by: Grant Likely 
Tested-by: Linus Walleij 
---
 arch/mips/pci/pci.c |   51 +++
 1 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/arch/mips/pci/pci.c b/arch/mips/pci/pci.c
index 0872f12..4b09ca8 100644
--- a/arch/mips/pci/pci.c
+++ b/arch/mips/pci/pci.c
@@ -122,51 +122,38 @@ static void pcibios_scanbus(struct pci_controller *hose)
 #ifdef CONFIG_OF
 void pci_load_of_ranges(struct pci_controller *hose, struct device_node *node)
 {
-   const __be32 *ranges;
-   int rlen;
-   int pna = of_n_addr_cells(node);
-   int np = pna + 5;
+   struct of_pci_range range;
+   struct of_pci_range_parser parser;
+   u32 res_type;
 
pr_info("PCI host bridge %s ranges:\n", node->full_name);
-   ranges = of_get_property(node, "ranges", &rlen);
-   if (ranges == NULL)
-   return;
hose->of_node = node;
 
-   while ((rlen -= np * 4) >= 0) {
-   u32 pci_space;
+   if (of_pci_range_parser_init(&parser, node))
+   return;
+
+   for_each_of_pci_range(&parser, &range) {
struct resource *res = NULL;
-   u64 addr, size;
-
-   pci_space = be32_to_cpup(&ranges[0]);
-   addr = of_translate_address(node, ranges + 3);
-   size = of_read_number(ranges + pna + 3, 2);
-   ranges += np;
-   switch ((pci_space >> 24) & 0x3) {
-   case 1: /* PCI IO space */
+
+   switch (range.flags & IORESOURCE_TYPE_BITS) {
+   case IORESOURCE_IO:
pr_info("  IO 0x%016llx..0x%016llx\n",
-   addr, addr + size - 1);
+   range.cpu_addr,
+   range.cpu_addr + range.size - 1);
hose->io_map_base =
-   (unsigned long)ioremap(addr, size);
+   (unsigned long)ioremap(range.cpu_addr,
+  range.size);
res = hose->io_resource;
-   res->flags = IORESOURCE_IO;
break;
-   case 2: /* PCI Memory space */
-   case 3: /* PCI 64 bits Memory space */
+   case IORESOURCE_MEM:
pr_info(" MEM 0x%016llx..0x%016llx\n",
-   addr, addr + size - 1);
+   range.cpu_addr,
+   range.cpu_addr + range.size - 1);
res = hose->mem_resource;
-   res->flags = IORESOURCE_MEM;
break;
}
-   if (res != NULL) {
-   res->start = addr;
-   res->name = node->full_name;
-   res->end = res->start + size - 1;
-   res->parent = NULL;
-   res->sibling = NULL;
-   res->child = NULL;
-   }
+   if (res != NULL)
+   of_pci_range_to_resource(&range, node, res);
}
 }
 #endif
-- 
1.7.0.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v8 0/3] of/pci: Provide common support for PCI DT parsing

2013-04-22 Thread Andrew Murray
This patchset factors out duplicated code associated with parsing PCI
DT "ranges" properties across the architectures and introduces a
"ranges" parser. This parser "of_pci_range_parser" can be used directly
by ARM host bridge drivers enabling them to obtain ranges from device
trees.

I've included the Reviewed-by, Tested-by and Acked-by's received from v5/v6/v7
in this patchset, earlier versions of this patchset (v3) have been tested-by:

Thierry Reding 
Jingoo Han 

I've tested that this patchset builds and runs on ARM and that it builds on
PowerPC, x86_64 and MIPS.

Compared to the v7 sent by Andrew Murray, the following changes have been made
(please note that the first patch is unchanged from v7):

 * Rename of_pci_range_parser to of_pci_range_parser_init and
   of_pci_process_ranges to of_pci_range_parser_one as suggested by Grant
   Likely.

 * Reverted back to using a switch statement instead of if/else in
   pci_process_bridge_OF_ranges. Grant Likely highlighted this change from
   the original code which was unnecessary.

 * Squashed in a patch provided by Gabor Juhos which fixes build errors on
   MIPS found in the last patchset.

Compared to the v6 sent by Andrew Murray, the following changes have
been made in response to build errors/warnings:

 * Inclusion of linux/of_address.h in of_pci.c as suggested by Michal
   Simek to prevent compilation failures on Microblaze (and others) and his
   ack.

 * Use of externs, static inlines and a typo in linux/of_address.h in response
   to linker errors (multiple defination) on x86_64 as spotted by a kbuild test
   robot on (jcooper/linux.git mvebu/drivers)

 * Add EXPORT_SYMBOL_GPL to of_pci_range_parser function to be consistent
   with of_pci_process_ranges function

Compared to the v5 sent by Andrew Murray, the following changes have
been made:

 * Use of CONFIG_64BIT instead of CONFIG_[a32bitarch] as suggested by
   Rob Herring in drivers/of/of_pci.c

 * Added forward declaration of struct pci_controller in linux/of_pci.h
   to prevent compiler warning as suggested by Thomas Petazzoni

 * Improved error checking (!range check), removal of unnecessary be32_to_cpup
   call, improved formatting of struct of_pci_range_parser layout and
   replacement of macro with a static inline. All suggested by Rob Herring.

Compared to the v4 (incorrectly labelled v3) sent by Andrew Murray,
the following changes have been made:

 * Split the patch as suggested by Rob Herring

Compared to the v3 sent by Andrew Murray, the following changes have
been made:

 * Unify and move duplicate pci_process_bridge_OF_ranges functions to
   drivers/of/of_pci.c as suggested by Rob Herring

 * Fix potential build errors with Microblaze/MIPS

Compared to "[PATCH v5 01/17] of/pci: Provide support for parsing PCI DT
ranges property", the following changes have been made:

 * Correct use of IORESOURCE_* as suggested by Russell King

 * Improved interface and naming as suggested by Thierry Reding

Compared to the v2 sent by Andrew Murray, Thomas Petazzoni did:

 * Add a memset() on the struct of_pci_range_iter when starting the
   for loop in for_each_pci_range(). Otherwise, with an uninitialized
   of_pci_range_iter, of_pci_process_ranges() may crash.

 * Add parenthesis around 'res', 'np' and 'iter' in the
   for_each_of_pci_range macro definitions. Otherwise, passing
   something like &foobar as 'res' didn't work.

 * Rebased on top of 3.9-rc2, which required fixing a few conflicts in
   the Microblaze code.

v2:
  This follows on from suggestions made by Grant Likely
  (marc.info/?l=linux-kernel&m=136079602806328)

Andrew Murray (3):
  of/pci: Unify pci_process_bridge_OF_ranges from Microblaze and
PowerPC
  of/pci: Provide support for parsing PCI DT ranges property
  of/pci: mips: convert to common of_pci_range_parser

 arch/microblaze/include/asm/pci-bridge.h |5 +-
 arch/microblaze/pci/pci-common.c |  192 --
 arch/mips/pci/pci.c  |   51 +++-
 arch/powerpc/include/asm/pci-bridge.h|5 +-
 arch/powerpc/kernel/pci-common.c |  192 --
 drivers/of/address.c |   67 +++
 drivers/of/of_pci.c  |  173 +++
 include/linux/of_address.h   |   48 
 include/linux/of_pci.h   |4 +
 9 files changed, 313 insertions(+), 424 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v8 2/3] of/pci: Provide support for parsing PCI DT ranges property

2013-04-22 Thread Andrew Murray
This patch factors out common implementation patterns to reduce overall kernel
code and provide a means for host bridge drivers to directly obtain struct
resources from the DT's ranges property without relying on architecture specific
DT handling. This will make it easier to write archiecture independent host 
bridge
drivers and mitigate against further duplication of DT parsing code.

This patch can be used in the following way:

struct of_pci_range_parser parser;
struct of_pci_range range;

if (of_pci_range_parser_init(&parser, np))
; //no ranges property

for_each_of_pci_range(&parser, &range) {

/*
directly access properties of the address range, e.g.:
range.pci_space, range.pci_addr, range.cpu_addr,
range.size, range.flags

alternatively obtain a struct resource, e.g.:
struct resource res;
of_pci_range_to_resource(&range, np, &res);
*/
}

Additionally the implementation takes care of adjacent ranges and merges them
into a single range (as was the case with powerpc and microblaze).

Signed-off-by: Andrew Murray 
Signed-off-by: Liviu Dudau 
Signed-off-by: Thomas Petazzoni 
Reviewed-by: Rob Herring 
Tested-by: Thomas Petazzoni 
Tested-by: Linus Walleij 
Acked-by: Grant Likely 
---
 drivers/of/address.c   |   67 ++
 drivers/of/of_pci.c|  113 +---
 include/linux/of_address.h |   48 +++
 3 files changed, 158 insertions(+), 70 deletions(-)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 04da786..fdd0636 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -227,6 +227,73 @@ int of_pci_address_to_resource(struct device_node *dev, 
int bar,
return __of_address_to_resource(dev, addrp, size, flags, NULL, r);
 }
 EXPORT_SYMBOL_GPL(of_pci_address_to_resource);
+
+int of_pci_range_parser_init(struct of_pci_range_parser *parser,
+   struct device_node *node)
+{
+   const int na = 3, ns = 2;
+   int rlen;
+
+   parser->node = node;
+   parser->pna = of_n_addr_cells(node);
+   parser->np = parser->pna + na + ns;
+
+   parser->range = of_get_property(node, "ranges", &rlen);
+   if (parser->range == NULL)
+   return -ENOENT;
+
+   parser->end = parser->range + rlen / sizeof(__be32);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(of_pci_range_parser_init);
+
+struct of_pci_range *of_pci_range_parser_one(struct of_pci_range_parser 
*parser,
+   struct of_pci_range *range)
+{
+   const int na = 3, ns = 2;
+
+   if (!range)
+   return NULL;
+
+   if (!parser->range || parser->range + parser->np > parser->end)
+   return NULL;
+
+   range->pci_space = parser->range[0];
+   range->flags = of_bus_pci_get_flags(parser->range);
+   range->pci_addr = of_read_number(parser->range + 1, ns);
+   range->cpu_addr = of_translate_address(parser->node,
+   parser->range + na);
+   range->size = of_read_number(parser->range + parser->pna + na, ns);
+
+   parser->range += parser->np;
+
+   /* Now consume following elements while they are contiguous */
+   while (parser->range + parser->np <= parser->end) {
+   u32 flags, pci_space;
+   u64 pci_addr, cpu_addr, size;
+
+   pci_space = be32_to_cpup(parser->range);
+   flags = of_bus_pci_get_flags(parser->range);
+   pci_addr = of_read_number(parser->range + 1, ns);
+   cpu_addr = of_translate_address(parser->node,
+   parser->range + na);
+   size = of_read_number(parser->range + parser->pna + na, ns);
+
+   if (flags != range->flags)
+   break;
+   if (pci_addr != range->pci_addr + range->size ||
+   cpu_addr != range->cpu_addr + range->size)
+   break;
+
+   range->size += size;
+   parser->range += parser->np;
+   }
+
+   return range;
+}
+EXPORT_SYMBOL_GPL(of_pci_range_parser_one);
+
 #endif /* CONFIG_PCI */
 
 /*
diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index 1626172..3c49ab2 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #if defined(CONFIG_PPC32) || defined(CONFIG_PPC64) || 
defined(CONFIG_MICROBLAZE)
@@ -82,67 +83,42 @@ EXPORT_SYMBOL_GPL(of_pci_find_child_device);
 void pci_process_bridge_OF_ranges(struct pci_controller *hose,
  struct device_node *dev, int primary)
 {
-   const u32 *ranges;
-   int rlen;
-   int pna = of_n_addr_cells(dev);
-   int np = pna + 5;

[PATCH v8 1/3] of/pci: Unify pci_process_bridge_OF_ranges from Microblaze and PowerPC

2013-04-22 Thread Andrew Murray
The pci_process_bridge_OF_ranges function, used to parse the "ranges"
property of a PCI host device, is found in both Microblaze and PowerPC
architectures. These implementations are nearly identical. This patch
moves this common code to a common place.

Signed-off-by: Andrew Murray 
Signed-off-by: Liviu Dudau 
Reviewed-by: Rob Herring 
Tested-by: Thomas Petazzoni 
Tested-by: Linus Walleij 
Acked-by: Michal Simek 
Acked-by: Grant Likely 
---
 arch/microblaze/include/asm/pci-bridge.h |5 +-
 arch/microblaze/pci/pci-common.c |  192 
 arch/powerpc/include/asm/pci-bridge.h|5 +-
 arch/powerpc/kernel/pci-common.c |  192 
 drivers/of/of_pci.c  |  200 ++
 include/linux/of_pci.h   |4 +
 6 files changed, 206 insertions(+), 392 deletions(-)

diff --git a/arch/microblaze/include/asm/pci-bridge.h 
b/arch/microblaze/include/asm/pci-bridge.h
index cb5d397..5783cd6 100644
--- a/arch/microblaze/include/asm/pci-bridge.h
+++ b/arch/microblaze/include/asm/pci-bridge.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device_node;
 
@@ -132,10 +133,6 @@ extern void setup_indirect_pci(struct pci_controller *hose,
 extern struct pci_controller *pci_find_hose_for_OF_device(
struct device_node *node);
 
-/* Fill up host controller resources from the OF node */
-extern void pci_process_bridge_OF_ranges(struct pci_controller *hose,
-   struct device_node *dev, int primary);
-
 /* Allocate & free a PCI host bridge structure */
 extern struct pci_controller *pcibios_alloc_controller(struct device_node 
*dev);
 extern void pcibios_free_controller(struct pci_controller *phb);
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 9ea521e..2735ad9 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -622,198 +622,6 @@ void pci_resource_to_user(const struct pci_dev *dev, int 
bar,
*end = rsrc->end - offset;
 }
 
-/**
- * pci_process_bridge_OF_ranges - Parse PCI bridge resources from device tree
- * @hose: newly allocated pci_controller to be setup
- * @dev: device node of the host bridge
- * @primary: set if primary bus (32 bits only, soon to be deprecated)
- *
- * This function will parse the "ranges" property of a PCI host bridge device
- * node and setup the resource mapping of a pci controller based on its
- * content.
- *
- * Life would be boring if it wasn't for a few issues that we have to deal
- * with here:
- *
- *   - We can only cope with one IO space range and up to 3 Memory space
- * ranges. However, some machines (thanks Apple !) tend to split their
- * space into lots of small contiguous ranges. So we have to coalesce.
- *
- *   - We can only cope with all memory ranges having the same offset
- * between CPU addresses and PCI addresses. Unfortunately, some bridges
- * are setup for a large 1:1 mapping along with a small "window" which
- * maps PCI address 0 to some arbitrary high address of the CPU space in
- * order to give access to the ISA memory hole.
- * The way out of here that I've chosen for now is to always set the
- * offset based on the first resource found, then override it if we
- * have a different offset and the previous was set by an ISA hole.
- *
- *   - Some busses have IO space not starting at 0, which causes trouble with
- * the way we do our IO resource renumbering. The code somewhat deals with
- * it for 64 bits but I would expect problems on 32 bits.
- *
- *   - Some 32 bits platforms such as 4xx can have physical space larger than
- * 32 bits so we need to use 64 bits values for the parsing
- */
-void pci_process_bridge_OF_ranges(struct pci_controller *hose,
- struct device_node *dev, int primary)
-{
-   const u32 *ranges;
-   int rlen;
-   int pna = of_n_addr_cells(dev);
-   int np = pna + 5;
-   int memno = 0, isa_hole = -1;
-   u32 pci_space;
-   unsigned long long pci_addr, cpu_addr, pci_next, cpu_next, size;
-   unsigned long long isa_mb = 0;
-   struct resource *res;
-
-   pr_info("PCI host bridge %s %s ranges:\n",
-  dev->full_name, primary ? "(primary)" : "");
-
-   /* Get ranges property */
-   ranges = of_get_property(dev, "ranges", &rlen);
-   if (ranges == NULL)
-   return;
-
-   /* Parse it */
-   pr_debug("Parsing ranges property...\n");
-   while ((rlen -= np * 4) >= 0) {
-   /* Read next ranges element */
-   pci_space = ranges[0];
-   pci_addr = of_read_number(ranges + 1, 2);
-   cpu_addr = of_translate_address(dev, ranges + 3);
-   size = of_read_number(ranges + pna + 3, 2);
-
-   pr_debug("pci_space: 0x%08x pci_addr:0x%016llx ",
-   

Re: [PATCH v7 3/3] of/pci: mips: convert to common of_pci_range_parser

2013-04-22 Thread Andrew Murray
On Sun, Apr 21, 2013 at 08:27:02AM +0100, Gabor Juhos wrote:
> Hi Jason,
> 
> >> Sorry I had no time earlier, but I have tested this now on MIPS. The patch
> >> causes build errors unfortunately. Given the fact that this has been merged
> >> already, I will send a fixup patch.
> > 
> > Olof has dropped this branch from arm-soc, plase post the build error
> > and fix here so that it can be included in this series.
> 
> I have posted the patch to Olof two days ago. It has been CC'd to you as well
> but In case that it does not exists in your mailbox the patch can be found 
> here:
> 
> https://patchwork.linux-mips.org/patch/5196/
> 
> However I can re-post the patch as a reply to this thread if you prefer that.

As this branch was dropped I have updated my patchset to include Grant's recent
feedback - I've also included Gabor's fixes to this patchset (and his sign-off).

If you include this new patchset in your branch the drivers that depend on it
will need to be updated to reflect the new naming of functions as suggested by
Grant.

Thanks,

Andrew Murray
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 06/15] powerpc/85xx: add support to JOG feature using cpufreq interface

2013-04-22 Thread Zhao Chenhui
On Mon, Apr 22, 2013 at 08:55:35AM +0530, Viresh Kumar wrote:
> On Fri, Apr 19, 2013 at 4:17 PM, Zhao Chenhui
>  wrote:
> > diff --git a/drivers/cpufreq/mpc85xx-cpufreq.c 
> > b/drivers/cpufreq/mpc85xx-cpufreq.c
> 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> Would be better to keep them in alphabetical order, so that we don't add
> anything twice.

Good idea.

> 
> > +static int mpc85xx_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +   unsigned int i, cur_pll;
> > +   int hw_cpu = get_hard_smp_processor_id(policy->cpu);
> > +
> > +   if (!cpu_present(policy->cpu))
> 
> This can't happen and so no need to check it.
> 
> > +   return -ENODEV;
> > +
> > +   /* the latency of a transition, the unit is ns */
> > +   policy->cpuinfo.transition_latency = 2000;
> > +
> > +   cur_pll = get_pll(hw_cpu);
> > +
> > +   /* initialize frequency table */
> > +   pr_debug("core%d frequency table:\n", hw_cpu);
> > +   for (i = 0; mpc85xx_freqs[i].frequency != CPUFREQ_TABLE_END; i++) {
> > +   if (mpc85xx_freqs[i].index <= max_pll[hw_cpu]) {
> > +   /* The frequency unit is kHz. */
> > +   mpc85xx_freqs[i].frequency =
> > +   (sysfreq * mpc85xx_freqs[i].index / 2) / 
> > 1000;
> > +   } else {
> > +   mpc85xx_freqs[i].frequency = CPUFREQ_ENTRY_INVALID;
> > +   }
> > +
> > +   pr_debug("%d: %dkHz\n", i, mpc85xx_freqs[i].frequency);
> > +
> > +   if (mpc85xx_freqs[i].index == cur_pll)
> > +   policy->cur = mpc85xx_freqs[i].frequency;
> > +   }
> > +   pr_debug("current pll is at %d, and core freq is%d\n",
> > +   cur_pll, policy->cur);
> > +
> > +   cpufreq_frequency_table_get_attr(mpc85xx_freqs, policy->cpu);
> > +
> > +   /*
> > +* This ensures that policy->cpuinfo_min
> > +* and policy->cpuinfo_max are set correctly.
> > +*/
> > +   return cpufreq_frequency_table_cpuinfo(policy, mpc85xx_freqs);
> 
> Call cpufreq_frequency_table_get_attr() at the end after above call is
> successful.
> 
> > +}
> 
> > +static int mpc85xx_cpufreq_target(struct cpufreq_policy *policy,
> > + unsigned int target_freq,
> > + unsigned int relation)
> 
> merge above two lines.
> 
> > +{
> > +   struct cpufreq_freqs freqs;
> > +   unsigned int new;
> > +   int ret = 0;
> > +
> > +   if (!set_pll)
> > +   return -ENODEV;
> > +
> > +   cpufreq_frequency_table_target(policy,
> > +  mpc85xx_freqs,
> > +  target_freq,
> > +  relation,
> > +  &new);
> 
> same.. merge all above to put it in a single line.
> 
> > +   freqs.old = policy->cur;
> > +   freqs.new = mpc85xx_freqs[new].frequency;
> > +   freqs.cpu = policy->cpu;
> 
> not required now.
> 
> > +   mutex_lock(&mpc85xx_switch_mutex);
> > +   cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
> 
> ditto. Rebase over latest code from linux-next. This call has changed.
> 
> > +   ret = set_pll(policy->cpu, mpc85xx_freqs[new].index);
> > +   if (!ret) {
> > +   pr_info("cpufreq: Setting core%d frequency to %d kHz and 
> > PLL ratio to %d:2\n",
> > +policy->cpu, mpc85xx_freqs[new].frequency,
> > +mpc85xx_freqs[new].index);
> > +
> > +   ppc_proc_freq = freqs.new * 1000ul;
> > +   }
> > +   cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
> > +   mutex_unlock(&mpc85xx_switch_mutex);
> > +
> > +   return ret;
> > +}
> 
> > +static int __init mpc85xx_jog_init(void)
> > +{
> > +   struct device_node *np;
> > +   unsigned int svr;
> > +
> > +   np = of_find_matching_node(NULL, mpc85xx_jog_ids);
> > +   if (!np)
> > +   return -ENODEV;
> > +
> > +   guts = of_iomap(np, 0);
> > +   if (!guts) {
> > +   of_node_put(np);
> > +   return -ENODEV;
> > +   }
> > +
> > +   sysfreq = fsl_get_sys_freq();
> > +
> > +   if (of_device_is_compatible(np, "fsl,mpc8536-guts")) {
> > +   svr = mfspr(SPRN_SVR);
> > +   if ((svr & 0x7fff) == 0x10) {
> > +   pr_err("MPC8536 Rev 1.0 does not support 
> > cpufreq(JOG).\n");
> > +   of_node_put(np);
> 
> unmap too??
> 
> > +   return -ENODEV;
> > +   }
> > +   mpc85xx_freqs = mpc8536_freqs_table;
> > +   set_pll = mpc8536_set_pll;
> > +   max_pll[0] = get_pll(0);
> > +
> > +   } else if (of_device_is_compatible(np, "fsl,p1022-guts")) {
> > +   mpc85xx_

Re: [PATCH v7 2/3] of/pci: Provide support for parsing PCI DT ranges property

2013-04-22 Thread Andrew Murray
On Thu, Apr 18, 2013 at 02:44:01PM +0100, Grant Likely wrote:
> On Tue, 16 Apr 2013 11:18:27 +0100, Andrew Murray  
> wrote:

> 
> Acked-by: Grant Likely 
> 
> But comments below...
> 

I've updated the patchset (now v8) to reflect your feedback, after a closer
look...

> > -
> > -   pr_debug("pci_space: 0x%08x pci_addr:0x%016llx ",
> > -   pci_space, pci_addr);
> > -   pr_debug("cpu_addr:0x%016llx size:0x%016llx\n",
> > -   cpu_addr, size);
> > -
> > -   ranges += np;
> > +   pr_debug("pci_space: 0x%08x pci_addr: 0x%016llx ",
> > +   range.pci_space, range.pci_addr);
> > +   pr_debug("cpu_addr: 0x%016llx size: 0x%016llx\n",
> > +   range.cpu_addr, range.size);
> 
> Nit: the patch changed whitespace on the pr_debug() statements, so even
> though the first line of each is identical, they look different in the
> patch.
> 

Actually the first line isn't identical, the original file was inconsistent
with its use of spaces between ':' and '0x%0' - my patch ensured that there
was always a space. I guess this could have been done as a separate patch.

> >  
> > /* If we failed translation or got a zero-sized region
> >  * (some FW try to feed us with non sensical zero sized regions
> >  * such as power3 which look like some kind of attempt
> >  * at exposing the VGA memory hole)
> >  */
> > -   if (cpu_addr == OF_BAD_ADDR || size == 0)
> > +   if (range.cpu_addr == OF_BAD_ADDR || range.size == 0)
> > continue;
> 
> Can this also be rolled into the parsing iterator?
> 

I decided not to do this. Mainly because ARM drivers use the parser directly
(instead of pci_process_bridge_OF_ranges function) and it seemed perfectly
valid for the parser to return a range of size 0 if that is what was present in
the DT.

> >  
> > -   /* Now consume following elements while they are contiguous */
> > -   for (; rlen >= np * sizeof(u32);
> > -ranges += np, rlen -= np * 4) {
> > -   if (ranges[0] != pci_space)
> > -   break;
> > -   pci_next = of_read_number(ranges + 1, 2);
> > -   cpu_next = of_translate_address(dev, ranges + 3);
> > -   if (pci_next != pci_addr + size ||
> > -   cpu_next != cpu_addr + size)
> > -   break;
> > -   size += of_read_number(ranges + pna + 3, 2);
> > -   }
> > -
> > /* Act based on address space type */
> > res = NULL;
> > -   switch ((pci_space >> 24) & 0x3) {
> > -   case 1: /* PCI IO space */
> > +   res_type = range.flags & IORESOURCE_TYPE_BITS;
> > +   if (res_type == IORESOURCE_IO) {
> 
> Why the change from switch() to an if/else if sequence?

I think this was an artifact of the patches evolution, I've reverted back to
the switch.

> 
> But those are mostly nitpicks. If this is deferred to v3.10 then I would
> suggest fixing them up and posting for another round of review.

Andrew Murray
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 06/15] powerpc/85xx: add support to JOG feature using cpufreq interface

2013-04-22 Thread Zhao Chenhui
On Mon, Apr 22, 2013 at 01:43:29AM +0200, Rafael J. Wysocki wrote:
> On Friday, April 19, 2013 07:00:57 PM Zhao Chenhui wrote:
> > - Forwarded message from Zhao Chenhui  -
> > 
> > Date: Fri, 19 Apr 2013 18:47:39 +0800
> > From: Zhao Chenhui 
> > To: linuxppc-dev@lists.ozlabs.org
> > CC: linux-ker...@vger.kernel.org
> > Subject: [linuxppc-release] [PATCH v2 06/15] powerpc/85xx: add support to 
> > JOG feature using cpufreq interface
> > X-Mailer: git-send-email 1.7.3
> > 
> > From: chenhui zhao 
> > 
> > Some 85xx silicons like MPC8536 and P1022 have a JOG feature, which provides
> > a dynamic mechanism to lower or raise the CPU core clock at runtime.
> > 
> > This patch adds the support to change CPU frequency using the standard
> > cpufreq interface. The ratio CORE to CCB can be 1:1(except MPC8536), 3:2,
> > 2:1, 5:2, 3:1, 7:2 and 4:1.
> > 
> > Two CPU cores on P1022 must not in the low power state during the frequency
> > transition. The driver uses a atomic counter to meet the requirement.
> > 
> > The jog mode frequency transition process on the MPC8536 is similar to
> > the deep sleep process. The driver need save the CPU state and restore
> > it after CPU warm reset.
> > 
> > Note:
> >  * The I/O peripherals such as PCIe and eTSEC may lose packets during
> >the jog mode frequency transition.
> >  * The driver doesn't support MPC8536 Rev 1.0 due to a JOG erratum.
> >Subsequent revisions of MPC8536 have corrected the erratum.
> > 
> > Signed-off-by: Dave Liu 
> > Signed-off-by: Li Yang 
> > Signed-off-by: Jerry Huang 
> > Signed-off-by: Zhao Chenhui 
> > CC: Scott Wood 
> 
> Well, I'd like someone from the PowerPC camp to comment on this before I take 
> it.
> 
> Thanks,
> Rafael
> 

OK. Thanks.

-Chenhui

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] powerpc/powernv: Patch MSI EOI handler on P8

2013-04-22 Thread Gavin Shan
On Mon, Apr 22, 2013 at 12:56:37PM +1000, Michael Ellerman wrote:
>On Mon, Apr 22, 2013 at 09:45:33AM +0800, Gavin Shan wrote:
>> On Mon, Apr 22, 2013 at 09:34:36AM +1000, Michael Ellerman wrote:
>> >On Fri, Apr 19, 2013 at 05:32:45PM +0800, Gavin Shan wrote:
>> >> The EOI handler of MSI/MSI-X interrupts for P8 (PHB3) need additional
>> >> steps to handle the P/Q bits in IVE before EOIing the corresponding
>> >> interrupt. The patch changes the EOI handler to cover that.
>> 
>> Thanks for your time to review it, Michael. By the way, I think I need
>> rebase the patch since the patch fb1b55d654a7038ca6337fbf55839a308c9bc1a7
>> ("Using bitmap to manage MSI") has been merged to linux-next.
>> 
>> >> diff --git a/arch/powerpc/sysdev/xics/icp-native.c 
>> >> b/arch/powerpc/sysdev/xics/icp-native.c
>> >> index 48861d3..289355e 100644
>> >> --- a/arch/powerpc/sysdev/xics/icp-native.c
>> >> +++ b/arch/powerpc/sysdev/xics/icp-native.c
>> >> @@ -27,6 +27,10 @@
>> >>  #include 
>> >>  #include 
>> >>  
>> >> +#if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_PCI_MSI)
>> >> +extern int pnv_pci_msi_eoi(unsigned int hw_irq);
>> >> +#endif
>> >
>> >You don't need to #ifdef the extern. But it should be in a header, not
>> >here.
>> >
>> 
>> Ok. I'll put it into asm/xics.h, but I want to confirm we needn't
>> #ifdef when moving it to asm/xics.h?
>
>No you don't need it #ifdef'd. It's just extra noise in the file, and
>doesn't really add anything IMHO.
>

Michael, I'm a bit confused about your point. asm/xics.h is shared between
PowerNV and pSeries platform, and pnv_pci_msi_eoi() is only implemented on
PowerNV platform, so the code should look like this (with newly introduced
option - CONFIG_POWERNV_MSI)

#ifdef CONFIG_POWERNV_MSI
extern int pnv_pci_msi_eoi(unsigned int hw_irq);
#endif

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] macintosh: use %*ph to print small buffers

2013-04-22 Thread Andy Shevchenko
Signed-off-by: Andy Shevchenko 
---
 drivers/macintosh/smu.c | 6 +-
 drivers/macintosh/via-pmu.c | 5 +++--
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/macintosh/smu.c b/drivers/macintosh/smu.c
index 9c6b964..b3b2d36 100644
--- a/drivers/macintosh/smu.c
+++ b/drivers/macintosh/smu.c
@@ -120,11 +120,7 @@ static void smu_start_cmd(void)
 
DPRINTK("SMU: starting cmd %x, %d bytes data\n", cmd->cmd,
cmd->data_len);
-   DPRINTK("SMU: data buffer: %02x %02x %02x %02x %02x %02x %02x %02x\n",
-   ((u8 *)cmd->data_buf)[0], ((u8 *)cmd->data_buf)[1],
-   ((u8 *)cmd->data_buf)[2], ((u8 *)cmd->data_buf)[3],
-   ((u8 *)cmd->data_buf)[4], ((u8 *)cmd->data_buf)[5],
-   ((u8 *)cmd->data_buf)[6], ((u8 *)cmd->data_buf)[7]);
+   DPRINTK("SMU: data buffer: %8ph\n", cmd->data_buf);
 
/* Fill the SMU command buffer */
smu->cmd_buf->cmd = cmd->cmd;
diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index c31fbab..283e1b5 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -750,8 +750,9 @@ done_battery_state_smart(struct adb_request* req)
voltage = (req->reply[8] << 8) | req->reply[9];
break;
default:
-   printk(KERN_WARNING "pmu.c : unrecognized 
battery info, len: %d, %02x %02x %02x %02x\n",
-   req->reply_len, req->reply[0], 
req->reply[1], req->reply[2], req->reply[3]);
+   pr_warn("pmu.c: unrecognized battery info, "
+   "len: %d, %4ph\n", req->reply_len,
+  req->reply);
break;
}
}
-- 
1.8.2.rc0.22.gb3600c3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 16/27] mm/THP: HPAGE_SHIFT is not a #define on some arch

2013-04-22 Thread Andrea Arcangeli
On Mon, Apr 22, 2013 at 03:30:50PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT
> and other derived values like HPAGE_PMD_ORDER are not constants. So move
> that to hugepage_init
> 
> Cc: Andrea Arcangeli 
> 
> Reviewed-by: David Gibson 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Andrea Arcangeli  

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 17/27] mm/THP: Add pmd args to pgtable deposit and withdraw APIs

2013-04-22 Thread Andrea Arcangeli
On Mon, Apr 22, 2013 at 03:30:51PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> This will be later used by powerpc THP support. In powerpc we want to use
> pgtable for storing the hash index values. So instead of adding them to
> mm_context list, we would like to store them in the second half of pmd
> 
> Cc: Andrea Arcangeli 

*snip*

>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> + pgtable_t pgtable)
>  {
>   assert_spin_locked(&mm->page_table_lock);
>  
> @@ -141,7 +142,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, 
> pgtable_t pgtable)
>  #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* no "address" argument so destroys page coloring of some arch */
> -pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>  {
>   pgtable_t pgtable;

This will add micro overhead with more variables put in certain regs
or stack. The micro overhead could be optimized away by wrapping the
call with a generic and per-arch header and by adding a __ prefix to
the above one in the generic .c file. I'm neutral but I pointed out so
others are free to comment on it.

Reviewed-by: Andrea Arcangeli 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations

2013-04-22 Thread Andrea Arcangeli
Hi,

On Mon, Apr 22, 2013 at 03:30:52PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" 
> 
> For architectures like ppc64 we look at deposited pgtable when
> calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw
> after finishing pmdp related operations.
> 
> Cc: Andrea Arcangeli 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/huge_memory.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 84f3180..2a43782 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1363,9 +1363,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pgtable_t pgtable;
>   pmd_t orig_pmd;
> - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> +
>   orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
>   tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> + pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
>   if (is_huge_zero_pmd(orig_pmd)) {
>   tlb->mm->nr_ptes--;
>   spin_unlock(&tlb->mm->page_table_lock);

I think here a comment inline (not only in the commit msg) is in
order. Otherwise it's hard to imagine others to be aware of this arch
detail when they will read the code later. So it would be prone to
break later without a comment.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] perf: Power7: Make CPI stack events available in sysfs

2013-04-22 Thread Sukadev Bhattiprolu
Michael Ellerman [mich...@ellerman.id.au] wrote:
| On Sat, Apr 06, 2013 at 09:48:03AM -0700, Sukadev Bhattiprolu wrote:
| > From bdeacf7175241f6c79b5b2be0fa6b20b0d0b7d1c Mon Sep 17 00:00:00 2001
| > From: Sukadev Bhattiprolu 
| > Date: Sat, 6 Apr 2013 08:48:26 -0700
| > Subject: [PATCH] perf: Power7: Make CPI stack events available in sysfs
| > 
| > A set of Power7 events are often used for Cycles Per Instruction (CPI) stack
| > analysis. Make these events available in sysfs (/sys/devices/cpu/events/) so
| > they can be identified using their symbolic names:
| > 
| > perf stat -e 'cpu/PM_CMPLU_STALL_DCACHE_MISS/' /bin/ls
| 
| Should we take these two via the powerpc tree? Or do you want to take
| them Arnaldo?

I think it can go through powerpc tree since it is all arch-specific.

Sukadev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v8 0/3] of/pci: Provide common support for PCI DT parsing

2013-04-22 Thread Jason Cooper
On Mon, Apr 22, 2013 at 11:41:32AM +0100, Andrew Murray wrote:
> This patchset factors out duplicated code associated with parsing PCI
> DT "ranges" properties across the architectures and introduces a
> "ranges" parser. This parser "of_pci_range_parser" can be used directly
> by ARM host bridge drivers enabling them to obtain ranges from device
> trees.
> 
> I've included the Reviewed-by, Tested-by and Acked-by's received from v5/v6/v7
> in this patchset, earlier versions of this patchset (v3) have been tested-by:
> 
> Thierry Reding 
> Jingoo Han 
> 
> I've tested that this patchset builds and runs on ARM and that it builds on
> PowerPC, x86_64 and MIPS.

Andrew,

Unfortunately, the mvebu/drivers branch containing your series had to be
dropped from arm-soc for v3.10.  This was not due to your series, but
since arm-soc's granularity is branches, your series was caught in the
drop.

As the mvebu-pcie driver is now v3.11 material, I have taken the
opportunity to upgrade from your v7 patchset to v8.  You can find the
whole branch at mvebu-next/pcie.

mvebu-next/pcie *will* be rebased onto v3.9 once it drops.  Several
dependencies will be removed (since they will have been merged into
v3.9).

Once the rebase is done, I'll send a pull request to Arnd and Olof so we
can get as many cycles on -next as possible.

thx,

Jason.

> 
> Compared to the v7 sent by Andrew Murray, the following changes have been made
> (please note that the first patch is unchanged from v7):
> 
>  * Rename of_pci_range_parser to of_pci_range_parser_init and
>of_pci_process_ranges to of_pci_range_parser_one as suggested by Grant
>Likely.
> 
>  * Reverted back to using a switch statement instead of if/else in
>pci_process_bridge_OF_ranges. Grant Likely highlighted this change from
>the original code which was unnecessary.
> 
>  * Squashed in a patch provided by Gabor Juhos which fixes build errors on
>MIPS found in the last patchset.
> 
> Compared to the v6 sent by Andrew Murray, the following changes have
> been made in response to build errors/warnings:
> 
>  * Inclusion of linux/of_address.h in of_pci.c as suggested by Michal
>Simek to prevent compilation failures on Microblaze (and others) and his
>ack.
> 
>  * Use of externs, static inlines and a typo in linux/of_address.h in response
>to linker errors (multiple defination) on x86_64 as spotted by a kbuild 
> test
>robot on (jcooper/linux.git mvebu/drivers)
> 
>  * Add EXPORT_SYMBOL_GPL to of_pci_range_parser function to be consistent
>with of_pci_process_ranges function
> 
> Compared to the v5 sent by Andrew Murray, the following changes have
> been made:
> 
>  * Use of CONFIG_64BIT instead of CONFIG_[a32bitarch] as suggested by
>Rob Herring in drivers/of/of_pci.c
> 
>  * Added forward declaration of struct pci_controller in linux/of_pci.h
>to prevent compiler warning as suggested by Thomas Petazzoni
> 
>  * Improved error checking (!range check), removal of unnecessary be32_to_cpup
>call, improved formatting of struct of_pci_range_parser layout and
>replacement of macro with a static inline. All suggested by Rob Herring.
> 
> Compared to the v4 (incorrectly labelled v3) sent by Andrew Murray,
> the following changes have been made:
> 
>  * Split the patch as suggested by Rob Herring
> 
> Compared to the v3 sent by Andrew Murray, the following changes have
> been made:
> 
>  * Unify and move duplicate pci_process_bridge_OF_ranges functions to
>drivers/of/of_pci.c as suggested by Rob Herring
> 
>  * Fix potential build errors with Microblaze/MIPS
> 
> Compared to "[PATCH v5 01/17] of/pci: Provide support for parsing PCI DT
> ranges property", the following changes have been made:
> 
>  * Correct use of IORESOURCE_* as suggested by Russell King
> 
>  * Improved interface and naming as suggested by Thierry Reding
> 
> Compared to the v2 sent by Andrew Murray, Thomas Petazzoni did:
> 
>  * Add a memset() on the struct of_pci_range_iter when starting the
>for loop in for_each_pci_range(). Otherwise, with an uninitialized
>of_pci_range_iter, of_pci_process_ranges() may crash.
> 
>  * Add parenthesis around 'res', 'np' and 'iter' in the
>for_each_of_pci_range macro definitions. Otherwise, passing
>something like &foobar as 'res' didn't work.
> 
>  * Rebased on top of 3.9-rc2, which required fixing a few conflicts in
>the Microblaze code.
> 
> v2:
>   This follows on from suggestions made by Grant Likely
>   (marc.info/?l=linux-kernel&m=136079602806328)
> 
> Andrew Murray (3):
>   of/pci: Unify pci_process_bridge_OF_ranges from Microblaze and
> PowerPC
>   of/pci: Provide support for parsing PCI DT ranges property
>   of/pci: mips: convert to common of_pci_range_parser
> 
>  arch/microblaze/include/asm/pci-bridge.h |5 +-
>  arch/microblaze/pci/pci-common.c |  192 
> --
>  arch/mips/pci/pci.c  |   51 +++-
>  arch/powerpc/include/asm/pci-bridge.h  

[PATCH v3 0/12] NUMA CPU Reconfiguration using PRRN

2013-04-22 Thread Nathan Fontenot
Newer firmware on Power systems can transparently reassign platform resources
(CPU and Memory) in use. For instance, if a processor or memory unit is
predicted to fail, the platform may transparently move the processing to an
equivalent unused processor or the memory state to an equivalent unused
memory unit. However, reassigning resources across NUMA boundaries may alter
the performance of the partition. When such reassignment is necessary, the
Platform Resource Reassignment Notification (PRRN) option provides a
mechanism to inform the Linux kernel of changes to the NUMA affinity of
its platform resources.

PRRN Events are RTAS events sent up through the event-scan mechanism on
Power. When these events are received the system needs can get the updated
device tree affinity information for the affected CPUs/memory via the
rtas update-nodes and update-properties calls. This information is then
used to update the NUMA affinity of the CPUs/Memory in the kernel.

This patch set adds the ability to recognize PRRN events, update the device
tree and kernel information for CPUs (memory will be handled in a later
patch), and add an interface to enable/disable toplogy updates from /proc.

Additionally, these updates solve an existing problem with the VPHN (Virtual
Processor Home Node) capability and allow us to re-enable this feature.

Nathan Fontenot

 arch/powerpc/include/asm/firmware.h   |3 
 arch/powerpc/include/asm/prom.h   |   46 ++--
 arch/powerpc/include/asm/rtas.h   |2 
 arch/powerpc/kernel/prom_init.c   |   98 ++
 arch/powerpc/kernel/rtasd.c   |   37 +++
 arch/powerpc/mm/numa.c|  214 +++---
 arch/powerpc/platforms/pseries/firmware.c |1 
 arch/powerpc/platforms/pseries/mobility.c |   24 +-
 powerpc/arch/powerpc/include/asm/firmware.h   |4 
 powerpc/arch/powerpc/include/asm/machdep.h|2 
 powerpc/arch/powerpc/include/asm/prom.h   |   73 +++
 powerpc/arch/powerpc/include/asm/rtas.h   |1 
 powerpc/arch/powerpc/include/asm/topology.h   |5 
 powerpc/arch/powerpc/kernel/prom_init.c   |2 
 powerpc/arch/powerpc/kernel/rtas.c|   10 +
 powerpc/arch/powerpc/kernel/rtasd.c   |7 
 powerpc/arch/powerpc/mm/numa.c|   62 ++
 powerpc/arch/powerpc/platforms/pseries/firmware.c |   49 -
 powerpc/arch/powerpc/platforms/pseries/mobility.c |   20 +-
 powerpc/arch/powerpc/platforms/pseries/pseries.h  |5 
 powerpc/arch/powerpc/platforms/pseries/setup.c|   40 ++--
 21 files changed, 500 insertions(+), 205 deletions(-)

Updates for v3 of the patchset:

1/12 - Updated to use a ppc_md interface to invoke device tree updates, this
corrects the build break previously seen in patch 2/12 for non-pseries
platforms.

2/12 - New patch in the series to correct the parsing of the buffer returned
from ibm,update-properties rtas call.

5/12 - The parsing of architecture vector 5 has been made more efficient.

7/12 - Correct #define used in call the firmware_has_feature()

8/12 - Updated calling of stop_machine() to only call it once per PRRN event.

12/12 - Added inclusion of topology.h to rtasd.c to correct a build failure
on non-pseries platforms.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 1/12] Create a powerpc update_devicetree interface

2013-04-22 Thread Nathan Fontenot
Newer firmware on Power systems can transparently reassign platform resources
(CPU and Memory) in use. For instance, if a processor or memory unit is
predicted to fail, the platform may transparently move the processing to an
equivalent unused processor or the memory state to an equivalent unused
memory unit. However, reassigning resources across NUMA boundaries may alter
the performance of the partition. When such reassignment is necessary, the
Platform Resource Reassignment Notification (PRRN) option provides a
mechanism to inform the Linux kernel of changes to the NUMA affinity of
its platform resources.

When rtasd receives a PRRN event, it needs to make a series of RTAS
calls (ibm,update-nodes and ibm,update-properties) to retrieve the
updated device tree information. These calls are already handled in the
pseries_devtree_update() routine used in partition migration.

This patch exposes a method for updating the device tree via
ppc_md.update_devicetree that takes a single 32-bit value as a parameter.
For pseries platforms this is the existing pseries_devicetree_update routine
which is updated to take the new parameter which is a scope value
to indicate the the reason for making the rtas calls. This parameter is
required by the ibm,update-nodes/ibm,update-properties RTAS calls, and
the appropriate value is contained within the RTAS event for PRRN
notifications. In pseries_devicetree_update() it was previously
hard-coded to 1, the scope value for partition migration.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/machdep.h|2 ++
 arch/powerpc/include/asm/rtas.h   |1 +
 arch/powerpc/kernel/rtas.c|   10 ++
 arch/powerpc/platforms/pseries/mobility.c |   24 +++-
 4 files changed, 28 insertions(+), 9 deletions(-)

Index: powerpc/arch/powerpc/include/asm/rtas.h
===
--- powerpc.orig/arch/powerpc/include/asm/rtas.h2013-04-15 
09:18:10.0 -0500
+++ powerpc/arch/powerpc/include/asm/rtas.h 2013-04-17 12:58:33.0 
-0500
@@ -276,6 +276,7 @@
const char *uname, int depth, void *data);
 
 extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
+extern int update_devicetree(s32 scope);
 
 #ifdef CONFIG_PPC_RTAS_DAEMON
 extern void rtas_cancel_event_scan(void);
Index: powerpc/arch/powerpc/platforms/pseries/mobility.c
===
--- powerpc.orig/arch/powerpc/platforms/pseries/mobility.c  2013-04-15 
09:18:10.0 -0500
+++ powerpc/arch/powerpc/platforms/pseries/mobility.c   2013-04-17 
13:01:08.0 -0500
@@ -19,6 +19,7 @@
 #include 
 
 #include 
+#include 
 #include "pseries.h"
 
 static struct kobject *mobility_kobj;
@@ -37,14 +38,16 @@
 #define UPDATE_DT_NODE 0x0200
 #define ADD_DT_NODE0x0300
 
-static int mobility_rtas_call(int token, char *buf)
+#define MIGRATION_SCOPE(1)
+
+static int mobility_rtas_call(int token, char *buf, s32 scope)
 {
int rc;
 
spin_lock(&rtas_data_buf_lock);
 
memcpy(rtas_data_buf, buf, RTAS_DATA_BUF_SIZE);
-   rc = rtas_call(token, 2, 1, NULL, rtas_data_buf, 1);
+   rc = rtas_call(token, 2, 1, NULL, rtas_data_buf, scope);
memcpy(buf, rtas_data_buf, RTAS_DATA_BUF_SIZE);
 
spin_unlock(&rtas_data_buf_lock);
@@ -123,7 +126,7 @@
return 0;
 }
 
-static int update_dt_node(u32 phandle)
+static int update_dt_node(u32 phandle, s32 scope)
 {
struct update_props_workarea *upwa;
struct device_node *dn;
@@ -151,7 +154,8 @@
upwa->phandle = phandle;
 
do {
-   rc = mobility_rtas_call(update_properties_token, rtas_buf);
+   rc = mobility_rtas_call(update_properties_token, rtas_buf,
+   scope);
if (rc < 0)
break;
 
@@ -219,7 +223,7 @@
return rc;
 }
 
-static int pseries_devicetree_update(void)
+static int pseries_devicetree_update(s32 scope)
 {
char *rtas_buf;
u32 *data;
@@ -235,7 +239,7 @@
return -ENOMEM;
 
do {
-   rc = mobility_rtas_call(update_nodes_token, rtas_buf);
+   rc = mobility_rtas_call(update_nodes_token, rtas_buf, scope);
if (rc && rc != 1)
break;
 
@@ -256,7 +260,7 @@
delete_dt_node(phandle);
break;
case UPDATE_DT_NODE:
-   update_dt_node(phandle);
+   update_dt_node(phandle, scope);
break;
case ADD_DT_NODE:
drc_index = *data++;
@@ -276,7 +280,7 @@
int rc;
int activate_fw_token;
 
-   rc = pseries_

[PATCH v3 2/12] Correct buffer parsing in update-properties

2013-04-22 Thread Nathan Fontenot
Correct parsing of the buffer returned from ibm,update-properties. The first
element is a length and the path to the property which is slightly different
from the list of properties in the buffer so we need to specifically
handle this.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/platforms/pseries/mobility.c |   20 
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: powerpc/arch/powerpc/platforms/pseries/mobility.c
===
--- powerpc.orig/arch/powerpc/platforms/pseries/mobility.c  2013-04-17 
13:27:23.0 -0500
+++ powerpc/arch/powerpc/platforms/pseries/mobility.c   2013-04-17 
13:28:58.0 -0500
@@ -135,6 +135,7 @@
char *prop_data;
char *rtas_buf;
int update_properties_token;
+   u32 vd;
 
update_properties_token = rtas_token("ibm,update-properties");
if (update_properties_token == RTAS_UNKNOWN_SERVICE)
@@ -161,13 +162,24 @@
 
prop_data = rtas_buf + sizeof(*upwa);
 
-   for (i = 0; i < upwa->nprops; i++) {
+   /* The first element of the buffer is the path of the node
+* being updated in the form of a 8 byte string length
+* followed by the string. Skip past this to get to the
+* properties being updated.
+*/
+   vd = *prop_data++;
+   prop_data += vd;
+
+   /* The path we skipped over is counted as one of the elements
+* returned so start counting at one.
+*/
+   for (i = 1; i < upwa->nprops; i++) {
char *prop_name;
-   u32 vd;
 
-   prop_name = prop_data + 1;
+   prop_name = prop_data;
prop_data += strlen(prop_name) + 1;
-   vd = *prop_data++;
+   vd = *(u32 *)prop_data;
+   prop_data += sizeof(vd);
 
switch (vd) {
case 0x:

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 3/12] Add PRRN event handler

2013-04-22 Thread Nathan Fontenot
From: Jesse Larrew 

A PRRN event is signaled via the RTAS event-scan mechanism, which
returns a Hot Plug Event message "fixed part" indicating "Platform
Resource Reassignment". In response to the Hot Plug Event message,
we must call ibm,update-nodes to determine which resources were
reassigned and then ibm,update-properties to obtain the new affinity
information about those resources.

The PRRN event-scan RTAS message contains only the "fixed part" with
the "Type" field set to the value 160 and no Extended Event Log. The
four-byte Extended Event Log Length field is repurposed (since no
Extended Event Log message is included) to pass the "scope" parameter
that causes the ibm,update-nodes to return the nodes affected by the
specific resource reassignment.

This patch adds a handler for PRRN RTAS events. The function
pseries_devicetree_update() (from mobility.c) is used to make the
ibm,update-nodes/ibm,update-properties RTAS calls. Updating the NUMA maps
(handled by a subsequent patch) will require significant processing,
so pseries_devicetree_update() is called from an asynchronous workqueue
to allow event processing to continue. 

PRRN RTAS events on pseries systems are rare events that have to be
initiated from the HMC console for the system by an IBM tech. This allows
us to assume that these events are widely spaced. Additionally, all work
on the queue is flushed before handling any new work to ensure we only have
one event in flight being handled at a time.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/rtas.h |2 ++
 arch/powerpc/kernel/rtasd.c |   37 -
 2 files changed, 38 insertions(+), 1 deletion(-)

Index: powerpc/arch/powerpc/include/asm/rtas.h
===
--- powerpc.orig/arch/powerpc/include/asm/rtas.h2013-04-17 
12:58:33.0 -0500
+++ powerpc/arch/powerpc/include/asm/rtas.h 2013-04-17 13:24:06.0 
-0500
@@ -143,6 +143,8 @@
 #define RTAS_TYPE_PMGM_TIME_ALARM  0x6f
 #define RTAS_TYPE_PMGM_CONFIG_CHANGE   0x70
 #define RTAS_TYPE_PMGM_SERVICE_PROC0x71
+/* Platform Resource Reassignment Notification */
+#define RTAS_TYPE_PRRN 0xA0
 
 /* RTAS check-exception vector offset */
 #define RTAS_VECTOR_EXTERNAL_INTERRUPT 0x500
Index: powerpc/arch/powerpc/kernel/rtasd.c
===
--- powerpc.orig/arch/powerpc/kernel/rtasd.c2013-04-17 12:55:11.0 
-0500
+++ powerpc/arch/powerpc/kernel/rtasd.c 2013-04-17 13:27:00.0 -0500
@@ -87,6 +87,8 @@
return "Resource Deallocation Event";
case RTAS_TYPE_DUMP:
return "Dump Notification Event";
+   case RTAS_TYPE_PRRN:
+   return "Platform Resource Reassignment Event";
}
 
return rtas_type[0];
@@ -265,7 +267,40 @@
spin_unlock_irqrestore(&rtasd_log_lock, s);
return;
}
+}
+
+static s32 update_scope;
+
+static void prrn_work_fn(struct work_struct *work)
+{
+   /*
+* For PRRN, we must pass the negative of the scope value in
+* the RTAS event.
+*/
+   if (ppc_md.update_devicetree)
+   ppc_md.update_devicetree(-update_scope);
+}
+
+static DECLARE_WORK(prrn_work, prrn_work_fn);
+
+void prrn_schedule_update(u32 scope)
+{
+   flush_work(&prrn_work);
+   update_scope = scope;
+   schedule_work(&prrn_work);
+}
+
+static void pseries_handle_event(const struct rtas_error_log *log)
+{
+   pSeries_log_error((char *)log, ERR_TYPE_RTAS_LOG, 0);
+
+   if (log->type == RTAS_TYPE_PRRN)
+   /* For PRRN Events the extended log length is used to denote
+* the scope for calling rtas update-nodes.
+*/
+   prrn_schedule_update(log->extended_log_length);
 
+   return;
 }
 
 static int rtas_log_open(struct inode * inode, struct file * file)
@@ -389,7 +424,7 @@
}
 
if (error == 0)
-   pSeries_log_error(logdata, ERR_TYPE_RTAS_LOG, 0);
+   pseries_handle_event((struct rtas_error_log *)logdata);
 
} while(error == 0);
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 4/12] Move architecture vector definitions to prom.h

2013-04-22 Thread Nathan Fontenot
As part of handling handling PRRN events we will need to check the
vector 5 portion of the architecture bits reported in the device tree
to ensure that PRRN event handling is enabled. In order to do this
firmware_has_feature is updated (in a subsequent patch) to
make this check.  To avoid having to re-define bits in the architecture
vector the bits are moved to prom.h.

This patch is the first step in updating firmware_has_feature
by simply moving the bit definitions from prom_init.c to asm/prom.h.
There are no functional changes.

Signed-off-by: Nathan Fontenot 

---
 arch/powerpc/include/asm/prom.h |   73 ++
 arch/powerpc/kernel/prom_init.c |   75 +++-
 2 files changed, 79 insertions(+), 69 deletions(-)

Index: powerpc/arch/powerpc/include/asm/prom.h
===
--- powerpc.orig/arch/powerpc/include/asm/prom.h2013-04-16 
21:25:16.0 -0500
+++ powerpc/arch/powerpc/include/asm/prom.h 2013-04-17 13:43:13.0 
-0500
@@ -74,6 +74,79 @@
 #define DRCONF_MEM_AI_INVALID  0x0040
 #define DRCONF_MEM_RESERVED0x0080
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
+/*
+ * There are two methods for telling firmware what our capabilities are.
+ * Newer machines have an "ibm,client-architecture-support" method on the
+ * root node.  For older machines, we have to call the "process-elf-header"
+ * method in the /packages/elf-loader node, passing it a fake 32-bit
+ * ELF header containing a couple of PT_NOTE sections that contain
+ * structures that contain various information.
+ */
+
+/* New method - extensible architecture description vector. */
+
+/* Option vector bits - generic bits in byte 1 */
+#define OV_IGNORE  0x80/* ignore this vector */
+#define OV_CESSATION_POLICY0x40/* halt if unsupported option present*/
+
+/* Option vector 1: processor architectures supported */
+#define OV1_PPC_2_00   0x80/* set if we support PowerPC 2.00 */
+#define OV1_PPC_2_01   0x40/* set if we support PowerPC 2.01 */
+#define OV1_PPC_2_02   0x20/* set if we support PowerPC 2.02 */
+#define OV1_PPC_2_03   0x10/* set if we support PowerPC 2.03 */
+#define OV1_PPC_2_04   0x08/* set if we support PowerPC 2.04 */
+#define OV1_PPC_2_05   0x04/* set if we support PowerPC 2.05 */
+#define OV1_PPC_2_06   0x02/* set if we support PowerPC 2.06 */
+#define OV1_PPC_2_07   0x01/* set if we support PowerPC 2.07 */
+
+/* Option vector 2: Open Firmware options supported */
+#define OV2_REAL_MODE  0x20/* set if we want OF in real mode */
+
+/* Option vector 3: processor options supported */
+#define OV3_FP 0x80/* floating point */
+#define OV3_VMX0x40/* VMX/Altivec */
+#define OV3_DFP0x20/* decimal FP */
+
+/* Option vector 4: IBM PAPR implementation */
+#define OV4_MIN_ENT_CAP0x01/* minimum VP entitled capacity 
*/
+
+/* Option vector 5: PAPR/OF options supported */
+#define OV5_LPAR   0x80/* logical partitioning supported */
+#define OV5_SPLPAR 0x40/* shared-processor LPAR supported */
+/* ibm,dynamic-reconfiguration-memory property supported */
+#define OV5_DRCONF_MEMORY  0x20
+#define OV5_LARGE_PAGES0x10/* large pages supported */
+#define OV5_DONATE_DEDICATE_CPU0x02/* donate dedicated CPU support 
*/
+/* PCIe/MSI support.  Without MSI full PCIe is not supported */
+#ifdef CONFIG_PCI_MSI
+#define OV5_MSI0x01/* PCIe/MSI support */
+#else
+#define OV5_MSI0x00
+#endif /* CONFIG_PCI_MSI */
+#ifdef CONFIG_PPC_SMLPAR
+#define OV5_CMO0x80/* Cooperative Memory 
Overcommitment */
+#define OV5_XCMO   0x40/* Page Coalescing */
+#else
+#define OV5_CMO0x00
+#define OV5_XCMO   0x00
+#endif
+#define OV5_TYPE1_AFFINITY 0x80/* Type 1 NUMA affinity */
+#define OV5_PFO_HW_RNG 0x80/* PFO Random Number Generator */
+#define OV5_PFO_HW_842 0x40/* PFO Compression Accelerator */
+#define OV5_PFO_HW_ENCR0x20/* PFO Encryption Accelerator */
+#define OV5_SUB_PROCESSORS 0x01/* 1,2,or 4 Sub-Processors supported */
+
+/* Option Vector 6: IBM PAPR hints */
+#define OV6_LINUX  0x02/* Linux is our OS */
+
+/*
+ * The architecture vector has an array of PVR mask/value pairs,
+ * followed by # option vectors - 1, followed by the option vectors.
+ */
+extern unsigned char ibm_architecture_vec[];
+#endif
+
 /* These includes are put at the bottom because they may contain things
  * that are overridden by this file.  Ideally they shouldn't be included
  * by this file, but there are a bunch of .c files that curre

[PATCH v3 5/12] Update firmware_has_feature() to check architecture bits

2013-04-22 Thread Nathan Fontenot
The firmware_has_feature() function makes it easy to check for supported
features of the hypervisor. This patch extends the capability of the
firmware_has_feature() function to include checking for specified bits
in vector 5 of the architecture vector as is reported in the device tree.

As part of this the #defines used for the architecture vector are
re-defined such that the vector 5 options have the vector
index and the feature bits encoded into them. This makes for a much
simpler design to update firmware_has_feature() to check for bits
in the architecture vector.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/firmware.h   |4 +-
 arch/powerpc/include/asm/prom.h   |   45 ---
 arch/powerpc/kernel/prom_init.c   |   23 ++
 arch/powerpc/platforms/pseries/firmware.c |   49 +-
 arch/powerpc/platforms/pseries/pseries.h  |5 ++-
 arch/powerpc/platforms/pseries/setup.c|   40 
 6 files changed, 113 insertions(+), 53 deletions(-)

Index: powerpc/arch/powerpc/include/asm/prom.h
===
--- powerpc.orig/arch/powerpc/include/asm/prom.h2013-04-17 
13:43:13.0 -0500
+++ powerpc/arch/powerpc/include/asm/prom.h 2013-04-17 13:51:46.0 
-0500
@@ -111,31 +111,27 @@
 /* Option vector 4: IBM PAPR implementation */
 #define OV4_MIN_ENT_CAP0x01/* minimum VP entitled capacity 
*/
 
-/* Option vector 5: PAPR/OF options supported */
-#define OV5_LPAR   0x80/* logical partitioning supported */
-#define OV5_SPLPAR 0x40/* shared-processor LPAR supported */
+/* Option vector 5: PAPR/OF options supported
+ * Thses bits are also used for the platform_has_feature() call so
+ * we encode the vector index in the define and use the OV5_FEAT()
+ * and OV5_INDX() macros to extract the desired information.
+ */
+#define OV5_FEAT(x)((x) & 0xff)
+#define OV5_INDX(x)((x) >> 8)
+#define OV5_LPAR   0x0280  /* logical partitioning supported */
+#define OV5_SPLPAR 0x0240  /* shared-processor LPAR supported */
 /* ibm,dynamic-reconfiguration-memory property supported */
-#define OV5_DRCONF_MEMORY  0x20
-#define OV5_LARGE_PAGES0x10/* large pages supported */
-#define OV5_DONATE_DEDICATE_CPU0x02/* donate dedicated CPU support 
*/
-/* PCIe/MSI support.  Without MSI full PCIe is not supported */
-#ifdef CONFIG_PCI_MSI
-#define OV5_MSI0x01/* PCIe/MSI support */
-#else
-#define OV5_MSI0x00
-#endif /* CONFIG_PCI_MSI */
-#ifdef CONFIG_PPC_SMLPAR
-#define OV5_CMO0x80/* Cooperative Memory 
Overcommitment */
-#define OV5_XCMO   0x40/* Page Coalescing */
-#else
-#define OV5_CMO0x00
-#define OV5_XCMO   0x00
-#endif
-#define OV5_TYPE1_AFFINITY 0x80/* Type 1 NUMA affinity */
-#define OV5_PFO_HW_RNG 0x80/* PFO Random Number Generator */
-#define OV5_PFO_HW_842 0x40/* PFO Compression Accelerator */
-#define OV5_PFO_HW_ENCR0x20/* PFO Encryption Accelerator */
-#define OV5_SUB_PROCESSORS 0x01/* 1,2,or 4 Sub-Processors supported */
+#define OV5_DRCONF_MEMORY  0x0220
+#define OV5_LARGE_PAGES0x0210  /* large pages supported */
+#define OV5_DONATE_DEDICATE_CPU0x0202  /* donate dedicated CPU support 
*/
+#define OV5_MSI0x0201  /* PCIe/MSI support */
+#define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
+#define OV5_XCMO   0x0440  /* Page Coalescing */
+#define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_PFO_HW_RNG 0x0E80  /* PFO Random Number Generator */
+#define OV5_PFO_HW_842 0x0E40  /* PFO Compression Accelerator */
+#define OV5_PFO_HW_ENCR0x0E20  /* PFO Encryption Accelerator */
+#define OV5_SUB_PROCESSORS 0x0F01  /* 1,2,or 4 Sub-Processors supported */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
@@ -145,6 +141,7 @@
  * followed by # option vectors - 1, followed by the option vectors.
  */
 extern unsigned char ibm_architecture_vec[];
+bool platform_has_feature(unsigned int);
 #endif
 
 /* These includes are put at the bottom because they may contain things
Index: powerpc/arch/powerpc/kernel/prom_init.c
===
--- powerpc.orig/arch/powerpc/kernel/prom_init.c2013-04-17 
13:43:13.0 -0500
+++ powerpc/arch/powerpc/kernel/prom_init.c 2013-04-17 13:51:46.0 
-0500
@@ -684,11 +684,21 @@
/* option vector 5: PAPR/OF options */
19 - 2, /* length */
0,  /* don't ignore, don't halt 

[PATCH v3 6/12] Update numa.c to use updated firmware_has_feature()

2013-04-22 Thread Nathan Fontenot
Update the numa code to use the updated firmware_has_feature() when checking
for type 1 affinity.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |   22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)

Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-15 09:18:07.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-15 09:54:59.0 -0500
@@ -291,9 +291,7 @@
 static int __init find_min_common_depth(void)
 {
int depth;
-   struct device_node *chosen;
struct device_node *root;
-   const char *vec5;
 
if (firmware_has_feature(FW_FEATURE_OPAL))
root = of_find_node_by_path("/ibm,opal");
@@ -325,24 +323,10 @@
 
distance_ref_points_depth /= sizeof(int);
 
-#define VEC5_AFFINITY_BYTE 5
-#define VEC5_AFFINITY  0x80
-
-   if (firmware_has_feature(FW_FEATURE_OPAL))
+   if (firmware_has_feature(FW_FEATURE_OPAL) ||
+   firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
+   dbg("Using form 1 affinity\n");
form1_affinity = 1;
-   else {
-   chosen = of_find_node_by_path("/chosen");
-   if (chosen) {
-   vec5 = of_get_property(chosen,
-  "ibm,architecture-vec-5", NULL);
-   if (vec5 && (vec5[VEC5_AFFINITY_BYTE] &
-   VEC5_AFFINITY)) {
-   dbg("Using form 1 affinity\n");
-   form1_affinity = 1;
-   }
-
-   of_node_put(chosen);
-   }
}
 
if (form1_affinity) {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 7/12] Use stop machine to update cpu maps

2013-04-22 Thread Nathan Fontenot
From: Jesse Larrew 

Platform events such as partition migration or the new PRRN firmware
feature can cause the NUMA characteristics of a CPU to change, and these
changes will be reflected in the device tree nodes for the affected
CPUs.

This patch registers a handler for Open Firmware device tree updates
and reconfigures the CPU and node maps whenever the associativity
changes. Currently, this is accomplished by marking the affected CPUs in
the cpu_associativity_changes_mask and allowing
arch_update_cpu_topology() to retrieve the new associativity information
using hcall_vphn().

Protecting the NUMA cpu maps from concurrent access during an update
operation will be addressed in a subsequent patch in this series.

Signed-off-by: Nathan Fontenot 
---

 arch/powerpc/include/asm/firmware.h   |3 
 arch/powerpc/include/asm/prom.h   |1 
 arch/powerpc/mm/numa.c|   99 ++
 arch/powerpc/platforms/pseries/firmware.c |1 
 4 files changed, 79 insertions(+), 25 deletions(-)

Index: powerpc/arch/powerpc/include/asm/prom.h
===
--- powerpc.orig/arch/powerpc/include/asm/prom.h2013-04-15 
14:03:52.0 -0500
+++ powerpc/arch/powerpc/include/asm/prom.h 2013-04-15 14:04:47.0 
-0500
@@ -128,6 +128,7 @@
 #define OV5_CMO0x0480  /* Cooperative Memory 
Overcommitment */
 #define OV5_XCMO   0x0440  /* Page Coalescing */
 #define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
+#define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
 #define OV5_PFO_HW_RNG 0x0E80  /* PFO Random Number Generator */
 #define OV5_PFO_HW_842 0x0E40  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x0E20  /* PFO Encryption Accelerator */
Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-15 14:04:46.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-15 14:06:20.0 -0500
@@ -1257,7 +1257,8 @@
 static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
-static void set_topology_timer(void);
+static int prrn_enabled;
+static void reset_topology_timer(void);
 
 /*
  * Store the current values of the associativity change counters in the
@@ -1293,11 +1294,9 @@
  */
 static int update_cpu_associativity_changes_mask(void)
 {
-   int cpu, nr_cpus = 0;
+   int cpu;
cpumask_t *changes = &cpu_associativity_changes_mask;
 
-   cpumask_clear(changes);
-
for_each_possible_cpu(cpu) {
int i, changed = 0;
u8 *counts = vphn_cpu_change_counts[cpu];
@@ -1311,11 +1310,10 @@
}
if (changed) {
cpumask_set_cpu(cpu, changes);
-   nr_cpus++;
}
}
 
-   return nr_cpus;
+   return cpumask_weight(changes);
 }
 
 /*
@@ -1416,7 +1414,7 @@
unsigned int associativity[VPHN_ASSOC_BUFSIZE] = {0};
struct device *dev;
 
-   for_each_cpu(cpu,&cpu_associativity_changes_mask) {
+   for_each_cpu(cpu, &cpu_associativity_changes_mask) {
vphn_get_associativity(cpu, associativity);
nid = associativity_to_nid(associativity);
 
@@ -1438,6 +1436,7 @@
dev = get_cpu_device(cpu);
if (dev)
kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+   cpumask_clear_cpu(cpu, &cpu_associativity_changes_mask);
changed = 1;
}
 
@@ -1457,37 +1456,80 @@
 
 static void topology_timer_fn(unsigned long ignored)
 {
-   if (!vphn_enabled)
-   return;
-   if (update_cpu_associativity_changes_mask() > 0)
+   if (prrn_enabled && cpumask_weight(&cpu_associativity_changes_mask))
topology_schedule_update();
-   set_topology_timer();
+   else if (vphn_enabled) {
+   if (update_cpu_associativity_changes_mask() > 0)
+   topology_schedule_update();
+   reset_topology_timer();
+   }
 }
 static struct timer_list topology_timer =
TIMER_INITIALIZER(topology_timer_fn, 0, 0);
 
-static void set_topology_timer(void)
+static void reset_topology_timer(void)
 {
topology_timer.data = 0;
topology_timer.expires = jiffies + 60 * HZ;
-   add_timer(&topology_timer);
+   mod_timer(&topology_timer, topology_timer.expires);
+}
+
+static void stage_topology_update(int core_id)
+{
+   cpumask_or(&cpu_associativity_changes_mask,
+   &cpu_associativity_changes_mask, cpu_sibling_mask(core_id));
+   reset_topology_timer();
 }
 
+static int dt_update_callback(struct notifier_block *nb,
+   unsigned long action, void *data)

[PATCH v3 8/12] Use stop machine to update cpu maps

2013-04-22 Thread Nathan Fontenot
The new PRRN firmware feature allows CPU and memory resources to be
transparently reassigned across NUMA boundaries. When this happens, the
kernel must update the node maps to reflect the new affinity information.

Although the NUMA maps can be protected by locking primitives during the
update itself, this is insufficient to prevent concurrent accesses to these
structures. Since cpumask_of_node() hands out a pointer to these
structures, they can still be modified outside of the lock. Furthermore,
tracking down each usage of these pointers and adding locks would be quite
invasive and difficult to maintain.

The approach used is to make a list of affected cpus and call stop_machine
to have the update routine run on each of the affected cpus allowing them
to update themselves. Each cpu finds itself in the list of cpus and makes
the appropriate updates. We need to have each cpu do this for themselves to
handle calls to vdso_getcpu_init that is added in a subsequent patch.

Situations like these are best handled using stop_machine(). Since the NUMA
affinity updates are exceptionally rare events, this approach has the
benefit of not adding any overhead while accessing the NUMA maps during
normal operation.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |   82 ++---
 1 file changed, 64 insertions(+), 18 deletions(-)

Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-17 14:04:12.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-18 09:10:11.0 -0500
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1254,6 +1255,13 @@
 
 /* Virtual Processor Home Node (VPHN) support */
 #ifdef CONFIG_PPC_SPLPAR
+struct topology_update_data {
+   struct topology_update_data *next;
+   unsigned int cpu;
+   int old_nid;
+   int new_nid;
+};
+
 static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
@@ -1405,41 +1413,79 @@
 }
 
 /*
+ * Update the CPU maps and sysfs entries for a single CPU when its NUMA
+ * characteristics change. This function doesn't perform any locking and is
+ * only safe to call from stop_machine().
+ */
+static int update_cpu_topology(void *data)
+{
+   struct topology_update_data *update;
+   unsigned long cpu;
+
+   if (!data)
+   return -EINVAL;
+
+   cpu = get_cpu();
+
+   for (update = data; update; update = update->next) {
+   if (cpu != update->cpu)
+   continue;
+
+   unregister_cpu_under_node(update->cpu, update->old_nid);
+   unmap_cpu_from_node(update->cpu);
+   map_cpu_to_node(update->cpu, update->new_nid);
+   register_cpu_under_node(update->cpu, update->new_nid);
+   }
+
+   return 0;
+}
+
+/*
  * Update the node maps and sysfs entries for each cpu whose home node
  * has changed. Returns 1 when the topology has changed, and 0 otherwise.
  */
 int arch_update_cpu_topology(void)
 {
-   int cpu, nid, old_nid, changed = 0;
+   unsigned int cpu, changed = 0;
+   struct topology_update_data *updates, *ud;
unsigned int associativity[VPHN_ASSOC_BUFSIZE] = {0};
struct device *dev;
+   int weight, i = 0;
+
+   weight = cpumask_weight(&cpu_associativity_changes_mask);
+   if (!weight)
+   return 0;
+
+   updates = kzalloc(weight * (sizeof(*updates)), GFP_KERNEL);
+   if (!updates)
+   return 0;
 
for_each_cpu(cpu, &cpu_associativity_changes_mask) {
+   ud = &updates[i++];
+   ud->cpu = cpu;
vphn_get_associativity(cpu, associativity);
-   nid = associativity_to_nid(associativity);
+   ud->new_nid = associativity_to_nid(associativity);
 
-   if (nid < 0 || !node_online(nid))
-   nid = first_online_node;
+   if (ud->new_nid < 0 || !node_online(ud->new_nid))
+   ud->new_nid = first_online_node;
 
-   old_nid = numa_cpu_lookup_table[cpu];
+   ud->old_nid = numa_cpu_lookup_table[cpu];
 
-   /* Disable hotplug while we update the cpu
-* masks and sysfs.
-*/
-   get_online_cpus();
-   unregister_cpu_under_node(cpu, old_nid);
-   unmap_cpu_from_node(cpu);
-   map_cpu_to_node(cpu, nid);
-   register_cpu_under_node(cpu, nid);
-   put_online_cpus();
+   if (i < weight)
+   ud->next = &updates[i];
+   }
+
+   stop_machine(update_cpu_topology, &updates[0], cpu_online_mask);
 
-   dev = get_cpu_device(cpu);
+   for (ud = &updates[0]; ud; ud = ud->next) {
+   dev = get_

[PATCH v3 9/12] Update NUMA VDSO information

2013-04-22 Thread Nathan Fontenot
From: Jesse Larrew 

The following patch adds vdso_getcpu_init(), which stores the NUMA node for
a cpu in SPRG3:

Commit 18ad51dd34 ("powerpc: Add VDSO version of getcpu") adds
vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3.

This patch ensures that this information is also updated when the NUMA
affinity of a cpu changes.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-18 09:10:11.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-22 09:39:02.0 -0500
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int numa_enabled = 1;
 
@@ -1434,6 +1435,7 @@
unregister_cpu_under_node(update->cpu, update->old_nid);
unmap_cpu_from_node(update->cpu);
map_cpu_to_node(update->cpu, update->new_nid);
+   vdso_getcpu_init();
register_cpu_under_node(update->cpu, update->new_nid);
}
 
@@ -1449,6 +1451,7 @@
unsigned int cpu, changed = 0;
struct topology_update_data *updates, *ud;
unsigned int associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   cpumask_t updated_cpus;
struct device *dev;
int weight, i = 0;
 
@@ -1460,6 +1463,8 @@
if (!updates)
return 0;
 
+   cpumask_clear(&updated_cpus);
+
for_each_cpu(cpu, &cpu_associativity_changes_mask) {
ud = &updates[i++];
ud->cpu = cpu;
@@ -1470,12 +1475,13 @@
ud->new_nid = first_online_node;
 
ud->old_nid = numa_cpu_lookup_table[cpu];
+   cpumask_set_cpu(cpu, &updated_cpus);
 
if (i < weight)
ud->next = &updates[i];
}
 
-   stop_machine(update_cpu_topology, &updates[0], cpu_online_mask);
+   stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
 
for (ud = &updates[0]; ud; ud = ud->next) {
dev = get_cpu_device(ud->cpu);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 10/12] Re-enable Virtual Private Home Node capabilities

2013-04-22 Thread Nathan Fontenot
From: Jesse Larrew 

The new PRRN firmware feature provides a more convenient and event-driven
interface than VPHN for notifying Linux of changes to the NUMA affinity of
platform resources. However, for practical reasons, it may not be feasible
for some customers to update to the latest firmware. For these customers,
the VPHN feature supported on previous firmware versions may still be the
best option.

The VPHN feature was previously disabled due to races with the load
balancing code when accessing the NUMA cpu maps, but the new stop_machine()
approach protects the NUMA cpu maps from these concurrent accesses. It
should be safe to re-enable this feature now.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-22 09:39:02.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-22 09:46:13.0 -0500
@@ -1572,9 +1572,8 @@
vphn_enabled = 0;
rc = of_reconfig_notifier_register(&dt_update_nb);
}
-   } else if (0 && firmware_has_feature(FW_FEATURE_VPHN) &&
+   } else if (firmware_has_feature(FW_FEATURE_VPHN) &&
   get_lppaca()->shared_proc) {
-   /* Disabled until races with load balancing are fixed */
if (!vphn_enabled) {
prrn_enabled = 0;
vphn_enabled = 1;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 11/12] Enable PRRN Event handling

2013-04-22 Thread Nathan Fontenot
The Linux kernel and platform firmware negotiate their mutual support
of the PRRN option via the ibm,client-architecture-support interface.
This patch simply sets the appropriate fields in the client architecture
vector to indicate Linux support and will cause the firmware to begin
sending PRRN events via the RTAS event-scan mechanism.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/kernel/prom_init.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: powerpc/arch/powerpc/kernel/prom_init.c
===
--- powerpc.orig/arch/powerpc/kernel/prom_init.c2013-04-18 
09:09:22.0 -0500
+++ powerpc/arch/powerpc/kernel/prom_init.c 2013-04-22 09:49:28.0 
-0500
@@ -698,7 +698,7 @@
 #else
0,
 #endif
-   OV5_FEAT(OV5_TYPE1_AFFINITY),
+   OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
0,
0,
0,

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 12/12] Add /proc interface to control topology updates

2013-04-22 Thread Nathan Fontenot
There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.

This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/topology.h |5 ++
 arch/powerpc/kernel/rtasd.c |7 ++--
 arch/powerpc/mm/numa.c  |   62 +++-
 3 files changed, 71 insertions(+), 3 deletions(-)

Index: powerpc/arch/powerpc/mm/numa.c
===
--- powerpc.orig/arch/powerpc/mm/numa.c 2013-04-22 09:46:13.0 -0500
+++ powerpc/arch/powerpc/mm/numa.c  2013-04-22 09:51:10.0 -0500
@@ -23,6 +23,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -1585,7 +1588,6 @@
 
return rc;
 }
-__initcall(start_topology_update);
 
 /*
  * Disable polling for VPHN associativity changes.
@@ -1604,4 +1606,62 @@
 
return rc;
 }
+
+inline int prrn_is_enabled(void)
+{
+   return prrn_enabled;
+}
+
+static int topology_read(struct seq_file *file, void *v)
+{
+   if (vphn_enabled || prrn_enabled)
+   seq_puts(file, "on\n");
+   else
+   seq_puts(file, "off\n");
+
+   return 0;
+}
+
+static int topology_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, topology_read, NULL);
+}
+
+static ssize_t topology_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *off)
+{
+   char kbuf[4]; /* "on" or "off" plus null. */
+   int read_len;
+
+   read_len = count < 3 ? count : 3;
+   if (copy_from_user(kbuf, buf, read_len))
+   return -EINVAL;
+
+   kbuf[read_len] = '\0';
+
+   if (!strncmp(kbuf, "on", 2))
+   start_topology_update();
+   else if (!strncmp(kbuf, "off", 3))
+   stop_topology_update();
+   else
+   return -EINVAL;
+
+   return count;
+}
+
+static const struct file_operations topology_ops = {
+   .read = seq_read,
+   .write = topology_write,
+   .open = topology_open,
+   .release = single_release
+};
+
+static int topology_update_init(void)
+{
+   start_topology_update();
+   proc_create("powerpc/topology_updates", 644, NULL, &topology_ops);
+
+   return 0;
+}
+device_initcall(topology_update_init);
 #endif /* CONFIG_PPC_SPLPAR */
Index: powerpc/arch/powerpc/include/asm/topology.h
===
--- powerpc.orig/arch/powerpc/include/asm/topology.h2013-04-18 
09:09:21.0 -0500
+++ powerpc/arch/powerpc/include/asm/topology.h 2013-04-22 09:51:10.0 
-0500
@@ -71,6 +71,7 @@
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int start_topology_update(void);
 extern int stop_topology_update(void);
+extern inline int prrn_is_enabled(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -80,6 +81,10 @@
 {
return 0;
 }
+static inline int prrn_is_enabled(void)
+{
+   return 0;
+}
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
Index: powerpc/arch/powerpc/kernel/rtasd.c
===
--- powerpc.orig/arch/powerpc/kernel/rtasd.c2013-04-18 09:09:21.0 
-0500
+++ powerpc/arch/powerpc/kernel/rtasd.c 2013-04-22 09:51:10.0 -0500
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 static DEFINE_SPINLOCK(rtasd_log_lock);
@@ -294,11 +295,13 @@
 {
pSeries_log_error((char *)log, ERR_TYPE_RTAS_LOG, 0);
 
-   if (log->type == RTAS_TYPE_PRRN)
+   if (log->type == RTAS_TYPE_PRRN) {
/* For PRRN Events the extended log length is used to denote
 * the scope for calling rtas update-nodes.
 */
-   prrn_schedule_update(log->extended_log_length);
+   if (prrn_is_enabled())
+   prrn_schedule_update(log->extended_log_length);
+   }
 
return;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] [RFC] powerpc: Add VDSO version of time

2013-04-22 Thread Adhemerval Zanella
On 04/18/2013 07:38 PM, Anton Blanchard wrote:
> Since you are only reading one long you shouldn't need to check the
> update count and loop, you will always see a consistent value. The
> system call version of time() just does an unprotected load for example.

Fixed.

> With the above change and with Michael's comments covered (decent
> changelog entry and Signed-off-by):
>
> Acked-by: Anton Blanchard 

Thanks for the review, below the updated patch:


From: Adhemerval Zanella 

This patch implement the time syscall as vDSO. The performance speedups
are:

Baseline PPC32: 380 nsec
Baseline PPC64: 350 nsec
vdso PPC32:  20 nsec
vsdo PPC64:  20 nsec

Tested on 64 bit build with both 32 bit and 64 bit userland.

Acked-by: Anton Blanchard 
Signed-off-by: Adhemerval Zanella 
---
 arch/powerpc/kernel/vdso.c|4 
 arch/powerpc/kernel/vdso32/gettimeofday.S |   26 ++
 arch/powerpc/kernel/vdso32/vdso32.lds.S   |1 +
 arch/powerpc/kernel/vdso64/gettimeofday.S |   26 ++
 arch/powerpc/kernel/vdso64/vdso64.lds.S   |1 +
 5 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 1b2076f..d4f463a 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -113,6 +113,10 @@ static struct vdso_patch_def vdso_patches[] = {
CPU_FTR_USE_TB, 0,
"__kernel_get_tbfreq", NULL
},
+   {
+   CPU_FTR_USE_TB, 0,
+   "__kernel_time", NULL
+   },
 };
 
 /*
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S 
b/arch/powerpc/kernel/vdso32/gettimeofday.S
index 4ee09ee..27e2f62 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -181,6 +181,32 @@ V_FUNCTION_END(__kernel_clock_getres)
 
 
 /*
+ * Exact prototype of time()
+ *
+ * time_t time(time *t);
+ *
+ */
+V_FUNCTION_BEGIN(__kernel_time)
+  .cfi_startproc
+   mflrr12
+  .cfi_register lr,r12
+
+   mr  r11,r3  /* r11 holds t */
+   bl  __get_datapage@local
+   mr  r9, r3  /* datapage ptr in r9 */
+
+   lwz r3,STAMP_XTIME+TSPEC_TV_SEC(r9)
+
+   cmplwi  r11,0   /* check if t is NULL */
+   beq 2f
+   stw r3,0(r11)   /* store result at *t */
+2: mtlrr12
+   crclr   cr0*4+so
+   blr
+  .cfi_endproc
+V_FUNCTION_END(__kernel_time)
+
+/*
  * This is the core of clock_gettime() and gettimeofday(),
  * it returns the current time in r3 (seconds) and r4.
  * On entry, r7 gives the resolution of r4, either USEC_PER_SEC
diff --git a/arch/powerpc/kernel/vdso32/vdso32.lds.S 
b/arch/powerpc/kernel/vdso32/vdso32.lds.S
index 43200ba..f223409 100644
--- a/arch/powerpc/kernel/vdso32/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso32/vdso32.lds.S
@@ -150,6 +150,7 @@ VERSION
 #ifdef CONFIG_PPC64
__kernel_getcpu;
 #endif
+   __kernel_time;
 
local: *;
};
diff --git a/arch/powerpc/kernel/vdso64/gettimeofday.S 
b/arch/powerpc/kernel/vdso64/gettimeofday.S
index e97a9a0..a76b4af 100644
--- a/arch/powerpc/kernel/vdso64/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso64/gettimeofday.S
@@ -164,6 +164,32 @@ V_FUNCTION_BEGIN(__kernel_clock_getres)
   .cfi_endproc
 V_FUNCTION_END(__kernel_clock_getres)
 
+/*
+ * Exact prototype of time()
+ *
+ * time_t time(time *t);
+ *
+ */
+V_FUNCTION_BEGIN(__kernel_time)
+  .cfi_startproc
+   mflrr12
+  .cfi_register lr,r12
+
+   mr  r11,r3  /* r11 holds t */
+   bl  V_LOCAL_FUNC(__get_datapage)
+
+   ld  r4,STAMP_XTIME+TSPC64_TV_SEC(r3)
+
+   cmpldi  r11,0   /* check if t is NULL */
+   beq 2f
+   std r4,0(r11)   /* store result at *t */
+2: mtlrr12
+   crclr   cr0*4+so
+   mr  r3,r4
+   blr
+  .cfi_endproc
+V_FUNCTION_END(__kernel_time)
+
 
 /*
  * This is the core of clock_gettime() and gettimeofday(),
diff --git a/arch/powerpc/kernel/vdso64/vdso64.lds.S 
b/arch/powerpc/kernel/vdso64/vdso64.lds.S
index e6c1758..e486381 100644
--- a/arch/powerpc/kernel/vdso64/vdso64.lds.S
+++ b/arch/powerpc/kernel/vdso64/vdso64.lds.S
@@ -147,6 +147,7 @@ VERSION
__kernel_sync_dicache_p5;
__kernel_sigtramp_rt64;
__kernel_getcpu;
+   __kernel_time;
 
local: *;
};

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v8 0/3] of/pci: Provide common support for PCI DT parsing

2013-04-22 Thread Jason Cooper
On Mon, Apr 22, 2013 at 12:53:43PM -0400, Jason Cooper wrote:
> On Mon, Apr 22, 2013 at 11:41:32AM +0100, Andrew Murray wrote:
> > This patchset factors out duplicated code associated with parsing PCI
> > DT "ranges" properties across the architectures and introduces a
> > "ranges" parser. This parser "of_pci_range_parser" can be used directly
> > by ARM host bridge drivers enabling them to obtain ranges from device
> > trees.
> > 
> > I've included the Reviewed-by, Tested-by and Acked-by's received from 
> > v5/v6/v7
> > in this patchset, earlier versions of this patchset (v3) have been 
> > tested-by:
> > 
> > Thierry Reding 
> > Jingoo Han 
> > 
> > I've tested that this patchset builds and runs on ARM and that it builds on
> > PowerPC, x86_64 and MIPS.
> 
> Andrew,
> 
> Unfortunately, the mvebu/drivers branch containing your series had to be
> dropped from arm-soc for v3.10.  This was not due to your series, but
> since arm-soc's granularity is branches, your series was caught in the
> drop.
> 
> As the mvebu-pcie driver is now v3.11 material, I have taken the
> opportunity to upgrade from your v7 patchset to v8.  You can find the
> whole branch at mvebu-next/pcie.
> 
> mvebu-next/pcie *will* be rebased onto v3.9 once it drops.  Several
> dependencies will be removed (since they will have been merged into
> v3.9).

s/v3.9/v3.10-rc1/g  :)

> Once the rebase is done, I'll send a pull request to Arnd and Olof so we
> can get as many cycles on -next as possible.

thx,

Jason.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] powerpc/powernv: Patch MSI EOI handler on P8

2013-04-22 Thread Michael Ellerman
On Mon, Apr 22, 2013 at 07:06:17PM +0800, Gavin Shan wrote:
> On Mon, Apr 22, 2013 at 12:56:37PM +1000, Michael Ellerman wrote:
> >On Mon, Apr 22, 2013 at 09:45:33AM +0800, Gavin Shan wrote:
> >> On Mon, Apr 22, 2013 at 09:34:36AM +1000, Michael Ellerman wrote:
> >> >On Fri, Apr 19, 2013 at 05:32:45PM +0800, Gavin Shan wrote:
> >> >> The EOI handler of MSI/MSI-X interrupts for P8 (PHB3) need additional
> >> >> steps to handle the P/Q bits in IVE before EOIing the corresponding
> >> >> interrupt. The patch changes the EOI handler to cover that.
> >> 
> >> Thanks for your time to review it, Michael. By the way, I think I need
> >> rebase the patch since the patch fb1b55d654a7038ca6337fbf55839a308c9bc1a7
> >> ("Using bitmap to manage MSI") has been merged to linux-next.
> >> 
> >> >> diff --git a/arch/powerpc/sysdev/xics/icp-native.c 
> >> >> b/arch/powerpc/sysdev/xics/icp-native.c
> >> >> index 48861d3..289355e 100644
> >> >> --- a/arch/powerpc/sysdev/xics/icp-native.c
> >> >> +++ b/arch/powerpc/sysdev/xics/icp-native.c
> >> >> @@ -27,6 +27,10 @@
> >> >>  #include 
> >> >>  #include 
> >> >>  
> >> >> +#if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_PCI_MSI)
> >> >> +extern int pnv_pci_msi_eoi(unsigned int hw_irq);
> >> >> +#endif
> >> >
> >> >You don't need to #ifdef the extern. But it should be in a header, not
> >> >here.
> >> >
> >> 
> >> Ok. I'll put it into asm/xics.h, but I want to confirm we needn't
> >> #ifdef when moving it to asm/xics.h?
> >
> >No you don't need it #ifdef'd. It's just extra noise in the file, and
> >doesn't really add anything IMHO.
> >
> 
> Michael, I'm a bit confused about your point. asm/xics.h is shared between
> PowerNV and pSeries platform, and pnv_pci_msi_eoi() is only implemented on
> PowerNV platform, so the code should look like this (with newly introduced
> option - CONFIG_POWERNV_MSI)
> 
> #ifdef CONFIG_POWERNV_MSI
> extern int pnv_pci_msi_eoi(unsigned int hw_irq);
> #endif

You can do that. But there's not much value added by adding an
#ifdef around the extern.

Assuming the body of pnv_pci_msi_eoi() is only available when
CONFIG_POWERNV_MSI is defined (which is the whole point), imagine there
is code in platforms/pseries which accidentally calls it.

If we have the extern protected by an ifdef we will get a warning that
we are calling an undeclared function, eg something like:

  pseries.c:30:2: warning: implicit declaration of function ‘pnv_pci_msi_eoi’ 
[-Wimplicit-function-declaration]

But more importantly we will not be able to link the kernel, because the
body of pnv_pci_msi_eoi() is missing (because CONFIG_POWERNV_MSI=n).

If we have the extern visible in the header, ie. not inside #ifdef, then
we will not see the warning because the compiler can see the
declaration.

But even so the kernel will still not link.

So my point is that having the #ifdef around the extern just gives you
an extra warning, which is not all that useful because you are going to
notice anyway as soon as the kernel fails to link.

Anyway it's a minor point so don't worry about it too much :)

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3 v13] iommu/fsl: Add additional iommu attributes required by the PAMU driver.

2013-04-22 Thread Scott Wood

On 04/22/2013 12:31:55 AM, Varun Sethi wrote:

Added the following domain attributes for the FSL PAMU driver:
1. Added new iommu stash attribute, which allows setting of the
   LIODN specific stash id parameter through IOMMU API.
2. Added an attribute for enabling/disabling DMA to a particular
   memory window.
3. Added domain attribute to check for PAMUV1 specific constraints.

Signed-off-by: Varun Sethi 
---
v13 changes:
- created a new file include/linux/fsl_pamu_stash.h for stash
attributes.
v12 changes:
- Moved PAMU specifc stash ids and structures to PAMU header file.
- no change in v11.
- no change in v10.
 include/linux/fsl_pamu_stash.h |   39  
+++

 include/linux/iommu.h  |   16 
 2 files changed, 55 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/fsl_pamu_stash.h

diff --git a/include/linux/fsl_pamu_stash.h  
b/include/linux/fsl_pamu_stash.h

new file mode 100644
index 000..caa1b21
--- /dev/null
+++ b/include/linux/fsl_pamu_stash.h
@@ -0,0 +1,39 @@
+/*
+ * This program is free software; you can redistribute it and/or  
modify
+ * it under the terms of the GNU General Public License, version 2,  
as

+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA   
02110-1301, USA.

+ *
+ * Copyright (C) 2013 Freescale Semiconductor, Inc.
+ *
+ */
+
+#ifndef __FSL_PAMU_STASH_H
+#define __FSL_PAMU_STASH_H
+
+/* cache stash targets */
+enum pamu_stash_target {
+   PAMU_ATTR_CACHE_L1 = 1,
+   PAMU_ATTR_CACHE_L2,
+   PAMU_ATTR_CACHE_L3,
+};
+
+/*
+ * This attribute allows configuring stashig specific parameters
+ * in the PAMU hardware.
+ */
+
+struct pamu_stash_attribute {
+   u32 cpu;/* cpu number */
+   u32 cache;  /* cache to stash to: L1,L2,L3 */
+};
+
+#endif  /* __FSL_PAMU_STASH_H */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2727810..c5dc2b9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -57,10 +57,26 @@ struct iommu_domain {
 #define IOMMU_CAP_CACHE_COHERENCY  0x1
 #define IOMMU_CAP_INTR_REMAP		0x2	/* isolates device  
intrs */


+/*
+ * Following constraints are specifc to PAMUV1:


FSL_PAMUV1


+ *  -aperture must be power of 2, and naturally aligned
+ *  -number of windows must be power of 2, and address space size
+ *   of each window is determined by aperture size / # of windows
+ *  -the actual size of the mapped region of a window must be power
+ *   of 2 starting with 4KB and physical address must be naturally
+ *   aligned.
+ * DOMAIN_ATTR_FSL_PAMUV1 corresponds to the above mentioned  
contraints.
+ * The caller can invoke iommu_domain_get_attr to check if the  
underlying

+ * iommu implementation supports these constraints.
+ */
+
 enum iommu_attr {
DOMAIN_ATTR_GEOMETRY,
DOMAIN_ATTR_PAGING,
DOMAIN_ATTR_WINDOWS,
+   DOMAIN_ATTR_PAMU_STASH,
+   DOMAIN_ATTR_PAMU_ENABLE,
+   DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_MAX,


Please be consistent on whether "PAMU" gets an "FSL_" namespace prefix  
(I'd prefer that it does).


-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 1/12] Create a powerpc update_devicetree interface

2013-04-22 Thread Benjamin Herrenschmidt
On Mon, 2013-04-22 at 13:30 -0500, Nathan Fontenot wrote:

> This patch exposes a method for updating the device tree via
> ppc_md.update_devicetree that takes a single 32-bit value as a parameter.
> For pseries platforms this is the existing pseries_devicetree_update routine
> which is updated to take the new parameter which is a scope value
> to indicate the the reason for making the rtas calls. This parameter is
> required by the ibm,update-nodes/ibm,update-properties RTAS calls, and
> the appropriate value is contained within the RTAS event for PRRN
> notifications. In pseries_devicetree_update() it was previously
> hard-coded to 1, the scope value for partition migration.

I think that's too much abstraction (see below)

Also you add this helper:

> Index: powerpc/arch/powerpc/kernel/rtas.c
> ===
> --- powerpc.orig/arch/powerpc/kernel/rtas.c   2013-03-08 19:23:06.0 
> -0600
> +++ powerpc/arch/powerpc/kernel/rtas.c2013-04-17 13:02:29.0 
> -0500
> @@ -1085,3 +1085,13 @@
>   timebase = 0;
>   arch_spin_unlock(&timebase_lock);
>  }
> +
> +int update_devicetree(s32 scope)
> +{
> + int rc = 0;
> +
> + if (ppc_md.update_devicetree)
> + rc = ppc_md.update_devicetree(scope);
> +
> + return rc;
> +}

But then don't use it afaik (you call directly ppc_md.update_... from
prrn_work_fn().

In the end, the caller (PRRN stuff), while in rtasd, is really pseries
specific and the resulting update_device_tree() as well, so I don't
think we need the ppc_md. hook in the middle with that "oddball" scope
parameter which is not defined outside of pseries specific areas.

In this case, it might be better to make sure the PRRN related stuff in
rtasd is inside an ifdef CONFIG_PPC_PSERIES and have it call directly
into pseries_update_devicetree().

It makes the code somewhat easier to follow and I doubt anybody else
will ever use that specific hook, at least not in its current form. If
we need an abstraction later, we can add one then.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 4/12] Move architecture vector definitions to prom.h

2013-04-22 Thread Benjamin Herrenschmidt
On Mon, 2013-04-22 at 13:35 -0500, Nathan Fontenot wrote:
> As part of handling handling PRRN events we will need to check the
> vector 5 portion of the architecture bits reported in the device tree
> to ensure that PRRN event handling is enabled. In order to do this
> firmware_has_feature is updated (in a subsequent patch) to
> make this check.  To avoid having to re-define bits in the architecture
> vector the bits are moved to prom.h.
> 
> This patch is the first step in updating firmware_has_feature
> by simply moving the bit definitions from prom_init.c to asm/prom.h.
> There are no functional changes.
> 
> Signed-off-by: Nathan Fontenot 
> 
> ---
>  arch/powerpc/include/asm/prom.h |   73 ++
>  arch/powerpc/kernel/prom_init.c |   75 
> +++-
>  2 files changed, 79 insertions(+), 69 deletions(-)
> 
> Index: powerpc/arch/powerpc/include/asm/prom.h
> ===
> --- powerpc.orig/arch/powerpc/include/asm/prom.h  2013-04-16 
> 21:25:16.0 -0500
> +++ powerpc/arch/powerpc/include/asm/prom.h   2013-04-17 13:43:13.0 
> -0500
> @@ -74,6 +74,79 @@
>  #define DRCONF_MEM_AI_INVALID0x0040
>  #define DRCONF_MEM_RESERVED  0x0080
>  
> +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
> +/

The ifdef is unnecessary

> + * There are two methods for telling firmware what our capabilities are.
> + * Newer machines have an "ibm,client-architecture-support" method on the
> + * root node.  For older machines, we have to call the "process-elf-header"
> + * method in the /packages/elf-loader node, passing it a fake 32-bit
> + * ELF header containing a couple of PT_NOTE sections that contain
> + * structures that contain various information.
> + */
> +
> +/* New method - extensible architecture description vector. */
> +
> +/* Option vector bits - generic bits in byte 1 */
> +#define OV_IGNORE0x80/* ignore this vector */
> +#define OV_CESSATION_POLICY  0x40/* halt if unsupported option present*/
> +
> +/* Option vector 1: processor architectures supported */
> +#define OV1_PPC_2_00 0x80/* set if we support PowerPC 2.00 */
> +#define OV1_PPC_2_01 0x40/* set if we support PowerPC 2.01 */
> +#define OV1_PPC_2_02 0x20/* set if we support PowerPC 2.02 */
> +#define OV1_PPC_2_03 0x10/* set if we support PowerPC 2.03 */
> +#define OV1_PPC_2_04 0x08/* set if we support PowerPC 2.04 */
> +#define OV1_PPC_2_05 0x04/* set if we support PowerPC 2.05 */
> +#define OV1_PPC_2_06 0x02/* set if we support PowerPC 2.06 */
> +#define OV1_PPC_2_07 0x01/* set if we support PowerPC 2.07 */
> +
> +/* Option vector 2: Open Firmware options supported */
> +#define OV2_REAL_MODE0x20/* set if we want OF in real 
> mode */
> +
> +/* Option vector 3: processor options supported */
> +#define OV3_FP   0x80/* floating point */
> +#define OV3_VMX  0x40/* VMX/Altivec */
> +#define OV3_DFP  0x20/* decimal FP */
> +
> +/* Option vector 4: IBM PAPR implementation */
> +#define OV4_MIN_ENT_CAP  0x01/* minimum VP entitled capacity 
> */
> +
> +/* Option vector 5: PAPR/OF options supported */
> +#define OV5_LPAR 0x80/* logical partitioning supported */
> +#define OV5_SPLPAR   0x40/* shared-processor LPAR supported */
> +/* ibm,dynamic-reconfiguration-memory property supported */
> +#define OV5_DRCONF_MEMORY0x20
> +#define OV5_LARGE_PAGES  0x10/* large pages supported */
> +#define OV5_DONATE_DEDICATE_CPU  0x02/* donate dedicated CPU support 
> */
> +/* PCIe/MSI support.  Without MSI full PCIe is not supported */
> +#ifdef CONFIG_PCI_MSI
> +#define OV5_MSI  0x01/* PCIe/MSI support */
> +#else
> +#define OV5_MSI  0x00
> +#endif /* CONFIG_PCI_MSI */
> +#ifdef CONFIG_PPC_SMLPAR
> +#define OV5_CMO  0x80/* Cooperative Memory 
> Overcommitment */
> +#define OV5_XCMO 0x40/* Page Coalescing */
> +#else
> +#define OV5_CMO  0x00
> +#define OV5_XCMO 0x00
> +#endif
> +#define OV5_TYPE1_AFFINITY   0x80/* Type 1 NUMA affinity */
> +#define OV5_PFO_HW_RNG   0x80/* PFO Random Number Generator 
> */
> +#define OV5_PFO_HW_842   0x40/* PFO Compression Accelerator 
> */
> +#define OV5_PFO_HW_ENCR  0x20/* PFO Encryption Accelerator */
> +#define OV5_SUB_PROCESSORS   0x01/* 1,2,or 4 Sub-Processors supported */
> +
> +/* Option Vector 6: IBM PAPR hints */
> +#define OV6_LINUX0x02/* Linux is our OS */
> +
> +/*
> + * The architecture vector has an array of PVR mask/value pairs,
> + * followed by # option vectors - 1, followed by the option vectors.
> + */

Re: [PATCH v2 7/11] Use stop machine to update cpu maps

2013-04-22 Thread Benjamin Herrenschmidt
On Fri, 2013-04-05 at 13:22 -0500, Nathan Fontenot wrote:

> Agreed, having to call stop_machine() for each cpu that gets updated is
> pretty brutal. The plus side is that PRRN events should a rare occurrence 
> and not cause too much pain.

So that doesn't happen on VPHN changes ?

> The current design ties into the of notification chain so that we can do
> the affinity update when the affinity property in the device tree is updated.
> Switching to doing one stop and updating all of the cpus would require a
> design changeand
> 
> I went back and looked at the code again and there is another issue with
> way this is done. Tying into the of notification chain is great for
> being informed of when a property changes but the code (from patch 6/11)
> 
> + case OF_RECONFIG_ADD_PROPERTY:
> + case OF_RECONFIG_UPDATE_PROPERTY:
> + update = (struct of_prop_reconfig *)data;
> + if (!of_prop_cmp(update->dn->type, "cpu")) {
> + u32 core_id;
> + of_property_read_u32(update->dn, "reg", &core_id);
> + stage_topology_update(core_id);
> + rc = NOTIFY_OK;
> + }
> + break;
> 
> Does not check to see which property is being updated and just assumes
> the affinity is being updated. This code as is will do an affinity update
> every time any property of a cpu is updated or added.
> 
> Since this needs an update I will also look at possibly doing this so
> that we call stop_machine only once.

Any new patch set ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [Suggestion] PowerPC: kernel: memory access violation when rtas_data_buf contents are more than 1026

2013-04-22 Thread Benjamin Herrenschmidt
On Thu, 2013-04-18 at 12:45 +0800, Chen Gang wrote:
> Hello Maintainers:
> 
> 
> in arch/powerpc/kernel/lparcfg.c, parse_system_parameter_string()
> 
>   need set '\0' for 'local_buffer'.
> 
>   the reason is:
> SPLPAR_MAXLENGTH is 1026, RTAS_DATA_BUF_SIZE is 4096
> the contents of rtas_data_buf may truncated in memcpy (line 301).
> 
> if contents are truncated.
>   the splpar_strlen is more than 1026 (line 321)
>   the while loop checking will not find the end of buffer (line 326)
>   it will cause memory access violation.
> 
> 
>   I find it by reading code, so please help check.

And a signed-off-by please ?

Cheers,
Ben.

>   thanks.
> 
> gchen.
> 
> -related fix 
> patch--
> 
> diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c
> index 801a757..d92f387 100644
> --- a/arch/powerpc/kernel/lparcfg.c
> +++ b/arch/powerpc/kernel/lparcfg.c
> @@ -299,6 +299,7 @@ static void parse_system_parameter_string(struct seq_file 
> *m)
>   __pa(rtas_data_buf),
>   RTAS_DATA_BUF_SIZE);
>   memcpy(local_buffer, rtas_data_buf, SPLPAR_MAXLENGTH);
> + local_buffer[SPLPAR_MAXLENGTH - 1] = '\0';
>   spin_unlock(&rtas_data_buf_lock);
>  
>   if (call_status != 0) {
> 
> 
> 
> -related source 
> code
> 
> 
> 283 static void parse_system_parameter_string(struct seq_file *m)
> 284 {
> 285 int call_status;
> 286 
> 287 unsigned char *local_buffer = kmalloc(SPLPAR_MAXLENGTH, 
> GFP_KERNEL);
> 288 if (!local_buffer) {
> 289 printk(KERN_ERR "%s %s kmalloc failure at line %d\n",
> 290__FILE__, __func__, __LINE__);
> 291 return;
> 292 }
> 293 
> 294 spin_lock(&rtas_data_buf_lock);
> 295 memset(rtas_data_buf, 0, SPLPAR_MAXLENGTH);
> 296 call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 
> 3, 1,
> 297 NULL,
> 298 SPLPAR_CHARACTERISTICS_TOKEN,
> 299 __pa(rtas_data_buf),
> 300 RTAS_DATA_BUF_SIZE);
> 301 memcpy(local_buffer, rtas_data_buf, SPLPAR_MAXLENGTH);
> 302 spin_unlock(&rtas_data_buf_lock);
> 303 
> 304 if (call_status != 0) {
> 305 printk(KERN_INFO
> 306"%s %s Error calling get-system-parameter 
> (0x%x)\n",
> 307__FILE__, __func__, call_status);
> 308 } else {   
> 309 int splpar_strlen;
> 310 int idx, w_idx;
> 311 char *workbuffer = kzalloc(SPLPAR_MAXLENGTH, GFP_KERNEL);
> 312 if (!workbuffer) { 
> 313 printk(KERN_ERR "%s %s kmalloc failure at line 
> %d\n",
> 314__FILE__, __func__, __LINE__);
> 315 kfree(local_buffer);
> 316 return;
> 317 }   
> 318 #ifdef LPARCFG_DEBUG
> 319 printk(KERN_INFO "success calling 
> get-system-parameter\n");
> 320 #endif
> 321 splpar_strlen = local_buffer[0] * 256 + local_buffer[1];
> 322 local_buffer += 2;  /* step over strlen value */
> 323 
> 324 w_idx = 0;
> 325 idx = 0;
> 326 while ((*local_buffer) && (idx < splpar_strlen)) {
> 327 workbuffer[w_idx++] = local_buffer[idx++];
> 328 if ((local_buffer[idx] == ',')
> 329 || (local_buffer[idx] == '\0')) {
> 330 workbuffer[w_idx] = '\0';
> 331 if (w_idx) {
> 332 /* avoid the empty string */
> 333 seq_printf(m, "%s\n", workbuffer);
> 334 }
> 335 memset(workbuffer, 0, SPLPAR_MAXLENGTH);
> 336 idx++;  /* skip the comma */
> 337 w_idx = 0;
> 338 } else if (local_buffer[idx] == '=') {
> 339 /* code here to replace workbuffer 
> contents
> 340with different keyword strings */
> 341 if (0 == strcmp(workbuffer, "MaxEntCap")) 
> {
> 342 strcpy(workbuffer,
> 343
> "partition_max_entitled_capacity");
> 344 w_idx = strlen(workbuffer);
> 345 }
> 346 if (0 == strcmp(workbuffer, 
> "MaxPlatProcs")) {
> 347 

Re: [PATCH] powerpc/rtas_flash: New return code to indicate FW entitlement expiry

2013-04-22 Thread Benjamin Herrenschmidt
On Fri, 2013-04-19 at 17:14 +0530, Vasant Hegde wrote:
> Add new return code to rtas_flash to indicate firmware entitlement
> expiry. This will be used by the update_flash script to return
> appropriate message to the user.

What's the point of that patch ? It adds a definition to a private .c
file not exposed to user space and doesn't do anything with it ...

Ben.

> Signed-off-by: Ananth N Mavinakayanahalli 
> Signed-off-by: Vasant Hegde 
> ---
>  arch/powerpc/kernel/rtas_flash.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/rtas_flash.c 
> b/arch/powerpc/kernel/rtas_flash.c
> index a7020d2..0a12c16 100644
> --- a/arch/powerpc/kernel/rtas_flash.c
> +++ b/arch/powerpc/kernel/rtas_flash.c
> @@ -64,6 +64,7 @@
>  #define VALIDATE_TMP_COMMIT_DL 4 /* Validate Return Status */
>  #define VALIDATE_TMP_COMMIT5 /* Validate Return Status */
>  #define VALIDATE_TMP_UPDATE_DL 6 /* Validate Return Status */
> +#define VALIDATE_OUT_OF_WRNTY  7 /* Validate Return Status */
>  
>  /* ibm,manage-flash-image operation tokens */
>  #define RTAS_REJECT_TMP_IMG   0


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Add HWCAP2 aux entry

2013-04-22 Thread Benjamin Herrenschmidt
On Thu, 2013-04-18 at 13:41 +1000, Michael Neuling wrote:
> akpm,
> 
> If you're happy with this, is it something you can take in your tree?

Andrew ? Or give me an ack ? :-) I'm happy to carry this, we need that
rather urgently and we have the glibc folks on board.

Cheers,
Ben.

> Mikey
> 
> Michael Neuling  wrote:
> > We are currently out of free bits in AT_HWCAP. With POWER8, we have
> > several hardware features that we need to advertise. 
> > 
> > Tested on POWER and x86.
> > 
> > Signed-off-by: Michael Neuling 
> > Signed-off-by: Nishanth Aravamudan 
> > ---
> > 
> > > Wouldn't it be safer to not emit AT_HWCAP2 unless it is defined by the 
> > > arch?
> > > 
> > > That way the change would only impact powerpc.
> > 
> > Should be addressed with this version.
> > 
> > Mikey
> > 
> > diff --git a/arch/powerpc/include/asm/cputable.h 
> > b/arch/powerpc/include/asm/cputable.h
> > index fb3245e..ccadad6 100644
> > --- a/arch/powerpc/include/asm/cputable.h
> > +++ b/arch/powerpc/include/asm/cputable.h
> > @@ -52,6 +52,7 @@ struct cpu_spec {
> > char*cpu_name;
> > unsigned long   cpu_features;   /* Kernel features */
> > unsigned intcpu_user_features;  /* Userland features */
> > +   unsigned intcpu_user_features2; /* Userland features v2 */
> > unsigned intmmu_features;   /* MMU features */
> >  
> > /* cache line sizes */
> > diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> > index ac9790f..cc0655a 100644
> > --- a/arch/powerpc/include/asm/elf.h
> > +++ b/arch/powerpc/include/asm/elf.h
> > @@ -61,6 +61,7 @@ typedef elf_vrregset_t elf_fpxregset_t;
> > instruction set this cpu supports.  This could be done in userspace,
> > but it's not easy, and we've already done it here.  */
> >  # define ELF_HWCAP (cur_cpu_spec->cpu_user_features)
> > +# define ELF_HWCAP2(cur_cpu_spec->cpu_user_features2)
> >  
> >  /* This yields a string that ld.so will use to load implementation
> > specific libraries for optimization.  This is more specific in
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 3939829..1f8b5d5 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -240,6 +240,9 @@ create_elf_tables(struct linux_binprm *bprm, struct 
> > elfhdr *exec,
> > NEW_AUX_ENT(AT_EGID, from_kgid_munged(cred->user_ns, cred->egid));
> > NEW_AUX_ENT(AT_SECURE, security_bprm_secureexec(bprm));
> > NEW_AUX_ENT(AT_RANDOM, (elf_addr_t)(unsigned long)u_rand_bytes);
> > +#ifdef ELF_HWCAP2
> > +   NEW_AUX_ENT(AT_HWCAP2, ELF_HWCAP2);
> > +#endif
> > NEW_AUX_ENT(AT_EXECFN, bprm->exec);
> > if (k_platform) {
> > NEW_AUX_ENT(AT_PLATFORM,
> > diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> > index 9c13e02..bf2381d 100644
> > --- a/fs/binfmt_elf_fdpic.c
> > +++ b/fs/binfmt_elf_fdpic.c
> > @@ -483,7 +483,6 @@ static int create_elf_fdpic_tables(struct linux_binprm 
> > *bprm,
> > size_t platform_len = 0, len;
> > char *k_platform, *k_base_platform;
> > char __user *u_platform, *u_base_platform, *p;
> > -   long hwcap;
> > int loop;
> > int nr; /* reset for each csp adjustment */
> >  
> > @@ -502,8 +501,6 @@ static int create_elf_fdpic_tables(struct linux_binprm 
> > *bprm,
> > return -EFAULT;
> >  #endif
> >  
> > -   hwcap = ELF_HWCAP;
> > -
> > /*
> >  * If this architecture has a platform capability string, copy it
> >  * to userspace.  In some cases (Sparc), this info is impossible
> > @@ -617,7 +614,10 @@ static int create_elf_fdpic_tables(struct linux_binprm 
> > *bprm,
> >  
> > nr = 0;
> > csp -= DLINFO_ITEMS * 2 * sizeof(unsigned long);
> > -   NEW_AUX_ENT(AT_HWCAP,   hwcap);
> > +   NEW_AUX_ENT(AT_HWCAP,   ELF_HWCAP);
> > +#ifdef ELF_HWCAP2
> > +   NEW_AUX_ENT(AT_HWCAP2,  ELF_HWCAP2);
> > +#endif
> > NEW_AUX_ENT(AT_PAGESZ,  PAGE_SIZE);
> > NEW_AUX_ENT(AT_CLKTCK,  CLOCKS_PER_SEC);
> > NEW_AUX_ENT(AT_PHDR,exec_params->ph_addr);
> > diff --git a/include/uapi/linux/auxvec.h b/include/uapi/linux/auxvec.h
> > index 61594d5..835c065 100644
> > --- a/include/uapi/linux/auxvec.h
> > +++ b/include/uapi/linux/auxvec.h
> > @@ -28,6 +28,7 @@
> >  #define AT_BASE_PLATFORM 24/* string identifying real platform, may
> >  * differ from AT_PLATFORM. */
> >  #define AT_RANDOM 25   /* address of 16 random bytes */
> > +#define AT_HWCAP2 26   /* extension of AT_HWCAP */
> >  
> >  #define AT_EXECFN  31  /* filename of program */
> >  
> > ___
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.

Re: [PATCH v3 7/12] Use stop machine to update cpu maps

2013-04-22 Thread Benjamin Herrenschmidt
On Mon, 2013-04-22 at 13:41 -0500, Nathan Fontenot wrote:
> From: Jesse Larrew 
> 
> Platform events such as partition migration or the new PRRN firmware
> feature can cause the NUMA characteristics of a CPU to change, and these
> changes will be reflected in the device tree nodes for the affected
> CPUs.
> 
> This patch registers a handler for Open Firmware device tree updates
> and reconfigures the CPU and node maps whenever the associativity
> changes. Currently, this is accomplished by marking the affected CPUs in
> the cpu_associativity_changes_mask and allowing
> arch_update_cpu_topology() to retrieve the new associativity information
> using hcall_vphn().
> 
> Protecting the NUMA cpu maps from concurrent access during an update
> operation will be addressed in a subsequent patch in this series.

I see no more mention of stop_machine() ... is the patch subject stale ?

Cheers,
Ben.

> Signed-off-by: Nathan Fontenot 
> ---
> 
>  arch/powerpc/include/asm/firmware.h   |3 
>  arch/powerpc/include/asm/prom.h   |1 
>  arch/powerpc/mm/numa.c|   99 
> ++
>  arch/powerpc/platforms/pseries/firmware.c |1 
>  4 files changed, 79 insertions(+), 25 deletions(-)
> 
> Index: powerpc/arch/powerpc/include/asm/prom.h
> ===
> --- powerpc.orig/arch/powerpc/include/asm/prom.h  2013-04-15 
> 14:03:52.0 -0500
> +++ powerpc/arch/powerpc/include/asm/prom.h   2013-04-15 14:04:47.0 
> -0500
> @@ -128,6 +128,7 @@
>  #define OV5_CMO  0x0480  /* Cooperative Memory 
> Overcommitment */
>  #define OV5_XCMO 0x0440  /* Page Coalescing */
>  #define OV5_TYPE1_AFFINITY   0x0580  /* Type 1 NUMA affinity */
> +#define OV5_PRRN 0x0540  /* Platform Resource Reassignment */
>  #define OV5_PFO_HW_RNG   0x0E80  /* PFO Random Number Generator 
> */
>  #define OV5_PFO_HW_842   0x0E40  /* PFO Compression Accelerator 
> */
>  #define OV5_PFO_HW_ENCR  0x0E20  /* PFO Encryption Accelerator */
> Index: powerpc/arch/powerpc/mm/numa.c
> ===
> --- powerpc.orig/arch/powerpc/mm/numa.c   2013-04-15 14:04:46.0 
> -0500
> +++ powerpc/arch/powerpc/mm/numa.c2013-04-15 14:06:20.0 -0500
> @@ -1257,7 +1257,8 @@
>  static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
>  static cpumask_t cpu_associativity_changes_mask;
>  static int vphn_enabled;
> -static void set_topology_timer(void);
> +static int prrn_enabled;
> +static void reset_topology_timer(void);
>  
>  /*
>   * Store the current values of the associativity change counters in the
> @@ -1293,11 +1294,9 @@
>   */
>  static int update_cpu_associativity_changes_mask(void)
>  {
> - int cpu, nr_cpus = 0;
> + int cpu;
>   cpumask_t *changes = &cpu_associativity_changes_mask;
>  
> - cpumask_clear(changes);
> -
>   for_each_possible_cpu(cpu) {
>   int i, changed = 0;
>   u8 *counts = vphn_cpu_change_counts[cpu];
> @@ -1311,11 +1310,10 @@
>   }
>   if (changed) {
>   cpumask_set_cpu(cpu, changes);
> - nr_cpus++;
>   }
>   }
>  
> - return nr_cpus;
> + return cpumask_weight(changes);
>  }
>  
>  /*
> @@ -1416,7 +1414,7 @@
>   unsigned int associativity[VPHN_ASSOC_BUFSIZE] = {0};
>   struct device *dev;
>  
> - for_each_cpu(cpu,&cpu_associativity_changes_mask) {
> + for_each_cpu(cpu, &cpu_associativity_changes_mask) {
>   vphn_get_associativity(cpu, associativity);
>   nid = associativity_to_nid(associativity);
>  
> @@ -1438,6 +1436,7 @@
>   dev = get_cpu_device(cpu);
>   if (dev)
>   kobject_uevent(&dev->kobj, KOBJ_CHANGE);
> + cpumask_clear_cpu(cpu, &cpu_associativity_changes_mask);
>   changed = 1;
>   }
>  
> @@ -1457,37 +1456,80 @@
>  
>  static void topology_timer_fn(unsigned long ignored)
>  {
> - if (!vphn_enabled)
> - return;
> - if (update_cpu_associativity_changes_mask() > 0)
> + if (prrn_enabled && cpumask_weight(&cpu_associativity_changes_mask))
>   topology_schedule_update();
> - set_topology_timer();
> + else if (vphn_enabled) {
> + if (update_cpu_associativity_changes_mask() > 0)
> + topology_schedule_update();
> + reset_topology_timer();
> + }
>  }
>  static struct timer_list topology_timer =
>   TIMER_INITIALIZER(topology_timer_fn, 0, 0);
>  
> -static void set_topology_timer(void)
> +static void reset_topology_timer(void)
>  {
>   topology_timer.data = 0;
>   topology_timer.expires = jiffies + 60 * HZ;
> - add_timer(&topology_timer);
> + mod_timer(&topology_timer, topology_timer.expires);
> +}
> +
> +static void st

Re: [Suggestion] PowerPC: kernel: memory access violation when rtas_data_buf contents are more than 1026

2013-04-22 Thread Chen Gang
On 2013年04月23日 08:31, Benjamin Herrenschmidt wrote:
> On Thu, 2013-04-18 at 12:45 +0800, Chen Gang wrote:
>> Hello Maintainers:
>>
>>
>> in arch/powerpc/kernel/lparcfg.c, parse_system_parameter_string()
>>
>>   need set '\0' for 'local_buffer'.
>>
>>   the reason is:
>> SPLPAR_MAXLENGTH is 1026, RTAS_DATA_BUF_SIZE is 4096
>> the contents of rtas_data_buf may truncated in memcpy (line 301).
>>
>> if contents are truncated.
>>   the splpar_strlen is more than 1026 (line 321)
>>   the while loop checking will not find the end of buffer (line 326)
>>   it will cause memory access violation.
>>
>>
>>   I find it by reading code, so please help check.
> 
> And a signed-off-by please ?
> 

  ok, thanks, I should send the related patch.


-- 
Chen Gang

Asianux Corporation
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/12] Update firmware_has_feature() to check architecture bits

2013-04-22 Thread Stephen Rothwell
Hi Nathan,

On Mon, 22 Apr 2013 13:38:47 -0500 Nathan Fontenot  
wrote:
>
> -/* Option vector 5: PAPR/OF options supported */
> -#define OV5_LPAR 0x80/* logical partitioning supported */
> -#define OV5_SPLPAR   0x40/* shared-processor LPAR supported */
> +/* Option vector 5: PAPR/OF options supported
> + * Thses bits are also used for the platform_has_feature() call so
  ^
typo

> + * we encode the vector index in the define and use the OV5_FEAT()
> + * and OV5_INDX() macros to extract the desired information.
> + */
> +#define OV5_FEAT(x)  ((x) & 0xff)
> +#define OV5_INDX(x)  ((x) >> 8)
> +#define OV5_LPAR 0x0280  /* logical partitioning supported */
> +#define OV5_SPLPAR   0x0240  /* shared-processor LPAR supported */

Wouldn't it be clearer to say

#define OV5_LPAR(OV5_INDX(0x2) | OV5_FEAT(0x80))

etc?

> @@ -145,6 +141,7 @@
>   * followed by # option vectors - 1, followed by the option vectors.
>   */
>  extern unsigned char ibm_architecture_vec[];
> +bool platform_has_feature(unsigned int);

"extern", please (if nothing else, for consistency).

> +static __initdata struct vec5_fw_feature
> +vec5_fw_features_table[FIRMWARE_MAX_FEATURES] = {

Why make this array FIRMWARE_MAX_FEATURES (63) long?  You could just
restrict the for loop below to ARRAY_SIZE(vec5_fw_features_table).

> + {FW_FEATURE_TYPE1_AFFINITY, OV5_TYPE1_AFFINITY},
> +};
> +
> +void __init fw_vec5_feature_init(const char *vec5, unsigned long len)
> +{
> + unsigned int index, feat;
> + int i;
> +
> + pr_debug(" -> fw_vec5_feature_init()\n");
> +
> + for (i = 0; i < FIRMWARE_MAX_FEATURES; i++) {
> + if (!vec5_fw_features_table[i].feature)
> + continue;

And this test could go away.

I realise that you have just copied the existing code, but you should not
do that blindly.  Maybe you could even add an (earlier) patch that fixes
the existing code.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpwjjkEJ0Fy3.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 5/12] Update firmware_has_feature() to check architecture bits

2013-04-22 Thread Stephen Rothwell
Hi Nathan,

On Mon, 22 Apr 2013 13:38:47 -0500 Nathan Fontenot  
wrote:
>
> +/* Option vector 5: PAPR/OF options supported
> + * Thses bits are also used for the platform_has_feature() call so

You talk about platform_has_feature(), but that does not exist (I assume
it existed in a previous version of the patch set).

> +bool platform_has_feature(unsigned int);

Ditto.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgp5P_HIWmupA.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/12] Add /proc interface to control topology updates

2013-04-22 Thread Stephen Rothwell
Hi Nathan,

On Mon, 22 Apr 2013 13:47:55 -0500 Nathan Fontenot  
wrote:
>
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>  extern int start_topology_update(void);
>  extern int stop_topology_update(void);
> +extern inline int prrn_is_enabled(void);

You really can't do "extern inline" with no body ...

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpYD_KU3Eta0.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/12] Add /proc interface to control topology updates

2013-04-22 Thread Stephen Rothwell
Hi Nathan,

On Mon, 22 Apr 2013 13:47:55 -0500 Nathan Fontenot  
wrote:
>
> +inline int prrn_is_enabled(void)
> +{
> + return prrn_enabled;
> +}

We generally leave these "inline"s up to the compiler these days i.e.
remove the "inline".

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpgdglAlgYlc.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/12] Add /proc interface to control topology updates

2013-04-22 Thread Michael Ellerman
On Tue, Apr 23, 2013 at 12:00:26PM +1000, Stephen Rothwell wrote:
> Hi Nathan,
> 
> On Mon, 22 Apr 2013 13:47:55 -0500 Nathan Fontenot  
> wrote:
> >
> >  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
> >  extern int start_topology_update(void);
> >  extern int stop_topology_update(void);
> > +extern inline int prrn_is_enabled(void);
> 
> You really can't do "extern inline" with no body ...

No you can't, and at least with my compiler it causes a build error.

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] PowerPC: kernel: memory access violation when rtas_data_buf contents are more than 1026

2013-04-22 Thread Chen Gang

need set '\0' for 'local_buffer'.

SPLPAR_MAXLENGTH is 1026, RTAS_DATA_BUF_SIZE is 4096. so the contents of
rtas_data_buf may truncated in memcpy.

if contents are really truncated.
  the splpar_strlen is more than 1026. the next while loop checking will
  not find the end of buffer. that will cause memory access violation.


Signed-off-by: Chen Gang 
---
 arch/powerpc/kernel/lparcfg.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/lparcfg.c b/arch/powerpc/kernel/lparcfg.c
index 801a757..d92f387 100644
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -299,6 +299,7 @@ static void parse_system_parameter_string(struct seq_file 
*m)
__pa(rtas_data_buf),
RTAS_DATA_BUF_SIZE);
memcpy(local_buffer, rtas_data_buf, SPLPAR_MAXLENGTH);
+   local_buffer[SPLPAR_MAXLENGTH - 1] = '\0';
spin_unlock(&rtas_data_buf_lock);
 
if (call_status != 0) {
-- 
1.7.7.6
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] powerpc: Move opcode definitions from kvm/emulate.c to asm/ppc-opcode.h

2013-04-22 Thread Jia Hongtao
Opcode and xopcode are useful definitions not just for KVM. Move these
definitions to asm/ppc-opcode.h for public use.

Signed-off-by: Jia Hongtao 
Signed-off-by: Li Yang 
---
 arch/powerpc/include/asm/ppc-opcode.h | 45 +++
 arch/powerpc/kvm/emulate.c| 44 +-
 2 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..18de83a 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -81,6 +81,51 @@
 #define__REGA0_R30 30
 #define__REGA0_R31 31
 
+/* opcode and xopcode for instructions */
+#define OP_TRAP 3
+#define OP_TRAP_64 2
+
+#define OP_31_XOP_TRAP  4
+#define OP_31_XOP_LWZX  23
+#define OP_31_XOP_LWZUX 55
+#define OP_31_XOP_TRAP_64   68
+#define OP_31_XOP_DCBF  86
+#define OP_31_XOP_LBZX  87
+#define OP_31_XOP_STWX  151
+#define OP_31_XOP_STBX  215
+#define OP_31_XOP_LBZUX 119
+#define OP_31_XOP_STBUX 247
+#define OP_31_XOP_LHZX  279
+#define OP_31_XOP_LHZUX 311
+#define OP_31_XOP_MFSPR 339
+#define OP_31_XOP_LHAX  343
+#define OP_31_XOP_STHX  407
+#define OP_31_XOP_STHUX 439
+#define OP_31_XOP_MTSPR 467
+#define OP_31_XOP_DCBI  470
+#define OP_31_XOP_LWBRX 534
+#define OP_31_XOP_TLBSYNC   566
+#define OP_31_XOP_STWBRX662
+#define OP_31_XOP_LHBRX 790
+#define OP_31_XOP_STHBRX918
+
+#define OP_LWZ  32
+#define OP_LD   58
+#define OP_LWZU 33
+#define OP_LBZ  34
+#define OP_LBZU 35
+#define OP_STW  36
+#define OP_STWU 37
+#define OP_STD  62
+#define OP_STB  38
+#define OP_STBU 39
+#define OP_LHZ  40
+#define OP_LHZU 41
+#define OP_LHA  42
+#define OP_LHAU 43
+#define OP_STH  44
+#define OP_STHU 45
+
 /* sorted alphabetically */
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
index 7a73b6f..426d3f5 100644
--- a/arch/powerpc/kvm/emulate.c
+++ b/arch/powerpc/kvm/emulate.c
@@ -30,52 +30,10 @@
 #include 
 #include 
 #include 
+#include 
 #include "timing.h"
 #include "trace.h"
 
-#define OP_TRAP 3
-#define OP_TRAP_64 2
-
-#define OP_31_XOP_TRAP  4
-#define OP_31_XOP_LWZX  23
-#define OP_31_XOP_TRAP_64   68
-#define OP_31_XOP_DCBF  86
-#define OP_31_XOP_LBZX  87
-#define OP_31_XOP_STWX  151
-#define OP_31_XOP_STBX  215
-#define OP_31_XOP_LBZUX 119
-#define OP_31_XOP_STBUX 247
-#define OP_31_XOP_LHZX  279
-#define OP_31_XOP_LHZUX 311
-#define OP_31_XOP_MFSPR 339
-#define OP_31_XOP_LHAX  343
-#define OP_31_XOP_STHX  407
-#define OP_31_XOP_STHUX 439
-#define OP_31_XOP_MTSPR 467
-#define OP_31_XOP_DCBI  470
-#define OP_31_XOP_LWBRX 534
-#define OP_31_XOP_TLBSYNC   566
-#define OP_31_XOP_STWBRX662
-#define OP_31_XOP_LHBRX 790
-#define OP_31_XOP_STHBRX918
-
-#define OP_LWZ  32
-#define OP_LD   58
-#define OP_LWZU 33
-#define OP_LBZ  34
-#define OP_LBZU 35
-#define OP_STW  36
-#define OP_STWU 37
-#define OP_STD  62
-#define OP_STB  38
-#define OP_STBU 39
-#define OP_LHZ  40
-#define OP_LHZU 41
-#define OP_LHA  42
-#define OP_LHAU 43
-#define OP_STH  44
-#define OP_STHU 45
-
 void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
 {
unsigned long dec_nsec;
-- 
1.8.0


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/2 V7] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx

2013-04-22 Thread Jia Hongtao
A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
goes down. when the link goes down, Non-posted transactions issued
via the ATMU requiring completion result in an instruction stall.
At the same time a machine-check exception is generated to the core
to allow further processing by the handler. We implements the handler
which skips the instruction caused the stall.

This patch depends on patch:
powerpc/85xx: Add platform_device declaration to fsl_pci.h

Signed-off-by: Zhao Chenhui 
Signed-off-by: Li Yang 
Signed-off-by: Liu Shuo 
Signed-off-by: Jia Hongtao 
---
V6:
* Correct PCIe checking method (Using indirect_type member of pci_controller
  stucture).

V5:
* Move OP and XOP defines to a new header file: asm/ppc-disassemble.h
* Add X UX BRX variant of load instruction emulation
* Remove A variant of load instruction emulation

V4:
* Fill rd with all-Fs if the skipped instruction is load and emulate the
  instruction.
* Let KVM/QEMU deal with the exception if the machine check comes from KVM.

 arch/powerpc/kernel/cpu_setup_fsl_booke.S |   2 +-
 arch/powerpc/kernel/traps.c   |   3 +
 arch/powerpc/sysdev/fsl_pci.c | 140 ++
 arch/powerpc/sysdev/fsl_pci.h |   6 ++
 4 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 0b9af01..bfb18c7 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -75,7 +75,7 @@ _GLOBAL(__setup_cpu_e500v2)
bl  __e500_icache_setup
bl  __e500_dcache_setup
bl  __setup_e500_ivors
-#ifdef CONFIG_FSL_RIO
+#if defined(CONFIG_FSL_RIO) || defined(CONFIG_FSL_PCI)
/* Ensure that RFXE is set */
mfspr   r3,SPRN_HID1
orisr3,r3,HID1_RFXE@h
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 37cc40e..d15cfb5 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -565,6 +566,8 @@ int machine_check_e500(struct pt_regs *regs)
if (reason & MCSR_BUS_RBERR) {
if (fsl_rio_mcheck_exception(regs))
return 1;
+   if (fsl_pci_mcheck_exception(regs))
+   return 1;
}
 
printk("Machine check in kernel mode.\n");
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 40ffe29..6bddf0f 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -26,11 +26,15 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -876,6 +880,142 @@ u64 fsl_pci_immrbar_base(struct pci_controller *hose)
return 0;
 }
 
+#ifdef CONFIG_E500
+static int mcheck_handle_load(struct pt_regs *regs, u32 inst)
+{
+   unsigned int rd, ra, rb, d;
+
+   rd = get_rt(inst);
+   ra = get_ra(inst);
+   rb = get_rb(inst);
+   d = get_d(inst);
+
+   switch (get_op(inst)) {
+   case 31:
+   switch (get_xop(inst)) {
+   case OP_31_XOP_LWZX:
+   case OP_31_XOP_LWBRX:
+   regs->gpr[rd] = 0x;
+   break;
+
+   case OP_31_XOP_LWZUX:
+   regs->gpr[rd] = 0x;
+   regs->gpr[ra] += regs->gpr[rb];
+   break;
+
+   case OP_31_XOP_LBZX:
+   regs->gpr[rd] = 0xff;
+   break;
+
+   case OP_31_XOP_LBZUX:
+   regs->gpr[rd] = 0xff;
+   regs->gpr[ra] += regs->gpr[rb];
+   break;
+
+   case OP_31_XOP_LHZX:
+   case OP_31_XOP_LHBRX:
+   regs->gpr[rd] = 0x;
+   break;
+
+   case OP_31_XOP_LHZUX:
+   regs->gpr[rd] = 0x;
+   regs->gpr[ra] += regs->gpr[rb];
+   break;
+
+   default:
+   return 0;
+   }
+   break;
+
+   case OP_LWZ:
+   regs->gpr[rd] = 0x;
+   break;
+
+   case OP_LWZU:
+   regs->gpr[rd] = 0x;
+   regs->gpr[ra] += (s16)d;
+   break;
+
+   case OP_LBZ:
+   regs->gpr[rd] = 0xff;
+   break;
+
+   case OP_LBZU:
+   regs->gpr[rd] = 0xff;
+   regs->gpr[ra] += (s16)d;
+   break;
+
+   case OP_LHZ:
+   regs->gpr[rd] = 0x;
+   break;
+
+   case OP_LHZU:
+   regs->gpr[rd] = 0x;
+   regs->gpr[ra] += (s16

Re: [PATCH 3/3] powerpc/powernv: Patch MSI EOI handler on P8

2013-04-22 Thread Gavin Shan
On Tue, Apr 23, 2013 at 09:34:16AM +1000, Michael Ellerman wrote:
>On Mon, Apr 22, 2013 at 07:06:17PM +0800, Gavin Shan wrote:
>> On Mon, Apr 22, 2013 at 12:56:37PM +1000, Michael Ellerman wrote:
>> >On Mon, Apr 22, 2013 at 09:45:33AM +0800, Gavin Shan wrote:
>> >> On Mon, Apr 22, 2013 at 09:34:36AM +1000, Michael Ellerman wrote:
>> >> >On Fri, Apr 19, 2013 at 05:32:45PM +0800, Gavin Shan wrote:

.../...

>> >> >> diff --git a/arch/powerpc/sysdev/xics/icp-native.c 
>> >> >> b/arch/powerpc/sysdev/xics/icp-native.c
>> >> >> index 48861d3..289355e 100644
>> >> >> --- a/arch/powerpc/sysdev/xics/icp-native.c
>> >> >> +++ b/arch/powerpc/sysdev/xics/icp-native.c
>> >> >> @@ -27,6 +27,10 @@
>> >> >>  #include 
>> >> >>  #include 
>> >> >>  
>> >> >> +#if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_PCI_MSI)
>> >> >> +extern int pnv_pci_msi_eoi(unsigned int hw_irq);
>> >> >> +#endif
>> >> >
>> >> >You don't need to #ifdef the extern. But it should be in a header, not
>> >> >here.
>> >> >
>> >> 
>> >> Ok. I'll put it into asm/xics.h, but I want to confirm we needn't
>> >> #ifdef when moving it to asm/xics.h?
>> >
>> >No you don't need it #ifdef'd. It's just extra noise in the file, and
>> >doesn't really add anything IMHO.
>> >
>> 
>> Michael, I'm a bit confused about your point. asm/xics.h is shared between
>> PowerNV and pSeries platform, and pnv_pci_msi_eoi() is only implemented on
>> PowerNV platform, so the code should look like this (with newly introduced
>> option - CONFIG_POWERNV_MSI)
>> 
>> #ifdef CONFIG_POWERNV_MSI
>> extern int pnv_pci_msi_eoi(unsigned int hw_irq);
>> #endif
>
>You can do that. But there's not much value added by adding an
>#ifdef around the extern.
>
>Assuming the body of pnv_pci_msi_eoi() is only available when
>CONFIG_POWERNV_MSI is defined (which is the whole point), imagine there
>is code in platforms/pseries which accidentally calls it.
>
>If we have the extern protected by an ifdef we will get a warning that
>we are calling an undeclared function, eg something like:
>
>  pseries.c:30:2: warning: implicit declaration of function ‘pnv_pci_msi_eoi’ 
> [-Wimplicit-function-declaration]
>
>But more importantly we will not be able to link the kernel, because the
>body of pnv_pci_msi_eoi() is missing (because CONFIG_POWERNV_MSI=n).
>
>If we have the extern visible in the header, ie. not inside #ifdef, then
>we will not see the warning because the compiler can see the
>declaration.
>
>But even so the kernel will still not link.
>
>So my point is that having the #ifdef around the extern just gives you
>an extra warning, which is not all that useful because you are going to
>notice anyway as soon as the kernel fails to link.
>
>Anyway it's a minor point so don't worry about it too much :)
>

Thanks for your time to explain it with details, Michael. I will
remove that "#ifdef" ;-)

Thanks,
Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/3 v2] iommu: Move swap_pci_ref function to pci.h.

2013-04-22 Thread Varun Sethi
swap_pci_ref function is used by the IOMMU API code for swapping pci device
pointers, while determining the iommu group for the device.
Currently this function was being implemented for different IOMMU drivers.
This patch moves the function to a new file, drivers/iommu/pci.h so that the
implementation can be shared across various IOMMU drivers.

Signed-off-by: Varun Sethi 
---
v2 changes:
- created a new file drivers/iommu/pci.h.

 drivers/iommu/amd_iommu.c   |7 +--
 drivers/iommu/intel-iommu.c |7 +--
 drivers/iommu/pci.h |   29 +
 3 files changed, 31 insertions(+), 12 deletions(-)
 create mode 100644 drivers/iommu/pci.h

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a7f6b04..2463464 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -46,6 +46,7 @@
 #include "amd_iommu_proto.h"
 #include "amd_iommu_types.h"
 #include "irq_remapping.h"
+#include "pci.h"
 
 #define CMD_SET_TYPE(cmd, t) ((cmd)->data[1] |= ((t) << 28))
 
@@ -263,12 +264,6 @@ static bool check_device(struct device *dev)
return true;
 }
 
-static void swap_pci_ref(struct pci_dev **from, struct pci_dev *to)
-{
-   pci_dev_put(*from);
-   *from = to;
-}
-
 static struct pci_bus *find_hosted_bus(struct pci_bus *bus)
 {
while (!bus->self) {
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6e0b9ff..81ad7b8 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -47,6 +47,7 @@
 #include 
 
 #include "irq_remapping.h"
+#include "pci.h"
 
 #define ROOT_SIZE  VTD_PAGE_SIZE
 #define CONTEXT_SIZE   VTD_PAGE_SIZE
@@ -4137,12 +4138,6 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
 }
 
-static void swap_pci_ref(struct pci_dev **from, struct pci_dev *to)
-{
-   pci_dev_put(*from);
-   *from = to;
-}
-
 #define REQ_ACS_FLAGS  (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
 
 static int intel_iommu_add_device(struct device *dev)
diff --git a/drivers/iommu/pci.h b/drivers/iommu/pci.h
new file mode 100644
index 000..d460646
--- /dev/null
+++ b/drivers/iommu/pci.h
@@ -0,0 +1,29 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ * Copyright (C) 2013 Freescale Semiconductor, Inc.
+ *
+ */
+#ifndef __PCI_H
+#define __PCI_H
+
+/* Helper function for swapping pci device reference */
+static inline void swap_pci_ref(struct pci_dev **from, struct pci_dev *to)
+{
+   pci_dev_put(*from);
+   *from = to;
+}
+
+#endif  /* __PCI_H */
-- 
1.7.4.1


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/3 v14] iommu/fsl: Add additional iommu attributes required by the PAMU driver.

2013-04-22 Thread Varun Sethi
Added the following domain attributes for the FSL PAMU driver:
1. Added new iommu stash attribute, which allows setting of the
   LIODN specific stash id parameter through IOMMU API.
2. Added an attribute for enabling/disabling DMA to a particular
   memory window.
3. Added domain attribute to check for PAMUV1 specific constraints.

Signed-off-by: Varun Sethi 
---
v14 changes:
- Add FSL prefix to PAMU attributes.
v13 changes:
- created a new file include/linux/fsl_pamu_stash.h for stash
attributes.
v12 changes:
- Moved PAMU specifc stash ids and structures to PAMU header file.
- no change in v11.
- no change in v10.
 include/linux/fsl_pamu_stash.h |   39 +++
 include/linux/iommu.h  |   16 
 2 files changed, 55 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/fsl_pamu_stash.h

diff --git a/include/linux/fsl_pamu_stash.h b/include/linux/fsl_pamu_stash.h
new file mode 100644
index 000..caa1b21
--- /dev/null
+++ b/include/linux/fsl_pamu_stash.h
@@ -0,0 +1,39 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2013 Freescale Semiconductor, Inc.
+ *
+ */
+
+#ifndef __FSL_PAMU_STASH_H
+#define __FSL_PAMU_STASH_H
+
+/* cache stash targets */
+enum pamu_stash_target {
+   PAMU_ATTR_CACHE_L1 = 1,
+   PAMU_ATTR_CACHE_L2,
+   PAMU_ATTR_CACHE_L3,
+};
+
+/*
+ * This attribute allows configuring stashig specific parameters
+ * in the PAMU hardware.
+ */
+
+struct pamu_stash_attribute {
+   u32 cpu;/* cpu number */
+   u32 cache;  /* cache to stash to: L1,L2,L3 */
+};
+
+#endif  /* __FSL_PAMU_STASH_H */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2727810..c5dc2b9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -57,10 +57,26 @@ struct iommu_domain {
 #define IOMMU_CAP_CACHE_COHERENCY  0x1
 #define IOMMU_CAP_INTR_REMAP   0x2 /* isolates device intrs */
 
+/*
+ * Following constraints are specifc to FSL_PAMUV1:
+ *  -aperture must be power of 2, and naturally aligned
+ *  -number of windows must be power of 2, and address space size
+ *   of each window is determined by aperture size / # of windows
+ *  -the actual size of the mapped region of a window must be power
+ *   of 2 starting with 4KB and physical address must be naturally
+ *   aligned.
+ * DOMAIN_ATTR_FSL_PAMUV1 corresponds to the above mentioned contraints.
+ * The caller can invoke iommu_domain_get_attr to check if the underlying
+ * iommu implementation supports these constraints.
+ */
+
 enum iommu_attr {
DOMAIN_ATTR_GEOMETRY,
DOMAIN_ATTR_PAGING,
DOMAIN_ATTR_WINDOWS,
+   DOMAIN_ATTR_FSL_PAMU_STASH,
+   DOMAIN_ATTR_FSL_PAMU_ENABLE,
+   DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_MAX,
 };
 
-- 
1.7.4.1


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 2/3 v13] iommu/fsl: Add additional iommu attributes required by the PAMU driver.

2013-04-22 Thread Sethi Varun-B16395


> -Original Message-
> From: Wood Scott-B07421
> Sent: Tuesday, April 23, 2013 5:20 AM
> To: Sethi Varun-B16395
> Cc: j...@8bytes.org; io...@lists.linux-foundation.org; linuxppc-
> d...@lists.ozlabs.org; linux-ker...@vger.kernel.org;
> ga...@kernel.crashing.org; b...@kernel.crashing.org; Yoder Stuart-B08248;
> Sethi Varun-B16395
> Subject: Re: [PATCH 2/3 v13] iommu/fsl: Add additional iommu attributes
> required by the PAMU driver.
> 
> On 04/22/2013 12:31:55 AM, Varun Sethi wrote:
> > Added the following domain attributes for the FSL PAMU driver:
> > 1. Added new iommu stash attribute, which allows setting of the
> >LIODN specific stash id parameter through IOMMU API.
> > 2. Added an attribute for enabling/disabling DMA to a particular
> >memory window.
> > 3. Added domain attribute to check for PAMUV1 specific constraints.
> >
> > Signed-off-by: Varun Sethi 
> > ---
> > v13 changes:
> > - created a new file include/linux/fsl_pamu_stash.h for stash
> > attributes.
> > v12 changes:
> > - Moved PAMU specifc stash ids and structures to PAMU header file.
> > - no change in v11.
> > - no change in v10.
> >  include/linux/fsl_pamu_stash.h |   39
> > +++
> >  include/linux/iommu.h  |   16 
> >  2 files changed, 55 insertions(+), 0 deletions(-)  create mode 100644
> > include/linux/fsl_pamu_stash.h
> >
> > diff --git a/include/linux/fsl_pamu_stash.h
> > b/include/linux/fsl_pamu_stash.h new file mode 100644 index
> > 000..caa1b21
> > --- /dev/null
> > +++ b/include/linux/fsl_pamu_stash.h
> > @@ -0,0 +1,39 @@
> > +/*
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License, version 2,
> > as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
> > 02110-1301, USA.
> > + *
> > + * Copyright (C) 2013 Freescale Semiconductor, Inc.
> > + *
> > + */
> > +
> > +#ifndef __FSL_PAMU_STASH_H
> > +#define __FSL_PAMU_STASH_H
> > +
> > +/* cache stash targets */
> > +enum pamu_stash_target {
> > +   PAMU_ATTR_CACHE_L1 = 1,
> > +   PAMU_ATTR_CACHE_L2,
> > +   PAMU_ATTR_CACHE_L3,
> > +};
> > +
> > +/*
> > + * This attribute allows configuring stashig specific parameters
> > + * in the PAMU hardware.
> > + */
> > +
> > +struct pamu_stash_attribute {
> > +   u32 cpu;/* cpu number */
> > +   u32 cache;  /* cache to stash to: L1,L2,L3 */
> > +};
> > +
> > +#endif  /* __FSL_PAMU_STASH_H */
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 2727810..c5dc2b9 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -57,10 +57,26 @@ struct iommu_domain {
> >  #define IOMMU_CAP_CACHE_COHERENCY  0x1
> >  #define IOMMU_CAP_INTR_REMAP   0x2 /* isolates device
> > intrs */
> >
> > +/*
> > + * Following constraints are specifc to PAMUV1:
> 
> FSL_PAMUV1
> 
> > + *  -aperture must be power of 2, and naturally aligned
> > + *  -number of windows must be power of 2, and address space size
> > + *   of each window is determined by aperture size / # of windows
> > + *  -the actual size of the mapped region of a window must be power
> > + *   of 2 starting with 4KB and physical address must be naturally
> > + *   aligned.
> > + * DOMAIN_ATTR_FSL_PAMUV1 corresponds to the above mentioned
> > contraints.
> > + * The caller can invoke iommu_domain_get_attr to check if the
> > underlying
> > + * iommu implementation supports these constraints.
> > + */
> > +
> >  enum iommu_attr {
> > DOMAIN_ATTR_GEOMETRY,
> > DOMAIN_ATTR_PAGING,
> > DOMAIN_ATTR_WINDOWS,
> > +   DOMAIN_ATTR_PAMU_STASH,
> > +   DOMAIN_ATTR_PAMU_ENABLE,
> > +   DOMAIN_ATTR_FSL_PAMUV1,
> > DOMAIN_ATTR_MAX,
> 
> Please be consistent on whether "PAMU" gets an "FSL_" namespace prefix
> (I'd prefer that it does).
Submitted new version(v14) of the patch with updated attribute names.

-Varun

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/rtas_flash: New return code to indicate FW entitlement expiry

2013-04-22 Thread Ananth N Mavinakayanahalli
On Tue, Apr 23, 2013 at 10:40:10AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-04-19 at 17:14 +0530, Vasant Hegde wrote:
> > Add new return code to rtas_flash to indicate firmware entitlement
> > expiry. This will be used by the update_flash script to return
> > appropriate message to the user.
> 
> What's the point of that patch ? It adds a definition to a private .c
> file not exposed to user space and doesn't do anything with it ...

Ben,

The userspace update_flash script invokes the rtas_flash module. With
upcoming System p servers, the firmware will have the entitlement dates
encoded in it and RTAS will return an error if the entitlement has
expired. All we need from this module is for it to return that new error
which will then be communicated to the user by the update_flash.

Ananth

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] powerpc: Move opcode definitions from kvm/emulate.c to asm/ppc-opcode.h

2013-04-22 Thread Michael Ellerman
On Tue, Apr 23, 2013 at 10:39:35AM +0800, Jia Hongtao wrote:
> Opcode and xopcode are useful definitions not just for KVM. Move these
> definitions to asm/ppc-opcode.h for public use.

Agreed. Though nearly everything else in ppc-opcode.h uses PPC_INST_FOO,
or at least PPC_FOO, any reason not to update these to match?

cheers

> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index 8752bc8..18de83a 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -81,6 +81,51 @@
>  #define  __REGA0_R30 30
>  #define  __REGA0_R31 31
>  
> +/* opcode and xopcode for instructions */
> +#define OP_TRAP 3
> +#define OP_TRAP_64 2
> +
> +#define OP_31_XOP_TRAP  4
> +#define OP_31_XOP_LWZX  23
> +#define OP_31_XOP_LWZUX 55
> +#define OP_31_XOP_TRAP_64   68
> +#define OP_31_XOP_DCBF  86
> +#define OP_31_XOP_LBZX  87
> +#define OP_31_XOP_STWX  151
> +#define OP_31_XOP_STBX  215
> +#define OP_31_XOP_LBZUX 119
> +#define OP_31_XOP_STBUX 247
> +#define OP_31_XOP_LHZX  279
> +#define OP_31_XOP_LHZUX 311
> +#define OP_31_XOP_MFSPR 339
> +#define OP_31_XOP_LHAX  343
> +#define OP_31_XOP_STHX  407
> +#define OP_31_XOP_STHUX 439
> +#define OP_31_XOP_MTSPR 467
> +#define OP_31_XOP_DCBI  470
> +#define OP_31_XOP_LWBRX 534
> +#define OP_31_XOP_TLBSYNC   566
> +#define OP_31_XOP_STWBRX662
> +#define OP_31_XOP_LHBRX 790
> +#define OP_31_XOP_STHBRX918
> +
> +#define OP_LWZ  32
> +#define OP_LD   58
> +#define OP_LWZU 33
> +#define OP_LBZ  34
> +#define OP_LBZU 35
> +#define OP_STW  36
> +#define OP_STWU 37
> +#define OP_STD  62
> +#define OP_STB  38
> +#define OP_STBU 39
> +#define OP_LHZ  40
> +#define OP_LHZU 41
> +#define OP_LHA  42
> +#define OP_LHAU 43
> +#define OP_STH  44
> +#define OP_STHU 45
> +
>  /* sorted alphabetically */
>  #define PPC_INST_DCBA0x7c0005ec
>  #define PPC_INST_DCBA_MASK   0xfc0007fe
> diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
> index 7a73b6f..426d3f5 100644
> --- a/arch/powerpc/kvm/emulate.c
> +++ b/arch/powerpc/kvm/emulate.c
> @@ -30,52 +30,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "timing.h"
>  #include "trace.h"
>  
> -#define OP_TRAP 3
> -#define OP_TRAP_64 2
> -
> -#define OP_31_XOP_TRAP  4
> -#define OP_31_XOP_LWZX  23
> -#define OP_31_XOP_TRAP_64   68
> -#define OP_31_XOP_DCBF  86
> -#define OP_31_XOP_LBZX  87
> -#define OP_31_XOP_STWX  151
> -#define OP_31_XOP_STBX  215
> -#define OP_31_XOP_LBZUX 119
> -#define OP_31_XOP_STBUX 247
> -#define OP_31_XOP_LHZX  279
> -#define OP_31_XOP_LHZUX 311
> -#define OP_31_XOP_MFSPR 339
> -#define OP_31_XOP_LHAX  343
> -#define OP_31_XOP_STHX  407
> -#define OP_31_XOP_STHUX 439
> -#define OP_31_XOP_MTSPR 467
> -#define OP_31_XOP_DCBI  470
> -#define OP_31_XOP_LWBRX 534
> -#define OP_31_XOP_TLBSYNC   566
> -#define OP_31_XOP_STWBRX662
> -#define OP_31_XOP_LHBRX 790
> -#define OP_31_XOP_STHBRX918
> -
> -#define OP_LWZ  32
> -#define OP_LD   58
> -#define OP_LWZU 33
> -#define OP_LBZ  34
> -#define OP_LBZU 35
> -#define OP_STW  36
> -#define OP_STWU 37
> -#define OP_STD  62
> -#define OP_STB  38
> -#define OP_STBU 39
> -#define OP_LHZ  40
> -#define OP_LHZU 41
> -#define OP_LHA  42
> -#define OP_LHAU 43
> -#define OP_STH  44
> -#define OP_STHU 45
> -
>  void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
>  {
>   unsigned long dec_nsec;
> -- 
> 1.8.0
> 
> 
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/rtas_flash: New return code to indicate FW entitlement expiry

2013-04-22 Thread Vasant Hegde

On 04/23/2013 06:10 AM, Benjamin Herrenschmidt wrote:

On Fri, 2013-04-19 at 17:14 +0530, Vasant Hegde wrote:

Add new return code to rtas_flash to indicate firmware entitlement
expiry. This will be used by the update_flash script to return
appropriate message to the user.


What's the point of that patch ? It adds a definition to a private .c
file not exposed to user space and doesn't do anything with it ...



This is to keep our code in sync with PAPR. And when we get this return
code from "ibm,validate-flash-image" RTAS call, user space tools (update_flash) 
reads output buffer via /proc interface to display appropriate message to user.


-Vasant



Ben.


Signed-off-by: Ananth N Mavinakayanahalli
Signed-off-by: Vasant Hegde
---
  arch/powerpc/kernel/rtas_flash.c |1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/rtas_flash.c b/arch/powerpc/kernel/rtas_flash.c
index a7020d2..0a12c16 100644
--- a/arch/powerpc/kernel/rtas_flash.c
+++ b/arch/powerpc/kernel/rtas_flash.c
@@ -64,6 +64,7 @@
  #define VALIDATE_TMP_COMMIT_DL 4 /* Validate Return Status */
  #define VALIDATE_TMP_COMMIT5 /* Validate Return Status */
  #define VALIDATE_TMP_UPDATE_DL 6 /* Validate Return Status */
+#define VALIDATE_OUT_OF_WRNTY  7 /* Validate Return Status */

  /* ibm,manage-flash-image operation tokens */
  #define RTAS_REJECT_TMP_IMG   0





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/rtas_flash: New return code to indicate FW entitlement expiry

2013-04-22 Thread Benjamin Herrenschmidt
On Tue, 2013-04-23 at 10:35 +0530, Ananth N Mavinakayanahalli wrote:
> On Tue, Apr 23, 2013 at 10:40:10AM +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2013-04-19 at 17:14 +0530, Vasant Hegde wrote:
> > > Add new return code to rtas_flash to indicate firmware entitlement
> > > expiry. This will be used by the update_flash script to return
> > > appropriate message to the user.
> > 
> > What's the point of that patch ? It adds a definition to a private .c
> > file not exposed to user space and doesn't do anything with it ...
> 
> Ben,
> 
> The userspace update_flash script invokes the rtas_flash module. With
> upcoming System p servers, the firmware will have the entitlement dates
> encoded in it and RTAS will return an error if the entitlement has
> expired. All we need from this module is for it to return that new error
> which will then be communicated to the user by the update_flash.

That doesn't answer my question :-)

What is the point of adding a #define to a piece of code without any user
of that definition and in a file that isn't exposed to user space ?

IE. What is the point of the patch ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V4 4/5] powerpc, perf: Define BHRB generic functions, data and flags for POWER8

2013-04-22 Thread Anshuman Khandual
This patch populates BHRB specific data for power_pmu structure. It
also implements POWER8 specific BHRB filter and configuration functions.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/power8-pmu.c | 57 +-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 106ae0b..153408c 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -109,6 +109,16 @@
 #define EVENT_IS_MARKED(EVENT_MARKED_MASK << 
EVENT_MARKED_SHIFT)
 #define EVENT_PSEL_MASK0xff/* PMCxSEL value */
 
+/* MMCRA IFM bits - POWER8 */
+#definePOWER8_MMCRA_IFM1   0x4000UL
+#definePOWER8_MMCRA_IFM2   0x8000UL
+#definePOWER8_MMCRA_IFM3   0xC000UL
+
+#define ONLY_PLM \
+   (PERF_SAMPLE_BRANCH_USER|\
+PERF_SAMPLE_BRANCH_KERNEL  |\
+PERF_SAMPLE_BRANCH_HV)
+
 /*
  * Layout of constraint bits:
  *
@@ -428,6 +438,48 @@ static int power8_generic_events[] = {
[PERF_COUNT_HW_BRANCH_MISSES] = PM_BR_MPRED_CMPL,
 };
 
+static u64 power8_bhrb_filter_map(u64 branch_sample_type)
+{
+   u64 pmu_bhrb_filter = 0;
+   u64 br_privilege = branch_sample_type & ONLY_PLM;
+
+   /* BHRB and regular PMU events share the same prvillege state
+* filter configuration. BHRB is always recorded along with a
+* regular PMU event. So privilege state filter criteria for BHRB
+* and the companion PMU events has to be the same. As a default
+* "perf record" tool sets all privillege bits ON when no filter
+* criteria is provided in the command line. So as along as all
+* privillege bits are ON or they are OFF, we are good to go.
+*/
+   if ((br_privilege != 7) && (br_privilege != 0))
+   return -1;
+
+   /* No branch filter requested */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY)
+   return pmu_bhrb_filter;
+
+   /* Invalid branch filter options - HW does not support */
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_IND_CALL)
+   return -1;
+
+   if (branch_sample_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+   pmu_bhrb_filter |= POWER8_MMCRA_IFM1;
+   return pmu_bhrb_filter;
+   }
+
+   /* Every thing else is unsupported */
+   return -1;
+}
+
+static void power8_config_bhrb(u64 pmu_bhrb_filter)
+{
+   /* Enable BHRB filter in PMU */
+   mtspr(SPRN_MMCRA, (mfspr(SPRN_MMCRA) | pmu_bhrb_filter));
+}
+
 static struct power_pmu power8_pmu = {
.name   = "POWER8",
.n_counter  = 6,
@@ -435,12 +487,15 @@ static struct power_pmu power8_pmu = {
.add_fields = POWER8_ADD_FIELDS,
.test_adder = POWER8_TEST_ADDER,
.compute_mmcr   = power8_compute_mmcr,
+   .config_bhrb= power8_config_bhrb,
+   .bhrb_filter_map= power8_bhrb_filter_map,
.get_constraint = power8_get_constraint,
.disable_pmc= power8_disable_pmc,
-   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER,
+   .flags  = PPMU_HAS_SSLOT | PPMU_HAS_SIER | PPMU_BHRB,
.n_generic  = ARRAY_SIZE(power8_generic_events),
.generic_events = power8_generic_events,
.attr_groups= power8_pmu_attr_groups,
+   .bhrb_nr= 32,
 };
 
 static int __init init_power8_pmu(void)
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V4 3/5] powerpc, perf: Add new BHRB related generic functions, data and flags

2013-04-22 Thread Anshuman Khandual
This patch adds couple of generic functions to power_pmu structure
which would configure the BHRB and it's filters. It also adds
representation of the number of BHRB entries present on the PMU.
A new PMU flag PPMU_BHRB would indicate presence of BHRB feature.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 57b42da..3f0c15c 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -33,6 +33,8 @@ struct power_pmu {
unsigned long *valp);
int (*get_alternatives)(u64 event_id, unsigned int flags,
u64 alt[]);
+   u64 (*bhrb_filter_map)(u64 branch_sample_type);
+   void(*config_bhrb)(u64 pmu_bhrb_filter);
void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]);
int (*limited_pmc_event)(u64 event_id);
u32 flags;
@@ -42,6 +44,9 @@ struct power_pmu {
int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
+
+   /* BHRB entries in the PMU */
+   int bhrb_nr;
 };
 
 /*
@@ -54,6 +59,7 @@ struct power_pmu {
 #define PPMU_SIAR_VALID0x0010 /* Processor has SIAR Valid 
bit */
 #define PPMU_HAS_SSLOT 0x0020 /* Has sampled slot in MMCRA */
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
+#define PPMU_BHRB  0x0080 /* has BHRB feature enabled */
 
 /*
  * Values for flags to get_alternatives()
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V4 5/5] powerpc, perf: Enable branch stack sampling framework

2013-04-22 Thread Anshuman Khandual
Provides basic enablement for perf branch stack sampling framework on
POWER8 processor based platforms. Adds new BHRB related elements into
cpu_hw_event structure to represent current BHRB config, BHRB filter
configuration, manage context and to hold output BHRB buffer during
PMU interrupt before passing to the user space. This also enables
processing of BHRB data and converts them into generic perf branch
stack data format.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/perf_event_server.h |   1 +
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 2 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3f0c15c..f265049 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -73,6 +73,7 @@ extern int register_power_pmu(struct power_pmu *);
 struct pt_regs;
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long int read_bhrb(int n);
 
 /*
  * Only override the default definitions in include/linux/perf_event.h
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 4ac6e64..c627843 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -19,6 +19,11 @@
 #include 
 #include 
 
+#define BHRB_MAX_ENTRIES   32
+#define BHRB_TARGET0x0002
+#define BHRB_PREDICTION0x0001
+#define BHRB_EA0xFFFC
+
 struct cpu_hw_events {
int n_events;
int n_percpu;
@@ -38,7 +43,15 @@ struct cpu_hw_events {
 
unsigned int group_flag;
int n_txn_start;
+
+   /* BHRB bits */
+   u64 bhrb_filter;/* BHRB HW branch 
filter */
+   int bhrb_users;
+   void*bhrb_context;
+   struct  perf_branch_stack   bhrb_stack;
+   struct  perf_branch_entry   bhrb_entries[BHRB_MAX_ENTRIES];
 };
+
 DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 struct power_pmu *ppmu;
@@ -858,6 +871,9 @@ static void power_pmu_enable(struct pmu *pmu)
}
 
  out:
+   if (cpuhw->bhrb_users)
+   ppmu->config_bhrb(cpuhw->bhrb_filter);
+
local_irq_restore(flags);
 }
 
@@ -888,6 +904,47 @@ static int collect_events(struct perf_event *group, int 
max_count,
return n;
 }
 
+/* Reset all possible BHRB entries */
+static void power_pmu_bhrb_reset(void)
+{
+   asm volatile(PPC_CLRBHRB);
+}
+
+void power_pmu_bhrb_enable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   /* Clear BHRB if we changed task context to avoid data leaks */
+   if (event->ctx->task && cpuhw->bhrb_context != event->ctx) {
+   power_pmu_bhrb_reset();
+   cpuhw->bhrb_context = event->ctx;
+   }
+   cpuhw->bhrb_users++;
+}
+
+void power_pmu_bhrb_disable(struct perf_event *event)
+{
+   struct cpu_hw_events *cpuhw = &__get_cpu_var(cpu_hw_events);
+
+   if (!ppmu->bhrb_nr)
+   return;
+
+   cpuhw->bhrb_users--;
+   WARN_ON_ONCE(cpuhw->bhrb_users < 0);
+
+   if (!cpuhw->disabled && !cpuhw->bhrb_users) {
+   /* BHRB cannot be turned off when other
+* events are active on the PMU.
+*/
+
+   /* avoid stale pointer */
+   cpuhw->bhrb_context = NULL;
+   }
+}
+
 /*
  * Add a event to the PMU.
  * If all events are not already frozen, then we disable and
@@ -947,6 +1004,9 @@ nocheck:
 
ret = 0;
  out:
+   if (has_branch_stack(event))
+   power_pmu_bhrb_enable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
return ret;
@@ -999,6 +1059,9 @@ static void power_pmu_del(struct perf_event *event, int 
ef_flags)
cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE);
}
 
+   if (has_branch_stack(event))
+   power_pmu_bhrb_disable(event);
+
perf_pmu_enable(event->pmu);
local_irq_restore(flags);
 }
@@ -1117,6 +1180,15 @@ int power_pmu_commit_txn(struct pmu *pmu)
return 0;
 }
 
+/* Called from ctxsw to prevent one process's branch entries to
+ * mingle with the other process's entries during context switch.
+ */
+void power_pmu_flush_branch_stack(void)
+{
+   if (ppmu->bhrb_nr)
+   power_pmu_bhrb_reset();
+}
+
 /*
  * Return 1 if we might be able to put event on a limited PMC,
  * or 0 if not.
@@ -1231,9 +1303,11 @@ static int power_pmu_event_init(struct perf_event *event)
if (!ppmu)
return -ENOENT;
 
-   /* does not support taken branch sampling */
-   if (has_branch_stack

[PATCH V4 1/5] powerpc, perf: Add new BHRB related instructions for POWER8

2013-04-22 Thread Anshuman Khandual
This patch adds new POWER8 instruction encoding for reading
and clearing Branch History Rolling Buffer entries. The new
instruction 'mfbhrbe' (move from branch history rolling buffer
entry) is used to read BHRB buffer entries and instruction
'clrbhrb' (clear branch history rolling buffer) is used to
clear the entire buffer. The instruction 'clrbhrb' has straight
forward encoding. But the instruction encoding format for
reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it
takes two arguments, i.e the index for the BHRB buffer entry to
read and a general purpose register to put the value which was
read from the buffer entry.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/ppc-opcode.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 8752bc8..0c34e48 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -82,6 +82,8 @@
 #define__REGA0_R31 31
 
 /* sorted alphabetically */
+#define PPC_INST_BHRBE 0x7c00025c
+#define PPC_INST_CLRBHRB   0x7c00035c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -297,6 +299,12 @@
 #define PPC_NAPstringify_in_c(.long PPC_INST_NAP)
 #define PPC_SLEEP  stringify_in_c(.long PPC_INST_SLEEP)
 
+/* BHRB instructions */
+#define PPC_CLRBHRBstringify_in_c(.long PPC_INST_CLRBHRB)
+#define PPC_MFBHRBE(r, n)  stringify_in_c(.long PPC_INST_BHRBE | \
+   __PPC_RT(r) | \
+   (((n) & 0x3ff) << 11))
+
 /* Transactional memory instructions */
 #define TRECHKPT   stringify_in_c(.long PPC_INST_TRECHKPT)
 #define TRECLAIM(r)stringify_in_c(.long PPC_INST_TRECLAIM \
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V4 2/5] powerpc, perf: Add basic assembly code to read BHRB entries on POWER8

2013-04-22 Thread Anshuman Khandual
This patch adds the basic assembly code to read BHRB buffer. BHRB entries
are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1)
and BHRB has been freezed. BHRB read should not be attempted when it is
still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce
non-deterministic results.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/perf/Makefile |  2 +-
 arch/powerpc/perf/bhrb.S   | 44 
 2 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 472db18..510fae1 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,7 +2,7 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   power8-pmu.o
diff --git a/arch/powerpc/perf/bhrb.S b/arch/powerpc/perf/bhrb.S
new file mode 100644
index 000..d85f9a5
--- /dev/null
+++ b/arch/powerpc/perf/bhrb.S
@@ -0,0 +1,44 @@
+/*
+ * Basic assembly code to read BHRB entries
+ *
+ * Copyright 2013 Anshuman Khandual, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include 
+#include 
+
+   .text
+
+.balign 8
+
+/* r3 = n  (where n = [0-31])
+ * The maximum number of BHRB entries supported with PPC_MFBHRBE instruction
+ * is 1024. We have limited number of table entries here as POWER8 implements
+ * 32 BHRB entries.
+ */
+
+/* .global read_bhrb */
+_GLOBAL(read_bhrb)
+   cmpldi  r3,31
+   bgt 1f
+   ld  r4,bhrb_table@got(r2)
+   sldir3,r3,3
+   add r3,r4,r3
+   mtctr   r3
+   bctr
+1: li  r3,0
+   blr
+
+#define MFBHRB_TABLE1(n) PPC_MFBHRBE(R3,n); blr
+#define MFBHRB_TABLE2(n) MFBHRB_TABLE1(n); MFBHRB_TABLE1(n+1)
+#define MFBHRB_TABLE4(n) MFBHRB_TABLE2(n); MFBHRB_TABLE2(n+2)
+#define MFBHRB_TABLE8(n) MFBHRB_TABLE4(n); MFBHRB_TABLE4(n+4)
+#define MFBHRB_TABLE16(n) MFBHRB_TABLE8(n); MFBHRB_TABLE8(n+8)
+#define MFBHRB_TABLE32(n) MFBHRB_TABLE16(n); MFBHRB_TABLE16(n+16)
+
+bhrb_table:
+   MFBHRB_TABLE32(0)
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V4 0/5] powerpc, perf: BHRB based branch stack enablement on POWER8

2013-04-22 Thread Anshuman Khandual
Branch History Rolling Buffer (BHRB) is a new PMU feaure in IBM
POWER8 processor which records the branch instructions inside the execution
pipeline. This patchset enables the basic functionality of the feature through
generic perf branch stack sampling framework.

Sample output
-
$./perf record -b top
$./perf report

Overhead  Command  Source Shared Object   Source Symbol 
 Target Shared ObjectTarget Symbol
#   ...    
..    
...
#

 7.82%  top  libc-2.11.2.so[k] _IO_vfscanf  
   libc-2.11.2.so[k] _IO_vfscanf
 6.17%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [unknown] [k] 
 2.37%  top  [unknown] [k] 0xf7aafb30   
   [unknown] [k] 
 1.80%  top  [unknown] [k] 0x0fe07978   
   libc-2.11.2.so[k] _IO_vfscanf
 1.60%  top  libc-2.11.2.so[k] _IO_vfscanf  
   [kernel.kallsyms] [k] .do_task_stat
 1.20%  top  [kernel.kallsyms] [k] .do_task_stat
   [kernel.kallsyms] [k] .do_task_stat
 1.02%  top  libc-2.11.2.so[k] vfprintf 
   libc-2.11.2.so[k] vfprintf
 0.92%  top  top   [k] _init
   [unknown] [k] 0x0fe037f4

Changes in V2
--
- Added copyright messages to the newly created files
- Modified couple of commit messages

Changes in V3
-
- Incorporated review comments from Segher https://lkml.org/lkml/2013/4/16/350
- Worked on a solution for review comment from Michael Ellerman 
https://lkml.org/lkml/2013/4/17/548
- Could not move updated cpu_hw_events structure from core-book3s.c 
file into perf_event_server.h
  Because perf_event_server.h is pulled in first inside 
linux/perf_event.h before the definition of
  perf_branch_entry structure. Thats the reason why perf_branch_entry 
definition is not available
  inside perf_event_server.h where we define the array inside 
cpu_hw_events structure.

- Finally have pulled in the code from perf_event_bhrb.c into 
core-book3s.c

- Improved documentation for the patchset

Changes in V4
-
- Incorporated review comments on V3 regarding new instruction encoding

Anshuman Khandual (5):
  powerpc, perf: Add new BHRB related instructions for POWER8
  powerpc, perf: Add basic assembly code to read BHRB entries on POWER8
  powerpc, perf: Add new BHRB related generic functions, data and flags
  powerpc, perf: Define BHRB generic functions, data and flags for POWER8
  powerpc, perf: Enable branch stack sampling framework

 arch/powerpc/include/asm/perf_event_server.h |   7 ++
 arch/powerpc/include/asm/ppc-opcode.h|   8 ++
 arch/powerpc/perf/Makefile   |   2 +-
 arch/powerpc/perf/bhrb.S |  44 +++
 arch/powerpc/perf/core-book3s.c  | 167 ++-
 arch/powerpc/perf/power8-pmu.c   |  57 -
 6 files changed, 280 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/perf/bhrb.S

-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc/fsl-pci:fix incorrect iounmap pci hose->private_data

2013-04-22 Thread Zang Roy-R61911


> -Original Message-
> From: Zang Roy-R61911
> Sent: Tuesday, April 23, 2013 2:36 AM
> To: linuxppc-dev@lists.ozlabs.org
> Cc: ga...@kernel.crashing.org; Zang Roy-R61911; Chen Yuanquan-B41889
> Subject: [PATCH] powerpc/fsl-pci:fix incorrect iounmap pci hose-
> >private_data
> 
> pci hose->private_data will be used by other function, for example,
> fsl_pcie_check_link(), so do not iounmap it.
> 
> fix the kerenl crash on T4240:
> 
> Unable to handle kernel paging request for data at address
> 0x880080060f14
> Faulting instruction address: 0xc0032554
> Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=24 T4240 QDS
> Modules linked in:
> NIP: c0032554 LR: c003254c CTR: c001e5c0
> REGS: c00179143440 TRAP: 0300   Not tainted
> (3.8.8-rt2-00754-g951f064-dirt)
> MSR: 80029000   CR: 24adbe22  XER: 
> SOFTE: 0
> DEAR: 880080060f14, ESR:  TASK = c0017913d2c0[1]
> 'swapper/0' THREAD: c0017914 CPU: 2
> GPR00: c003254c c001791436c0 c0ae2998
> 0027
> GPR04:  05a5 
> 0002
> GPR08: 3030303038303038 c0a2d4d0 c0aebeb8
> c0af2998
> GPR12: 24adbe22 cfffa800 c0001be0
> 
> GPR16:   
> 
> GPR20:   
> c09ddf70
> GPR24: c09e8d40 c0af2998 c0b1529c
> c00179143b40
> GPR28: c001799b4000 c00179143c00 88008006
> c0727ec8
> NIP [c0032554] .fsl_pcie_check_link+0x104/0x150 LR
> [c003254c] .fsl_pcie_check_link+0xfc/0x150 Call Trace:
> [c001791436c0] [c003254c] .fsl_pcie_check_link+0xfc/0x150
> (unreliab)
> [c00179143a30] [c00325d4]
> .fsl_indirect_read_config+0x34/0xb0
> [c00179143ad0] [c02c7ee8]
> .pci_bus_read_config_byte+0x88/0xd0
> [c00179143b90] [c09c0528] .pci_apply_final_quirks+0x9c/0x18c
> [c00179143c40] [c000142c] .do_one_initcall+0x5c/0x1f0
> [c00179143cf0] [c09a0bb4] .kernel_init_freeable+0x180/0x264
> [c00179143db0] [c0001bfc] .kernel_init+0x1c/0x420
> [c00179143e30] [c8b4] .ret_from_kernel_thread+0x64/0xb0
> Instruction dump:
> 6000 4ba0 ebc301d0 3fe2ffc4 3c62ffe0 3bff5530 38638a78 7fe4fb78
> 7fc5f378 486ea77d 6000 7c0004ac <801e0f14> 0c00 4c00012c 3c62ffe0
> ---[ end trace f841fbc03c9d2e1b ]---
> 
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b
> 
> Rebooting in 180 seconds..
> 
> Signed-off-by: Yuanquan Chen 
> Signed-off-by: Roy Zang 
> ---
> based on Kumar's next branch.
> tested on P3041 and T4240.
Please ignore this patch, I will send a v2 version.
Thanks.
Roy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/rtas_flash: New return code to indicate FW entitlement expiry

2013-04-22 Thread Ananth N Mavinakayanahalli
On Tue, Apr 23, 2013 at 03:32:30PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-04-23 at 10:35 +0530, Ananth N Mavinakayanahalli wrote:
> > On Tue, Apr 23, 2013 at 10:40:10AM +1000, Benjamin Herrenschmidt wrote:
> > > On Fri, 2013-04-19 at 17:14 +0530, Vasant Hegde wrote:
> > > > Add new return code to rtas_flash to indicate firmware entitlement
> > > > expiry. This will be used by the update_flash script to return
> > > > appropriate message to the user.
> > > 
> > > What's the point of that patch ? It adds a definition to a private .c
> > > file not exposed to user space and doesn't do anything with it ...
> > 
> > Ben,
> > 
> > The userspace update_flash script invokes the rtas_flash module. With
> > upcoming System p servers, the firmware will have the entitlement dates
> > encoded in it and RTAS will return an error if the entitlement has
> > expired. All we need from this module is for it to return that new error
> > which will then be communicated to the user by the update_flash.
> 
> That doesn't answer my question :-)
> 
> What is the point of adding a #define to a piece of code without any user
> of that definition and in a file that isn't exposed to user space ?
> 
> IE. What is the point of the patch ?

Strictly, we don't need this (kernel) update...

But to keep the code in sync with PAPR, this was added. Agree that the
other return codes also don't say much about what they are for. Will
redo the patch with that info for better code readability.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 1/2] powerpc: Move opcode definitions from kvm/emulate.c to asm/ppc-opcode.h

2013-04-22 Thread Jia Hongtao-B38951


> -Original Message-
> From: Linuxppc-dev [mailto:linuxppc-dev-
> bounces+b38951=freescale@lists.ozlabs.org] On Behalf Of Michael
> Ellerman
> Sent: Tuesday, April 23, 2013 1:30 PM
> To: Jia Hongtao-B38951
> Cc: Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH 1/2] powerpc: Move opcode definitions from
> kvm/emulate.c to asm/ppc-opcode.h
> 
> On Tue, Apr 23, 2013 at 10:39:35AM +0800, Jia Hongtao wrote:
> > Opcode and xopcode are useful definitions not just for KVM. Move these
> > definitions to asm/ppc-opcode.h for public use.
> 
> Agreed. Though nearly everything else in ppc-opcode.h uses PPC_INST_FOO,
> or at least PPC_FOO, any reason not to update these to match?
> 
> cheers

These definitions are firstly used by KVM defined like OP_31_XOP_TRAP.
Two ways to extract these definitions for public use:

1. Like this patch did. For keeping the KVM code that using these
   definitions unchanged we do not update them to match.

2. Move these definitions to another .h file like my last patch did:
   http://patchwork.ozlabs.org/patch/235646/
   You can see the comments there.

Thanks.
-Hongtao






___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev