Re: [RFC v2 5/7] powerpc: Rename context.vdso_base to context.vdso

2016-11-07 Thread Michael Ellerman
Will Deacon  writes:
> On Fri, Nov 04, 2016 at 03:58:22PM +1100, Michael Ellerman wrote:
>> Christopher Covington  writes:
>> >  arch/powerpc/include/asm/book3s/32/mmu-hash.h |  2 +-
>> >  arch/powerpc/include/asm/book3s/64/mmu.h  |  2 +-
>> >  arch/powerpc/include/asm/mm-arch-hooks.h  |  6 +++---
>> >  arch/powerpc/include/asm/mmu-40x.h|  2 +-
>> >  arch/powerpc/include/asm/mmu-44x.h|  2 +-
>> >  arch/powerpc/include/asm/mmu-8xx.h|  2 +-
>> >  arch/powerpc/include/asm/mmu-book3e.h |  2 +-
>> >  arch/powerpc/include/asm/mmu_context.h|  4 ++--
>> >  arch/powerpc/include/asm/vdso.h   |  2 +-
>> >  arch/powerpc/include/uapi/asm/elf.h   |  2 +-
>> >  arch/powerpc/kernel/signal_32.c   |  8 
>> >  arch/powerpc/kernel/signal_64.c   |  4 ++--
>> >  arch/powerpc/kernel/vdso.c|  8 
>> >  arch/powerpc/perf/callchain.c | 12 ++--
>> >  14 files changed, 29 insertions(+), 29 deletions(-)
>> 
>> This is kind of annoying, but I guess it's worth doing.
>> 
>> It's going to conflict like hell though. Who were you thinking would
>> merge this series? I think it should go via Andrew Morton's tree, as
>> that way if we get bad conflicts we can pull it out and redo it.
>
> The other thing you can do is generate the patch towards the end of the
> merge window and send it as a separate pull request. The disadvantage of
> that is that it can't spend any time in -next, but that might be ok for a
> mechanical rename.

True. Though in this case it's a mechanical rename that then allows us
to use the generic code, so I'd prefer we had some -next coverage on the
latter.

The other other option would be to wrap all uses of the arch value in a
macro (or actually two probably, one a getter one a setter). That would
then allow arches to use the generic code regardless of the name and
type of their mm->context.vdso_whatever.

That would allow the basic series to go in, and then each arch could do
a series later that switches it to the "standard" name and type.

cheers


Re: [PATCH v2] ppc: cpufreq: disable preemption while checking CPU throttling state

2016-11-07 Thread Denis Kirjanov
On 11/7/16, Michael Ellerman  wrote:
> Denis Kirjanov  writes:
>
>> [   67.700897] BUG: using smp_processor_id() in preemptible []
>> code: cat/7343
>> [   67.700988] caller is .powernv_cpufreq_throttle_check+0x2c/0x710
>> [   67.700998] CPU: 13 PID: 7343 Comm: cat Not tainted 4.8.0-rc5-dirty
>> #1
>> [   67.701038] Call Trace:
>> [   67.701066] [c007d25b75b0] [c0971378]
>> .dump_stack+0xe4/0x150 (unreliable)
>> [   67.701153] [c007d25b7640] [c05162e4]
>> .check_preemption_disabled+0x134/0x150
>> [   67.701238] [c007d25b76e0] [c07b63ac]
>> .powernv_cpufreq_throttle_check+0x2c/0x710
>> [   67.701322] [c007d25b7790] [c07b6d18]
>> .powernv_cpufreq_target_index+0x288/0x360
>> [   67.701407] [c007d25b7870] [c07acee4]
>> .__cpufreq_driver_target+0x394/0x8c0
>> [   67.701491] [c007d25b7920] [c07b22ac]
>> .cpufreq_set+0x7c/0xd0
>> [   67.701565] [c007d25b79b0] [c07adf50]
>> .store_scaling_setspeed+0x80/0xc0
>> [   67.701650] [c007d25b7a40] [c07ae270]
>> .store+0xa0/0x100
>> [   67.701723] [c007d25b7ae0] [c03566e8]
>> .sysfs_kf_write+0x88/0xb0
>> [   67.701796] [c007d25b7b70] [c03553b8]
>> .kernfs_fop_write+0x178/0x260
>> [   67.701881] [c007d25b7c10] [c02ac3cc]
>> .__vfs_write+0x3c/0x1c0
>> [   67.701954] [c007d25b7cf0] [c02ad584]
>> .vfs_write+0xc4/0x230
>> [   67.702027] [c007d25b7d90] [c02aeef8]
>> .SyS_write+0x58/0x100
>> [   67.702101] [c007d25b7e30] [c000bfec]
>> system_call+0x38/0xfc
>>
>> Signed-off-by: Denis Kirjanov 
>>
>> v2:  wrap powernv_cpufreq_throttle_check()
>> as suggested by Gautham R Shenoy
>
> That should be below the "---".
>
> When did this break?

I think that problem exists since commit
09a972d1620934142d30cfda455ffe429af751c4 ("cpufreq: powernv: Report
cpu frequency throttling") but I can be wrong
>
> cheers
>


Re: Coding Style: Reverse XMAS tree declarations ? (was Re: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet)

2016-11-07 Thread Michael Ellerman
Joe Perches  writes:
> On Thu, 2016-11-03 at 15:58 -0400, David Miller wrote:
>> From: Madalin Bucur 
>> Date: Wed, 2 Nov 2016 22:17:26 +0200
>> 
>> > This introduces the Freescale Data Path Acceleration Architecture
>> > +static inline size_t bpool_buffer_raw_size(u8 index, u8 cnt)
>> > +{
>> > + u8 i;
>> > + size_t res = DPAA_BP_RAW_SIZE / 2;
>> 
>> Always order local variable declarations from longest to shortest line,
>> also know as Reverse Christmas Tree Format.
>
> I think this declaration sorting order is misguided but
> here's a possible change to checkpatch adding a test for it
> that does this test just for net/ and drivers/net/

And arch/powerpc too please.

cheers


Re: [PATCH v3] console: use first console if stdout-path device doesn't appear

2016-11-07 Thread Michael Ellerman
Paul Burton  writes:

> If a device tree specified a preferred device for kernel console output
> via the stdout-path or linux,stdout-path chosen node properties there's
> no guarantee that it will have specified a device for which we have a
> driver. It may also be the case that we do have a driver but it doesn't
> call of_console_check() to register as a preferred console (eg. offb
> driver as used on powermac systems).
>
> In these cases try to ensure that we provide some console output by
> enabling the first usable registered console, which we keep track of
> with the of_fallback_console variable. Affected systems will enable
> their console later than they did prior to commit 05fd007e4629
> ("console: don't prefer first registered if DT specifies stdout-path")
> but should otherwise produce the same output.
>
> Tested in QEMU with a PowerPC pseries_defconfig kernel.

Hi Paul,

This does "work", as in it boots and I get a console. But the delay in
getting output on the VGA is not workable. I get pretty much no output
until the machine is booted entirely to userspace, meaning any crash
prior to that will be undebuggable.

I also note Andreas reports it doesn't work at all on PowerMac.

Please send a revert and we can try again next cycle.

cheers


Re: [PATCH] vfio: Fix build break when SPAPR_TCE_IOMMU=n

2016-11-07 Thread Michael Ellerman
Paolo Bonzini  writes:
> On 04/11/2016 06:48, Michael Ellerman wrote:
>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>> index da6e2ce77495..6b51a4ebed8a 100644
>> --- a/drivers/vfio/Kconfig
>> +++ b/drivers/vfio/Kconfig
>> @@ -6,12 +6,12 @@ config VFIO_IOMMU_TYPE1
>>  config VFIO_IOMMU_SPAPR_TCE
>>  tristate
>>  depends on VFIO && SPAPR_TCE_IOMMU
>> -default n
>> +default VFIO
>
> No need to depend on VFIO since you already have it in "default".

True, I can take that out.

> (I assume you cannot use "default y" because "depends on" doesn't downgrade
> "y" to "m" when VFIO is a module.

Correct.

> A shorthand is
>
>   def_tristate VFIO && SPAPR_TCE_IOMMU

Yep. My experience though is that a lot of folks don't really know what
that means. So I prefer to spell it out with an explicit type, depends
and default.

But I'll respin it that way if Alex prefers the shorter style.

cheers


[PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-07 Thread Aneesh Kumar K.V
Architectures like ppc64 want to use page table deposit/withraw
even with huge pmd dax entries. Allow arch to override the
vma_is_anonymous check by moving that to pmd_move_must_withdraw
function

Signed-off-by: Aneesh Kumar K.V 
---
 include/asm-generic/pgtable.h | 12 
 mm/huge_memory.c  | 17 +++--
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index c4f8fd2fd384..324990273ad2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
-#ifndef pmd_move_must_withdraw
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
-spinlock_t *old_pmd_ptl)
-{
-   /*
-* With split pmd lock we also need to move preallocated
-* PTE page table if new_pmd is on different PMD page table.
-*/
-   return new_pmd_ptl != old_pmd_ptl;
-}
-#endif
-
 /*
  * This function is meant to be used by sites walking pagetables with
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..1ac1b0ca63c4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1424,6 +1424,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
return 1;
 }
 
+#ifndef pmd_move_must_withdraw
+static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
+spinlock_t *old_pmd_ptl)
+{
+   /*
+* With split pmd lock we also need to move preallocated
+* PTE page table if new_pmd is on different PMD page table.
+*
+* We also don't deposit and withdraw tables for file pages.
+*/
+   return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+}
+#endif
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, unsigned long old_end,
  pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1458,8 +1472,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned 
long old_addr,
pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));
 
-   if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
-   vma_is_anonymous(vma)) {
+   if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
-- 
2.10.2



[PATCH 2/2] mm: THP page cache support for ppc64

2016-11-07 Thread Aneesh Kumar K.V
Add arch specific callback in the generic THP page cache code that will
deposit and withdarw preallocated page table. Archs like ppc64 use
this preallocated table to store the hash pte slot information.

Testing:
kernel build of the patch series on tmpfs mounted with option huge=always

The related thp stat:
thp_fault_alloc 72939
thp_fault_fallback 60547
thp_collapse_alloc 603
thp_collapse_alloc_failed 0
thp_file_alloc 253763
thp_file_mapped 4251
thp_split_page 51518
thp_split_page_failed 1
thp_deferred_split_page 73566
thp_split_pmd 665
thp_zero_page_alloc 3
thp_zero_page_alloc_failed 0

Signed-off-by: Aneesh Kumar K.V 
---
NOTE: I decided not to hide the arch_needs_pgtable_deposit() check
in zap_deposited_table() because IMHO that hides the new code path
which makes following this special case confusing. 

 arch/powerpc/include/asm/book3s/64/pgtable.h | 10 +
 include/asm-generic/pgtable.h|  3 ++
 mm/Kconfig   |  6 +--
 mm/huge_memory.c | 17 +
 mm/khugepaged.c  | 21 ++-
 mm/memory.c  | 56 +++-
 6 files changed, 96 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9fd77f8794a0..9eed9d66d24f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1020,6 +1020,16 @@ static inline int pmd_move_must_withdraw(struct spinlock 
*new_pmd_ptl,
 */
return true;
 }
+
+
+#define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
+static inline bool arch_needs_pgtable_deposit(void)
+{
+   if (radix_enabled())
+   return false;
+   return true;
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 324990273ad2..e00e3b7cf6a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,6 +653,9 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
+#ifndef arch_needs_pgtable_deposit
+#define arch_needs_pgtable_deposit() (false)
+#endif
 /*
  * This function is meant to be used by sites walking pagetables with
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11fa0d9..0a279d399722 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -447,13 +447,9 @@ choice
  benefit.
 endchoice
 
-#
-# We don't deposit page tables on file THP mapping,
-# but Power makes use of them to address MMU quirk.
-#
 config TRANSPARENT_HUGE_PAGECACHE
def_bool y
-   depends on TRANSPARENT_HUGEPAGE && !PPC
+   depends on TRANSPARENT_HUGEPAGE
 
 #
 # UP and nommu archs use km based percpu allocator
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1ac1b0ca63c4..00d2d8b8a00d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,6 +1377,15 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
struct vm_area_struct *vma,
return ret;
 }
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+   pte_free(mm, pgtable);
+   atomic_long_dec(&mm->nr_ptes);
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 pmd_t *pmd, unsigned long addr)
 {
@@ -1416,6 +1425,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
atomic_long_dec(&tlb->mm->nr_ptes);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
+   if (arch_needs_pgtable_deposit())
+   zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
}
spin_unlock(ptl);
@@ -1594,6 +1605,12 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
 
if (!vma_is_anonymous(vma)) {
_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+   /*
+* We are going to unmap this huge page. So
+* just go ahead and zap it
+*/
+   if (arch_needs_pgtable_deposit())
+   zap_deposited_table(mm, pmd);
if (vma_is_dax(vma))
return;
page = pmd_page(_pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..9fb7b275cb63 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1240,6 +1240,7 @@ static void retract_page_tables(struct address_space 
*mapping, pgoff_t pgoff)
struct vm_area_struct *vma;
unsigned long addr;
pmd_t *pmd, _pmd;
+   bool deposited = false;
 
i_m

[PATCH] powerpc/mm: Add _PAGE_DEVMAP pte bit for ppc64.

2016-11-07 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
NOTE: We need other patches to enable pmem on ppc64. I am waiting
for dax pmd fault handling cleanup to hit upstream before I send
the changes. Meanwhile reserve the pte bit.

 arch/powerpc/include/asm/book3s/64/pgtable.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9eed9d66d24f..af9e1a150ae7 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -32,6 +32,8 @@
 #define _PAGE_SOFT_DIRTY   0x0
 #endif
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
+#define _PAGE_DEVMAP   _RPAGE_SW1
+#define __HAVE_ARCH_PTE_DEVMAP
 
 
 #define _PAGE_PTE  (1ul << 62) /* distinguishes PTEs from 
pointers */
@@ -485,6 +487,16 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
+}
+
+static inline int pte_devmap(pte_t pte)
+{
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
/* FIXME!! check whether this need to be a conditional */
@@ -1030,6 +1042,18 @@ static inline bool arch_needs_pgtable_deposit(void)
return true;
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+   /*
+* No special ptes at pmd level.
+*/
+   return __pmd(pmd_val(pmd) | _PAGE_DEVMAP);
+}
+
+static inline int pmd_devmap(pmd_t pmd)
+{
+   return pte_devmap(pmd_pte(pmd));
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
-- 
2.10.2



[PATCH v3] ppc: cpufreq: disable preemption while checking CPU throttling state

2016-11-07 Thread Denis Kirjanov
[   67.700897] BUG: using smp_processor_id() in preemptible [] 
code: cat/7343
[   67.700988] caller is .powernv_cpufreq_throttle_check+0x2c/0x710
[   67.700998] CPU: 13 PID: 7343 Comm: cat Not tainted 4.8.0-rc5-dirty #1
[   67.701038] Call Trace:
[   67.701066] [c007d25b75b0] [c0971378] .dump_stack+0xe4/0x150 
(unreliable)
[   67.701153] [c007d25b7640] [c05162e4] 
.check_preemption_disabled+0x134/0x150
[   67.701238] [c007d25b76e0] [c07b63ac] 
.powernv_cpufreq_throttle_check+0x2c/0x710
[   67.701322] [c007d25b7790] [c07b6d18] 
.powernv_cpufreq_target_index+0x288/0x360
[   67.701407] [c007d25b7870] [c07acee4] 
.__cpufreq_driver_target+0x394/0x8c0
[   67.701491] [c007d25b7920] [c07b22ac] .cpufreq_set+0x7c/0xd0
[   67.701565] [c007d25b79b0] [c07adf50] 
.store_scaling_setspeed+0x80/0xc0
[   67.701650] [c007d25b7a40] [c07ae270] .store+0xa0/0x100
[   67.701723] [c007d25b7ae0] [c03566e8] 
.sysfs_kf_write+0x88/0xb0
[   67.701796] [c007d25b7b70] [c03553b8] 
.kernfs_fop_write+0x178/0x260
[   67.701881] [c007d25b7c10] [c02ac3cc] .__vfs_write+0x3c/0x1c0
[   67.701954] [c007d25b7cf0] [c02ad584] .vfs_write+0xc4/0x230
[   67.702027] [c007d25b7d90] [c02aeef8] .SyS_write+0x58/0x100
[   67.702101] [c007d25b7e30] [c000bfec] system_call+0x38/0xfc

Fixes: 09a972d ("cpufreq: powernv: Report cpu frequency throttling")
Signed-off-by: Denis Kirjanov 
---
v2: v2:  wrap powernv_cpufreq_throttle_check()
as suggested by Gautham R Shenoy

v3: added Fix tag

drivers/cpufreq/powernv-cpufreq.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index d3ffde8..a84724e 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -647,8 +647,14 @@ static int powernv_cpufreq_target_index(struct 
cpufreq_policy *policy,
if (unlikely(rebooting) && new_index != get_nominal_index())
return 0;
 
-   if (!throttled)
+   if (!throttled) {
+   /* we don't want to be preempted while
+* checking if the CPU frequency has been throttled
+*/
+   preempt_disable();
powernv_cpufreq_throttle_check(NULL);
+   preempt_enable();
+   }
 
cur_msec = jiffies_to_msecs(get_jiffies_64());
 
-- 
1.8.3.1



Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-07 Thread Hillf Danton
On Monday, November 07, 2016 4:35 PM Aneesh Kumar K.V
> 
> Architectures like ppc64 want to use page table deposit/withraw
> even with huge pmd dax entries. Allow arch to override the
> vma_is_anonymous check by moving that to pmd_move_must_withdraw
> function
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  include/asm-generic/pgtable.h | 12 
>  mm/huge_memory.c  | 17 +++--
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index c4f8fd2fd384..324990273ad2 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  }
>  #endif
> 
> -#ifndef pmd_move_must_withdraw
> -static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
> -  spinlock_t *old_pmd_ptl)
> -{
> - /*
> -  * With split pmd lock we also need to move preallocated
> -  * PTE page table if new_pmd is on different PMD page table.
> -  */
> - return new_pmd_ptl != old_pmd_ptl;
> -}
> -#endif
> -
>  /*
>   * This function is meant to be used by sites walking pagetables with
>   * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cdcd25cb30fe..1ac1b0ca63c4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1424,6 +1424,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   return 1;
>  }
> 
> +#ifndef pmd_move_must_withdraw
> +static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
> +  spinlock_t *old_pmd_ptl)
> +{
> + /*
> +  * With split pmd lock we also need to move preallocated
> +  * PTE page table if new_pmd is on different PMD page table.
> +  *
> +  * We also don't deposit and withdraw tables for file pages.
> +  */
> + return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);

Stray git merge?

> +}
> +#endif
> +
>  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> unsigned long new_addr, unsigned long old_end,
> pmd_t *old_pmd, pmd_t *new_pmd)
> @@ -1458,8 +1472,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned 
> long old_addr,
>   pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
>   VM_BUG_ON(!pmd_none(*new_pmd));
> 
> - if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
> - vma_is_anonymous(vma)) {
> + if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
>   pgtable_t pgtable;
>   pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>   pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> --
> 2.10.2
> 



Re: [PATCH v2 1/2] cpufreq: powernv: Adding fast_switch for schedutil

2016-11-07 Thread Viresh Kumar
On 07-11-16, 13:09, Akshay Adiga wrote:
> Adding fast_switch which does light weight operation to set the desired
> pstate. Both global and local pstates are set to the same desired pstate.
> 
> Signed-off-by: Akshay Adiga 
> ---
> Changes from v1 :
> - Removed unnecessary check for index out of bound.

Provided Gautham/Shilpa find these patches fine..

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH v3] console: use first console if stdout-path device doesn't appear

2016-11-07 Thread Paul Burton
On Monday, 7 November 2016 19:27:32 GMT Michael Ellerman wrote:
> Paul Burton  writes:
> > If a device tree specified a preferred device for kernel console output
> > via the stdout-path or linux,stdout-path chosen node properties there's
> > no guarantee that it will have specified a device for which we have a
> > driver. It may also be the case that we do have a driver but it doesn't
> > call of_console_check() to register as a preferred console (eg. offb
> > driver as used on powermac systems).
> > 
> > In these cases try to ensure that we provide some console output by
> > enabling the first usable registered console, which we keep track of
> > with the of_fallback_console variable. Affected systems will enable
> > their console later than they did prior to commit 05fd007e4629
> > ("console: don't prefer first registered if DT specifies stdout-path")
> > but should otherwise produce the same output.
> > 
> > Tested in QEMU with a PowerPC pseries_defconfig kernel.
> 
> Hi Paul,
> 
> This does "work", as in it boots and I get a console. But the delay in
> getting output on the VGA is not workable. I get pretty much no output
> until the machine is booted entirely to userspace, meaning any crash
> prior to that will be undebuggable.
> 
> I also note Andreas reports it doesn't work at all on PowerMac.
> 
> Please send a revert and we can try again next cycle.
> 
> cheers

Hi Michael,

A revert was already submitted by Hans de Goede & is being discussed over 
here:

https://marc.info/?l=linux-kernel&m=147826151427455&w=2

Thanks,
Paul

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-07 Thread kbuild test robot
Hi Aneesh,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/mm-move-vma_is_anonymous-check-within-pmd_move_must_withdraw/20161107-164033
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x006-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   mm/huge_memory.c: In function 'pmd_move_must_withdraw':
>> mm/huge_memory.c:1441:58: error: 'vma' undeclared (first use in this 
>> function)
 return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
 ^~~
   mm/huge_memory.c:1441:58: note: each undeclared identifier is reported only 
once for each function it appears in
>> mm/huge_memory.c:1442:1: warning: control reaches end of non-void function 
>> [-Wreturn-type]
}
^

vim +/vma +1441 mm/huge_memory.c

  1435  /*
  1436   * With split pmd lock we also need to move preallocated
  1437   * PTE page table if new_pmd is on different PMD page table.
  1438   *
  1439   * We also don't deposit and withdraw tables for file pages.
  1440   */
> 1441  return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
> 1442  }
  1443  #endif
  1444  
  1445  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 0/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-07 Thread Geert Uytterhoeven
On Mon, Oct 31, 2016 at 12:30 PM, Geert Uytterhoeven
 wrote:
> Some Renesas SoCs may exist in different revisions, providing slightly
> different functionalities (e.g. R-Car H3 ES1.x and ES2.0), and behavior
> (errate and quirks).  This needs to be catered for by drivers and/or
> platform code.  The recently proposed soc_device_match() API seems like
> a good fit to handle this.
>
> This patch series implements the core infrastructure to provide SoC and
> revision information through the SoC bus for Renesas ARM SoCs. It
> consists of 7 patches:
>   - Patches 1-4 provide soc_device_match(), with some related fixes,
>   - Patches 5-7 implement identification of Renesas SoCs and
> registration with the SoC bus,
>
> Changes compared to v1:
>   - Add Acked-by,
>   - New patches:
>   - "[4/7] base: soc: Provide a dummy implementation of
>soc_device_match()",
>   - "[5/7] ARM: shmobile: Document DT bindings for CCCR and PRR",
>   - "[6/7] arm64: dts: r8a7795: Add device node for PRR"
> (more similar patches available, I'm not yet spamming you all
>  with them),
>   - Drop SoC families and family names; use fixed "Renesas" instead,
>   - Drop EMEV2, which doesn't have a chip ID register, and doesn't share
> devices with other SoCs,
>   - Drop RZ/A1H and R-CAR M1A, which don't have chip ID registers (for
> M1A: not accessible from the ARM core?),
>   - On arm, move "select SOC_BUS" from ARCH_RENESAS to Kconfig symbols
> for SoCs that provide a chip ID register,
>   - Build renesas-soc only if SOC_BUS is enabled,
>   - Use "renesas,prr" and "renesas,cccr" device nodes in DT if
> available, else fall back to hardcoded addresses for compatibility
> with existing DTBs,
>   - Remove verification of product IDs; just print the ID instead,
>   - Don't register the SoC bus if the chip ID register is missing,
>   - Change R-Mobile APE6 fallback to use PRR instead of CCCR (it has
> both).
>
> Merge strategy:
>   - In theory, patches 1-4 should go through Greg's driver core tree.
> But it's a hard dependency for all users.
> If people agree, I can provide an immutable branch in my
> renesas-drivers repository, to be merged by all interested parties.
> So far I'm aware of Freescale/NXP, and Renesas.

And Samsung.
Shall I create the immutable branch now?

Thanks!

>   - Patches 5-7 obviously have to go through Simon's Renesas tree (after
> merging the soc_device_match() core), and arm-soc.
>
> Tested on (machine, soc_id, optional revision):
> EMEV2 KZM9D Board, emev2
> Genmai, r7s72100
> APE6EVM, r8a73a4, ES1.0
> armadillo 800 eva, r8a7740, ES2.0
> bockw, r8a7778
> marzen, r8a7779, ES1.0
> Lager, r8a7790, ES1.0
> Koelsch, r8a7791, ES1.0
> Porter, r8a7791, ES3.0
> Blanche, r8a7792, ES1.1
> Gose, r8a7793, ES1.0
> Alt, r8a7794, ES1.0
> Renesas Salvator-X board based on r8a7795, r8a7795, ES1.0
> Renesas Salvator-X board based on r8a7795, r8a7795, ES1.1
> Renesas Salvator-X board based on r8a7796, r8a7796, ES1.0
> KZM-A9-GT, sh73a0, ES2.0
>
> For your convenience, this series (incl. more DT updates to add device
> nodes for CCCR and PRR to all other Renesas ARM SoCs) is also available
> in the topic/renesas-soc-id-v2 branch of my renesas-drivers git
> repository at
> git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git
> Its first user is support for R-Car H3 ES2.0 in branch
> topic/r8a7795-es2-v1-rebased2.
>
> Thanks for your comments!
>
> Arnd Bergmann (1):
>   base: soc: Introduce soc_device_match() interface
>
> Geert Uytterhoeven (6):
>   base: soc: Early register bus when needed
>   base: soc: Check for NULL SoC device attributes
>   base: soc: Provide a dummy implementation of soc_device_match()
>   ARM: shmobile: Document DT bindings for CCCR and PRR
>   arm64: dts: r8a7795: Add device node for PRR
>   soc: renesas: Identify SoC and register with the SoC bus
>
>  Documentation/devicetree/bindings/arm/shmobile.txt |  26 +
>  arch/arm/mach-shmobile/Kconfig |   3 +
>  arch/arm64/Kconfig.platforms   |   1 +
>  arch/arm64/boot/dts/renesas/r8a7795.dtsi   |   5 +
>  drivers/base/Kconfig   |   1 +
>  drivers/base/soc.c |  79 +
>  drivers/soc/renesas/Makefile   |   2 +
>  drivers/soc/renesas/renesas-soc.c  | 130 
> +
>  include/linux/sys_soc.h|   9 ++
>  9 files changed, 256 insertions(+)
>  create mode 100644 drivers/soc/renesas/renesas-soc.c

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
   

Re: [Patch V6 2/6] irqchip: xilinx: Clean up irqdomain argument and read/write

2016-11-07 Thread Zubair Lutfullah Kakakhel

Hi,

On 11/01/2016 11:05 AM, Zubair Lutfullah Kakakhel wrote:

Hi,

Thanks for the review.

On 10/31/2016 07:51 PM, Thomas Gleixner wrote:

On Mon, 31 Oct 2016, Zubair Lutfullah Kakakhel wrote:

The drivers read/write function handling is a bit quirky.


Can you please explain in more detail what's quirky and why it should be
done differently,


And the irqmask is passed directly to the handler.


I can't make any sense out of that sentence.  Which handler? If you talk
about the write function, then I don't see a change. So what are you
talking about?


Thanks. I'll add more detail in v7 if this patch survives.




Add a new irqchip struct to pass to the handler and
cleanup read/write handling.


I still don't see what it cleans up. You move the write function pointer
into a data structure, which is exposed by another pointer. So you create
two levels of indirection in the hotpath. The function prototype is still
the same. So all this does is making things slower unless I'm missing
something.


I wrote this patch/cleanup based on a review of driver by Marc when I moved the
driver from arch/microblaze to drivers/irqchip

"Marc Zyngier

...


 arch/microblaze/kernel/intc.c   | 196 
 drivers/irqchip/irq-axi-intc.c  | 196 


...


+/* Yeah, okay, casting the intr_mask to a void* is butt-ugly, but I'm
+ * lazy and Michal can clean it up to something nicer when he tests
+ * and commits this patch.  ~~gcl */
+root_domain = irq_domain_add_linear(intc, nr_irq, &xintc_irq_domain_ops,
+(void *)intr_mask);


Since you're now reworking this driver, how about addressing this
ugliness? You could store the intr_mask together with intc_baseaddr,
and the read/write functions in a global structure, and pass a
pointer to it? That would make the code a bit nicer...
"

https://patchwork.kernel.org/patch/9287933/




-static unsigned int (*read_fn)(void __iomem *);
-static void (*write_fn)(u32, void __iomem *);
+struct xintc_irq_chip {
+void __iomem *base;
+structirq_domain *domain;
+structirq_chip chip;


The tabs between struct and the structure name are bogus.


+u32intr_mask;
+unsigned int (*read)(void __iomem *iomem);
+void (*write)(u32 data, void __iomem *iomem);


Please structure that like a table:

void__iomem *base;
struct irq_domain*domain;
struct irq_chipchip;
u32intr_mask;
unsigned int(*read)(void __iomem *iomem);
void(*write)(u32 data, void __iomem *iomem);

Can you see how that makes parsing the struct simpler, because the data
types are clearly to identify?


That does make it look much better.




+static struct xintc_irq_chip *xintc_irqc;

 static void intc_write32(u32 val, void __iomem *addr)
 {
@@ -54,6 +60,18 @@ static unsigned int intc_read32_be(void __iomem *addr)
 return ioread32be(addr);
 }

+static inline unsigned int xintc_read(struct xintc_irq_chip *xintc_irqc,
+ int reg)
+{
+return xintc_irqc->read(xintc_irqc->base + reg);
+}
+
+static inline void xintc_write(struct xintc_irq_chip *xintc_irqc,
+ int reg, u32 data)
+{
+xintc_irqc->write(data, xintc_irqc->base + reg);
+}
+
 static void intc_enable_or_unmask(struct irq_data *d)
 {
 unsigned long mask = 1 << d->hwirq;
@@ -65,21 +83,21 @@ static void intc_enable_or_unmask(struct irq_data *d)
  * acks the irq before calling the interrupt handler
  */
 if (irqd_is_level_type(d))
-write_fn(mask, intc_baseaddr + IAR);
+xintc_write(xintc_irqc, IAR, mask);


So this whole thing makes only sense, when you want to support multiple
instances of that chip and then you need to store the xintc_irqc pointer as
irqchip data and retrieve it from there. Unless you do that, this "cleanup"
is just churn for nothing with the effect of making things less efficient.



Indeed the driver doesn't support multiple instances of the Xilinx Interrupt 
controller.
I don't have a use-case or the hardware for that.

So what would be the recommended course of action?


I'd really appreciate a consensus here.
I'm hoping that these patches can go in v4.10.

Thanks
ZubairLK



Regards,
ZubairLK


Thanks,

tglx



Re: [PATCH] powerpc/mm: Add _PAGE_DEVMAP pte bit for ppc64.

2016-11-07 Thread Denis Kirjanov
On Monday, November 7, 2016, Aneesh Kumar K.V <
aneesh.ku...@linux.vnet.ibm.com> wrote:

> Signed-off-by: Aneesh Kumar K.V  >


Hi Aneesh,

> ---
> NOTE: We need other patches to enable pmem on ppc64. I am waiting
> for dax pmd fault handling cleanup to hit upstream before I send
> the changes. Meanwhile reserve the pte bit.
>
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 24
> 
>  1 file changed, 24 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 9eed9d66d24f..af9e1a150ae7 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -32,6 +32,8 @@
>  #define _PAGE_SOFT_DIRTY   0x0
>  #endif
>  #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
> +#define _PAGE_DEVMAP   _RPAGE_SW1
> +#define __HAVE_ARCH_PTE_DEVMAP
>
>
>  #define _PAGE_PTE  (1ul << 62) /* distinguishes PTEs from
> pointers */
> @@ -485,6 +487,16 @@ static inline pte_t pte_mkhuge(pte_t pte)
> return pte;
>  }
>
> +static inline pte_t pte_mkdevmap(pte_t pte)
> +{
> +   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
> +}
> +
> +static inline int pte_devmap(pte_t pte)
> +{
> +   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
> +}
>
> This could be bool, right?

 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  {
> /* FIXME!! check whether this need to be a conditional */
> @@ -1030,6 +1042,18 @@ static inline bool arch_needs_pgtable_deposit(void)
> return true;
>  }
>
> +static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> +{
> +   /*
> +* No special ptes at pmd level.
> +*/
> +   return __pmd(pmd_val(pmd) | _PAGE_DEVMAP);
> +}
> +
> +static inline int pmd_devmap(pmd_t pmd)
> +{
> +   return pte_devmap(pmd_pte(pmd));
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> --
> 2.10.2
>
>


RE: Coding Style: Reverse XMAS tree declarations ?

2016-11-07 Thread David Laight
From: Lino Sanfilippo
> Sent: 04 November 2016 20:07
...
> In this case it is IMHO rather the declaration + initialization that makes
> "bar" hard to find at one glance, not the use of RXT. You could do something 
> like
> 
>   [longish list of reverse xmas tree identifiers...]
>   struct foobarbaz *qux;
>   struct foo *bar;
> 
>   bar = longish_function(args, ...);
> 
> to increase readability.
> 
> Personally I find it more readable to always use a separate line for 
> initializations
> by means of functions (regardless of whether the RXT scheme is used or not).

I find it best to only use initialisers for 'variables' that are (mostly) 
constant.
If something need to be set to NULL in case a search fails, set it to NULL
just before the loop.
Don't put initialisation on the declaration 'because you can'.

Difficulty in spotting the type of a variable is why (IMHO) you should
but all declarations at the top of a function
(except, maybe, temporaries needed for a few lines).

David



Re: [PATCHv3 0/8] powerpc/mm: refactor vDSO mapping code

2016-11-07 Thread Dmitry Safonov
2016-10-27 20:09 GMT+03:00 Dmitry Safonov :
> Changes since v1, v2:
> - use vdso64_pages only under CONFIG_PPC64 (32-bit build fix)
> - remove arch_vma_name helper as not needed anymore,
>   simplify vdso_base pointer initializing in map_vdso()
>
> Cleanup patches for vDSO on powerpc.
> Originally, I wanted to add vDSO remapping on arm/aarch64 and
> I decided to cleanup that part on powerpc.
> I've add a hook for vm_ops for vDSO just like I did for x86,
> which makes cross-arch arch_mremap hook no more needed.
> Other changes - reduce exhaustive code duplication by
> separating the common vdso code.
>
> No visible to userspace changes expected.
> Tested on qemu with buildroot rootfs.
>
> Dmitry Safonov (8):
>   powerpc/vdso: unify return paths in setup_additional_pages
>   powerpc/vdso: remove unused params in vdso_do_func_patch{32,64}
>   powerpc/vdso: separate common code in vdso_common
>   powerpc/vdso: introduce init_vdso{32,64}_pagelist
>   powerpc/vdso: split map_vdso from arch_setup_additional_pages
>   powerpc/vdso: switch from legacy_special_mapping_vmops
>   mm: kill arch_mremap
>   powerpc/vdso: remove arch_vma_name
>
>  arch/alpha/include/asm/Kbuild|   1 -
>  arch/arc/include/asm/Kbuild  |   1 -
>  arch/arm/include/asm/Kbuild  |   1 -
>  arch/arm64/include/asm/Kbuild|   1 -
>  arch/avr32/include/asm/Kbuild|   1 -
>  arch/blackfin/include/asm/Kbuild |   1 -
>  arch/c6x/include/asm/Kbuild  |   1 -
>  arch/cris/include/asm/Kbuild |   1 -
>  arch/frv/include/asm/Kbuild  |   1 -
>  arch/h8300/include/asm/Kbuild|   1 -
>  arch/hexagon/include/asm/Kbuild  |   1 -
>  arch/ia64/include/asm/Kbuild |   1 -
>  arch/m32r/include/asm/Kbuild |   1 -
>  arch/m68k/include/asm/Kbuild |   1 -
>  arch/metag/include/asm/Kbuild|   1 -
>  arch/microblaze/include/asm/Kbuild   |   1 -
>  arch/mips/include/asm/Kbuild |   1 -
>  arch/mn10300/include/asm/Kbuild  |   1 -
>  arch/nios2/include/asm/Kbuild|   1 -
>  arch/openrisc/include/asm/Kbuild |   1 -
>  arch/parisc/include/asm/Kbuild   |   1 -
>  arch/powerpc/include/asm/mm-arch-hooks.h |  28 --
>  arch/powerpc/kernel/vdso.c   | 502 
> +--
>  arch/powerpc/kernel/vdso_common.c| 248 +++
>  arch/s390/include/asm/Kbuild |   1 -
>  arch/score/include/asm/Kbuild|   1 -
>  arch/sh/include/asm/Kbuild   |   1 -
>  arch/sparc/include/asm/Kbuild|   1 -
>  arch/tile/include/asm/Kbuild |   1 -
>  arch/um/include/asm/Kbuild   |   1 -
>  arch/unicore32/include/asm/Kbuild|   1 -
>  arch/x86/include/asm/Kbuild  |   1 -
>  arch/xtensa/include/asm/Kbuild   |   1 -
>  include/asm-generic/mm-arch-hooks.h  |  16 -
>  include/linux/mm-arch-hooks.h|  25 --
>  mm/mremap.c  |   4 -
>  36 files changed, 324 insertions(+), 529 deletions(-)
>  delete mode 100644 arch/powerpc/include/asm/mm-arch-hooks.h
>  create mode 100644 arch/powerpc/kernel/vdso_common.c
>  delete mode 100644 include/asm-generic/mm-arch-hooks.h
>  delete mode 100644 include/linux/mm-arch-hooks.h

ping?


Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-07 Thread Aneesh Kumar K.V
kbuild test robot  writes:

> Hi Aneesh,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/mm-move-vma_is_anonymous-check-within-pmd_move_must_withdraw/20161107-164033
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x006-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
>
> All error/warnings (new ones prefixed by >>):
>
>mm/huge_memory.c: In function 'pmd_move_must_withdraw':
>>> mm/huge_memory.c:1441:58: error: 'vma' undeclared (first use in this 
>>> function)
>  return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
>  ^~~
>mm/huge_memory.c:1441:58: note: each undeclared identifier is reported 
> only once for each function it appears in
>>> mm/huge_memory.c:1442:1: warning: control reaches end of non-void function 
>>> [-Wreturn-type]
> }
> ^
>
> vim +/vma +1441 mm/huge_memory.c
>
>   1435/*
>   1436 * With split pmd lock we also need to move preallocated
>   1437 * PTE page table if new_pmd is on different PMD page 
> table.
>   1438 *
>   1439 * We also don't deposit and withdraw tables for file 
> pages.
>   1440 */
>> 1441 return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
>> 1442 }
>   1443#endif
>   1444
>   1445bool move_huge_pmd(struct vm_area_struct *vma, unsigned long 
> old_addr,
>

My bad, I didn't test with hugepage enabled for x86. Will fixup in the
next update.

-aneesh



[PATCH V2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

2016-11-07 Thread Aneesh Kumar K.V
Architectures like ppc64 want to use page table deposit/withraw
even with huge pmd dax entries. Allow arch to override the
vma_is_anonymous check by moving that to pmd_move_must_withdraw
function

Signed-off-by: Aneesh Kumar K.V 
---
* Changes from V1:
  Fix build error with x86 config

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 ++-
 include/asm-generic/pgtable.h| 12 
 mm/huge_memory.c | 18 --
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9fd77f8794a0..700301bc5190 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1009,7 +1009,8 @@ static inline void pmdp_huge_split_prepare(struct 
vm_area_struct *vma,
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
-struct spinlock *old_pmd_ptl)
+struct spinlock *old_pmd_ptl,
+struct vm_area_struct *vma)
 {
if (radix_enabled())
return false;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index c4f8fd2fd384..324990273ad2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
-#ifndef pmd_move_must_withdraw
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
-spinlock_t *old_pmd_ptl)
-{
-   /*
-* With split pmd lock we also need to move preallocated
-* PTE page table if new_pmd is on different PMD page table.
-*/
-   return new_pmd_ptl != old_pmd_ptl;
-}
-#endif
-
 /*
  * This function is meant to be used by sites walking pagetables with
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..54f265ec902e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1424,6 +1424,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
return 1;
 }
 
+#ifndef pmd_move_must_withdraw
+static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
+spinlock_t *old_pmd_ptl,
+struct vm_area_struct *vma)
+{
+   /*
+* With split pmd lock we also need to move preallocated
+* PTE page table if new_pmd is on different PMD page table.
+*
+* We also don't deposit and withdraw tables for file pages.
+*/
+   return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+}
+#endif
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, unsigned long old_end,
  pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1458,8 +1473,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned 
long old_addr,
pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));
 
-   if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
-   vma_is_anonymous(vma)) {
+   if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
-- 
2.10.2



Re: [PATCH v3] console: use first console if stdout-path device doesn't appear

2016-11-07 Thread Larry Finger

On 11/07/2016 03:18 AM, Paul Burton wrote:

On Monday, 7 November 2016 19:27:32 GMT Michael Ellerman wrote:

Paul Burton  writes:

If a device tree specified a preferred device for kernel console output
via the stdout-path or linux,stdout-path chosen node properties there's
no guarantee that it will have specified a device for which we have a
driver. It may also be the case that we do have a driver but it doesn't
call of_console_check() to register as a preferred console (eg. offb
driver as used on powermac systems).

In these cases try to ensure that we provide some console output by
enabling the first usable registered console, which we keep track of
with the of_fallback_console variable. Affected systems will enable
their console later than they did prior to commit 05fd007e4629
("console: don't prefer first registered if DT specifies stdout-path")
but should otherwise produce the same output.

Tested in QEMU with a PowerPC pseries_defconfig kernel.


Hi Paul,

This does "work", as in it boots and I get a console. But the delay in
getting output on the VGA is not workable. I get pretty much no output
until the machine is booted entirely to userspace, meaning any crash
prior to that will be undebuggable.

I also note Andreas reports it doesn't work at all on PowerMac.

Please send a revert and we can try again next cycle.

cheers


Hi Michael,

A revert was already submitted by Hans de Goede & is being discussed over
here:

https://marc.info/?l=linux-kernel&m=147826151427455&w=2


I am a little surprised that I was not CCd on that thread. To reiterate, my 
PowerBook G4 with a PPC32 processor CRASHES on boot. That is a lot more serious 
than the console output disappearing.


As it seems unlikely that this regression will be fixed in the current cycle, I 
recommend that the reversion of commit 05fd007e4629 until a proper fix is available.


Larry




RE: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread Madalin-Cristian Bucur
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, November 03, 2016 9:58 PM
> 
> From: Madalin Bucur 
> Date: Wed, 2 Nov 2016 22:17:26 +0200
> 
> > This introduces the Freescale Data Path Acceleration Architecture
> > +static inline size_t bpool_buffer_raw_size(u8 index, u8 cnt)
> > +{
> > +   u8 i;
> > +   size_t res = DPAA_BP_RAW_SIZE / 2;
> 
> Always order local variable declarations from longest to shortest line,
> also know as Reverse Christmas Tree Format.
> 
> Please audit your entire submission for this problem, it occurs
> everywhere.

Thank you, I'll resolve this.

> > +   /* we do not want shared skbs on TX */
> > +   net_dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> 
> Why?  By clearing this, you disallow an important fundamental way to do
> performane testing, via pktgen.

The Tx path in DPAA requires one to insert a back-pointer to the skb into
the Tx buffer. On the Tx confirmation path the back-pointer in the buffer
is used to release the skb. If Tx buffer is shared we'd alter the back-pointer
and leak/double free skbs. See also 

static int dpaa_start_xmit(struct sk_buff *skb, struct net_device 
*net_dev)
{
...
  if (!nonlinear) {
/* We're going to store the skb backpointer at the 
beginning
 * of the data buffer, so we need a privately owned skb
   *
 * We've made sure skb is not shared in dev->priv_flags,
 * we need to verify the skb head is not cloned
   */
if (skb_cow_head(skb, priv->tx_headroom))
goto enomem;

  WARN_ON(skb_is_nonlinear(skb));
}
...

> > +   int numstats = sizeof(struct rtnl_link_stats64) / sizeof(u64);
>  ...
> > +   cpustats = (u64 *)&percpu_priv->stats;
> > +
> > +   for (j = 0; j < numstats; j++)
> > +   netstats[j] += cpustats[j];
> 
> This is a memcpy() on well-typed datastructures which requires no
> casting or special handling whatsoever, so use memcpy instead of
> needlessly open coding the operation.

Will fix.

> > +static int dpaa_change_mtu(struct net_device *net_dev, int new_mtu)
> > +{
> > +   const int max_mtu = dpaa_get_max_mtu();
> > +
> > +   /* Make sure we don't exceed the Ethernet controller's MAXFRM */
> > +   if (new_mtu < 68 || new_mtu > max_mtu) {
> > +   netdev_err(net_dev, "Invalid L3 mtu %d (must be between %d and
> %d).\n",
> > +  new_mtu, 68, max_mtu);
> > +   return -EINVAL;
> > +   }
> > +   net_dev->mtu = new_mtu;
> > +
> > +   return 0;
> > +}
> 
> MTU restrictions are handled in the net-next tree via net_dev->min_mtu and
> net_dev->max_mtu.  Use that and do not define this NDO operation as you do
> not need it.

OK
 
> > +static int dpaa_set_features(struct net_device *dev, netdev_features_t
> features)
> > +{
> > +   /* Not much to do here for now */
> > +   dev->features = features;
> > +   return 0;
> > +}
> 
> Do not define unnecessary NDO operations, let the defaults do their job.
> 
> > +static netdev_features_t dpaa_fix_features(struct net_device *dev,
> > +  netdev_features_t features)
> > +{
> > +   netdev_features_t unsupported_features = 0;
> > +
> > +   /* In theory we should never be requested to enable features that
> > +* we didn't set in netdev->features and netdev->hw_features at
> probe
> > +* time, but double check just to be on the safe side.
> > +*/
> > +   unsupported_features |= NETIF_F_RXCSUM;
> > +
> > +   features &= ~unsupported_features;
> > +
> > +   return features;
> > +}
> 
> Unless you can show that your need this, do not "guess" by implement this
> NDO operation.  You don't need it.

Will remove it.
 
> > +#ifdef CONFIG_FSL_DPAA_ETH_FRIENDLY_IF_NAME
> > +static int dpaa_mac_hw_index_get(struct platform_device *pdev)
> > +{
> > +   struct device *dpaa_dev;
> > +   struct dpaa_eth_data *eth_data;
> > +
> > +   dpaa_dev = &pdev->dev;
> > +   eth_data = dpaa_dev->platform_data;
> > +
> > +   return eth_data->mac_hw_id;
> > +}
> > +
> > +static int dpaa_mac_fman_index_get(struct platform_device *pdev)
> > +{
> > +   struct device *dpaa_dev;
> > +   struct dpaa_eth_data *eth_data;
> > +
> > +   dpaa_dev = &pdev->dev;
> > +   eth_data = dpaa_dev->platform_data;
> > +
> > +   return eth_data->fman_hw_id;
> > +}
> > +#endif
> 
> Do not play network device naming games like this, use the standard name
> assignment done by the kernel and have userspace entities do geographic or
> device type specific naming.
> 
> I want to see this code completely removed.

I'll remove the option, udev rules like these can achieve the same effect:

SUBSYSTEM=="net", DRIVERS=="fsl_dpa*", ATTR{device_addr}=="ffe4e", 
NAME="fm1-mac1"
SUBSYSTEM=="net", DRIVERS=="fsl_dpa*", ATTR{device_addr}=="ffe4e2000", 
NAME="fm1-mac2"
 
> > +static int dpaa_set_mac_address(s

Re: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread David Miller
From: Madalin-Cristian Bucur 
Date: Mon, 7 Nov 2016 15:43:26 +

>> From: David Miller [mailto:da...@davemloft.net]
>> Sent: Thursday, November 03, 2016 9:58 PM
>> 
>> Why?  By clearing this, you disallow an important fundamental way to do
>> performane testing, via pktgen.
> 
> The Tx path in DPAA requires one to insert a back-pointer to the skb into
> the Tx buffer. On the Tx confirmation path the back-pointer in the buffer
> is used to release the skb. If Tx buffer is shared we'd alter the back-pointer
> and leak/double free skbs. See also 

Then have your software state store an array of SKB pointers, one for each
TX ring entry, just like every other driver does.


Re: [PATCH] vfio: Fix build break when SPAPR_TCE_IOMMU=n

2016-11-07 Thread Alex Williamson
On Mon, 07 Nov 2016 19:34:42 +1100
Michael Ellerman  wrote:

> Paolo Bonzini  writes:
> > On 04/11/2016 06:48, Michael Ellerman wrote:  
> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >> index da6e2ce77495..6b51a4ebed8a 100644
> >> --- a/drivers/vfio/Kconfig
> >> +++ b/drivers/vfio/Kconfig
> >> @@ -6,12 +6,12 @@ config VFIO_IOMMU_TYPE1
> >>  config VFIO_IOMMU_SPAPR_TCE
> >>tristate
> >>depends on VFIO && SPAPR_TCE_IOMMU
> >> -  default n
> >> +  default VFIO  
> >
> > No need to depend on VFIO since you already have it in "default".  

depends and defaults are different beasts though, if VFIO is not
enabled and we're not on a powerpc system with SPAPR,
VFIO_IOMMU_SPAPR_TCE should not be selectable, not just default to 'n'.

> 
> True, I can take that out.
> 
> > (I assume you cannot use "default y" because "depends on" doesn't downgrade
> > "y" to "m" when VFIO is a module.  
> 
> Correct.
> 
> > A shorthand is
> >
> > def_tristate VFIO && SPAPR_TCE_IOMMU  
> 
> Yep. My experience though is that a lot of folks don't really know what
> that means. So I prefer to spell it out with an explicit type, depends
> and default.
> 
> But I'll respin it that way if Alex prefers the shorter style.

Perhaps I'm one of those people.  Non-powerpc archs should not have an
option to select this, which is why the depends is there, AIUI.  So
long as we don't start exposing options that aren't relevant to a
platform, I'm flexible on what shorthands we use, but you may need to
teach me about them first.  Thanks,

Alex


Re: [PATCH] vfio: Fix build break when SPAPR_TCE_IOMMU=n

2016-11-07 Thread Paolo Bonzini


On 07/11/2016 17:25, Alex Williamson wrote:
> On Mon, 07 Nov 2016 19:34:42 +1100
> Michael Ellerman  wrote:
> 
>> Paolo Bonzini  writes:
>>> On 04/11/2016 06:48, Michael Ellerman wrote:  
 diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
 index da6e2ce77495..6b51a4ebed8a 100644
 --- a/drivers/vfio/Kconfig
 +++ b/drivers/vfio/Kconfig
 @@ -6,12 +6,12 @@ config VFIO_IOMMU_TYPE1
  config VFIO_IOMMU_SPAPR_TCE
tristate
depends on VFIO && SPAPR_TCE_IOMMU
 -  default n
 +  default VFIO  
>>>
>>> No need to depend on VFIO since you already have it in "default".  
> 
> depends and defaults are different beasts though, if VFIO is not
> enabled and we're not on a powerpc system with SPAPR,
> VFIO_IOMMU_SPAPR_TCE should not be selectable, not just default to 'n'.

AFAIU without a prompt nothing is selectable anyway (hence my preference
for a shorthand).

Paolo

>>
>> True, I can take that out.
>>
>>> (I assume you cannot use "default y" because "depends on" doesn't downgrade
>>> "y" to "m" when VFIO is a module.  
>>
>> Correct.
>>
>>> A shorthand is
>>>
>>> def_tristate VFIO && SPAPR_TCE_IOMMU  
>>
>> Yep. My experience though is that a lot of folks don't really know what
>> that means. So I prefer to spell it out with an explicit type, depends
>> and default.
>>
>> But I'll respin it that way if Alex prefers the shorter style.
> 
> Perhaps I'm one of those people.  Non-powerpc archs should not have an
> option to select this, which is why the depends is there, AIUI.  So
> long as we don't start exposing options that aren't relevant to a
> platform, I'm flexible on what shorthands we use, but you may need to
> teach me about them first.  Thanks,
> 
> Alex
> 


RE: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread Madalin-Cristian Bucur
> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Monday, November 07, 2016 5:55 PM
> 
> From: Madalin-Cristian Bucur 
> Date: Mon, 7 Nov 2016 15:43:26 +
> 
> >> From: David Miller [mailto:da...@davemloft.net]
> >> Sent: Thursday, November 03, 2016 9:58 PM
> >>
> >> Why?  By clearing this, you disallow an important fundamental way to do
> >> performane testing, via pktgen.
> >
> > The Tx path in DPAA requires one to insert a back-pointer to the skb
> into
> > the Tx buffer. On the Tx confirmation path the back-pointer in the
> buffer
> > is used to release the skb. If Tx buffer is shared we'd alter the back-
> pointer
> > and leak/double free skbs. See also
> 
> Then have your software state store an array of SKB pointers, one for each
> TX ring entry, just like every other driver does.

There is no Tx ring in DPAA. Frames are send out on QMan HW queues towards
the FMan for Tx and then received back on Tx confirmation queues for cleanup.
Array traversal would for sure cost more than using the back-pointer. Also,
we can now process confirmations on a different core than the one doing Tx,
we'd have to keep the arrays percpu and force the Tx conf on the same core.
Or add locks.

Madalin


Re: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread Joakim Tjernlund
On Wed, 2016-11-02 at 22:17 +0200, Madalin Bucur wrote:
> This introduces the Freescale Data Path Acceleration Architecture
> (DPAA) Ethernet driver (dpaa_eth) that builds upon the DPAA QMan,
> BMan, PAMU and FMan drivers to deliver Ethernet connectivity on
> the Freescale DPAA QorIQ platforms.

Nice to see DPAA support soon entering the kernel(not a day too early:)
I would like to see BQL supported from day one though, if possible.

 Regards
          Joakim Tjernlund

Re: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread David Miller
From: Madalin-Cristian Bucur 
Date: Mon, 7 Nov 2016 16:32:16 +

>> -Original Message-
>> From: David Miller [mailto:da...@davemloft.net]
>> Sent: Monday, November 07, 2016 5:55 PM
>> 
>> From: Madalin-Cristian Bucur 
>> Date: Mon, 7 Nov 2016 15:43:26 +
>> 
>> >> From: David Miller [mailto:da...@davemloft.net]
>> >> Sent: Thursday, November 03, 2016 9:58 PM
>> >>
>> >> Why?  By clearing this, you disallow an important fundamental way to do
>> >> performane testing, via pktgen.
>> >
>> > The Tx path in DPAA requires one to insert a back-pointer to the skb
>> into
>> > the Tx buffer. On the Tx confirmation path the back-pointer in the
>> buffer
>> > is used to release the skb. If Tx buffer is shared we'd alter the back-
>> pointer
>> > and leak/double free skbs. See also
>> 
>> Then have your software state store an array of SKB pointers, one for each
>> TX ring entry, just like every other driver does.
> 
> There is no Tx ring in DPAA. Frames are send out on QMan HW queues towards
> the FMan for Tx and then received back on Tx confirmation queues for cleanup.
> Array traversal would for sure cost more than using the back-pointer. Also,
> we can now process confirmations on a different core than the one doing Tx,
> we'd have to keep the arrays percpu and force the Tx conf on the same core.
> Or add locks.

Report back an integer index, like every scsi driver out there which
completes tagged queued block I/O operations asynchronously.  You can
associate the array with a specific TX confirmation queue.



RE: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet

2016-11-07 Thread Madalin-Cristian Bucur
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Monday, November 07, 2016 6:40 PM
> 
> From: Madalin-Cristian Bucur 
> Date: Mon, 7 Nov 2016 16:32:16 +
> 
> >> From: David Miller [mailto:da...@davemloft.net]
> >> Sent: Monday, November 07, 2016 5:55 PM
> >>
> >> From: Madalin-Cristian Bucur 
> >> Date: Mon, 7 Nov 2016 15:43:26 +
> >>
> >> >> From: David Miller [mailto:da...@davemloft.net]
> >> >> Sent: Thursday, November 03, 2016 9:58 PM
> >> >>
> >> >> Why?  By clearing this, you disallow an important fundamental way to
> >> >> do performane testing, via pktgen.
> >> >
> >> > The Tx path in DPAA requires one to insert a back-pointer to the skb
> >> > into
> >> > the Tx buffer. On the Tx confirmation path the back-pointer in the
> >> > buffer
> >> > is used to release the skb. If Tx buffer is shared we'd alter the
> >> > back-pointer
> >> > and leak/double free skbs. See also
> >>
> >> Then have your software state store an array of SKB pointers, one for
> each
> >> TX ring entry, just like every other driver does.
> >
> > There is no Tx ring in DPAA. Frames are send out on QMan HW queues
> > towards the FMan for Tx and then received back on Tx confirmation queues
> > for cleanup.
> > Array traversal would for sure cost more than using the back-pointer.
> > Also, we can now process confirmations on a different core than the one
> > doing Tx,
> > we'd have to keep the arrays percpu and force the Tx conf on the same
> > core. Or add locks.
> 
> Report back an integer index, like every scsi driver out there which
> completes tagged queued block I/O operations asynchronously.  You can
> associate the array with a specific TX confirmation queue.

>From HW? It only gives you back the buffer start address (plus length, etc).
"buff_2_skb()" needs to be solved in SW, expensively using array (lists? As
the number of frames in flight can be large/variable) or cheaply with the back
pointer. The back-pointer approach has its tradeoffs: no shared skbs, imposed
non-zero needed_headroom.



[PATCH] cxl: Fix coredump generation when cxl_get_fd() is used

2016-11-07 Thread Frederic Barrat
If a process dumps core while owning a cxl file descriptor obtained
from an AFU driver (e.g. cxlflash) through the cxl_get_fd() API, the
following error occurs:

[  868.027591] Unable to handle kernel paging request for data at address ...
[  868.027778] Faulting instruction address: 0xc035edb0
cpu 0x8c: Vector: 300 (Data Access) at [c03c688275e0]
pc: c035edb0: elf_core_dump+0xd60/0x1300
lr: c035ed80: elf_core_dump+0xd30/0x1300
sp: c03c68827860
   msr: 90019033
   dar: c
dsisr: 4000
 current = 0xc03c6878
 paca= 0xc1b73200   softe: 0irq_happened: 0x01
pid   = 46725, comm = hxesurelock
enter ? for help
[c03c68827a60] c036948c do_coredump+0xcec/0x11e0
[c03c68827c20] c00ce9e0 get_signal+0x540/0x7b0
[c03c68827d10] c0017354 do_signal+0x54/0x2b0
[c03c68827e00] c001777c do_notify_resume+0xbc/0xd0
[c03c68827e30] c0009838 ret_from_except_lite+0x64/0x68
--- Exception: 300 (Data Access) at 3fff98ad2918

The root cause is that the address_space structure for the file
doesn't define a 'host' member.

When cxl allocates a file descriptor, it's using the anonymous inode
to back the file, but allocates a private address_space for each
context. The private address_space allows to track memory allocation
for each context. cxl doesn't define the 'host' member of the address
space, i.e. the inode. We don't want to define it as the anonymous
inode, since there's no longer a 1-to-1 relation between address_space
and inode.

To fix it, instead of using the anonymous inode, we introduce a simple
pseudo filesystem so that cxl can allocate its own inodes. So we now
have one inode for each file and address_space. The pseudo filesystem
is only mounted on the first allocation of a file descriptor by
cxl_get_fd().

Tested with cxlflash.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/api.c | 138 +
 drivers/misc/cxl/context.c |  17 --
 drivers/misc/cxl/cxl.h |   6 +-
 drivers/misc/cxl/file.c|   5 +-
 4 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 2e5233b..7640339 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -9,18 +9,119 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
+#include 
 
 #include "cxl.h"
 
+/*
+ * Since we want to track memory mappings to be able to force-unmap
+ * when the AFU is no longer reachable, we need an inode. For devices
+ * opened through the cxl user API, this is not a problem, but a
+ * userland process can also get a cxl fd through the cxl_get_fd()
+ * API, which is used by the cxlflash driver.
+ *
+ * Therefore we implement our own simple pseudo-filesystem and inode
+ * allocator. We don't use the anonymous inode, as we need the
+ * meta-data associated with it (address_space) and it is shared by
+ * other drivers/processes, so it could lead to cxl unmapping VMAs
+ * from random processes.
+ */
+
+#define CXL_PSEUDO_FS_MAGIC0x1697697f
+
+static int cxl_fs_cnt;
+static struct vfsmount *cxl_vfs_mount;
+
+static const struct dentry_operations cxl_fs_dops = {
+   .d_dname= simple_dname,
+};
+
+static struct dentry *cxl_fs_mount(struct file_system_type *fs_type, int flags,
+   const char *dev_name, void *data)
+{
+   return mount_pseudo(fs_type, "cxl:", NULL, &cxl_fs_dops,
+   CXL_PSEUDO_FS_MAGIC);
+}
+
+static struct file_system_type cxl_fs_type = {
+   .name   = "cxl",
+   .owner  = THIS_MODULE,
+   .mount  = cxl_fs_mount,
+   .kill_sb= kill_anon_super,
+};
+
+
+void cxl_release_mapping(struct cxl_context *ctx)
+{
+   if (ctx->kernelapi && ctx->mapping)
+   simple_release_fs(&cxl_vfs_mount, &cxl_fs_cnt);
+}
+
+static struct file *cxl_getfile(const char *name,
+   const struct file_operations *fops,
+   void *priv, int flags)
+{
+   struct qstr this;
+   struct path path;
+   struct file *file;
+   struct inode *inode = NULL;
+   int rc;
+
+   /* strongly inspired by anon_inode_getfile() */
+
+   if (fops->owner && !try_module_get(fops->owner))
+   return ERR_PTR(-ENOENT);
+
+   rc = simple_pin_fs(&cxl_fs_type, &cxl_vfs_mount, &cxl_fs_cnt);
+   if (rc < 0) {
+   pr_err("Cannot mount cxl pseudo filesystem: %d\n", rc);
+   file = ERR_PTR(rc);
+   goto err_module;
+   }
+
+   inode = alloc_anon_inode(cxl_vfs_mount->mnt_sb);
+   if (IS_ERR(inode)) {
+   file = ERR_CAST(inode);
+   goto err_fs;
+   }
+
+   file = ERR_PTR(-ENOMEM);
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   path.dentry = d_alloc_pseudo(

Re: [PATCH v3] console: use first console if stdout-path device doesn't appear

2016-11-07 Thread Paul Burton
Hi Larry,

On Monday, 7 November 2016 09:26:37 GMT Larry Finger wrote:
> > A revert was already submitted by Hans de Goede & is being discussed over
> > here:
> > 
> > https://marc.info/?l=linux-kernel&m=147826151427455&w=2
> 
> I am a little surprised that I was not CCd on that thread.

Hans started that thread without copying you just as you started your thread 
without copying Andreas, who reported issues first.

> To reiterate, my
> PowerBook G4 with a PPC32 processor CRASHES on boot. That is a lot more
> serious than the console output disappearing.

Crashes as in init dies due to lack of a console, right?

> As it seems unlikely that this regression will be fixed in the current
> cycle, I recommend that the reversion of commit 05fd007e4629 until a proper
> fix is available.

I agree that reverting is probably the best option for now, and have replied 
with that in the other thread too.

Thanks,
Paul

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH v3] console: use first console if stdout-path device doesn't appear

2016-11-07 Thread Larry Finger

On 11/07/2016 11:21 AM, Paul Burton wrote:

Hi Larry,

On Monday, 7 November 2016 09:26:37 GMT Larry Finger wrote:

A revert was already submitted by Hans de Goede & is being discussed over
here:

https://marc.info/?l=linux-kernel&m=147826151427455&w=2


I am a little surprised that I was not CCd on that thread.


Hans started that thread without copying you just as you started your thread
without copying Andreas, who reported issues first.


My searching had failed to find his report.


To reiterate, my
PowerBook G4 with a PPC32 processor CRASHES on boot. That is a lot more
serious than the console output disappearing.


Crashes as in init dies due to lack of a console, right?


It gets a kernel panic because of an attempt to kill process 1 (init). It then 
waits 120 seconds and tries to reboot.



As it seems unlikely that this regression will be fixed in the current
cycle, I recommend that the reversion of commit 05fd007e4629 until a proper
fix is available.


I agree that reverting is probably the best option for now, and have replied
with that in the other thread too.


Good.

Larry



Re: [RFC v2 6/7] mm/powerpc: Use generic VDSO remap and unmap functions

2016-11-07 Thread Laurent Dufour
On 04/11/2016 05:59, Michael Ellerman wrote:
> Christopher Covington  writes:
> 
>> The PowerPC VDSO remap and unmap code was copied to a generic location,
>> only modifying the variable name expected in mm->context (vdso instead of
>> vdso_base) to match most other architectures. Having adopted this generic
>> naming, drop the code in arch/powerpc and use the generic version.
>>
>> Signed-off-by: Christopher Covington 
>> ---
>>  arch/powerpc/Kconfig |  1 +
>>  arch/powerpc/include/asm/Kbuild  |  1 +
>>  arch/powerpc/include/asm/mm-arch-hooks.h | 28 -
>>  arch/powerpc/include/asm/mmu_context.h   | 35 
>> +---
>>  4 files changed, 3 insertions(+), 62 deletions(-)
>>  delete mode 100644 arch/powerpc/include/asm/mm-arch-hooks.h
> 
> This looks OK.
> 
> Have you tested it on powerpc? I could but I don't know how to actually
> trigger these paths, I assume I need a CRIU setup?

FWIW, tested on ppc64le using a sample test process moving its VDSO and
then catching a signal on 4.9-rc4 and using CRIU on top of 4.8 with
sightly changes to due minor upstream changes.

Reviewed-by: Laurent Dufour 
Tested-by: Laurent Dufour 



Re: [PATCH v2 0/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-07 Thread Krzysztof Kozlowski
On Mon, Nov 07, 2016 at 10:35:31AM +0100, Geert Uytterhoeven wrote:
> On Mon, Oct 31, 2016 at 12:30 PM, Geert Uytterhoeven
>  wrote:
> > Some Renesas SoCs may exist in different revisions, providing slightly
> > different functionalities (e.g. R-Car H3 ES1.x and ES2.0), and behavior
> > (errate and quirks).  This needs to be catered for by drivers and/or
> > platform code.  The recently proposed soc_device_match() API seems like
> > a good fit to handle this.
> >
> > This patch series implements the core infrastructure to provide SoC and
> > revision information through the SoC bus for Renesas ARM SoCs. It
> > consists of 7 patches:
> >   - Patches 1-4 provide soc_device_match(), with some related fixes,
> >   - Patches 5-7 implement identification of Renesas SoCs and
> > registration with the SoC bus,
> >
> > Changes compared to v1:
> >   - Add Acked-by,
> >   - New patches:
> >   - "[4/7] base: soc: Provide a dummy implementation of
> >soc_device_match()",
> >   - "[5/7] ARM: shmobile: Document DT bindings for CCCR and PRR",
> >   - "[6/7] arm64: dts: r8a7795: Add device node for PRR"
> > (more similar patches available, I'm not yet spamming you all
> >  with them),
> >   - Drop SoC families and family names; use fixed "Renesas" instead,
> >   - Drop EMEV2, which doesn't have a chip ID register, and doesn't share
> > devices with other SoCs,
> >   - Drop RZ/A1H and R-CAR M1A, which don't have chip ID registers (for
> > M1A: not accessible from the ARM core?),
> >   - On arm, move "select SOC_BUS" from ARCH_RENESAS to Kconfig symbols
> > for SoCs that provide a chip ID register,
> >   - Build renesas-soc only if SOC_BUS is enabled,
> >   - Use "renesas,prr" and "renesas,cccr" device nodes in DT if
> > available, else fall back to hardcoded addresses for compatibility
> > with existing DTBs,
> >   - Remove verification of product IDs; just print the ID instead,
> >   - Don't register the SoC bus if the chip ID register is missing,
> >   - Change R-Mobile APE6 fallback to use PRR instead of CCCR (it has
> > both).
> >
> > Merge strategy:
> >   - In theory, patches 1-4 should go through Greg's driver core tree.
> > But it's a hard dependency for all users.
> > If people agree, I can provide an immutable branch in my
> > renesas-drivers repository, to be merged by all interested parties.
> > So far I'm aware of Freescale/NXP, and Renesas.
> 
> And Samsung.

Yes, I would need it as well.

> Shall I create the immutable branch now?

...or the applying person could provide one.

Best regards,
Krzysztof


Re: [v15, 3/7] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-11-07 Thread Arnd Bergmann
On Monday, October 31, 2016 9:35:33 AM CET Y.B. Lu wrote:
> > 
> > I don't see any of the contents of this header referenced by the soc
> > driver any more. I think you can just drop this patch.
> > 
> 
> [Lu Yangbo-B47093] This header file was included by guts.c.
> The guts driver used macro SVR_MAJ/SVR_MIN for calculation.
> 
> This header file was for powerpc arch before. And this patch is to made it as
> common header file for both ARM and PPC.
> Sooner or later this is needed.

Let's discuss it once we actually need the header then, ok?

Arnd


[PATCH v6 4/4] of/fdt: mark hotpluggable memory

2016-11-07 Thread Reza Arbab
When movable nodes are enabled, any node containing only hotpluggable
memory is made movable at boot time.

On x86, hotpluggable memory is discovered by parsing the ACPI SRAT,
making corresponding calls to memblock_mark_hotplug().

If we introduce a dt property to describe memory as hotpluggable,
configs supporting early fdt may then also do this marking and use
movable nodes.

Signed-off-by: Reza Arbab 
---
 drivers/of/fdt.c | 6 ++
 mm/Kconfig   | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index c89d5d2..2cf1d66 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1015,6 +1015,7 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
const __be32 *reg, *endp;
int l;
+   bool hotpluggable;
 
/* We are scanning "memory" nodes only */
if (type == NULL) {
@@ -1034,6 +1035,7 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
return 0;
 
endp = reg + (l / sizeof(__be32));
+   hotpluggable = of_get_flat_dt_prop(node, "linux,hotpluggable", NULL);
 
pr_debug("memory scan node %s, reg size %d,\n", uname, l);
 
@@ -1049,6 +1051,10 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
(unsigned long long)size);
 
early_init_dt_add_memory_arch(base, size);
+
+   if (hotpluggable && memblock_mark_hotplug(base, size))
+   pr_warn("failed to mark hotplug range 0x%llx - 
0x%llx\n",
+   base, base + size);
}
 
return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 061b46b..33a9b06 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,7 +153,7 @@ config MOVABLE_NODE
bool "Enable to assign a node which has only movable memory"
depends on HAVE_MEMBLOCK
depends on NO_BOOTMEM
-   depends on X86_64 || MEMORY_HOTPLUG
+   depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
depends on NUMA
default n
help
-- 
1.8.3.1



[PATCH v6 3/4] mm: enable CONFIG_MOVABLE_NODE on non-x86 arches

2016-11-07 Thread Reza Arbab
To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
the following must be true:

1. This config has the capability to identify movable nodes at boot.
   Right now, only x86 can do this.

2. Our config supports memory hotplug, which means that a movable node
   can be created by hotplugging all of its memory into ZONE_MOVABLE.

Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
recognizes (1), but not (2).

Signed-off-by: Reza Arbab 
Reviewed-by: Aneesh Kumar K.V 
Acked-by: Balbir Singh 
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 86e3e0e..061b46b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,7 +153,7 @@ config MOVABLE_NODE
bool "Enable to assign a node which has only movable memory"
depends on HAVE_MEMBLOCK
depends on NO_BOOTMEM
-   depends on X86_64
+   depends on X86_64 || MEMORY_HOTPLUG
depends on NUMA
default n
help
-- 
1.8.3.1



[PATCH v6 0/4] enable movable nodes on non-x86 configs

2016-11-07 Thread Reza Arbab
This patchset allows more configs to make use of movable nodes. When
CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such
nodes into the system:

1. Discover movable nodes at boot. Currently this is only possible on
   x86, but we will enable configs supporting fdt to do the same.

2. Hotplug and online all of a node's memory using online_movable. This
   is already possible on any config supporting memory hotplug, not 
   just x86, but the Kconfig doesn't say so. We will fix that.

We'll also remove some cruft on power which would prevent (2).

/* changelog */

v6:
* Add a patch enabling the fdt to describe hotpluggable memory.

v5:
* 
http://lkml.kernel.org/r/1477339089-5455-1-git-send-email-ar...@linux.vnet.ibm.com

* Drop the patches which recognize the "status" property of dt memory
  nodes. Firmware can set the size of "linux,usable-memory" to zero instead.

v4:
* 
http://lkml.kernel.org/r/1475778995-1420-1-git-send-email-ar...@linux.vnet.ibm.com

* Rename of_fdt_is_available() to of_fdt_device_is_available().
  Rename of_flat_dt_is_available() to of_flat_dt_device_is_available().

* Instead of restoring top-down allocation, ensure it never goes
  bottom-up in the first place, by making movable_node arch-specific.

* Use MEMORY_HOTPLUG instead of PPC64 in the mm/Kconfig patch.

v3:
* 
http://lkml.kernel.org/r/1474828616-16608-1-git-send-email-ar...@linux.vnet.ibm.com

* Use Rob Herring's suggestions to improve the node availability check.

* More verbose commit log in the patch enabling CONFIG_MOVABLE_NODE.

* Add a patch to restore top-down allocation the way x86 does.

v2:
* 
http://lkml.kernel.org/r/1473883618-14998-1-git-send-email-ar...@linux.vnet.ibm.com

* Use the "status" property of standard dt memory nodes instead of
  introducing a new "ibm,hotplug-aperture" compatible id.

* Remove the patch which explicitly creates a memoryless node. This set
  no longer has any bearing on whether the pgdat is created at boot or
  at the time of memory addition.

v1:
* 
http://lkml.kernel.org/r/1470680843-28702-1-git-send-email-ar...@linux.vnet.ibm.com

Reza Arbab (4):
  powerpc/mm: allow memory hotplug into a memoryless node
  mm: remove x86-only restriction of movable_node
  mm: enable CONFIG_MOVABLE_NODE on non-x86 arches
  of/fdt: mark hotpluggable memory

 Documentation/kernel-parameters.txt |  2 +-
 arch/powerpc/mm/numa.c  | 13 +
 arch/x86/kernel/setup.c | 24 
 drivers/of/fdt.c|  6 ++
 mm/Kconfig  |  2 +-
 mm/memory_hotplug.c | 20 
 6 files changed, 33 insertions(+), 34 deletions(-)

-- 
1.8.3.1



[PATCH v6 2/4] mm: remove x86-only restriction of movable_node

2016-11-07 Thread Reza Arbab
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:

1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().

Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.

This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced. (1) will happen for them, but (2) will not.

This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable. Since other arches do
this marking either immediately or not at all, they do not require the
bottom-up toggle.

So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.

Signed-off-by: Reza Arbab 
---
 Documentation/kernel-parameters.txt |  2 +-
 arch/x86/kernel/setup.c | 24 
 mm/memory_hotplug.c | 20 
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 37babf9..adcccd5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2401,7 +2401,7 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
-   movable_node[KNL,X86] Boot-time switch to enable the effects
+   movable_node[KNL] Boot-time switch to enable the effects
of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
 
MTD_Partition=  [MTD]
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9c337b0..4cfba94 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -985,6 +985,30 @@ void __init setup_arch(char **cmdline_p)
 
parse_early_param();
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+   /*
+* Memory used by the kernel cannot be hot-removed because Linux
+* cannot migrate the kernel pages. When memory hotplug is
+* enabled, we should prevent memblock from allocating memory
+* for the kernel.
+*
+* ACPI SRAT records all hotpluggable memory ranges. But before
+* SRAT is parsed, we don't know about it.
+*
+* The kernel image is loaded into memory at very early time. We
+* cannot prevent this anyway. So on NUMA system, we set any
+* node the kernel resides in as un-hotpluggable.
+*
+* Since on modern servers, one node could have double-digit
+* gigabytes memory, we can assume the memory around the kernel
+* image is also un-hotpluggable. So before SRAT is parsed, just
+* allocate memory near the kernel image to try the best to keep
+* the kernel away from hotpluggable memory.
+*/
+   if (movable_node_is_enabled())
+   memblock_set_bottom_up(true);
+#endif
+
x86_report_nx();
 
/* after early param, so could get panic from serial */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index cad4b91..e43142c1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1727,26 +1727,6 @@ static bool can_offline_normal(struct zone *zone, 
unsigned long nr_pages)
 static int __init cmdline_parse_movable_node(char *p)
 {
 #ifdef CONFIG_MOVABLE_NODE
-   /*
-* Memory used by the kernel cannot be hot-removed because Linux
-* cannot migrate the kernel pages. When memory hotplug is
-* enabled, we should prevent memblock from allocating memory
-* for the kernel.
-*
-* ACPI SRAT records all hotpluggable memory ranges. But before
-* SRAT is parsed, we don't know about it.
-*
-* The kernel image is loaded into memory at very early time. We
-* cannot prevent this anyway. So on NUMA system, we set any
-* node the kernel resides in as un-hotpluggable.
-*
-* Since on modern servers, one node could have double-digit
-* gigabytes memory, we can assume the memory around the kernel
-* image is also un-hotpluggable. So before SRAT is parsed, just
-* allocate memory near the kernel image to try the best to keep
-* the kernel away from hotpluggable memory.
-*/
-   memblock_set_bottom_up(true);
movable_node_enabled = true;
 #else
pr_warn("movable_node option not supported\n");
-- 
1.8.3.1



[PATCH v6 1/4] powerpc/mm: allow memory hotplug into a memoryless node

2016-11-07 Thread Reza Arbab
Remove the check which prevents us from hotplugging into an empty node.

The original commit b226e4621245 ("[PATCH] powerpc: don't add memory to
empty node/zone"), states that this was intended to be a temporary measure.
It is a workaround for an oops which no longer occurs.

Signed-off-by: Reza Arbab 
Reviewed-by: Aneesh Kumar K.V 
Acked-by: Balbir Singh 
Cc: Nathan Fontenot 
Cc: Bharata B Rao 
---
 arch/powerpc/mm/numa.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a51c188..0cb6bd8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1085,7 +1085,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 int hot_add_scn_to_nid(unsigned long scn_addr)
 {
struct device_node *memory = NULL;
-   int nid, found = 0;
+   int nid;
 
if (!numa_enabled || (min_common_depth < 0))
return first_online_node;
@@ -1101,17 +1101,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
if (nid < 0 || !node_online(nid))
nid = first_online_node;
 
-   if (NODE_DATA(nid)->node_spanned_pages)
-   return nid;
-
-   for_each_online_node(nid) {
-   if (NODE_DATA(nid)->node_spanned_pages) {
-   found = 1;
-   break;
-   }
-   }
-
-   BUG_ON(!found);
return nid;
 }
 
-- 
1.8.3.1



Re: [PATCH] vfio: Fix build break when SPAPR_TCE_IOMMU=n

2016-11-07 Thread Michael Ellerman
Alex Williamson  writes:
> On Mon, 07 Nov 2016 19:34:42 +1100
> Michael Ellerman  wrote:
>> Paolo Bonzini  writes:
>> > On 04/11/2016 06:48, Michael Ellerman wrote:  
>> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>> >> index da6e2ce77495..6b51a4ebed8a 100644
>> >> --- a/drivers/vfio/Kconfig
>> >> +++ b/drivers/vfio/Kconfig
>> >> @@ -6,12 +6,12 @@ config VFIO_IOMMU_TYPE1
>> >>  config VFIO_IOMMU_SPAPR_TCE
>> >>   tristate
>> >>   depends on VFIO && SPAPR_TCE_IOMMU
>> >> - default n
>> >> + default VFIO  
>> >
>> > No need to depend on VFIO since you already have it in "default".  
>
> depends and defaults are different beasts though, if VFIO is not
> enabled and we're not on a powerpc system with SPAPR,
> VFIO_IOMMU_SPAPR_TCE should not be selectable, not just default to 'n'.

Right. But dropping VFIO from the depends won't change that. In fact it
has no effect because the entire directory is not built if VFIO=n.

See drivers/Makefile:

obj-$(CONFIG_VFIO)  += vfio/


So we don't need the depends on VFIO there.

>> > A shorthand is
>> >
>> >def_tristate VFIO && SPAPR_TCE_IOMMU  
>> 
>> Yep. My experience though is that a lot of folks don't really know what
>> that means. So I prefer to spell it out with an explicit type, depends
>> and default.
>> 
>> But I'll respin it that way if Alex prefers the shorter style.
>
> Perhaps I'm one of those people.  Non-powerpc archs should not have an
> option to select this, which is why the depends is there, AIUI.  So
> long as we don't start exposing options that aren't relevant to a
> platform, I'm flexible on what shorthands we use, but you may need to
> teach me about them first.  Thanks,

Using the def_tristate trick won't expose the option to users, because
it has no description it's not user selectable. But it does make the
symbol exist on an x86 build, and it appears in the .config.

eg, using def_tristate you get:

# CONFIG_VFIO_IOMMU_SPAPR_TCE is not set
# CONFIG_VFIO is not set

Whereas using depends all you get is:

# CONFIG_VFIO is not set


So using def_tristate in this case is not entirely equivalent.

cheers


Re: [RFC v2 6/7] mm/powerpc: Use generic VDSO remap and unmap functions

2016-11-07 Thread Michael Ellerman
Laurent Dufour  writes:

> On 04/11/2016 05:59, Michael Ellerman wrote:
>> Christopher Covington  writes:
>> 
>>> The PowerPC VDSO remap and unmap code was copied to a generic location,
>>> only modifying the variable name expected in mm->context (vdso instead of
>>> vdso_base) to match most other architectures. Having adopted this generic
>>> naming, drop the code in arch/powerpc and use the generic version.
>>>
>>> Signed-off-by: Christopher Covington 
>>> ---
>>>  arch/powerpc/Kconfig |  1 +
>>>  arch/powerpc/include/asm/Kbuild  |  1 +
>>>  arch/powerpc/include/asm/mm-arch-hooks.h | 28 -
>>>  arch/powerpc/include/asm/mmu_context.h   | 35 
>>> +---
>>>  4 files changed, 3 insertions(+), 62 deletions(-)
>>>  delete mode 100644 arch/powerpc/include/asm/mm-arch-hooks.h
>> 
>> This looks OK.
>> 
>> Have you tested it on powerpc? I could but I don't know how to actually
>> trigger these paths, I assume I need a CRIU setup?
>
> FWIW, tested on ppc64le using a sample test process moving its VDSO and
> then catching a signal on 4.9-rc4 and using CRIU on top of 4.8 with
> sightly changes to due minor upstream changes.
>
> Reviewed-by: Laurent Dufour 
> Tested-by: Laurent Dufour 

Thanks, in that case:

Acked-by: Michael Ellerman 

cheers


Re: [PATCHv3 0/8] powerpc/mm: refactor vDSO mapping code

2016-11-07 Thread Michael Ellerman
Dmitry Safonov <0x7f454...@gmail.com> writes:

> 2016-10-27 20:09 GMT+03:00 Dmitry Safonov :
>
> ping?

There's another series doing some similar changes:

http://www.spinics.net/lists/linux-mm/msg115860.html


And I don't like all the macro games in 3/8, eg:

+#ifndef BITS
+#define BITS 32
+#endif
+
+#undef Elf_Ehdr
+#undef Elf_Sym
+#undef Elf_Shdr
+
+#define _CONCAT3(a, b, c)  a ## b ## c
+#define CONCAT3(a, b, c)   _CONCAT3(a, b, c)
+#define Elf_Ehdr   CONCAT3(Elf,  BITS, _Ehdr)
+#define Elf_SymCONCAT3(Elf,  BITS, _Sym)
+#define Elf_Shdr   CONCAT3(Elf,  BITS, _Shdr)
+#define VDSO_LBASE CONCAT3(VDSO, BITS, _LBASE)
+#define vdso_kbase CONCAT3(vdso, BITS, _kbase)
+#define vdso_pages CONCAT3(vdso, BITS, _pages)
+
+#undef pr_fmt
+#define pr_fmt(fmt)"vDSO" __stringify(BITS) ": " fmt
+
+#define lib_elfinfo CONCAT3(lib, BITS, _elfinfo)
+
+#define find_section CONCAT3(find_section, BITS,)
+static void * __init find_section(Elf_Ehdr *ehdr, const char *secname,
+   unsigned long *size)


I'd rather we kept the duplication of code than the obfuscation those
macros add.

If we can come up with a way to share more of the code without having to
do all those tricks then I'd be interested.

cheers


Re: [PATCH] vfio: Fix build break when SPAPR_TCE_IOMMU=n

2016-11-07 Thread Alex Williamson
On Tue, 08 Nov 2016 10:49:35 +1100
Michael Ellerman  wrote:

> Alex Williamson  writes:
> > On Mon, 07 Nov 2016 19:34:42 +1100
> > Michael Ellerman  wrote:  
> >> Paolo Bonzini  writes:  
> >> > On 04/11/2016 06:48, Michael Ellerman wrote:
> >> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >> >> index da6e2ce77495..6b51a4ebed8a 100644
> >> >> --- a/drivers/vfio/Kconfig
> >> >> +++ b/drivers/vfio/Kconfig
> >> >> @@ -6,12 +6,12 @@ config VFIO_IOMMU_TYPE1
> >> >>  config VFIO_IOMMU_SPAPR_TCE
> >> >> tristate
> >> >> depends on VFIO && SPAPR_TCE_IOMMU
> >> >> -   default n
> >> >> +   default VFIO
> >> >
> >> > No need to depend on VFIO since you already have it in "default".
> >
> > depends and defaults are different beasts though, if VFIO is not
> > enabled and we're not on a powerpc system with SPAPR,
> > VFIO_IOMMU_SPAPR_TCE should not be selectable, not just default to 'n'.  
> 
> Right. But dropping VFIO from the depends won't change that. In fact it
> has no effect because the entire directory is not built if VFIO=n.
> 
> See drivers/Makefile:
> 
> obj-$(CONFIG_VFIO)+= vfio/
> 
> 
> So we don't need the depends on VFIO there.
> 
> >> > A shorthand is
> >> >
> >> >  def_tristate VFIO && SPAPR_TCE_IOMMU
> >> 
> >> Yep. My experience though is that a lot of folks don't really know what
> >> that means. So I prefer to spell it out with an explicit type, depends
> >> and default.
> >> 
> >> But I'll respin it that way if Alex prefers the shorter style.  
> >
> > Perhaps I'm one of those people.  Non-powerpc archs should not have an
> > option to select this, which is why the depends is there, AIUI.  So
> > long as we don't start exposing options that aren't relevant to a
> > platform, I'm flexible on what shorthands we use, but you may need to
> > teach me about them first.  Thanks,  
> 
> Using the def_tristate trick won't expose the option to users, because
> it has no description it's not user selectable. But it does make the
> symbol exist on an x86 build, and it appears in the .config.
> 
> eg, using def_tristate you get:
> 
> # CONFIG_VFIO_IOMMU_SPAPR_TCE is not set
> # CONFIG_VFIO is not set
> 
> Whereas using depends all you get is:
> 
> # CONFIG_VFIO is not set
> 
> 
> So using def_tristate in this case is not entirely equivalent.

I get a surprising number of users trying to manually hack on
their .config file, so I'd prefer it not appear in the .config unless
CONFIG_VFIO is y/m _and_ the user is actually building to a config
where SPAPR is relevant.  I'm sure it will save me at least a few
questions in the future.  Thanks,

Alex


Re: [PATCHv3 1/8] powerpc/vdso: unify return paths in setup_additional_pages

2016-11-07 Thread Michael Ellerman
Hi Dmitry,

Thanks for the patches.

Dmitry Safonov  writes:
> Impact: cleanup

I'm not a fan of these "Impact" lines, especially when they're not
correct, ie. this is not a cleanup, a cleanup doesn't change logic.

> Rename `rc' variable which doesn't seems to mean anything into
> kernel-known `ret'.

'rc' means "Return Code", it's fairly common. I see at least ~8500
"int rc" declarations in the kernel.

Please don't rename variables and change logic in one patch.

> Combine two function returns into one as it's
> also easier to read.
>
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Andy Lutomirski 
> Cc: Oleg Nesterov 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Signed-off-by: Dmitry Safonov 
> ---
>  arch/powerpc/kernel/vdso.c | 19 +++
>  1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
> index 4111d30badfa..4ffb82a2d9e9 100644
> --- a/arch/powerpc/kernel/vdso.c
> +++ b/arch/powerpc/kernel/vdso.c
> @@ -154,7 +154,7 @@ int arch_setup_additional_pages(struct linux_binprm 
> *bprm, int uses_interp)
>   struct page **vdso_pagelist;
>   unsigned long vdso_pages;
>   unsigned long vdso_base;
> - int rc;
> + int ret = 0;
  
Please don't initialise return codes in the declaration, it prevents the
compiler from warning you if you forget to initialise it in a
particular path.

AFAICS you never even use the default value either.

>   if (!vdso_ready)
>   return 0;
> @@ -203,8 +203,8 @@ int arch_setup_additional_pages(struct linux_binprm 
> *bprm, int uses_interp)
> ((VDSO_ALIGNMENT - 1) & PAGE_MASK),
> 0, 0);
>   if (IS_ERR_VALUE(vdso_base)) {
> - rc = vdso_base;
> - goto fail_mmapsem;
> + ret = vdso_base;
> + goto out_up_mmap_sem;
>   }
>  
>   /* Add required alignment. */
> @@ -227,21 +227,16 @@ int arch_setup_additional_pages(struct linux_binprm 
> *bprm, int uses_interp)
>* It's fine to use that for setting breakpoints in the vDSO code
>* pages though.
>*/
> - rc = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
> + ret = install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT,
>VM_READ|VM_EXEC|
>VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
>vdso_pagelist);
> - if (rc) {
> + if (ret)
>   current->mm->context.vdso_base = 0;
> - goto fail_mmapsem;
> - }
> -
> - up_write(&mm->mmap_sem);
> - return 0;
>  
> - fail_mmapsem:
> +out_up_mmap_sem:
>   up_write(&mm->mmap_sem);
> - return rc;
> + return ret;
>  }


If you strip out the variable renames then I think that change would be
OK.

cheers


Re: [PATCH v2] ppc: cpufreq: disable preemption while checking CPU throttling state

2016-11-07 Thread Michael Ellerman
Denis Kirjanov  writes:
> On 11/7/16, Michael Ellerman  wrote:
>> Denis Kirjanov  writes:
>>> [   67.700897] BUG: using smp_processor_id() in preemptible [] 
>>> code: cat/7343
>>> [   67.700988] caller is .powernv_cpufreq_throttle_check+0x2c/0x710
>>
>> When did this break?
>
> I think that problem exists since commit
> 09a972d1620934142d30cfda455ffe429af751c4 ("cpufreq: powernv: Report
> cpu frequency throttling") but I can be wrong

Yep that looks right to me, thanks.

Gautham, how bad is the bug here, do we want to send this to stable?

cheers


Re: [PATCH v6 4/4] of/fdt: mark hotpluggable memory

2016-11-07 Thread kbuild test robot
Hi Reza,

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.9-rc4 next-20161028]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Reza-Arbab/enable-movable-nodes-on-non-x86-configs/20161108-081711
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile 

All errors (new ones prefixed by >>):

   drivers/of/fdt.c: In function 'early_init_dt_scan_memory':
>> drivers/of/fdt.c:1064:3: error: implicit declaration of function 
>> 'memblock_mark_hotplug'
   cc1: some warnings being treated as errors

vim +/memblock_mark_hotplug +1064 drivers/of/fdt.c

  1058  continue;
  1059  pr_debug(" - %llx ,  %llx\n", (unsigned long long)base,
  1060  (unsigned long long)size);
  1061  
  1062  early_init_dt_add_memory_arch(base, size);
  1063  
> 1064  if (hotpluggable && memblock_mark_hotplug(base, size))
  1065  pr_warn("failed to mark hotplug range 0x%llx - 
0x%llx\n",
  1066  base, base + size);
  1067  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2] ppc: cpufreq: disable preemption while checking CPU throttling state

2016-11-07 Thread Gautham R Shenoy
Hi Michael,

On Tue, Nov 08, 2016 at 11:21:23AM +1100, Michael Ellerman wrote:
> Denis Kirjanov  writes:
> > On 11/7/16, Michael Ellerman  wrote:
> >> Denis Kirjanov  writes:
> >>> [   67.700897] BUG: using smp_processor_id() in preemptible 
> >>> [] code: cat/7343
> >>> [   67.700988] caller is .powernv_cpufreq_throttle_check+0x2c/0x710
> >>
> >> When did this break?
> >
> > I think that problem exists since commit
> > 09a972d1620934142d30cfda455ffe429af751c4 ("cpufreq: powernv: Report
> > cpu frequency throttling") but I can be wrong
> 
> Yep that looks right to me, thanks.
> 
> Gautham, how bad is the bug here, do we want to send this to stable?

Not so bad that it needs to be sent to stable immediately.

On CONFIG_PREEMPT enabled systems, if task executing
powernv_cpufreq_throttle_check can be switched to a CPU on a different
chip, then we will end up incorrectly attributing the throttle
statistics of the new chip to the chip where it previously ran. These
throttle statistics aren't used inside the kernel for any computation,
but are reported to the user via /sysfs for them to account for any
variance during a benchmark run.

It would be good to include this in the 4.9 fixes though. 

> 
> cheers
> 



Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container

2016-11-07 Thread David Gibson
On Mon, Oct 24, 2016 at 05:53:09PM +1100, Alexey Kardashevskiy wrote:
> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better have tce_container take a reference to mm_struct and
> use it later when the process is gone (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from making changes to a memory
> context of other processes; in order to add this check,
> VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> quite special anyway - it is the only ioctl() called when neither
> container nor container->mm is initialized.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
> Changes:
> v4:
> * added check for container->mm!=current->mm in tce_iommu_ioctl()
> for all ioctls and removed other redundand checks
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 131 
> ++--
>  1 file changed, 66 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d0c38b2..81ab93f 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -31,49 +31,46 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>   struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(long npages)
> +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  {
>   long ret = 0, locked, lock_limit;
>  
> - if (!current || !current->mm)
> - return -ESRCH; /* process exited */
> -
>   if (!npages)
>   return 0;
>  
> - down_write(¤t->mm->mmap_sem);
> - locked = current->mm->locked_vm + npages;
> + down_write(&mm->mmap_sem);
> + locked = mm->locked_vm + npages;
>   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>   ret = -ENOMEM;
>   else
> - current->mm->locked_vm += npages;
> + mm->locked_vm += npages;
>  
>   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>   npages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> + mm->locked_vm << PAGE_SHIFT,
>   rlimit(RLIMIT_MEMLOCK),
>   ret ? " - exceeded" : "");
>  
> - up_write(¤t->mm->mmap_sem);
> + up_write(&mm->mmap_sem);
>  
>   return ret;
>  }
>  
> -static void decrement_locked_vm(long npages)
> +static void decrement_locked_vm(struct mm_struct *mm, long npages)
>  {
> - if (!current || !current->mm || !npages)
> - return; /* process exited */
> + if (!npages)
> + return;
>  
> - down_write(¤t->mm->mmap_sem);
> - if (WARN_ON_ONCE(npages > current->mm->locked_vm))
> - npages = current->mm->locked_vm;
> - current->mm->locked_vm -= npages;
> + down_write(&mm->mmap_sem);
> + if (WARN_ON_ONCE(npages > mm->locked_vm))
> + npages = mm->locked_vm;
> + mm->locked_vm -= npages;
>   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
>   npages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> + mm->locked_vm << PAGE_SHIFT,
>   rlimit(RLIMIT_MEMLOCK));
> - up_write(¤t->mm->mmap_sem);
> + up_write(&mm->mmap_sem);
>  }
>  
>  /*
> @@ -98,6 +95,7 @@ struct tce_container {
>   bool enabled;
>   bool v2;
>   unsigned long locked_pages;
> + struct mm_struct *mm;
>   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
>  };
> @@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>  {
>   struct mm_iommu_table_group_mem_t *mem;
>  
> - if (!current || !current->mm)
> - return -ESRCH; /* process exited */
> -
>   if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>   return -EINVAL;
>  
> - mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
> + mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>   if (!mem)
>   return -ENOENT;
>  
> - return mm_iommu_put(current->mm, mem);
> + return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct 
> tce_container *container,
>   struct mm_iommu_table_group_mem_t *mem = NULL;
>   un

Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-11-07 Thread David Gibson
On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> At the moment the userspace tool is expected to request pinning of
> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> When the userspace process finishes, all the pinned pages need to
> be put; this is done as a part of the userspace memory context (MM)
> destruction which happens on the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This moves preregistered regions tracking from MM to VFIO; insteads of
> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> added so each container releases regions which it has pre-registered.
> 
> This changes the userspace interface to return EBUSY if a memory
> region is already registered in a container. However it should not
> have any practical effect as the only userspace tool available now
> does register memory region once per container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Nicholas Piggin 

Reviewed-by: David Gibson 

Sorry for the delay.  I got hung up on the fact that the regions
reserved by one container can cause strange behaviour on unrelated
containers (one container can't reserve a region overlapping with a
region reserved by another container).

That limitation is awful, and I wish I'd spotted it when we first
implemented this.  But, it's not new with this patch series, so it's
not a reason not to go ahead with this series which does fix a real
bug.



> ---
> Changes:
> v4:
> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> avoid calling mm_iommu_put() if memory is preregistered already
> 
> v3:
> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> 
> v2:
> * updated commit log
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c| 11 ---
>  drivers/vfio/vfio_iommu_spapr_tce.c| 58 
> +-
>  3 files changed, 57 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
> b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct 
> mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>   drop_cop(mm->context.acop, mm);
>   kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 4c6db09..104bad0 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>   INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> - struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> - list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> - next) {
> - list_del_rcu(&mem->next);
> - mm_iommu_do_free(mem);
> - }
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 81ab93f..001a488 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered region  it has
> + * referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> + struct list_head next;
> + struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -98,12 +107,27 @@ struct tce_container {
>   struct mm_struct *mm;
>   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
> + struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> + struct tce_iommu_prereg *tcemem)
> +{
> + long ret;
> +
> + list_del(&tcemem->next);
> + ret = mm_iommu_put(container->mm, tcemem->mem);
> + kfree(tcemem);
> +
> + return ret;
> +}
> 

Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-11-07 Thread David Gibson
On Thu, Nov 03, 2016 at 12:02:28PM +1100, Paul Mackerras wrote:
> On Wed, Nov 02, 2016 at 01:44:03PM +1100, Alexey Kardashevskiy wrote:
> > On 31/10/16 15:23, David Gibson wrote:
> [...]
> > > 
> > > Um.. yeah.. that's not really ok.  Prohibiting overlapping
> > > registrations on the same container is reasonable enough.  Having a
> > > container not be able to register memory because some completely
> > > different container has registered something overlapping is getting
> > > very ugly.
> > 
> > I am lost here. Does this mean the patches cannot go upstream?
> > 
> > Also how would I implement overlapping if we are not teaching KVM about
> > VFIO containers? The mm list has a counter of how many times each memory
> > region was mapped via TCE (and this prevents unregistration), and if we
> > want overlapping regions - a "mapped" counter of which one would I update
> > in real mode (where I only have a user address and a LIOBN)?
> 
> The patches fix a real bug, where we run out of memory to run VMs.
> 
> The patches don't change the interface, and don't introduce the
> constraint that is being discussed here (that the regions being
> registered may not overlap unless they are identical to a previously
> registered region).  That constraint is already present in the
> upstream code.

Ah, good point.  I hadn't thought that through and realized that
limitation was already through.

> They do change the behaviour when you use a container fd from a
> different process from the one which opened the fd, but that is not
> something that worked in any meaningful way before anyway.
> 
> So David, do you still see any reason why the patches should not be
> accepted?

Given the (still hideously ugly) limitation is not new, then no, not
any more.  I'll send an R-b.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2 2/2] cpufreq: powernv: Use PMCR to verify global and local pstate

2016-11-07 Thread Gautham R Shenoy

On Mon, Nov 07, 2016 at 01:09:09PM +0530, Akshay Adiga wrote:
> As fast_switch() may get called with interrupt disable mode, we cannot
> hold a mutex to update the global_pstate_info. So currently, fast_switch()
> does not update the global_pstate_info and it will end up with stale data
> whenever pstate is updated through fast_switch().
> 
> As the gpstate_timer can fire after fast_switch() has updated the pstates,
> the timer handler cannot rely on the cached values of local and global
> pstate and needs to read it from the PMCR.
> 
> Only gpstate_timer_handler() is affected by the stale cached pstate data
> beacause either fast_switch() or target_index() routines will be called
> for a given govenor, but gpstate_timer can fire after the governor has
> changed to schedutil.
> 
> 
> Signed-off-by: Akshay Adiga 
> ---
> 
> Changes from v1 :
> - Corrected Commit message
> - Type cast pstate values read from PMCR to type s8
> - Added Macros to get local and global pstates from PMCR

Thanks for this. Could you also send a (separate patch) to set the
local and global pstates to PMCR in set_pstate?

> 
> 
>  drivers/cpufreq/powernv-cpufreq.c | 34 --
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 4a4380d..bf4bc585 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -42,6 +42,8 @@
>  #define PMSR_PSAFE_ENABLE(1UL << 30)
>  #define PMSR_SPR_EM_DISABLE  (1UL << 31)
>  #define PMSR_MAX(x)  ((x >> 32) & 0xFF)
> +#define PMCR_LPSTATE(x)  (((x) >> 48) & 0xFF)
> +#define PMCR_GPSTATE(x)  (((x) >> 56) & 0xFF)

You define:
#define LPSTATE_SHIFT48
#define GPSTATE_SHIFT56

since we can use this in the set_variants.

Moreover, the LPSTATE, GPSTATE retreival is applicable to both PMCR and PMSR. So
could you rename these functions to GET_LPSTATE, GET_GPSTATE.

Similarly, we might want to have a SET_LPSTATE, SET_GPSTATE and fix
the hard coded values that we have in set_pstate.


> 
>  #define MAX_RAMP_DOWN_TIME   5120
>  /*
> @@ -592,7 +594,8 @@ void gpstate_timer_handler(unsigned long data)
>  {
>   struct cpufreq_policy *policy = (struct cpufreq_policy *)data;
>   struct global_pstate_info *gpstates = policy->driver_data;
> - int gpstate_idx;
> + int gpstate_idx, lpstate_idx;
> + unsigned long val;
>   unsigned int time_diff = jiffies_to_msecs(jiffies)
>   - gpstates->last_sampled_time;
>   struct powernv_smp_call_data freq_data;
> @@ -600,21 +603,36 @@ void gpstate_timer_handler(unsigned long data)
>   if (!spin_trylock(&gpstates->gpstate_lock))
>   return;
> 
> + /*
> +  * If PMCR was last updated was using fast_swtich then
> +  * We may have wrong in gpstate->last_lpstate_idx
> +  * value. Hence, read from PMCR to get correct data.
> +  */
> + val = get_pmspr(SPRN_PMCR);
> + freq_data.gpstate_id = (s8)PMCR_GPSTATE(val);
> + freq_data.pstate_id = (s8)PMCR_LPSTATE(val);
> + if (freq_data.gpstate_id  == freq_data.pstate_id) {
> + reset_gpstates(policy);
> + spin_unlock(&gpstates->gpstate_lock);
> + return;
> + }
> +
>   gpstates->last_sampled_time += time_diff;
>   gpstates->elapsed_time += time_diff;
> - freq_data.pstate_id = idx_to_pstate(gpstates->last_lpstate_idx);
> 
> - if ((gpstates->last_gpstate_idx == gpstates->last_lpstate_idx) ||
> - (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
> + if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
>   gpstate_idx = pstate_to_idx(freq_data.pstate_id);
>   reset_gpstates(policy);
>   gpstates->highest_lpstate_idx = gpstate_idx;
>   } else {
> + lpstate_idx = pstate_to_idx(freq_data.pstate_id);
>   gpstate_idx = calc_global_pstate(gpstates->elapsed_time,
>gpstates->highest_lpstate_idx,
> -  gpstates->last_lpstate_idx);
> +  lpstate_idx);
>   }
> -
> + freq_data.gpstate_id = idx_to_pstate(gpstate_idx);
> + gpstates->last_gpstate_idx = gpstate_idx;
> + gpstates->last_lpstate_idx = lpstate_idx;
>   /*
>* If local pstate is equal to global pstate, rampdown is over
>* So timer is not required to be queued.
> @@ -622,10 +640,6 @@ void gpstate_timer_handler(unsigned long data)
>   if (gpstate_idx != gpstates->last_lpstate_idx)
>   queue_gpstate_timer(gpstates);
> 
> - freq_data.gpstate_id = idx_to_pstate(gpstate_idx);
> - gpstates->last_gpstate_idx = pstate_to_idx(freq_data.gpstate_id);
> - gpstates->last_lpstate_idx = pstate_to_idx(freq_data.pstate_id);
> -
>   spin_unlock(&gpstates->gpstate_lock)

Re: 4.9-rc G5 build broken with a 2.23 binutils

2016-11-07 Thread Michael Ellerman
Hugh Dickins  writes:

> Hi Nick,
>
> I've not been able to build 4.9-rc kernels for the PowerMac G5, getting
>
> arch/powerpc/kernel/exceptions-64s.S:
> Assembler messages:
> arch/powerpc/kernel/exceptions-64s.S:770: Error: operand out of range
> (0x8100 is not between 0x and 0x)
> and 20 more similar before
> arch/powerpc/kernel/exceptions-64s.S:1024: Error: operand out of range
> (0x9180 is not between 0x and 0x)
> make[1]: *** [arch/powerpc/kernel/head_64.o] Error 1
>
> That's with the binutils-2.23.2-8.1.4.ppc64 from openSUSE 13.1.
>
> But yesterday your comment on @l in
> a24553dd02dc ("powerpc/pseries: Remove unnecessary syscall trampoline")
> helped me to the patch below (removing @l from LOAD_HANDLER, but leaving
> @l in __LOAD_HANDLER), which gets the build working again (and produces
> a good kernel).  For me.  But to judge by your comment, this would break
> the build for someone else?  I don't mind keeping this as a local patch
> if I'm the only one affected, so haven't signed off on it; but thought
> I'd better inform you.

Thanks for the report Hugh.

We added the @l recently to keep binutils 2.22 happy, in commit a24553dd02dc:

https://git.kernel.org/torvalds/c/a24553dd02dc


We needed it because we were using LOAD_HANDLER() for systemcall_common,
which is in a separate .S. But since then we've split that out into a
separate macro, meaning we no longer need the @l in LOAD_HANDLER().

Why binutils 2.23 can't cope with the @l I'm not sure, and I don't
really care. Dropping the @l is more correct, ie. 

So I'll merge this as a fix, do you want to send me a SOB?

cheers


[PATCH] powerpc/pseries: add definitions for new H_SIGNAL_SYS_RESET hcall

2016-11-07 Thread Nicholas Piggin
This has not made its way to a PAPR release yet, but we have an hcall
number assigned.

  H_SIGNAL_SYS_RESET = 0x380

  Syntax:
  hcall(uint64 H_SIGNAL_SYS_RESET, /* generate a system reset NMI on   */
   /* the threads indicated by target  */
int64 target); /* thread target selection */
/* -1 = target all online threads including the caller */
/* -2 = target all online threads except for the caller*/
/* All other negative values: reserved */
/* Positive values: The thread to be targeted, */
/* obtained from the value of the "ibm,ppc-interrupt-server#s" */
/* property of the CPU in the OF device tree   */

  Semantics:
  - If the NMI handlers are not registered, terminate the partition.
  - Invalid target: return H_Parameter.
  - Otherwise: Generate a system reset NMI on target thread(s),
return H_Success.

This will be used by crash/debug code to get stuck CPUs into a known state.

Signed-off-by: Nicholas Piggin 
---

I'll get the NMI stuff resubmitted soon, but just wanted to send
out this hcall addition now, for reference.

Thanks,
Nick

 arch/powerpc/include/asm/hvcall.h | 8 +++-
 arch/powerpc/include/asm/plpar_wrappers.h | 5 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 708edeb..d52ef28 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -275,7 +275,8 @@
 #define H_COP  0x304
 #define H_GET_MPP_X0x314
 #define H_SET_MODE 0x31C
-#define MAX_HCALL_OPCODE   H_SET_MODE
+#define H_SIGNAL_SYS_RESET 0x380
+#define MAX_HCALL_OPCODE   H_SIGNAL_SYS_RESET
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
@@ -306,6 +307,11 @@
 #define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
 #define H_SET_MODE_RESOURCE_LE 4
 
+/* Values for argument to H_SIGNAL_SYS_RESET */
+#define H_SIGNAL_SYS_RESET_ALL -1
+#define H_SIGNAL_SYS_RESET_ALLBUTSELF  -2
+/* >= 0 values are CPU number */
+
 #ifndef __ASSEMBLY__
 
 /**
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 1b39424..9642ae8 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -340,4 +340,9 @@ static inline long plapr_set_watchpoint0(unsigned long 
dawr0, unsigned long dawr
return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0);
 }
 
+static inline long plapr_signal_sys_reset(long cpu)
+{
+   return plpar_hcall_norets(H_SIGNAL_SYS_RESET, cpu);
+}
+
 #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
-- 
2.10.2



RE: [v15, 3/7] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-11-07 Thread Y.B. Lu
Hi Arnd,


> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Tuesday, November 08, 2016 5:20 AM
> To: Y.B. Lu
> Cc: linuxppc-dev@lists.ozlabs.org; linux-...@vger.kernel.org;
> ulf.hans...@linaro.org; Scott Wood; Mark Rutland; Greg Kroah-Hartman; X.B.
> Xie; M.H. Lian; linux-...@vger.kernel.org; linux-...@vger.kernel.org;
> Qiang Zhao; Russell King; Bhupesh Sharma; Joerg Roedel; Claudiu Manoil;
> devicet...@vger.kernel.org; Rob Herring; Santosh Shilimkar; linux-arm-
> ker...@lists.infradead.org; net...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Leo Li; io...@lists.linux-foundation.org; Kumar
> Gala
> Subject: Re: [v15, 3/7] powerpc/fsl: move mpc85xx.h to include/linux/fsl
> 
> On Monday, October 31, 2016 9:35:33 AM CET Y.B. Lu wrote:
> > >
> > > I don't see any of the contents of this header referenced by the soc
> > > driver any more. I think you can just drop this patch.
> > >
> >
> > [Lu Yangbo-B47093] This header file was included by guts.c.
> > The guts driver used macro SVR_MAJ/SVR_MIN for calculation.
> >
> > This header file was for powerpc arch before. And this patch is to
> > made it as common header file for both ARM and PPC.
> > Sooner or later this is needed.
> 
> Let's discuss it once we actually need the header then, ok?

[Lu Yangbo-B47093] As I said, this header file was included by guts.c in patch 
4.
The guts driver used macro SVR_MAJ/SVR_MIN for calculation which were defined 
in this header file.
Did you suggest we dropped this patch and just calculated them in driver?

Thanks :)


> 
>   Arnd


Re: [PATCH 1/2] powerpc/64: Fix checksum folding in csum_tcpudp_nofold and ip_fast_csum_nofold

2016-11-07 Thread Michael Ellerman
Paul Mackerras  writes:

> These functions compute an IP checksum by computing a 64-bit sum and
> folding it to 32 bits (the "nofold" in their names refers to folding
> down to 16 bits).  However, doing (u32) (s + (s >> 32)) is not
> sufficient to fold a 64-bit sum to 32 bits correctly.  The addition
> can produce a carry out from bit 31, which needs to be added in to
> the sum to produce the correct result.
>
> To fix this, we copy the from64to32() function from lib/checksum.c
> and use that.

This seems to have been broken since ~forever. Do we just not hit that
case very often, or do we just incorrectly report checksum failures?

Should it go to stable?

cheers


Re: compiling master.

2016-11-07 Thread Michael Ellerman
Can you guys let me know if Hugh's fix works for you?

https://patchwork.ozlabs.org/patch/691753/

cheers


Re: [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown

2016-11-07 Thread Michael Ellerman
Alexey Kardashevskiy  writes:
> These patches are to fix a bug when pages stay pinned hours
> after QEMU which requested pinning exited.
>
> Please comment. Thanks.
>
> Alexey Kardashevskiy (4):
>   powerpc/iommu: Pass mm_struct to init/cleanup helpers
>   powerpc/iommu: Stop using @current in mm_iommu_xxx
>   vfio/spapr: Reference mm in tce_container
>   powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
>
>  arch/powerpc/include/asm/mmu_context.h |  20 ++--
>  arch/powerpc/kernel/setup-common.c |   2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
>  arch/powerpc/mm/mmu_context_iommu.c|  60 ---
>  drivers/vfio/vfio_iommu_spapr_tce.c| 181 
> ++---
>  5 files changed, 154 insertions(+), 115 deletions(-)


Alex, given the diffstat how do you want to merge this? (for 4.10)

Should I merge it, you merge it, or I can put it in a topic branch?

cheers