Re: [RFC PATCH 14/23] watchdog/hardlockup: Decouple the hardlockup detector from perf

2018-06-14 Thread Ricardo Neri
On Thu, Jun 14, 2018 at 11:41:44AM +1000, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:19:01 -0700
> Ricardo Neri  wrote:
> 
> > On Wed, Jun 13, 2018 at 10:43:24AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 12, 2018 at 05:57:34PM -0700, Ricardo Neri wrote:  
> > > > The current default implementation of the hardlockup detector assumes 
> > > > that
> > > > it is implemented using perf events.  
> > > 
> > > The sparc and powerpc things are very much not using perf.  
> > 
> > Isn't it true that the current hardlockup detector
> > (under kernel/watchdog_hld.c) is based on perf?
> 
> arch/powerpc/kernel/watchdog.c is a powerpc implementation that uses
> the kernel/watchdog_hld.c framework.
> 
> > As far as I understand,
> > this hardlockup detector is constructed using perf events for architectures
> > that don't provide an NMI watchdog. Perhaps I can be more specific and say
> > that this synthetized detector is based on perf.
> 
> The perf detector is like that, but we want NMI watchdogs to share
> the watchdog_hld code as much as possible even for arch specific NMI
> watchdogs, so that kernel and user interfaces and behaviour are
> consistent.
> 
> Other arch watchdogs like sparc are a little older so they are not
> using HLD. You don't have to change those for your series, but it
> would be good to bring them into the fold if possible at some time.
> IIRC sparc was slightly non-trivial because it has some differences
> in sysctl or cmdline APIs that we don't want to break.
> 
> But powerpc at least needs to be updated if you change hld apis.

I will look into updating at least the powerpc implementation as part
of these changes.

Thanks and BR,
Ricardo


Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations

2018-06-14 Thread Ricardo Neri
On Thu, Jun 14, 2018 at 12:32:50PM +1000, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:31:17 -0700
> Ricardo Neri  wrote:
> 
> > On Wed, Jun 13, 2018 at 09:52:25PM +1000, Nicholas Piggin wrote:
> > > On Wed, 13 Jun 2018 11:26:49 +0200 (CEST)
> > > Thomas Gleixner  wrote:
> > >   
> > > > On Wed, 13 Jun 2018, Peter Zijlstra wrote:  
> > > > > On Wed, Jun 13, 2018 at 05:41:41PM +1000, Nicholas Piggin wrote:
> > > > > > On Tue, 12 Jun 2018 17:57:32 -0700
> > > > > > Ricardo Neri  wrote:
> > > > > > 
> > > > > > > Instead of exposing individual functions for the operations of 
> > > > > > > the NMI
> > > > > > > watchdog, define a common interface that can be used across 
> > > > > > > multiple
> > > > > > > implementations.
> > > > > > > 
> > > > > > > The struct nmi_watchdog_ops is defined for such operations. These 
> > > > > > > initial
> > > > > > > definitions include the enable, disable, start, stop, and cleanup
> > > > > > > operations.
> > > > > > > 
> > > > > > > Only a single NMI watchdog can be used in the system. The 
> > > > > > > operations of
> > > > > > > this NMI watchdog are accessed via the new variable nmi_wd_ops. 
> > > > > > > This
> > > > > > > variable is set to point the operations of the first NMI watchdog 
> > > > > > > that
> > > > > > > initializes successfully. Even though at this moment, the only 
> > > > > > > available
> > > > > > > NMI watchdog is the perf-based hardlockup detector. More 
> > > > > > > implementations
> > > > > > > can be added in the future.
> > > > > > 
> > > > > > Cool, this looks pretty nice at a quick glance. sparc and powerpc at
> > > > > > least have their own NMI watchdogs, it would be good to have those
> > > > > > converted as well.
> > > > > 
> > > > > Yeah, agreed, this looks like half a patch.
> > > > 
> > > > Though I'm not seeing the advantage of it. That kind of NMI watchdogs 
> > > > are
> > > > low level architecture details so having yet another 'ops' data 
> > > > structure
> > > > with a gazillion of callbacks, checks and indirections does not provide
> > > > value over the currently available weak stubs.  
> > > 
> > > The other way to go of course is librify the perf watchdog and make an
> > > x86 watchdog that selects between perf and hpet... I also probably
> > > prefer that for code such as this, but I wouldn't strongly object to
> > > ops struct if I'm not writing the code. It's not that bad is it?  
> > 
> > My motivation to add the ops was that the hpet and perf watchdog share
> > significant portions of code.
> 
> Right, a good motivation.
> 
> > I could look into creating the library for
> > common code and relocate the hpet watchdog into arch/x86 for the hpet-
> > specific parts.
> 
> If you can investigate that approach, that would be appreciated. I hope
> I did not misunderstand you there, Thomas.
> 
> Basically you would have perf infrastructure and hpet infrastructure,
> and then the x86 watchdog driver will use one or the other of those. The
> generic watchdog driver will be just a simple shim that uses the perf
> infrastructure. Then hopefully the powerpc driver would require almost
> no change.

Sure, I will try to structure code to minimize the changes to the powerpc
watchdog... without breaking the sparc one.

Thanks and BR,
Ricardo


Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs

2018-06-14 Thread Ricardo Neri
On Wed, Jun 13, 2018 at 11:48:09AM +0200, Thomas Gleixner wrote:
> On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > +   /* There are no CPUs to monitor. */
> > +   if (!cpumask_weight(>monitored_mask))
> > +   return NMI_HANDLED;
> > +
> > inspect_for_hardlockups(regs);
> >  
> > +   /*
> > +* Target a new CPU. Keep trying until we find a monitored CPU. CPUs
> > +* are addded and removed to this mask at cpu_up() and cpu_down(),
> > +* respectively. Thus, the interrupt should be able to be moved to
> > +* the next monitored CPU.
> > +*/
> > +   spin_lock(_data->lock);
> 
> Yuck. Taking a spinlock from NMI ...

I am sorry. I will look into other options for locking. Do you think rcu_lock
would help in this case? I need this locking because the CPUs being monitored
changes as CPUs come online and offline.

> 
> > +   for_each_cpu_wrap(cpu, >monitored_mask, smp_processor_id() + 1) {
> > +   if (!irq_set_affinity(hld_data->irq, cpumask_of(cpu)))
> > +   break;
> 
> ... and then calling into generic interrupt code which will take even more
> locks is completely broken.


I will into reworking how the destination of the interrupt is set.

Thanks and BR,

Ricardo


Re: [RFC PATCH 03/23] genirq: Introduce IRQF_DELIVER_AS_NMI

2018-06-14 Thread Ricardo Neri
On Wed, Jun 13, 2018 at 11:06:25AM +0100, Marc Zyngier wrote:
> On 13/06/18 10:20, Thomas Gleixner wrote:
> > On Wed, 13 Jun 2018, Julien Thierry wrote:
> >> On 13/06/18 09:34, Peter Zijlstra wrote:
> >>> On Tue, Jun 12, 2018 at 05:57:23PM -0700, Ricardo Neri wrote:
>  diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>  index 5426627..dbc5e02 100644
>  --- a/include/linux/interrupt.h
>  +++ b/include/linux/interrupt.h
>  @@ -61,6 +61,8 @@
> *interrupt handler after suspending interrupts. For
>  system
> *wakeup devices users need to implement wakeup
>  detection in
> *their interrupt handlers.
>  + * IRQF_DELIVER_AS_NMI - Configure interrupt to be delivered as
>  non-maskable, if
>  + *supported by the chip.
> */
> >>>
> >>> NAK on the first 6 patches. You really _REALLY_ don't want to expose
> >>> NMIs to this level.
> >>>
> >>
> >> I've been working on something similar on arm64 side, and effectively the 
> >> one
> >> thing that might be common to arm64 and intel is the interface to set an
> >> interrupt as NMI. So I guess it would be nice to agree on the right 
> >> approach
> >> for this.
> >>
> >> The way I did it was by introducing a new irq_state and let the irqchip 
> >> driver
> >> handle most of the work (if it supports that state):
> >>
> >> https://lkml.org/lkml/2018/5/25/181
> >>
> >> This has not been ACKed nor NAKed. So I am just asking whether this is a 
> >> more
> >> suitable approach, and if not, is there any suggestions on how to do this?
> > 
> > I really didn't pay attention to that as it's burried in the GIC/ARM series
> > which is usually Marc's playground.
> 
> I'm working my way through it ATM now that I have some brain cycles back.
> 
> > Adding NMI delivery support at low level architecture irq chip level is
> > perfectly fine, but the exposure of that needs to be restricted very
> > much. Adding it to the generic interrupt control interfaces is not going to
> > happen. That's doomed to begin with and a complete abuse of the interface
> > as the handler can not ever be used for that.
> 
> I can only agree with that. Allowing random driver to use request_irq()
> to make anything an NMI ultimately turns it into a complete mess ("hey,
> NMI is *faster*, let's use that"), and a potential source of horrible
> deadlocks.
> 
> What I'd find more palatable is a way for an irqchip to be able to
> prioritize some interrupts based on a set of architecturally-defined
> requirements, and a separate NMI requesting/handling framework that is
> separate from the IRQ API, as the overall requirements are likely to
> completely different.
> 
> It shouldn't have to be nearly as complex as the IRQ API, and require
> much stricter requirements in terms of what you can do there (flow
> handling should definitely be different).

Marc, Julien, do you plan to actively work on this? Would you mind keeping
me in the loop? I also need this work for this watchdog. In the meantime,
I will go through Julien's patches and try to adapt it to my work.

Thanks and BR,
Ricardo


Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI

2018-06-14 Thread Ricardo Neri
On Wed, Jun 13, 2018 at 11:07:20AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 05:57:37PM -0700, Ricardo Neri wrote:
> 
> +static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
> +{
> +   unsigned long this_isr;
> +   unsigned int lvl_trig;
> +
> +   this_isr = hpet_readl(HPET_STATUS) & BIT(hdata->num);
> +
> +   lvl_trig = hpet_readl(HPET_Tn_CFG(hdata->num)) & HPET_TN_LEVEL;
> +
> +   if (lvl_trig && this_isr)
> +   return true;
> +
> +   return false;
> +}
> 
> > +static int hardlockup_detector_nmi_handler(unsigned int val,
> > +  struct pt_regs *regs)
> > +{
> > +   struct hpet_hld_data *hdata = hld_data;
> > +   unsigned int use_fsb;
> > +
> > +   /*
> > +* If FSB delivery mode is used, the timer interrupt is programmed as
> > +* edge-triggered and there is no need to check the ISR register.
> > +*/
> > +   use_fsb = hdata->flags & HPET_DEV_FSB_CAP;
> 
> Please do explain.. That FSB thing basically means MSI. But there's only
> a single NMI vector. How do we know this NMI came from the HPET?

Indeed, I see now that this is wrong. There is no way to know. The only way
is to use an IO APIC interrupt and read the HPET status register.

> 
> > +
> > +   if (!use_fsb && !is_hpet_wdt_interrupt(hdata))
> 
> So you add _2_ HPET reads for every single NMI that gets triggered...
> and IIRC HPET reads are _slloww_.


Since the trigger mode of the HPET timer is not expected to change, 
perhaps is_hpet_wdt_interrupt() can only need the interrupt status
register. This would reduce the reads to one. Furthermore, the hardlockup
detector can skip an X number of NMIs and reduce further the frequency
of reads. Does this make sense?

> 
> > +   return NMI_DONE;
> > +
> > +   inspect_for_hardlockups(regs);
> > +
> > +   if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > +   kick_timer(hdata);
> > +
> > +   /* Acknowledge interrupt if in level-triggered mode */
> > +   if (!use_fsb)
> > +   hpet_writel(BIT(hdata->num), HPET_STATUS);
> > +
> > +   return NMI_HANDLED;
> 
> So if I read this right, when in FSB/MSI mode, we'll basically _always_
> claim every single NMI as handled?
> 
> That's broken.

Yes, this is not correct. I will drop the functionality to use
FSB/MSI mode.

Thanks and BR,
Ricardo


Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI

2018-06-14 Thread Ricardo Neri
On Wed, Jun 13, 2018 at 11:40:00AM +0200, Thomas Gleixner wrote:
> On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > @@ -183,6 +184,8 @@ static irqreturn_t hardlockup_detector_irq_handler(int 
> > irq, void *data)
> > if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > kick_timer(hdata);
> >  
> > +   pr_err("This interrupt should not have happened. Ensure delivery mode 
> > is NMI.\n");
> 
> Eeew.

If you don't mind me asking. What is the problem with this error message?
> 
> >  /**
> > + * hardlockup_detector_nmi_handler() - NMI Interrupt handler
> > + * @val:   Attribute associated with the NMI. Not used.
> > + * @regs:  Register values as seen when the NMI was asserted
> > + *
> > + * When an NMI is issued, look for hardlockups. If the timer is not 
> > periodic,
> > + * kick it. The interrupt is always handled when if delivered via the
> > + * Front-Side Bus.
> > + *
> > + * Returns:
> > + *
> > + * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
> > + * otherwise.
> > + */
> > +static int hardlockup_detector_nmi_handler(unsigned int val,
> > +  struct pt_regs *regs)
> > +{
> > +   struct hpet_hld_data *hdata = hld_data;
> > +   unsigned int use_fsb;
> > +
> > +   /*
> > +* If FSB delivery mode is used, the timer interrupt is programmed as
> > +* edge-triggered and there is no need to check the ISR register.
> > +*/
> > +   use_fsb = hdata->flags & HPET_DEV_FSB_CAP;
> > +
> > +   if (!use_fsb && !is_hpet_wdt_interrupt(hdata))
> > +   return NMI_DONE;
> 
> So for 'use_fsb == True' every single NMI will fall through into the
> watchdog code below.
> 
> > +   inspect_for_hardlockups(regs);
> > +
> > +   if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > +   kick_timer(hdata);
> 
> And in case that the HPET does not support periodic mode this reprogramms
> the timer on every NMI which means that while perf is running the watchdog
> will never ever detect anything.

Yes. I see that this is wrong. With MSI interrupts, as far as I can
see, there is not a way to make sure that the HPET timer caused the NMI
perhaps the only option is to use an IO APIC interrupt and read the
interrupt status register.

> 
> Aside of that, reading TWO HPET registers for every NMI is insane. HPET
> access is horribly slow, so any high frequency perf monitoring will take a
> massive performance hit.

If an IO APIC interrupt is used, only HPET register (the status register)
would need to be read for every NMI. Would that be more acceptable? Otherwise,
there is no way to determine if the HPET cause the NMI.

Alternatively, there could be a counter that skips reading the HPET status
register (and the detection of hardlockups) for every X NMIs. This would
reduce the overall frequency of HPET register reads.

Is that more acceptable?

Thanks and BR,
Ricardo


[PATCH v3] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP

2018-06-14 Thread Nicholas Piggin
The patch 99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
problem") added a force flush mode to the mmu_gather flush, which
unconditionally flushes the entire address range being invalidated
(even if actual ptes only covered a smaller range), to solve a problem
with concurrent threads invalidating the same PTEs causing them to
miss TLBs that need flushing.

This does not work with powerpc that invalidates mmu_gather batches
according to page size. Have powerpc flush all possible page sizes in
the range if it encounters this concurrency condition.

Patch 4647706ebe ("mm: always flush VMA ranges affected by
zap_page_range") does add a TLB flush for all page sizes on powerpc for
the zap_page_range case, but that is to be removed and replaced with
the mmu_gather flush to avoid redundant flushing. It is also thought to
not cover other obscure race conditions:

https://lkml.kernel.org/r/bd3a0ebe-ecf4-41d4-87fa-c755ea9ab...@gmail.com

Hash does not have a problem because it invalidates TLBs inside the
page table locks.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Nicholas Piggin 
---
Since v2:
- Fix compile breakage.

 arch/powerpc/mm/tlb-radix.c | 96 +
 1 file changed, 75 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 67a6e86d3e7e..a734e486664d 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -689,22 +689,17 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = 
POWER9_TLB_SETS_RADIX * 2;
 
-void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
-unsigned long end)
+static inline void __radix__flush_tlb_range(struct mm_struct *mm,
+   unsigned long start, unsigned long end,
+   bool flush_all_sizes)
 
 {
-   struct mm_struct *mm = vma->vm_mm;
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
unsigned long page_size = 1UL << page_shift;
unsigned long nr_pages = (end - start) >> page_shift;
bool local, full;
 
-#ifdef CONFIG_HUGETLB_PAGE
-   if (is_vm_hugetlb_page(vma))
-   return radix__flush_hugetlb_tlb_range(vma, start, end);
-#endif
-
pid = mm->context.id;
if (unlikely(pid == MMU_NO_CONTEXT))
return;
@@ -738,37 +733,64 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, 
unsigned long start,
_tlbie_pid(pid, RIC_FLUSH_TLB);
}
} else {
-   bool hflush = false;
+   bool hflush = flush_all_sizes;
+   bool gflush = flush_all_sizes;
unsigned long hstart, hend;
+   unsigned long gstart, gend;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   hstart = (start + HPAGE_PMD_SIZE - 1) >> HPAGE_PMD_SHIFT;
-   hend = end >> HPAGE_PMD_SHIFT;
-   if (hstart < hend) {
-   hstart <<= HPAGE_PMD_SHIFT;
-   hend <<= HPAGE_PMD_SHIFT;
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
hflush = true;
+
+   if (hflush) {
+   hstart = (start + PMD_SIZE - 1) & PMD_MASK;
+   hend = end & PMD_MASK;
+   if (hstart == hend)
+   hflush = false;
+   }
+
+   if (gflush) {
+   gstart = (start + PUD_SIZE - 1) & PUD_MASK;
+   gend = end & PUD_MASK;
+   if (gstart == gend)
+   gflush = false;
}
-#endif
 
asm volatile("ptesync": : :"memory");
if (local) {
__tlbiel_va_range(start, end, pid, page_size, 
mmu_virtual_psize);
if (hflush)
__tlbiel_va_range(hstart, hend, pid,
-   HPAGE_PMD_SIZE, MMU_PAGE_2M);
+   PMD_SIZE, MMU_PAGE_2M);
+   if (gflush)
+   __tlbiel_va_range(gstart, gend, pid,
+   PUD_SIZE, MMU_PAGE_1G);
asm volatile("ptesync": : :"memory");
} else {
__tlbie_va_range(start, end, pid, page_size, 
mmu_virtual_psize);
if (hflush)
__tlbie_va_range(hstart, hend, pid,
-   HPAGE_PMD_SIZE, MMU_PAGE_2M);
+   PMD_SIZE, MMU_PAGE_2M);
+   if (gflush)
+   

Re: [PATCH v13 00/24] selftests, powerpc, x86 : Memory Protection Keys

2018-06-14 Thread Ram Pai
On Thu, Jun 14, 2018 at 10:19:11PM +0200, Florian Weimer wrote:
> On 06/14/2018 02:44 AM, Ram Pai wrote:
> >Test
> >
> >Verified for correctness on powerpc. Need help verifying on x86.
> >Compiles on x86.
> 
> It breaks make in tools/testing/selftests/x86:
> 
> make: *** No rule to make target `protection_keys.c', needed by
> `/home/linux/tools/testing/selftests/x86/protection_keys_64'.  Stop.

Ah.. it has to be taken out from the Makefile of
/home/linux/tools/testing/selftests/x86/

The sources have been moved to /home/linux/tools/testing/selftests/mm/

> 
> The generic implementation no longer builds 32-bit binaries.  Is
> this the intent?

No. But building it 32-bit after moving it to a the new directory 
needs some special code in the Makefile. 

> 
> It's possible to build 32-bit binaries with “make CC='gcc -m32'”, so
> perhaps this is good enough?

Dave Hansen did mention it, but he did not complain too much. So I kept
quite.

> 
> But with that, I get a warning:
> 
> protection_keys.c: In function ‘dump_mem’:
> protection_keys.c:172:3: warning: format ‘%lx’ expects argument of
> type ‘long unsigned int’, but argument 4 has type ‘uint64_t’
> [-Wformat=]
>dprintf1("dump[%03d][@%p]: %016lx\n", i, ptr, *ptr);
>^
> 
> I suppose you could use %016llx and add a cast to unsigned long long
> to fix this.

yes.

> 
> Anyway, both the 32-bit and 64-bit tests fail here:
> 
> assert() at protection_keys.c::943 test_nr: 12 iteration: 1
> running abort_hooks()...
> 
> I've yet checked what causes this.  It's with the kernel headers
> from 4.17, but with other userspace headers based on glibc 2.17.  I
> hope to look into this some more before the weekend, but I
> eventually have to return the test machine to the pool.

I wish I could get a x86 machine which could do memory keys. Had a AWS
instance, but struggled to boot my kernel. Can't get to the console...
gave up.  If someone can give me a ready-made machine with support for
memkeys, I can quickly fix all the outstanding x86 issues.  But if
someone can just fix it for me,   ;)

RP



Re: [PATCH v6 3/4] powerpc/lib: implement strlen() in assembly

2018-06-14 Thread Segher Boessenkool
On Tue, Jun 12, 2018 at 07:01:59PM +0200, Christophe LEROY wrote:
> 
> 
> Le 12/06/2018 à 16:53, Segher Boessenkool a écrit :
> >On Tue, Jun 12, 2018 at 09:14:53AM +, Christophe Leroy wrote:
> >>---
> >>Not tested on PPC64.
> >
> >It won't be acceptable until that happens.  It also is likely quite bad
> >performance on all 64-bit CPUs from the last fifteen years or so.  Or you
> >did nothing to prove otherwise, at least.
> 
> Will it be as bad as the generic implementation which does it byte per 
> byte ?

Probably not.  But how is it for short inputs, etc.?

The main point is that it needs actual testing _for correctness_.

Btw, GCC 7 and later can expand many memcmp as builtins on PowerPC (just
like memset and memcpy etc.), creating better code, without function call.


Segher


Re: [PATCH] mm: convert return type of handle_mm_fault() caller to vm_fault_t

2018-06-14 Thread kbuild test robot
Hi Souptick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.17]
[cannot apply to linus/master powerpc/next sparc-next/master next-20180614]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Souptick-Joarder/mm-convert-return-type-of-handle_mm_fault-caller-to-vm_fault_t/20180615-030636
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> arch/powerpc/mm/copro_fault.c:36:5: error: conflicting types for 
>> 'copro_handle_mm_fault'
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^
   In file included from arch/powerpc/mm/copro_fault.c:27:0:
   arch/powerpc/include/asm/copro.h:18:5: note: previous declaration of 
'copro_handle_mm_fault' was here
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^
   In file included from include/linux/linkage.h:7:0,
from include/linux/kernel.h:7,
from include/asm-generic/bug.h:18,
from arch/powerpc/include/asm/bug.h:128,
from include/linux/bug.h:5,
from arch/powerpc/include/asm/mmu.h:126,
from arch/powerpc/include/asm/lppaca.h:36,
from arch/powerpc/include/asm/paca.h:21,
from arch/powerpc/include/asm/current.h:16,
from include/linux/sched.h:12,
from arch/powerpc/mm/copro_fault.c:23:
   arch/powerpc/mm/copro_fault.c:101:19: error: conflicting types for 
'copro_handle_mm_fault'
EXPORT_SYMBOL_GPL(copro_handle_mm_fault);
  ^
   include/linux/export.h:65:21: note: in definition of macro '___EXPORT_SYMBOL'
 extern typeof(sym) sym;  \
^~~
   arch/powerpc/mm/copro_fault.c:101:1: note: in expansion of macro 
'EXPORT_SYMBOL_GPL'
EXPORT_SYMBOL_GPL(copro_handle_mm_fault);
^
   In file included from arch/powerpc/mm/copro_fault.c:27:0:
   arch/powerpc/include/asm/copro.h:18:5: note: previous declaration of 
'copro_handle_mm_fault' was here
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^

vim +/copro_handle_mm_fault +36 arch/powerpc/mm/copro_fault.c

7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  30  
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  31  /*
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  32   * This ought to be kept in sync with the powerpc specific do_page_fault
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  33   * function. Currently, there are a few corner cases that we haven't had
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  34   * to handle fortunately.
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  35   */
e83d01697 arch/powerpc/mm/copro_fault.c   Ian Munsie 2014-10-08 
 @36  int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
7c71e54a4 arch/powerpc/mm/copro_fault.c   Souptick Joarder   2018-06-15 
  37unsigned long dsisr, vm_fault_t *flt)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  38  {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  39struct vm_area_struct *vma;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  40unsigned long is_write;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  41int ret;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  42  
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2009-02-17 
  43if (mm == NULL)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  44return -EFAULT;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2009-02-17 
  45  
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2009-02-17 
  46if (mm->pgd == NULL)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  47return -EFAULT;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  48  
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr2007-12-20 
  49down_read(>mmap_sem);
60ee03194 arch/powerpc/platforms

[PATCH] powerpc/e500mc: Set assembler machine type to e500mc

2018-06-14 Thread Michael Jeanson
In binutils 2.26 a new opcode for the "wait" instruction was added for the
POWER9 and has precedence over the one specific to the e500mc. Commit
ebf714ff3756 ("powerpc/e500mc: Add support for the wait instruction in
e500_idle") uses this instruction specifically on the e500mc to work around
an erratum.

This results in an invalid instruction in idle_e500 when we build for the
e500mc on bintutils >= 2.26 with the default assembler machine type.

Since multiplatform between e500 and non-e500 is not supported, set the
assembler machine type globaly when CONFIG_PPC_E500MC=y.

Signed-off-by: Michael Jeanson 
Reviewed-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Kumar Gala 
CC: Vakul Garg 
CC: Scott Wood 
CC: Mathieu Desnoyers 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-ker...@vger.kernel.org
CC: sta...@vger.kernel.org
---
 arch/powerpc/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 95813df90801..bb2523b4bd8f 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -251,6 +251,7 @@ cpu-as-$(CONFIG_4xx)+= -Wa,-m405
 cpu-as-$(CONFIG_ALTIVEC)   += $(call as-option,-Wa$(comma)-maltivec)
 cpu-as-$(CONFIG_E200)  += -Wa,-me200
 cpu-as-$(CONFIG_PPC_BOOK3S_64) += -Wa,-mpower4
+cpu-as-$(CONFIG_PPC_E500MC)+= $(call as-option,-Wa$(comma)-me500mc)
 
 KBUILD_AFLAGS += $(cpu-as-y)
 KBUILD_CFLAGS += $(cpu-as-y)
-- 
2.17.1



Re: [PATCH v13 00/24] selftests, powerpc, x86 : Memory Protection Keys

2018-06-14 Thread Florian Weimer

On 06/14/2018 02:44 AM, Ram Pai wrote:

Test

Verified for correctness on powerpc. Need help verifying on x86.
Compiles on x86.


It breaks make in tools/testing/selftests/x86:

make: *** No rule to make target `protection_keys.c', needed by 
`/home/linux/tools/testing/selftests/x86/protection_keys_64'.  Stop.


The generic implementation no longer builds 32-bit binaries.  Is this 
the intent?


It's possible to build 32-bit binaries with “make CC='gcc -m32'”, so 
perhaps this is good enough?


But with that, I get a warning:

protection_keys.c: In function ‘dump_mem’:
protection_keys.c:172:3: warning: format ‘%lx’ expects argument of type 
‘long unsigned int’, but argument 4 has type ‘uint64_t’ [-Wformat=]

   dprintf1("dump[%03d][@%p]: %016lx\n", i, ptr, *ptr);
   ^

I suppose you could use %016llx and add a cast to unsigned long long to 
fix this.


Anyway, both the 32-bit and 64-bit tests fail here:

assert() at protection_keys.c::943 test_nr: 12 iteration: 1
running abort_hooks()...

I've yet checked what causes this.  It's with the kernel headers from 
4.17, but with other userspace headers based on glibc 2.17.  I hope to 
look into this some more before the weekend, but I eventually have to 
return the test machine to the pool.


Thanks,
Florian


[PATCH] mm: convert return type of handle_mm_fault() caller to vm_fault_t

2018-06-14 Thread Souptick Joarder
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

In this patch all the caller of handle_mm_fault()
are changed to return vm_fault_t type.

Signed-off-by: Souptick Joarder 
---
 arch/alpha/mm/fault.c |  3 ++-
 arch/arc/mm/fault.c   |  4 +++-
 arch/arm/mm/fault.c   |  7 ---
 arch/arm64/mm/fault.c |  6 +++---
 arch/hexagon/mm/vm_fault.c|  2 +-
 arch/ia64/mm/fault.c  |  2 +-
 arch/m68k/mm/fault.c  |  4 ++--
 arch/microblaze/mm/fault.c|  2 +-
 arch/mips/mm/fault.c  |  2 +-
 arch/nds32/mm/fault.c |  2 +-
 arch/nios2/mm/fault.c |  2 +-
 arch/openrisc/mm/fault.c  |  2 +-
 arch/parisc/mm/fault.c|  2 +-
 arch/powerpc/mm/copro_fault.c |  2 +-
 arch/powerpc/mm/fault.c   |  7 ---
 arch/riscv/mm/fault.c |  3 ++-
 arch/s390/mm/fault.c  | 13 -
 arch/sh/mm/fault.c|  4 ++--
 arch/sparc/mm/fault_32.c  |  3 ++-
 arch/sparc/mm/fault_64.c  |  3 ++-
 arch/um/kernel/trap.c |  2 +-
 arch/unicore32/mm/fault.c |  9 +
 arch/x86/mm/fault.c   |  5 +++--
 arch/xtensa/mm/fault.c|  2 +-
 drivers/iommu/amd_iommu_v2.c  |  2 +-
 drivers/iommu/intel-svm.c |  4 +++-
 mm/hmm.c  |  8 
 mm/ksm.c  |  2 +-
 28 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index cd3c572..2a979ee 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -87,7 +87,8 @@
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
const struct exception_table_entry *fixup;
-   int fault, si_code = SEGV_MAPERR;
+   int si_code = SEGV_MAPERR;
+   vm_fault_t fault;
siginfo_t info;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index a0b7bd6..3a18d33 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -66,7 +67,8 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
siginfo_t info;
-   int fault, ret;
+   int ret;
+   vm_fault_t fault;
int write = regs->ecr_cause & ECR_C_PROTV_STORE;  /* ST/EX */
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index b75eada..758abcb 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -219,12 +219,12 @@ static inline bool access_error(unsigned int fsr, struct 
vm_area_struct *vma)
return vma->vm_flags & mask ? false : true;
 }
 
-static int __kprobes
+static vm_fault_t __kprobes
 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
unsigned int flags, struct task_struct *tsk)
 {
struct vm_area_struct *vma;
-   int fault;
+   vm_fault_t fault;
 
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
@@ -259,7 +259,8 @@ static inline bool access_error(unsigned int fsr, struct 
vm_area_struct *vma)
 {
struct task_struct *tsk;
struct mm_struct *mm;
-   int fault, sig, code;
+   int sig, code;
+   vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
if (notify_page_fault(regs, fsr))
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2af3dd8..8da263b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -371,12 +371,12 @@ static void do_bad_area(unsigned long addr, unsigned int 
esr, struct pt_regs *re
 #define VM_FAULT_BADMAP0x01
 #define VM_FAULT_BADACCESS 0x02
 
-static int __do_page_fault(struct mm_struct *mm, unsigned long addr,
+static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
   unsigned int mm_flags, unsigned long vm_flags,
   struct task_struct *tsk)
 {
struct vm_area_struct *vma;
-   int fault;
+   vm_fault_t fault;
 
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
@@ -419,7 +419,7 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
struct task_struct *tsk;
struct mm_struct *mm;
struct siginfo si;
-   int fault, major = 0;
+   vm_fault_t fault, major = 0;
unsigned long vm_flags = VM_READ | VM_WRITE;
unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 3eec33c..5d1de6c 100644
--- 

Re: powerpc: use time64_t in read_persistent_clock

2018-06-14 Thread Mathieu Malaterre
On Thu, Jun 14, 2018 at 1:46 PM Arnd Bergmann  wrote:
>
> On Wed, Jun 13, 2018 at 10:24 PM, Mathieu Malaterre  wrote:
> > Arnd,
> >
> > In 5bfd643583b2e I can see that you changed:
> >
> > $ git show 5bfd643583b2e -- arch/powerpc/platforms/powermac/time.c
> > [...]
> >  #ifdef CONFIG_ADB_PMU
> > -static unsigned long pmu_get_time(void)
> > +static time64_t pmu_get_time(void)
> >  {
> > struct adb_request req;
> > -   unsigned int now;
> > +   time64_t now;
> >
> > if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
> > return 0;
> > @@ -160,10 +151,10 @@ static unsigned long pmu_get_time(void)
> >req.reply_len);
> > now = (req.reply[0] << 24) + (req.reply[1] << 16)
> > + (req.reply[2] << 8) + req.reply[3];
> > -   return ((unsigned long)now) - RTC_OFFSET;
> > +   return now - RTC_OFFSET;
> >  }
> > [...]
> >
> > As far as I can tell the old function would never return a negative
> > value and rely on unsigned long roll over. Now I am seeing negative
> > value being returned and it seems to break later on in the rtc
> > library.
> >
> > Could you comment why you removed the cast to (unsigned long) in your 
> > commit ?
>
> I'm sorry my patch caused a regression. Even with your bug report
> it took me a while to see what exactly went wrong, and how I got
> mixed up the integer conversion rules.
>
> I mistakenly assume that fiven
>
>long long now;
>unsigned char reply[4];
>now = reply[0] << 24;
>
> we would always end up with a positive number, but since reply[0] is
> promoted to a signed integer, the right-hand side of the assignment
> ends up as a negative number for the usual contents of the RTC.

Ah right. My diagnosis was invalid.

> Can you confirm that this patch addresses your problem?

Yes !

Before:
[5.986710] rtc-generic rtc-generic: hctosys: unable to read the
hardware clock

After:
[5.579611] rtc-generic rtc-generic: setting system clock to
2018-06-14 18:57:00 UTC (1529002620)

So for the above:

Tested-by: Mathieu Malaterre 

> diff --git a/arch/powerpc/platforms/powermac/time.c
> b/arch/powerpc/platforms/powermac/time.c
> index 7c968e46736f..a7fdcd3051f9 100644
> --- a/arch/powerpc/platforms/powermac/time.c
> +++ b/arch/powerpc/platforms/powermac/time.c
> @@ -97,7 +97,7 @@ static time64_t cuda_get_time(void)
> if (req.reply_len != 7)
> printk(KERN_ERR "cuda_get_time: got %d byte reply\n",
>req.reply_len);
> -   now = (req.reply[3] << 24) + (req.reply[4] << 16)
> +   now = (u32)(req.reply[3] << 24) + (req.reply[4] << 16)
> + (req.reply[5] << 8) + req.reply[6];
> return now - RTC_OFFSET;
>  }
> @@ -140,7 +140,7 @@ static time64_t pmu_get_time(void)
> if (req.reply_len != 4)
> printk(KERN_ERR "pmu_get_time: got %d byte reply from PMU\n",
>req.reply_len);
> -   now = (req.reply[0] << 24) + (req.reply[1] << 16)
> +   now = (u32)(req.reply[0] << 24) + (req.reply[1] << 16)
> + (req.reply[2] << 8) + req.reply[3];
> return now - RTC_OFFSET;
>  }
>
> On a related note, we do have some freedom in how we want to
> interpret values beyond year 2038/2040 when the RTC does wrap
> around.
>
> - With the original code, we converted the time into a signed
>   32-bit integer based on the 1970 Unix epoch, converting RTC
>   years 2038 through 2040 into Linux years 1902 through 1904,
>   which doesn't seem very helpful.
>
> - With my fix above, we would interpret the numbers as documented,
>   with 2040 wrapping around to 1904.
>
> - If we want, we could reinterpret the 1904-1970 range as later
>   times between 2040 and 2106, like this:
>
> --- a/arch/powerpc/platforms/powermac/time.c
> +++ b/arch/powerpc/platforms/powermac/time.c
> @@ -88,7 +88,7 @@ long __init pmac_time_init(void)
>  static time64_t cuda_get_time(void)
>  {
> struct adb_request req;
> -   time64_t now;
> +   u32 now;
>
> if (cuda_request(, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0)
> return 0;
> @@ -132,7 +132,7 @@ static int cuda_set_rtc_time(struct rtc_time *tm)
>  static time64_t pmu_get_time(void)
>  {
> struct adb_request req;
> -   time64_t now;
> +   u32 now;
>
> if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
> return 0;
>
> On the upside, this would possibly extend the life of some machines
> by another 66 years, on the downside it might cause issues when
> an empty RTC battery causes the initial system time to be
> above the 32-bit TIME_T_MAX (instead of going back to 1904).

I would submit the first patch and add the above as comment. I doubt
anyone would still have a running system by then.

Thanks


Re: [PATCH] misc: ocxl: Change return type for fault handler

2018-06-14 Thread Frederic Barrat




Le 11/06/2018 à 22:29, Souptick Joarder a écrit :

Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

There is an existing bug when vm_insert_pfn() can return
ENOMEM which was ignored and VM_FAULT_NOPAGE returned as
default. The new inline vmf_insert_pfn() has removed
this inefficiency by returning correct vm_fault_ type.

Signed-off-by: Souptick Joarder 
---


Thanks!

Tested and
Acked-by: Frederic Barrat 



  drivers/misc/ocxl/context.c | 22 +++---
  drivers/misc/ocxl/sysfs.c   |  5 ++---
  2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index 909e880..98daf91 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -83,7 +83,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
return rc;
  }

-static int map_afu_irq(struct vm_area_struct *vma, unsigned long address,
+static vm_fault_t map_afu_irq(struct vm_area_struct *vma, unsigned long 
address,
u64 offset, struct ocxl_context *ctx)
  {
u64 trigger_addr;
@@ -92,15 +92,15 @@ static int map_afu_irq(struct vm_area_struct *vma, unsigned 
long address,
if (!trigger_addr)
return VM_FAULT_SIGBUS;

-   vm_insert_pfn(vma, address, trigger_addr >> PAGE_SHIFT);
-   return VM_FAULT_NOPAGE;
+   return vmf_insert_pfn(vma, address, trigger_addr >> PAGE_SHIFT);
  }

-static int map_pp_mmio(struct vm_area_struct *vma, unsigned long address,
+static vm_fault_t map_pp_mmio(struct vm_area_struct *vma, unsigned long 
address,
u64 offset, struct ocxl_context *ctx)
  {
u64 pp_mmio_addr;
int pasid_off;
+   vm_fault_t ret;

if (offset >= ctx->afu->config.pp_mmio_stride)
return VM_FAULT_SIGBUS;
@@ -118,27 +118,27 @@ static int map_pp_mmio(struct vm_area_struct *vma, 
unsigned long address,
pasid_off * ctx->afu->config.pp_mmio_stride +
offset;

-   vm_insert_pfn(vma, address, pp_mmio_addr >> PAGE_SHIFT);
+   ret = vmf_insert_pfn(vma, address, pp_mmio_addr >> PAGE_SHIFT);
mutex_unlock(>status_mutex);
-   return VM_FAULT_NOPAGE;
+   return ret;
  }

-static int ocxl_mmap_fault(struct vm_fault *vmf)
+static vm_fault_t ocxl_mmap_fault(struct vm_fault *vmf)
  {
struct vm_area_struct *vma = vmf->vma;
struct ocxl_context *ctx = vma->vm_file->private_data;
u64 offset;
-   int rc;
+   vm_fault_t ret;

offset = vmf->pgoff << PAGE_SHIFT;
pr_debug("%s: pasid %d address 0x%lx offset 0x%llx\n", __func__,
ctx->pasid, vmf->address, offset);

if (offset < ctx->afu->irq_base_offset)
-   rc = map_pp_mmio(vma, vmf->address, offset, ctx);
+   ret = map_pp_mmio(vma, vmf->address, offset, ctx);
else
-   rc = map_afu_irq(vma, vmf->address, offset, ctx);
-   return rc;
+   ret = map_afu_irq(vma, vmf->address, offset, ctx);
+   return ret;
  }

  static const struct vm_operations_struct ocxl_vmops = {
diff --git a/drivers/misc/ocxl/sysfs.c b/drivers/misc/ocxl/sysfs.c
index d9753a1..0ab1fd1 100644
--- a/drivers/misc/ocxl/sysfs.c
+++ b/drivers/misc/ocxl/sysfs.c
@@ -64,7 +64,7 @@ static ssize_t global_mmio_read(struct file *filp, struct 
kobject *kobj,
return count;
  }

-static int global_mmio_fault(struct vm_fault *vmf)
+static vm_fault_t global_mmio_fault(struct vm_fault *vmf)
  {
struct vm_area_struct *vma = vmf->vma;
struct ocxl_afu *afu = vma->vm_private_data;
@@ -75,8 +75,7 @@ static int global_mmio_fault(struct vm_fault *vmf)

offset = vmf->pgoff;
offset += (afu->global_mmio_start >> PAGE_SHIFT);
-   vm_insert_pfn(vma, vmf->address, offset);
-   return VM_FAULT_NOPAGE;
+   return vmf_insert_pfn(vma, vmf->address, offset);
  }

  static const struct vm_operations_struct global_mmio_vmops = {





Re: [RFC PATCH 2/5] powerpc: Flush checkpointed gpr state for 32-bit processes in ptrace

2018-06-14 Thread Pedro Franco de Carvalho
Michael Ellerman  writes:

> Can you add a helper that does it and use that helper in these two
> functions. Then if you can send me another patch that converts all the
> other uses to use the new helper.

Yes, I'll do this.

Thanks!

--
Pedro



Re: [RFC PATCH 1/5] powerpc: Fix inverted active predicate for setting the EBB regset

2018-06-14 Thread Pedro Franco de Carvalho
Michael Ellerman  writes:

> Hi Pedro,
>
> Thanks for looking into this, how did you detect this? Do you have a
> test case?

Hello,

I'm working on allowing these registers to be accessed through GDB,
which is where I saw this happen. Then I used a small program that
traces another, then reads and tries to write to the regset, but not in
linux selftest format.

> I don't think Anshuman wrote it this way on purpose, but added him to Cc
> in case he remembers.
>
> But I don't think this fix is necessarily right. If we are setting the
> EBB regs via ptrace then it doesn't matter if they were previously in
> use or not, we should just set them. What *does* matter is that at the
> end of the function we set used_ebb to true, because otherwise the
> values we have set will not actually be used when the process is
> rescheduled.

Now I'm not sure why the ptrace calls for the ebb regset are guarded
with used_ebb in the first place. The save/restore_sprs functions in
kernel/process.c seem to handle the ebb registers regardless of this
flag, and it seems to be possible for user programs to read and write to
these registers even without having first created a perf event.

The flag only appears to be used in perf/core_book3s.c in the
ebb_event_add function, and the pmu registers (mmcr0, etc) only seem to
be saved to and restored from the thread_struct through
ebb_switch_in/out. So maybe the original intent was to guard the
pmu_get/set functions with used_ebb instead?

I'm not sure about this, because I don't know if it possible for a
ptrace call to happen between ebb_event_add and the first ebb_switch_in
for a thread.

Thanks!

--
Pedro



Re: [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand

2018-06-14 Thread David Gibson
On Thu, Jun 14, 2018 at 04:35:18PM +1000, Alexey Kardashevskiy wrote:
> On 12/6/18 2:17 pm, David Gibson wrote:
> > On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
> >> At the moment we allocate the entire TCE table, twice (hardware part and
> >> userspace translation cache). This normally works as we normally have
> >> contigous memory and the guest will map entire RAM for 64bit DMA.
> >>
> >> However if we have sparse RAM (one example is a memory device), then
> >> we will allocate TCEs which will never be used as the guest only maps
> >> actual memory for DMA. If it is a single level TCE table, there is nothing
> >> we can really do but if it a multilevel table, we can skip allocating
> >> TCEs we know we won't need.
> >>
> >> This adds ability to allocate only first level, saving memory.
> >>
> >> This changes iommu_table::free() to avoid allocating of an extra level;
> >> iommu_table::set() will do this when needed.
> >>
> >> This adds @alloc parameter to iommu_table::exchange() to tell the callback
> >> if it can allocate an extra level; the flag is set to "false" for
> >> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
> >> H_TOO_HARD.
> >>
> >> This still requires the entire table to be counted in mm::locked_vm.
> >>
> >> To be conservative, this only does on-demand allocation when
> >> the usespace cache table is requested which is the case of VFIO.
> >>
> >> The example math for a system replicating a powernv setup with NVLink2
> >> in a guest:
> >> 16GB RAM mapped at 0x0
> >> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x2440
> >>
> >> the table to cover that all with 64K pages takes:
> >> (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
> >>
> >> If we allocate only necessary TCE levels, we will only need:
> >> (((0x4 + 0x4) >> 16)*8)>>20 = 4MB (plus some for indirect
> >> levels).
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>  arch/powerpc/include/asm/iommu.h  |  7 ++-
> >>  arch/powerpc/platforms/powernv/pci.h  |  6 ++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c   |  4 +-
> >>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 
> >> ---
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 ++--
> >>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
> >>  6 files changed, 69 insertions(+), 27 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/iommu.h 
> >> b/arch/powerpc/include/asm/iommu.h
> >> index 4bdcf22..daa3ee5 100644
> >> --- a/arch/powerpc/include/asm/iommu.h
> >> +++ b/arch/powerpc/include/asm/iommu.h
> >> @@ -70,7 +70,7 @@ struct iommu_table_ops {
> >>unsigned long *hpa,
> >>enum dma_data_direction *direction);
> >>  
> >> -  __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
> >> +  __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
> >>  #endif
> >>void (*clear)(struct iommu_table *tbl,
> >>long index, long npages);
> >> @@ -122,10 +122,13 @@ struct iommu_table {
> >>__be64 *it_userspace; /* userspace view of the table */
> >>struct iommu_table_ops *it_ops;
> >>struct krefit_kref;
> >> +  int it_nid;
> >>  };
> >>  
> >> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
> >> +  ((tbl)->it_ops->useraddrptr((tbl), (entry), false))
> > 
> > Is real mode really the only case where you want to inhibit new
> > allocations?  I would have thought some paths would be read-only and
> > you wouldn't want to allocate, even in virtual mode.
> 
> There are paths when I do not want allocation but I can figure out that
> from dma direction flag, for example, I am cleaning up the table and I do
> not want any extra  allocation to happen there and they do happen because I
> made a mistake so I'll repost. Other than that, this @alloc flag is for
> real mode only.

Hm, ok.

Something just occurred to me: IIRC, going from the vmalloc() to the
explicit table structure was to avoid allocating pages for memory
regions that aren't there due to sparseness.  But.. won't you get that
with vmalloc() anyway, if the pages for the gaps never get
instantiated?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2 0/6] powerpc/pkeys: fixes to pkeys

2018-06-14 Thread Florian Weimer

On 06/14/2018 02:28 AM, Ram Pai wrote:

Assortment of fixes to pkey.

Patch 1  makes pkey consumable in multithreaded applications.

Patch 2  fixes fork behavior to inherit the key attributes.

Patch 3  A off-by-one bug made one key unusable. Fixes it.

Patch 4  Execute-only key is preallocated.

Patch 5  Makes pkey-0 less special.

Patch 6  Deny by default permissions on all unallocated keys.

Passes all selftests on powerpc. Also behavior verified to be correct
by Florian.

Changelog:

v2: . fixed merge conflict with upstream code.
. Add patch 6. Makes the behavior consistent
  with that on x86.


(Except signal handling, but I agree with Ram that the POWER behavior is 
the correct one.)



Ram Pai (6):
   powerpc/pkeys: Enable all user-allocatable pkeys at init.
   powerpc/pkeys: Save the pkey registers before fork
   powerpc/pkeys: fix calculation of total pkeys.
   powerpc/pkeys: Preallocate execute-only key
   powerpc/pkeys: make protection key 0 less special
   powerpc/pkeys: Deny read/write/execute by default


I tested the whole series with the new selftests, with the printamr.c 
program I posted earlier, and the glibc test for pkey_alloc   The 
latter required some test fixes, but now passes as well.  As far as I 
can tell, everything looks good now.


Tested-By: Florian Weimer 

Thanks,
Florian


Re: stacktrace.c:(.text+0x1b0): undefined reference to `.smp_send_safe_nmi_ipi'

2018-06-14 Thread Nicholas Piggin
On Thu, 14 Jun 2018 12:51:28 +0200
Christophe LEROY  wrote:

> Le 14/06/2018 à 12:25, Nicholas Piggin a écrit :
> > On Thu, 14 Jun 2018 12:06:51 +0200
> > Christophe LEROY  wrote:
> >   
> >> See http://kisskb.ellerman.id.au/kisskb/buildresult/13395345/ for details
> >>
> >> Seems like that function only exists when CONFIG_NMI_IPI is defined.
> >>
> >> Problem introduced by commit 5cc05910f26e6fd6da15f052f86f6150e4b91664
> >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/kernel/stacktrace.c?h=next-20180614=5cc05910f26e6fd6da15f052f86f6150e4b91664
> >>
> >>
> >> What do you recommend to fix that ?  
> > 
> > We could just do smp_call_function in the case we have no NMI_IPI?  
> 
> ???
> 
> Can you detail what you have in mind ? Do smp_call_function() to call 
> what ? Instead of the call to smp_send_safe_nmi_ipi() ?

Yes I was thinking that but now I take a better look there's places that
call it with irqs off so that might be a bit risky. I guess just making
it depend on NMI_IPI for now is the minimum fix.

Thanks,
Nick


Re: powerpc: use time64_t in read_persistent_clock

2018-06-14 Thread Arnd Bergmann
On Wed, Jun 13, 2018 at 10:24 PM, Mathieu Malaterre  wrote:
> Arnd,
>
> In 5bfd643583b2e I can see that you changed:
>
> $ git show 5bfd643583b2e -- arch/powerpc/platforms/powermac/time.c
> [...]
>  #ifdef CONFIG_ADB_PMU
> -static unsigned long pmu_get_time(void)
> +static time64_t pmu_get_time(void)
>  {
> struct adb_request req;
> -   unsigned int now;
> +   time64_t now;
>
> if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
> return 0;
> @@ -160,10 +151,10 @@ static unsigned long pmu_get_time(void)
>req.reply_len);
> now = (req.reply[0] << 24) + (req.reply[1] << 16)
> + (req.reply[2] << 8) + req.reply[3];
> -   return ((unsigned long)now) - RTC_OFFSET;
> +   return now - RTC_OFFSET;
>  }
> [...]
>
> As far as I can tell the old function would never return a negative
> value and rely on unsigned long roll over. Now I am seeing negative
> value being returned and it seems to break later on in the rtc
> library.
>
> Could you comment why you removed the cast to (unsigned long) in your commit ?

I'm sorry my patch caused a regression. Even with your bug report
it took me a while to see what exactly went wrong, and how I got
mixed up the integer conversion rules.

I mistakenly assume that fiven

   long long now;
   unsigned char reply[4];
   now = reply[0] << 24;

we would always end up with a positive number, but since reply[0] is
promoted to a signed integer, the right-hand side of the assignment
ends up as a negative number for the usual contents of the RTC.

Can you confirm that this patch addresses your problem?

diff --git a/arch/powerpc/platforms/powermac/time.c
b/arch/powerpc/platforms/powermac/time.c
index 7c968e46736f..a7fdcd3051f9 100644
--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -97,7 +97,7 @@ static time64_t cuda_get_time(void)
if (req.reply_len != 7)
printk(KERN_ERR "cuda_get_time: got %d byte reply\n",
   req.reply_len);
-   now = (req.reply[3] << 24) + (req.reply[4] << 16)
+   now = (u32)(req.reply[3] << 24) + (req.reply[4] << 16)
+ (req.reply[5] << 8) + req.reply[6];
return now - RTC_OFFSET;
 }
@@ -140,7 +140,7 @@ static time64_t pmu_get_time(void)
if (req.reply_len != 4)
printk(KERN_ERR "pmu_get_time: got %d byte reply from PMU\n",
   req.reply_len);
-   now = (req.reply[0] << 24) + (req.reply[1] << 16)
+   now = (u32)(req.reply[0] << 24) + (req.reply[1] << 16)
+ (req.reply[2] << 8) + req.reply[3];
return now - RTC_OFFSET;
 }

On a related note, we do have some freedom in how we want to
interpret values beyond year 2038/2040 when the RTC does wrap
around.

- With the original code, we converted the time into a signed
  32-bit integer based on the 1970 Unix epoch, converting RTC
  years 2038 through 2040 into Linux years 1902 through 1904,
  which doesn't seem very helpful.

- With my fix above, we would interpret the numbers as documented,
  with 2040 wrapping around to 1904.

- If we want, we could reinterpret the 1904-1970 range as later
  times between 2040 and 2106, like this:

--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -88,7 +88,7 @@ long __init pmac_time_init(void)
 static time64_t cuda_get_time(void)
 {
struct adb_request req;
-   time64_t now;
+   u32 now;

if (cuda_request(, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0)
return 0;
@@ -132,7 +132,7 @@ static int cuda_set_rtc_time(struct rtc_time *tm)
 static time64_t pmu_get_time(void)
 {
struct adb_request req;
-   time64_t now;
+   u32 now;

if (pmu_request(, NULL, 1, PMU_READ_RTC) < 0)
return 0;

On the upside, this would possibly extend the life of some machines
by another 66 years, on the downside it might cause issues when
an empty RTC battery causes the initial system time to be
above the 32-bit TIME_T_MAX (instead of going back to 1904).

  Arnd


Re: stacktrace.c:(.text+0x1b0): undefined reference to `.smp_send_safe_nmi_ipi'

2018-06-14 Thread Christophe LEROY




Le 14/06/2018 à 12:25, Nicholas Piggin a écrit :

On Thu, 14 Jun 2018 12:06:51 +0200
Christophe LEROY  wrote:


See http://kisskb.ellerman.id.au/kisskb/buildresult/13395345/ for details

Seems like that function only exists when CONFIG_NMI_IPI is defined.

Problem introduced by commit 5cc05910f26e6fd6da15f052f86f6150e4b91664
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/kernel/stacktrace.c?h=next-20180614=5cc05910f26e6fd6da15f052f86f6150e4b91664


What do you recommend to fix that ?


We could just do smp_call_function in the case we have no NMI_IPI?


???

Can you detail what you have in mind ? Do smp_call_function() to call 
what ? Instead of the call to smp_send_safe_nmi_ipi() ?


Thanks
Christophe



Thanks,
Nick



[PATCH] powerpc/mm/hash/4k: Free hugetlb page table caches correctly.

2018-06-14 Thread Aneesh Kumar K.V
With 4k page size for hugetlb we allocate hugepage directories from its on slab
cache. With patch 0c4d26802 ("powerpc/book3s64/mm: Simplify the rcu callback 
for page table free")
we missed to free these allocated hugepd tables.

Update pgtable_free to handle hugetlb hugepd directory table.

Fixes:  0c4d26802 ("powerpc/book3s64/mm: Simplify the rcu callback for page 
table free")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/32/pgalloc.h  |  1 +
 .../include/asm/book3s/64/pgtable-4k.h| 21 +++
 .../include/asm/book3s/64/pgtable-64k.h   |  9 
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  5 +
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  1 +
 arch/powerpc/include/asm/nohash/64/pgalloc.h  |  1 +
 arch/powerpc/mm/hugetlbpage.c |  3 ++-
 arch/powerpc/mm/pgtable-book3s64.c| 12 +++
 8 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 6a6673907e45..e4633803fe43 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -108,6 +108,7 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 }
 
 #define check_pgt_cache()  do { } while (0)
+#define get_hugepd_cache_index(x)  (x)
 
 #ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h 
b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
index af5f2baac80f..a069dfcac9a9 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
@@ -49,6 +49,27 @@ static inline int hugepd_ok(hugepd_t hpd)
 }
 #define is_hugepd(hpd) (hugepd_ok(hpd))
 
+/*
+ * 16M and 16G huge page directory tables are allocated from slab cache
+ *
+ */
+#define H_16M_CACHE_INDEX (PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE - 
24)
+#define H_16G_CACHE_INDEX  
\
+   (PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + H_PUD_INDEX_SIZE - 
34)
+
+static inline int get_hugepd_cache_index(int index)
+{
+   switch (index) {
+   case H_16M_CACHE_INDEX:
+   return HTLB_16M_INDEX;
+   case H_16G_CACHE_INDEX:
+   return HTLB_16G_INDEX;
+   default:
+   BUG();
+   }
+   /* should not reach */
+}
+
 #else /* !CONFIG_HUGETLB_PAGE */
 static inline int pmd_huge(pmd_t pmd) { return 0; }
 static inline int pud_huge(pud_t pud) { return 0; }
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h 
b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
index fb4b3ba52339..d7ee249d6890 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
@@ -45,8 +45,17 @@ static inline int hugepd_ok(hugepd_t hpd)
 {
return 0;
 }
+
 #define is_hugepd(pdep)0
 
+/*
+ * This should never get called
+ */
+static inline int get_hugepd_cache_index(int index)
+{
+   BUG();
+}
+
 #else /* !CONFIG_HUGETLB_PAGE */
 static inline int pmd_huge(pmd_t pmd) { return 0; }
 static inline int pud_huge(pud_t pud) { return 0; }
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 42fe7c2ff2df..1022f622397f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -287,6 +287,11 @@ enum pgtable_index {
PMD_INDEX,
PUD_INDEX,
PGD_INDEX,
+   /*
+* Below are used with 4k page size and hugetlb
+*/
+   HTLB_16M_INDEX,
+   HTLB_16G_INDEX,
 };
 
 extern unsigned long __vmalloc_start;
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1707781d2f20..9de40eb614da 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -109,6 +109,7 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 }
 
 #define check_pgt_cache()  do { } while (0)
+#define get_hugepd_cache_index(x)  (x)
 
 #ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h 
b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index 0e693f322cb2..e2d62d033708 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -141,6 +141,7 @@ static inline void pgtable_free(void *table, int shift)
}
 }
 
+#define get_hugepd_cache_index(x)  (x)
 #ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int 
shift)
 {
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7c5f479c5c00..8a9a49c13865 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ 

Re: stacktrace.c:(.text+0x1b0): undefined reference to `.smp_send_safe_nmi_ipi'

2018-06-14 Thread Nicholas Piggin
On Thu, 14 Jun 2018 12:06:51 +0200
Christophe LEROY  wrote:

> See http://kisskb.ellerman.id.au/kisskb/buildresult/13395345/ for details
> 
> Seems like that function only exists when CONFIG_NMI_IPI is defined.
> 
> Problem introduced by commit 5cc05910f26e6fd6da15f052f86f6150e4b91664 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/kernel/stacktrace.c?h=next-20180614=5cc05910f26e6fd6da15f052f86f6150e4b91664
> 
> 
> What do you recommend to fix that ?

We could just do smp_call_function in the case we have no NMI_IPI?

Thanks,
Nick


stacktrace.c:(.text+0x1b0): undefined reference to `.smp_send_safe_nmi_ipi'

2018-06-14 Thread Christophe LEROY

See http://kisskb.ellerman.id.au/kisskb/buildresult/13395345/ for details

Seems like that function only exists when CONFIG_NMI_IPI is defined.

Problem introduced by commit 5cc05910f26e6fd6da15f052f86f6150e4b91664 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/kernel/stacktrace.c?h=next-20180614=5cc05910f26e6fd6da15f052f86f6150e4b91664



What do you recommend to fix that ?


Christophe


Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations

2018-06-14 Thread Thomas Gleixner
On Thu, 14 Jun 2018, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:31:17 -0700
> > I could look into creating the library for
> > common code and relocate the hpet watchdog into arch/x86 for the hpet-
> > specific parts.
> 
> If you can investigate that approach, that would be appreciated. I hope
> I did not misunderstand you there, Thomas.

I'm not against cleanups and consolidation, quite the contrary.

But this stuff just adds new infrastructure w/o showing that it's actually
a cleanup and consolidation.

Thanks,

tglx


Re: [PATCH v2] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP

2018-06-14 Thread Nicholas Piggin
On Thu, 14 Jun 2018 13:51:40 +0800
kbuild test robot  wrote:

> Hi Nicholas,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on next-20180613]
> [cannot apply to v4.17]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-radix-Fix-MADV_-FREE-DONTNEED-TLB-flush-miss-problem-with-THP/20180614-114728
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: microblaze-nommu_defconfig (attached as .config)
> compiler: microblaze-linux-gcc (GCC) 8.1.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=8.1.0 make.cross ARCH=microblaze 

Ooops, absent mindedly edited a header without thinking it's not
powerpc specific. Will fix it somehow.


Re: [RFC PATCH 3/3] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather

2018-06-14 Thread Nicholas Piggin
On Thu, 14 Jun 2018 15:15:47 +0900
Linus Torvalds  wrote:

> On Thu, Jun 14, 2018 at 11:49 AM Nicholas Piggin  wrote:
> >
> > +#ifndef pte_free_tlb
> >  #define pte_free_tlb(tlb, ptep, address)   \
> > do {\
> > __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> > __pte_free_tlb(tlb, ptep, address); \
> > } while (0)
> > +#endif  
> 
> Do you really want to / need to take over the whole pte_free_tlb macro?
> 
> I was hoping that you'd just replace the __tlv_adjust_range() instead.
> 
> Something like
> 
>  - replace the
> 
> __tlb_adjust_range(tlb, address, PAGE_SIZE);
> 
>with a "page directory" version:
> 
> __tlb_free_directory(tlb, address, size);
> 
>  - have the default implementation for that be the old code:
> 
> #ifndef __tlb_free_directory
>   #define __tlb_free_directory(tlb,addr,size)
> __tlb_adjust_range(tlb, addr, PAGE_SIZE)
> #endif
> 
> and that way architectures can now just hook into that
> "__tlb_free_directory()" thing.
> 
> Hmm?

Isn't it just easier and less indirection for the arch to just take
over the pte_free_tlb instead? 

I don't see what the __tlb_free_directory gets you except having to
follow another macro -- if the arch has something special they want
to do there, just do it in their __pte_free_tlb and call it
pte_free_tlb instead.

Thanks,
Nick

> 
>  Linus



Re: [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand

2018-06-14 Thread Alexey Kardashevskiy
On 12/6/18 2:17 pm, David Gibson wrote:
> On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
>> At the moment we allocate the entire TCE table, twice (hardware part and
>> userspace translation cache). This normally works as we normally have
>> contigous memory and the guest will map entire RAM for 64bit DMA.
>>
>> However if we have sparse RAM (one example is a memory device), then
>> we will allocate TCEs which will never be used as the guest only maps
>> actual memory for DMA. If it is a single level TCE table, there is nothing
>> we can really do but if it a multilevel table, we can skip allocating
>> TCEs we know we won't need.
>>
>> This adds ability to allocate only first level, saving memory.
>>
>> This changes iommu_table::free() to avoid allocating of an extra level;
>> iommu_table::set() will do this when needed.
>>
>> This adds @alloc parameter to iommu_table::exchange() to tell the callback
>> if it can allocate an extra level; the flag is set to "false" for
>> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
>> H_TOO_HARD.
>>
>> This still requires the entire table to be counted in mm::locked_vm.
>>
>> To be conservative, this only does on-demand allocation when
>> the usespace cache table is requested which is the case of VFIO.
>>
>> The example math for a system replicating a powernv setup with NVLink2
>> in a guest:
>> 16GB RAM mapped at 0x0
>> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x2440
>>
>> the table to cover that all with 64K pages takes:
>> (((0x2440 + 0x20) >> 16)*8)>>20 = 4556MB
>>
>> If we allocate only necessary TCE levels, we will only need:
>> (((0x4 + 0x4) >> 16)*8)>>20 = 4MB (plus some for indirect
>> levels).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/include/asm/iommu.h  |  7 ++-
>>  arch/powerpc/platforms/powernv/pci.h  |  6 ++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c   |  4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 ++--
>>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
>>  6 files changed, 69 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h 
>> b/arch/powerpc/include/asm/iommu.h
>> index 4bdcf22..daa3ee5 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -70,7 +70,7 @@ struct iommu_table_ops {
>>  unsigned long *hpa,
>>  enum dma_data_direction *direction);
>>  
>> -__be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
>> +__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
>>  #endif
>>  void (*clear)(struct iommu_table *tbl,
>>  long index, long npages);
>> @@ -122,10 +122,13 @@ struct iommu_table {
>>  __be64 *it_userspace; /* userspace view of the table */
>>  struct iommu_table_ops *it_ops;
>>  struct krefit_kref;
>> +int it_nid;
>>  };
>>  
>> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
>> +((tbl)->it_ops->useraddrptr((tbl), (entry), false))
> 
> Is real mode really the only case where you want to inhibit new
> allocations?  I would have thought some paths would be read-only and
> you wouldn't want to allocate, even in virtual mode.


There are paths when I do not want allocation but I can figure out that
from dma direction flag, for example, I am cleaning up the table and I do
not want any extra  allocation to happen there and they do happen because I
made a mistake so I'll repost. Other than that, this @alloc flag is for
real mode only.


> 
>>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
>> -((tbl)->it_ops->useraddrptr((tbl), (entry)))
>> +((tbl)->it_ops->useraddrptr((tbl), (entry), true))
>>  
>>  /* Pure 2^n version of get_order */
>>  static inline __attribute_const__
>> diff --git a/arch/powerpc/platforms/powernv/pci.h 
>> b/arch/powerpc/platforms/powernv/pci.h
>> index 5e02408..1fa5590 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long 
>> index, long npages,
>>  unsigned long attrs);
>>  extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>>  extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>> -unsigned long *hpa, enum dma_data_direction *direction);
>> -extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
>> +unsigned long *hpa, enum dma_data_direction *direction,
>> +bool alloc);
>> +extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index,
>> +bool alloc);
>>  extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
>>  
>>  extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 

Re: [RFC PATCH 3/3] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather

2018-06-14 Thread Linus Torvalds
On Thu, Jun 14, 2018 at 11:49 AM Nicholas Piggin  wrote:
>
> +#ifndef pte_free_tlb
>  #define pte_free_tlb(tlb, ptep, address)   \
> do {\
> __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> __pte_free_tlb(tlb, ptep, address); \
> } while (0)
> +#endif

Do you really want to / need to take over the whole pte_free_tlb macro?

I was hoping that you'd just replace the __tlv_adjust_range() instead.

Something like

 - replace the

__tlb_adjust_range(tlb, address, PAGE_SIZE);

   with a "page directory" version:

__tlb_free_directory(tlb, address, size);

 - have the default implementation for that be the old code:

#ifndef __tlb_free_directory
  #define __tlb_free_directory(tlb,addr,size)
__tlb_adjust_range(tlb, addr, PAGE_SIZE)
#endif

and that way architectures can now just hook into that
"__tlb_free_directory()" thing.

Hmm?

 Linus