date:20201029

Re: [PATCH 0/4] Powerpc: Better preemption for shared processor

2020-10-29 Thread Srikar Dronamraju

* Waiman Long  [2020-10-28 20:01:30]:

> > Srikar Dronamraju (4):
> >powerpc: Refactor is_kvm_guest declaration to new header
> >powerpc: Rename is_kvm_guest to check_kvm_guest
> >powerpc: Reintroduce is_kvm_guest
> >powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
> > 
> >   arch/powerpc/include/asm/firmware.h  |  6 --
> >   arch/powerpc/include/asm/kvm_guest.h | 25 +
> >   arch/powerpc/include/asm/kvm_para.h  |  2 +-
> >   arch/powerpc/include/asm/paravirt.h  | 18 ++
> >   arch/powerpc/kernel/firmware.c   |  5 -
> >   arch/powerpc/platforms/pseries/smp.c |  3 ++-
> >   6 files changed, 50 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/powerpc/include/asm/kvm_guest.h
> > 
> This patch series looks good to me and the performance is nice too.
> 
> Acked-by: Waiman Long 

Thank you.

> 
> Just curious, is the performance mainly from the use of static_branch
> (patches 1 - 3) or from reducing call to yield_count_of().

Because of the reduced call to yield_count

> 
> Cheers,
> Longman
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/4] PM: hibernate: improve robustness of mapping pages in the direct map

2020-10-29 Thread Mike Rapoport

On Wed, Oct 28, 2020 at 09:15:38PM +, Edgecombe, Rick P wrote:
> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > +   if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> > +   unsigned long addr = (unsigned
> > long)page_address(page);
> > +   int ret;
> > +
> > +   if (enable)
> > +   ret = set_direct_map_default_noflush(page);
> > +   else
> > +   ret = set_direct_map_invalid_noflush(page);
> > +
> > +   if (WARN_ON(ret))
> > +   return;
> > +
> > +   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > +   } else {
> > +   debug_pagealloc_map_pages(page, 1, enable);
> > +   }
> 
> Looking at the arm side again, I think this might actually introduce a
> regression for the arm/hibernate/DEBUG_PAGEALLOC combo.
> 
> Unlike __kernel_map_pages(), it looks like arm's cpa will always bail
> in the set_direct_map_() functions if rodata_full is false.
>
> So if rodata_full was disabled but debug page alloc is on, then this
> would now skip remapping the pages. I guess the significance depends
> on whether hibernate could actually try to save any DEBUG_PAGEALLOC
> unmapped pages. Looks like it to me though.
 
__kernel_map_pages() on arm64 will also bail out if rodata_full is
false:

void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (!debug_pagealloc_enabled() && !rodata_full)
return;

set_memory_valid((unsigned long)page_address(page), numpages, enable);
}

So using set_direct_map() to map back pages removed from the direct map
with __kernel_map_pages() seems safe to me.

-- 
Sincerely yours,
Mike.

Re: [PATCH 5/9] kprobes/ftrace: Add recursion protection to the ftrace callback

2020-10-29 Thread Masami Hiramatsu

Hi Steve,

On Wed, 28 Oct 2020 07:52:49 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (VMware)" 
> 
> If a ftrace callback does not supply its own recursion protection and
> does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
> make a helper trampoline to do so before calling the callback instead of
> just calling the callback directly.

So in that case the handlers will be called without preempt disabled?


> The default for ftrace_ops is going to assume recursion protection unless
> otherwise specified.

This seems to skip entier handler if ftrace finds recursion.
I would like to increment the missed counter even in that case.

[...]
e.g.

> diff --git a/arch/csky/kernel/probes/ftrace.c 
> b/arch/csky/kernel/probes/ftrace.c
> index 5264763d05be..5eb2604fdf71 100644
> --- a/arch/csky/kernel/probes/ftrace.c
> +++ b/arch/csky/kernel/probes/ftrace.c
> @@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
>  void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *ops, struct pt_regs *regs)
>  {
> + int bit;
>   bool lr_saver = false;
>   struct kprobe *p;
>   struct kprobe_ctlblk *kcb;
>  
> - /* Preempt is disabled by ftrace */
> + bit = ftrace_test_recursion_trylock();

> +
> + preempt_disable_notrace();
>   p = get_kprobe((kprobe_opcode_t *)ip);
>   if (!p) {
>   p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
>   if (unlikely(!p) || kprobe_disabled(p))
> - return;
> + goto out;
>   lr_saver = true;
>   }

if (bit < 0) {
kprobes_inc_nmissed_count(p);
goto out;
}

>  
> @@ -56,6 +61,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
> parent_ip,
>*/
>   __this_cpu_write(current_kprobe, NULL);
>   }
> +out:
> + preempt_enable_notrace();

if (bit >= 0)
ftrace_test_recursion_unlock(bit);

>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  

Or, we can also introduce a support function,

static inline void kprobes_inc_nmissed_ip(unsigned long ip)
{
struct kprobe *p;

preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (p)
kprobes_inc_nmissed_count(p);
preempt_enable_notrace();
}

> diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
> index 4bab21c71055..5f7742b225a5 100644
> --- a/arch/parisc/kernel/ftrace.c
> +++ b/arch/parisc/kernel/ftrace.c
> @@ -208,13 +208,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> long parent_ip,
>  {
>   struct kprobe_ctlblk *kcb;
>   struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);

(BTW, here is a bug... get_kprobe() must be called with preempt disabled.)

> + int bit;
>  
> - if (unlikely(!p) || kprobe_disabled(p))
> + bit = ftrace_test_recursion_trylock();

if (bit < 0) {
kprobes_inc_nmissed_ip(ip);
>   return;
}

This may easier for you ?

Thank you,

>  
> + preempt_disable_notrace();
> + if (unlikely(!p) || kprobe_disabled(p))
> + goto out;
> +
>   if (kprobe_running()) {
>   kprobes_inc_nmissed_count(p);
> - return;
> + goto out;
>   }
>  
>   __this_cpu_write(current_kprobe, p);
> @@ -235,6 +241,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
> long parent_ip,
>   }
>   }
>   __this_cpu_write(current_kprobe, NULL);
> +out:
> + preempt_enable_notrace();
> + ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
> b/arch/powerpc/kernel/kprobes-ftrace.c
> index 972cb28174b2..5df8d50c65ae 100644
> --- a/arch/powerpc/kernel/kprobes-ftrace.c
> +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> @@ -18,10 +18,16 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned 
> long parent_nip,
>  {
>   struct kprobe *p;
>   struct kprobe_ctlblk *kcb;
> + int bit;
>  
> + bit = ftrace_test_recursion_trylock();
> + if (bit < 0)
> + return;
> +
> + preempt_disable_notrace();
>   p = get_kprobe((kprobe_opcode_t *)nip);
>   if (unlikely(!p) || kprobe_disabled(p))
> - return;
> + goto out;
>  
>   kcb = get_kprobe_ctlblk();
>   if (kprobe_running()) {
> @@ -52,6 +58,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
> parent_nip,
>*/
>   __this_cpu_write(current_kprobe, NULL);
>   }
> +out:
> + preempt_enable_notrace();
> + ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index b388e87a08bf..88466d7fb6b2 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/

Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output

2020-10-29 Thread Mauro Carvalho Chehab

Hi Richard,

Em Wed, 28 Oct 2020 10:44:27 -0700
Richard Cochran  escreveu:

> On Wed, Oct 28, 2020 at 03:23:18PM +0100, Mauro Carvalho Chehab wrote:
> 
> > diff --git a/Documentation/ABI/testing/sysfs-uevent 
> > b/Documentation/ABI/testing/sysfs-uevent
> > index aa39f8d7bcdf..d0893dad3f38 100644
> > --- a/Documentation/ABI/testing/sysfs-uevent
> > +++ b/Documentation/ABI/testing/sysfs-uevent
> > @@ -19,7 +19,8 @@ Description:
> >  a transaction identifier so it's possible to use the same 
> > UUID
> >  value for one or more synthetic uevents in which case we
> >  logically group these uevents together for any userspace
> > -listeners. The UUID value appears in uevent as
> > +listeners. The UUID value appears in uevent as:  
> 
> I know almost nothing about Sphinx, but why have one colon here ^^^ and ...

Good point. After re-reading the text, this ":" doesn't belong here.

> 
> > +
> >  "SYNTH_UUID=----" 
> > environment
> >  variable.
> >  
> > @@ -30,18 +31,19 @@ Description:
> >  It's possible to define zero or more pairs - each pair is 
> > then
> >  delimited by a space character ' '. Each pair appears in
> >  synthetic uevent as "SYNTH_ARG_KEY=VALUE". That means the 
> > KEY
> > -name gains "SYNTH_ARG_" prefix to avoid possible collisions
> > +name gains `SYNTH_ARG_` prefix to avoid possible collisions
> >  with existing variables.
> >  
> > -Example of valid sequence written to the uevent file:
> > +Example of valid sequence written to the uevent file::  
> 
> ... two here?

The main issue that this patch wants to solve is here:

This generates synthetic uevent including these variables::

ACTION=add
SYNTH_ARG_A=1
SYNTH_ARG_B=abc
SYNTH_UUID=fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed

On Sphinx, consecutive lines with the same indent belongs to the same
paragraph. So, without "::", the above will be displayed on a single line,
which is undesired.

using "::" tells Sphinx to display as-is. It will also place it into a a 
box (colored for html output) and using a monospaced font.

The change at the "uevent file:" line was done just for coherency
purposes.

Yet, after re-reading the text, there are other things that are not
coherent. So, I guess the enclosed patch will work better for sys-uevent.

Thanks,
Mauro

docs: ABI: sysfs-uevent: make it compatible with ReST output

- Replace " by ``, in order to use monospaced fonts;
- mark literal blocks as such.

Signed-off-by: Mauro Carvalho Chehab 

diff --git a/Documentation/ABI/testing/sysfs-uevent 
b/Documentation/ABI/testing/sysfs-uevent
index aa39f8d7bcdf..0b6227706b35 100644
--- a/Documentation/ABI/testing/sysfs-uevent
+++ b/Documentation/ABI/testing/sysfs-uevent
@@ -6,42 +6,46 @@ Description:
 Enable passing additional variables for synthetic uevents that
 are generated by writing /sys/.../uevent file.
 
-Recognized extended format is ACTION [UUID [KEY=VALUE ...].
+Recognized extended format is::
 
-The ACTION is compulsory - it is the name of the uevent action
-("add", "change", "remove"). There is no change compared to
-previous functionality here. The rest of the extended format
-is optional.
+   ACTION [UUID [KEY=VALUE ...]
+
+The ACTION is compulsory - it is the name of the uevent
+action (``add``, ``change``, ``remove``). There is no change
+compared to previous functionality here. The rest of the
+extended format is optional.
 
 You need to pass UUID first before any KEY=VALUE pairs.
-The UUID must be in "----"
+The UUID must be in ``----``
 format where 'x' is a hex digit. The UUID is considered to be
 a transaction identifier so it's possible to use the same UUID
 value for one or more synthetic uevents in which case we
 logically group these uevents together for any userspace
 listeners. The UUID value appears in uevent as
-"SYNTH_UUID=----" environment
+``SYNTH_UUID=----`` environment
 variable.
 
 If UUID is not passed in, the generated synthetic uevent gains
-"SYNTH_UUID=0" environment variable automatically.
+``SYNTH_UUID=0`` environment variable automatically.
 
 The KEY=VALUE pairs can contain alphanumeric char

[PATCH 02/25] ASoC: fsl: fsl_ssi: remove unnecessary CONFIG_PM_SLEEP

2020-10-29 Thread Coiby Xu

SET_SYSTEM_SLEEP_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu 
---
 sound/soc/fsl/fsl_ssi.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
index 404be27c15fe..065500a4cbc1 100644
--- a/sound/soc/fsl/fsl_ssi.c
+++ b/sound/soc/fsl/fsl_ssi.c
@@ -1669,7 +1669,6 @@ static int fsl_ssi_remove(struct platform_device *pdev)
return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int fsl_ssi_suspend(struct device *dev)
 {
struct fsl_ssi *ssi = dev_get_drvdata(dev);
@@ -1699,7 +1698,6 @@ static int fsl_ssi_resume(struct device *dev)
 
return regcache_sync(regs);
 }
-#endif /* CONFIG_PM_SLEEP */
 
 static const struct dev_pm_ops fsl_ssi_pm = {
SET_SYSTEM_SLEEP_PM_OPS(fsl_ssi_suspend, fsl_ssi_resume)
-- 
2.28.0

[PATCH 03/25] ASoC: fsl: remove unnecessary CONFIG_PM_SLEEP

2020-10-29 Thread Coiby Xu

SET_SYSTEM_SLEEP_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu 
---
 sound/soc/fsl/imx-audmux.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/sound/soc/fsl/imx-audmux.c b/sound/soc/fsl/imx-audmux.c
index 25c18b9e348f..6d77188a4eab 100644
--- a/sound/soc/fsl/imx-audmux.c
+++ b/sound/soc/fsl/imx-audmux.c
@@ -349,7 +349,6 @@ static int imx_audmux_remove(struct platform_device *pdev)
return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int imx_audmux_suspend(struct device *dev)
 {
int i;
@@ -377,7 +376,6 @@ static int imx_audmux_resume(struct device *dev)
 
return 0;
 }
-#endif /* CONFIG_PM_SLEEP */
 
 static const struct dev_pm_ops imx_audmux_pm = {
SET_SYSTEM_SLEEP_PM_OPS(imx_audmux_suspend, imx_audmux_resume)
-- 
2.28.0

[PATCH 25/25] ALSA: aoa: remove unnecessary CONFIG_PM_SLEEP

2020-10-29 Thread Coiby Xu

SIMPLE_DEV_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu 
---
 sound/aoa/fabrics/layout.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/sound/aoa/fabrics/layout.c b/sound/aoa/fabrics/layout.c
index d2e85b83f7ed..197d13f23141 100644
--- a/sound/aoa/fabrics/layout.c
+++ b/sound/aoa/fabrics/layout.c
@@ -1126,7 +1126,6 @@ static int aoa_fabric_layout_remove(struct soundbus_dev 
*sdev)
return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int aoa_fabric_layout_suspend(struct device *dev)
 {
struct layout_dev *ldev = dev_get_drvdata(dev);
@@ -1150,7 +1149,6 @@ static int aoa_fabric_layout_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(aoa_fabric_layout_pm_ops,
aoa_fabric_layout_suspend, aoa_fabric_layout_resume);
 
-#endif
 
 static struct soundbus_driver aoa_soundbus_driver = {
.name = "snd_aoa_soundbus_drv",
@@ -1159,9 +1157,7 @@ static struct soundbus_driver aoa_soundbus_driver = {
.remove = aoa_fabric_layout_remove,
.driver = {
.owner = THIS_MODULE,
-#ifdef CONFIG_PM_SLEEP
.pm = &aoa_fabric_layout_pm_ops,
-#endif
}
 };
 
-- 
2.28.0

Re: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation

2020-10-29 Thread Mike Rapoport

On Wed, Oct 28, 2020 at 09:03:31PM +, Edgecombe, Rick P wrote:

> > On Wed, Oct 28, 2020 at 11:20:12AM +, Will Deacon wrote:
> > > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> > > > 
> > > > This is a theoretical bug, but it is still not nice :)  
> > > > 
> > > 
> > > Just to clarify: this patch series fixes this problem, right?
> > 
> > Yes.
> > 
> 
> Well, now I'm confused again.
> 
> As David pointed, __vunmap() should not be executing simultaneously
> with the hibernate operation because hibernate can't snapshot while
> data it needs to save is still updating. If a thread was paused when a
> page was in an "invalid" state, it should be remapped by hibernate
> before the copy.
> 
> To level set, before reading this mail, my takeaways from the
> discussions on potential hibernate/debug page alloc problems were:
> 
> Potential RISC-V issue:
> Doesn't have hibernate support
> 
> Potential ARM issue:
> The logic around when it's cpa determines pages might be unmapped looks
> correct for current callers.
> 
> Potential x86 page break issue:
> Seems to be ok for now, but a new set_memory_np() caller could violate
> assumptions in hibernate.
> 
> Non-obvious thorny logic: 
> General agreement it would be good to separate dependencies.
> 
> Behavior of V1 of this patchset:
> No functional change other than addition of a warn in hibernate.

There is a change that adds explicit use of set_direct_map() to
hibernate. Currently, in case of arm64 with DEBUG_PAGEALLOC=n if a
thread was paused when a page was in an "invalid" state hibernate will
access an unmapped data because __kernel_map_pages() will bail out.
After the change set_direct_map_default_noflush() would be used and the
page will get mapped before copy.

> So "does this fix the problem", "yes" leaves me a bit confused... Not
> saying there couldn't be any problems, especially due to the thorniness
> and cross arch stride, but what is it exactly and how does this series
> fix it?

This series goal was primarily to separate dependincies and make it
clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it turned
out, there is also some lack of consistency between architectures that
implement either of this so I tried to improve this as well.

Honestly, I don't know if a thread can be paused at the time __vunmap()
left invalid pages, but it could, there is an issue on arm64 with
DEBUG_PAGEALLOC=n and this set fixes it.

__vunmap()
vm_remove_mappings()
set_direct_map_invalid()
/* thread is frozen */
safe_copy_page()
__kernel_map_pages()
if (!debug_pagealloc())
return
do_copy_page() -> fault

-- 
Sincerely yours,
Mike.

Re: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation

2020-10-29 Thread David Hildenbrand


On 25.10.20 11:15, Mike Rapoport wrote:

From: Mike Rapoport 

Hi,

During recent discussion about KVM protected memory, David raised a concern
about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with
#ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
allocation debugging is disabled at run time and some allow modifying the
direct map regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly.



So, I was primarily wondering if we really have to touch direct mappings 
in hibernation code, or if we can avoid doing that. I was wondering if 
we cannot simply do something like kmap() when trying to access a 
!mapped page. Similar to reading old-os memory after kexec when in 
kdump. Just a thought.


--
Thanks,

David / dhildenb

Re: [PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-29 Thread Michael Ellerman

Alexey Kardashevskiy  writes:
> On 29/10/2020 11:40, Michael Ellerman wrote:
>> Alexey Kardashevskiy  writes:
>>> @@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>>   
>>> mutex_lock(&direct_window_init_mutex);
>>>   
>>> -   dma_addr = find_existing_ddw(pdn);
>>> +   dma_addr = find_existing_ddw(pdn, &len);
>> 
>> I don't see len used anywhere?
>> 
>>> if (dma_addr != 0)
>>> goto out_unlock;
>>>   
>>> @@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> }
>>> /* verify the window * number of ptes will map the partition */
>>> /* check largest block * page size > max memory hotplug addr */
>>> -   max_addr = ddw_memory_hotplug_max();
>>> -   if (query.largest_available_block < (max_addr >> page_shift)) {
>>> -   dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>>> - "%llu-sized pages\n", max_addr,  
>>> query.largest_available_block,
>>> - 1ULL << page_shift);
>>> +   /*
>>> +* The "ibm,pmemory" can appear anywhere in the address space.
>>> +* Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
>>> +* for the upper limit and fallback to max RAM otherwise but this
>>> +* disables device::dma_ops_bypass.
>>> +*/
>>> +   len = max_ram_len;
>> 
>> Here you override whatever find_existing_ddw() wrote to len?
>
> Not always, there is a bunch of gotos before this line to the end of the 
> function and one (which returns the existing window) is legit. Thanks,

Ah yep I see it.

Gotos considered confusing ;)

cheers

Re: [PATCH kernel v4 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-29 Thread Michael Ellerman

Alexey Kardashevskiy  writes:
> So far we have been using huge DMA windows to map all the RAM available.
> The RAM is normally mapped to the VM address space contiguously, and
> there is always a reasonable upper limit for possible future hot plugged
> RAM which makes it easy to map all RAM via IOMMU.
>
> Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
> normal RAM) can map anywhere in the VM space beyond the maximum RAM size
> and since it can be used for DMA, it requires extending the huge window
> up to MAX_PHYSMEM_BITS which requires hypervisor support for:
> 1. huge TCE tables;
> 2. multilevel TCE tables;
> 3. huge IOMMU pages.
>
> Certain hypervisors cannot do either so the only option left is
> restricting the huge DMA window to include only RAM and fallback to
> the default DMA window for persistent memory.
>
> This defines arch_dma_map_direct/etc to allow generic DMA code perform
> additional checks on whether direct DMA is still possible.
>
> This checks if the system has persistent memory. If it does not,
> the DMA bypass mode is selected, i.e.
> * dev->bus_dma_limit = 0
> * dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.
>
> If there is such memory, this creates identity mapping only for RAM and
> sets the dev->bus_dma_limit to let the generic code decide whether to
> call into the direct DMA or the indirect DMA ops.
>
> This should not change the existing behaviour when no persistent memory
> as dev->dma_ops_bypass is expected to be set.
>
> Signed-off-by: Alexey Kardashevskiy 

Acked-by: Michael Ellerman 

cheers

[PATCH kernel v2] irq: Add reference counting to IRQ mappings

2020-10-29 Thread Alexey Kardashevskiy

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Signed-off-by: Alexey Kardashevskiy 
---

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?


---
Changes:
v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
 kernel/irq/irqdesc.c   | 36 ---
 kernel/irq/irqdomain.c | 49 +-
 2 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..bc8f62157ffa 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -419,20 +419,40 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
 }
 
+static void delayed_free_desc(struct rcu_head *rhp);
 static void irq_kobj_release(struct kobject *kobj)
 {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+#ifdef CONFIG_IRQ_DOMAIN
+   struct irq_domain *domain;
+   unsigned int virq = desc->irq_data.irq;
 
-   free_masks(desc);
-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
+   domain = desc->irq_data.domain;
+   if (domain) {
+   if (irq_domain_is_hierarchy(domain)) {
+   irq_domain_free_irqs(virq, 1);
+   } else {
+   irq_domain_disassociate(domain, virq);
+   irq_free_desc(virq);
+   }
+   }
+#endif
+   /*
+* We free the descriptor, masks and stat fields via RCU. That
+* allows demultiplex interrupts to do rcu based management of
+* the child interrupts.
+* This also allows us to use rcu in kstat_irqs_usr().
+*/
+   call_rcu(&desc->rcu, delayed_free_desc);
 }
 
 static void delayed_free_desc(struct rcu_head *rhp)
 {
struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
 
-   kobject_put(&desc->kobj);
+   free_masks(desc);
+   free_percpu(desc->kstat_irqs);
+   kfree(desc);
 }
 
 static void free_desc(unsigned int irq)
@@ -453,14 +473,6 @@ static void free_desc(unsigned int irq)
 */
irq_sysfs_del(desc);
delete_irq_desc(irq);
-
-   /*
-* We free the descriptor, masks and stat fields via RCU. That
-* allows demultiplex interrupts to do rcu based management of
-* the child interrupts.
-* This also allows us to use rcu in kstat_irqs_usr().
-*/
-   call_rcu(&desc->rcu, delayed_free_desc);
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..5fb060e077e3 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -487,6 +487,7 @@ static void irq_domain_set_mapping(struct irq_domain 
*domain,
 
 void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
 {
+   struct irq_desc *desc = irq_to_desc(irq);
struct irq_data *irq_data = irq_get_irq_data(irq);
irq_hw_number_t hwirq;
 
@@ -514,11 +515,14 @@ void irq_domain_disassociate(struct irq_domain *domain, 
unsigned int irq)
 
/* Clear reverse map for this hwirq */
irq_domain_clear_mapping(domain, hwirq);
+
+   kobject_put(&desc->kobj);
 }
 
 int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
 irq_hw_number_t hwirq)
 {
+   struct irq_desc *desc = irq_to_desc(virq);
struct irq_data *irq_data = irq_get_irq_data(virq);
int ret;
 
@@ -530,6 +534,8 @@ int irq_domain_associate(struct irq_domain *domain, 
unsigned int virq,
if (WARN(irq_data->domain, "error: virq%i is already associated", virq))
return -EINVAL;
 
+   kobject_get(&desc->kobj);
+
mutex_lock(&irq_domain_mutex);
irq_data->hwirq = hwirq;
irq_data->domain = domain;
@@ -548,6 +554,7 @@ int irq_domain_associate(str

Re: [PATCH] powerpc: avoid broken GCC attribute((optimize))

2020-10-29 Thread Michael Ellerman

Ard Biesheuvel  writes:
> On Wed, 28 Oct 2020 at 09:04, Ard Biesheuvel  wrote:
>>
>> Commit 7053f80d9696 ("powerpc/64: Prevent stack protection in early boot")
>> introduced a couple of uses of __attribute__((optimize)) with function
>> scope, to disable the stack protector in some early boot code.
>>
>> Unfortunately, and this is documented in the GCC man pages [0], overriding
>> function attributes for optimization is broken, and is only supported for
>> debug scenarios, not for production: the problem appears to be that
>> setting GCC -f flags using this method will cause it to forget about some
>> or all other optimization settings that have been applied.
>>
>> So the only safe way to disable the stack protector is to disable it for
>> the entire source file.
>>
>> [0] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
>>
>> Cc: Michael Ellerman 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Nick Desaulniers 
>> Cc: Arvind Sankar 
>> Cc: Randy Dunlap 
>> Cc: Josh Poimboeuf 
>> Cc: Thomas Gleixner 
>> Cc: Alexei Starovoitov 
>> Cc: Daniel Borkmann 
>> Cc: Peter Zijlstra (Intel) 
>> Cc: Geert Uytterhoeven 
>> Cc: Kees Cook 
>> Fixes: 7053f80d9696 ("powerpc/64: Prevent stack protection in early boot")
>> Signed-off-by: Ard Biesheuvel 
>> ---
>> Related discussion here:
>> https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=gfhpzoj_hc...@mail.gmail.com/
>>
>> TL;DR using __attribute__((optimize("-fno-gcse"))) in the BPF interpreter
>> causes the compiler to forget about -fno-asynchronous-unwind-tables passed
>> on the command line, resulting in unexpected .eh_frame sections in vmlinux.
>>
>>  arch/powerpc/kernel/Makefile   | 3 +++
>>  arch/powerpc/kernel/paca.c | 2 +-
>>  arch/powerpc/kernel/setup.h| 6 --
>>  arch/powerpc/kernel/setup_64.c | 2 +-
>>  4 files changed, 5 insertions(+), 8 deletions(-)

Thanks for the patch.

> FYI i was notified by one of the robots that I missed one occurrence
> of __nostackprotector in arch/powerpc/kernel/paca.c
>
> Let me know if I need to resend.

That's fine I'll fix it up when applying.

With the existing code, with STACKPROTECTOR_STRONG=y, I see two
functions in setup_64.c that are triggering stack protection. One is
__init, and the other takes no parameters and is not easily reachable
from userspace, so I don't think losing the stack canary on either of
those is a concern.

I don't see anything in paca.c triggering stack protection.

I don't think there's any evidence this is causing a bug for us, so I'll
plan to put this in next for v5.11.

cheers

Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai

On Thu, 2020-10-29 at 11:09 +1100, Michael Ellerman wrote:
> Qian Cai  writes:
> > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > in the CPU-hotplug onlining process, which results in lockdep splats as
> > follows:
> 
> Since when?

For me, it is since the commit in the link which looks now merged into
v5.10-rc1. Then, it needs CONFIG_PROVE_RCU_LIST=y.

> What kernel version?
> 
> I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> v5.10-rc1. Am I missing a CONFIG?
> 
> cheers
> 
> 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > 
> >  other info that might help us debug this:
> > 
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> > 
> >  Call Trace:
> >  dump_stack+0xec/0x144 (unreliable)
> >  lockdep_rcu_suspicious+0x128/0x14c
> >  __lock_acquire+0x1060/0x1c60
> >  lock_acquire+0x140/0x5f0
> >  _raw_spin_lock_irqsave+0x64/0xb0
> >  clockevents_register_device+0x74/0x270
> >  register_decrementer_clockevent+0x94/0x110
> >  start_secondary+0x134/0x800
> >  start_secondary_prolog+0x10/0x14
> > 
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the start_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> > 
> > Link: 
> > https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai 
> > ---
> >  arch/powerpc/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 3c6b9822f978..8c2857cbd960 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
> >  /* Activate a secondary processor. */
> >  void start_secondary(void *unused)
> >  {
> > -   unsigned int cpu = smp_processor_id();
> > +   unsigned int cpu = raw_smp_processor_id();
> >  
> > mmgrab(&init_mm);
> > current->active_mm = &init_mm;
> >  
> > smp_store_cpu_info(cpu);
> > set_dec(tb_ticks_per_jiffy);
> > +   rcu_cpu_starting(cpu);
> > preempt_disable();
> > cpu_callin_map[cpu] = 1;
> >  
> > -- 
> > 2.28.0

Re: [PATCH 01/13] PCI: dwc/imx6: Drop setting PCI_MSI_FLAGS_ENABLE

2020-10-29 Thread Rob Herring

On Wed, Oct 28, 2020 at 7:21 PM Michael Ellerman  wrote:
>
> Rob Herring  writes:
> > No other host driver sets the PCI_MSI_FLAGS_ENABLE bit, so it must not
> > be necessary. If it is, a comment is needed.
>
> Yeah, but git blame directly points to:
>
>   75cb8d20c112 ("PCI: imx: Enable MSI from downstream components")

I think I did read this at some point and then forgot about it when I
made the change later...

> Which has a pretty long explanation. The relevant bit probably being:
>
>   ... on i.MX6, the MSI Enable bit controls delivery of MSI interrupts
>   from components below the Root Port.

The thing is that all seems not i.MX6 specific but DWC specific given
MSI handling is contained within the DWC block. So I don't see how
this could be an integration difference.

So maybe everyone else is still just setting CONFIG_PCIEPORTBUS
typically and haven't noticed? Is it correct for the host driver to
set MSI enable?

Rob

Re: [PATCH 5/9] kprobes/ftrace: Add recursion protection to the ftrace callback

2020-10-29 Thread Steven Rostedt

On Thu, 29 Oct 2020 16:58:03 +0900
Masami Hiramatsu  wrote:

> Hi Steve,
> 
> On Wed, 28 Oct 2020 07:52:49 -0400
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (VMware)" 
> > 
> > If a ftrace callback does not supply its own recursion protection and
> > does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
> > make a helper trampoline to do so before calling the callback instead of
> > just calling the callback directly.  
> 
> So in that case the handlers will be called without preempt disabled?
> 
> 
> > The default for ftrace_ops is going to assume recursion protection unless
> > otherwise specified.  
> 
> This seems to skip entier handler if ftrace finds recursion.
> I would like to increment the missed counter even in that case.

Note, this code does not change the functionality at this point, because
without having the FL_RECURSION flag set (which kprobes does not even in
this patch), it always gets called from the helper function that does this:

bit = trace_test_and_set_recursion(TRACE_LIST_START, TRACE_LIST_MAX);
if (bit < 0)
return;

preempt_disable_notrace();

op->func(ip, parent_ip, op, regs);

preempt_enable_notrace();
trace_clear_recursion(bit);

Where this function gets called by op->func().

In other words, you don't get that count anyway, and I don't think you want
it. Because it means you traced something that your callback calls.

That bit check is basically a nop, because the last patch in this series
will make the default that everything has recursion protection, but at this
patch the test does this:

/* A previous recursion check was made */
if ((val & TRACE_CONTEXT_MASK) > max)
return 0;

Which would always return true, because this function is called via the
helper that already did the trace_test_and_set_recursion() which, if it
made it this far, the val would always be greater than max.

> 
> [...]
> e.g.
> 
> > diff --git a/arch/csky/kernel/probes/ftrace.c 
> > b/arch/csky/kernel/probes/ftrace.c
> > index 5264763d05be..5eb2604fdf71 100644
> > --- a/arch/csky/kernel/probes/ftrace.c
> > +++ b/arch/csky/kernel/probes/ftrace.c
> > @@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
> >  void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
> >struct ftrace_ops *ops, struct pt_regs *regs)
> >  {
> > +   int bit;
> > bool lr_saver = false;
> > struct kprobe *p;
> > struct kprobe_ctlblk *kcb;
> >  
> > -   /* Preempt is disabled by ftrace */
> > +   bit = ftrace_test_recursion_trylock();  
> 
> > +
> > +   preempt_disable_notrace();
> > p = get_kprobe((kprobe_opcode_t *)ip);
> > if (!p) {
> > p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
> > if (unlikely(!p) || kprobe_disabled(p))
> > -   return;
> > +   goto out;
> > lr_saver = true;
> > }  
> 
>   if (bit < 0) {
>   kprobes_inc_nmissed_count(p);
>   goto out;
>   }

If anything called in get_kprobe() or kprobes_inc_nmissed_count() gets
traced here, you have zero recursion protection, and this will crash the
machine with a likely reboot (triple fault).

Note, the recursion handles interrupts and wont stop them. bit < 0 only
happens if you recurse because this function called something that ends up
calling itself. Really, why would you care about missing a kprobe on the
same kprobe?

-- Steve

Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai

On Wed, 2020-10-28 at 17:31 -0700, Paul E. McKenney wrote:
> On Thu, Oct 29, 2020 at 11:09:07AM +1100, Michael Ellerman wrote:
> > Qian Cai  writes:
> > > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > > in the CPU-hotplug onlining process, which results in lockdep splats as
> > > follows:
> > 
> > Since when?
> > What kernel version?
> > 
> > I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> > v5.10-rc1. Am I missing a CONFIG?
> 
> My guess would be that adding CONFIG_PROVE_RAW_LOCK_NESTING=y will
> get you some splats.

Well, I don't have that set, so it should be CONFIG_PROVE_RCU_LIST=y. Anyway,
this is .config to reproduce on Power9 NV:

https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config

[PATCH v2 0/4] arch, mm: improve robustness of direct map manipulation

2020-10-29 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

During recent discussion about KVM protected memory, David raised a concern
about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with
#ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
allocation debugging is disabled at run time and some allow modifying the
direct map regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly. 

Since currently the only user of __kernel_map_pages() outside
DEBUG_PAGEALLOC, it is updated to make direct map accesses there more
explicit.

[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a403482...@redhat.com

v2 changes:
* Rephrase patch 2 changelog to better describe the change intentions and
implications
* Move removal of kernel_map_pages() from patch 1 to patch 2, per David

v1:
https://lore.kernel.org/lkml/20201025101555.3057-1-r...@kernel.org

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: make direct map manipulations more explicit
  arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig|  3 +++
 arch/arm64/Kconfig  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c|  6 +++--
 arch/powerpc/Kconfig|  5 +
 arch/riscv/Kconfig  |  4 +---
 arch/riscv/include/asm/pgtable.h|  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 31 +
 arch/s390/Kconfig   |  4 +---
 arch/sparc/Kconfig  |  4 +---
 arch/x86/Kconfig|  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  4 ++--
 include/linux/mm.h  | 35 +
 include/linux/set_memory.h  |  5 +
 kernel/power/snapshot.c | 30 +++--
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c |  6 ++---
 mm/slab.c   |  8 +++
 20 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.28.0

[PATCH v2 1/4] mm: introduce debug_pagealloc_map_pages() helper

2020-10-29 Thread Mike Rapoport

From: Mike Rapoport 

When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().

The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.
Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is
not enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.

To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made
more explicit.

Start with combining the guard condition and the call to
__kernel_map_pages() into a single debug_pagealloc_map_pages() function to
emphasize that __kernel_map_pages() should not be called without
DEBUG_PAGEALLOC and use this new function to map/unmap pages when page
allocation debug is enabled.

Signed-off-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 include/linux/mm.h  | 10 ++
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c |  6 ++
 mm/slab.c   |  8 +++-
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..1fc0609056dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2936,12 +2936,22 @@ kernel_map_pages(struct page *page, int numpages, int 
enable)
 {
__kernel_map_pages(page, numpages, enable);
 }
+
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable)
+{
+   if (debug_pagealloc_enabled_static())
+   __kernel_map_pages(page, numpages, enable);
+}
+
 #ifdef CONFIG_HIBERNATION
 extern bool kernel_page_present(struct page *page);
 #endif /* CONFIG_HIBERNATION */
 #else  /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable) {}
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
 static inline bool kernel_page_present(struct page *page) { return true; }
 #endif /* CONFIG_HIBERNATION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b44d4c7ba73b..e2b6043a4428 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -614,8 +614,7 @@ void generic_online_page(struct page *page, unsigned int 
order)
 * so we should map it first. This is better than introducing a special
 * case in page freeing fast path.
 */
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
__free_pages_core(page, order);
totalram_pages_add(1UL << order);
 #ifdef CONFIG_HIGHMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..9a66a1ff9193 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1272,8 +1272,7 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
 */
arch_free_page(page, order);
 
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 0);
+   debug_pagealloc_map_pages(page, 1 << order, 0);
 
kasan_free_nondeferred_pages(page, order);
 
@@ -2270,8 +2269,7 @@ inline void post_alloc_hook(struct page *page, unsigned 
int order,
set_page_refcounted(page);
 
arch_alloc_page(page, order);
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
kernel_poison_pages(page, 1 << order, 1);
set_page_owner(page, order, gfp_flags);
diff --git a/mm/slab.c b/mm/slab.c
index b1113561b98b..340db0ce74c4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1431,10 +1431,8 @@ static bool is_debug_pagealloc_cache(struct kmem_cache 
*cachep)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map)
 {
-   if (!is_debug_pagealloc_cache(cachep))
-   return;
-
-   kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map);
+   debug_pagealloc_map_pages(virt_to_page(objp),
+ cachep->size / PAGE_SIZE, map);
 }
 
 #else
@@ -2062,7 +2060,7 @@ int __kmem_cache_create(struct kmem_cache *cachep, 
slab_flags_t flags)
 
 #if DEBUG
/*
-* If we're going to use the generic kernel_map_pages()
+* If we're going to use the generic debug_pagealloc_map_pages()
 * poisoning, then it's going to smash the c

[PATCH v2 2/4] PM: hibernate: make direct map manipulations more explicit

2020-10-29 Thread Mike Rapoport

From: Mike Rapoport 

When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
not present in the direct map and has to be explicitly mapped before it
could be copied.

On arm64 it is possible that a page would be removed from the direct map
using set_direct_map_invalid_noflush() but __kernel_map_pages() will refuse
to map this page back if DEBUG_PAGEALLOC is disabled.

Introduce hibernate_map_page() that will explicitly use
set_direct_map_{default,invalid}_noflush() for ARCH_HAS_SET_DIRECT_MAP case
and debug_pagealloc_map_pages() for DEBUG_PAGEALLOC case.

The remapping of the pages in safe_copy_page() presumes that it only
changes protection bits in an existing PTE and so it is safe to ignore
return value of set_direct_map_{default,invalid}_noflush().

Still, add a WARN_ON() so that future changes in set_memory APIs will not
silently break hibernation.

Signed-off-by: Mike Rapoport 
---
 include/linux/mm.h  | 12 
 kernel/power/snapshot.c | 30 --
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fc0609056dc..14e397f3752c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2927,16 +2927,6 @@ static inline bool debug_pagealloc_enabled_static(void)
 #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
-/*
- * When called in DEBUG_PAGEALLOC context, the call should most likely be
- * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
- */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable)
-{
-   __kernel_map_pages(page, numpages, enable);
-}
-
 static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable)
 {
@@ -2948,8 +2938,6 @@ static inline void debug_pagealloc_map_pages(struct page 
*page,
 extern bool kernel_page_present(struct page *page);
 #endif /* CONFIG_HIBERNATION */
 #else  /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable) {}
 static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 46b1804c1ddf..054c8cce4236 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void 
*page_address) {}
 static inline void hibernate_restore_unprotect_page(void *page_address) {}
 #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
 
+static inline void hibernate_map_page(struct page *page, int enable)
+{
+   if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
+   unsigned long addr = (unsigned long)page_address(page);
+   int ret;
+
+   /*
+* This should not fail because remapping a page here means
+* that we only update protection bits in an existing PTE.
+* It is still worth to have WARN_ON() here if something
+* changes and this will no longer be the case.
+*/
+   if (enable)
+   ret = set_direct_map_default_noflush(page);
+   else
+   ret = set_direct_map_invalid_noflush(page);
+
+   if (WARN_ON(ret))
+   return;
+
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+   } else {
+   debug_pagealloc_map_pages(page, 1, enable);
+   }
+}
+
 static int swsusp_page_is_free(struct page *);
 static void swsusp_set_page_forbidden(struct page *);
 static void swsusp_unset_page_forbidden(struct page *);
@@ -1355,9 +1381,9 @@ static void safe_copy_page(void *dst, struct page *s_page)
if (kernel_page_present(s_page)) {
do_copy_page(dst, page_address(s_page));
} else {
-   kernel_map_pages(s_page, 1, 1);
+   hibernate_map_page(s_page, 1);
do_copy_page(dst, page_address(s_page));
-   kernel_map_pages(s_page, 1, 0);
+   hibernate_map_page(s_page, 0);
}
 }
 
-- 
2.28.0

[PATCH] powerpc: add support for TIF_NOTIFY_SIGNAL

2020-10-29 Thread Jens Axboe

Wire up TIF_NOTIFY_SIGNAL handling for powerpc.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Jens Axboe 
---

5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
for details:

https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/

As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
as that will enable a set of cleanups once all of them support it. I'm
happy carrying this patch if need be, or it can be funelled through the
arch tree. Let me know.

 arch/powerpc/include/asm/thread_info.h | 5 -
 arch/powerpc/kernel/signal.c   | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 46a210b03d2b..53115ae61495 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -90,6 +90,7 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACE  0   /* syscall trace active */
 #define TIF_SIGPENDING 1   /* signal pending */
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
+#define TIF_NOTIFY_SIGNAL  3   /* signal notifications exist */
 #define TIF_SYSCALL_EMU4   /* syscall emulation active */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
 #define TIF_PATCH_PENDING  6   /* pending live patching update */
@@ -115,6 +116,7 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACE (1

[PATCH v2 3/4] arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC

2020-10-29 Thread Mike Rapoport

From: Mike Rapoport 

The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must never
fail. With this assumption is wouldn't be safe to allow general usage of
this function.

Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.

As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.

Signed-off-by: Mike Rapoport 
---
 arch/Kconfig |  3 +++
 arch/arm64/Kconfig   |  4 +---
 arch/arm64/mm/pageattr.c |  6 --
 arch/powerpc/Kconfig |  5 +
 arch/riscv/Kconfig   |  4 +---
 arch/riscv/include/asm/pgtable.h |  2 --
 arch/riscv/mm/pageattr.c |  2 ++
 arch/s390/Kconfig|  4 +---
 arch/sparc/Kconfig   |  4 +---
 arch/x86/Kconfig |  4 +---
 arch/x86/mm/pat/set_memory.c |  2 ++
 include/linux/mm.h   | 10 +++---
 12 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..56d4752b6db6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1028,6 +1028,9 @@ config HAVE_STATIC_CALL_INLINE
bool
depends on HAVE_STATIC_CALL
 
+config ARCH_SUPPORTS_DEBUG_PAGEALLOC
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f858c352f72a..5a01dfb77b93 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -71,6 +71,7 @@ config ARM64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_MEMORY_FAILURE
select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_ATOMIC_RMW
@@ -1005,9 +1006,6 @@ config HOLES_IN_ZONE
 
 source "kernel/Kconfig.hz"
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-   def_bool y
-
 config ARCH_SPARSEMEM_ENABLE
def_bool y
select SPARSEMEM_VMEMMAP_ENABLE
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 1b94f5b82654..18613d8834db 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -178,13 +178,15 @@ int set_direct_map_default_noflush(struct page *page)
   PAGE_SIZE, change_page_range, &data);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-   if (!debug_pagealloc_enabled() && !rodata_full)
+   if (!rodata_full)
return;
 
set_memory_valid((unsigned long)page_address(page), numpages, enable);
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 /*
  * This function is used to determine if a linear map page has been marked as
@@ -204,7 +206,7 @@ bool kernel_page_present(struct page *page)
pte_t *ptep;
unsigned long addr = (unsigned long)page_address(page);
 
-   if (!debug_pagealloc_enabled() && !rodata_full)
+   if (!rodata_full)
return true;
 
pgdp = pgd_offset_k(addr);
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9f13fe08492..ad8a83f3ddca 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -146,6 +146,7 @@ config PPC
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_SUPPORTS_ATOMIC_RMW
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOCif PPC32 || PPC_BOOK3S_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
@@ -355,10 +356,6 @@ config PPC_OF_PLATFORM_PCI
depends on PCI
depends on PPC64 # not supported on 32 bits yet
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-   depends on PPC32 || PPC_BOOK3S_64
-   def_bool y
-
 config ARCH_SUPPORTS_UPROBES
def_bool y
 
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..9283c6f9ae2a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -14,6 +14,7 @@ config RISCV
def_bool y
select ARCH_CLOCKSOURCE_INIT
select ARCH_SUPPORTS_ATOMIC_RMW
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOC if MMU
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEBUG_VIRTUAL if MMU
@@ -153,9 +154,6 @@ config ARCH_SELECT_MEMORY_MODEL
 config ARCH_WANT_GENERAL_HUGETLB
def_bool y
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-   def_bool y
-
 config SYS_SUPPORTS_HUGETLBFS
depends on MMU
def_bool y
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 183f1f4b2ae6..41a72861987c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/i

[PATCH v2 4/4] arch, mm: make kernel_page_present() always available

2020-10-29 Thread Mike Rapoport

From: Mike Rapoport 

For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
verify that a page is mapped in the kernel direct map can be useful
regardless of hibernation.

Add RISC-V implementation of kernel_page_present(), update its forward
declarations and stubs to be a part of set_memory API and remove ugly
ifdefery in inlcude/linux/mm.h around current declarations of
kernel_page_present().

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 29 +
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  2 --
 include/linux/mm.h  |  7 ---
 include/linux/set_memory.h  |  5 +
 7 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h 
b/arch/arm64/include/asm/cacheflush.h
index 9384fd8fc13c..45217f21f1fe 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -140,6 +140,7 @@ int set_memory_valid(unsigned long addr, int numpages, int 
enable);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #include 
 
diff --git a/arch/riscv/include/asm/set_memory.h 
b/arch/riscv/include/asm/set_memory.h
index 4c5bae7ca01c..d690b08dff2a 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -24,6 +24,7 @@ static inline int set_memory_nx(unsigned long addr, int 
numpages) { return 0; }
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 321b09d2e2ea..87ba5a68bbb8 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -198,3 +198,32 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 __pgprot(0), __pgprot(_PAGE_PRESENT));
 }
 #endif
+
+bool kernel_page_present(struct page *page)
+{
+   unsigned long addr = (unsigned long)page_address(page);
+   pgd_t *pgd;
+   pud_t *pud;
+   p4d_t *p4d;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset_k(addr);
+   if (!pgd_present(*pgd))
+   return false;
+
+   p4d = p4d_offset(pgd, addr);
+   if (!p4d_present(*p4d))
+   return false;
+
+   pud = pud_offset(p4d, addr);
+   if (!pud_present(*pud))
+   return false;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   return false;
+
+   pte = pte_offset_kernel(pmd, addr);
+   return pte_present(*pte);
+}
diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 5948218f35c5..4352f08bfbb5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -82,6 +82,7 @@ int set_pages_rw(struct page *page, int numpages);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 7f248fc45317..16f878c26667 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2228,7 +2228,6 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#ifdef CONFIG_HIBERNATION
 bool kernel_page_present(struct page *page)
 {
unsigned int level;
@@ -2240,7 +2239,6 @@ bool kernel_page_present(struct page *page)
pte = lookup_address((unsigned long)page_address(page), &level);
return (pte_val(*pte) & _PAGE_PRESENT);
 }
-#endif /* CONFIG_HIBERNATION */
 
 int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
   unsigned numpages, unsigned long page_flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab0ef6bd351d..44b82f22e76a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2937,16 +2937,9 @@ static inline void debug_pagealloc_map_pages(struct page 
*page,
if (debug_pagealloc_enabled_static())
__kernel_map_pages(page, numpages, enable);
 }
-
-#ifdef CONFIG_HIBERNATION
-extern bool kernel_page_present(struct page *page);
-#endif /* CONFIG_HIBERNATION */
 #else  /* CONFIG_DEBUG_PAGEALLOC */
 static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable) {}
-#ifdef CONFIG_HIBERNATION
-static inline bool kernel_page_present(struct page *page) { return true; }
-#endif /* CONFIG_HIBERNATION */
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
 #ifdef __HAVE_ARCH_GATE_AREA
diff --git a/include/l

[PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

2020-10-29 Thread David Hildenbrand

powernv/memtrace is the only in-kernel user that rips out random memory
it never added (doesn't own) in order to allocate memory without a
linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
that - use alloc_contig_pages() for allocating memory and remove the
linear mapping manually.

The original idea was discussed in:
 https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915...@redhat.com

I only tested allocations briefly via QEMU TCG - see patch #4 for more
details.

David Hildenbrand (4):
  powerpc/mm: factor out creating/removing linear mapping
  powerpc/mm: print warning in arch_remove_linear_mapping()
  powerpc/mm: remove linear mapping if __add_pages() fails in
arch_add_memory()
  powernv/memtrace: don't abuse memory hot(un)plug infrastructure for
memory allocations

 arch/powerpc/mm/mem.c |  48 +---
 arch/powerpc/platforms/powernv/Kconfig|   8 +-
 arch/powerpc/platforms/powernv/memtrace.c | 134 --
 include/linux/memory_hotplug.h|   3 +
 4 files changed, 86 insertions(+), 107 deletions(-)

-- 
2.26.2

[PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping

2020-10-29 Thread David Hildenbrand

We want to stop abusing memory hotplug infrastructure in memtrace code
to perform allocations and remove the linear mapping. Instead we will use
alloc_contig_pages() and remove the identity mapping manually.

Let's factor out creating/removing the linear mapping into
arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
future, we might be able to have whole arch_add_memory() /
arch_remove_memory() be implemented in common code.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rashmica Gupta 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/mem.c  | 41 +++---
 include/linux/memory_hotplug.h |  3 +++
 2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 01ec2a252f09..8a86d81f8df0 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -120,34 +120,26 @@ static void flush_dcache_range_chunked(unsigned long 
start, unsigned long stop,
}
 }
 
-int __ref arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_params *params)
+int __ref arch_create_linear_mapping(int nid, u64 start, u64 size,
+struct mhp_params *params)
 {
-   unsigned long start_pfn = start >> PAGE_SHIFT;
-   unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
 
start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
if (rc) {
-   pr_warn("Unable to create mapping for hot added memory 
0x%llx..0x%llx: %d\n",
+   pr_warn("Unable to create linear mapping for 0x%llx..0x%llx: 
%d\n",
start, start + size, rc);
return -EFAULT;
}
-
-   return __add_pages(nid, start_pfn, nr_pages, params);
+   return 0;
 }
 
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
-struct vmem_altmap *altmap)
+void __ref arch_remove_linear_mapping(u64 start, u64 size)
 {
-   unsigned long start_pfn = start >> PAGE_SHIFT;
-   unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   __remove_pages(start_pfn, nr_pages, altmap);
-
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
@@ -160,6 +152,29 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 */
vm_unmap_aliases();
 }
+
+int __ref arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_params *params)
+{
+   unsigned long start_pfn = start >> PAGE_SHIFT;
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+   int rc;
+
+   rc = arch_create_linear_mapping(nid, start, size, params);
+   if (rc)
+   return rc;
+   return __add_pages(nid, start_pfn, nr_pages, params);
+}
+
+void __ref arch_remove_memory(int nid, u64 start, u64 size,
+ struct vmem_altmap *altmap)
+{
+   unsigned long start_pfn = start >> PAGE_SHIFT;
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+
+   __remove_pages(start_pfn, nr_pages, altmap);
+   arch_remove_linear_mapping(start, size);
+}
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d65c6fdc5cfc..00b9e9bd3850 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,6 +375,9 @@ extern struct page *sparse_decode_mem_map(unsigned long 
coded_mem_map,
  unsigned long pnum);
 extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned 
start_pfn,
unsigned long nr_pages);
+extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
+ struct mhp_params *params);
+void arch_remove_linear_mapping(u64 start, u64 size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
-- 
2.26.2

[PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping()

2020-10-29 Thread David Hildenbrand

Let's print a warning similar to in arch_add_linear_mapping() instead of
WARN_ON_ONCE() and eventually crashing the kernel.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rashmica Gupta 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/mem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8a86d81f8df0..685028451dd2 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -145,7 +145,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
 
ret = remove_section_mapping(start, start + size);
-   WARN_ON_ONCE(ret);
+   if (ret)
+   pr_warn("Unable to remove linear mapping for 0x%llx..0x%llx: 
%d\n",
+   start, start + size, ret);
 
/* Ensure all vmalloc mappings are flushed in case they also
 * hit that section of memory
-- 
2.26.2

[PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

2020-10-29 Thread David Hildenbrand

Let's revert what we did in case seomthing goes wrong and we return an
error.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rashmica Gupta 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/mem.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 685028451dd2..69b3e8072261 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -165,7 +165,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
rc = arch_create_linear_mapping(nid, start, size, params);
if (rc)
return rc;
-   return __add_pages(nid, start_pfn, nr_pages, params);
+   rc = __add_pages(nid, start_pfn, nr_pages, params);
+   if (rc)
+   arch_remove_linear_mapping(start, size);
+   return rc;
 }
 
 void __ref arch_remove_memory(int nid, u64 start, u64 size,
-- 
2.26.2

[PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

2020-10-29 Thread David Hildenbrand

Let's use alloc_contig_pages() for allocating memory and remove the
linear mapping manually via arch_remove_linear_mapping(). Mark all pages
PG_offline, such that they will definitely not get touched - e.g.,
when hibernating. When freeing memory, try to revert what we did.

The original idea was discussed in:
 https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915...@redhat.com

This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
architectures, whereby only single pages are unmapped from the linear
mapping. Let's mimic what memory hot(un)plug would do with the linear
mapping.

We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.

Simple test under QEMU TCG (10GB RAM, single NUMA node):

sh-5.0# mount -t debugfs none /sys/kernel/debug/
sh-5.0# cat /sys/devices/system/memory/block_size_bytes
4000
sh-5.0# echo 0x4000 > /sys/kernel/debug/powerpc/memtrace/enable
[   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 
0x8000
sh-5.0# echo 0x8000 > /sys/kernel/debug/powerpc/memtrace/enable
[   75.424302][  T356] radix-mmu: Mapped 0x8000-0xc000 
with 64.0 KiB pages
[   75.430549][  T356] memtrace: Freed trace memory back on node 0
[   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 
0x8000
sh-5.0# echo 0x1 > /sys/kernel/debug/powerpc/memtrace/enable
[   80.418835][  T356] radix-mmu: Mapped 0x8000-0x0001 
with 64.0 KiB pages
[   80.430493][  T356] memtrace: Freed trace memory back on node 0
[   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
sh-5.0# echo 0x4000 > /sys/kernel/debug/powerpc/memtrace/enable
[   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 
0x8000

Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
pages are not movable. However, as we don't run with any memory
hot(un)plug mechanism around, we could make an exception to
increase the chance of allocations succeeding.

Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
along PG_reserved in hibernation code will always return "true"
on powerpc, resulting in the pages getting touched. It's too
generic - e.g., indicates boot allocations.

Note 3: For now, we keep using memory_block_size_bytes() as minimum
granularity. I'm not able to come up with a better guess (most
probably, doing it on a section basis could be possible).

Suggested-by: Michal Hocko 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rashmica Gupta 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/Kconfig|   8 +-
 arch/powerpc/platforms/powernv/memtrace.c | 134 --
 2 files changed, 49 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..619b093a0657 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,11 +27,11 @@ config OPAL_PRD
  recovery diagnostics on OpenPower machines
 
 config PPC_MEMTRACE
-   bool "Enable removal of RAM from kernel mappings for tracing"
-   depends on PPC_POWERNV && MEMORY_HOTREMOVE
+   bool "Enable runtime allocation of RAM for tracing"
+   depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC
help
- Enabling this option allows for the removal of memory (RAM)
- from the kernel mappings to be used for hardware tracing.
+ Enabling this option allows for runtime allocation of memory (RAM)
+ for hardware tracing.
 
 config PPC_VAS
bool "IBM Virtual Accelerator Switchboard (VAS)"
diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 6828108486f8..8f47797a78c2 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -50,83 +50,50 @@ static const struct file_operations memtrace_fops = {
.open   = simple_open,
 };
 
-static int check_memblock_online(struct memory_block *mem, void *arg)
-{
-   if (mem->state != MEM_ONLINE)
-   return -1;
-
-   return 0;
-}
-
-static int change_memblock_state(struct memory_block *mem, void *arg)
-{
-   unsigned long state = (unsigned long)arg;
-
-   mem->state = state;
-
-   return 0;
-}
-
-/* called with device_hotplug_lock held */
-static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
+static u64 memtrace_alloc_node(u32 nid, u64 size)
 {
-   const unsigned long start = PFN_PHYS(start_pfn);
-   const unsigned long size = PFN_PHYS(nr_pages);
+   const unsigned long nr_pages = PHYS_PFN(size);
+   unsigned long pfn, start_pfn;
+   struct page *page;
 
-   if (walk_memory_blocks(start, size

Re: [PATCH v2 2/4] PM: hibernate: make direct map manipulations more explicit

2020-10-29 Thread Rafael J. Wysocki

On Thu, Oct 29, 2020 at 5:19 PM Mike Rapoport  wrote:
>
> From: Mike Rapoport 
>
> When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
> not present in the direct map and has to be explicitly mapped before it
> could be copied.
>
> On arm64 it is possible that a page would be removed from the direct map
> using set_direct_map_invalid_noflush() but __kernel_map_pages() will refuse
> to map this page back if DEBUG_PAGEALLOC is disabled.
>
> Introduce hibernate_map_page() that will explicitly use
> set_direct_map_{default,invalid}_noflush() for ARCH_HAS_SET_DIRECT_MAP case
> and debug_pagealloc_map_pages() for DEBUG_PAGEALLOC case.
>
> The remapping of the pages in safe_copy_page() presumes that it only
> changes protection bits in an existing PTE and so it is safe to ignore
> return value of set_direct_map_{default,invalid}_noflush().
>
> Still, add a WARN_ON() so that future changes in set_memory APIs will not
> silently break hibernation.
>
> Signed-off-by: Mike Rapoport 

>From the hibernation support perspective:

Acked-by: Rafael J. Wysocki 

> ---
>  include/linux/mm.h  | 12 
>  kernel/power/snapshot.c | 30 --
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1fc0609056dc..14e397f3752c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2927,16 +2927,6 @@ static inline bool debug_pagealloc_enabled_static(void)
>  #if defined(CONFIG_DEBUG_PAGEALLOC) || 
> defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
>  extern void __kernel_map_pages(struct page *page, int numpages, int enable);
>
> -/*
> - * When called in DEBUG_PAGEALLOC context, the call should most likely be
> - * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
> - */
> -static inline void
> -kernel_map_pages(struct page *page, int numpages, int enable)
> -{
> -   __kernel_map_pages(page, numpages, enable);
> -}
> -
>  static inline void debug_pagealloc_map_pages(struct page *page,
>  int numpages, int enable)
>  {
> @@ -2948,8 +2938,6 @@ static inline void debug_pagealloc_map_pages(struct 
> page *page,
>  extern bool kernel_page_present(struct page *page);
>  #endif /* CONFIG_HIBERNATION */
>  #else  /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
> -static inline void
> -kernel_map_pages(struct page *page, int numpages, int enable) {}
>  static inline void debug_pagealloc_map_pages(struct page *page,
>  int numpages, int enable) {}
>  #ifdef CONFIG_HIBERNATION
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 46b1804c1ddf..054c8cce4236 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void 
> *page_address) {}
>  static inline void hibernate_restore_unprotect_page(void *page_address) {}
>  #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
>
> +static inline void hibernate_map_page(struct page *page, int enable)
> +{
> +   if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> +   unsigned long addr = (unsigned long)page_address(page);
> +   int ret;
> +
> +   /*
> +* This should not fail because remapping a page here means
> +* that we only update protection bits in an existing PTE.
> +* It is still worth to have WARN_ON() here if something
> +* changes and this will no longer be the case.
> +*/
> +   if (enable)
> +   ret = set_direct_map_default_noflush(page);
> +   else
> +   ret = set_direct_map_invalid_noflush(page);
> +
> +   if (WARN_ON(ret))
> +   return;
> +
> +   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +   } else {
> +   debug_pagealloc_map_pages(page, 1, enable);
> +   }
> +}
> +
>  static int swsusp_page_is_free(struct page *);
>  static void swsusp_set_page_forbidden(struct page *);
>  static void swsusp_unset_page_forbidden(struct page *);
> @@ -1355,9 +1381,9 @@ static void safe_copy_page(void *dst, struct page 
> *s_page)
> if (kernel_page_present(s_page)) {
> do_copy_page(dst, page_address(s_page));
> } else {
> -   kernel_map_pages(s_page, 1, 1);
> +   hibernate_map_page(s_page, 1);
> do_copy_page(dst, page_address(s_page));
> -   kernel_map_pages(s_page, 1, 0);
> +   hibernate_map_page(s_page, 0);
> }
>  }
>
> --
> 2.28.0
>

Re: [PATCH 1/3] powerpc/uaccess: Switch __put_user_size_allowed() to __put_user_asm_goto()

2020-10-29 Thread Andreas Schwab

#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 5.10.0-rc1 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc-4.9 (SUSE Linux) 4.9.3"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=40903
CONFIG_LD_VERSION=23501
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_WATCH_QUEUE=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
# end of CPU/Task time and stats accounting

# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=15
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13

#
# Scheduler features
#
CONFIG_UCLAMP_TASK=y
CONFIG_UCLAMP_BUCKETS_COUNT=5
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_CC_HAS_INT128=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_UCLAMP_TASK_GROUP is not set
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
CONFIG_BOOT_CONFIG=y
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
# CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not set
CONFIG_SYSCTL=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_BPF=y
CONFIG_EXPERT=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_BPF_SYSCALL=y
CONFIG_USERMODE_DRIVER=y
# CONFIG_BPF_PRELOAD is not set
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_RSEQ=y
# CONFIG_DEBUG_RSEQ is not set
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# end of Kernel Performance Events And Counters

CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_MEMCG_SYSFS_ON=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAU

Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output

2020-10-29 Thread Jonathan Cameron

On Wed, 28 Oct 2020 15:23:18 +0100
Mauro Carvalho Chehab  wrote:

> From: Mauro Carvalho Chehab 
> 
> Some files over there won't parse well by Sphinx.
> 
> Fix them.
> 
> Signed-off-by: Mauro Carvalho Chehab 
> Signed-off-by: Mauro Carvalho Chehab 

Query below...  I'm going to guess a rebase issue?

Other than that
Acked-by: Jonathan Cameron  # for IIO


> diff --git a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 
> b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> index b7259234ad70..a10a4de3e5fe 100644
> --- a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> +++ b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
> @@ -3,67 +3,85 @@ KernelVersion:  4.11
>  Contact: benjamin.gaign...@st.com
>  Description:
>   Reading returns the list possible master modes which are:
> - - "reset" : The UG bit from the TIMx_EGR register is
> +
> +
> + - "reset"
> + The UG bit from the TIMx_EGR register is
>   used as trigger output (TRGO).
> - - "enable": The Counter Enable signal CNT_EN is used
> + - "enable"
> + The Counter Enable signal CNT_EN is used
>   as trigger output.
> - - "update": The update event is selected as trigger output.
> + - "update"
> + The update event is selected as trigger output.
>   For instance a master timer can then be used
>   as a prescaler for a slave timer.
> - - "compare_pulse" : The trigger output send a positive pulse
> - when the CC1IF flag is to be set.
> - - "OC1REF": OC1REF signal is used as trigger output.
> - - "OC2REF": OC2REF signal is used as trigger output.
> - - "OC3REF": OC3REF signal is used as trigger output.
> - - "OC4REF": OC4REF signal is used as trigger output.
> + - "compare_pulse"
> + The trigger output send a positive pulse
> + when the CC1IF flag is to be set.
> + - "OC1REF"
> + OC1REF signal is used as trigger output.
> + - "OC2REF"
> + OC2REF signal is used as trigger output.
> + - "OC3REF"
> + OC3REF signal is used as trigger output.
> + - "OC4REF"
> + OC4REF signal is used as trigger output.
> +
>   Additional modes (on TRGO2 only):
> - - "OC5REF": OC5REF signal is used as trigger output.
> - - "OC6REF": OC6REF signal is used as trigger output.
> +
> + - "OC5REF"
> + OC5REF signal is used as trigger output.
> + - "OC6REF"
> + OC6REF signal is used as trigger output.
>   - "compare_pulse_OC4REF":
> -   OC4REF rising or falling edges generate pulses.
> + OC4REF rising or falling edges generate pulses.
>   - "compare_pulse_OC6REF":
> -   OC6REF rising or falling edges generate pulses.
> + OC6REF rising or falling edges generate pulses.
>   - "compare_pulse_OC4REF_r_or_OC6REF_r":
> -   OC4REF or OC6REF rising edges generate pulses.
> + OC4REF or OC6REF rising edges generate pulses.
>   - "compare_pulse_OC4REF_r_or_OC6REF_f":
> -   OC4REF rising or OC6REF falling edges generate pulses.
> + OC4REF rising or OC6REF falling edges generate
> + pulses.
>   - "compare_pulse_OC5REF_r_or_OC6REF_r":
> -   OC5REF or OC6REF rising edges generate pulses.
> + OC5REF or OC6REF rising edges generate pulses.
>   - "compare_pulse_OC5REF_r_or_OC6REF_f":
> -   OC5REF rising or OC6REF falling edges generate pulses.
> + OC5REF rising or OC6REF falling edges generate
> + pulses.
>  
> - +---+   +-++-+
> - | Prescaler +-> | Counter |+-> | Master  | TRGO(2)
> - +---+   +--++-+|-> | Control +-->
> -||  ||  +-+
> - +--v+-+ OCxREF ||  +-+
> - | Chx compare +--> | Output  | ChX
> - +---+-+ |  | Control +-->
> -   . |   |  +-+
> -   . |   |.
> - +---

Re: [PATCH] powerpc/32s: Setup the early hash table at all time.

2020-10-29 Thread Andreas Schwab

On Okt 01 2020, Christophe Leroy wrote:

> At the time being, an early hash table is set up when
> CONFIG_KASAN is selected.
>
> There is nothing wrong with setting such an early hash table
> all the time, even if it is not used. This is a statically
> allocated 256 kB table which lies in the init data section.
>
> This makes the code simpler and may in the future allow to
> setup early IO mappings with fixmap instead of hard coding BATs.
>
> Put create_hpte() and flush_hash_pages() in the .ref.text section
> in order to avoid warning for the reference to early_hash[]. This
> reference is removed by MMU_init_hw_patch() before init memory is
> freed.

This breaks booting on the iBook G4.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: [PATCH 03/13] PCI: dwc: Move "dbi", "dbi2", and "addr_space" resource setup into common code

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:46 PM, Rob Herring wrote:
> 
> Most DWC drivers use the common register resource names "dbi", "dbi2", and
> "addr_space", so let's move their setup into the DWC common code.
>
> This means 'dbi_base' in particular is setup later, but it looks like no
> drivers touch DBI registers before dw_pcie_host_init or dw_pcie_ep_init.
>
> Cc: Kishon Vijay Abraham I 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Murali Karicheri 
> Cc: Minghuan Lian 
> Cc: Mingkai Hu 
> Cc: Roy Zang 
> Cc: Jonathan Chocron 
> Cc: Jesper Nilsson 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Gustavo Pimentel 
> Cc: Xiaowei Song 
> Cc: Binghui Wang 
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: Stanimir Varbanov 
> Cc: Pratyush Anand 
> Cc: Thierry Reding 
> Cc: Jonathan Hunter 
> Cc: Kunihiko Hayashi 
> Cc: Masahiro Yamada 
> Cc: linux-o...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-arm-ker...@axis.com
> Cc: linux-arm-...@vger.kernel.org
> Cc: linux-te...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-dra7xx.c   |  8 
>  drivers/pci/controller/dwc/pci-keystone.c | 29 +---
>  .../pci/controller/dwc/pci-layerscape-ep.c| 37 +--
>  drivers/pci/controller/dwc/pcie-al.c  |  9 +---
>  drivers/pci/controller/dwc/pcie-artpec6.c | 43 ++
>  .../pci/controller/dwc/pcie-designware-ep.c   | 29 ++--
>  .../pci/controller/dwc/pcie-designware-host.c |  7 +++
>  .../pci/controller/dwc/pcie-designware-plat.c | 45 +--
>  drivers/pci/controller/dwc/pcie-intel-gw.c|  4 --
>  drivers/pci/controller/dwc/pcie-kirin.c   |  5 ---
>  drivers/pci/controller/dwc/pcie-qcom.c|  8 
>  drivers/pci/controller/dwc/pcie-spear13xx.c   | 11 +
>  drivers/pci/controller/dwc/pcie-tegra194.c| 22 -
>  drivers/pci/controller/dwc/pcie-uniphier-ep.c | 38 +---
>  drivers/pci/controller/dwc/pcie-uniphier.c|  6 ---
>  15 files changed, 47 insertions(+), 254 deletions(-)

[...]

Re: [PATCH 05/13] PCI: dwc: Ensure all outbound ATU windows are reset

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> The Layerscape driver clears the ATU registers which may have been
> configured by the bootloader. Any driver could have the same issue
> and doing it for all drivers doesn't hurt, so let's move it into the
> common DWC code.
>
> Cc: Minghuan Lian 
> Cc: Mingkai Hu 
> Cc: Roy Zang 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Gustavo Pimentel 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-layerscape.c   | 14 --
>  drivers/pci/controller/dwc/pcie-designware-host.c |  5 +
>  2 files changed, 5 insertions(+), 14 deletions(-)
>
>  diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
>  index f24f79a70d9a..53e56d54c482 100644
>  --- a/drivers/pci/controller/dwc/pci-layerscape.c
>  +++ b/drivers/pci/controller/dwc/pci-layerscape.c
>  @@ -83,14 +83,6 @@ static void ls_pcie_drop_msg_tlp(struct ls_pcie *pcie)
>   iowrite32(val, pci->dbi_base + PCIE_STRFMR1);
>   }
>  
>  -static void ls_pcie_disable_outbound_atus(struct ls_pcie *pcie)
>  -{
>  -int i;
>  -
>  -for (i = 0; i < PCIE_IATU_NUM; i++)
>  -dw_pcie_disable_atu(pcie->pci, i, DW_PCIE_REGION_OUTBOUND);
>  -}
>  -
>   static int ls1021_pcie_link_up(struct dw_pcie *pci)
>   {
>   u32 state;
>  @@ -136,12 +128,6 @@ static int ls_pcie_host_init(struct pcie_port *pp)
>   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
>   struct ls_pcie *pcie = to_ls_pcie(pci);
>  
>  -/*
>  - * Disable outbound windows configured by the bootloader to avoid
>  - * one transaction hitting multiple outbound windows.
>  - * dw_pcie_setup_rc() will reconfigure the outbound windows.
>  - */
>  -ls_pcie_disable_outbound_atus(pcie);
>   ls_pcie_fix_error_response(pcie);
>  
>   dw_pcie_dbi_ro_wr_en(pci);
>  diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c 
> b/drivers/pci/controller/dwc/pcie-designware-host.c
>  index cde45b2076ee..265a48f1a0ae 100644
>  --- a/drivers/pci/controller/dwc/pcie-designware-host.c
>  +++ b/drivers/pci/controller/dwc/pcie-designware-host.c
>  @@ -534,6 +534,7 @@ static struct pci_ops dw_pcie_ops = {
>  
>   void dw_pcie_setup_rc(struct pcie_port *pp)
>   {
>  +int i;
>   u32 val, ctrl, num_ctrls;
>   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
>  
>  @@ -583,6 +584,10 @@ void dw_pcie_setup_rc(struct pcie_port *pp)
>   PCI_COMMAND_MASTER | PCI_COMMAND_SERR;
>   dw_pcie_writel_dbi(pci, PCI_COMMAND, val);
>  
>  +/* Ensure all outbound windows are disabled so there are multiple 
> matches */
>  +for (i = 0; i < pci->num_viewport; i++)
>  +dw_pcie_disable_atu(pci, i, DW_PCIE_REGION_OUTBOUND);
>  +
>   /*
>* If the platform provides its own child bus config accesses, it means
>* the platform uses its own address translation component rather than
>  -- 
>  2.25.1

Re: [PATCH 07/13] PCI: dwc: Drop the .set_num_vectors() host op

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> There's no reason for the .set_num_vectors() host op. Drivers needing a
> non-default value can just initialize pcie_port.num_vectors directly.
>
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Gustavo Pimentel 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Thierry Reding 
> Cc: Jonathan Hunter 
> Cc: linux-te...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  .../pci/controller/dwc/pcie-designware-host.c | 19 ---
>  .../pci/controller/dwc/pcie-designware-plat.c |  7 +--
>  drivers/pci/controller/dwc/pcie-designware.h  |  1 -
>  drivers/pci/controller/dwc/pcie-tegra194.c|  7 +--
>  4 files changed, 6 insertions(+), 28 deletions(-)

[...]

Re: [PATCH 08/13] PCI: dwc: Move MSI interrupt setup into DWC common code

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> Platforms using the built-in DWC MSI controller all have a dedicated
> interrupt with "msi" name or at index 0, so let's move setting up the
> interrupt to the common DWC code.
>
> spear13xx and dra7xx are the 2 oddballs with muxed interrupts, so
> we need to prevent configuring the MSI interrupt by setting msi_irq
> to negative.
>
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Richard Zhu 
> Cc: Lucas Stach 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Pengutronix Kernel Team 
> Cc: Fabio Estevam 
> Cc: NXP Linux Team 
> Cc: Yue Wang 
> Cc: Kevin Hilman 
> Cc: Neil Armstrong 
> Cc: Jerome Brunet 
> Cc: Martin Blumenstingl 
> Cc: Jesper Nilsson 
> Cc: Gustavo Pimentel 
> Cc: Xiaowei Song 
> Cc: Binghui Wang 
> Cc: Stanimir Varbanov 
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: Pratyush Anand 
> Cc: Thierry Reding 
> Cc: Jonathan Hunter 
> Cc: Kunihiko Hayashi 
> Cc: Masahiro Yamada 
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linux-amlo...@lists.infradead.org
> Cc: linux-arm-ker...@axis.com
> Cc: linux-arm-...@vger.kernel.org
> Cc: linux-te...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-dra7xx.c   |  3 +++
>  drivers/pci/controller/dwc/pci-exynos.c   |  6 -
>  drivers/pci/controller/dwc/pci-imx6.c |  6 -
>  drivers/pci/controller/dwc/pci-meson.c|  6 -
>  drivers/pci/controller/dwc/pcie-artpec6.c |  6 -
> .../pci/controller/dwc/pcie-designware-host.c | 11 +-
>  .../pci/controller/dwc/pcie-designware-plat.c |  6 -
>  drivers/pci/controller/dwc/pcie-histb.c   |  6 -
>  drivers/pci/controller/dwc/pcie-kirin.c   | 22 ---
>  drivers/pci/controller/dwc/pcie-qcom.c|  8 ---
>  drivers/pci/controller/dwc/pcie-spear13xx.c   |  1 +
>  drivers/pci/controller/dwc/pcie-tegra194.c|  8 ---
>  drivers/pci/controller/dwc/pcie-uniphier.c|  6 -
>  13 files changed, 14 insertions(+), 81 deletions(-)

[...]

Re: [PATCH 09/13] PCI: dwc: Rework MSI initialization

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> There are 3 possible MSI implementations for the DWC host. The first is
> using the built-in DWC MSI controller. The 2nd is a custom MSI
> controller as part of the PCI host (keystone only). The 3rd is an
> external MSI controller (typically GICv3 ITS). Currently, the last 2
> are distinguished with a .msi_host_init() hook with the 3rd option using
> an empty function. However we can detect the 3rd case with the presence
> of 'msi-parent' or 'msi-map' properties, so let's do that instead and
> remove the empty functions.
>
> Cc: Murali Karicheri 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Minghuan Lian 
> Cc: Mingkai Hu 
> Cc: Roy Zang 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Gustavo Pimentel 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-keystone.c |  9 ---
>  drivers/pci/controller/dwc/pci-layerscape.c   | 25 ---
>  .../pci/controller/dwc/pcie-designware-host.c | 20 +--
>  drivers/pci/controller/dwc/pcie-designware.h  |  1 +
>  drivers/pci/controller/dwc/pcie-intel-gw.c|  9 ---
>  5 files changed, 13 insertions(+), 51 deletions(-)

[...]

Re: [PATCH 10/13] PCI: dwc: Move link handling into common code

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> All the DWC drivers do link setup and checks at roughly the same time.
> Let's use the existing .start_link() hook (currently only used in EP
> mode) and move the link handling to the core code.
>
> The behavior for a link down was inconsistent as some drivers would fail
> probe in that case while others succeed. Let's standardize this to
> succeed as there are usecases where devices (and the link) appear later
> even without hotplug. For example, a reconfigured FPGA device.
>
> Cc: Kishon Vijay Abraham I 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Richard Zhu 
> Cc: Lucas Stach 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Pengutronix Kernel Team 
> Cc: Fabio Estevam 
> Cc: NXP Linux Team 
> Cc: Murali Karicheri 
> Cc: Yue Wang 
> Cc: Kevin Hilman 
> Cc: Neil Armstrong 
> Cc: Jerome Brunet 
> Cc: Martin Blumenstingl 
> Cc: Thomas Petazzoni 
> Cc: Jesper Nilsson 
> Cc: Gustavo Pimentel 
> Cc: Xiaowei Song 
> Cc: Binghui Wang 
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: Stanimir Varbanov 
> Cc: Pratyush Anand 
> Cc: Thierry Reding 
> Cc: Jonathan Hunter 
> Cc: Kunihiko Hayashi 
> Cc: Masahiro Yamada 
> Cc: linux-o...@vger.kernel.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linux-amlo...@lists.infradead.org
> Cc: linux-arm-ker...@axis.com
> Cc: linux-arm-...@vger.kernel.org
> Cc: linux-te...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-dra7xx.c   |  2 -
>  drivers/pci/controller/dwc/pci-exynos.c   | 41 +++--
>  drivers/pci/controller/dwc/pci-imx6.c |  9 ++--
>  drivers/pci/controller/dwc/pci-keystone.c |  9 
>  drivers/pci/controller/dwc/pci-meson.c| 24 --
>  drivers/pci/controller/dwc/pcie-armada8k.c| 39 +++-
>  drivers/pci/controller/dwc/pcie-artpec6.c |  2 -
>  .../pci/controller/dwc/pcie-designware-host.c |  9 
>  .../pci/controller/dwc/pcie-designware-plat.c |  3 --
>  drivers/pci/controller/dwc/pcie-histb.c   | 34 +++---
>  drivers/pci/controller/dwc/pcie-kirin.c   | 23 ++
>  drivers/pci/controller/dwc/pcie-qcom.c| 19 ++--
>  drivers/pci/controller/dwc/pcie-spear13xx.c   | 46 ---
>  drivers/pci/controller/dwc/pcie-tegra194.c|  1 -
>  drivers/pci/controller/dwc/pcie-uniphier.c| 13 ++
>  15 files changed, 103 insertions(+), 171 deletions(-)

[...]

Re: [PATCH 11/13] PCI: dwc: Move dw_pcie_msi_init() into core

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> The host drivers which call dw_pcie_msi_init() are all the ones using
> the built-in MSI controller, so let's move it into the common DWC code.
>
> Cc: Kishon Vijay Abraham I 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Richard Zhu 
> Cc: Lucas Stach 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Pengutronix Kernel Team 
> Cc: Fabio Estevam 
> Cc: NXP Linux Team 
> Cc: Yue Wang 
> Cc: Kevin Hilman 
> Cc: Neil Armstrong 
> Cc: Jerome Brunet 
> Cc: Martin Blumenstingl 
> Cc: Jesper Nilsson 
> Cc: Gustavo Pimentel 
> Cc: Xiaowei Song 
> Cc: Binghui Wang 
> Cc: Stanimir Varbanov 
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: Pratyush Anand 
> Cc: Thierry Reding 
> Cc: Jonathan Hunter 
> Cc: Kunihiko Hayashi 
> Cc: Masahiro Yamada 
> Cc: linux-o...@vger.kernel.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linux-amlo...@lists.infradead.org
> Cc: linux-arm-ker...@axis.com
> Cc: linux-arm-...@vger.kernel.org
> Cc: linux-te...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-dra7xx.c   |  2 --
>  drivers/pci/controller/dwc/pci-exynos.c   |  4 
>  drivers/pci/controller/dwc/pci-imx6.c |  1 -
>  drivers/pci/controller/dwc/pci-meson.c|  1 -
>  drivers/pci/controller/dwc/pcie-artpec6.c |  1 -
>  drivers/pci/controller/dwc/pcie-designware-host.c |  8 +---
>  drivers/pci/controller/dwc/pcie-designware-plat.c |  1 -
>  drivers/pci/controller/dwc/pcie-designware.h  | 10 --
>  drivers/pci/controller/dwc/pcie-histb.c   |  2 --
>  drivers/pci/controller/dwc/pcie-kirin.c   |  1 -
>  drivers/pci/controller/dwc/pcie-qcom.c|  2 --
>  drivers/pci/controller/dwc/pcie-spear13xx.c   |  6 +-
>  drivers/pci/controller/dwc/pcie-tegra194.c|  2 --
>  drivers/pci/controller/dwc/pcie-uniphier.c|  1 -
>  14 files changed, 6 insertions(+), 36 deletions(-)

[...]

[patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends

2020-10-29 Thread Thomas Gleixner

Following up to the discussion in:

  https://lore.kernel.org/r/20200914204209.256266...@linutronix.de

and the initial version of this:

  https://lore.kernel.org/r/20200919091751.06...@linutronix.de

this series provides a preemptible variant of kmap_atomic & related
interfaces.

Now that the scheduler folks have wrapped their heads around the migration
disable scheduler woes, there is not a real reason anymore to confine
migration disabling to RT.

As expressed in the earlier discussion by graphics and crypto folks, there
is interest to get rid of their kmap_atomic* usage because they need only a
temporary stable map and not all the bells and whistels of kmap_atomic*.

This series provides kmap_local.* iomap_local variants which only disable
migration to keep the virtual mapping address stable accross preemption,
but do neither disable pagefaults nor preemption. The new functions can be
used in any context, but if used in atomic context the caller has to take
care of eventually disabling pagefaults.

This is achieved by:

 - Removing the RT dependency from migrate_disable/enable()

 - Consolidating all kmap atomic implementations in generic code

 - Switching from per CPU storage of the kmap index to a per task storage

 - Adding a pteval array to the per task storage which contains the ptevals
   of the currently active temporary kmaps

 - Adding context switch code which checks whether the outgoing or the
   incoming task has active temporary kmaps. If so, the outgoing task's
   kmaps are removed and the incoming task's kmaps are restored.

 - Adding new interfaces k[un]map_temporary*() which are not disabling
   preemption and can be called from any context (except NMI).

   Contrary to kmap() which provides preemptible and "persistant" mappings,
   these interfaces are meant to replace the temporary mappings provided by
   kmap_atomic*() today.

This allows to get rid of conditional mapping choices and allows to have
preemptible short term mappings on 64bit which are today enforced to be
non-preemptible due to the highmem constraints. It clearly puts overhead on
the highmem users, but highmem is slow anyway.

This is not a wholesale conversion which makes kmap_atomic magically
preemptible because there might be usage sites which rely on the implicit
preempt disable. So this needs to be done on a case by case basis and the
call sites converted to kmap_temporary.

Note, that this is only lightly tested on X86 and completely untested on
all other architectures.

There is also a still to be investigated question from Linus on the initial
posting versus the per cpu / per task mapping stack depth which might need
to be made larger due to the ability to take page faults within a mapping
region.

Though I wanted to share the current state of affairs before investigating
that further. If there is consensus in going forward with this, I'll have a
deeper look into this issue.

The lot is available from

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem

It is based on Peter Zijlstras migrate disable branch which is close to be
merged into the tip tree, but still not finalized:

   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git 
sched/migrate-disable

Changes vs. V1:

  - Make it truly functional by depending on migrate disable/enable (Brown 
paperbag)
  - Rename to kmap_local.* (Linus)
  - Fix the sched in/out issue Linus pointed out
  - Fix a few style issues (Christoph)
  - Split a few things out into seperate patches to make review simpler
  - Pick up acked/reviewed tags as appropriate

Thanks,

tglx
---
 a/arch/arm/mm/highmem.c   |  121 --
 a/arch/microblaze/mm/highmem.c|   78 
 a/arch/nds32/mm/highmem.c |   48 ---
 a/arch/powerpc/mm/highmem.c   |   67 --
 a/arch/sparc/mm/highmem.c |  115 -
 arch/arc/Kconfig  |1 
 arch/arc/include/asm/highmem.h|8 +
 arch/arc/mm/highmem.c |   44 --
 arch/arm/Kconfig  |1 
 arch/arm/include/asm/highmem.h|   31 +++-
 arch/arm/mm/Makefile  |1 
 arch/csky/Kconfig |1 
 arch/csky/include/asm/highmem.h   |4 
 arch/csky/mm/highmem.c|   75 ---
 arch/microblaze/Kconfig   |1 
 arch/microblaze/include/asm/highmem.h |6 
 arch/microblaze/mm/Makefile   |1 
 arch/microblaze/mm/init.c |6 
 arch/mips/Kconfig |1 
 arch/mips/include/asm/highmem.h   |4 
 arch/mips/mm/highmem.c|   77 
 arch/mips/mm/init.c   |3 
 arch/nds32/Kconfig.cpu|1 
 arch/nds32/include/asm/highmem.h  |   21 ++-
 arch/nds32/mm/Makefile|1 
 arch/powerpc/Kconfig  |1 
 arch/powerpc/include/asm/highmem.h|6 
 arch/powe

[patch V2 04/18] x86/mm/highmem: Use generic kmap atomic implementation

2020-10-29 Thread Thomas Gleixner

Convert X86 to the generic kmap atomic implementation and make the
iomap_atomic() naming convention consistent while at it.

Signed-off-by: Thomas Gleixner 
Cc: x...@kernel.org
---
 arch/x86/Kconfig   |3 +-
 arch/x86/include/asm/fixmap.h  |1 
 arch/x86/include/asm/highmem.h |   12 ++--
 arch/x86/include/asm/iomap.h   |   18 ++--
 arch/x86/mm/highmem_32.c   |   59 -
 arch/x86/mm/init_32.c  |   15 --
 arch/x86/mm/iomap_32.c |   59 +++--
 include/linux/io-mapping.h |2 -
 8 files changed, 27 insertions(+), 142 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -14,10 +14,11 @@ config X86_32
select ARCH_WANT_IPC_PARSE_VERSION
select CLKSRC_I8253
select CLONE_BACKWARDS
+   select GENERIC_VDSO_32
select HAVE_DEBUG_STACKOVERFLOW
+   select KMAP_LOCAL
select MODULES_USE_ELF_REL
select OLD_SIGACTION
-   select GENERIC_VDSO_32
 
 config X86_64
def_bool y
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -151,7 +151,6 @@ extern void reserve_top_address(unsigned
 
 extern int fixmaps_set;
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 void __native_set_fixmap(enum fixed_addresses idx, pte_t pte);
--- a/arch/x86/include/asm/highmem.h
+++ b/arch/x86/include/asm/highmem.h
@@ -58,11 +58,17 @@ extern unsigned long highstart_pfn, high
 #define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
 #define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))
 
-void *kmap_atomic_pfn(unsigned long pfn);
-void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
-
 #define flush_cache_kmaps()do { } while (0)
 
+#definearch_kmap_local_post_map(vaddr, pteval) \
+   arch_flush_lazy_mmu_mode()
+
+#definearch_kmap_local_post_unmap(vaddr)   \
+   do {\
+   flush_tlb_one_kernel((vaddr));  \
+   arch_flush_lazy_mmu_mode(); \
+   } while (0)
+
 extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn,
unsigned long end_pfn);
 
--- a/arch/x86/include/asm/iomap.h
+++ b/arch/x86/include/asm/iomap.h
@@ -9,19 +9,21 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-void __iomem *
-iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
+void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot);
 
-void
-iounmap_atomic(void __iomem *kvaddr);
+static inline void iounmap_atomic(void __iomem *vaddr)
+{
+   kunmap_local_indexed((void __force *)vaddr);
+   pagefault_enable();
+   preempt_enable();
+}
 
-int
-iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
+int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
 
-void
-iomap_free(resource_size_t base, unsigned long size);
+void iomap_free(resource_size_t base, unsigned long size);
 
 #endif /* _ASM_X86_IOMAP_H */
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -4,65 +4,6 @@
 #include  /* for totalram_pages */
 #include 
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   arch_flush_lazy_mmu_mode();
-
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-/*
- * This is the same as kmap_atomic() but can map memory that doesn't
- * have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn)
-{
-   return kmap_atomic_prot_pfn(pfn, kmap_prot);
-}
-EXPORT_SYMBOL_GPL(kmap_atomic_pfn);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-
-   if (vaddr >= __fix_to_virt(FIX_KMAP_END) &&
-   vaddr <= __fix_to_virt(FIX_KMAP_BEGIN)) {
-   int idx, type;
-
-   type = kmap_atomic_idx();
-   idx = type + KM_TYPE_NR * smp_processor_id();
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   WARN_ON_ONCE(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-#endif
-   /*
-* Force other mappings to Oops if they'll try to access this
-* pte without first remap it.  Keeping stale mappings around
-* is a bad idea also, in case the page changes cacheability
-* attributes or becomes a protected page in a hypervisor.
-*/
-   kpte_clear_flush(kmap_pte-idx, vaddr);
-   kmap_atomic_idx_pop();
-   arch_flush_lazy_mmu_mode();
-   }
-#ifdef CONFIG_DEBUG_HIGHMEM
-   else {

[patch V2 01/18] sched: Make migrate_disable/enable() independent of RT

2020-10-29 Thread Thomas Gleixner

Now that the scheduler can deal with migrate disable properly, there is no
real compelling reason to make it only available for RT.

There are quite some code pathes which needlessly disable preemption in
order to prevent migration and some constructs like kmap_atomic() enforce
it implicitly.

Making it available independent of RT allows to provide a preemptible
variant of kmap_atomic() and makes the code more consistent in general.

FIXME: Rework the comment in preempt.h

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
---
 include/linux/preempt.h |   38 +++---
 include/linux/sched.h   |2 +-
 kernel/sched/core.c |   12 ++--
 kernel/sched/sched.h|2 +-
 lib/smp_processor_id.c  |2 +-
 5 files changed, 8 insertions(+), 48 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -322,7 +322,7 @@ static inline void preempt_notifier_init
 
 #endif
 
-#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
+#ifdef CONFIG_SMP
 
 /*
  * Migrate-Disable and why it is undesired.
@@ -382,43 +382,11 @@ static inline void preempt_notifier_init
 extern void migrate_disable(void);
 extern void migrate_enable(void);
 
-#elif defined(CONFIG_PREEMPT_RT)
+#else
 
 static inline void migrate_disable(void) { }
 static inline void migrate_enable(void) { }
 
-#else /* !CONFIG_PREEMPT_RT */
-
-/**
- * migrate_disable - Prevent migration of the current task
- *
- * Maps to preempt_disable() which also disables preemption. Use
- * migrate_disable() to annotate that the intent is to prevent migration,
- * but not necessarily preemption.
- *
- * Can be invoked nested like preempt_disable() and needs the corresponding
- * number of migrate_enable() invocations.
- */
-static __always_inline void migrate_disable(void)
-{
-   preempt_disable();
-}
-
-/**
- * migrate_enable - Allow migration of the current task
- *
- * Counterpart to migrate_disable().
- *
- * As migrate_disable() can be invoked nested, only the outermost invocation
- * reenables migration.
- *
- * Currently mapped to preempt_enable().
- */
-static __always_inline void migrate_enable(void)
-{
-   preempt_enable();
-}
-
-#endif /* CONFIG_SMP && CONFIG_PREEMPT_RT */
+#endif /* CONFIG_SMP */
 
 #endif /* __LINUX_PREEMPT_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -715,7 +715,7 @@ struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t   cpus_mask;
void*migration_pending;
-#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
+#ifdef CONFIG_SMP
unsigned short  migration_disabled;
 #endif
unsigned short  migration_flags;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1696,8 +1696,6 @@ void check_preempt_curr(struct rq *rq, s
 
 #ifdef CONFIG_SMP
 
-#ifdef CONFIG_PREEMPT_RT
-
 static void
 __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, 
u32 flags);
 
@@ -1772,8 +1770,6 @@ static inline bool rq_has_pinned_tasks(s
return rq->nr_pinned;
 }
 
-#endif
-
 /*
  * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
@@ -2841,7 +2837,7 @@ void sched_set_stop_task(int cpu, struct
}
 }
 
-#else
+#else /* CONFIG_SMP */
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
 const struct cpumask *new_mask,
@@ -2850,10 +2846,6 @@ static inline int __set_cpus_allowed_ptr
return set_cpus_allowed_ptr(p, new_mask);
 }
 
-#endif /* CONFIG_SMP */
-
-#if !defined(CONFIG_SMP) || !defined(CONFIG_PREEMPT_RT)
-
 static inline void migrate_disable_switch(struct rq *rq, struct task_struct 
*p) { }
 
 static inline bool rq_has_pinned_tasks(struct rq *rq)
@@ -2861,7 +2853,7 @@ static inline bool rq_has_pinned_tasks(s
return false;
 }
 
-#endif
+#endif /* !CONFIG_SMP */
 
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1056,7 +1056,7 @@ struct rq {
struct cpuidle_state*idle_state;
 #endif
 
-#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
+#if CONFIG_SMP
unsigned intnr_pinned;
 #endif
unsigned intpush_busy;
--- a/lib/smp_processor_id.c
+++ b/lib/smp_processor_id.c
@@ -26,7 +26,7 @@ unsigned int check_preemption_disabled(c
if (current->nr_cpus_allowed == 1)
goto out;
 
-#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT)
+#ifdef CONFIG_SMP
if (current->migration_disabled)
goto out;
 #endif

[patch V2 02/18] mm/highmem: Un-EXPORT __kmap_atomic_idx()

2020-10-29 Thread Thomas Gleixner

Nothing in modules can use that.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
 mm/highmem.c |2 --
 1 file changed, 2 deletions(-)

--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -108,8 +108,6 @@ static inline wait_queue_head_t *get_pkm
 atomic_long_t _totalhigh_pages __read_mostly;
 EXPORT_SYMBOL(_totalhigh_pages);
 
-EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx);
-
 unsigned int nr_free_highpages (void)
 {
struct zone *zone;

[patch V2 03/18] highmem: Provide generic variant of kmap_atomic*

2020-10-29 Thread Thomas Gleixner

The kmap_atomic* interfaces in all architectures are pretty much the same
except for post map operations (flush) and pre- and post unmap operations.

Provide a generic variant for that.

Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: linux...@kvack.org
---
V2: Address review comments from Christoph (style and EXPORT variant)
---
 include/linux/highmem.h |   79 ++--
 mm/Kconfig  |3 +
 mm/highmem.c|  118 +++-
 3 files changed, 183 insertions(+), 17 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -31,9 +31,16 @@ static inline void invalidate_kernel_vma
 
 #include 
 
+/*
+ * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft.
+ */
+#ifdef CONFIG_KMAP_LOCAL
+void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
+void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
+void kunmap_local_indexed(void *vaddr);
+#endif
+
 #ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
-extern void kunmap_atomic_high(void *kvaddr);
 #include 
 
 #ifndef ARCH_HAS_KMAP_FLUSH_TLB
@@ -81,6 +88,11 @@ static inline void kunmap(struct page *p
  * be used in IRQ contexts, so in some (very limited) cases we need
  * it.
  */
+
+#ifndef CONFIG_KMAP_LOCAL
+void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
+void kunmap_atomic_high(void *kvaddr);
+
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
preempt_disable();
@@ -89,7 +101,38 @@ static inline void *kmap_atomic_prot(str
return page_address(page);
return kmap_atomic_high_prot(page, prot);
 }
-#define kmap_atomic(page)  kmap_atomic_prot(page, kmap_prot)
+
+static inline void __kunmap_atomic(void *vaddr)
+{
+   kunmap_atomic_high(vaddr);
+}
+#else /* !CONFIG_KMAP_LOCAL */
+
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_page_prot(page, prot);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   preempt_disable();
+   pagefault_disable();
+   return __kmap_local_pfn_prot(pfn, kmap_prot);
+}
+
+static inline void __kunmap_atomic(void *addr)
+{
+   kunmap_local_indexed(addr);
+}
+
+#endif /* CONFIG_KMAP_LOCAL */
+
+static inline void *kmap_atomic(struct page *page)
+{
+   return kmap_atomic_prot(page, kmap_prot);
+}
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
@@ -157,21 +200,28 @@ static inline void *kmap_atomic(struct p
pagefault_disable();
return page_address(page);
 }
-#define kmap_atomic_prot(page, prot)   kmap_atomic(page)
 
-static inline void kunmap_atomic_high(void *addr)
+static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+{
+   return kmap_atomic(page);
+}
+
+static inline void *kmap_atomic_pfn(unsigned long pfn)
+{
+   return kmap_atomic(pfn_to_page(pfn));
+}
+
+static inline void __kunmap_atomic(void *addr)
 {
/*
 * Mostly nothing to do in the CONFIG_HIGHMEM=n case as kunmap_atomic()
-* handles re-enabling faults + preemption
+* handles re-enabling faults and preemption
 */
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
kunmap_flush_on_unmap(addr);
 #endif
 }
 
-#define kmap_atomic_pfn(pfn)   kmap_atomic(pfn_to_page(pfn))
-
 #define kmap_flush_unused()do {} while(0)
 
 #endif /* CONFIG_HIGHMEM */
@@ -213,15 +263,14 @@ static inline void kmap_atomic_idx_pop(v
  * Prevent people trying to call kunmap_atomic() as if it were kunmap()
  * kunmap_atomic() should get the return value of kmap_atomic, not the page.
  */
-#define kunmap_atomic(addr) \
-do {\
-   BUILD_BUG_ON(__same_type((addr), struct page *));   \
-   kunmap_atomic_high(addr);  \
-   pagefault_enable(); \
-   preempt_enable();   \
+#define kunmap_atomic(__addr)  \
+do {   \
+   BUILD_BUG_ON(__same_type((__addr), struct page *)); \
+   __kunmap_atomic(__addr);\
+   pagefault_enable(); \
+   preempt_enable();   \
 } while (0)
 
-
 /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
 #ifndef clear_user_highpage
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config ARCH_HAS_HUGEPD
 config MAPPING_DIRTY_HELPERS
 bool
 
+config KMAP_LOCAL
+   bool
+
 endmenu
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -30,6 +30,7 @@
 #include 
 #include

[patch V2 05/18] arc/mm/highmem: Use generic kmap atomic implementation

2020-10-29 Thread Thomas Gleixner

Adopt the map ordering to match the other architectures and the generic
code.

Signed-off-by: Thomas Gleixner 
Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
---
 arch/arc/Kconfig   |1 
 arch/arc/include/asm/highmem.h |8 ++-
 arch/arc/mm/highmem.c  |   44 -
 3 files changed, 9 insertions(+), 44 deletions(-)

--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -507,6 +507,7 @@ config LINUX_RAM_BASE
 config HIGHMEM
bool "High Memory Support"
select ARCH_DISCONTIGMEM_ENABLE
+   select KMAP_LOCAL
help
  With ARC 2G:2G address split, only upper 2G is directly addressable by
  kernel. Enable this to potentially allow access to rest of 2G and PAE
--- a/arch/arc/include/asm/highmem.h
+++ b/arch/arc/include/asm/highmem.h
@@ -15,7 +15,10 @@
 #define FIXMAP_BASE(PAGE_OFFSET - FIXMAP_SIZE - PKMAP_SIZE)
 #define FIXMAP_SIZEPGDIR_SIZE  /* only 1 PGD worth */
 #define KM_TYPE_NR ((FIXMAP_SIZE >> PAGE_SHIFT)/NR_CPUS)
-#define FIXMAP_ADDR(nr)(FIXMAP_BASE + ((nr) << PAGE_SHIFT))
+
+#define FIX_KMAP_BEGIN (0)
+#define FIX_KMAP_END   ((FIXMAP_SIZE >> PAGE_SHIFT) - 1)
+#define FIXADDR_TOP(FIXMAP_BASE + FIXMAP_SIZE - PAGE_SIZE)
 
 /* start after fixmap area */
 #define PKMAP_BASE (FIXMAP_BASE + FIXMAP_SIZE)
@@ -29,6 +32,9 @@
 
 extern void kmap_init(void);
 
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE)
+
 static inline void flush_cache_kmaps(void)
 {
flush_cache_all();
--- a/arch/arc/mm/highmem.c
+++ b/arch/arc/mm/highmem.c
@@ -47,48 +47,6 @@
  */
 
 extern pte_t * pkmap_page_table;
-static pte_t * fixmap_page_table;
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   int idx, cpu_idx;
-   unsigned long vaddr;
-
-   cpu_idx = kmap_atomic_idx_push();
-   idx = cpu_idx + KM_TYPE_NR * smp_processor_id();
-   vaddr = FIXMAP_ADDR(idx);
-
-   set_pte_at(&init_mm, vaddr, fixmap_page_table + idx,
-  mk_pte(page, prot));
-
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kv)
-{
-   unsigned long kvaddr = (unsigned long)kv;
-
-   if (kvaddr >= FIXMAP_BASE && kvaddr < (FIXMAP_BASE + FIXMAP_SIZE)) {
-
-   /*
-* Because preemption is disabled, this vaddr can be associated
-* with the current allocated index.
-* But in case of multiple live kmap_atomic(), it still relies 
on
-* callers to unmap in right order.
-*/
-   int cpu_idx = kmap_atomic_idx();
-   int idx = cpu_idx + KM_TYPE_NR * smp_processor_id();
-
-   WARN_ON(kvaddr != FIXMAP_ADDR(idx));
-
-   pte_clear(&init_mm, kvaddr, fixmap_page_table + idx);
-   local_flush_tlb_kernel_range(kvaddr, kvaddr + PAGE_SIZE);
-
-   kmap_atomic_idx_pop();
-   }
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
 
 static noinline pte_t * __init alloc_kmap_pgtable(unsigned long kvaddr)
 {
@@ -113,5 +71,5 @@ void __init kmap_init(void)
pkmap_page_table = alloc_kmap_pgtable(PKMAP_BASE);
 
BUILD_BUG_ON(LAST_PKMAP > PTRS_PER_PTE);
-   fixmap_page_table = alloc_kmap_pgtable(FIXMAP_BASE);
+   alloc_kmap_pgtable(FIXMAP_BASE);
 }

[patch V2 07/18] csky/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Acked-by: Guo Ren 
Cc: linux-c...@vger.kernel.org

---
 arch/csky/Kconfig   |1 
 arch/csky/include/asm/highmem.h |4 +-
 arch/csky/mm/highmem.c  |   75 
 3 files changed, 5 insertions(+), 75 deletions(-)

--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -286,6 +286,7 @@ config NR_CPUS
 config HIGHMEM
bool "High Memory Support"
depends on !CPU_CK610
+   select KMAP_LOCAL
default y
 
 config FORCE_MAX_ZONEORDER
--- a/arch/csky/include/asm/highmem.h
+++ b/arch/csky/include/asm/highmem.h
@@ -32,10 +32,12 @@ extern pte_t *pkmap_page_table;
 
 #define ARCH_HAS_KMAP_FLUSH_TLB
 extern void kmap_flush_tlb(unsigned long addr);
-extern void *kmap_atomic_pfn(unsigned long pfn);
 
 #define flush_cache_kmaps() do {} while (0)
 
+#define arch_kmap_local_post_map(vaddr, pteval)kmap_flush_tlb(vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  kmap_flush_tlb(vaddr)
+
 extern void kmap_init(void);
 
 #endif /* __KERNEL__ */
--- a/arch/csky/mm/highmem.c
+++ b/arch/csky/mm/highmem.c
@@ -9,8 +9,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 unsigned long highstart_pfn, highend_pfn;
 
 void kmap_flush_tlb(unsigned long addr)
@@ -19,67 +17,7 @@ void kmap_flush_tlb(unsigned long addr)
 }
 EXPORT_SYMBOL(kmap_flush_tlb);
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte - idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   flush_tlb_one((unsigned long)vaddr);
-
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int idx;
-
-   if (vaddr < FIXADDR_START)
-   return;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   idx = KM_TYPE_NR*smp_processor_id() + kmap_atomic_idx();
-
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   pte_clear(&init_mm, vaddr, kmap_pte - idx);
-   flush_tlb_one(vaddr);
-#else
-   (void) idx; /* to kill a warning */
-#endif
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
-
-/*
- * This is the same as kmap_atomic() but can map memory that doesn't
- * have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   pagefault_disable();
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL));
-   flush_tlb_one(vaddr);
-
-   return (void *) vaddr;
-}
-
-static void __init kmap_pages_init(void)
+void __init kmap_init(void)
 {
unsigned long vaddr;
pgd_t *pgd;
@@ -96,14 +34,3 @@ static void __init kmap_pages_init(void)
pte = pte_offset_kernel(pmd, vaddr);
pkmap_page_table = pte;
 }
-
-void __init kmap_init(void)
-{
-   unsigned long vaddr;
-
-   kmap_pages_init();
-
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN);
-
-   kmap_pte = pte_offset_kernel((pmd_t *)pgd_offset_k(vaddr), vaddr);
-}

[patch V2 06/18] ARM: highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Cc: Russell King 
Cc: Arnd Bergmann 
Cc: linux-arm-ker...@lists.infradead.org

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b16658..410235e350cc 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1499,6 +1499,7 @@ config HAVE_ARCH_PFN_VALID
 config HIGHMEM
bool "High Memory Support"
depends on MMU
+   select KMAP_LOCAL
help
  The address space of ARM processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
diff --git a/arch/arm/include/asm/highmem.h b/arch/arm/include/asm/highmem.h
index 31811be38d78..99a99862c474 100644
--- a/arch/arm/include/asm/highmem.h
+++ b/arch/arm/include/asm/highmem.h
@@ -46,19 +46,32 @@ extern pte_t *pkmap_page_table;
 
 #ifdef ARCH_NEEDS_KMAP_HIGH_GET
 extern void *kmap_high_get(struct page *page);
-#else
+
+static inline void *arch_kmap_local_high_get(struct page *page)
+{
+   if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !cache_is_vivt())
+   return NULL;
+   return kmap_high_get(page);
+}
+#define arch_kmap_local_high_get arch_kmap_local_high_get
+
+#else /* ARCH_NEEDS_KMAP_HIGH_GET */
 static inline void *kmap_high_get(struct page *page)
 {
return NULL;
 }
-#endif
+#endif /* !ARCH_NEEDS_KMAP_HIGH_GET */
 
-/*
- * The following functions are already defined by 
- * when CONFIG_HIGHMEM is not set.
- */
-#ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_pfn(unsigned long pfn);
-#endif
+#define arch_kmap_local_post_map(vaddr, pteval)
\
+   local_flush_tlb_kernel_page(vaddr)
+
+#define arch_kmap_local_pre_unmap(vaddr)   \
+do {   \
+   if (cache_is_vivt())\
+   __cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE); \
+} while (0)
+
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_page(vaddr)
 
 #endif
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 7cb1699fbfc4..c4ce477c5261 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -19,7 +19,6 @@ obj-$(CONFIG_MODULES) += proc-syms.o
 obj-$(CONFIG_DEBUG_VIRTUAL)+= physaddr.o
 
 obj-$(CONFIG_ALIGNMENT_TRAP)   += alignment.o
-obj-$(CONFIG_HIGHMEM)  += highmem.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_ARM_PV_FIXUP) += pv-fixup-asm.o
 
diff --git a/arch/arm/mm/highmem.c b/arch/arm/mm/highmem.c
deleted file mode 100644
index 187fab227b50..
--- a/arch/arm/mm/highmem.c
+++ /dev/null
@@ -1,121 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * arch/arm/mm/highmem.c -- ARM highmem support
- *
- * Author: Nicolas Pitre
- * Created:september 8, 2008
- * Copyright:  Marvell Semiconductors Inc.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "mm.h"
-
-static inline void set_fixmap_pte(int idx, pte_t pte)
-{
-   unsigned long vaddr = __fix_to_virt(idx);
-   pte_t *ptep = virt_to_kpte(vaddr);
-
-   set_pte_ext(ptep, pte, 0);
-   local_flush_tlb_kernel_page(vaddr);
-}
-
-static inline pte_t get_fixmap_pte(unsigned long vaddr)
-{
-   pte_t *ptep = virt_to_kpte(vaddr);
-
-   return *ptep;
-}
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned int idx;
-   unsigned long vaddr;
-   void *kmap;
-   int type;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   /*
-* There is no cache coherency issue when non VIVT, so force the
-* dedicated kmap usage for better debugging purposes in that case.
-*/
-   if (!cache_is_vivt())
-   kmap = NULL;
-   else
-#endif
-   kmap = kmap_high_get(page);
-   if (kmap)
-   return kmap;
-
-   type = kmap_atomic_idx_push();
-
-   idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();
-   vaddr = __fix_to_virt(idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   /*
-* With debugging enabled, kunmap_atomic forces that entry to 0.
-* Make sure it was indeed properly unmapped.
-*/
-   BUG_ON(!pte_none(get_fixmap_pte(vaddr)));
-#endif
-   /*
-* When debugging is off, kunmap_atomic leaves the previous mapping
-* in place, so the contained TLB flush ensures the TLB is updated
-* with the new mapping.
-*/
-   set_fixmap_pte(idx, mk_pte(page, prot));
-
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int idx, type;
-
-   if (kvaddr >= (void *)FIXADDR_START) {
-   type = kmap_atomic_idx();
-   idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();
-
-   if (cache_i

[patch V2 08/18] microblaze/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture.

Signed-off-by: Thomas Gleixner 
Cc: Michal Simek 

diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index d262ac0c8714..186a0526564c 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -170,6 +170,7 @@ config XILINX_UNCACHED_SHADOW
 config HIGHMEM
bool "High memory support"
depends on MMU
+   select KMAP_LOCAL
help
  The address space of Microblaze processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
diff --git a/arch/microblaze/include/asm/highmem.h 
b/arch/microblaze/include/asm/highmem.h
index 284ca8fb54c1..4418633fb163 100644
--- a/arch/microblaze/include/asm/highmem.h
+++ b/arch/microblaze/include/asm/highmem.h
@@ -25,7 +25,6 @@
 #include 
 #include 
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 /*
@@ -52,6 +51,11 @@ extern pte_t *pkmap_page_table;
 
 #define flush_cache_kmaps(){ flush_icache(); flush_dcache(); }
 
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   local_flush_tlb_page(NULL, vaddr);
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_page(NULL, vaddr);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
diff --git a/arch/microblaze/mm/Makefile b/arch/microblaze/mm/Makefile
index 1b16875cea70..8ced71100047 100644
--- a/arch/microblaze/mm/Makefile
+++ b/arch/microblaze/mm/Makefile
@@ -6,4 +6,3 @@
 obj-y := consistent.o init.o
 
 obj-$(CONFIG_MMU) += pgtable.o mmu_context.o fault.o
-obj-$(CONFIG_HIGHMEM) += highmem.o
diff --git a/arch/microblaze/mm/highmem.c b/arch/microblaze/mm/highmem.c
deleted file mode 100644
index 92e0890416c9..
--- a/arch/microblaze/mm/highmem.c
+++ /dev/null
@@ -1,78 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * highmem.c: virtual kernel memory mappings for high memory
- *
- * PowerPC version, stolen from the i386 version.
- *
- * Used in CONFIG_HIGHMEM systems for memory pages which
- * are not addressable by direct kernel virtual addresses.
- *
- * Copyright (C) 1999 Gerhard Wichert, Siemens AG
- *   gerhard.wich...@pdb.siemens.de
- *
- *
- * Redesigned the x86 32-bit VM architecture to deal with
- * up to 16 Terrabyte physical memory. With current x86 CPUs
- * we now support up to 64 Gigabytes physical RAM.
- *
- * Copyright (C) 1999 Ingo Molnar 
- *
- * Reworked for PowerPC by various contributors. Moved from
- * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp.
- */
-
-#include 
-#include 
-
-/*
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need
- * it.
- */
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-#endif
-   set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot));
-   local_flush_tlb_page(NULL, vaddr);
-
-   return (void *) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type;
-   unsigned int idx;
-
-   if (vaddr < __fix_to_virt(FIX_KMAP_END))
-   return;
-
-   type = kmap_atomic_idx();
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-#endif
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   local_flush_tlb_page(NULL, vaddr);
-
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 3344d4a1fe89..3f4e41787a4e 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -49,17 +49,11 @@ unsigned long lowmem_size;
 EXPORT_SYMBOL(min_low_pfn);
 EXPORT_SYMBOL(max_low_pfn);
 
-#ifdef CONFIG_HIGHMEM
-pte_t *kmap_pte;
-EXPORT_SYMBOL(kmap_pte);
-
 static void __init highmem_init(void)
 {
pr_debug("%x\n", (u32)PKMAP_BASE);
map_page(PKMAP_BASE, 0, 0); /* XXX gross */
pkmap_page_table = virt_to_kpte(PKMAP_BASE);
-
-   kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN));
 }
 
 static void highmem_setup(void)

[patch V2 09/18] mips/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Thomas Bogendoerfer 
Cc: linux-m...@vger.kernel.org

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8f328298f8cc..ed6b3de944a8 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2654,6 +2654,7 @@ config MIPS_CRC_SUPPORT
 config HIGHMEM
bool "High Memory Support"
depends on 32BIT && CPU_SUPPORTS_HIGHMEM && SYS_SUPPORTS_HIGHMEM && 
!CPU_MIPS32_3_5_EVA
+   select KMAP_LOCAL
 
 config CPU_SUPPORTS_HIGHMEM
bool
diff --git a/arch/mips/include/asm/highmem.h b/arch/mips/include/asm/highmem.h
index f1f788b57166..cb2e0fb8483b 100644
--- a/arch/mips/include/asm/highmem.h
+++ b/arch/mips/include/asm/highmem.h
@@ -48,11 +48,11 @@ extern pte_t *pkmap_page_table;
 
 #define ARCH_HAS_KMAP_FLUSH_TLB
 extern void kmap_flush_tlb(unsigned long addr);
-extern void *kmap_atomic_pfn(unsigned long pfn);
 
 #define flush_cache_kmaps()BUG_ON(cpu_has_dc_aliases)
 
-extern void kmap_init(void);
+#define arch_kmap_local_post_map(vaddr, pteval)
local_flush_tlb_one(vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  local_flush_tlb_one(vaddr)
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/mips/mm/highmem.c b/arch/mips/mm/highmem.c
index 5fec7f45d79a..57e2f08f00d0 100644
--- a/arch/mips/mm/highmem.c
+++ b/arch/mips/mm/highmem.c
@@ -8,8 +8,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 unsigned long highstart_pfn, highend_pfn;
 
 void kmap_flush_tlb(unsigned long addr)
@@ -17,78 +15,3 @@ void kmap_flush_tlb(unsigned long addr)
flush_tlb_one(addr);
 }
 EXPORT_SYMBOL(kmap_flush_tlb);
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte - idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-   local_flush_tlb_one((unsigned long)vaddr);
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type __maybe_unused;
-
-   if (vaddr < FIXADDR_START)
-   return;
-
-   type = kmap_atomic_idx();
-#ifdef CONFIG_DEBUG_HIGHMEM
-   {
-   int idx = type + KM_TYPE_NR * smp_processor_id();
-
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   local_flush_tlb_one(vaddr);
-   }
-#endif
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
-
-/*
- * This is the same as kmap_atomic() but can map memory that doesn't
- * have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   preempt_disable();
-   pagefault_disable();
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL));
-   flush_tlb_one(vaddr);
-
-   return (void*) vaddr;
-}
-
-void __init kmap_init(void)
-{
-   unsigned long kmap_vstart;
-
-   /* cache the first kmap pte */
-   kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
-   kmap_pte = virt_to_kpte(kmap_vstart);
-}
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 6c7bbfe35ba3..e5de8e9c2ede 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -402,9 +402,6 @@ void __init paging_init(void)
 
pagetable_init();
 
-#ifdef CONFIG_HIGHMEM
-   kmap_init();
-#endif
 #ifdef CONFIG_ZONE_DMA
max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
 #endif

[patch V2 10/18] nds32/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

The mapping code is odd and looks broken. See FIXME in the comment.

Signed-off-by: Thomas Gleixner 
Cc: Nick Hu 
Cc: Greentime Hu 
Cc: Vincent Chen 

diff --git a/arch/nds32/Kconfig.cpu b/arch/nds32/Kconfig.cpu
index f88a12fdf0f3..c7add11ea36e 100644
--- a/arch/nds32/Kconfig.cpu
+++ b/arch/nds32/Kconfig.cpu
@@ -157,6 +157,7 @@ config HW_SUPPORT_UNALIGNMENT_ACCESS
 config HIGHMEM
bool "High Memory Support"
depends on MMU && !CPU_CACHE_ALIASING
+select KMAP_LOCAL
help
  The address space of Andes processors is only 4 Gigabytes large
  and it has to accommodate user address space, kernel address
diff --git a/arch/nds32/include/asm/highmem.h b/arch/nds32/include/asm/highmem.h
index fe986d0e6e3f..d844c282c090 100644
--- a/arch/nds32/include/asm/highmem.h
+++ b/arch/nds32/include/asm/highmem.h
@@ -45,11 +45,22 @@ extern pte_t *pkmap_page_table;
 extern void kmap_init(void);
 
 /*
- * The following functions are already defined by 
- * when CONFIG_HIGHMEM is not set.
+ * FIXME: The below looks broken vs. a kmap_atomic() in task context which
+ * is interupted and another kmap_atomic() happens in interrupt context.
+ * But what do I know about nds32. -- tglx
  */
-#ifdef CONFIG_HIGHMEM
-extern void *kmap_atomic_pfn(unsigned long pfn);
-#endif
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   do {\
+   __nds32__tlbop_inv(vaddr);  \
+   __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN); \
+   __nds32__tlbop_rwr(pteval); \
+   __nds32__isb(); \
+   } while (0)
+
+#define arch_kmap_local_pre_unmap(vaddr, pte)  \
+   do {\
+   __nds32__tlbop_inv(vaddr);  \
+   __nds32__isb(); \
+   } while (0)
 
 #endif
diff --git a/arch/nds32/mm/Makefile b/arch/nds32/mm/Makefile
index 897ecaf5cf54..14fb2e8eb036 100644
--- a/arch/nds32/mm/Makefile
+++ b/arch/nds32/mm/Makefile
@@ -3,7 +3,6 @@ obj-y   := extable.o tlb.o fault.o 
init.o mmap.o \
mm-nds32.o cacheflush.o proc.o
 
 obj-$(CONFIG_ALIGNMENT_TRAP)   += alignment.o
-obj-$(CONFIG_HIGHMEM)   += highmem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_proc.o = $(CC_FLAGS_FTRACE)
diff --git a/arch/nds32/mm/highmem.c b/arch/nds32/mm/highmem.c
deleted file mode 100644
index 4284cd59e21a..
--- a/arch/nds32/mm/highmem.c
+++ /dev/null
@@ -1,48 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (C) 2005-2017 Andes Technology Corporation
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned int idx;
-   unsigned long vaddr, pte;
-   int type;
-   pte_t *ptep;
-
-   type = kmap_atomic_idx_push();
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   pte = (page_to_pfn(page) << PAGE_SHIFT) | prot;
-   ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr);
-   set_pte(ptep, pte);
-
-   __nds32__tlbop_inv(vaddr);
-   __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN);
-   __nds32__tlbop_rwr(pte);
-   __nds32__isb();
-   return (void *)vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   if (kvaddr >= (void *)FIXADDR_START) {
-   unsigned long vaddr = (unsigned long)kvaddr;
-   pte_t *ptep;
-   kmap_atomic_idx_pop();
-   __nds32__tlbop_inv(vaddr);
-   __nds32__isb();
-   ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr);
-   set_pte(ptep, 0);
-   }
-}
-EXPORT_SYMBOL(kunmap_atomic_high);

[patch V2 11/18] powerpc/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig   |1 
 arch/powerpc/include/asm/highmem.h |6 ++-
 arch/powerpc/mm/Makefile   |1 
 arch/powerpc/mm/highmem.c  |   67 -
 arch/powerpc/mm/mem.c  |7 ---
 5 files changed, 6 insertions(+), 76 deletions(-)

--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -409,6 +409,7 @@ menu "Kernel options"
 config HIGHMEM
bool "High memory support"
depends on PPC32
+select KMAP_LOCAL
 
 source "kernel/Kconfig.hz"
 
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -29,7 +29,6 @@
 #include 
 #include 
 
-extern pte_t *kmap_pte;
 extern pte_t *pkmap_page_table;
 
 /*
@@ -60,6 +59,11 @@ extern pte_t *pkmap_page_table;
 
 #define flush_cache_kmaps()flush_cache_all()
 
+#define arch_kmap_local_post_map(vaddr, pteval)\
+   local_flush_tlb_page(NULL, vaddr)
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_page(NULL, vaddr)
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -16,7 +16,6 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += num
 obj-$(CONFIG_PPC_MM_SLICES)+= slice.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
-obj-$(CONFIG_HIGHMEM)  += highmem.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
 obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
--- a/arch/powerpc/mm/highmem.c
+++ /dev/null
@@ -1,67 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * highmem.c: virtual kernel memory mappings for high memory
- *
- * PowerPC version, stolen from the i386 version.
- *
- * Used in CONFIG_HIGHMEM systems for memory pages which
- * are not addressable by direct kernel virtual addresses.
- *
- * Copyright (C) 1999 Gerhard Wichert, Siemens AG
- *   gerhard.wich...@pdb.siemens.de
- *
- *
- * Redesigned the x86 32-bit VM architecture to deal with
- * up to 16 Terrabyte physical memory. With current x86 CPUs
- * we now support up to 64 Gigabytes physical RAM.
- *
- * Copyright (C) 1999 Ingo Molnar 
- *
- * Reworked for PowerPC by various contributors. Moved from
- * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp.
- */
-
-#include 
-#include 
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   int idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   WARN_ON(IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !pte_none(*(kmap_pte - 
idx)));
-   __set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot), 1);
-   local_flush_tlb_page(NULL, vaddr);
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-
-   if (vaddr < __fix_to_virt(FIX_KMAP_END))
-   return;
-
-   if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM)) {
-   int type = kmap_atomic_idx();
-   unsigned int idx;
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-   WARN_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
-
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   local_flush_tlb_page(NULL, vaddr);
-   }
-
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -61,11 +61,6 @@
 unsigned long long memory_limit;
 bool init_mem_is_free;
 
-#ifdef CONFIG_HIGHMEM
-pte_t *kmap_pte;
-EXPORT_SYMBOL(kmap_pte);
-#endif
-
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
@@ -235,8 +230,6 @@ void __init paging_init(void)
 
map_kernel_page(PKMAP_BASE, 0, __pgprot(0));/* XXX gross */
pkmap_page_table = virt_to_kpte(PKMAP_BASE);
-
-   kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN));
 #endif /* CONFIG_HIGHMEM */
 
printk(KERN_DEBUG "Top of RAM: 0x%llx, Total RAM: 0x%llx\n",

[patch V2 12/18] sparc/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org

---
 arch/sparc/Kconfig   |1 
 arch/sparc/include/asm/highmem.h |7 +-
 arch/sparc/mm/Makefile   |3 -
 arch/sparc/mm/highmem.c  |  115 ---
 arch/sparc/mm/srmmu.c|2 
 5 files changed, 6 insertions(+), 122 deletions(-)

--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -139,6 +139,7 @@ config MMU
 config HIGHMEM
bool
default y if SPARC32
+select KMAP_LOCAL
 
 config ZONE_DMA
bool
--- a/arch/sparc/include/asm/highmem.h
+++ b/arch/sparc/include/asm/highmem.h
@@ -33,8 +33,6 @@ extern unsigned long highstart_pfn, high
 #define kmap_prot __pgprot(SRMMU_ET_PTE | SRMMU_PRIV | SRMMU_CACHE)
 extern pte_t *pkmap_page_table;
 
-void kmap_init(void) __init;
-
 /*
  * Right now we initialize only a single pte table. It can be extended
  * easily, subsequent pte tables have to be allocated in one physical
@@ -53,6 +51,11 @@ void kmap_init(void) __init;
 
 #define flush_cache_kmaps()flush_cache_all()
 
+/* FIXME: Use __flush_tlb_one(vaddr) instead of flush_cache_all() -- Anton */
+#define arch_kmap_local_post_map(vaddr, pteval)flush_cache_all()
+#define arch_kmap_local_post_unmap(vaddr)  flush_cache_all()
+
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_HIGHMEM_H */
--- a/arch/sparc/mm/Makefile
+++ b/arch/sparc/mm/Makefile
@@ -15,6 +15,3 @@ obj-$(CONFIG_SPARC32)   += leon_mm.o
 
 # Only used by sparc64
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-
-# Only used by sparc32
-obj-$(CONFIG_HIGHMEM)   += highmem.o
--- a/arch/sparc/mm/highmem.c
+++ /dev/null
@@ -1,115 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- *  highmem.c: virtual kernel memory mappings for high memory
- *
- *  Provides kernel-static versions of atomic kmap functions originally
- *  found as inlines in include/asm-sparc/highmem.h.  These became
- *  needed as kmap_atomic() and kunmap_atomic() started getting
- *  called from within modules.
- *  -- Tomas Szepe , September 2002
- *
- *  But kmap_atomic() and kunmap_atomic() cannot be inlined in
- *  modules because they are loaded with btfixup-ped functions.
- */
-
-/*
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need it.
- *
- * XXX This is an old text. Actually, it's good to use atomic kmaps,
- * provided you remember that they are atomic and not try to sleep
- * with a kmap taken, much like a spinlock. Non-atomic kmaps are
- * shared by CPUs, and so precious, and establishing them requires IPI.
- * Atomic kmaps are lightweight and we may have NCPUS more of them.
- */
-#include 
-#include 
-#include 
-
-#include 
-#include 
-#include 
-
-static pte_t *kmap_pte;
-
-void __init kmap_init(void)
-{
-   unsigned long address = __fix_to_virt(FIX_KMAP_BEGIN);
-
-/* cache the first kmap pte */
-kmap_pte = virt_to_kpte(address);
-}
-
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
-{
-   unsigned long vaddr;
-   long idx, type;
-
-   type = kmap_atomic_idx_push();
-   idx = type + KM_TYPE_NR*smp_processor_id();
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-
-/* XXX Fix - Anton */
-#if 0
-   __flush_cache_one(vaddr);
-#else
-   flush_cache_all();
-#endif
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte-idx)));
-#endif
-   set_pte(kmap_pte-idx, mk_pte(page, prot));
-/* XXX Fix - Anton */
-#if 0
-   __flush_tlb_one(vaddr);
-#else
-   flush_tlb_all();
-#endif
-
-   return (void*) vaddr;
-}
-EXPORT_SYMBOL(kmap_atomic_high_prot);
-
-void kunmap_atomic_high(void *kvaddr)
-{
-   unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
-   int type;
-
-   if (vaddr < FIXADDR_START)
-   return;
-
-   type = kmap_atomic_idx();
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   {
-   unsigned long idx;
-
-   idx = type + KM_TYPE_NR * smp_processor_id();
-   BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx));
-
-   /* XXX Fix - Anton */
-#if 0
-   __flush_cache_one(vaddr);
-#else
-   flush_cache_all();
-#endif
-
-   /*
-* force other mappings to Oops if they'll try to access
-* this pte without first remap it
-*/
-   pte_clear(&init_mm, vaddr, kmap_pte-idx);
-   /* XXX Fix - Anton */
-#if 0
-   __flush_tlb_one(vaddr);
-#else
-   flush_tlb_all();
-#endif
-   }
-#endif
-
-   kmap_atomic_idx_pop();
-}
-EXPORT_SYMBOL(kunmap_atomic_high);
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -971,8 +971,6 @@ void __init srmmu_paging_init(void)
 
sparc_context_init(num_contexts);
 
-

[patch V2 13/18] xtensa/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread Thomas Gleixner

No reason having the same code in every architecture

Signed-off-by: Thomas Gleixner 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: linux-xte...@linux-xtensa.org

---
 arch/xtensa/Kconfig   |1 
 arch/xtensa/include/asm/highmem.h |9 +++
 arch/xtensa/mm/highmem.c  |   44 +++---
 3 files changed, 14 insertions(+), 40 deletions(-)

--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -666,6 +666,7 @@ endchoice
 config HIGHMEM
bool "High Memory Support"
depends on MMU
+select KMAP_LOCAL
help
  Linux can use the full amount of RAM in the system by
  default. However, the default MMUv2 setup only maps the
--- a/arch/xtensa/include/asm/highmem.h
+++ b/arch/xtensa/include/asm/highmem.h
@@ -68,6 +68,15 @@ static inline void flush_cache_kmaps(voi
flush_cache_all();
 }
 
+enum fixed_addresses kmap_local_map_idx(int type, unsigned long pfn);
+#define arch_kmap_local_map_idxkmap_local_map_idx
+
+enum fixed_addresses kmap_local_unmap_idx(int type, unsigned long addr);
+#define arch_kmap_local_unmap_idx  kmap_local_unmap_idx
+
+#define arch_kmap_local_post_unmap(vaddr)  \
+   local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE)
+
 void kmap_init(void);
 
 #endif
--- a/arch/xtensa/mm/highmem.c
+++ b/arch/xtensa/mm/highmem.c
@@ -12,8 +12,6 @@
 #include 
 #include 
 
-static pte_t *kmap_pte;
-
 #if DCACHE_WAY_SIZE > PAGE_SIZE
 unsigned int last_pkmap_nr_arr[DCACHE_N_COLORS];
 wait_queue_head_t pkmap_map_wait_arr[DCACHE_N_COLORS];
@@ -37,55 +35,21 @@ static inline enum fixed_addresses kmap_
color;
 }
 
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot)
+enum fixed_addresses kmap_local_map_idx(int type, unsigned long pfn)
 {
-   enum fixed_addresses idx;
-   unsigned long vaddr;
-
-   idx = kmap_idx(kmap_atomic_idx_push(),
-  DCACHE_ALIAS(page_to_phys(page)));
-   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
-   BUG_ON(!pte_none(*(kmap_pte + idx)));
-#endif
-   set_pte(kmap_pte + idx, mk_pte(page, prot));
-
-   return (void *)vaddr;
+   return kmap_idx(type, DCACHE_ALIAS(pfn << PAGE_SHIFT);
 }
-EXPORT_SYMBOL(kmap_atomic_high_prot);
 
-void kunmap_atomic_high(void *kvaddr)
+enum fixed_addresses kmap_local_unmap_idx(int type, unsigned long addr)
 {
-   if (kvaddr >= (void *)FIXADDR_START &&
-   kvaddr < (void *)FIXADDR_TOP) {
-   int idx = kmap_idx(kmap_atomic_idx(),
-  DCACHE_ALIAS((unsigned long)kvaddr));
-
-   /*
-* Force other mappings to Oops if they'll try to access this
-* pte without first remap it.  Keeping stale mappings around
-* is a bad idea also, in case the page changes cacheability
-* attributes or becomes a protected page in a hypervisor.
-*/
-   pte_clear(&init_mm, kvaddr, kmap_pte + idx);
-   local_flush_tlb_kernel_range((unsigned long)kvaddr,
-(unsigned long)kvaddr + PAGE_SIZE);
-
-   kmap_atomic_idx_pop();
-   }
+   return kmap_idx(type, DCACHE_ALIAS(addr));
 }
-EXPORT_SYMBOL(kunmap_atomic_high);
 
 void __init kmap_init(void)
 {
-   unsigned long kmap_vstart;
-
/* Check if this memory layout is broken because PKMAP overlaps
 * page table.
 */
BUILD_BUG_ON(PKMAP_BASE < TLBTEMP_BASE_1 + TLBTEMP_SIZE);
-   /* cache the first kmap pte */
-   kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
-   kmap_pte = virt_to_kpte(kmap_vstart);
kmap_waitqueues_init();
 }

[patch V2 16/18] sched: highmem: Store local kmaps in task struct

2020-10-29 Thread Thomas Gleixner

Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.

The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.

The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.

Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.

Signed-off-by: Thomas Gleixner 
---
 include/linux/highmem.h |   10 +
 include/linux/sched.h   |9 
 kernel/entry/common.c   |2 +
 kernel/fork.c   |1 
 kernel/sched/core.c |   18 +
 mm/highmem.c|   96 +---
 6 files changed, 123 insertions(+), 13 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -38,6 +38,16 @@ static inline void invalidate_kernel_vma
 void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
 void *__kmap_local_page_prot(struct page *page, pgprot_t prot);
 void kunmap_local_indexed(void *vaddr);
+void kmap_local_fork(struct task_struct *tsk);
+void __kmap_local_sched_out(void);
+void __kmap_local_sched_in(void);
+static inline void kmap_assert_nomap(void)
+{
+   DEBUG_LOCKS_WARN_ON(current->kmap_ctrl.idx);
+}
+#else
+static inline void kmap_local_fork(struct task_struct *tsk) { }
+static inline void kmap_assert_nomap(void) { }
 #endif
 
 #ifdef CONFIG_HIGHMEM
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -629,6 +630,13 @@ struct wake_q_node {
struct wake_q_node *next;
 };
 
+struct kmap_ctrl {
+#ifdef CONFIG_KMAP_LOCAL
+   int idx;
+   pte_t   pteval[KM_TYPE_NR];
+#endif
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1294,6 +1302,7 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+   struct kmap_ctrlkmap_ctrl;
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long   task_state_change;
 #endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -194,6 +195,7 @@ static void exit_to_user_mode_prepare(st
 
/* Ensure that the address limit is intact and no locks are held */
addr_limit_user_check();
+   kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
 }
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -930,6 +930,7 @@ static struct task_struct *dup_task_stru
account_kernel_stack(tsk, 1);
 
kcov_task_init(tsk);
+   kmap_local_fork(tsk);
 
 #ifdef CONFIG_FAULT_INJECTION
tsk->fail_nth = 0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4053,6 +4053,22 @@ static inline void finish_lock_switch(st
 # define finish_arch_post_lock_switch()do { } while (0)
 #endif
 
+static inline void kmap_local_sched_out(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_out();
+#endif
+}
+
+static inline void kmap_local_sched_in(void)
+{
+#ifdef CONFIG_KMAP_LOCAL
+   if (unlikely(current->kmap_ctrl.idx))
+   __kmap_local_sched_in();
+#endif
+}
+
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -4075,6 +4091,7 @@ prepare_task_switch(struct rq *rq, struc
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
fire_sched_out_preempt_notifiers(prev, next);
+   kmap_local_sched_out();
prepare_task(next);
prepare_arch_switch(next);
 }
@@ -4141,6 +4158,7 @@ static struct rq *finish_task_switch(str
finish_lock_switch(rq);
finish_arch_post_lock_switch();
kcov_finish_switch(current);
+   kmap_local_sched_in();
 
fire_sched_in_preempt_notifiers(current);
/*
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -367,27 +367,24 @@ EXPORT_SYMBOL(kunmap_high);
 
 #ifdef CONFIG_KMAP_LOCAL
 
-static DEFINE_PER_CPU(int, __kmap_atomic_idx);
-
-static inline int kmap_atomic_idx_push(void)
+static inline int kmap_local_idx_push(void)
 {
-   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
+   int idx = current->kmap_ctrl.idx++;
 
WARN_ON_ONCE(in_irq() && !irqs_disabled());
BUG_ON(idx >= KM_TYPE_NR);
return idx;
 }
 
-static inline int kmap_atomic_idx(void)
+static inline int kmap_local_idx(void)
 {
-   return __this_cpu_read(__kmap_atomic_idx) - 1

[patch V2 14/18] mm/highmem: Remove the old kmap_atomic cruft

2020-10-29 Thread Thomas Gleixner

All users gone.

Signed-off-by: Thomas Gleixner 

---
 include/linux/highmem.h |   61 ++--
 mm/highmem.c|   28 ++
 2 files changed, 27 insertions(+), 62 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -88,31 +88,16 @@ static inline void kunmap(struct page *p
  * be used in IRQ contexts, so in some (very limited) cases we need
  * it.
  */
-
-#ifndef CONFIG_KMAP_LOCAL
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot);
-void kunmap_atomic_high(void *kvaddr);
-
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
preempt_disable();
pagefault_disable();
-   if (!PageHighMem(page))
-   return page_address(page);
-   return kmap_atomic_high_prot(page, prot);
-}
-
-static inline void __kunmap_atomic(void *vaddr)
-{
-   kunmap_atomic_high(vaddr);
+   return __kmap_local_page_prot(page, prot);
 }
-#else /* !CONFIG_KMAP_LOCAL */
 
-static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
+static inline void *kmap_atomic(struct page *page)
 {
-   preempt_disable();
-   pagefault_disable();
-   return __kmap_local_page_prot(page, prot);
+   return kmap_atomic_prot(page, kmap_prot);
 }
 
 static inline void *kmap_atomic_pfn(unsigned long pfn)
@@ -127,13 +112,6 @@ static inline void __kunmap_atomic(void
kunmap_local_indexed(addr);
 }
 
-#endif /* CONFIG_KMAP_LOCAL */
-
-static inline void *kmap_atomic(struct page *page)
-{
-   return kmap_atomic_prot(page, kmap_prot);
-}
-
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
 extern atomic_long_t _totalhigh_pages;
@@ -226,39 +204,6 @@ static inline void __kunmap_atomic(void
 
 #endif /* CONFIG_HIGHMEM */
 
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
-
-DECLARE_PER_CPU(int, __kmap_atomic_idx);
-
-static inline int kmap_atomic_idx_push(void)
-{
-   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
-
-#ifdef CONFIG_DEBUG_HIGHMEM
-   WARN_ON_ONCE(in_irq() && !irqs_disabled());
-   BUG_ON(idx >= KM_TYPE_NR);
-#endif
-   return idx;
-}
-
-static inline int kmap_atomic_idx(void)
-{
-   return __this_cpu_read(__kmap_atomic_idx) - 1;
-}
-
-static inline void kmap_atomic_idx_pop(void)
-{
-#ifdef CONFIG_DEBUG_HIGHMEM
-   int idx = __this_cpu_dec_return(__kmap_atomic_idx);
-
-   BUG_ON(idx < 0);
-#else
-   __this_cpu_dec(__kmap_atomic_idx);
-#endif
-}
-
-#endif
-
 /*
  * Prevent people trying to call kunmap_atomic() as if it were kunmap()
  * kunmap_atomic() should get the return value of kmap_atomic, not the page.
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -32,10 +32,6 @@
 #include 
 #include 
 
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
-DEFINE_PER_CPU(int, __kmap_atomic_idx);
-#endif
-
 /*
  * Virtual_count is not a pure "count".
  *  0 means that it is not mapped, and has not been mapped
@@ -370,6 +366,30 @@ EXPORT_SYMBOL(kunmap_high);
 #endif /* CONFIG_HIGHMEM */
 
 #ifdef CONFIG_KMAP_LOCAL
+
+static DEFINE_PER_CPU(int, __kmap_atomic_idx);
+
+static inline int kmap_atomic_idx_push(void)
+{
+   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
+
+   WARN_ON_ONCE(in_irq() && !irqs_disabled());
+   BUG_ON(idx >= KM_TYPE_NR);
+   return idx;
+}
+
+static inline int kmap_atomic_idx(void)
+{
+   return __this_cpu_read(__kmap_atomic_idx) - 1;
+}
+
+static inline void kmap_atomic_idx_pop(void)
+{
+   int idx = __this_cpu_dec_return(__kmap_atomic_idx);
+
+   BUG_ON(idx < 0);
+}
+
 #ifndef arch_kmap_local_post_map
 # define arch_kmap_local_post_map(vaddr, pteval)   do { } while (0)
 #endif

[patch V2 15/18] io-mapping: Cleanup atomic iomap

2020-10-29 Thread Thomas Gleixner

Switch the atomic iomap implementation over to kmap_local and stick the
preempt/pagefault mechanics into the generic code similar to the
kmap_atomic variants.

Rename the x86 map function in preparation for a non-atomic variant.

Signed-off-by: Thomas Gleixner 
---
V2: New patch to make review easier
---
 arch/x86/include/asm/iomap.h |9 +
 arch/x86/mm/iomap_32.c   |6 ++
 include/linux/io-mapping.h   |8 ++--
 3 files changed, 9 insertions(+), 14 deletions(-)

--- a/arch/x86/include/asm/iomap.h
+++ b/arch/x86/include/asm/iomap.h
@@ -13,14 +13,7 @@
 #include 
 #include 
 
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot);
-
-static inline void iounmap_atomic(void __iomem *vaddr)
-{
-   kunmap_local_indexed((void __force *)vaddr);
-   pagefault_enable();
-   preempt_enable();
-}
+void __iomem *__iomap_local_pfn_prot(unsigned long pfn, pgprot_t prot);
 
 int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
 
--- a/arch/x86/mm/iomap_32.c
+++ b/arch/x86/mm/iomap_32.c
@@ -44,7 +44,7 @@ void iomap_free(resource_size_t base, un
 }
 EXPORT_SYMBOL_GPL(iomap_free);
 
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot)
+void __iomem *__iomap_local_pfn_prot(unsigned long pfn, pgprot_t prot)
 {
/*
 * For non-PAT systems, translate non-WB request to UC- just in
@@ -60,8 +60,6 @@ void __iomem *iomap_atomic_pfn_prot(unsi
/* Filter out unsupported __PAGE_KERNEL* bits: */
pgprot_val(prot) &= __default_kernel_pte_mask;
 
-   preempt_disable();
-   pagefault_disable();
return (void __force __iomem *)__kmap_local_pfn_prot(pfn, prot);
 }
-EXPORT_SYMBOL_GPL(iomap_atomic_pfn_prot);
+EXPORT_SYMBOL_GPL(__iomap_local_pfn_prot);
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -69,13 +69,17 @@ io_mapping_map_atomic_wc(struct io_mappi
 
BUG_ON(offset >= mapping->size);
phys_addr = mapping->base + offset;
-   return iomap_atomic_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
+   preempt_disable();
+   pagefault_disable();
+   return __iomap_local_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
 }
 
 static inline void
 io_mapping_unmap_atomic(void __iomem *vaddr)
 {
-   iounmap_atomic(vaddr);
+   kunmap_local_indexed((void __force *)vaddr);
+   pagefault_enable();
+   preempt_enable();
 }
 
 static inline void __iomem *

[patch V2 17/18] mm/highmem: Provide kmap_local*

2020-10-29 Thread Thomas Gleixner

Now that the kmap atomic index is stored in task struct provide a
preemptible variant. On context switch the maps of an outgoing task are
removed and the map of the incoming task are restored. That's obviously
slow, but highmem is slow anyway.

The kmap_local.*() functions can be invoked from both preemptible and
atomic context. kmap local sections disable migration to keep the resulting
virtual mapping address correct, but disable neither pagefaults nor
preemption.

A wholesale conversion of kmap_atomic to be fully preemptible is not
possible because some of the usage sites might rely on the preemption
disable for serialization or on the implicit pagefault disable. Needs to be
done on a case by case basis.

Signed-off-by: Thomas Gleixner 
---
V2: Make it more consistent and add commentry
---
 include/linux/highmem.h |  115 +---
 1 file changed, 100 insertions(+), 15 deletions(-)

--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -86,17 +86,56 @@ static inline void kunmap(struct page *p
 }
 
 /*
- * kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because
- * no global lock is needed and because the kmap code must perform a global TLB
- * invalidation when the kmap pool wraps.
- *
- * However when holding an atomic kmap it is not legal to sleep, so atomic
- * kmaps are appropriate for short, tight code paths only.
- *
- * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap
- * gives a more generic (and caching) interface. But kmap_atomic can
- * be used in IRQ contexts, so in some (very limited) cases we need
- * it.
+ * For highmem systems it is required to temporarily map pages
+ * which reside in the portion of memory which is not covered
+ * by the permanent kernel mapping.
+ *
+ * This comes in three flavors:
+ *
+ * 1) kmap/kunmap:
+ *
+ *An interface to acquire longer term mappings with no restrictions
+ *on preemption and migration. This comes with an overhead as the
+ *mapping space is restricted and protected by a global lock. It
+ *also requires global TLB invalidation when the kmap pool wraps.
+ *
+ *kmap() might block when the mapping space is fully utilized until a
+ *slot becomes available. Only callable from preemptible thread
+ *context.
+ *
+ * 2) kmap_local.*()/kunmap_local.*()
+ *
+ *An interface to acquire short term mappings. Can be invoked from any
+ *context including interrupts. The mapping is per thread, CPU local
+ *and not globaly visible. It can only be used in the context which
+ *acquried the mapping. Nesting kmap_local.*() and kmap_atomic.*()
+ *mappings is allowed to a certain extent (up to KMAP_TYPE_NR).
+ *
+ *Nested kmap_local.*() and kunmap_local.*() invocations have to be
+ *strictly ordered because the map implementation is stack based.
+ *
+ *kmap_local.*() disables migration, but keeps preemption enabled. It's
+ *valid to take pagefaults in a kmap_local region unless the context in
+ *which the local kmap is acquired does not allow it for other reasons.
+ *
+ *If a task holding local kmaps is preempted, the maps are removed on
+ *context switch and restored when the task comes back on the CPU. As
+ *the maps are strictly CPU local it is guaranteed that the task stays
+ *on the CPU and the CPU cannot be unplugged until the local kmaps are
+ *released.
+ *
+ * 3) kmap_atomic.*()/kunmap_atomic.*()
+ *
+ *Based on the same mechanism as kmap local. Atomic kmap disables
+ *preemption and pagefaults. Only use if absolutely required, use
+ *the corresponding kmap_local variant if possible.
+ *
+ * Local and atomic kmaps are faster than kmap/kunmap, but impose
+ * restrictions. Only use them when required.
+ *
+ * For !HIGHMEM enabled systems the kmap flavours are not doing any mapping
+ * operation and kmap() won't sleep, but the kmap local and atomic variants
+ * still disable migration resp. pagefaults and preemption.
  */
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
@@ -122,6 +161,28 @@ static inline void __kunmap_atomic(void
kunmap_local_indexed(addr);
 }
 
+static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot)
+{
+   migrate_disable();
+   return __kmap_local_page_prot(page, prot);
+}
+
+static inline void *kmap_local_page(struct page *page)
+{
+   return kmap_local_page_prot(page, kmap_prot);
+}
+
+static inline void *kmap_local_pfn(unsigned long pfn)
+{
+   migrate_disable();
+   return __kmap_local_pfn_prot(pfn, kmap_prot);
+}
+
+static inline void __kunmap_local(void *vaddr)
+{
+   kunmap_local_indexed(vaddr);
+}
+
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
 extern atomic_long_t _totalhigh_pages;
@@ -201,10 +262,27 @@ static inline void *kmap_atomic_pfn(unsi
 
 static inline void __kunmap_atomic(void *addr)
 {
-   /*
-* Mostly nothing to do in the

[patch V2 18/18] io-mapping: Provide iomap_local variant

2020-10-29 Thread Thomas Gleixner

Similar to kmap local provide a iomap local variant which only disables
migration, but neither disables pagefaults nor preemption.

Signed-off-by: Thomas Gleixner 
---
V2: Split out from the large combo patch and add the !IOMAP_ATOMIC variants
---
 include/linux/io-mapping.h |   34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -83,6 +83,23 @@ io_mapping_unmap_atomic(void __iomem *va
 }
 
 static inline void __iomem *
+io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
+{
+   resource_size_t phys_addr;
+
+   BUG_ON(offset >= mapping->size);
+   phys_addr = mapping->base + offset;
+   migrate_disable();
+   return __iomap_local_pfn_prot(PHYS_PFN(phys_addr), mapping->prot);
+}
+
+static inline void io_mapping_unmap_local(void __iomem *vaddr)
+{
+   kunmap_local_indexed((void __force *)vaddr);
+   migrate_enable();
+}
+
+static inline void __iomem *
 io_mapping_map_wc(struct io_mapping *mapping,
  unsigned long offset,
  unsigned long size)
@@ -101,7 +118,7 @@ io_mapping_unmap(void __iomem *vaddr)
iounmap(vaddr);
 }
 
-#else
+#else  /* HAVE_ATOMIC_IOMAP */
 
 #include 
 
@@ -166,7 +183,20 @@ io_mapping_unmap_atomic(void __iomem *va
preempt_enable();
 }
 
-#endif /* HAVE_ATOMIC_IOMAP */
+static inline void __iomem *
+io_mapping_map_local_wc(struct io_mapping *mapping, unsigned long offset)
+{
+   migrate_disable();
+   return io_mapping_map_wc(mapping, offset, PAGE_SIZE);
+}
+
+static inline void io_mapping_unmap_local(void __iomem *vaddr)
+{
+   io_mapping_unmap(vaddr);
+   migrate_enable();
+}
+
+#endif /* !HAVE_ATOMIC_IOMAP */
 
 static inline struct io_mapping *
 io_mapping_create_wc(resource_size_t base,

Re: [PATCH 12/13] PCI: dwc: Move dw_pcie_setup_rc() to DWC common code

2020-10-29 Thread Jingoo Han

On 10/28/20, 4:47 PM, Rob Herring wrote:
> 
> All RC complex drivers must call dw_pcie_setup_rc(). The ordering of the
> call shouldn't be too important other than being after any RC resets.
>
> There's a few calls of dw_pcie_setup_rc() left as drivers implementing
> suspend/resume need it.
>
> Cc: Kishon Vijay Abraham I 
> Cc: Lorenzo Pieralisi 
> Cc: Bjorn Helgaas 
> Cc: Jingoo Han 

Acked-by: Jingoo Han 

Best regards,
Jingoo Han

> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Richard Zhu 
> Cc: Lucas Stach 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Pengutronix Kernel Team 
> Cc: Fabio Estevam 
> Cc: NXP Linux Team 
> Cc: Murali Karicheri 
> Cc: Minghuan Lian 
> Cc: Mingkai Hu 
> Cc: Roy Zang 
> Cc: Yue Wang 
> Cc: Kevin Hilman 
> Cc: Neil Armstrong 
> Cc: Jerome Brunet 
> Cc: Martin Blumenstingl 
> Cc: Thomas Petazzoni 
> Cc: Jesper Nilsson 
> Cc: Gustavo Pimentel 
> Cc: Xiaowei Song 
> Cc: Binghui Wang 
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: Stanimir Varbanov 
> Cc: Pratyush Anand 
> Cc: Kunihiko Hayashi 
> Cc: Masahiro Yamada 
> Cc: linux-o...@vger.kernel.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-amlo...@lists.infradead.org
> Cc: linux-arm-ker...@axis.com
> Cc: linux-arm-...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/pci/controller/dwc/pci-dra7xx.c   | 1 -
>  drivers/pci/controller/dwc/pci-exynos.c   | 1 -
>  drivers/pci/controller/dwc/pci-imx6.c | 1 -
>  drivers/pci/controller/dwc/pci-keystone.c | 2 --
>  drivers/pci/controller/dwc/pci-layerscape.c   | 2 --
>  drivers/pci/controller/dwc/pci-meson.c| 2 --
>  drivers/pci/controller/dwc/pcie-armada8k.c| 2 --
>  drivers/pci/controller/dwc/pcie-artpec6.c | 1 -
>  drivers/pci/controller/dwc/pcie-designware-host.c | 1 +
>  drivers/pci/controller/dwc/pcie-designware-plat.c | 8 
>  drivers/pci/controller/dwc/pcie-histb.c   | 3 ---
>  drivers/pci/controller/dwc/pcie-kirin.c   | 2 --
>  drivers/pci/controller/dwc/pcie-qcom.c| 1 -
>  drivers/pci/controller/dwc/pcie-spear13xx.c   | 2 --
>  drivers/pci/controller/dwc/pcie-uniphier.c| 2 --
>  15 files changed, 1 insertion(+), 30 deletions(-)

[...]

Test Results: RE: [V2,18/18] io-mapping: Provide iomap_local variant

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,17/18] mm/highmem: Provide kmap_local*

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends

2020-10-29 Thread Linus Torvalds

On Thu, Oct 29, 2020 at 3:32 PM Thomas Gleixner  wrote:
>
>
> Though I wanted to share the current state of affairs before investigating
> that further. If there is consensus in going forward with this, I'll have a
> deeper look into this issue.

Me likee. I think this looks like the right thing to do.

I didn't actually apply the patches, but just from reading them it
_looks_ to me like you do the migrate_disable() unconditionally, even
if it's not a highmem page..

That sounds like it might be a good thing for debugging, but not
necessarily great in general.

Or am I misreading things?

Linus

Re: [PATCH 2/4] PM: hibernate: improve robustness of mapping pages in the direct map

2020-10-29 Thread Edgecombe, Rick P

On Thu, 2020-10-29 at 09:54 +0200, Mike Rapoport wrote:
> __kernel_map_pages() on arm64 will also bail out if rodata_full is
> false:
> void __kernel_map_pages(struct page *page, int numpages, int enable)
> {
> if (!debug_pagealloc_enabled() && !rodata_full)
> return;
> 
> set_memory_valid((unsigned long)page_address(page), numpages,
> enable);
> }
> 
> So using set_direct_map() to map back pages removed from the direct
> map
> with __kernel_map_pages() seems safe to me.

Heh, one of us must have some simple boolean error in our head. I hope
its not me! :) I'll try on more time.

__kernel_map_pages() will bail out if rodata_full is false **AND**
debug page alloc is off. So it will only bail under conditions where
there could be nothing unmapped on the direct map.

Equivalent logic would be:
if (!(debug_pagealloc_enabled() || rodata_full))
return;

Or:
if (debug_pagealloc_enabled() || rodata_full)
set_memory_valid(blah)

So if either is on, the existing code will try to re-map. But the
set_direct_map_()'s will only work if rodata_full is on. So switching
hibernate to set_direct_map() will cause the remap to be missed for the
debug page alloc case, with !rodata_full.

It also breaks normal debug page alloc usage with !rodata_full for
similar reasons after patch 3. The pages would never get unmapped.

Re: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation

2020-10-29 Thread Edgecombe, Rick P

On Thu, 2020-10-29 at 10:12 +0200, Mike Rapoport wrote:
> This series goal was primarily to separate dependincies and make it
> clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it
> turned
> out, there is also some lack of consistency between architectures
> that
> implement either of this so I tried to improve this as well.
> 
> Honestly, I don't know if a thread can be paused at the time
> __vunmap()
> left invalid pages, but it could, there is an issue on arm64 with
> DEBUG_PAGEALLOC=n and this set fixes it.

Ah, ok. So from this and the other thread, this is about the logic in
arm's cpa for when it will try the un/map operations. I think the logic
actually works currently. And this series introduces a problem on ARM
similar to the one you are saying preexists. I put the details in the
other thread.

Test Results: RE: [V2,15/18] io-mapping: Cleanup atomic iomap

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,14/18] mm/highmem: Remove the old kmap_atomic cruft

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,16/18] sched: highmem: Store local kmaps in task struct

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,13/18] xtensa/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,12/18] sparc/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,11/18] powerpc/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,10/18] nds32/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2, 09/18] mips/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,08/18] microblaze/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,06/18] ARM: highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends

2020-10-29 Thread Thomas Gleixner

On Thu, Oct 29 2020 at 16:11, Linus Torvalds wrote:
> On Thu, Oct 29, 2020 at 3:32 PM Thomas Gleixner  wrote:
>>
>> Though I wanted to share the current state of affairs before investigating
>> that further. If there is consensus in going forward with this, I'll have a
>> deeper look into this issue.
>
> Me likee. I think this looks like the right thing to do.
>
> I didn't actually apply the patches, but just from reading them it
> _looks_ to me like you do the migrate_disable() unconditionally, even
> if it's not a highmem page..
>
> That sounds like it might be a good thing for debugging, but not
> necessarily great in general.
>
> Or am I misreading things?

No, you're not misreading it, but doing it conditionally would be a
complete semantical disaster. kmap_atomic*() also disables preemption
and pagefaults unconditionaly.  If that wouldn't be the case then every
caller would have to have conditionals like 'if (CONFIG_HIGHMEM)' or
worse 'if (PageHighMem(page)'.

Let's not go there.

Migrate disable is a less horrible plague than preempt and pagefault
disable even if the scheduler people disagree due to the lack of theory
backing that up :)

The charm of the new interface is that users still can rely on per
cpuness independent of being on a highmem plagued system. For non
highmem systems the extra migrate disable/enable is really a minor
nuissance.

Thanks,

tglx

Test Results: RE: [V2, 07/18] csky/mm/highmem: Switch to generic kmap atomic

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2, 05/18] arc/mm/highmem: Use generic kmap atomic implementation

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Test Results: RE: [V2,03/18] highmem: Provide generic variant of kmap_atomic*

2020-10-29 Thread snowpatch

Thanks for your contribution, unfortunately we've found some issues.

Your patch failed to apply to any branch.

Re: [PATCH] powerpc: add support for TIF_NOTIFY_SIGNAL

2020-10-29 Thread Michael Ellerman

Jens Axboe  writes:
> Wire up TIF_NOTIFY_SIGNAL handling for powerpc.
>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Jens Axboe 
> ---
>
> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
> for details:
>
> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/
>
> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
> as that will enable a set of cleanups once all of them support it. I'm
> happy carrying this patch if need be, or it can be funelled through the
> arch tree. Let me know.

Happy for you to take it along with the rest of the series.

Acked-by: Michael Ellerman 


cheers

> diff --git a/arch/powerpc/include/asm/thread_info.h 
> b/arch/powerpc/include/asm/thread_info.h
> index 46a210b03d2b..53115ae61495 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -90,6 +90,7 @@ void arch_setup_new_exec(void);
>  #define TIF_SYSCALL_TRACE0   /* syscall trace active */
>  #define TIF_SIGPENDING   1   /* signal pending */
>  #define TIF_NEED_RESCHED 2   /* rescheduling necessary */
> +#define TIF_NOTIFY_SIGNAL3   /* signal notifications exist */
>  #define TIF_SYSCALL_EMU  4   /* syscall emulation active */
>  #define TIF_RESTORE_TM   5   /* need to restore TM 
> FP/VEC/VSX */
>  #define TIF_PATCH_PENDING6   /* pending live patching update */
> @@ -115,6 +116,7 @@ void arch_setup_new_exec(void);
>  #define _TIF_SYSCALL_TRACE   (1<  #define _TIF_SIGPENDING  (1<  #define _TIF_NEED_RESCHED(1< +#define _TIF_NOTIFY_SIGNAL   (1<  #define _TIF_POLLING_NRFLAG  (1<  #define _TIF_32BIT   (1<  #define _TIF_RESTORE_TM  (1< @@ -136,7 +138,8 @@ void arch_setup_new_exec(void);
>  
>  #define _TIF_USER_WORK_MASK  (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
>_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> -  _TIF_RESTORE_TM | _TIF_PATCH_PENDING)
> +  _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
> +  _TIF_NOTIFY_SIGNAL)
>  #define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR)
>  
>  /* Bits in local_flags */
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index d2c356f37077..a8bb0aca1d02 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -318,7 +318,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
> thread_info_flags)
>   if (thread_info_flags & _TIF_PATCH_PENDING)
>   klp_update_patch_state(current);
>  
> - if (thread_info_flags & _TIF_SIGPENDING) {
> + if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
>   BUG_ON(regs != current->thread.regs);
>   do_signal(current);
>   }
> -- 
> 2.29.0
>
> -- 
> Jens Axboe

[PATCH 01/29] powerpc/rtas: move rtas_call_reentrant() out of pseries guards

2020-10-29 Thread Nathan Lynch

rtas_call_reentrant() isn't platform-dependent; move it out of
CONFIG_PPC_PSERIES-guarded code.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 954f41676f69..b40fc892138b 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -897,6 +897,12 @@ int rtas_ibm_suspend_me(u64 handle)
 
return atomic_read(&data.error);
 }
+#else /* CONFIG_PPC_PSERIES */
+int rtas_ibm_suspend_me(u64 handle)
+{
+   return -ENOSYS;
+}
+#endif
 
 /**
  * rtas_call_reentrant() - Used for reentrant rtas calls
@@ -948,13 +954,6 @@ int rtas_call_reentrant(int token, int nargs, int nret, 
int *outputs, ...)
return ret;
 }
 
-#else /* CONFIG_PPC_PSERIES */
-int rtas_ibm_suspend_me(u64 handle)
-{
-   return -ENOSYS;
-}
-#endif
-
 /**
  * Find a specific pseries error log in an RTAS extended event log.
  * @log: RTAS error/event log
-- 
2.25.4

[PATCH 02/29] powerpc/rtas: prevent suspend-related sys_rtas use on LE

2020-10-29 Thread Nathan Lynch

While drmgr has had work in some areas to make its RTAS syscall
interactions endian-neutral, its code for performing partition
migration via the syscall has never worked on LE. While it is able to
complete ibm,suspend-me successfully, it crashes when attempting the
subsequent ibm,update-nodes call.

drmgr is the only known (or plausible) user of these ibm,suspend-me,
ibm,update-nodes, and ibm,update-properties, so allow them only in
big-endian configurations.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index b40fc892138b..132b2ae39009 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1049,9 +1049,11 @@ static struct rtas_filter rtas_filters[] __ro_after_init 
= {
{ "set-time-for-power-on", -1, -1, -1, -1, -1 },
{ "ibm,set-system-parameter", -1, 1, -1, -1, -1 },
{ "set-time-of-day", -1, -1, -1, -1, -1 },
+#ifdef CONFIG_CPU_BIG_ENDIAN
{ "ibm,suspend-me", -1, -1, -1, -1, -1 },
{ "ibm,update-nodes", -1, 0, -1, -1, -1, 4096 },
{ "ibm,update-properties", -1, 0, -1, -1, -1, 4096 },
+#endif
{ "ibm,physical-attestation", -1, 0, 1, -1, -1 },
 };
 
-- 
2.25.4

[PATCH 03/29] powerpc/rtas: complete ibm,suspend-me status codes

2020-10-29 Thread Nathan Lynch

We don't completely account for the possible return codes for
ibm,suspend-me. Add definitions for these.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 55f9a154c95d..f060181a0d32 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -23,11 +23,16 @@
 #define RTAS_RMOBUF_MAX (64 * 1024)
 
 /* RTAS return status codes */
-#define RTAS_NOT_SUSPENDABLE   -9004
 #define RTAS_BUSY  -2/* RTAS Busy */
 #define RTAS_EXTENDED_DELAY_MIN9900
 #define RTAS_EXTENDED_DELAY_MAX9905
 
+/* statuses specific to ibm,suspend-me */
+#define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */
+#define RTAS_NOT_SUSPENDABLE-9004 /* Partition not suspendable */
+#define RTAS_THREADS_ACTIVE -9005 /* Multiple processor threads active */
+#define RTAS_OUTSTANDING_COPROC -9006 /* Outstanding coprocessor operations */
+
 /*
  * In general to call RTAS use rtas_token("string") to lookup
  * an RTAS token for the given string (e.g. "event-scan").
-- 
2.25.4

[PATCH 06/29] powerpc/rtas: add rtas_activate_firmware()

2020-10-29 Thread Nathan Lynch

Provide a documented wrapper function for the ibm,activate-firmware
service, which must be called after a partition migration or
hibernation.

If the function is absent or the call fails, the OS will continue to
run normally with the current firmware, so there is no need to perform
any recovery. Just log it and continue.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/rtas.c  | 30 ++
 2 files changed, 31 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index b43165fc6c2a..fdefe6a974eb 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -247,6 +247,7 @@ extern void __noreturn rtas_restart(char *cmd);
 extern void rtas_power_off(void);
 extern void __noreturn rtas_halt(void);
 extern void rtas_os_term(char *str);
+void rtas_activate_firmware(void);
 extern int rtas_get_sensor(int sensor, int index, int *state);
 extern int rtas_get_sensor_fast(int sensor, int index, int *state);
 extern int rtas_get_power_level(int powerdomain, int *level);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 70c570269d7b..58bbd69a233f 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -961,6 +961,36 @@ int rtas_ibm_suspend_me_unsafe(u64 handle)
 }
 #endif
 
+/**
+ * rtas_activate_firmware() - Activate a new version of firmware.
+ *
+ * Activate a new version of partition firmware. The OS must call this
+ * after resuming from a partition hibernation or migration in order
+ * to maintain the ability to perform live firmware updates. It's not
+ * catastrophic for this method to be absent or to fail; just log the
+ * condition in that case.
+ *
+ * Context: This function may sleep.
+ */
+void rtas_activate_firmware(void)
+{
+   int token;
+   int fwrc;
+
+   token = rtas_token("ibm,activate-firmware");
+   if (token == RTAS_UNKNOWN_SERVICE) {
+   pr_notice("ibm,activate-firmware method unavailable\n");
+   return;
+   }
+
+   do {
+   fwrc = rtas_call(token, 0, 1, NULL);
+   } while (rtas_busy_delay(fwrc));
+
+   if (fwrc)
+   pr_err("ibm,activate-firmware failed (%i)\n", fwrc);
+}
+
 /**
  * rtas_call_reentrant() - Used for reentrant rtas calls
  * @token: Token for desired reentrant RTAS call
-- 
2.25.4

[PATCH 04/29] powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe

2020-10-29 Thread Nathan Lynch

The pseries partition suspend sequence requires that all active CPUs
call H_JOIN, which suspends all but one of them with interrupts
disabled. The "chosen" CPU is then to call ibm,suspend-me to complete
the suspend. Upon returning from ibm,suspend-me, the chosen CPU is to
use H_PROD to wake the joined CPUs.

Using on_each_cpu() for this, as rtas_ibm_suspend_me() does to
implement partition migration, is susceptible to deadlock with other
users of on_each_cpu() and with users of stop_machine APIs. The
callback passed to on_each_cpu() is not allowed to synchronize with
other CPUs in the way it is used here.

Complicating the fix is the fact that rtas_ibm_suspend_me() also
occupies the function name that should be used to provide a more
conventional wrapper for ibm,suspend-me. Rename rtas_ibm_suspend_me()
to rtas_ibm_suspend_me_unsafe() to free up the name and indicate that
it should not gain users.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   | 2 +-
 arch/powerpc/kernel/rtas.c| 6 +++---
 arch/powerpc/platforms/pseries/mobility.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index f060181a0d32..8436ed01567b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -257,7 +257,7 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-extern int rtas_ibm_suspend_me(u64 handle);
+int rtas_ibm_suspend_me_unsafe(u64 handle);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 132b2ae39009..33adefa84a42 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -843,7 +843,7 @@ static void rtas_percpu_suspend_me(void *info)
__rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
 }
 
-int rtas_ibm_suspend_me(u64 handle)
+int rtas_ibm_suspend_me_unsafe(u64 handle)
 {
long state;
long rc;
@@ -898,7 +898,7 @@ int rtas_ibm_suspend_me(u64 handle)
return atomic_read(&data.error);
 }
 #else /* CONFIG_PPC_PSERIES */
-int rtas_ibm_suspend_me(u64 handle)
+int rtas_ibm_suspend_me_unsafe(u64 handle)
 {
return -ENOSYS;
 }
@@ -1184,7 +1184,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
int rc = 0;
u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32)
  | be32_to_cpu(args.args[1]);
-   rc = rtas_ibm_suspend_me(handle);
+   rc = rtas_ibm_suspend_me_unsafe(handle);
if (rc == -EAGAIN)
args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE);
else if (rc == -EIO)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index d6f4162478a5..b6de65cbfcd9 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -370,7 +370,7 @@ static ssize_t migration_store(struct class *class,
return rc;
 
do {
-   rc = rtas_ibm_suspend_me(streamid);
+   rc = rtas_ibm_suspend_me_unsafe(streamid);
if (rc == -EAGAIN)
ssleep(1);
} while (rc == -EAGAIN);
-- 
2.25.4

[PATCH 00/29] partition suspend updates

2020-10-29 Thread Nathan Lynch

This series aims to improve the pseries-specific partition migration
and hibernation implementation, part of which has been living in
kernel/rtas.c. Most of that code is eliminated or moved to
platforms/pseries, and the following major functional changes are
made:

- Use stop_machine() instead of on_each_cpu() to avoid deadlock in the
  join/suspend sequence.

- Retry the join/suspend sequence on errors that are likely to be
  transient. This is a mitigation for the fact that drivers currently
  have no way to prepare for an impending partition suspension,
  sometimes resulting in a virtual adapter being in a state which
  causes the platform to fail the suspend call.

- Request cancellation of the migration via H_VASI_SIGNAL if Linux is
  going to error out of the suspend attempt. This allows the
  management console and other entities to promptly clean up their
  operations instead of relying on long timeouts to fail the
  migration.

- Little-endian users of ibm,suspend-me, ibm,update-nodes and
  ibm,update-properties via sys_rtas are blocked when
  CONFIG_PPC_RTAS_FILTERS is enabled.

- Legacy user space code (drmgr) historically has driven the migration
  process by using sys_rtas to separately call ibm,suspend-me,
  ibm,activate-firmware, and ibm,update-nodes/properties, in that
  order. With these changes, when sys_rtas() dispatches
  ibm,suspend-me, the kernel performs the device tree update and
  firmware activation before returning. This is more reliable, and
  drmgr does not seem bothered by it.

- If the H_VASI_STATE hcall is absent, the implementation proceeds
  with the suspend instead of erroring out. This allows us to exercise
  these code paths in QEMU.

Nathan Lynch (29):
  powerpc/rtas: move rtas_call_reentrant() out of pseries guards
  powerpc/rtas: prevent suspend-related sys_rtas use on LE
  powerpc/rtas: complete ibm,suspend-me status codes
  powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe
  powerpc/rtas: add rtas_ibm_suspend_me()
  powerpc/rtas: add rtas_activate_firmware()
  powerpc/hvcall: add token and codes for H_VASI_SIGNAL
  powerpc/pseries/mobility: don't error on absence of ibm,update-nodes
  powerpc/pseries/mobility: add missing break to default case
  powerpc/pseries/mobility: error message improvements
  powerpc/pseries/mobility: use rtas_activate_firmware() on resume
  powerpc/pseries/mobility: extract VASI session polling logic
  powerpc/pseries/mobility: use stop_machine for join/suspend
  powerpc/pseries/mobility: signal suspend cancellation to platform
  powerpc/pseries/mobility: retry partition suspend after error
  powerpc/rtas: dispatch partition migration requests to pseries
  powerpc/rtas: remove rtas_ibm_suspend_me_unsafe()
  powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend
ops
  powerpc/pseries/hibernation: pass stream id via function arguments
  powerpc/pseries/hibernation: remove pseries_suspend_cpu()
  powerpc/machdep: remove suspend_disable_cpu()
  powerpc/rtas: remove rtas_suspend_cpu()
  powerpc/pseries/hibernation: switch to rtas_ibm_suspend_me()
  powerpc/rtas: remove unused rtas_suspend_last_cpu()
  powerpc/pseries/hibernation: remove redundant cacheinfo update
  powerpc/pseries/hibernation: perform post-suspend fixups later
  powerpc/pseries/hibernation: remove prepare_late() callback
  powerpc/rtas: remove unused rtas_suspend_me_data
  powerpc/pseries/mobility: refactor node lookup during DT update

 arch/powerpc/include/asm/hvcall.h |   9 +
 arch/powerpc/include/asm/machdep.h|   1 -
 arch/powerpc/include/asm/rtas-types.h |   8 -
 arch/powerpc/include/asm/rtas.h   |  13 +-
 arch/powerpc/kernel/rtas.c| 245 ++-
 arch/powerpc/platforms/pseries/mobility.c | 361 ++
 arch/powerpc/platforms/pseries/suspend.c  |  79 +
 7 files changed, 420 insertions(+), 296 deletions(-)

-- 
2.25.4

[PATCH 05/29] powerpc/rtas: add rtas_ibm_suspend_me()

2020-10-29 Thread Nathan Lynch

Now that the name is available, provide a simple wrapper for
ibm,suspend-me which returns both a Linux errno and optionally the
actual RTAS status to the caller.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/rtas.c  | 57 +
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 8436ed01567b..b43165fc6c2a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -258,6 +258,7 @@ extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me_unsafe(u64 handle);
+int rtas_ibm_suspend_me(int *fw_status);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 33adefa84a42..70c570269d7b 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -684,6 +684,63 @@ int rtas_set_indicator_fast(int indicator, int index, int 
new_value)
return rc;
 }
 
+/**
+ * rtas_ibm_suspend_me() - Call ibm,suspend-me to suspend the LPAR.
+ *
+ * @fw_status: RTAS call status will be placed here if not NULL.
+ *
+ * rtas_ibm_suspend_me() should be called only on a CPU which has
+ * received H_CONTINUE from the H_JOIN hcall. All other active CPUs
+ * should be waiting to return from H_JOIN.
+ *
+ * rtas_ibm_suspend_me() may suspend execution of the OS
+ * indefinitely. Callers should take appropriate measures upon return, such as
+ * resetting watchdog facilities.
+ *
+ * Callers may choose to retry this call if @fw_status is
+ * %RTAS_THREADS_ACTIVE.
+ *
+ * Return:
+ * 0  - The partition has resumed from suspend, possibly after
+ *  migration to a different host.
+ * -ECANCELED - The operation was aborted.
+ * -EAGAIN- There were other CPUs not in H_JOIN at the time of the call.
+ * -EBUSY - Some other condition prevented the suspend from succeeding.
+ * -EIO   - Hardware/platform error.
+ */
+int rtas_ibm_suspend_me(int *fw_status)
+{
+   int fwrc;
+   int ret;
+
+   fwrc = rtas_call(rtas_token("ibm,suspend-me"), 0, 1, NULL);
+
+   switch (fwrc) {
+   case 0:
+   ret = 0;
+   break;
+   case RTAS_SUSPEND_ABORTED:
+   ret = -ECANCELED;
+   break;
+   case RTAS_THREADS_ACTIVE:
+   ret = -EAGAIN;
+   break;
+   case RTAS_NOT_SUSPENDABLE:
+   case RTAS_OUTSTANDING_COPROC:
+   ret = -EBUSY;
+   break;
+   case -1:
+   default:
+   ret = -EIO;
+   break;
+   }
+
+   if (fw_status)
+   *fw_status = fwrc;
+
+   return ret;
+}
+
 void __noreturn rtas_restart(char *cmd)
 {
if (rtas_flash_term_hook)
-- 
2.25.4

[PATCH 07/29] powerpc/hvcall: add token and codes for H_VASI_SIGNAL

2020-10-29 Thread Nathan Lynch

H_VASI_SIGNAL can be used by a partition to request cancellation of
its migration. To be used in future changes.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/hvcall.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c1fbccb04390..c98f5141e3fc 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -155,6 +155,14 @@
 #define H_VASI_RESUMED  5
 #define H_VASI_COMPLETED6
 
+/* VASI signal codes. Only the Cancel code is valid for H_VASI_SIGNAL. */
+#define H_VASI_SIGNAL_CANCEL1
+#define H_VASI_SIGNAL_ABORT 2
+#define H_VASI_SIGNAL_SUSPEND   3
+#define H_VASI_SIGNAL_COMPLETE  4
+#define H_VASI_SIGNAL_ENABLE5
+#define H_VASI_SIGNAL_FAILOVER  6
+
 /* Each control block has to be on a 4K boundary */
 #define H_CB_ALIGNMENT  4096
 
@@ -261,6 +269,7 @@
 #define H_ADD_CONN 0x284
 #define H_DEL_CONN 0x288
 #define H_JOIN 0x298
+#define H_VASI_SIGNAL   0x2A0
 #define H_VASI_STATE0x2A4
 #define H_VIOCTL   0x2A8
 #define H_ENABLE_CRQ   0x2B0
-- 
2.25.4

[PATCH 08/29] powerpc/pseries/mobility: don't error on absence of ibm, update-nodes

2020-10-29 Thread Nathan Lynch

Treat the absence of the ibm,update-nodes function as benign instead
of reporting an error. If the platform does not provide that facility,
it's not a problem for Linux.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index b6de65cbfcd9..5bcb6e5cc0f2 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -261,7 +261,7 @@ int pseries_devicetree_update(s32 scope)
 
update_nodes_token = rtas_token("ibm,update-nodes");
if (update_nodes_token == RTAS_UNKNOWN_SERVICE)
-   return -EINVAL;
+   return 0;
 
rtas_buf = kzalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL);
if (!rtas_buf)
-- 
2.25.4

[PATCH 09/29] powerpc/pseries/mobility: add missing break to default case

2020-10-29 Thread Nathan Lynch

update_dt_node() has a switch statement where the default case lacks a
break statement.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 5bcb6e5cc0f2..d85799b5464a 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -213,6 +213,7 @@ static int update_dt_node(__be32 phandle, s32 scope)
}
 
prop_data += vd;
+   break;
}
 
cond_resched();
-- 
2.25.4

[PATCH 10/29] powerpc/pseries/mobility: error message improvements

2020-10-29 Thread Nathan Lynch

- Convert printk(KERN_ERR) to pr_err().
- Include errno in property update failure message.
- Remove reference to "Post-mobility" from device tree update message:
  with pr_err() it will have a "mobility:" prefix.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index d85799b5464a..ef8f5641e700 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -208,8 +208,8 @@ static int update_dt_node(__be32 phandle, s32 scope)
rc = update_dt_property(dn, &prop, prop_name,
vd, prop_data);
if (rc) {
-   printk(KERN_ERR "Could not update %s"
-  " property\n", prop_name);
+   pr_err("updating %s property failed: 
%d\n",
+  prop_name, rc);
}
 
prop_data += vd;
@@ -343,8 +343,7 @@ void post_mobility_fixup(void)
 
rc = pseries_devicetree_update(MIGRATION_SCOPE);
if (rc)
-   printk(KERN_ERR "Post-mobility device tree update "
-   "failed: %d\n", rc);
+   pr_err("device tree update failed: %d\n", rc);
 
cacheinfo_rebuild();
 
-- 
2.25.4

[PATCH 11/29] powerpc/pseries/mobility: use rtas_activate_firmware() on resume

2020-10-29 Thread Nathan Lynch

It's incorrect to abort post-suspend processing if
ibm,activate-firmware isn't available. Use rtas_activate_firmware(),
which logs this condition appropriately and allows us to proceed.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index ef8f5641e700..dc6abf164db7 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -312,21 +312,8 @@ int pseries_devicetree_update(s32 scope)
 void post_mobility_fixup(void)
 {
int rc;
-   int activate_fw_token;
 
-   activate_fw_token = rtas_token("ibm,activate-firmware");
-   if (activate_fw_token == RTAS_UNKNOWN_SERVICE) {
-   printk(KERN_ERR "Could not make post-mobility "
-  "activate-fw call.\n");
-   return;
-   }
-
-   do {
-   rc = rtas_call(activate_fw_token, 0, 1, NULL);
-   } while (rtas_busy_delay(rc));
-
-   if (rc)
-   printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc);
+   rtas_activate_firmware();
 
/*
 * We don't want CPUs to go online/offline while the device
-- 
2.25.4

[PATCH 12/29] powerpc/pseries/mobility: extract VASI session polling logic

2020-10-29 Thread Nathan Lynch

The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to
the caller until the specified VASI suspend session state makes the
transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest
of separating concerns to prepare for a new implementation of the
join/suspend sequence, extract VASI session polling logic into a
couple of local functions. Waiting for the session state to reach
H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures
that we will never get an EAGAIN result necessitating a retry. No
user-visible change in behavior is intended.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 76 +--
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index dc6abf164db7..1b8ae221b98a 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -345,6 +345,73 @@ void post_mobility_fixup(void)
return;
 }
 
+static int poll_vasi_state(u64 handle, unsigned long *res)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long hvrc;
+   int ret;
+
+   hvrc = plpar_hcall(H_VASI_STATE, retbuf, handle);
+   switch (hvrc) {
+   case H_SUCCESS:
+   ret = 0;
+   *res = retbuf[0];
+   break;
+   case H_PARAMETER:
+   ret = -EINVAL;
+   break;
+   case H_FUNCTION:
+   ret = -EOPNOTSUPP;
+   break;
+   case H_HARDWARE:
+   default:
+   pr_err("unexpected H_VASI_STATE result %ld\n", hvrc);
+   ret = -EIO;
+   break;
+   }
+   return ret;
+}
+
+static int wait_for_vasi_session_suspending(u64 handle)
+{
+   unsigned long state;
+   bool keep_polling;
+   int ret;
+
+   /*
+* Wait for transition from H_VASI_ENABLED to
+* H_VASI_SUSPENDING. Treat anything else as an error.
+*/
+   do {
+   keep_polling = false;
+   ret = poll_vasi_state(handle, &state);
+   if (ret != 0)
+   break;
+
+   switch (state) {
+   case H_VASI_SUSPENDING:
+   break;
+   case H_VASI_ENABLED:
+   keep_polling = true;
+   ssleep(1);
+   break;
+   default:
+   pr_err("unexpected H_VASI_STATE result %lu\n", state);
+   ret = -EIO;
+   break;
+   }
+   } while (keep_polling);
+
+   /*
+* Proceed even if H_VASI_STATE is unavailable. If H_JOIN or
+* ibm,suspend-me are also unimplemented, we'll recover then.
+*/
+   if (ret == -EOPNOTSUPP)
+   ret = 0;
+
+   return ret;
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
@@ -356,12 +423,11 @@ static ssize_t migration_store(struct class *class,
if (rc)
return rc;
 
-   do {
-   rc = rtas_ibm_suspend_me_unsafe(streamid);
-   if (rc == -EAGAIN)
-   ssleep(1);
-   } while (rc == -EAGAIN);
+   rc = wait_for_vasi_session_suspending(streamid);
+   if (rc)
+   return rc;
 
+   rc = rtas_ibm_suspend_me_unsafe(streamid);
if (rc)
return rc;
 
-- 
2.25.4

[PATCH 13/29] powerpc/pseries/mobility: use stop_machine for join/suspend

2020-10-29 Thread Nathan Lynch

The partition suspend sequence as specified in the platform
architecture requires that all active processor threads call
H_JOIN, which:

- suspends the calling thread until it is the target of
  an H_PROD; or
- immediately returns H_CONTINUE, if the calling thread is the last to
  call H_JOIN. This thread is expected to call ibm,suspend-me to
  completely suspend the partition.

Upon returning from ibm,suspend-me the calling thread must wake all
others using H_PROD.

rtas_ibm_suspend_me_unsafe() uses on_each_cpu() to implement this
protocol, but because of its synchronizing nature this is susceptible
to deadlock versus users of stop_machine() or other callers of
on_each_cpu().

Not only is stop_machine() intended for use cases like this, it
handles error propagation and allows us to keep the data shared
between CPUs minimal: a single atomic counter which ensures exactly
one CPU will wake the others from their joined states.

Switch the migration code to use stop_machine() and a less complex
local implementation of the H_JOIN/ibm,suspend-me logic, which
carries additional benefits:

- more informative error reporting, appropriately ratelimited
- resets the lockup detector / watchdog on resume to prevent lockup
  warnings when the OS has been suspended for a time exceeding the
  threshold.

Fixes: 91dc182ca6e2 ("[PATCH] powerpc: special-case ibm,suspend-me RTAS call")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 132 --
 1 file changed, 125 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 1b8ae221b98a..44ca7d4e143d 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -12,9 +12,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -412,6 +414,128 @@ static int wait_for_vasi_session_suspending(u64 handle)
return ret;
 }
 
+static void prod_single(unsigned int target_cpu)
+{
+   long hvrc;
+   int hwid;
+
+   hwid = get_hard_smp_processor_id(target_cpu);
+   hvrc = plpar_hcall_norets(H_PROD, hwid);
+   if (hvrc == H_SUCCESS)
+   return;
+   pr_err_ratelimited("H_PROD of CPU %u (hwid %d) error: %ld\n",
+  target_cpu, hwid, hvrc);
+}
+
+static void prod_others(void)
+{
+   unsigned int cpu;
+
+   for_each_online_cpu(cpu) {
+   if (cpu != smp_processor_id())
+   prod_single(cpu);
+   }
+}
+
+static u16 clamp_slb_size(void)
+{
+   u16 prev = mmu_slb_size;
+
+   slb_set_size(SLB_MIN_SIZE);
+
+   return prev;
+}
+
+static int do_suspend(void)
+{
+   u16 saved_slb_size;
+   int status;
+   int ret;
+
+   pr_info("calling ibm,suspend-me on CPU %i\n", smp_processor_id());
+
+   /*
+* The destination processor model may have fewer SLB entries
+* than the source. We reduce mmu_slb_size to a safe minimum
+* before suspending in order to minimize the possibility of
+* programming non-existent entries on the destination. If
+* suspend fails, we restore it before returning. On success
+* the OF reconfig path will update it from the new device
+* tree after resuming on the destination.
+*/
+   saved_slb_size = clamp_slb_size();
+
+   ret = rtas_ibm_suspend_me(&status);
+   if (ret != 0) {
+   pr_err("ibm,suspend-me error: %d\n", status);
+   slb_set_size(saved_slb_size);
+   }
+
+   return ret;
+}
+
+static int do_join(void *arg)
+{
+   atomic_t *counter = arg;
+   long hvrc;
+   int ret;
+
+   /* Must ensure MSR.EE off for H_JOIN. */
+   hard_irq_disable();
+   hvrc = plpar_hcall_norets(H_JOIN);
+
+   switch (hvrc) {
+   case H_CONTINUE:
+   /*
+* All other CPUs are offline or in H_JOIN. This CPU
+* attempts the suspend.
+*/
+   ret = do_suspend();
+   break;
+   case H_SUCCESS:
+   /*
+* The suspend is complete and this cpu has received a
+* prod.
+*/
+   ret = 0;
+   break;
+   case H_BAD_MODE:
+   case H_HARDWARE:
+   default:
+   ret = -EIO;
+   pr_err_ratelimited("H_JOIN error %ld on CPU %i\n",
+  hvrc, smp_processor_id());
+   break;
+   }
+
+   if (atomic_inc_return(counter) == 1) {
+   pr_info("CPU %u waking all threads\n", smp_processor_id());
+   prod_others();
+   }
+   /*
+* Execution may have been suspended for several seconds, so
+* reset the watchdog.
+*/
+   touch_nmi_watchdog();
+   return ret;
+}
+
+static int pseries_migrate_partition(u64 handle)

[PATCH 14/29] powerpc/pseries/mobility: signal suspend cancellation to platform

2020-10-29 Thread Nathan Lynch

If we're returning an error to user space, use H_VASI_SIGNAL to send a
cancellation request to the platform. This isn't strictly required but
it communicates that Linux will not attempt to complete the suspend,
which allows the various entities involved to promptly end the
operation in progress.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 44ca7d4e143d..0f592246f345 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -520,6 +520,35 @@ static int do_join(void *arg)
return ret;
 }
 
+/*
+ * Abort reason code byte 0. We use only the 'Migrating partition' value.
+ */
+enum vasi_aborting_entity {
+   ORCHESTRATOR= 1,
+   VSP_SOURCE  = 2,
+   PARTITION_FIRMWARE  = 3,
+   PLATFORM_FIRMWARE   = 4,
+   VSP_TARGET  = 5,
+   MIGRATING_PARTITION = 6,
+};
+
+static void pseries_cancel_migration(u64 handle, int err)
+{
+   u32 reason_code;
+   u32 detail;
+   u8 entity;
+   long hvrc;
+
+   entity = MIGRATING_PARTITION;
+   detail = abs(err) & 0xff;
+   reason_code = (entity << 24) | detail;
+
+   hvrc = plpar_hcall_norets(H_VASI_SIGNAL, handle,
+ H_VASI_SIGNAL_CANCEL, reason_code);
+   if (hvrc)
+   pr_err("H_VASI_SIGNAL error: %ld\n", hvrc);
+}
+
 static int pseries_migrate_partition(u64 handle)
 {
atomic_t counter = ATOMIC_INIT(0);
@@ -532,6 +561,8 @@ static int pseries_migrate_partition(u64 handle)
ret = stop_machine(do_join, &counter, cpu_online_mask);
if (ret == 0)
post_mobility_fixup();
+   else
+   pseries_cancel_migration(handle, ret);
 out:
return ret;
 }
-- 
2.25.4

[PATCH 16/29] powerpc/rtas: dispatch partition migration requests to pseries

2020-10-29 Thread Nathan Lynch

sys_rtas() cannot call ibm,suspend-me directly in the same way it
handles other inputs. Instead it must dispatch the request to code
that can first perform the H_JOIN sequence before any call to
ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair
amount of platform-specific code to implement this.

Since a different, more robust implementation of the suspend sequence
is now in the pseries platform code, we want to dispatch the request
there while minimizing additional dependence on pseries.

Use a weak function that only pseries overrides. Note that invoking
ibm,suspend-me via the RTAS syscall is all but deprecated; this change
preserves ABI compatibility for old programs while providing to them
the benefit of the new partition suspend implementation. This is a
behavior change in that the kernel performs the device tree update and
firmware activation before returning, but experimentation indicates
this is tolerated fine by legacy user space.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   | 1 +
 arch/powerpc/kernel/rtas.c| 8 +++-
 arch/powerpc/platforms/pseries/mobility.c | 5 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index fdefe6a974eb..be0fc2536673 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -260,6 +260,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data 
*data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me_unsafe(u64 handle);
 int rtas_ibm_suspend_me(int *fw_status);
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 58bbd69a233f..52fb394f15d6 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1221,6 +1221,12 @@ static bool block_rtas_call(int token, int nargs,
 
 #endif /* CONFIG_PPC_RTAS_FILTER */
 
+/* Only pseries should need to override this. */
+int __weak rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return -EINVAL;
+}
+
 /* We assume to be passed big endian arguments */
 SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 {
@@ -1271,7 +1277,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
int rc = 0;
u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32)
  | be32_to_cpu(args.args[1]);
-   rc = rtas_ibm_suspend_me_unsafe(handle);
+   rc = rtas_syscall_dispatch_ibm_suspend_me(handle);
if (rc == -EAGAIN)
args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE);
else if (rc == -EIO)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index e459cdb8286f..d6417c9db201 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -622,6 +622,11 @@ static int pseries_migrate_partition(u64 handle)
return ret;
 }
 
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return pseries_migrate_partition(handle);
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
-- 
2.25.4

[PATCH 19/29] powerpc/pseries/hibernation: pass stream id via function arguments

2020-10-29 Thread Nathan Lynch

There is no need for the stream id to be a file-global variable; pass
it from hibernate_store() to pseries_suspend_begin() for the
H_VASI_STATE call.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 3eaa9d59dc7a..232621f33510 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -15,7 +15,6 @@
 #include 
 #include "../../kernel/cacheinfo.h"
 
-static u64 stream_id;
 static struct device suspend_dev;
 static DECLARE_COMPLETION(suspend_work);
 static struct rtas_suspend_me_data suspend_data;
@@ -29,7 +28,7 @@ static atomic_t suspending;
  * Return value:
  * 0 on success / other on failure
  **/
-static int pseries_suspend_begin(suspend_state_t state)
+static int pseries_suspend_begin(u64 stream_id)
 {
long vasi_state, rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -132,6 +131,7 @@ static ssize_t store_hibernate(struct device *dev,
   struct device_attribute *attr,
   const char *buf, size_t count)
 {
+   u64 stream_id;
int rc;
 
if (!capable(CAP_SYS_ADMIN))
@@ -140,7 +140,7 @@ static ssize_t store_hibernate(struct device *dev,
stream_id = simple_strtoul(buf, NULL, 16);
 
do {
-   rc = pseries_suspend_begin(PM_SUSPEND_MEM);
+   rc = pseries_suspend_begin(stream_id);
if (rc == -EAGAIN)
ssleep(1);
} while (rc == -EAGAIN);
@@ -148,8 +148,6 @@ static ssize_t store_hibernate(struct device *dev,
if (!rc)
rc = pm_suspend(PM_SUSPEND_MEM);
 
-   stream_id = 0;
-
if (!rc)
rc = count;
 
-- 
2.25.4

[PATCH 15/29] powerpc/pseries/mobility: retry partition suspend after error

2020-10-29 Thread Nathan Lynch

This is a mitigation for the relatively rare occurrence where a
virtual IOA can be in a transient state that prevents the
suspend/migration from succeeding, resulting in an error from
ibm,suspend-me.

If the join/suspend sequence returns an error, it is acceptable to
retry as long as the VASI suspend session state is still
"Suspending" (i.e. the platform is still waiting for the OS to
suspend).

Retry a few times on suspend failure while this condition holds,
progressively increasing the delay between attempts. We don't want to
retry indefinitey because firmware emits an error log event on each
unsuccessful attempt.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 59 ++-
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 0f592246f345..e459cdb8286f 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -549,16 +549,71 @@ static void pseries_cancel_migration(u64 handle, int err)
pr_err("H_VASI_SIGNAL error: %ld\n", hvrc);
 }
 
+static int pseries_suspend(u64 handle)
+{
+   const unsigned int max_attempts = 5;
+   unsigned int retry_interval_ms = 1;
+   unsigned int attempt = 1;
+   int ret;
+
+   while (true) {
+   atomic_t counter = ATOMIC_INIT(0);
+   unsigned long vasi_state;
+   int vasi_err;
+
+   ret = stop_machine(do_join, &counter, cpu_online_mask);
+   if (ret == 0)
+   break;
+   /*
+* Encountered an error. If the VASI stream is still
+* in Suspending state, it's likely a transient
+* condition related to some device in the partition
+* and we can retry in the hope that the cause has
+* cleared after some delay.
+*
+* A better design would allow drivers etc to prepare
+* for the suspend and avoid conditions which prevent
+* the suspend from succeeding. For now, we have this
+* mitigation.
+*/
+   pr_notice("Partition suspend attempt %u of %u error: %d\n",
+ attempt, max_attempts, ret);
+
+   if (attempt == max_attempts)
+   break;
+
+   vasi_err = poll_vasi_state(handle, &vasi_state);
+   if (vasi_err == 0) {
+   if (vasi_state != H_VASI_SUSPENDING) {
+   pr_notice("VASI state %lu after failed 
suspend\n",
+ vasi_state);
+   break;
+   }
+   } else if (vasi_err != -EOPNOTSUPP) {
+   pr_err("VASI state poll error: %d", vasi_err);
+   break;
+   }
+
+   pr_notice("Will retry partition suspend after %u ms\n",
+ retry_interval_ms);
+
+   msleep(retry_interval_ms);
+   retry_interval_ms *= 10;
+   attempt++;
+   }
+
+   return ret;
+}
+
 static int pseries_migrate_partition(u64 handle)
 {
-   atomic_t counter = ATOMIC_INIT(0);
int ret;
 
ret = wait_for_vasi_session_suspending(handle);
if (ret)
goto out;
 
-   ret = stop_machine(do_join, &counter, cpu_online_mask);
+   ret = pseries_suspend(handle);
if (ret == 0)
post_mobility_fixup();
else
-- 
2.25.4

[PATCH 17/29] powerpc/rtas: remove rtas_ibm_suspend_me_unsafe()

2020-10-29 Thread Nathan Lynch

rtas_ibm_suspend_me_unsafe() is now unused; remove it and
rtas_percpu_suspend_me() which becomes unused as a result.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 -
 arch/powerpc/kernel/rtas.c  | 64 -
 2 files changed, 65 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index be0fc2536673..1e695e553a36 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -258,7 +258,6 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-int rtas_ibm_suspend_me_unsafe(u64 handle);
 int rtas_ibm_suspend_me(int *fw_status);
 int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
 
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 52fb394f15d6..6fde38b488f7 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -895,70 +895,6 @@ int rtas_suspend_cpu(struct rtas_suspend_me_data *data)
return __rtas_suspend_cpu(data, 0);
 }
 
-static void rtas_percpu_suspend_me(void *info)
-{
-   __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
-}
-
-int rtas_ibm_suspend_me_unsafe(u64 handle)
-{
-   long state;
-   long rc;
-   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
-   struct rtas_suspend_me_data data;
-   DECLARE_COMPLETION_ONSTACK(done);
-
-   if (!rtas_service_present("ibm,suspend-me"))
-   return -ENOSYS;
-
-   /* Make sure the state is valid */
-   rc = plpar_hcall(H_VASI_STATE, retbuf, handle);
-
-   state = retbuf[0];
-
-   if (rc) {
-   printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned 
%ld\n",rc);
-   return rc;
-   } else if (state == H_VASI_ENABLED) {
-   return -EAGAIN;
-   } else if (state != H_VASI_SUSPENDING) {
-   printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned state 
%ld\n",
-  state);
-   return -EIO;
-   }
-
-   atomic_set(&data.working, 0);
-   atomic_set(&data.done, 0);
-   atomic_set(&data.error, 0);
-   data.token = rtas_token("ibm,suspend-me");
-   data.complete = &done;
-
-   lock_device_hotplug();
-
-   cpu_hotplug_disable();
-
-   /* Call function on all CPUs.  One of us will make the
-* rtas call
-*/
-   on_each_cpu(rtas_percpu_suspend_me, &data, 0);
-
-   wait_for_completion(&done);
-
-   if (atomic_read(&data.error) != 0)
-   printk(KERN_ERR "Error doing global join\n");
-
-
-   cpu_hotplug_enable();
-
-   unlock_device_hotplug();
-
-   return atomic_read(&data.error);
-}
-#else /* CONFIG_PPC_PSERIES */
-int rtas_ibm_suspend_me_unsafe(u64 handle)
-{
-   return -ENOSYS;
-}
 #endif
 
 /**
-- 
2.25.4

[PATCH 20/29] powerpc/pseries/hibernation: remove pseries_suspend_cpu()

2020-10-29 Thread Nathan Lynch

Since commit 48f6e7f6d948 ("powerpc/pseries: remove cede offline state
for CPUs"), ppc_md.suspend_disable_cpu() is no longer used and all
CPUs (save one) are placed into true offline state as opposed to
H_JOIN. So pseries_suspend_cpu() is effectively unused; remove it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 232621f33510..3315d698d5ab 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -48,20 +48,6 @@ static int pseries_suspend_begin(u64 stream_id)
   vasi_state);
return -EIO;
}
-
-   return 0;
-}
-
-/**
- * pseries_suspend_cpu - Suspend a single CPU
- *
- * Makes the H_JOIN call to suspend the CPU
- *
- **/
-static int pseries_suspend_cpu(void)
-{
-   if (atomic_read(&suspending))
-   return rtas_suspend_cpu(&suspend_data);
return 0;
 }
 
@@ -235,7 +221,6 @@ static int __init pseries_suspend_init(void)
if ((rc = pseries_suspend_sysfs_register(&suspend_dev)))
return rc;
 
-   ppc_md.suspend_disable_cpu = pseries_suspend_cpu;
ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs;
suspend_set_ops(&pseries_suspend_ops);
return 0;
-- 
2.25.4

[PATCH 21/29] powerpc/machdep: remove suspend_disable_cpu()

2020-10-29 Thread Nathan Lynch

There are no users left of the suspend_disable_cpu() callback, remove
it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/machdep.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 475687f24f4a..cf6ebbc16cb4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -207,7 +207,6 @@ struct machdep_calls {
void (*suspend_disable_irqs)(void);
void (*suspend_enable_irqs)(void);
 #endif
-   int (*suspend_disable_cpu)(void);
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
ssize_t (*cpu_probe)(const char *, size_t);
-- 
2.25.4

1 2 >

1 - 100 of 116 matches

Mail list logo