date:20190716

Re: [EXTERNAL] Re: [PATCH v2 3/6] powerpc/eeh: Improve debug messages around device addition

2019-07-16 Thread Oliver O'Halloran

On Tue, 2019-07-16 at 16:48 +1000, Sam Bobroff wrote:
> On Thu, Jun 20, 2019 at 01:45:24PM +1000, Oliver O'Halloran wrote:
> > On Thu, Jun 20, 2019 at 12:40 PM Alexey Kardashevskiy  
> > wrote:
> > > On 19/06/2019 14:27, Sam Bobroff wrote:
> > > > On Tue, Jun 11, 2019 at 03:47:58PM +1000, Alexey Kardashevskiy wrote:
> > > > > On 07/05/2019 14:30, Sam Bobroff wrote:
> > > > > > Also remove useless comment.
> > > > > > 
> > > > > > Signed-off-by: Sam Bobroff 
> > > > > > Reviewed-by: Alexey Kardashevskiy 
> > > > > > ---
> > > *snip*
> > > > I can see that edev will be non-NULL here, but that pr_debug() pattern
> > > > (using the PDN information to form the PCI address) is quite common
> > > > across the EEH code, so I think rather than changing a couple of
> > > > specific cases, I should do a separate cleanup patch and introduce
> > > > something like pdn_debug(pdn, ""). What do you think?
> > > 
> > > I'd switch them all to already existing dev_dbg/pci_debug rather than
> > > adding pdn_debug as imho it should not have been used in the first place
> > > really...
> > > 
> > > > (I don't know exactly when edev->pdev can be NULL.)
> > > 
> > > ... and if you switch to dev_dbg/pci_debug, I think quite soon you'll
> > > know if it can or cannot be NULL :)
> > 
> > As far as I can tell edev->pdev is NULL in two cases:
> > 
> > 1. Before eeh_device_add_late() has been called on the pdev. The late
> > part of the add maps the pdev to an edev and sets the pdev's edev
> > pointer and vis a vis.
> > 2. While recoverying EEH unaware devices. Unaware devices are
> > destroyed and rescanned and the edev->pdev pointer is cleared by
> > pcibios_device_release()
> > 
> > In most of these cases it should be safe to use the pci_*() functions
> > rather than making a new one up for printing pdns. In the cases where
> > we might not have a PCI dev i'd make a new set of prints that take an
> > EEH dev rather than a pci_dn since i'd like pci_dn to die sooner
> > rather than later.
> > 
> > Oliver
> 
> I'll change the calls in {pnv,pseries}_pcibios_bus_add_device() and
> eeh_add_device_late() to use dev_dbg() and post a new version.
> 
> For {pnv,pseries}_eeh_probe() I'm not sure what we can do; there's no
> pci_dev available yet and while it would be nice to use the eeh_dev
> rather than the pdn, it doesn't seem to have the bus/device/fn
> information we need. Am I missing something there?  (The code in the
> probe functions seems to get it from the pci_dn.)

We do have a pci_dev in the powernv case since pnv_eeh_probe() isn't
called until the late probe happens (which is after the pci_dev has
been created). I've got some patches to rework the probe path to make
this a bit clearer, but they need a bit more work.

> 
> If there isn't an easy way around this, would it therefore be reasonable
> to just leave them open-coded as they are?

I've had this patch floating around a while that should do the trick.
The PCI_BUSNO macro is probably unnecessary since I'm sure there is
something that does it in generic code, but I couldn't find it.


>From 61ff8c23c4d13ff640fb2d069dcedacdf2ee22dd Mon Sep 17 00:00:00 2001
From: Oliver O'Halloran 
Date: Thu, 18 Apr 2019 18:25:13 +1000
Subject: [PATCH] powerpc/eeh: Add bdfn field to eeh_dev

Preperation for removing pci_dn from the powernv EEH code. The only thing we
really use pci_dn for is to get the bdfn of the device for config space
accesses, so adding that information to eeh_dev reduces the need to carry
around the pci_dn.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/eeh.h | 2 ++
 arch/powerpc/include/asm/ppc-pci.h | 2 ++
 arch/powerpc/kernel/eeh_dev.c  | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7fd476d..a208e02 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -131,6 +131,8 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
 struct eeh_dev {
int mode;   /* EEH mode */
int class_code; /* Class code of the device */
+   int bdfn;   /* bdfn of device (for cfg ops) */
+   struct pci_controller *controller;
int pe_config_addr; /* PE config address*/
u32 config_space[16];   /* Saved PCI config space   */
int pcix_cap;   /* Saved PCIx capability*/
diff --git a/arch/powerpc/include/asm/ppc-pci.h 
b/arch/powerpc/include/asm/ppc-pci.h
index cec2d64..72860de 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -74,6 +74,8 @@ static inline const char *eeh_driver_name(struct pci_dev 
*pdev)
 
 #endif /* CONFIG_EEH */
 
+#define PCI_BUSNO(bdfn) ((bdfn >> 8) & 0xff)
+
 #else /* CONFIG_PCI */
 static inline void init_pci_config_tokens(void) { }
 #endif /* !CONFIG_PCI */
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kerne

Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-16 Thread Masahiro Yamada

On Tue, Jul 16, 2019 at 3:16 AM Segher Boessenkool
 wrote:
>
> On Mon, Jul 15, 2019 at 09:03:46PM +0900, Masahiro Yamada wrote:
> > On Mon, Jul 15, 2019 at 4:30 PM Segher Boessenkool
> >  wrote:
> > >
> > > On Mon, Jul 15, 2019 at 05:05:34PM +1000, Michael Ellerman wrote:
> > > > Segher Boessenkool  writes:
> > > > > Yes, that is why I used the environment variable, all binutils work
> > > > > with that.  There was no --target option in GNU ar before 2.22.
> >
> > I use binutils 2.30
> > It does not understand --target option.
> >
> > $ powerpc-linux-ar --version
> > GNU ar (GNU Binutils) 2.30
> > Copyright (C) 2018 Free Software Foundation, Inc.
> > This program is free software; you may redistribute it under the terms of
> > the GNU General Public License version 3 or (at your option) any later 
> > version.
> > This program has absolutely no warranty.
> >
> > If I give --target=elf$(BITS)-$(GNUTARGET) option, I see this:
> > powerpc-linux-ar: -t: No such file or directory
>
> You need to provide a valid command line, like
>
> $ powerpc-linux-ar tv smth.a --target=elf32-powerpc
>
> ar is a bit weird.


Ah, I see!

I had missed the space being required.

Since I cannot test old binutils,
I will leave this to ppc people.






--
Best Regards
Masahiro Yamada

Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags

2019-07-16 Thread Aleksa Sarai

On 2019-07-14, Al Viro  wrote:
> On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote:
> > On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
> > > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
> > > 
> > > > if (flags & LOOKUP_BENEATH) {
> > > > nd->root = nd->path;
> > > > if (!(flags & LOOKUP_RCU))
> > > > path_get(&nd->root);
> > > > else
> > > > nd->root_seq = nd->seq;
> > > 
> > > BTW, this assignment is needed for LOOKUP_RCU case.  Without it
> > > you are pretty much guaranteed that lazy pathwalk will fail,
> > > when it comes to complete_walk().
> > > 
> > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH
> > > combination would someday get passed?
> > 
> > I don't understand what's going on with ->r_seq in there - your
> > call of path_is_under() is after having (re-)sampled rename_lock,
> > but if that was the only .. in there, who's going to recheck
> > the value?  For that matter, what's to guarantee that the thing
> > won't get moved just as you are returning from handle_dots()?
> > 
> > IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
> 
> Sigh...  Usual effects of trying to document things:
> 
> 1) LOOKUP_NO_EVAL looks bogus.  It had been introduced by commit 57d4657716ac
> (audit: ignore fcaps on umount) and AFAICS it's crap.  It is set in
> ksys_umount() and nowhere else.  It's ignored by everything except
> filename_mountpoint().  The thing is, call graph for filename_mountpoint()
> is
>   filename_mountpoint()
>   <- user_path_mountpoint_at()
>   <- ksys_umount()
>   <- kern_path_mountpoint()
>   <- autofs_dev_ioctl_ismountpoint()
>   <- find_autofs_mount()
>   <- autofs_dev_ioctl_open_mountpoint()
>   <- autofs_dev_ioctl_requester()
>   <- autofs_dev_ioctl_ismountpoint()
> In other words, that flag is basically "was filename_mountpoint()
> been called by umount(2) or has it come from an autofs ioctl?".
> And looking at the rationale in that commit, autofs ioctls need
> it just as much as umount(2) does.  Why is it not set for those
> as well?  And why is it conditional at all?

In addition, LOOKUP_NO_EVAL == LOOKUP_OPEN (0x100). Is that meant to be
the case? Also I just saw you have a patch in work.namei that fixes this
up -- do you want me to rebase on top of that?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature

Re: [PATCH v6] cpufreq/pasemi: fix an use-after-free inpas_cpufreq_cpu_init()

2019-07-16 Thread wen.yang99

> > The cpu variable is still being used in the of_get_property() call
> > after the of_node_put() call, which may result in use-after-free.
> >
> > Fixes: a9acc26b75f6 ("cpufreq/pasemi: fix possible object reference leak")
> > Signed-off-by: Wen Yang 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Viresh Kumar 
> > Cc: Michael Ellerman 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux...@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> > ---
> > v6: keep the blank line and fix warning: label 'out_unmap_sdcpwr' defined 
> > but not used.
> > v5: put together the code to get, use, and release cpu device_node.
> > v4: restore the blank line.
> > v3: fix a leaked reference.
> > v2: clean up the code according to the advice of viresh.
> >
> >  drivers/cpufreq/pasemi-cpufreq.c | 26 ++
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> > b/drivers/cpufreq/pasemi-cpufreq.c
> > index 6b1e4ab..7d557f9 100644
> > --- a/drivers/cpufreq/pasemi-cpufreq.c
> > +++ b/drivers/cpufreq/pasemi-cpufreq.c
> > @@ -131,10 +131,18 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
> > *policy)
> >  int err = -ENODEV;
> >
> >  cpu = of_get_cpu_node(policy->cpu, NULL);
> > +if (!cpu)
> > +goto out;
> >
> > +max_freqp = of_get_property(cpu, "clock-frequency", NULL);
> >  of_node_put(cpu);
> > -if (!cpu)
> > +if (!max_freqp) {
> > +err = -EINVAL;
> >  goto out;
> > +}
> > +
> > +/* we need the freq in kHz */
> > +max_freq = *max_freqp / 1000;
> >
> >  dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
> >  if (!dn)
> > @@ -171,16 +179,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
> > *policy)
> >  }
> >
> >  pr_debug("init cpufreq on CPU %d\n", policy->cpu);
> > -
> > -max_freqp = of_get_property(cpu, "clock-frequency", NULL);
> > -if (!max_freqp) {
> > -err = -EINVAL;
> > -goto out_unmap_sdcpwr;
> > -}
> > -
> > -/* we need the freq in kHz */
> > -max_freq = *max_freqp / 1000;
> > -
> >  pr_debug("max clock-frequency is at %u kHz\n", max_freq);
> >  pr_debug("initializing frequency table\n");
> >
> > @@ -196,7 +194,11 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
> > *policy)
> >  policy->cur = pas_freqs[cur_astate].frequency;
> >  ppc_proc_freq = policy->cur * 1000ul;
> >
> > -return cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
> > +err = cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
> 
> So you are trying to fix an earlier issue here with this. Should have
> been a separate patch. Over that I have just sent a patch now to make
> this routine return void.
> 
> https://lore.kernel.org/lkml/ee8cf5fb4b4a01fdf9199037ff6d835b935cfd13.1562902877.git.viresh.ku...@linaro.org/
> 
> So all you need to do is to remove the label out_unmap_sdcpwr instead.

Okay thank you.
Now this patch
(https://lore.kernel.org/lkml/ee8cf5fb4b4a01fdf9199037ff6d835b935cfd13.1562902877.git.viresh.ku...@linaro.org/)
 
seems to have not been merged into the linux-next.

In order to avoid code conflicts, we will wait until this patch is merged in 
and then send v7.

--
Thanks and regards,
Wen

> > +if (err)
> > +goto out_unmap_sdcpwr;
> > +
> > +return 0;
> >
> >  out_unmap_sdcpwr:
> >  iounmap(sdcpwr_mapbase);
> > --
> > 2.9.5

Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-16 Thread Oscar Salvador

On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote:
> On 01.07.19 12:27, Michal Hocko wrote:
> > On Mon 01-07-19 11:36:44, Oscar Salvador wrote:
> >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
> >>> Yeah, we do not allow to offline multi zone (node) ranges so the current
> >>> code seems to be over engineered.
> >>>
> >>> Anyway, I am wondering why do we have to strictly check for already
> >>> removed nodes links. Is the sysfs code going to complain we we try to
> >>> remove again?
> >>
> >> No, sysfs will silently "fail" if the symlink has already been removed.
> >> At least that is what I saw last time I played with it.
> >>
> >> I guess the question is what if sysfs handling changes in the future
> >> and starts dropping warnings when trying to remove a symlink is not there.
> >> Maybe that is unlikely to happen?
> > 
> > And maybe we handle it then rather than have a static allocation that
> > everybody with hotremove configured has to pay for.
> > 
> 
> So what's the suggestion? Dropping the nodemask_t completely and calling
> sysfs_remove_link() on already potentially removed links?
> 
> Of course, we can also just use mem_blk->nid and rest assured that it
> will never be called for memory blocks belonging to multiple nodes.

Hi David,

While it is easy to construct a scenario where a memblock belongs to multiple
nodes, I have to confess that I yet have not seen that in a real-world scenario.

Given said that, I think that the less risky way is to just drop the nodemask_t
and do not care about calling sysfs_remove_link() for already removed links.
As I said, sysfs_remove_link() will silently fail when it fails to find the
symlink, so I do not think it is a big deal.


-- 
Oscar Salvador
SUSE L3

[Bug 203647] Locking API testsuite fails "mixed read-lock/lock-write ABBA" rlock on kernels >=4.14.x

2019-07-16 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=203647

Anatoly Pugachev (mator...@gmail.com) changed:

   What|Removed |Added

 CC||mator...@gmail.com

--- Comment #7 from Anatoly Pugachev (mator...@gmail.com) ---
it's the same for sparc64:

`
[0.20] PROMLIB: Sun IEEE Boot Prom 'OBP 4.38.12 2018/03/28 14:54'
[0.32] PROMLIB: Root node compatible: sun4v
[0.80] Linux version 5.2.0-10808-g9637d517347e (mator@ttip) (gcc
version 8.3.0 (Debian 8.3.0-7)) #1080 SMP Tue Jul 16 10:46:19 MSK 2019
[0.000386] printk: bootconsole [earlyprom0] enabled
[0.000441] ARCH: SUN4V
...
[0.451068]  memory used by lock dependency info: 3855 kB
[0.451104]  per task-struct memory footprint: 1920 bytes
[0.451140] 
[0.451167] | Locking API testsuite:
[0.451194]

[0.451244]  | spin |wlock |rlock |mutex |
wsem | rsem |
[0.451294]  
--
[0.451350]  A-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.454281]  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.457443]  A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.460744]  A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.464032]  A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.467545]  A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.471009]  A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.474475] double unlock:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.477425]   initialize held:  ok  |  ok  |  ok  |  ok  | 
ok  |  ok  |  ok  |
[0.480301]  
--
[0.480352]   recursive read-lock: |  ok  | 
   |  ok  |
[0.481247]recursive read-lock #2: |  ok  | 
   |  ok  |
[0.482120] mixed read-write-lock: |  ok  | 
   |  ok  |
[0.482998] mixed write-read-lock: |  ok  | 
   |  ok  |
[0.483878]   mixed read-lock/lock-write ABBA: |FAILED| 
   |  ok  |
[0.484755]mixed read-lock/lock-read ABBA: |  ok  | 
   |  ok  |
[0.485676]  mixed write-lock/lock-write ABBA: |  ok  | 
   |  ok  |
[0.486597]  
--
`

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v6] cpufreq/pasemi: fix an use-after-free inpas_cpufreq_cpu_init()

2019-07-16 Thread Viresh Kumar

On 16-07-19, 16:26, wen.yan...@zte.com.cn wrote:
> Okay thank you.
> Now this patch
> (https://lore.kernel.org/lkml/ee8cf5fb4b4a01fdf9199037ff6d835b935cfd13.1562902877.git.viresh.ku...@linaro.org/)
>  
> seems to have not been merged into the linux-next.
> 
> In order to avoid code conflicts, we will wait until this patch is merged in 
> and then send v7.

Please rebase on PM tree's linux-next branch instead and resend your
patch.

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git

-- 
viresh

[PATCH v2] powerpc/nvdimm: Pick the nearby online node if the device node is not online

2019-07-16 Thread Aneesh Kumar K.V

This is similar to what ACPI does. Nvdimm layer doesn't bring the SCM device
numa node online. Hence we need to make sure we always use an online node
as ndr_desc.numa_node. Otherwise this result in kernel crashes. The target
node is used by dax/kmem and that will bring up the numa node online correctly.

Without this patch, we do hit kernel crash as below because we try to access
uninitialized NODE_DATA in different code paths.

cpu 0x0: Vector: 300 (Data Access) at [c000fac53170]
pc: c04bbc50: ___slab_alloc+0x120/0xca0
lr: c04bc834: __slab_alloc+0x64/0xc0
sp: c000fac53400
   msr: 82009033
   dar: 73e8
 dsisr: 8
  current = 0xc000fabb6d80
  paca= 0xc387   irqmask: 0x03   irq_happened: 0x01
pid   = 7, comm = kworker/u16:0
Linux version 5.2.0-06234-g76bd729b2644 (kvaneesh@ltc-boston123) (gcc version 
7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #135 SMP Thu Jul 11 05:36:30 CDT 2019
enter ? for help
[link register   ] c04bc834 __slab_alloc+0x64/0xc0
[c000fac53400] c000fac53480 (unreliable)
[c000fac53500] c04bc818 __slab_alloc+0x48/0xc0
[c000fac53560] c04c30a0 __kmalloc_node_track_caller+0x3c0/0x6b0
[c000fac535d0] c0cfafe4 devm_kmalloc+0x74/0xc0
[c000fac53600] c0d69434 nd_region_activate+0x144/0x560
[c000fac536d0] c0d6b19c nd_region_probe+0x17c/0x370
[c000fac537b0] c0d6349c nvdimm_bus_probe+0x10c/0x230
[c000fac53840] c0cf3cc4 really_probe+0x254/0x4e0
[c000fac538d0] c0cf429c driver_probe_device+0x16c/0x1e0
[c000fac53950] c0cf0b44 bus_for_each_drv+0x94/0x130
[c000fac539b0] c0cf392c __device_attach+0xdc/0x200
[c000fac53a50] c0cf231c bus_probe_device+0x4c/0xf0
[c000fac53a90] c0ced268 device_add+0x528/0x810
[c000fac53b60] c0d62a58 nd_async_device_register+0x28/0xa0
[c000fac53bd0] c01ccb8c async_run_entry_fn+0xcc/0x1f0
[c000fac53c50] c01bcd9c process_one_work+0x46c/0x860
[c000fac53d20] c01bd4f4 worker_thread+0x364/0x5f0
[c000fac53db0] c01c7260 kthread+0x1b0/0x1c0
[c000fac53e20] c000b954 ret_from_kernel_thread+0x5c/0x68

With the patch we get

 # numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 130865 MB
node 1 free: 129130 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
 # cat /sys/bus/nd/devices/region0/numa_node
0
 # dmesg | grep papr_scm
[   91.332305] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Region 
registered with target node 2 and online node 0

Signed-off-by: Aneesh Kumar K.V 
---
changes from V1:
* handle NUMA_NO_NODE

 arch/powerpc/platforms/pseries/papr_scm.c | 30 +--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index c8ec670ee924..b813bc92f35f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -255,12 +255,32 @@ static const struct attribute_group 
*papr_scm_dimm_groups[] = {
NULL,
 };
 
+static inline int papr_scm_node(int node)
+{
+   int min_dist = INT_MAX, dist;
+   int nid, min_node;
+
+   if ((node == NUMA_NO_NODE) || node_online(node))
+   return node;
+
+   min_node = first_online_node;
+   for_each_online_node(nid) {
+   dist = node_distance(node, nid);
+   if (dist < min_dist) {
+   min_dist = dist;
+   min_node = nid;
+   }
+   }
+   return min_node;
+}
+
 static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 {
struct device *dev = &p->pdev->dev;
struct nd_mapping_desc mapping;
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
+   int target_nid, online_nid;
 
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -299,8 +319,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 
memset(&ndr_desc, 0, sizeof(ndr_desc));
ndr_desc.attr_groups = region_attr_groups;
-   ndr_desc.numa_node = dev_to_node(&p->pdev->dev);
-   ndr_desc.target_node = ndr_desc.numa_node;
+   target_nid = dev_to_node(&p->pdev->dev);
+   online_nid = papr_scm_node(target_nid);
+   set_dev_node(&p->pdev->dev, online_nid);
+   ndr_desc.numa_node = online_nid;
+   ndr_desc.target_node = target_nid;
ndr_desc.res = &p->res;
ndr_desc.of_node = p->dn;
ndr_desc.provider_data = p;
@@ -318,6 +341,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.res, p->dn);
goto err;
}
+   if (target_nid != online_nid)
+   dev_info(dev, "Regi

Re: [PATCH kernel v2] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Alexey Kardashevskiy





On 16/07/2019 18:59, Cédric Le Goater wrote:

On 15/07/2019 09:11, Alexey Kardashevskiy wrote:

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This adds a new ppc_md.orphan_irq callback which is called if no irq
descriptor is found. The XIVE implementation drops the current priority
to 0xff which effectively unmasks interrupts in a current CPU.



The test on generic_handle_irq() catches interrupt events that
were served on a target CPU while the source interrupt was being
shutdown on another CPU.

The orphan_irq() handler restores the CPPR in such cases.

This looks OK to me. I would have added some more comments in the
code.



Which and where? Thanks,



Reviewed-by: Cédric Le Goater 


And adding to the list of future cleanups : a 'set_cppr' helper.

Thanks,

C.



Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* added ppc_md.orphan_irq

---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.
---
  arch/powerpc/include/asm/machdep.h |  3 +++
  arch/powerpc/kernel/irq.c  |  9 ++---
  arch/powerpc/sysdev/xive/common.c  | 10 ++
  3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index c43d6eca9edd..6cc14e28e89a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
/* Return an irq, or 0 to indicate there are none pending. */
unsigned int(*get_irq)(void);
  
+	/* Drops irq if it does not have a valid descriptor */

+   void(*orphan_irq)(unsigned int irq);
+
/* PCI stuff */
/* Called after allocating resources */
void(*pcibios_fixup)(void);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bc68c53af67c..b4e06d05bdba 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
may_hard_irq_enable();
  
  	/* And finally process it */

-   if (unlikely(!irq))
+   if (unlikely(!irq)) {
__this_cpu_inc(irq_stat.spurious_irqs);
-   else
-   generic_handle_irq(irq);
+   } else if (generic_handle_irq(irq)) {
+   if (ppc_md.orphan_irq)
+   ppc_md.orphan_irq(irq);
+   __this_cpu_inc(irq_stat.spurious_irqs);
+   }
  
  	trace_irq_exit(regs);
  
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c

index 082c7e1c20f0..b4054091999a 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,6 +283,15 @@ static unsigned int xive_get_irq(void)
return irq;
  }
  
+static void xive_orphan_irq(unsigned int irq)

+{
+   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
+
+   xc->cppr = 0xff;
+   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
+   DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
+}
+
  /*
   * After EOI'ing an interrupt, we need to re-check the queue
   * to see if another interrupt is pending since multiple
@@ -1419,6 +1428,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
void __iomem *area, u32 o
xive_irq_priority = max_prio;
  
  	ppc_md.get_irq = xive_get_irq;

+   ppc_md.orphan_irq = xive_orphan_irq;
__xive_enabled = true;
  
  	pr_devel("Initializing host..\n");






--
Alexey

Re: [PATCH kernel v2] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Cédric Le Goater

On 16/07/2019 11:10, Alexey Kardashevskiy wrote:
> 
> 
> On 16/07/2019 18:59, Cédric Le Goater wrote:
>> On 15/07/2019 09:11, Alexey Kardashevskiy wrote:
>>> There is a race between releasing an irq on one cpu and fetching it
>>> from XIVE on another cpu as there does not seem to be any locking between
>>> these, probably because xive_irq_chip::irq_shutdown() is supposed to
>>> remove the irq from all queues in the system which it does not do.
>>>
>>> As a result, when such released irq appears in a queue, we take it
>>> from the queue but we do not change the current priority on that cpu and
>>> since there is no handler for the irq, EOI is never called and the cpu
>>> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
>>> is assigned to the same cpu, then that device stops working until irq
>>> is moved to another cpu or the device is reset.
>>>
>>> This adds a new ppc_md.orphan_irq callback which is called if no irq
>>> descriptor is found. The XIVE implementation drops the current priority
>>> to 0xff which effectively unmasks interrupts in a current CPU.
>>
>>
>> The test on generic_handle_irq() catches interrupt events that
>> were served on a target CPU while the source interrupt was being
>> shutdown on another CPU.
>>
>> The orphan_irq() handler restores the CPPR in such cases.
>>
>> This looks OK to me. I would have added some more comments in the
>> code.
> 
> 
> Which and where? Thanks,

Above xive_orphan_irq() explaining the complete problem that we are 
addressing. XIVE is not super obvious when looking at the code ...


C.
 
> 
>> Reviewed-by: Cédric Le Goater 
>>
>>
>> And adding to the list of future cleanups : a 'set_cppr' helper.
>>
>> Thanks,
>>
>> C.
>>
>>
>>> Signed-off-by: Alexey Kardashevskiy 
>>> ---
>>> Changes:
>>> v2:
>>> * added ppc_md.orphan_irq
>>>
>>> ---
>>>
>>> Found it on P9 system with:
>>> - a host with 8 cpus online
>>> - a boot disk on ahci with its msix on cpu#0
>>> - a guest with 2xGPUs + 6xNVLink + 4 cpus
>>> - GPU#0 from the guest is bound to the same cpu#0.
>>>
>>> Killing a guest killed ahci and therefore the host because of the race.
>>> Note that VFIO masks interrupts first and only then resets the device.
>>> ---
>>>   arch/powerpc/include/asm/machdep.h |  3 +++
>>>   arch/powerpc/kernel/irq.c  |  9 ++---
>>>   arch/powerpc/sysdev/xive/common.c  | 10 ++
>>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/machdep.h 
>>> b/arch/powerpc/include/asm/machdep.h
>>> index c43d6eca9edd..6cc14e28e89a 100644
>>> --- a/arch/powerpc/include/asm/machdep.h
>>> +++ b/arch/powerpc/include/asm/machdep.h
>>> @@ -59,6 +59,9 @@ struct machdep_calls {
>>>   /* Return an irq, or 0 to indicate there are none pending. */
>>>   unsigned int    (*get_irq)(void);
>>>   +    /* Drops irq if it does not have a valid descriptor */
>>> +    void    (*orphan_irq)(unsigned int irq);
>>> +
>>>   /* PCI stuff */
>>>   /* Called after allocating resources */
>>>   void    (*pcibios_fixup)(void);
>>> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
>>> index bc68c53af67c..b4e06d05bdba 100644
>>> --- a/arch/powerpc/kernel/irq.c
>>> +++ b/arch/powerpc/kernel/irq.c
>>> @@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
>>>   may_hard_irq_enable();
>>>     /* And finally process it */
>>> -    if (unlikely(!irq))
>>> +    if (unlikely(!irq)) {
>>>   __this_cpu_inc(irq_stat.spurious_irqs);
>>> -    else
>>> -    generic_handle_irq(irq);
>>> +    } else if (generic_handle_irq(irq)) {
>>> +    if (ppc_md.orphan_irq)
>>> +    ppc_md.orphan_irq(irq);
>>> +    __this_cpu_inc(irq_stat.spurious_irqs);
>>> +    }
>>>     trace_irq_exit(regs);
>>>   diff --git a/arch/powerpc/sysdev/xive/common.c 
>>> b/arch/powerpc/sysdev/xive/common.c
>>> index 082c7e1c20f0..b4054091999a 100644
>>> --- a/arch/powerpc/sysdev/xive/common.c
>>> +++ b/arch/powerpc/sysdev/xive/common.c
>>> @@ -283,6 +283,15 @@ static unsigned int xive_get_irq(void)
>>>   return irq;
>>>   }
>>>   +static void xive_orphan_irq(unsigned int irq)
>>> +{
>>> +    struct xive_cpu *xc = __this_cpu_read(xive_cpu);
>>> +
>>> +    xc->cppr = 0xff;
>>> +    out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
>>> +    DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
>>> +}
>>> +
>>>   /*
>>>    * After EOI'ing an interrupt, we need to re-check the queue
>>>    * to see if another interrupt is pending since multiple
>>> @@ -1419,6 +1428,7 @@ bool __init xive_core_init(const struct xive_ops 
>>> *ops, void __iomem *area, u32 o
>>>   xive_irq_priority = max_prio;
>>>     ppc_md.get_irq = xive_get_irq;
>>> +    ppc_md.orphan_irq = xive_orphan_irq;
>>>   __xive_enabled = true;
>>>     pr_devel("Initializing host..\n");
>>>
>>
>

Re: [PATCH kernel v2] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Cédric Le Goater

On 15/07/2019 09:11, Alexey Kardashevskiy wrote:
> There is a race between releasing an irq on one cpu and fetching it
> from XIVE on another cpu as there does not seem to be any locking between
> these, probably because xive_irq_chip::irq_shutdown() is supposed to
> remove the irq from all queues in the system which it does not do.
> 
> As a result, when such released irq appears in a queue, we take it
> from the queue but we do not change the current priority on that cpu and
> since there is no handler for the irq, EOI is never called and the cpu
> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
> is assigned to the same cpu, then that device stops working until irq
> is moved to another cpu or the device is reset.
> 
> This adds a new ppc_md.orphan_irq callback which is called if no irq
> descriptor is found. The XIVE implementation drops the current priority
> to 0xff which effectively unmasks interrupts in a current CPU.


The test on generic_handle_irq() catches interrupt events that
were served on a target CPU while the source interrupt was being
shutdown on another CPU.

The orphan_irq() handler restores the CPPR in such cases. 

This looks OK to me. I would have added some more comments in the 
code. 

Reviewed-by: Cédric Le Goater 


And adding to the list of future cleanups : a 'set_cppr' helper.

Thanks,

C.


> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * added ppc_md.orphan_irq
> 
> ---
> 
> Found it on P9 system with:
> - a host with 8 cpus online
> - a boot disk on ahci with its msix on cpu#0
> - a guest with 2xGPUs + 6xNVLink + 4 cpus
> - GPU#0 from the guest is bound to the same cpu#0.
> 
> Killing a guest killed ahci and therefore the host because of the race.
> Note that VFIO masks interrupts first and only then resets the device.
> ---
>  arch/powerpc/include/asm/machdep.h |  3 +++
>  arch/powerpc/kernel/irq.c  |  9 ++---
>  arch/powerpc/sysdev/xive/common.c  | 10 ++
>  3 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index c43d6eca9edd..6cc14e28e89a 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -59,6 +59,9 @@ struct machdep_calls {
>   /* Return an irq, or 0 to indicate there are none pending. */
>   unsigned int(*get_irq)(void);
>  
> + /* Drops irq if it does not have a valid descriptor */
> + void(*orphan_irq)(unsigned int irq);
> +
>   /* PCI stuff */
>   /* Called after allocating resources */
>   void(*pcibios_fixup)(void);
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index bc68c53af67c..b4e06d05bdba 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
>   may_hard_irq_enable();
>  
>   /* And finally process it */
> - if (unlikely(!irq))
> + if (unlikely(!irq)) {
>   __this_cpu_inc(irq_stat.spurious_irqs);
> - else
> - generic_handle_irq(irq);
> + } else if (generic_handle_irq(irq)) {
> + if (ppc_md.orphan_irq)
> + ppc_md.orphan_irq(irq);
> + __this_cpu_inc(irq_stat.spurious_irqs);
> + }
>  
>   trace_irq_exit(regs);
>  
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..b4054091999a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -283,6 +283,15 @@ static unsigned int xive_get_irq(void)
>   return irq;
>  }
>  
> +static void xive_orphan_irq(unsigned int irq)
> +{
> + struct xive_cpu *xc = __this_cpu_read(xive_cpu);
> +
> + xc->cppr = 0xff;
> + out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
> + DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
> +}
> +
>  /*
>   * After EOI'ing an interrupt, we need to re-check the queue
>   * to see if another interrupt is pending since multiple
> @@ -1419,6 +1428,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
> void __iomem *area, u32 o
>   xive_irq_priority = max_prio;
>  
>   ppc_md.get_irq = xive_get_irq;
> + ppc_md.orphan_irq = xive_orphan_irq;
>   __xive_enabled = true;
>  
>   pr_devel("Initializing host..\n");
>

[PATCH 00/10] cpufreq: Migrate users of policy notifiers to QoS requests

2019-07-16 Thread Viresh Kumar

Hello,

Now that cpufreq core supports taking QoS requests for min/max cpu
frequencies, lets migrate rest of the users to using them instead of the
policy notifiers.

The CPUFREQ_NOTIFY and CPUFREQ_ADJUST events of the policy notifiers are
removed as a result, but we have to add CPUFREQ_CREATE_POLICY and
CPUFREQ_REMOVE_POLICY events to it for the acpi stuff specifically. So
the policy notifiers aren't completely removed.

Boot tested on my x86 PC and ARM hikey board. Nothing looked broken :)

This has already gone through build bot for a few days now.

--
viresh

Viresh Kumar (10):
  cpufreq: Add policy create/remove notifiers
  video: sa1100fb: Remove cpufreq policy notifier
  video: pxafb: Remove cpufreq policy notifier
  arch_topology: Use CPUFREQ_CREATE_POLICY instead of CPUFREQ_NOTIFY
  thermal: cpu_cooling: Switch to QoS requests instead of cpufreq
notifier
  powerpc: macintosh: Switch to QoS requests instead of cpufreq notifier
  cpufreq: powerpc_cbe: Switch to QoS requests instead of cpufreq
notifier
  ACPI: cpufreq: Switch to QoS requests instead of cpufreq notifier
  cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier
events
  Documentation: cpufreq: Update policy notifier documentation

 Documentation/cpu-freq/core.txt|  16 +--
 drivers/acpi/processor_driver.c|  44 -
 drivers/acpi/processor_perflib.c   | 106 +---
 drivers/acpi/processor_thermal.c   |  81 ---
 drivers/base/arch_topology.c   |   2 +-
 drivers/cpufreq/cpufreq.c  |  51 --
 drivers/cpufreq/ppc_cbe_cpufreq.c  |  19 +++-
 drivers/cpufreq/ppc_cbe_cpufreq.h  |   8 ++
 drivers/cpufreq/ppc_cbe_cpufreq_pmi.c  |  96 +++---
 drivers/macintosh/windfarm_cpufreq_clamp.c |  77 ++-
 drivers/thermal/cpu_cooling.c  | 110 +
 drivers/video/fbdev/pxafb.c|  21 
 drivers/video/fbdev/pxafb.h|   1 -
 drivers/video/fbdev/sa1100fb.c |  27 -
 drivers/video/fbdev/sa1100fb.h |   1 -
 include/acpi/processor.h   |  22 +++--
 include/linux/cpufreq.h|   4 +-
 17 files changed, 327 insertions(+), 359 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH 06/10] powerpc: macintosh: Switch to QoS requests instead of cpufreq notifier

2019-07-16 Thread Viresh Kumar

The cpufreq core now takes the min/max frequency constraints via QoS
requests and the CPUFREQ_ADJUST notifier shall get removed later on.

Switch over to using the QoS request for maximum frequency constraint
for windfarm_cpufreq_clamp driver.

Signed-off-by: Viresh Kumar 
---
 drivers/macintosh/windfarm_cpufreq_clamp.c | 77 ++
 1 file changed, 50 insertions(+), 27 deletions(-)

diff --git a/drivers/macintosh/windfarm_cpufreq_clamp.c 
b/drivers/macintosh/windfarm_cpufreq_clamp.c
index 52fd5fca89a0..705c6200814b 100644
--- a/drivers/macintosh/windfarm_cpufreq_clamp.c
+++ b/drivers/macintosh/windfarm_cpufreq_clamp.c
@@ -3,9 +3,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -16,36 +18,24 @@
 
 static int clamped;
 static struct wf_control *clamp_control;
-
-static int clamp_notifier_call(struct notifier_block *self,
-  unsigned long event, void *data)
-{
-   struct cpufreq_policy *p = data;
-   unsigned long max_freq;
-
-   if (event != CPUFREQ_ADJUST)
-   return 0;
-
-   max_freq = clamped ? (p->cpuinfo.min_freq) : (p->cpuinfo.max_freq);
-   cpufreq_verify_within_limits(p, 0, max_freq);
-
-   return 0;
-}
-
-static struct notifier_block clamp_notifier = {
-   .notifier_call = clamp_notifier_call,
-};
+static struct dev_pm_qos_request qos_req;
+static unsigned int min_freq, max_freq;
 
 static int clamp_set(struct wf_control *ct, s32 value)
 {
-   if (value)
+   unsigned int freq;
+
+   if (value) {
+   freq = min_freq;
printk(KERN_INFO "windfarm: Clamping CPU frequency to "
   "minimum !\n");
-   else
+   } else {
+   freq = max_freq;
printk(KERN_INFO "windfarm: CPU frequency unclamped !\n");
+   }
clamped = value;
-   cpufreq_update_policy(0);
-   return 0;
+
+   return dev_pm_qos_update_request(&qos_req, freq);
 }
 
 static int clamp_get(struct wf_control *ct, s32 *value)
@@ -74,27 +64,60 @@ static const struct wf_control_ops clamp_ops = {
 
 static int __init wf_cpufreq_clamp_init(void)
 {
+   struct cpufreq_policy *policy;
struct wf_control *clamp;
+   struct device *dev;
+   int ret;
+
+   policy = cpufreq_cpu_get(0);
+   if (!policy) {
+   pr_warn("%s: cpufreq policy not found cpu0\n", __func__);
+   return -EPROBE_DEFER;
+   }
+
+   min_freq = policy->cpuinfo.min_freq;
+   max_freq = policy->cpuinfo.max_freq;
+   cpufreq_cpu_put(policy);
+
+   dev = get_cpu_device(0);
+   if (unlikely(!dev)) {
+   pr_warn("%s: No cpu device for cpu0\n", __func__);
+   return -ENODEV;
+   }
 
clamp = kmalloc(sizeof(struct wf_control), GFP_KERNEL);
if (clamp == NULL)
return -ENOMEM;
-   cpufreq_register_notifier(&clamp_notifier, CPUFREQ_POLICY_NOTIFIER);
+
+   ret = dev_pm_qos_add_request(dev, &qos_req, DEV_PM_QOS_MAX_FREQUENCY,
+max_freq);
+   if (ret < 0) {
+   pr_err("%s: Failed to add freq constraint (%d)\n", __func__,
+  ret);
+   goto free;
+   }
+
clamp->ops = &clamp_ops;
clamp->name = "cpufreq-clamp";
-   if (wf_register_control(clamp))
+   ret = wf_register_control(clamp);
+   if (ret)
goto fail;
clamp_control = clamp;
return 0;
  fail:
+   dev_pm_qos_remove_request(&qos_req);
+
+ free:
kfree(clamp);
-   return -ENODEV;
+   return ret;
 }
 
 static void __exit wf_cpufreq_clamp_exit(void)
 {
-   if (clamp_control)
+   if (clamp_control) {
wf_unregister_control(clamp_control);
+   dev_pm_qos_remove_request(&qos_req);
+   }
 }
 
 
-- 
2.21.0.rc0.269.g1a574e7a288b

Re: [PATCH 00/10] cpufreq: Migrate users of policy notifiers to QoS requests

2019-07-16 Thread Rafael J. Wysocki

On Tue, Jul 16, 2019 at 11:49 AM Viresh Kumar  wrote:
>
> Hello,
>
> Now that cpufreq core supports taking QoS requests for min/max cpu
> frequencies, lets migrate rest of the users to using them instead of the
> policy notifiers.

Technically, this still is linux-next only. :-)

> The CPUFREQ_NOTIFY and CPUFREQ_ADJUST events of the policy notifiers are
> removed as a result, but we have to add CPUFREQ_CREATE_POLICY and
> CPUFREQ_REMOVE_POLICY events to it for the acpi stuff specifically. So
> the policy notifiers aren't completely removed.

That's not entirely accurate, because arch_topology is going to use
CPUFREQ_CREATE_POLICY now too.

> Boot tested on my x86 PC and ARM hikey board. Nothing looked broken :)
>
> This has already gone through build bot for a few days now.

So I'd prefer patches [5-8] to go right after the first one and then
do the cleanups on top of that, as somebody may want to backport the
essential changes without the cleanups.

Re: [PATCH 00/10] cpufreq: Migrate users of policy notifiers to QoS requests

2019-07-16 Thread Viresh Kumar

On 16-07-19, 12:06, Rafael J. Wysocki wrote:
> On Tue, Jul 16, 2019 at 11:49 AM Viresh Kumar  wrote:
> >
> > Hello,
> >
> > Now that cpufreq core supports taking QoS requests for min/max cpu
> > frequencies, lets migrate rest of the users to using them instead of the
> > policy notifiers.
> 
> Technically, this still is linux-next only. :-)

True :)

> > The CPUFREQ_NOTIFY and CPUFREQ_ADJUST events of the policy notifiers are
> > removed as a result, but we have to add CPUFREQ_CREATE_POLICY and
> > CPUFREQ_REMOVE_POLICY events to it for the acpi stuff specifically. So
> > the policy notifiers aren't completely removed.
> 
> That's not entirely accurate, because arch_topology is going to use
> CPUFREQ_CREATE_POLICY now too.

Yeah, I thought about that while writing this patchset and
coverletter. But had it not been required for ACPI, I would have done
it differently for the arch-topology code. Maybe direct calling of
arch-topology routine from cpufreq core. I wanted to get rid of the
policy notifiers completely but I couldn't find a better way of doing
it for ACPI stuff.

> > Boot tested on my x86 PC and ARM hikey board. Nothing looked broken :)
> >
> > This has already gone through build bot for a few days now.
> 
> So I'd prefer patches [5-8] to go right after the first one and then
> do the cleanups on top of that, as somebody may want to backport the
> essential changes without the cleanups.

In the exceptional case where nobody finds anything wrong with the
patches (highly unlikely), do you want me to resend with reordering or
you can reorder them while applying? There are no dependencies between
those patches anyway.

--
viresh

Re: [PATCH 00/10] cpufreq: Migrate users of policy notifiers to QoS requests

2019-07-16 Thread Rafael J. Wysocki

On Tue, Jul 16, 2019 at 12:14 PM Viresh Kumar  wrote:
>
> On 16-07-19, 12:06, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2019 at 11:49 AM Viresh Kumar  
> > wrote:
> > >
> > > Hello,
> > >
> > > Now that cpufreq core supports taking QoS requests for min/max cpu
> > > frequencies, lets migrate rest of the users to using them instead of the
> > > policy notifiers.
> >
> > Technically, this still is linux-next only. :-)
>
> True :)
>
> > > The CPUFREQ_NOTIFY and CPUFREQ_ADJUST events of the policy notifiers are
> > > removed as a result, but we have to add CPUFREQ_CREATE_POLICY and
> > > CPUFREQ_REMOVE_POLICY events to it for the acpi stuff specifically. So
> > > the policy notifiers aren't completely removed.
> >
> > That's not entirely accurate, because arch_topology is going to use
> > CPUFREQ_CREATE_POLICY now too.
>
> Yeah, I thought about that while writing this patchset and
> coverletter. But had it not been required for ACPI, I would have done
> it differently for the arch-topology code. Maybe direct calling of
> arch-topology routine from cpufreq core. I wanted to get rid of the
> policy notifiers completely but I couldn't find a better way of doing
> it for ACPI stuff.
>
> > > Boot tested on my x86 PC and ARM hikey board. Nothing looked broken :)
> > >
> > > This has already gone through build bot for a few days now.
> >
> > So I'd prefer patches [5-8] to go right after the first one and then
> > do the cleanups on top of that, as somebody may want to backport the
> > essential changes without the cleanups.
>
> In the exceptional case where nobody finds anything wrong with the
> patches (highly unlikely), do you want me to resend with reordering or
> you can reorder them while applying? There are no dependencies between
> those patches anyway.

Please resend the reordered set when the merge window closes.

Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-16 Thread David Hildenbrand

On 16.07.19 10:46, Oscar Salvador wrote:
> On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote:
>> On 01.07.19 12:27, Michal Hocko wrote:
>>> On Mon 01-07-19 11:36:44, Oscar Salvador wrote:
 On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
> Yeah, we do not allow to offline multi zone (node) ranges so the current
> code seems to be over engineered.
>
> Anyway, I am wondering why do we have to strictly check for already
> removed nodes links. Is the sysfs code going to complain we we try to
> remove again?

 No, sysfs will silently "fail" if the symlink has already been removed.
 At least that is what I saw last time I played with it.

 I guess the question is what if sysfs handling changes in the future
 and starts dropping warnings when trying to remove a symlink is not there.
 Maybe that is unlikely to happen?
>>>
>>> And maybe we handle it then rather than have a static allocation that
>>> everybody with hotremove configured has to pay for.
>>>
>>
>> So what's the suggestion? Dropping the nodemask_t completely and calling
>> sysfs_remove_link() on already potentially removed links?
>>
>> Of course, we can also just use mem_blk->nid and rest assured that it
>> will never be called for memory blocks belonging to multiple nodes.
> 
> Hi David,
> 
> While it is easy to construct a scenario where a memblock belongs to multiple
> nodes, I have to confess that I yet have not seen that in a real-world 
> scenario.
> 
> Given said that, I think that the less risky way is to just drop the 
> nodemask_t
> and do not care about calling sysfs_remove_link() for already removed links.
> As I said, sysfs_remove_link() will silently fail when it fails to find the
> symlink, so I do not think it is a big deal.
> 
> 

As far as I can tell we

a) don't allow offlining of memory that belongs to multiple nodes
already (as pointed out by Michael recently)

b) users cannot add memory blocks that belong to multiple nodes via
add_memory()

So I don't see a way how remove_memory() (and even offline_pages())
could ever succeed on such memory blocks.

I think it should be fine to limit it to one node here. (if not, I guess
we would have a different BUG that would actually allow to remove such
memory blocks)

-- 

Thanks,

David / dhildenb

Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-16 Thread David Hildenbrand

On 16.07.19 10:46, Oscar Salvador wrote:
> On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote:
>> On 01.07.19 12:27, Michal Hocko wrote:
>>> On Mon 01-07-19 11:36:44, Oscar Salvador wrote:
 On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
> Yeah, we do not allow to offline multi zone (node) ranges so the current
> code seems to be over engineered.
>
> Anyway, I am wondering why do we have to strictly check for already
> removed nodes links. Is the sysfs code going to complain we we try to
> remove again?

 No, sysfs will silently "fail" if the symlink has already been removed.
 At least that is what I saw last time I played with it.

 I guess the question is what if sysfs handling changes in the future
 and starts dropping warnings when trying to remove a symlink is not there.
 Maybe that is unlikely to happen?
>>>
>>> And maybe we handle it then rather than have a static allocation that
>>> everybody with hotremove configured has to pay for.
>>>
>>
>> So what's the suggestion? Dropping the nodemask_t completely and calling
>> sysfs_remove_link() on already potentially removed links?
>>
>> Of course, we can also just use mem_blk->nid and rest assured that it
>> will never be called for memory blocks belonging to multiple nodes.
> 
> Hi David,
> 
> While it is easy to construct a scenario where a memblock belongs to multiple
> nodes, I have to confess that I yet have not seen that in a real-world 
> scenario.
> 
> Given said that, I think that the less risky way is to just drop the 
> nodemask_t
> and do not care about calling sysfs_remove_link() for already removed links.
> As I said, sysfs_remove_link() will silently fail when it fails to find the
> symlink, so I do not think it is a big deal.
> 
> 

As far as I can tell we

a) don't allow offlining of memory that belongs to multiple nodes
already (as pointed out by Michal recently)

b) users cannot add memory blocks that belong to multiple nodes via
add_memory()

So I don't see a way how remove_memory() (and even offline_pages())
could ever succeed on such memory blocks.

I think it should be fine to limit it to one node here. (if not, I guess
we would have a different BUG that would actually allow to remove such
memory blocks)

-- 

Thanks,

David / dhildenb

[PATCH v4 00/25] Add FADump support on PowerNV platform

2019-07-16 Thread Hari Bathini

Firmware-Assisted Dump (FADump) is currently supported only on pSeries
platform. This patch series adds support for PowerNV platform too.

The first few patches refactor the FADump code to make use of common
code across multiple platforms. Then basic FADump support is added for
PowerNV platform. Followed by patches to honour reserved-ranges DT node
while reserving/releasing memory used by FADump. The subsequent patch
processes CPU state data provided by firmware to create and append core
notes to the ELF core file and the next patch adds support to preserve
crash data for subsequent boots (useful in cases like petitboot). The
subsequent patches add support to export opalcore. opalcore makes
debugging of failures in OPAL code easier. Firmware-Assisted Dump
documentation is also updated appropriately.

The patch series is tested with the latest firmware plus the below skiboot
changes for MPIPL support:

https://patchwork.ozlabs.org/project/skiboot/list/?series=119169
("MPIPL support")


Changes in v4:
  * Split the patches.
  * Rebased to latest upstream kernel version.
  * Updated according to latest OPAL changes.

---

Hari Bathini (25):
  powerpc/fadump: move internal macros/definitions to a new header
  powerpc/fadump: move internal code to a new file
  powerpc/fadump: Improve fadump documentation
  pseries/fadump: move rtas specific definitions to platform code
  pseries/fadump: introduce callbacks for platform specific operations
  pseries/fadump: define register/un-register callback functions
  pseries/fadump: move out platform specific support from generic code
  powerpc/fadump: use FADump instead of fadump for how it is pronounced
  opal: add MPIPL interface definitions
  powernv/fadump: add fadump support on powernv
  powernv/fadump: register kernel metadata address with opal
  powernv/fadump: define register/un-register callback functions
  powernv/fadump: support copying multiple kernel memory regions
  powernv/fadump: process the crashdump by exporting it as /proc/vmcore
  powerpc/fadump: Update documentation about OPAL platform support
  powerpc/fadump: consider reserved ranges while reserving memory
  powerpc/fadump: consider reserved ranges while releasing memory
  powernv/fadump: process architected register state data provided by 
firmware
  powernv/fadump: add support to preserve crash data on FADUMP disabled 
kernel
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powernv/opalcore: export /sys/firmware/opal/core for analysing opal 
crashes
  powernv/fadump: Warn before processing partial crashdump
  powernv/opalcore: provide an option to invalidate /sys/firmware/opal/core 
file
  powernv/fadump: consider f/w load area
  powernv/fadump: update documentation about option to release opalcore


 Documentation/powerpc/firmware-assisted-dump.txt |  224 +++-
 arch/powerpc/Kconfig |   23 
 arch/powerpc/include/asm/fadump.h|  190 
 arch/powerpc/include/asm/opal-api.h  |   50 +
 arch/powerpc/include/asm/opal.h  |6 
 arch/powerpc/kernel/Makefile |6 
 arch/powerpc/kernel/fadump-common.c  |  153 +++
 arch/powerpc/kernel/fadump-common.h  |  203 
 arch/powerpc/kernel/fadump.c | 1181 --
 arch/powerpc/kernel/prom.c   |4 
 arch/powerpc/platforms/powernv/Makefile  |3 
 arch/powerpc/platforms/powernv/opal-call.c   |3 
 arch/powerpc/platforms/powernv/opal-core.c   |  637 
 arch/powerpc/platforms/powernv/opal-fadump.c |  671 
 arch/powerpc/platforms/powernv/opal-fadump.h |  154 +++
 arch/powerpc/platforms/pseries/Makefile  |1 
 arch/powerpc/platforms/pseries/rtas-fadump.c |  595 +++
 arch/powerpc/platforms/pseries/rtas-fadump.h |  123 ++
 18 files changed, 3231 insertions(+), 996 deletions(-)
 create mode 100644 arch/powerpc/kernel/fadump-common.c
 create mode 100644 arch/powerpc/kernel/fadump-common.h
 create mode 100644 arch/powerpc/platforms/powernv/opal-core.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.c
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.h

[PATCH v4 01/25] powerpc/fadump: move internal macros/definitions to a new header

2019-07-16 Thread Hari Bathini

Though asm/fadump.h is meant to be used by other components dealing
with FADump, it also has macros/definitions internal to FADump code.
Move them to a new header file used within FADump code. This also
makes way for refactoring platform specific FADump code.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/fadump.h   |   71 ---
 arch/powerpc/kernel/fadump-common.h |   93 +++
 arch/powerpc/kernel/fadump.c|2 +
 3 files changed, 95 insertions(+), 71 deletions(-)
 create mode 100644 arch/powerpc/kernel/fadump-common.h

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 17d9b6a..75179497 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -11,34 +11,6 @@
 
 #ifdef CONFIG_FA_DUMP
 
-/*
- * The RMA region will be saved for later dumping when kernel crashes.
- * RMA is Real Mode Area, the first block of logical memory address owned
- * by logical partition, containing the storage that may be accessed with
- * translate off.
- */
-#define RMA_START  0x0
-#define RMA_END(ppc64_rma_size)
-
-/*
- * On some Power systems where RMO is 128MB, it still requires minimum of
- * 256MB for kernel to boot successfully. When kdump infrastructure is
- * configured to save vmcore over network, we run into OOM issue while
- * loading modules related to network setup. Hence we need aditional 64M
- * of memory to avoid OOM issue.
- */
-#define MIN_BOOT_MEM   (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \
-   + (0x1UL << 26))
-
-/* The upper limit percentage for user specified boot memory size (25%) */
-#define MAX_BOOT_MEM_RATIO 4
-
-#define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt)
-
-/* Alignement per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT   (PAGE_SIZE <<   \
-   max_t(unsigned long, MAX_ORDER - 1, pageblock_order))
-
 /* Firmware provided dump sections */
 #define FADUMP_CPU_STATE_DATA  0x0001
 #define FADUMP_HPTE_REGION 0x0002
@@ -47,11 +19,6 @@
 /* Dump request flag */
 #define FADUMP_REQUEST_FLAG0x0001
 
-/* FAD commands */
-#define FADUMP_REGISTER1
-#define FADUMP_UNREGISTER  2
-#define FADUMP_INVALIDATE  3
-
 /* Dump status flag */
 #define FADUMP_ERROR_FLAG  0x2000
 
@@ -112,29 +79,6 @@ struct fadump_mem_struct {
struct fadump_section   rmr_region;
 };
 
-/* Firmware-assisted dump configuration details. */
-struct fw_dump {
-   unsigned long   cpu_state_data_size;
-   unsigned long   hpte_region_size;
-   unsigned long   boot_memory_size;
-   unsigned long   reserve_dump_area_start;
-   unsigned long   reserve_dump_area_size;
-   /* cmd line option during boot */
-   unsigned long   reserve_bootvar;
-
-   unsigned long   fadumphdr_addr;
-   unsigned long   cpu_notes_buf;
-   unsigned long   cpu_notes_buf_size;
-
-   int ibm_configure_kernel_dump;
-
-   unsigned long   fadump_enabled:1;
-   unsigned long   fadump_supported:1;
-   unsigned long   dump_active:1;
-   unsigned long   dump_registered:1;
-   unsigned long   nocma:1;
-};
-
 /*
  * Copy the ascii values for first 8 characters from a string into u64
  * variable at their respective indexes.
@@ -153,7 +97,6 @@ static inline u64 str_to_u64(const char *str)
 #define STR_TO_HEX(x)  str_to_u64(x)
 #define REG_ID(x)  str_to_u64(x)
 
-#define FADUMP_CRASH_INFO_MAGICSTR_TO_HEX("FADMPINF")
 #define REGSAVE_AREA_MAGIC STR_TO_HEX("REGSAVE")
 
 /* The firmware-assisted dump format.
@@ -178,20 +121,6 @@ struct fadump_reg_entry {
__be64  reg_value;
 };
 
-/* fadump crash info structure */
-struct fadump_crash_info_header {
-   u64 magic_number;
-   u64 elfcorehdr_addr;
-   u32 crashing_cpu;
-   struct pt_regs  regs;
-   struct cpumask  online_mask;
-};
-
-struct fad_crash_memory_ranges {
-   unsigned long long  base;
-   unsigned long long  size;
-};
-
 extern int is_fadump_memory_area(u64 addr, ulong size);
 extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
new file mode 100644
index 000..ba65e69
--- /dev/null
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -0,0 +1,93 @@
+/*
+ * Firmware-Assisted Dump internal code.
+ *
+ * Copyright 2011, IBM Corporation
+ * Author: Mahesh Salgaonkar 
+ *
+ * Copyright 2019, IBM Corp.
+ * Author: Hari Bathini 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your opti

[PATCH v4 02/25] powerpc/fadump: move internal code to a new file

2019-07-16 Thread Hari Bathini

Make way for refactoring platform specific FADump code by moving code
that could be referenced from multiple places to fadump-common.c file.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/Makefile|2 
 arch/powerpc/kernel/fadump-common.c |  144 +++
 arch/powerpc/kernel/fadump-common.h |8 ++
 arch/powerpc/kernel/fadump.c|  146 ++-
 4 files changed, 162 insertions(+), 138 deletions(-)
 create mode 100644 arch/powerpc/kernel/fadump-common.c

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 56dfa7a..439d548 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -78,7 +78,7 @@ obj-$(CONFIG_EEH)  += eeh.o eeh_pe.o eeh_dev.o 
eeh_cache.o \
  eeh_driver.o eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_GENERIC_TBSYNC)   += smp-tbsync.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
-obj-$(CONFIG_FA_DUMP)  += fadump.o
+obj-$(CONFIG_FA_DUMP)  += fadump.o fadump-common.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
 endif
diff --git a/arch/powerpc/kernel/fadump-common.c 
b/arch/powerpc/kernel/fadump-common.c
new file mode 100644
index 000..76c1233
--- /dev/null
+++ b/arch/powerpc/kernel/fadump-common.c
@@ -0,0 +1,144 @@
+/*
+ * Firmware-Assisted Dump internal code.
+ *
+ * Copyright 2011, IBM Corporation
+ * Author: Mahesh Salgaonkar 
+ *
+ * Copyright 2019, IBM Corp.
+ * Author: Hari Bathini 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+#define pr_fmt(fmt) "fadump: " fmt
+
+#include 
+#include 
+#include 
+#include 
+
+#include "fadump-common.h"
+
+void *fadump_cpu_notes_buf_alloc(unsigned long size)
+{
+   void *vaddr;
+   struct page *page;
+   unsigned long order, count, i;
+
+   order = get_order(size);
+   vaddr = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
+   if (!vaddr)
+   return NULL;
+
+   count = 1 << order;
+   page = virt_to_page(vaddr);
+   for (i = 0; i < count; i++)
+   SetPageReserved(page + i);
+   return vaddr;
+}
+
+void fadump_cpu_notes_buf_free(unsigned long vaddr, unsigned long size)
+{
+   struct page *page;
+   unsigned long order, count, i;
+
+   order = get_order(size);
+   count = 1 << order;
+   page = virt_to_page(vaddr);
+   for (i = 0; i < count; i++)
+   ClearPageReserved(page + i);
+   __free_pages(page, order);
+}
+
+u32 *fadump_regs_to_elf_notes(u32 *buf, struct pt_regs *regs)
+{
+   struct elf_prstatus prstatus;
+
+   memset(&prstatus, 0, sizeof(prstatus));
+   /*
+* FIXME: How do i get PID? Do I really need it?
+* prstatus.pr_pid = 
+*/
+   elf_core_copy_kernel_regs(&prstatus.pr_reg, regs);
+   buf = append_elf_note(buf, CRASH_CORE_NOTE_NAME, NT_PRSTATUS,
+ &prstatus, sizeof(prstatus));
+   return buf;
+}
+
+void fadump_update_elfcore_header(struct fw_dump *fadump_conf, char *bufp)
+{
+   struct elfhdr *elf;
+   struct elf_phdr *phdr;
+
+   elf = (struct elfhdr *)bufp;
+   bufp += sizeof(struct elfhdr);
+
+   /* First note is a place holder for cpu notes info. */
+   phdr = (struct elf_phdr *)bufp;
+
+   if (phdr->p_type == PT_NOTE) {
+   phdr->p_paddr  = fadump_conf->cpu_notes_buf;
+   phdr->p_offset = phdr->p_paddr;
+   phdr->p_memsz  = fadump_conf->cpu_notes_buf_size;
+   phdr->p_filesz = phdr->p_memsz;
+   }
+}
+
+/*
+ * Returns 1, if there are no holes in memory area between d_start to d_end,
+ * 0 otherwise.
+ */
+static int is_fadump_memory_area_contiguous(unsigned long d_start,
+   unsigned long d_end)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+   int ret = 0;
+
+   for_each_memblock(memory, reg) {
+   start = max_t(unsigned long, d_start, reg->base);
+   end = min_t(unsigned long, d_end, (reg->base + reg->size));
+   if (d_start < end) {
+   /* Memory hole from d_start to start */
+   if (start > d_start)
+   break;
+
+   if (end == d_end) {
+   ret = 1;
+   break;
+   }
+
+   d_start = end + 1;
+   }
+   }
+
+   return ret;
+}
+
+/*
+ * Returns 1, if there are no holes in boot memory area,
+ * 0 otherwise.
+ */
+int is_fadump_boot_mem_contiguous(struct fw_dump *fadump_conf)
+{
+   unsigned long d_start = RMA_START;
+   u

[PATCH v4 03/25] powerpc/fadump: Improve fadump documentation

2019-07-16 Thread Hari Bathini

The figures depicting FADump's (Firmware-Assisted Dump) memory layout
are missing some finer details like different memory regions and what
they represent. Improve the documentation by updating those details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   65 --
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 0c41d6d..e9b4e3c 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -74,8 +74,9 @@ as follows:
there is crash data available from a previous boot. During
the early boot OS will reserve rest of the memory above
boot memory size effectively booting with restricted memory
-   size. This will make sure that the second kernel will not
-   touch any of the dump memory area.
+   size. This will make sure that this kernel (also, referred
+   to as second kernel or capture kernel) will not touch any
+   of the dump memory area.
 
 -- User-space tools will read /proc/vmcore to obtain the contents
of memory, which holds the previous crashed kernel dump in ELF
@@ -125,48 +126,52 @@ space memory except the user pages that were present in 
CMA region.
 
   o Memory Reservation during first kernel
 
-  Low memory Top of memory
-  0  boot memory size   |
-  |   ||<--Reserved dump area -->|  |
-  V   V|   Permanent Reservation |  V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
-|   ^
-|   |
-\   /
- ---
-  Boot memory content gets transferred to
-  reserved area by firmware at the time of
-  crash
+  Low memoryTop of memory
+  0  boot memory size  |<--Reserved dump area --->|  |
+  |   ||   Permanent Reservation  |  |
+  V   V|   (Preserve area)|  V
+  +---+--/ /---+---+++---++--+
+  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
+  +---+--/ /---+---+++---++--+
+|   ^  ^
+|   |  |
+\   /  |
+ --- FADump Header
+  Boot memory content gets transferred   (meta area)
+  to reserved area by firmware at the
+  time of crash
+
Fig. 1
 
+
   o Memory Reservation during second kernel after crash
 
-  Low memoryTop of memory
-  0  boot memory size   |
-  |   |<- Reserved dump area --- -->|
-  V   V V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
+  Low memoryTop of memory
+  0  boot memory size|
+  |   |<- Reserved dump area --->|
+  V   V|< Preserve area ->|  V
+  +---+--/ /---+---+++---++--+
+  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
+  +---+--/ /---+---+++---++--+
 |  |
 V  V
Used by second/proc/vmcore
kernel to boot
Fig. 2
 
-Currently the dump will be copied from /proc/vmcore to a
-a new file upon user intervention. The dump data available through
-/proc/vmcore will be in ELF format. Hence the existing kdump
-infrastructure (kdump scripts) to save the dump works fine with
-minor modifications.
+Currently the dump will be copied from /proc/vmcore to a new file upon
+user intervention. The dump data available through /proc/vmcore will be
+in ELF format. Hence the existing kdump infrastructure (kdump scripts)
+to save the dump works fine with minor modifications. KDump scripts on
+major Distro releases have already been modified to work seemlessly (no
+user intervention in saving the dump) when FADump is used, ins

[PATCH v4 04/25] pseries/fadump: move rtas specific definitions to platform code

2019-07-16 Thread Hari Bathini

Currently, FADump is only supported on pSeries but that is going to
change soon with FADump support being added on PowerNV platform. So,
move rtas specific definitions to platform code to allow FADump
to have multiple platforms support.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/fadump.h|  112 --
 arch/powerpc/kernel/fadump-common.h  |   20 -
 arch/powerpc/kernel/fadump.c |   90 +++--
 arch/powerpc/platforms/pseries/rtas-fadump.h |  107 +
 4 files changed, 174 insertions(+), 155 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.h

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 75179497..e608d34 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -11,116 +11,8 @@
 
 #ifdef CONFIG_FA_DUMP
 
-/* Firmware provided dump sections */
-#define FADUMP_CPU_STATE_DATA  0x0001
-#define FADUMP_HPTE_REGION 0x0002
-#define FADUMP_REAL_MODE_REGION0x0011
-
-/* Dump request flag */
-#define FADUMP_REQUEST_FLAG0x0001
-
-/* Dump status flag */
-#define FADUMP_ERROR_FLAG  0x2000
-
-#define FADUMP_CPU_ID_MASK ((1UL << 32) - 1)
-
-#define CPU_UNKNOWN(~((u32)0))
-
-/* Utility macros */
-#define SKIP_TO_NEXT_CPU(reg_entry)\
-({ \
-   while (be64_to_cpu(reg_entry->reg_id) != REG_ID("CPUEND"))  \
-   reg_entry++;\
-   reg_entry++;\
-})
-
 extern int crashing_cpu;
 
-/* Kernel Dump section info */
-struct fadump_section {
-   __be32  request_flag;
-   __be16  source_data_type;
-   __be16  error_flags;
-   __be64  source_address;
-   __be64  source_len;
-   __be64  bytes_dumped;
-   __be64  destination_address;
-};
-
-/* ibm,configure-kernel-dump header. */
-struct fadump_section_header {
-   __be32  dump_format_version;
-   __be16  dump_num_sections;
-   __be16  dump_status_flag;
-   __be32  offset_first_dump_section;
-
-   /* Fields for disk dump option. */
-   __be32  dd_block_size;
-   __be64  dd_block_offset;
-   __be64  dd_num_blocks;
-   __be32  dd_offset_disk_path;
-
-   /* Maximum time allowed to prevent an automatic dump-reboot. */
-   __be32  max_time_auto;
-};
-
-/*
- * Firmware Assisted dump memory structure. This structure is required for
- * registering future kernel dump with power firmware through rtas call.
- *
- * No disk dump option. Hence disk dump path string section is not included.
- */
-struct fadump_mem_struct {
-   struct fadump_section_headerheader;
-
-   /* Kernel dump sections */
-   struct fadump_section   cpu_state_data;
-   struct fadump_section   hpte_region;
-   struct fadump_section   rmr_region;
-};
-
-/*
- * Copy the ascii values for first 8 characters from a string into u64
- * variable at their respective indexes.
- * e.g.
- *  The string "FADMPINF" will be converted into 0x4641444d50494e46
- */
-static inline u64 str_to_u64(const char *str)
-{
-   u64 val = 0;
-   int i;
-
-   for (i = 0; i < sizeof(val); i++)
-   val = (*str) ? (val << 8) | *str++ : val << 8;
-   return val;
-}
-#define STR_TO_HEX(x)  str_to_u64(x)
-#define REG_ID(x)  str_to_u64(x)
-
-#define REGSAVE_AREA_MAGIC STR_TO_HEX("REGSAVE")
-
-/* The firmware-assisted dump format.
- *
- * The register save area is an area in the partition's memory used to preserve
- * the register contents (CPU state data) for the active CPUs during a firmware
- * assisted dump. The dump format contains register save area header followed
- * by register entries. Each list of registers for a CPU starts with
- * "CPUSTRT" and ends with "CPUEND".
- */
-
-/* Register save area header. */
-struct fadump_reg_save_area_header {
-   __be64  magic_number;
-   __be32  version;
-   __be32  num_cpu_offset;
-};
-
-/* Register entry. */
-struct fadump_reg_entry {
-   __be64  reg_id;
-   __be64  reg_value;
-};
-
 extern int is_fadump_memory_area(u64 addr, ulong size);
 extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
@@ -136,5 +28,5 @@ static inline int is_fadump_active(void) { return 0; }
 static inline int should_fadump_crash(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
 static inline void fadump_cleanup(void) { }
-#endif
-#endif
+#endif /* !CONFIG_FA_DUMP */
+#endif /* __PPC64_FA_DUMP_H__ */
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 6c1e310..09d6161 100644
--- a/arch/powerpc/kernel/fadump-comm

[PATCH v4 05/25] pseries/fadump: introduce callbacks for platform specific operations

2019-07-16 Thread Hari Bathini

Introduce callback functions for platform specific operations like
register, unregister, invalidate & such. Also, define place-holders
for the same on pSeries platform.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |   33 ++
 arch/powerpc/kernel/fadump.c |   47 +
 arch/powerpc/platforms/pseries/Makefile  |1 
 arch/powerpc/platforms/pseries/rtas-fadump.c |  134 ++
 4 files changed, 171 insertions(+), 44 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.c

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 09d6161..020d582 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -50,6 +50,12 @@
 #define FADUMP_UNREGISTER  2
 #define FADUMP_INVALIDATE  3
 
+/* Firmware-Assited Dump platforms */
+enum fadump_platform_type {
+   FADUMP_PLATFORM_UNKNOWN = 0,
+   FADUMP_PLATFORM_PSERIES,
+};
+
 /*
  * Copy the ascii values for first 8 characters from a string into u64
  * variable at their respective indexes.
@@ -84,6 +90,9 @@ struct fad_crash_memory_ranges {
unsigned long long  size;
 };
 
+/* Platform specific callback functions */
+struct fadump_ops;
+
 /* Firmware-assisted dump configuration details. */
 struct fw_dump {
unsigned long   reserve_dump_area_start;
@@ -106,6 +115,21 @@ struct fw_dump {
unsigned long   dump_active:1;
unsigned long   dump_registered:1;
unsigned long   nocma:1;
+
+   enum fadump_platform_type   fadump_platform;
+   struct fadump_ops   *ops;
+};
+
+struct fadump_ops {
+   ulong   (*init_fadump_mem_struct)(struct fw_dump *fadump_config);
+   int (*register_fadump)(struct fw_dump *fadump_config);
+   int (*unregister_fadump)(struct fw_dump *fadump_config);
+   int (*invalidate_fadump)(struct fw_dump *fadump_config);
+   int (*process_fadump)(struct fw_dump *fadump_config);
+   void(*fadump_region_show)(struct fw_dump *fadump_config,
+ struct seq_file *m);
+   void(*fadump_trigger)(struct fadump_crash_info_header *fdh,
+ const char *msg);
 };
 
 /* Helper functions */
@@ -116,4 +140,13 @@ void fadump_update_elfcore_header(struct fw_dump 
*fadump_config, char *bufp);
 int is_fadump_boot_mem_contiguous(struct fw_dump *fadump_conf);
 int is_fadump_reserved_mem_contiguous(struct fw_dump *fadump_conf);
 
+#ifdef CONFIG_PPC_PSERIES
+extern int rtas_fadump_dt_scan(struct fw_dump *fadump_config, ulong node);
+#else
+static inline int rtas_fadump_dt_scan(struct fw_dump *fadump_config, ulong 
node)
+{
+   return 1;
+}
+#endif
+
 #endif /* __PPC64_FA_DUMP_INTERNAL_H__ */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index f571cb3..a901ca1 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -112,24 +112,12 @@ static int __init fadump_cma_init(void) { return 1; }
 int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
  int depth, void *data)
 {
-   const __be32 *sections;
-   int i, num_sections;
-   int size;
-   const __be32 *token;
+   int ret;
 
if (depth != 1 || strcmp(uname, "rtas") != 0)
return 0;
 
-   /*
-* Check if Firmware Assisted dump is supported. if yes, check
-* if dump has been initiated on last reboot.
-*/
-   token = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump", NULL);
-   if (!token)
-   return 1;
-
-   fw_dump.fadump_supported = 1;
-   fw_dump.ibm_configure_kernel_dump = be32_to_cpu(*token);
+   ret = rtas_fadump_dt_scan(&fw_dump, node);
 
/*
 * The 'ibm,kernel-dump' rtas node is present only if there is
@@ -139,36 +127,7 @@ int __init early_init_dt_scan_fw_dump(unsigned long node, 
const char *uname,
if (fdm_active)
fw_dump.dump_active = 1;
 
-   /* Get the sizes required to store dump data for the firmware provided
-* dump sections.
-* For each dump section type supported, a 32bit cell which defines
-* the ID of a supported section followed by two 32 bit cells which
-* gives teh size of the section in bytes.
-*/
-   sections = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump-sizes",
-   &size);
-
-   if (!sections)
-   return 1;
-
-   num_sections = size / (3 * sizeof(u32));
-
-   for (i = 0; i < num_sections; i++, sections += 3) {
-   u32 type = (u32)of_read_number(sections, 1);
-
-   switch (type) {
-   case RTAS_FADUMP_CPU_STATE_DATA:
-   fw_dump.cpu_state_data_size =
-   of_read_ulong(§ions[1],

[PATCH v4 06/25] pseries/fadump: define register/un-register callback functions

2019-07-16 Thread Hari Bathini

Make RTAS calls to register and un-register for FADump. Also, update
how fadump_region contents are diplayed to provide more information.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |2 
 arch/powerpc/kernel/fadump.c |  164 ++
 arch/powerpc/platforms/pseries/rtas-fadump.c |  163 +-
 3 files changed, 176 insertions(+), 153 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 020d582..273247d 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -108,6 +108,8 @@ struct fw_dump {
unsigned long   cpu_notes_buf;
unsigned long   cpu_notes_buf_size;
 
+   unsigned long   boot_mem_dest_addr;
+
int ibm_configure_kernel_dump;
 
unsigned long   fadump_enabled:1;
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index a901ca1..650ebf8 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -36,7 +36,6 @@
 #include "../platforms/pseries/rtas-fadump.h"
 
 static struct fw_dump fw_dump;
-static struct rtas_fadump_mem_struct fdm;
 static const struct rtas_fadump_mem_struct *fdm_active;
 
 static DEFINE_MUTEX(fadump_mutex);
@@ -179,61 +178,6 @@ static void fadump_show_config(void)
pr_debug("Boot memory size  : %lx\n", fw_dump.boot_memory_size);
 }
 
-static unsigned long init_fadump_mem_struct(struct rtas_fadump_mem_struct *fdm,
-   unsigned long addr)
-{
-   if (!fdm)
-   return 0;
-
-   memset(fdm, 0, sizeof(struct rtas_fadump_mem_struct));
-   addr = addr & PAGE_MASK;
-
-   fdm->header.dump_format_version = cpu_to_be32(0x0001);
-   fdm->header.dump_num_sections = cpu_to_be16(3);
-   fdm->header.dump_status_flag = 0;
-   fdm->header.offset_first_dump_section =
-   cpu_to_be32((u32)offsetof(struct rtas_fadump_mem_struct, 
cpu_state_data));
-
-   /*
-* Fields for disk dump option.
-* We are not using disk dump option, hence set these fields to 0.
-*/
-   fdm->header.dd_block_size = 0;
-   fdm->header.dd_block_offset = 0;
-   fdm->header.dd_num_blocks = 0;
-   fdm->header.dd_offset_disk_path = 0;
-
-   /* set 0 to disable an automatic dump-reboot. */
-   fdm->header.max_time_auto = 0;
-
-   /* Kernel dump sections */
-   /* cpu state data section. */
-   fdm->cpu_state_data.request_flag = 
cpu_to_be32(RTAS_FADUMP_REQUEST_FLAG);
-   fdm->cpu_state_data.source_data_type = 
cpu_to_be16(RTAS_FADUMP_CPU_STATE_DATA);
-   fdm->cpu_state_data.source_address = 0;
-   fdm->cpu_state_data.source_len = 
cpu_to_be64(fw_dump.cpu_state_data_size);
-   fdm->cpu_state_data.destination_address = cpu_to_be64(addr);
-   addr += fw_dump.cpu_state_data_size;
-
-   /* hpte region section */
-   fdm->hpte_region.request_flag = cpu_to_be32(RTAS_FADUMP_REQUEST_FLAG);
-   fdm->hpte_region.source_data_type = 
cpu_to_be16(RTAS_FADUMP_HPTE_REGION);
-   fdm->hpte_region.source_address = 0;
-   fdm->hpte_region.source_len = cpu_to_be64(fw_dump.hpte_region_size);
-   fdm->hpte_region.destination_address = cpu_to_be64(addr);
-   addr += fw_dump.hpte_region_size;
-
-   /* RMA region section */
-   fdm->rmr_region.request_flag = cpu_to_be32(RTAS_FADUMP_REQUEST_FLAG);
-   fdm->rmr_region.source_data_type = 
cpu_to_be16(RTAS_FADUMP_REAL_MODE_REGION);
-   fdm->rmr_region.source_address = cpu_to_be64(RMA_START);
-   fdm->rmr_region.source_len = cpu_to_be64(fw_dump.boot_memory_size);
-   fdm->rmr_region.destination_address = cpu_to_be64(addr);
-   addr += fw_dump.boot_memory_size;
-
-   return addr;
-}
-
 /**
  * fadump_calculate_reserve_size(): reserve variable boot area 5% of System RAM
  *
@@ -480,61 +424,6 @@ static int __init early_fadump_reserve_mem(char *p)
 }
 early_param("fadump_reserve_mem", early_fadump_reserve_mem);
 
-static int register_fw_dump(struct rtas_fadump_mem_struct *fdm)
-{
-   int rc, err;
-   unsigned int wait_time;
-
-   pr_debug("Registering for firmware-assisted kernel dump...\n");
-
-   /* TODO: Add upper time limit for the delay */
-   do {
-   rc = rtas_call(fw_dump.ibm_configure_kernel_dump, 3, 1, NULL,
-   FADUMP_REGISTER, fdm,
-   sizeof(struct rtas_fadump_mem_struct));
-
-   wait_time = rtas_busy_delay_time(rc);
-   if (wait_time)
-   mdelay(wait_time);
-
-   } while (wait_time);
-
-   err = -EIO;
-   switch (rc) {
-   default:
-   pr_err("Failed to register. Unknown Error(%d).\n", rc);
-   break;
-   case -1:
-   printk(KERN_ERR "Failed to register firmware-assisted kernel"
-   " dump. Hardware Error(%d).\n", rc);
-

[PATCH v4 08/25] powerpc/fadump: use FADump instead of fadump for how it is pronounced

2019-07-16 Thread Hari Bathini

fadump is pronounced f-a-dump. Update documentation accordingly. Also,
update how fadump_region contents look like with recent changes.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   71 --
 1 file changed, 39 insertions(+), 32 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index e9b4e3c..0c6a28c 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -8,18 +8,18 @@ a crashed system, and to do so from a fully-reset system, and
 to minimize the total elapsed time until the system is back
 in production use.
 
-- Firmware assisted dump (fadump) infrastructure is intended to replace
+- Firmware-Assisted Dump (FADump) infrastructure is intended to replace
   the existing phyp assisted dump.
 - Fadump uses the same firmware interfaces and memory reservation model
   as phyp assisted dump.
-- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
+- Unlike phyp dump, FADump exports the memory dump through /proc/vmcore
   in the ELF format in the same way as kdump. This helps us reuse the
   kdump infrastructure for dump capture and filtering.
 - Unlike phyp dump, userspace tool does not need to refer any sysfs
   interface while reading /proc/vmcore.
-- Unlike phyp dump, fadump allows user to release all the memory reserved
+- Unlike phyp dump, FADump allows user to release all the memory reserved
   for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
-- Once enabled through kernel boot parameter, fadump can be
+- Once enabled through kernel boot parameter, FADump can be
   started/stopped through /sys/kernel/fadump_registered interface (see
   sysfs files section below) and can be easily integrated with kdump
   service start/stop init scripts.
@@ -33,7 +33,7 @@ dump offers several strong, practical advantages:
in a clean, consistent state.
 -- Once the dump is copied out, the memory that held the dump
is immediately available to the running kernel. And therefore,
-   unlike kdump, fadump doesn't need a 2nd reboot to get back
+   unlike kdump, FADump doesn't need a 2nd reboot to get back
the system to the production configuration.
 
 The above can only be accomplished by coordination with,
@@ -61,7 +61,7 @@ as follows:
  boot successfully. For syntax of crashkernel= parameter,
  refer to Documentation/kdump/kdump.rst. If any offset is
  provided in crashkernel= parameter, it will be ignored
- as fadump uses a predefined offset to reserve memory
+ as FADump uses a predefined offset to reserve memory
  for boot memory dump preservation in case of a crash.
 
 -- After the low memory (boot memory) area has been saved, the
@@ -120,7 +120,7 @@ blocking this significant chunk of memory from production 
kernel.
 Hence, the implementation uses the Linux kernel's Contiguous Memory
 Allocator (CMA) for memory reservation if CMA is configured for kernel.
 With CMA reservation this memory will be available for applications to
-use it, while kernel is prevented from using it. With this fadump will
+use it, while kernel is prevented from using it. With this FADump will
 still be able to capture all of the kernel memory and most of the user
 space memory except the user pages that were present in CMA region.
 
@@ -170,14 +170,14 @@ KDump, as dump mechanism.
 The tools to examine the dump will be same as the ones
 used for kdump.
 
-How to enable firmware-assisted dump (fadump):
+How to enable firmware-assisted dump (FADump):
 -
 
 1. Set config option CONFIG_FA_DUMP=y and build kernel.
-2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
-   By default, fadump reserved memory will be initialized as CMA area.
-   Alternatively, user can boot linux kernel with 'fadump=nocma' to
-   prevent fadump to use CMA.
+2. Boot into linux kernel with 'FADump=on' kernel cmdline option.
+   By default, FADump reserved memory will be initialized as CMA area.
+   Alternatively, user can boot linux kernel with 'FADump=nocma' to
+   prevent FADump to use CMA.
 3. Optionally, user can also set 'crashkernel=' kernel cmdline
to specify size of the memory to reserve for boot memory dump
preservation.
@@ -190,7 +190,7 @@ NOTE: 1. 'fadump_reserve_mem=' parameter has been 
deprecated. Instead
  option is set at kernel cmdline.
   3. if user wants to capture all of user space memory and ok with
  reserved memory not available to production system, then
- 'fadump=nocma' kernel parameter can be used to fallback to
+ 'FADump=nocma' kernel parameter can be used to fallback to
  old behaviour.
 
 Sysfs/debugfs files:
@@ -203,29 +203,29 @@ Here is the list of files under kernel sysfs:
 
  /sys/kernel/fadump_enabled
 
-This is used to display the f

[PATCH v4 09/25] opal: add MPIPL interface definitions

2019-07-16 Thread Hari Bathini

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/opal-api.h|   50 +++-
 arch/powerpc/include/asm/opal.h|6 +++
 arch/powerpc/platforms/powernv/opal-call.c |3 ++
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 383242e..c8a5665 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,7 +208,10 @@
 #define OPAL_HANDLE_HMI2   166
 #defineOPAL_NX_COPROC_INIT 167
 #define OPAL_XIVE_GET_VP_STATE 170
-#define OPAL_LAST  170
+#define OPAL_MPIPL_UPDATE  173
+#define OPAL_MPIPL_REGISTER_TAG174
+#define OPAL_MPIPL_QUERY_TAG   175
+#define OPAL_LAST  175
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
@@ -980,6 +983,50 @@ struct opal_sg_list {
 };
 
 /*
+ * Firmware-Assisted Dump (FADump) using MPIPL
+ */
+
+/* MPIPL update operations */
+enum opal_mpipl_ops {
+   OPAL_MPIPL_ADD_RANGE= 0,
+   OPAL_MPIPL_REMOVE_RANGE = 1,
+   OPAL_MPIPL_REMOVE_ALL   = 2,
+   OPAL_MPIPL_FREE_PRESERVED_MEMORY= 3,
+};
+
+/*
+ * Each tag maps to a metadata type. Use these tags to register/query
+ * corresponding metadata address with/from OPAL.
+ */
+enum opal_mpipl_tags {
+   OPAL_MPIPL_TAG_CPU  = 0,
+   OPAL_MPIPL_TAG_OPAL = 1,
+   OPAL_MPIPL_TAG_KERNEL   = 2,
+   OPAL_MPIPL_TAG_BOOT_MEM = 3,
+};
+
+/* Preserved memory details */
+struct opal_mpipl_region {
+   __be64  src;
+   __be64  dest;
+   __be64  size;
+};
+
+/* FADump structure format version */
+#define MPIPL_FADUMP_VERSION   0x01
+
+/* Metadata provided by OPAL. */
+struct opal_mpipl_fadump {
+   u8  version;
+   u8  reserved[7];
+   __be32  crashing_pir;
+   __be32  cpu_data_version;
+   __be32  cpu_data_size;
+   __be32  region_cnt;
+   struct opal_mpipl_regionregion[];
+} __attribute__((packed));
+
+/*
  * Dump region ID range usable by the OS
  */
 #define OPAL_DUMP_REGION_HOST_START0x80
@@ -1059,6 +1106,7 @@ enum {
OPAL_REBOOT_NORMAL  = 0,
OPAL_REBOOT_PLATFORM_ERROR  = 1,
OPAL_REBOOT_FULL_IPL= 2,
+   OPAL_REBOOT_MPIPL   = 3,
 };
 
 /* Argument to OPAL_PCI_TCE_KILL */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 57bd029..878110a 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -39,6 +39,12 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
bdfn,
uint64_t PE_handle);
 int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
uint64_t rate_phys, uint32_t size);
+
+int64_t opal_mpipl_update(enum opal_mpipl_ops op, u64 src,
+ u64 dest, u64 size);
+int64_t opal_mpipl_register_tag(enum opal_mpipl_tags tag, uint64_t addr);
+int64_t opal_mpipl_query_tag(enum opal_mpipl_tags tag, uint64_t *addr);
+
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
 int64_t opal_console_read(int64_t term_number, __be64 *length,
diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
b/arch/powerpc/platforms/powernv/opal-call.c
index 29ca523..fc8cc7c 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -287,3 +287,6 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, 
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_mpipl_update,   OPAL_MPIPL_UPDATE);
+OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG);
+OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG);

[PATCH v4 10/25] powernv/fadump: add fadump support on powernv

2019-07-16 Thread Hari Bathini

Add basic callback functions for FADump on PowerNV platform.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/Kconfig |5 +
 arch/powerpc/kernel/fadump-common.h  |   10 +++
 arch/powerpc/kernel/fadump.c |3 +
 arch/powerpc/platforms/powernv/Makefile  |1 
 arch/powerpc/platforms/powernv/opal-fadump.c |  102 ++
 5 files changed, 119 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f516796..0ce0a80 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -566,7 +566,7 @@ config CRASH_DUMP
 
 config FA_DUMP
bool "Firmware-assisted dump"
-   depends on PPC64 && PPC_RTAS
+   depends on PPC64 && (PPC_RTAS || PPC_POWERNV)
select CRASH_CORE
select CRASH_DUMP
help
@@ -577,7 +577,8 @@ config FA_DUMP
  is meant to be a kdump replacement offering robustness and
  speed not possible without system firmware assistance.
 
- If unsure, say "N"
+ If unsure, say "y". Only special kernels like petitboot may
+ need to say "N" here.
 
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 0231a0b..928d364 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -54,6 +54,7 @@
 enum fadump_platform_type {
FADUMP_PLATFORM_UNKNOWN = 0,
FADUMP_PLATFORM_PSERIES,
+   FADUMP_PLATFORM_POWERNV,
 };
 
 /*
@@ -157,4 +158,13 @@ static inline int rtas_fadump_dt_scan(struct fw_dump 
*fadump_config, ulong node)
 }
 #endif
 
+#ifdef CONFIG_PPC_POWERNV
+extern int opal_fadump_dt_scan(struct fw_dump *fadump_config, ulong node);
+#else
+static inline int opal_fadump_dt_scan(struct fw_dump *fadump_config, ulong 
node)
+{
+   return 1;
+}
+#endif
+
 #endif /* __PPC64_FA_DUMP_INTERNAL_H__ */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index e995db1..517a40b 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -114,6 +114,9 @@ int __init early_init_dt_scan_fw_dump(unsigned long node, 
const char *uname,
if (strcmp(uname, "rtas") == 0)
return rtas_fadump_dt_scan(&fw_dump, node);
 
+   if (strcmp(uname, "ibm,opal") == 0)
+   return opal_fadump_dt_scan(&fw_dump, node);
+
return 0;
 }
 
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index da2e99e..43a6e1c 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,6 +6,7 @@ obj-y   += opal-msglog.o opal-hmi.o 
opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
+obj-$(CONFIG_FA_DUMP)  += opal-fadump.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
new file mode 100644
index 000..d8ee836
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -0,0 +1,102 @@
+/*
+ * Firmware-Assisted Dump support on POWER platform (OPAL).
+ *
+ * Copyright 2019, IBM Corp.
+ * Author: Hari Bathini 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+#define pr_fmt(fmt) "opal fadump: " fmt
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "../../kernel/fadump-common.h"
+
+static ulong opal_fadump_init_mem_struct(struct fw_dump *fadump_conf)
+{
+   return fadump_conf->reserve_dump_area_start;
+}
+
+static int opal_fadump_register_fadump(struct fw_dump *fadump_conf)
+{
+   return -EIO;
+}
+
+static int opal_fadump_unregister_fadump(struct fw_dump *fadump_conf)
+{
+   return -EIO;
+}
+
+static int opal_fadump_invalidate_fadump(struct fw_dump *fadump_conf)
+{
+   return -EIO;
+}
+
+static int __init opal_fadump_process_fadump(struct fw_dump *fadump_conf)
+{
+   return -EINVAL;
+}
+
+static void opal_fadump_region_show(struct fw_dump *fadump_conf,
+   struct seq_file *m)
+{
+}
+
+static void opal_fadump_trigger(struct fadump_crash_info_header *fdh,
+   const char *msg)
+{
+   int rc;
+
+   rc = opal_cec_reboot2(OPAL_REBOOT_MPIPL, msg);
+   if (rc == OPAL_UNSUPPORTED) {
+   pr_emerg("Reboot type %d not supported.\n",
+OPAL_REBOOT_MPIPL);
+   } else if (

[PATCH v4 11/25] powernv/fadump: register kernel metadata address with opal

2019-07-16 Thread Hari Bathini

OPAL allows registering address with it in the first kernel and
retrieving it after MPIPL. Setup kernel metadata and register its
address with OPAL to use it for processing the crash dump.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |4 +
 arch/powerpc/kernel/fadump.c |   65 ++-
 arch/powerpc/platforms/powernv/opal-fadump.c |   73 ++
 arch/powerpc/platforms/powernv/opal-fadump.h |   37 +
 arch/powerpc/platforms/pseries/rtas-fadump.c |   32 +--
 5 files changed, 177 insertions(+), 34 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 928d364..89b8916 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -117,6 +117,8 @@ struct fw_dump {
 
unsigned long   boot_mem_dest_addr;
 
+   u64 kernel_metadata;
+
int ibm_configure_kernel_dump;
 
unsigned long   fadump_enabled:1;
@@ -131,6 +133,8 @@ struct fw_dump {
 
 struct fadump_ops {
ulong   (*init_fadump_mem_struct)(struct fw_dump *fadump_config);
+   ulong   (*get_kernel_metadata_size)(void);
+   int (*setup_kernel_metadata)(struct fw_dump *fadump_config);
int (*register_fadump)(struct fw_dump *fadump_config);
int (*unregister_fadump)(struct fw_dump *fadump_config);
int (*invalidate_fadump)(struct fw_dump *fadump_config);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 517a40b..4dd8037 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -258,6 +258,9 @@ static unsigned long get_fadump_area_size(void)
size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
 
size = PAGE_ALIGN(size);
+
+   /* This is to hold kernel metadata on platforms that support it */
+   size += fw_dump.ops->get_kernel_metadata_size();
return size;
 }
 
@@ -283,17 +286,17 @@ static void __init fadump_reserve_crash_area(unsigned 
long base,
 
 int __init fadump_reserve_mem(void)
 {
+   int ret = 1;
unsigned long base, size, memory_boundary;
 
if (!fw_dump.fadump_enabled)
return 0;
 
if (!fw_dump.fadump_supported) {
-   printk(KERN_INFO "Firmware-assisted dump is not supported on"
-   " this hardware\n");
-   fw_dump.fadump_enabled = 0;
-   return 0;
+   pr_info("Firmware-Assisted Dump is not supported on this 
hardware\n");
+   goto error_out;
}
+
/*
 * Initialize boot memory size
 * If dump is active then we have already calculated the size during
@@ -310,11 +313,13 @@ int __init fadump_reserve_mem(void)
}
 
size = get_fadump_area_size();
+   fw_dump.reserve_dump_area_size = size;
if (memory_limit)
memory_boundary = memory_limit;
else
memory_boundary = memblock_end_of_DRAM();
 
+   base = fw_dump.boot_memory_size;
if (fw_dump.dump_active) {
pr_info("Firmware-assisted dump is active.\n");
 
@@ -332,13 +337,11 @@ int __init fadump_reserve_mem(void)
 * dump is written to disk by userspace tool. This memory
 * will be released for general use once the dump is saved.
 */
-   base = fw_dump.boot_memory_size;
size = memory_boundary - base;
fadump_reserve_crash_area(base, size);
 
pr_debug("fadumphdr_addr = %#016lx\n", fw_dump.fadumphdr_addr);
fw_dump.reserve_dump_area_start = base;
-   fw_dump.reserve_dump_area_size = size;
} else {
/*
 * Reserve memory at an offset closer to bottom of the RAM to
@@ -346,30 +349,42 @@ int __init fadump_reserve_mem(void)
 * use memblock_find_in_range() here since it doesn't allocate
 * from bottom to top.
 */
-   for (base = fw_dump.boot_memory_size;
-base <= (memory_boundary - size);
-base += size) {
+   while (base <= (memory_boundary - size)) {
if (memblock_is_region_memory(base, size) &&
!memblock_is_region_reserved(base, size))
break;
+
+   base += size;
}
-   if ((base > (memory_boundary - size)) ||
-   memblock_reserve(base, size)) {
+
+   if (base > (memory_boundary - size)) {
+   pr_err("Failed to find memory chunk for reservation\n");
+   goto error_out;
+   }
+   fw_dump.reserve_dump_area_start = base;
+
+

[PATCH v4 12/25] powernv/fadump: define register/un-register callback functions

2019-07-16 Thread Hari Bathini

Make OPAL calls to register and un-register with firmware for MPIPL.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-fadump.c |   71 +-
 1 file changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index 4b8504e..2179126 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -27,6 +27,20 @@
 
 static struct opal_fadump_mem_struct *opal_fdm;
 
+static void opal_fadump_update_config(struct fw_dump *fadump_conf,
+ const struct opal_fadump_mem_struct *fdm)
+{
+   /*
+* The destination address of the first boot memory region is the
+* destination address of boot memory regions.
+*/
+   fadump_conf->boot_mem_dest_addr = fdm->rgn[0].dest;
+   pr_debug("Destination address of boot memory regions: %#016lx\n",
+fadump_conf->boot_mem_dest_addr);
+
+   fadump_conf->fadumphdr_addr = fdm->fadumphdr_addr;
+}
+
 static ulong opal_fadump_init_mem_struct(struct fw_dump *fadump_conf)
 {
ulong addr = fadump_conf->reserve_dump_area_start;
@@ -47,6 +61,8 @@ static ulong opal_fadump_init_mem_struct(struct fw_dump 
*fadump_conf)
opal_fdm->fadumphdr_addr = (opal_fdm->rgn[0].dest +
fadump_conf->boot_memory_size);
 
+   opal_fadump_update_config(fadump_conf, opal_fdm);
+
return addr;
 }
 
@@ -88,12 +104,63 @@ static int opal_fadump_setup_kernel_metadata(struct 
fw_dump *fadump_conf)
 
 static int opal_fadump_register_fadump(struct fw_dump *fadump_conf)
 {
-   return -EIO;
+   int i, err = -EIO;
+   s64 rc;
+
+   for (i = 0; i < opal_fdm->region_cnt; i++) {
+   rc = opal_mpipl_update(OPAL_MPIPL_ADD_RANGE,
+  opal_fdm->rgn[i].src,
+  opal_fdm->rgn[i].dest,
+  opal_fdm->rgn[i].size);
+   if (rc != OPAL_SUCCESS)
+   break;
+
+   opal_fdm->registered_regions++;
+   }
+
+   switch (rc) {
+   case OPAL_SUCCESS:
+   pr_info("Registration is successful!\n");
+   fadump_conf->dump_registered = 1;
+   err = 0;
+   break;
+   case OPAL_UNSUPPORTED:
+   pr_err("Support not available.\n");
+   fadump_conf->fadump_supported = 0;
+   fadump_conf->fadump_enabled = 0;
+   break;
+   case OPAL_INTERNAL_ERROR:
+   pr_err("Failed to register. Hardware Error(%lld).\n", rc);
+   break;
+   case OPAL_PARAMETER:
+   pr_err("Failed to register. Parameter Error(%lld).\n", rc);
+   break;
+   case OPAL_PERMISSION:
+   pr_err("Already registered!\n");
+   fadump_conf->dump_registered = 1;
+   err = -EEXIST;
+   break;
+   default:
+   pr_err("Failed to register. Unknown Error(%lld).\n", rc);
+   break;
+   }
+
+   return err;
 }
 
 static int opal_fadump_unregister_fadump(struct fw_dump *fadump_conf)
 {
-   return -EIO;
+   s64 rc;
+
+   rc = opal_mpipl_update(OPAL_MPIPL_REMOVE_ALL, 0, 0, 0);
+   if (rc) {
+   pr_err("Failed to un-register - unexpected Error(%lld).\n", rc);
+   return -EIO;
+   }
+
+   opal_fdm->registered_regions = 0;
+   fadump_conf->dump_registered = 0;
+   return 0;
 }
 
 static int opal_fadump_invalidate_fadump(struct fw_dump *fadump_conf)

[PATCH v4 13/25] powernv/fadump: support copying multiple kernel memory regions

2019-07-16 Thread Hari Bathini

Firmware uses 32-bit field for region size while copying/backing-up
memory during MPIPL. So, the maximum copy size for a region would
be a page less than 4GB (aligned to pagesize) but FADump capture
kernel usually needs more memory than that to be preserved to avoid
running into out of memory errors.

So, request firmware to copy multiple kernel memory regions instead
of just one (which worked fine for pseries as 64-bit field was used
for size there). With support to copy multiple kernel memory regions,
also handle holes in the memory area to be preserved. Support as many
as 128 kernel memory regions. This allows having an adequate FADump
capture kernel size for different scenarios.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.c  |   15 ++
 arch/powerpc/kernel/fadump-common.h  |   16 ++
 arch/powerpc/kernel/fadump.c |  173 ++
 arch/powerpc/platforms/powernv/opal-fadump.c |   25 +++-
 arch/powerpc/platforms/powernv/opal-fadump.h |5 -
 arch/powerpc/platforms/pseries/rtas-fadump.c |   12 ++
 arch/powerpc/platforms/pseries/rtas-fadump.h |5 +
 7 files changed, 211 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.c 
b/arch/powerpc/kernel/fadump-common.c
index 76c1233..731b929 100644
--- a/arch/powerpc/kernel/fadump-common.c
+++ b/arch/powerpc/kernel/fadump-common.c
@@ -125,10 +125,19 @@ static int is_fadump_memory_area_contiguous(unsigned long 
d_start,
  */
 int is_fadump_boot_mem_contiguous(struct fw_dump *fadump_conf)
 {
-   unsigned long d_start = RMA_START;
-   unsigned long d_end   = RMA_START + fadump_conf->boot_memory_size;
+   int i, ret = 0;
+   unsigned long d_start, d_end;
 
-   return is_fadump_memory_area_contiguous(d_start, d_end);
+   for (i = 0; i < fadump_conf->boot_mem_regs_cnt; i++) {
+   d_start = fadump_conf->boot_mem_addr[i];
+   d_end   = d_start + fadump_conf->boot_mem_size[i];
+
+   ret = is_fadump_memory_area_contiguous(d_start, d_end);
+   if (!ret)
+   break;
+   }
+
+   return ret;
 }
 
 /*
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 89b8916..06d9ecf 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -94,6 +94,9 @@ struct fad_crash_memory_ranges {
 /* Platform specific callback functions */
 struct fadump_ops;
 
+/* Maximum number of memory regions kernel supports */
+#define FADUMP_MAX_MEM_REGS128
+
 /* Firmware-assisted dump configuration details. */
 struct fw_dump {
unsigned long   reserve_dump_area_start;
@@ -109,14 +112,23 @@ struct fw_dump {
 
unsigned long   cpu_state_data_size;
unsigned long   hpte_region_size;
+
unsigned long   boot_memory_size;
+   unsigned long   boot_mem_dest_addr;
+   unsigned long   boot_mem_regs_cnt;
+   unsigned long   boot_mem_addr[FADUMP_MAX_MEM_REGS];
+   unsigned long   boot_mem_size[FADUMP_MAX_MEM_REGS];
+   unsigned long   boot_mem_top;
 
unsigned long   fadumphdr_addr;
unsigned long   cpu_notes_buf;
unsigned long   cpu_notes_buf_size;
 
-   unsigned long   boot_mem_dest_addr;
-
+   /*
+* Maximum size supported by firmware to copy from source to
+* destination address per entry.
+*/
+   unsigned long   max_copy_size;
u64 kernel_metadata;
 
int ibm_configure_kernel_dump;
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 4dd8037..abf4f334 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -128,6 +128,7 @@ int is_fadump_memory_area(u64 addr, ulong size)
 {
u64 d_start = fw_dump.reserve_dump_area_start;
u64 d_end = d_start + fw_dump.reserve_dump_area_size;
+   u64 b_end = fw_dump.boot_mem_top;
 
if (!fw_dump.dump_registered)
return 0;
@@ -135,7 +136,7 @@ int is_fadump_memory_area(u64 addr, ulong size)
if (((addr + size) > d_start) && (addr <= d_end))
return 1;
 
-   return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
+   return (((addr + size) > RMA_START) && (addr <= b_end));
 }
 
 int should_fadump_crash(void)
@@ -153,6 +154,8 @@ int is_fadump_active(void)
 /* Print firmware assisted dump configurations for debugging purpose. */
 static void fadump_show_config(void)
 {
+   int i;
+
pr_debug("Support for firmware-assisted dump (fadump): %s\n",
(fw_dump.fadump_supported ? "present" : "no support"));
 
@@ -166,7 +169,13 @@ static void fadump_show_config(void)
pr_debug("Dump section sizes:\n");
pr_debug("CPU state data size: %lx\n", fw_dump.cpu_state_data_size);
pr_debug("HPTE region size   : %lx\n", fw_dump.hpte_region_size);
-   pr_debug("Boot memory siz

[PATCH v4 14/25] powernv/fadump: process the crashdump by exporting it as /proc/vmcore

2019-07-16 Thread Hari Bathini

Add support in the kernel to process the crash'ed kernel's memory
preserved during MPIPL and export it as /proc/vmcore file for the
userland scripts to filter and analyze it later.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-fadump.c |  190 ++
 1 file changed, 187 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index 9c68c83..dffc0e7 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -25,6 +26,7 @@
 #include "../../kernel/fadump-common.h"
 #include "opal-fadump.h"
 
+static const struct opal_fadump_mem_struct *opal_fdm_active;
 static struct opal_fadump_mem_struct *opal_fdm;
 
 static void opal_fadump_update_config(struct fw_dump *fadump_conf,
@@ -41,6 +43,50 @@ static void opal_fadump_update_config(struct fw_dump 
*fadump_conf,
 fadump_conf->boot_mem_dest_addr);
 
fadump_conf->fadumphdr_addr = fdm->fadumphdr_addr;
+
+   /* Start address of preserve area (permanent reservation) */
+   fadump_conf->preserv_area_start = fadump_conf->boot_mem_dest_addr;
+   pr_debug("Preserve area start address: 0x%lx\n",
+fadump_conf->preserv_area_start);
+}
+
+/*
+ * This function is called in the capture kernel to get configuration details
+ * from metadata setup by the first kernel.
+ */
+static void opal_fadump_get_config(struct fw_dump *fadump_conf,
+  const struct opal_fadump_mem_struct *fdm)
+{
+   unsigned long base, size, last_end, hole_size;
+   int i;
+
+   if (!fadump_conf->dump_active)
+   return;
+
+   last_end = 0;
+   hole_size = 0;
+   fadump_conf->boot_memory_size = 0;
+
+   if (fdm->region_cnt)
+   pr_debug("Boot memory regions:\n");
+
+   for (i = 0; i < fdm->region_cnt; i++) {
+   base = fdm->rgn[i].src;
+   size = fdm->rgn[i].size;
+   pr_debug("\t%d. base: 0x%lx, size: 0x%lx\n",
+(i + 1), base, size);
+
+   fadump_conf->boot_mem_addr[i] = base;
+   fadump_conf->boot_mem_size[i] = size;
+   fadump_conf->boot_memory_size += size;
+   hole_size += (base - last_end);
+
+   last_end = base + size;
+   }
+
+   fadump_conf->boot_mem_top = (fadump_conf->boot_memory_size + hole_size);
+   fadump_conf->boot_mem_regs_cnt = fdm->region_cnt;
+   opal_fadump_update_config(fadump_conf, fdm);
 }
 
 static ulong opal_fadump_init_mem_struct(struct fw_dump *fadump_conf)
@@ -174,27 +220,127 @@ static int opal_fadump_unregister_fadump(struct fw_dump 
*fadump_conf)
 
 static int opal_fadump_invalidate_fadump(struct fw_dump *fadump_conf)
 {
-   return -EIO;
+   s64 rc;
+
+   rc = opal_mpipl_update(OPAL_MPIPL_FREE_PRESERVED_MEMORY, 0, 0, 0);
+   if (rc) {
+   pr_err("Failed to invalidate - unexpected Error(%lld).\n", rc);
+   return -EIO;
+   }
+
+   fadump_conf->dump_active = 0;
+   opal_fdm_active = NULL;
+   return 0;
+}
+
+/*
+ * Convert CPU state data saved at the time of crash into ELF notes.
+ */
+static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
+{
+   u32 num_cpus, *note_buf;
+   struct fadump_crash_info_header *fdh = NULL;
+
+   num_cpus = 1;
+   /* Allocate buffer to hold cpu crash notes. */
+   fadump_conf->cpu_notes_buf_size = num_cpus * sizeof(note_buf_t);
+   fadump_conf->cpu_notes_buf_size =
+   PAGE_ALIGN(fadump_conf->cpu_notes_buf_size);
+   note_buf = fadump_cpu_notes_buf_alloc(fadump_conf->cpu_notes_buf_size);
+   if (!note_buf) {
+   pr_err("Failed to allocate 0x%lx bytes for cpu notes buffer\n",
+  fadump_conf->cpu_notes_buf_size);
+   return -ENOMEM;
+   }
+   fadump_conf->cpu_notes_buf = __pa(note_buf);
+
+   pr_debug("Allocated buffer for cpu notes of size %ld at %p\n",
+(num_cpus * sizeof(note_buf_t)), note_buf);
+
+   if (fadump_conf->fadumphdr_addr)
+   fdh = __va(fadump_conf->fadumphdr_addr);
+
+   if (fdh && (fdh->crashing_cpu != FADUMP_CPU_UNKNOWN)) {
+   note_buf = fadump_regs_to_elf_notes(note_buf, &(fdh->regs));
+   final_note(note_buf);
+
+   pr_debug("Updating elfcore header (%llx) with cpu notes\n",
+fdh->elfcorehdr_addr);
+   fadump_update_elfcore_header(fadump_conf,
+__va(fdh->elfcorehdr_addr));
+   }
+
+   return 0;
 }
 
 static int __init opal_fadump_process_fadump(struct fw_dump *fadump_conf)
 {
-   return -EINVAL;
+   struct fadump_crash_info_header *fdh;
+   int rc = 0;
+
+   if (

[PATCH v4 15/25] powerpc/fadump: Update documentation about OPAL platform support

2019-07-16 Thread Hari Bathini

With FADump support now available on both pseries and OPAL platforms,
update FADump documentation with these details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |  104 +-
 1 file changed, 63 insertions(+), 41 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 0c6a28c..cd48776 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -70,7 +70,8 @@ as follows:
normal.
 
 -- The freshly booted kernel will notice that there is a new
-   node (ibm,dump-kernel) in the device tree, indicating that
+   node (ibm,dump-kernel on PSeries or ibm,opal/dump/result-table
+   on OPAL platform) in the device tree, indicating that
there is crash data available from a previous boot. During
the early boot OS will reserve rest of the memory above
boot memory size effectively booting with restricted memory
@@ -93,7 +94,9 @@ as follows:
 
 Please note that the firmware-assisted dump feature
 is only available on Power6 and above systems with recent
-firmware versions.
+firmware versions on PSeries (PowerVM) platform and Power9
+and above systems with recent firmware versions on PowerNV
+(OPAL) platform.
 
 Implementation details:
 --
@@ -108,57 +111,76 @@ that are run. If there is dump data, then the
 /sys/kernel/fadump_release_mem file is created, and the reserved
 memory is held.
 
-If there is no waiting dump data, then only the memory required
-to hold CPU state, HPTE region, boot memory dump and elfcore
-header, is usually reserved at an offset greater than boot memory
-size (see Fig. 1). This area is *not* released: this region will
-be kept permanently reserved, so that it can act as a receptacle
-for a copy of the boot memory content in addition to CPU state
-and HPTE region, in the case a crash does occur. Since this reserved
-memory area is used only after the system crash, there is no point in
-blocking this significant chunk of memory from production kernel.
-Hence, the implementation uses the Linux kernel's Contiguous Memory
-Allocator (CMA) for memory reservation if CMA is configured for kernel.
-With CMA reservation this memory will be available for applications to
-use it, while kernel is prevented from using it. With this FADump will
-still be able to capture all of the kernel memory and most of the user
-space memory except the user pages that were present in CMA region.
+If there is no waiting dump data, then only the memory required to
+hold CPU state, HPTE region, boot memory dump, FADump header and
+elfcore header, is usually reserved at an offset greater than boot
+memory size (see Fig. 1). This area is *not* released: this region
+will be kept permanently reserved, so that it can act as a receptacle
+for a copy of the boot memory content in addition to CPU state and
+HPTE region, in the case a crash does occur.
+
+Since this reserved memory area is used only after the system crash,
+there is no point in blocking this significant chunk of memory from
+production kernel. Hence, the implementation uses the Linux kernel's
+Contiguous Memory Allocator (CMA) for memory reservation if CMA is
+configured for kernel. With CMA reservation this memory will be
+available for applications to use it, while kernel is prevented from
+using it. With this FADump will still be able to capture all of the
+kernel memory and most of the user space memory except the user pages
+that were present in CMA region.
 
   o Memory Reservation during first kernel
 
-  Low memoryTop of memory
-  0  boot memory size  |<--Reserved dump area --->|  |
-  |   ||   Permanent Reservation  |  |
-  V   V|   (Preserve area)|  V
-  +---+--/ /---+---+++---++--+
-  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
-  +---+--/ /---+---+++---++--+
-|   ^  ^
-|   |  |
-\   /  |
- --- FADump Header
-  Boot memory content gets transferred   (meta area)
-  to reserved area by firmware at the
-  time of crash
+  Low memory Top of memory
+  0boot memory size   |<--- Reserved dump area --->|   |
+  |   |   |Permanent Reservation   |   |
+  V   V   |   (Preserve area)  |   V
+  +---+-/ /---+---++---+-+-++--+
+  |   |   |///||  DUMP | HDR | ELF ||  |
+  +---+-/ /---+---++---+-+-++--+
+|   ^^ ^

[PATCH v4 16/25] powerpc/fadump: consider reserved ranges while reserving memory

2019-07-16 Thread Hari Bathini

Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse reserved-ranges DT
node and reserve kernel memory falling in these ranges for F/W
purposes. Ensure memory in these ranges is not overlapped with
memory reserved for FADump.

Also, use a smaller offset, instead of the size of the memory to
be reserved, by which to skip memory before making another attempt
at reserving memory, after the previous attempt to reserve memory
for FADump failed due to memory holes and/or reserved ranges, to
reduce the likelihood of memory reservation failure.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h |   13 +++
 arch/powerpc/kernel/fadump.c|  143 ++-
 2 files changed, 149 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 06d9ecf..968745a 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -86,7 +86,7 @@ struct fadump_crash_info_header {
struct cpumask  online_mask;
 };
 
-struct fad_crash_memory_ranges {
+struct fadump_memory_range {
unsigned long long  base;
unsigned long long  size;
 };
@@ -94,6 +94,17 @@ struct fad_crash_memory_ranges {
 /* Platform specific callback functions */
 struct fadump_ops;
 
+/*
+ * Amount of memory (1024MB) to skip before making another attempt at
+ * reserving memory (after the previous attempt to reserve memory for
+ * FADump failed due to memory holes and/or reserved ranges) to reduce
+ * the likelihood of memory reservation failure.
+ */
+#define FADUMP_OFFSET_SIZE 0x4000U
+
+/* Maximum no. of reserved ranges supported for processing. */
+#define FADUMP_MAX_RESERVED_RANGES 128
+
 /* Maximum number of memory regions kernel supports */
 #define FADUMP_MAX_MEM_REGS128
 
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index abf4f334..bface37 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -36,11 +36,14 @@
 static struct fw_dump fw_dump;
 
 static DEFINE_MUTEX(fadump_mutex);
-struct fad_crash_memory_ranges *crash_memory_ranges;
+struct fadump_memory_range *crash_memory_ranges;
 int crash_memory_ranges_size;
 int crash_mem_ranges;
 int max_crash_mem_ranges;
 
+struct fadump_memory_range reserved_ranges[FADUMP_MAX_RESERVED_RANGES];
+int reserved_ranges_cnt;
+
 #ifdef CONFIG_CMA
 static struct cma *fadump_cma;
 
@@ -104,12 +107,116 @@ int __init fadump_cma_init(void)
 static int __init fadump_cma_init(void) { return 1; }
 #endif /* CONFIG_CMA */
 
+/*
+ * Sort the reserved ranges in-place and merge adjacent ranges
+ * to minimize the reserved ranges count.
+ */
+static void __init sort_and_merge_reserved_ranges(void)
+{
+   unsigned long long base, size;
+   struct fadump_memory_range tmp_range;
+   int i, j, idx;
+
+   if (!reserved_ranges_cnt)
+   return;
+
+   /* Sort the reserved ranges */
+   for (i = 0; i < reserved_ranges_cnt; i++) {
+   idx = i;
+   for (j = i + 1; j < reserved_ranges_cnt; j++) {
+   if (reserved_ranges[idx].base > reserved_ranges[j].base)
+   idx = j;
+   }
+   if (idx != i) {
+   tmp_range = reserved_ranges[idx];
+   reserved_ranges[idx] = reserved_ranges[i];
+   reserved_ranges[i] = tmp_range;
+   }
+   }
+
+   /* Merge adjacent reserved ranges */
+   idx = 0;
+   for (i = 1; i < reserved_ranges_cnt; i++) {
+   base = reserved_ranges[i-1].base;
+   size = reserved_ranges[i-1].size;
+   if (reserved_ranges[i].base == (base + size))
+   reserved_ranges[idx].size += reserved_ranges[i].size;
+   else {
+   idx++;
+   if (i == idx)
+   continue;
+
+   reserved_ranges[idx] = reserved_ranges[i];
+   }
+   }
+   reserved_ranges_cnt = idx + 1;
+}
+
+static int __init add_reserved_range(unsigned long base,
+unsigned long size)
+{
+   int i;
+
+   if (reserved_ranges_cnt == FADUMP_MAX_RESERVED_RANGES) {
+   /* Compact reserved ranges and try again. */
+   sort_and_merge_reserved_ranges();
+   if (reserved_ranges_cnt == FADUMP_MAX_RESERVED_RANGES)
+   return 0;
+   }
+
+   i = reserved_ranges_cnt++;
+   reserved_ranges[i].base = base;
+   reserved_ranges[i].size = size;
+   return 1;
+}
+
+/*
+ * Scan reserved-ranges to consider them while reserving/releasing
+ * memory for FADump.
+ */
+static void __init early_init_dt_scan_reserved_ranges(unsigned long node)
+{
+   int len, ret;
+   unsi

[PATCH v4 07/25] pseries/fadump: move out platform specific support from generic code

2019-07-16 Thread Hari Bathini

Move code that supports processing the crash'ed kernel's memory
preserved by firmware to platform specific callback functions.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |6 
 arch/powerpc/kernel/fadump.c |  340 +-
 arch/powerpc/platforms/pseries/rtas-fadump.c |  278 +
 3 files changed, 299 insertions(+), 325 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 273247d..0231a0b 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -100,6 +100,12 @@ struct fw_dump {
/* cmd line option during boot */
unsigned long   reserve_bootvar;
 
+   /*
+* Start address of preserve area. This memory is reserved
+* permanently (production or capture kernel) for FADump.
+*/
+   unsigned long   preserv_area_start;
+
unsigned long   cpu_state_data_size;
unsigned long   hpte_region_size;
unsigned long   boot_memory_size;
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 650ebf8..e995db1 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -28,15 +28,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
 #include "fadump-common.h"
-#include "../platforms/pseries/rtas-fadump.h"
 
 static struct fw_dump fw_dump;
-static const struct rtas_fadump_mem_struct *fdm_active;
 
 static DEFINE_MUTEX(fadump_mutex);
 struct fad_crash_memory_ranges *crash_memory_ranges;
@@ -111,22 +108,13 @@ static int __init fadump_cma_init(void) { return 1; }
 int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
  int depth, void *data)
 {
-   int ret;
-
-   if (depth != 1 || strcmp(uname, "rtas") != 0)
+   if (depth != 1)
return 0;
 
-   ret = rtas_fadump_dt_scan(&fw_dump, node);
+   if (strcmp(uname, "rtas") == 0)
+   return rtas_fadump_dt_scan(&fw_dump, node);
 
-   /*
-* The 'ibm,kernel-dump' rtas node is present only if there is
-* dump data waiting for us.
-*/
-   fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
-   fw_dump.dump_active = 1;
-
-   return ret;
+   return 0;
 }
 
 /*
@@ -308,9 +296,7 @@ int __init fadump_reserve_mem(void)
 * If dump is active then we have already calculated the size during
 * first kernel.
 */
-   if (fdm_active)
-   fw_dump.boot_memory_size = 
be64_to_cpu(fdm_active->rmr_region.source_len);
-   else {
+   if (!fw_dump.dump_active) {
fw_dump.boot_memory_size = fadump_calculate_reserve_size();
 #ifdef CONFIG_CMA
if (!fw_dump.nocma)
@@ -320,6 +306,7 @@ int __init fadump_reserve_mem(void)
 #endif
}
 
+   size = get_fadump_area_size();
if (memory_limit)
memory_boundary = memory_limit;
else
@@ -346,15 +333,10 @@ int __init fadump_reserve_mem(void)
size = memory_boundary - base;
fadump_reserve_crash_area(base, size);
 
-   fw_dump.fadumphdr_addr =
-   
be64_to_cpu(fdm_active->rmr_region.destination_address) +
-   be64_to_cpu(fdm_active->rmr_region.source_len);
-   pr_debug("fadumphdr_addr = %pa\n", &fw_dump.fadumphdr_addr);
+   pr_debug("fadumphdr_addr = %#016lx\n", fw_dump.fadumphdr_addr);
fw_dump.reserve_dump_area_start = base;
fw_dump.reserve_dump_area_size = size;
} else {
-   size = get_fadump_area_size();
-
/*
 * Reserve memory at an offset closer to bottom of the RAM to
 * minimize the impact of memory hot-remove operation. We can't
@@ -469,218 +451,6 @@ void crash_fadump(struct pt_regs *regs, const char *str)
fw_dump.ops->fadump_trigger(fdh, str);
 }
 
-#define GPR_MASK   0xff00
-static inline int fadump_gpr_index(u64 id)
-{
-   int i = -1;
-   char str[3];
-
-   if ((id & GPR_MASK) == fadump_str_to_u64("GPR")) {
-   /* get the digits at the end */
-   id &= ~GPR_MASK;
-   id >>= 24;
-   str[2] = '\0';
-   str[1] = id & 0xff;
-   str[0] = (id >> 8) & 0xff;
-   sscanf(str, "%d", &i);
-   if (i > 31)
-   i = -1;
-   }
-   return i;
-}
-
-static inline void fadump_set_regval(struct pt_regs *regs, u64 reg_id,
-   u64 reg_val)
-{
-   int i;
-
-   i = fadump_gpr_index(reg_id);
-   if (i >= 0)
-   regs->gpr[i] = (unsigned long)reg_val;
-   else if (reg_id == fadump_str_to_u64("NIA"))
-   regs->

[PATCH v4 17/25] powerpc/fadump: consider reserved ranges while releasing memory

2019-07-16 Thread Hari Bathini

Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse 'reserved-ranges' DT
node to reserve kernel memory falling in these ranges for firmware
purposes. Along with the preserved area memory, also ensure memory
in reserved ranges is not overlapped with memory released by capture
kernel aftering saving vmcore. Also, fix the off-by-one error in
fadump_release_reserved_area function while releasing memory.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump.c |   61 +-
 1 file changed, 42 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index bface37..608eb1d 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -111,7 +111,7 @@ static int __init fadump_cma_init(void) { return 1; }
  * Sort the reserved ranges in-place and merge adjacent ranges
  * to minimize the reserved ranges count.
  */
-static void __init sort_and_merge_reserved_ranges(void)
+static void sort_and_merge_reserved_ranges(void)
 {
unsigned long long base, size;
struct fadump_memory_range tmp_range;
@@ -152,8 +152,7 @@ static void __init sort_and_merge_reserved_ranges(void)
reserved_ranges_cnt = idx + 1;
 }
 
-static int __init add_reserved_range(unsigned long base,
-unsigned long size)
+static int add_reserved_range(unsigned long base, unsigned long size)
 {
int i;
 
@@ -876,7 +875,7 @@ static int fadump_setup_crash_memory_ranges(void)
continue;
}
 
-   /* add this range excluding the reserved dump area. */
+   /* add this range excluding the preserve area. */
ret = fadump_exclude_reserved_area(start, end);
if (ret)
return ret;
@@ -1106,33 +1105,57 @@ static void fadump_release_reserved_area(unsigned long 
start, unsigned long end)
if (tend == end_pfn)
break;
 
-   start_pfn = tend + 1;
+   start_pfn = tend;
}
}
 }
 
 /*
- * Release the memory that was reserved in early boot to preserve the memory
- * contents. The released memory will be available for general use.
+ * Release the memory that was reserved during early boot to preserve the
+ * crash'ed kernel's memory contents except preserve area (permanent
+ * reservation) and reserved ranges used by F/W. The released memory will
+ * be available for general use.
  */
 static void fadump_release_memory(unsigned long begin, unsigned long end)
 {
+   int i;
unsigned long ra_start, ra_end;
-
-   ra_start = fw_dump.reserve_dump_area_start;
-   ra_end = ra_start + fw_dump.reserve_dump_area_size;
+   unsigned long tstart;
 
/*
-* exclude the dump reserve area. Will reuse it for next
-* fadump registration.
+* Add memory to permanently preserve to reserved ranges list
+* and exclude all these ranges while releasing memory.
 */
-   if (begin < ra_end && end > ra_start) {
-   if (begin < ra_start)
-   fadump_release_reserved_area(begin, ra_start);
-   if (end > ra_end)
-   fadump_release_reserved_area(ra_end, end);
-   } else
-   fadump_release_reserved_area(begin, end);
+   i = add_reserved_range(fw_dump.reserve_dump_area_start,
+  fw_dump.reserve_dump_area_size);
+   if (i == 0) {
+   /*
+* Reached the MAX reserved ranges count. To ensure reserved
+* dump area is excluded (as it will be reused for next
+* FADump registration), ignore the last reserved range and
+* add reserved dump area instead.
+*/
+   reserved_ranges_cnt--;
+   add_reserved_range(fw_dump.reserve_dump_area_start,
+  fw_dump.reserve_dump_area_size);
+   }
+   sort_and_merge_reserved_ranges();
+
+   tstart = begin;
+   for (i = 0; i < reserved_ranges_cnt; i++) {
+   ra_start = reserved_ranges[i].base;
+   ra_end = ra_start + reserved_ranges[i].size;
+
+   if (tstart >= ra_end)
+   continue;
+
+   if (tstart < ra_start)
+   fadump_release_reserved_area(tstart, ra_start);
+   tstart = ra_end;
+   }
+
+   if (tstart < end)
+   fadump_release_reserved_area(tstart, end);
 }
 
 static void fadump_invalidate_release_mem(void)

[PATCH v4 18/25] powernv/fadump: process architected register state data provided by firmware

2019-07-16 Thread Hari Bathini

From: Hari Bathini 

Firmware provides architected register state data at the time of crash.
Process this data and build CPU notes to append to ELF core.

Signed-off-by: Hari Bathini 
Signed-off-by: Vasant Hegde 
---
 arch/powerpc/kernel/fadump-common.h  |4 +
 arch/powerpc/platforms/powernv/opal-fadump.c |  197 --
 arch/powerpc/platforms/powernv/opal-fadump.h |   39 +
 3 files changed, 228 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 968745a..2dd0d9d 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -121,7 +121,11 @@ struct fw_dump {
 */
unsigned long   preserv_area_start;
 
+   unsigned long   cpu_state_destination_addr;
+   unsigned long   cpu_state_data_version;
+   unsigned long   cpu_state_entry_size;
unsigned long   cpu_state_data_size;
+
unsigned long   hpte_region_size;
 
unsigned long   boot_memory_size;
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index dffc0e7..479967c 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -27,6 +27,7 @@
 #include "opal-fadump.h"
 
 static const struct opal_fadump_mem_struct *opal_fdm_active;
+static const struct opal_mpipl_fadump *opal_cpu_metadata;
 static struct opal_fadump_mem_struct *opal_fdm;
 
 static void opal_fadump_update_config(struct fw_dump *fadump_conf,
@@ -233,15 +234,115 @@ static int opal_fadump_invalidate_fadump(struct fw_dump 
*fadump_conf)
return 0;
 }
 
+static inline void opal_fadump_set_regval_regnum(struct pt_regs *regs,
+u32 reg_type, u32 reg_num,
+u64 reg_val)
+{
+   if (reg_type == HDAT_FADUMP_REG_TYPE_GPR) {
+   if (reg_num < 32)
+   regs->gpr[reg_num] = reg_val;
+   return;
+   }
+
+   switch (reg_num) {
+   case SPRN_CTR:
+   regs->ctr = reg_val;
+   break;
+   case SPRN_LR:
+   regs->link = reg_val;
+   break;
+   case SPRN_XER:
+   regs->xer = reg_val;
+   break;
+   case SPRN_DAR:
+   regs->dar = reg_val;
+   break;
+   case SPRN_DSISR:
+   regs->dsisr = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_NIP:
+   regs->nip = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_MSR:
+   regs->msr = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_CCR:
+   regs->ccr = reg_val;
+   break;
+   }
+}
+
+static inline void opal_fadump_read_regs(char *bufp, unsigned int regs_cnt,
+unsigned int reg_entry_size,
+struct pt_regs *regs)
+{
+   int i;
+   struct hdat_fadump_reg_entry *reg_entry;
+
+   memset(regs, 0, sizeof(struct pt_regs));
+
+   for (i = 0; i < regs_cnt; i++, bufp += reg_entry_size) {
+   reg_entry = (struct hdat_fadump_reg_entry *)bufp;
+   opal_fadump_set_regval_regnum(regs,
+ be32_to_cpu(reg_entry->reg_type),
+ be32_to_cpu(reg_entry->reg_num),
+ be64_to_cpu(reg_entry->reg_val));
+   }
+}
+
+static inline bool __init is_thread_core_inactive(u8 core_state)
+{
+   bool is_inactive = false;
+
+   if (core_state == HDAT_FADUMP_CORE_INACTIVE)
+   is_inactive = true;
+
+   return is_inactive;
+}
+
 /*
  * Convert CPU state data saved at the time of crash into ELF notes.
+ *
+ * Each register entry is of 16 bytes, A numerical identifier along with
+ * a GPR/SPR flag in the first 8 bytes and the register value in the next
+ * 8 bytes. For more details refer to F/W documentation.
  */
 static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
 {
u32 num_cpus, *note_buf;
struct fadump_crash_info_header *fdh = NULL;
+   struct hdat_fadump_thread_hdr *thdr;
+   unsigned long addr;
+   u32 thread_pir;
+   char *bufp;
+   struct pt_regs regs;
+   unsigned int size_of_each_thread;
+   unsigned int regs_offset, regs_cnt, reg_esize;
+   int i;
+
+   if ((fadump_conf->cpu_state_destination_addr == 0) ||
+   (fadump_conf->cpu_state_entry_size == 0)) {
+   pr_err("CPU state data not available for processing!\n");
+   return -ENODEV;
+   }
+
+   size_of_each_thread = fadump_conf->cpu_state_entry_size;
+   num_cpus = (fadump_conf->cpu_state_data_size / size_of_each_thread);
+
+   addr = fadump_conf->cpu_state_destination_addr;
+   bufp = __va(addr);
+

[PATCH v4 19/25] powernv/fadump: add support to preserve crash data on FADUMP disabled kernel

2019-07-16 Thread Hari Bathini

Add a new kernel config option, CONFIG_PRESERVE_FA_DUMP that ensures
that crash data, from previously crash'ed kernel, is preserved. This
helps in cases where FADump is not enabled but the subsequent memory
preserving kernel boot is likely to process this crash data. One
typical usecase for this config option is petitboot kernel.

As OPAL allows registering address with it in the first kernel and
retrieving it after MPIPL, use it to store the top of boot memory.
A kernel that intends to preserve crash data retrieves it and avoids
using memory beyond this address.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/Kconfig |9 ++
 arch/powerpc/include/asm/fadump.h|9 +-
 arch/powerpc/kernel/Makefile |6 +
 arch/powerpc/kernel/fadump-common.h  |   13 ++-
 arch/powerpc/kernel/fadump.c |  128 --
 arch/powerpc/kernel/prom.c   |4 -
 arch/powerpc/platforms/powernv/Makefile  |1 
 arch/powerpc/platforms/powernv/opal-fadump.c |   59 
 arch/powerpc/platforms/powernv/opal-fadump.h |3 +
 9 files changed, 176 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0ce0a80..7c44a8b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -580,6 +580,15 @@ config FA_DUMP
  If unsure, say "y". Only special kernels like petitboot may
  need to say "N" here.
 
+config PRESERVE_FA_DUMP
+   bool "Preserve Firmware-assisted dump"
+   depends on PPC64 && PPC_POWERNV && !FA_DUMP
+   help
+ On a kernel with FA_DUMP disabled, this option helps to preserve
+ crash data from a previously crash'ed kernel. Useful when the next
+ memory preserving kernel boot would process this crash data.
+ Petitboot kernel is the typical usecase for this option.
+
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
depends on SMP
diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index e608d34..fd990d8 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -14,9 +14,6 @@
 extern int crashing_cpu;
 
 extern int is_fadump_memory_area(u64 addr, ulong size);
-extern int early_init_dt_scan_fw_dump(unsigned long node,
-   const char *uname, int depth, void *data);
-extern int fadump_reserve_mem(void);
 extern int setup_fadump(void);
 extern int is_fadump_active(void);
 extern int should_fadump_crash(void);
@@ -29,4 +26,10 @@ static inline int should_fadump_crash(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
 static inline void fadump_cleanup(void) { }
 #endif /* !CONFIG_FA_DUMP */
+
+#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
+extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
+ int depth, void *data);
+extern int fadump_reserve_mem(void);
+#endif
 #endif /* __PPC64_FA_DUMP_H__ */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 439d548..6abaead 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -78,7 +78,11 @@ obj-$(CONFIG_EEH)  += eeh.o eeh_pe.o eeh_dev.o 
eeh_cache.o \
  eeh_driver.o eeh_event.o eeh_sysfs.o
 obj-$(CONFIG_GENERIC_TBSYNC)   += smp-tbsync.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
-obj-$(CONFIG_FA_DUMP)  += fadump.o fadump-common.o
+ifeq ($(CONFIG_FA_DUMP),y)
+obj-y  += fadump.o fadump-common.o
+else
+obj-$(CONFIG_PRESERVE_FA_DUMP) += fadump.o
+endif
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
 endif
diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 2dd0d9d..5dbcefc 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -16,6 +16,7 @@
 #ifndef __PPC64_FA_DUMP_INTERNAL_H__
 #define __PPC64_FA_DUMP_INTERNAL_H__
 
+#ifndef CONFIG_PRESERVE_FA_DUMP
 /*
  * The RMA region will be saved for later dumping when kernel crashes.
  * RMA is Real Mode Area, the first block of logical memory address owned
@@ -180,7 +181,17 @@ void fadump_update_elfcore_header(struct fw_dump 
*fadump_config, char *bufp);
 int is_fadump_boot_mem_contiguous(struct fw_dump *fadump_conf);
 int is_fadump_reserved_mem_contiguous(struct fw_dump *fadump_conf);
 
-#ifdef CONFIG_PPC_PSERIES
+#else /* !CONFIG_PRESERVE_FA_DUMP */
+
+/* Firmware-assisted dump configuration details. */
+struct fw_dump {
+   unsigned long   boot_mem_top;
+   unsigned long   dump_active;
+};
+
+#endif /* CONFIG_PRESERVE_FA_DUMP */
+
+#if !defined(CONFIG_PRESERVE_FA_DUMP) && defined(CONFIG_PPC_PSERIES)
 extern int rtas_fadump_dt_scan(struct fw_dump *fadump_config, ulong node);
 #else
 static inline int rtas_fadump_dt_scan(struct fw_dump *fadump_config, ulong

[PATCH v4 20/25] powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP

2019-07-16 Thread Hari Bathini

Kernel config option CONFIG_PRESERVE_FA_DUMP is introduced to ensure
crash data, from previously crash'ed kernel, is preserved. Update
documentation with this details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index cd48776..373a9fb 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -98,6 +98,15 @@ firmware versions on PSeries (PowerVM) platform and Power9
 and above systems with recent firmware versions on PowerNV
 (OPAL) platform.
 
+On OPAL based machines, system first boots into an intermittent
+kernel (referred to as petitboot kernel) before booting into the
+capture kernel. This kernel would have minimal kernel and/or
+userspace support to process crash data. Such kernel needs to
+preserve previously crash'ed kernel's memory for the subsequent
+capture kernel boot to process this crash data. Kernel config
+option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
+to ensure that crash data is preserved to process later.
+
 Implementation details:
 --

[PATCH v4 21/25] powernv/opalcore: export /sys/firmware/opal/core for analysing opal crashes

2019-07-16 Thread Hari Bathini

From: Hari Bathini 

Export /sys/firmware/opal/core file to analyze opal crashes. Since OPAL
core can be generated independent of CONFIG_FA_DUMP support in kernel,
add this support under a new kernel config option CONFIG_OPAL_CORE.
Also, avoid code duplication by moving common code used while exporting
/proc/vmcore and/or /sys/firmware/opal/core file(s).

Signed-off-by: Hari Bathini 
---
 arch/powerpc/Kconfig |9 
 arch/powerpc/platforms/powernv/Makefile  |1 
 arch/powerpc/platforms/powernv/opal-core.c   |  599 ++
 arch/powerpc/platforms/powernv/opal-fadump.c |   84 +---
 arch/powerpc/platforms/powernv/opal-fadump.h |   71 +++
 5 files changed, 697 insertions(+), 67 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-core.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 7c44a8b..0afe0db 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -589,6 +589,15 @@ config PRESERVE_FA_DUMP
  memory preserving kernel boot would process this crash data.
  Petitboot kernel is the typical usecase for this option.
 
+config OPAL_CORE
+   bool "Export OPAL memory as /sys/firmware/opal/core"
+   depends on PPC64 && PPC_POWERNV
+   help
+ This option uses the MPIPL support in firmware to provide an
+ ELF core of OPAL memory after a crash. The ELF core is exported
+ as /sys/firmware/opal/core file which is helpful in debugging
+ OPAL crashes using GDB.
+
 config IRQ_ALL_CPUS
bool "Distribute interrupts on all CPUs by default"
depends on SMP
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b4a8022..e659afd 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -8,6 +8,7 @@ obj-y   += opal-kmsg.o opal-powercap.o 
opal-psr.o opal-sensor-groups.o
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_FA_DUMP)  += opal-fadump.o
 obj-$(CONFIG_PRESERVE_FA_DUMP) += opal-fadump.o
+obj-$(CONFIG_OPAL_CORE)+= opal-core.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
diff --git a/arch/powerpc/platforms/powernv/opal-core.c 
b/arch/powerpc/platforms/powernv/opal-core.c
new file mode 100644
index 000..55bea53
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -0,0 +1,599 @@
+/*
+ * Interface for exporting the OPAL ELF core.
+ * Heavily inspired from fs/proc/vmcore.c
+ *
+ * Copyright 2019, IBM Corp.
+ * Author: Hari Bathini 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+#define pr_fmt(fmt) "opalcore: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "../../kernel/fadump-common.h"
+#include "opal-fadump.h"
+
+#define MAX_PT_LOAD_CNT8
+
+/* NT_AUXV note related info */
+#define AUXV_CNT   1
+#define AUXV_DESC_SZ   (((2 * AUXV_CNT) + 1) * sizeof(Elf64_Off))
+
+struct opalcore_config {
+   unsigned intnum_cpus;
+   /* PIR value of crashing CPU */
+   unsigned intcrashing_cpu;
+
+   /* CPU state data info from F/W */
+   unsigned long   cpu_state_destination_addr;
+   unsigned long   cpu_state_data_size;
+   unsigned long   cpu_state_entry_size;
+
+   /* OPAL memory to be exported as PT_LOAD segments */
+   unsigned long   ptload_addr[MAX_PT_LOAD_CNT];
+   unsigned long   ptload_size[MAX_PT_LOAD_CNT];
+   unsigned long   ptload_cnt;
+
+   /* Pointer to the first PT_LOAD in the ELF core file */
+   Elf64_Phdr  *ptload_phdr;
+
+   /* Total size of opalcore file. */
+   size_t  opalcore_size;
+
+   /* Buffer for all the ELF core headers and the PT_NOTE */
+   size_t  opalcorebuf_sz;
+   char*opalcorebuf;
+
+   /* NT_AUXV buffer */
+   charauxv_buf[AUXV_DESC_SZ];
+};
+
+struct opalcore {
+   struct list_head list;
+   unsigned long long paddr;
+   unsigned long long size;
+   loff_t offset;
+};
+
+static LIST_HEAD(opalcore_list);
+static struct opalcore_config *oc_conf;
+static const struct opal_mpipl_fadump *opalc_metadata;
+static const struct opal_mpipl_fadump *opalc_cpu_metadata;
+
+/*
+ * Set crashing CPU's signal to SIGUSR1. if the kernel is triggered
+ * by kernel, SIGTERM otherwise.
+ */
+bool kernel_initiated;
+
+static struct opalcore * __init get_new_element(void)
+{
+   return kzalloc(sizeof(struct opalcore)

[PATCH v4 22/25] powernv/fadump: Warn before processing partial crashdump

2019-07-16 Thread Hari Bathini

If not all kernel boot memory regions are registered for MPIPL before
system crashes, try processing the partial crashdump but warn the user
before proceeding.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-fadump.c |   21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index b55f25c..3ef212d 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -136,6 +136,27 @@ static void opal_fadump_get_config(struct fw_dump 
*fadump_conf,
last_end = base + size;
}
 
+   /*
+* Rarely, but it can so happen that system crashes before all
+* boot memory regions are registered for MPIPL. In such
+* cases, warn that the vmcore may not be accurate and proceed
+* anyway as that is the best bet considering free pages, cache
+* pages, user pages, etc are usually filtered out.
+*
+* Hope the memory that could not be preserved only has pages
+* that are usually filtered out while saving the vmcore.
+*/
+   if (fdm->region_cnt < fdm->registered_regions) {
+   pr_warn("The crashdump may not be accurate as the below boot 
memory regions could not be preserved:\n");
+   i = fdm->registered_regions;
+   while (i < fdm->region_cnt) {
+   pr_warn("\t%d. base: 0x%llx, size: 0x%llx\n",
+   (i + 1), fdm->rgn[i].src,
+   fdm->rgn[i].size);
+   i++;
+   }
+   }
+
fadump_conf->boot_mem_top = (fadump_conf->boot_memory_size + hole_size);
fadump_conf->boot_mem_regs_cnt = fdm->region_cnt;
opal_fadump_update_config(fadump_conf, fdm);

[PATCH v4 23/25] powernv/opalcore: provide an option to invalidate /sys/firmware/opal/core file

2019-07-16 Thread Hari Bathini

Writing '1' to /sys/kernel/fadump_release_opalcore would release the
memory held by kernel in exporting /sys/firmware/opal/core file.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/platforms/powernv/opal-core.c |   38 
 1 file changed, 38 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-core.c 
b/arch/powerpc/platforms/powernv/opal-core.c
index 55bea53..9663d70 100644
--- a/arch/powerpc/platforms/powernv/opal-core.c
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -562,6 +564,36 @@ static void opalcore_cleanup(void)
 }
 __exitcall(opalcore_cleanup);
 
+static ssize_t fadump_release_opalcore_store(struct kobject *kobj,
+struct kobj_attribute *attr,
+const char *buf, size_t count)
+{
+   int input = -1;
+
+   if (kstrtoint(buf, 0, &input))
+   return -EINVAL;
+
+   if (input == 1) {
+   if (oc_conf == NULL) {
+   pr_err("'/sys/firmware/opal/core' file not 
accessible!\n");
+   return -EPERM;
+   }
+
+   /*
+* Take away '/sys/firmware/opal/core' and release all memory
+* used for exporting this file.
+*/
+   opalcore_cleanup();
+   } else
+   return -EINVAL;
+
+   return count;
+}
+
+static struct kobj_attribute opalcore_rel_attr = 
__ATTR(fadump_release_opalcore,
+   0200, NULL,
+   fadump_release_opalcore_store);
+
 /* Init function for opalcore module. */
 static int __init opalcore_init(void)
 {
@@ -594,6 +626,12 @@ static int __init opalcore_init(void)
return rc;
}
 
+   rc = sysfs_create_file(kernel_kobj, &opalcore_rel_attr.attr);
+   if (rc) {
+   pr_warn("unable to create sysfs file fadump_release_opalcore 
(%d)\n",
+   rc);
+   }
+
return 0;
 }
 fs_initcall(opalcore_init);

[PATCH v4 24/25] powernv/fadump: consider f/w load area

2019-07-16 Thread Hari Bathini

OPAL loads kernel & initrd at 512MB offset (256MB size), also exported
as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is
less than 768MB, kernel memory to be exported as '/proc/vmcore' would
be overwritten by f/w while loading kernel & initrd. To avoid such a
scenario, enforce a minimum boot memory size of 768MB on OPAL platform.

Also, skip using FADump if a newer F/W version loads kernel & initrd
above 768MB.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/fadump-common.h  |   11 +-
 arch/powerpc/kernel/fadump.c |   11 +-
 arch/powerpc/platforms/powernv/opal-fadump.c |   29 ++
 arch/powerpc/platforms/powernv/opal-fadump.h |7 ++
 arch/powerpc/platforms/pseries/rtas-fadump.c |6 +
 arch/powerpc/platforms/pseries/rtas-fadump.h |   11 ++
 6 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 5dbcefc..e758eb6 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -26,16 +26,6 @@
 #define RMA_START  0x0
 #define RMA_END(ppc64_rma_size)
 
-/*
- * On some Power systems where RMO is 128MB, it still requires minimum of
- * 256MB for kernel to boot successfully. When kdump infrastructure is
- * configured to save vmcore over network, we run into OOM issue while
- * loading modules related to network setup. Hence we need additional 64M
- * of memory to avoid OOM issue.
- */
-#define MIN_BOOT_MEM   (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \
-   + (0x1UL << 26))
-
 /* The upper limit percentage for user specified boot memory size (25%) */
 #define MAX_BOOT_MEM_RATIO 4
 
@@ -163,6 +153,7 @@ struct fadump_ops {
ulong   (*init_fadump_mem_struct)(struct fw_dump *fadump_config);
ulong   (*get_kernel_metadata_size)(void);
int (*setup_kernel_metadata)(struct fw_dump *fadump_config);
+   ulong   (*get_bootmem_min)(void);
int (*register_fadump)(struct fw_dump *fadump_config);
int (*unregister_fadump)(struct fw_dump *fadump_config);
int (*invalidate_fadump)(struct fw_dump *fadump_config);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index bb6a63c..ffc9e3f 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -335,7 +335,8 @@ static inline unsigned long 
fadump_calculate_reserve_size(void)
if (memory_limit && size > memory_limit)
size = memory_limit;
 
-   return (size > MIN_BOOT_MEM ? size : MIN_BOOT_MEM);
+   return (size > fw_dump.ops->get_bootmem_min() ? size :
+   fw_dump.ops->get_bootmem_min());
 }
 
 /*
@@ -493,6 +494,14 @@ int __init fadump_reserve_mem(void)
ALIGN(fw_dump.boot_memory_size,
FADUMP_CMA_ALIGNMENT);
 #endif
+
+   if (fw_dump.boot_memory_size < fw_dump.ops->get_bootmem_min()) {
+   pr_err("Can't enable fadump with boot memory size 
(0x%lx) less than 0x%lx\n",
+  fw_dump.boot_memory_size,
+  fw_dump.ops->get_bootmem_min());
+   goto error_out;
+   }
+
if (!fadump_get_boot_mem_regions()) {
pr_err("Too many holes in boot memory area to enable 
fadump\n");
goto error_out;
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index 3ef212d..618186e 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -241,6 +242,11 @@ static int opal_fadump_setup_kernel_metadata(struct 
fw_dump *fadump_conf)
return err;
 }
 
+static ulong opal_fadump_get_bootmem_min(void)
+{
+   return OPAL_FADUMP_MIN_BOOT_MEM;
+}
+
 static int opal_fadump_register_fadump(struct fw_dump *fadump_conf)
 {
int i, err = -EIO;
@@ -535,6 +541,7 @@ static struct fadump_ops opal_fadump_ops = {
.init_fadump_mem_struct = opal_fadump_init_mem_struct,
.get_kernel_metadata_size   = opal_fadump_get_kernel_metadata_size,
.setup_kernel_metadata  = opal_fadump_setup_kernel_metadata,
+   .get_bootmem_min= opal_fadump_get_bootmem_min,
.register_fadump= opal_fadump_register_fadump,
.unregister_fadump  = opal_fadump_unregister_fadump,
.invalidate_fadump  = opal_fadump_invalidate_fadump,
@@ -547,6 +554,7 @@ int __init opal_fadump_dt_scan(struct fw_dump *fadump_conf, 
ulong node)
 {
unsigned long dn;
const __be32 *prop;
+   int i, len;
 
/*
 * Check if Fir

[PATCH v4 25/25] powernv/fadump: update documentation about option to release opalcore

2019-07-16 Thread Hari Bathini

With /sys/firmware/opal/core support available on OPAL based machines
and an option to the release memory used by kernel in exporting this
core file, update FADump documentation with these details.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 373a9fb..9933fa6 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -107,6 +107,16 @@ capture kernel boot to process this crash data. Kernel 
config
 option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
 to ensure that crash data is preserved to process later.
 
+-- On OPAL based machines (PowerNV), if the kernel is build with
+   CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
+   exported as /sys/firmware/opal/core file. This procfs file is
+   helpful in debugging OPAL crashes with GDB. The kernel memory
+   used for exporting this procfs file can be released by echo'ing
+   '1' to /sys/kernel/fadump_release_opalcore node.
+
+   e.g.
+ # echo 1 > /sys/kernel/fadump_release_opalcore
+
 Implementation details:
 --
 
@@ -270,6 +280,15 @@ Here is the list of files under kernel sysfs:
 enhanced to use this interface to release the memory reserved for
 dump and continue without 2nd reboot.
 
+ /sys/kernel/fadump_release_opalcore
+
+This file is available only on OPAL based machines when FADump is
+active during capture kernel. This is used to release the memory
+used by the kernel to export /sys/firmware/opal/core file. To
+release this memory, echo '1' to it:
+
+echo 1  > /sys/kernel/fadump_release_opalcore
+
 Here is the list of files under powerpc debugfs:
 (Assuming debugfs is mounted on /sys/kernel/debug directory.)

Re: [PATCH kernel v2] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Michael Ellerman

Cédric Le Goater  writes:
> On 16/07/2019 11:10, Alexey Kardashevskiy wrote:
>> On 16/07/2019 18:59, Cédric Le Goater wrote:
>>> On 15/07/2019 09:11, Alexey Kardashevskiy wrote:
 There is a race between releasing an irq on one cpu and fetching it
 from XIVE on another cpu as there does not seem to be any locking between
 these, probably because xive_irq_chip::irq_shutdown() is supposed to
 remove the irq from all queues in the system which it does not do.

 As a result, when such released irq appears in a queue, we take it
 from the queue but we do not change the current priority on that cpu and
 since there is no handler for the irq, EOI is never called and the cpu
 current priority remains elevated (7 vs. 0xff==unmasked). If another irq
 is assigned to the same cpu, then that device stops working until irq
 is moved to another cpu or the device is reset.

 This adds a new ppc_md.orphan_irq callback which is called if no irq
 descriptor is found. The XIVE implementation drops the current priority
 to 0xff which effectively unmasks interrupts in a current CPU.
>>>
>>>
>>> The test on generic_handle_irq() catches interrupt events that
>>> were served on a target CPU while the source interrupt was being
>>> shutdown on another CPU.
>>>
>>> The orphan_irq() handler restores the CPPR in such cases.
>>>
>>> This looks OK to me. I would have added some more comments in the
>>> code.
>> 
>> Which and where? Thanks,
>
> Above xive_orphan_irq() explaining the complete problem that we are 
> addressing. XIVE is not super obvious when looking at the code ...

Yes adding a comment would be good, thanks.

This will also need a Fixes: tag.

cheers

[PATCH 00/14] pending doc patches for 5.3-rc

2019-07-16 Thread Mauro Carvalho Chehab

Those are the pending documentation patches after my pull request
for this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media.git 
tags/docs/v5.3-1

Patches 1 to 13 were already submitted, but got rebased. Patch 14
is a new fixup one.

Patches 1 and 2 weren't submitted before due to merge conflicts
that are now solved upstream;

Patch 3 fixes a series of random Documentation/* references that
are pointing to the wrong places.

Patch 4 fix a longstanding issue: every time a new book is added,
conf.py need changes, in order to allow generating a PDF file.
After the patch, conf.py will automatically recognize new books,
saving the trouble of keeping adding documents to it.

Patches 5 to 11 are due to fonts support when building translations.pdf.
The main focus is to add xeCJK support. While doing it, I discovered
some bugs at sphinx-pre-install script after running it with 7 different
distributions.

Patch 12 improves support for partial doc building. Currently, each
subdir needs to have its own conf.py, in order to support partial
doc build. After it, any Documentation subdir can be used to 
roduce html/pdf docs with:

make SPHINXDIRS="foo bar" htmldocs
(or pdfdocs, latexdocs, epubdocs, ...)

Patch 13 is a cleanup patch: it simply get rid of all those extra
conf.py files that  aren't needed anymore. The only extra config
file after it is this one:

Documentation/media/conf_nitpick.py

With enables some extra optional Sphinx features.

Patch 14 adds Documentation/virtual to the main index.rst file
and add a new *.rst file that was orphaned there.

-

After this series, there's just one more patch meant to be applied
for 5.3, with is still waiting for some patches to be merged from
linux-next:


https://git.linuxtv.org/mchehab/experimental.git/commit/?id=b1b5dc7d7bbfbbfdace2a248c6458301c6e34100


Mauro Carvalho Chehab (14):
  docs: powerpc: convert docs to ReST and rename to *.rst
  docs: power: add it to to the main documentation index
  docs: fix broken doc references due to renames
  docs: pdf: add all Documentation/*/index.rst to PDF output
  docs: conf.py: add CJK package needed by translations
  docs: conf.py: only use CJK if the font is available
  scripts/sphinx-pre-install: fix script for RHEL/CentOS
  scripts/sphinx-pre-install: don't use LaTeX with CentOS 7
  scripts/sphinx-pre-install: fix latexmk dependencies
  scripts/sphinx-pre-install: cleanup Gentoo checks
  scripts/sphinx-pre-install: seek for Noto CJK fonts for pdf output
  docs: load_config.py: avoid needing a conf.py just due to LaTeX docs
  docs: remove extra conf.py files
  docs: virtual: add it to the documentation body

 Documentation/PCI/pci-error-recovery.rst  |   5 +-
 Documentation/RCU/rculist_nulls.txt   |   2 +-
 Documentation/admin-guide/conf.py |  10 --
 Documentation/conf.py |  30 +++-
 Documentation/core-api/conf.py|  10 --
 Documentation/crypto/conf.py  |  10 --
 Documentation/dev-tools/conf.py   |  10 --
 .../devicetree/bindings/arm/idle-states.txt   |   2 +-
 Documentation/doc-guide/conf.py   |  10 --
 Documentation/driver-api/80211/conf.py|  10 --
 Documentation/driver-api/conf.py  |  10 --
 Documentation/driver-api/pm/conf.py   |  10 --
 Documentation/filesystems/conf.py |  10 --
 Documentation/gpu/conf.py |  10 --
 Documentation/index.rst   |   3 +
 Documentation/input/conf.py   |  10 --
 Documentation/kernel-hacking/conf.py  |  10 --
 Documentation/locking/spinlocks.rst   |   4 +-
 Documentation/maintainer/conf.py  |  10 --
 Documentation/media/conf.py   |  12 --
 Documentation/memory-barriers.txt |   2 +-
 Documentation/networking/conf.py  |  10 --
 Documentation/power/index.rst |   2 +-
 .../{bootwrapper.txt => bootwrapper.rst}  |  28 +++-
 .../{cpu_families.txt => cpu_families.rst}|  23 +--
 .../{cpu_features.txt => cpu_features.rst}|   6 +-
 Documentation/powerpc/{cxl.txt => cxl.rst}|  46 --
 .../powerpc/{cxlflash.txt => cxlflash.rst}|  10 +-
 .../{DAWR-POWER9.txt => dawr-power9.rst}  |  15 +-
 Documentation/powerpc/{dscr.txt => dscr.rst}  |  18 +-
 ...ecovery.txt => eeh-pci-error-recovery.rst} | 108 ++--
 ...ed-dump.txt => firmware-assisted-dump.rst} | 117 +++--
 Documentation/powerpc/{hvcs.txt => hvcs.rst}  | 108 ++--
 Documentation/powerpc/index.rst   |  34 
 Documentation/powerpc/isa-versions.rst|  15 +-
 .../powerpc/{mpc52xx.txt => mpc52xx.rst}  |  12 +-
 ...nv.txt => pci_iov_resource_on_powernv.rst} |  15 +-
 .../powerpc/{pmu-ebb.txt => pmu-ebb.rst}  |   1 +
 Documentation/powerpc/ptrace.rst  | 156 ++
 Documentation/powerpc/ptrace.txt  | 151

[PATCH 01/14] docs: powerpc: convert docs to ReST and rename to *.rst

2019-07-16 Thread Mauro Carvalho Chehab

Convert docs to ReST and add them to the arch-specific
book.

The conversion here was trivial, as almost every file there
was already using an elegant format close to ReST standard.

The changes were mostly to mark literal blocks and add a few
missing section title identifiers.

One note with regards to "--": on Sphinx, this can't be used
to identify a list, as it will format it badly. This can be
used, however, to identify a long hyphen - and "---" is an
even longer one.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab 
Acked-by: Andrew Donnellan  # cxl
---
 Documentation/PCI/pci-error-recovery.rst  |   5 +-
 Documentation/index.rst   |   1 +
 .../{bootwrapper.txt => bootwrapper.rst}  |  28 +++-
 .../{cpu_families.txt => cpu_families.rst}|  23 +--
 .../{cpu_features.txt => cpu_features.rst}|   6 +-
 Documentation/powerpc/{cxl.txt => cxl.rst}|  46 --
 .../powerpc/{cxlflash.txt => cxlflash.rst}|  10 +-
 .../{DAWR-POWER9.txt => dawr-power9.rst}  |  15 +-
 Documentation/powerpc/{dscr.txt => dscr.rst}  |  18 +-
 ...ecovery.txt => eeh-pci-error-recovery.rst} | 108 ++--
 ...ed-dump.txt => firmware-assisted-dump.rst} | 117 +++--
 Documentation/powerpc/{hvcs.txt => hvcs.rst}  | 108 ++--
 Documentation/powerpc/index.rst   |  34 
 Documentation/powerpc/isa-versions.rst|  15 +-
 .../powerpc/{mpc52xx.txt => mpc52xx.rst}  |  12 +-
 ...nv.txt => pci_iov_resource_on_powernv.rst} |  15 +-
 .../powerpc/{pmu-ebb.txt => pmu-ebb.rst}  |   1 +
 Documentation/powerpc/ptrace.rst  | 156 ++
 Documentation/powerpc/ptrace.txt  | 151 -
 .../{qe_firmware.txt => qe_firmware.rst}  |  37 +++--
 .../{syscall64-abi.txt => syscall64-abi.rst}  |  29 ++--
 ...al_memory.txt => transactional_memory.rst} |  45 ++---
 MAINTAINERS   |   6 +-
 arch/powerpc/kernel/exceptions-64s.S  |   2 +-
 drivers/soc/fsl/qe/qe.c   |   2 +-
 drivers/tty/hvc/hvcs.c|   2 +-
 include/soc/fsl/qe/qe.h   |   2 +-
 27 files changed, 567 insertions(+), 427 deletions(-)
 rename Documentation/powerpc/{bootwrapper.txt => bootwrapper.rst} (93%)
 rename Documentation/powerpc/{cpu_families.txt => cpu_families.rst} (95%)
 rename Documentation/powerpc/{cpu_features.txt => cpu_features.rst} (97%)
 rename Documentation/powerpc/{cxl.txt => cxl.rst} (95%)
 rename Documentation/powerpc/{cxlflash.txt => cxlflash.rst} (98%)
 rename Documentation/powerpc/{DAWR-POWER9.txt => dawr-power9.rst} (95%)
 rename Documentation/powerpc/{dscr.txt => dscr.rst} (91%)
 rename Documentation/powerpc/{eeh-pci-error-recovery.txt => 
eeh-pci-error-recovery.rst} (82%)
 rename Documentation/powerpc/{firmware-assisted-dump.txt => 
firmware-assisted-dump.rst} (80%)
 rename Documentation/powerpc/{hvcs.txt => hvcs.rst} (91%)
 create mode 100644 Documentation/powerpc/index.rst
 rename Documentation/powerpc/{mpc52xx.txt => mpc52xx.rst} (91%)
 rename Documentation/powerpc/{pci_iov_resource_on_powernv.txt => 
pci_iov_resource_on_powernv.rst} (97%)
 rename Documentation/powerpc/{pmu-ebb.txt => pmu-ebb.rst} (99%)
 create mode 100644 Documentation/powerpc/ptrace.rst
 delete mode 100644 Documentation/powerpc/ptrace.txt
 rename Documentation/powerpc/{qe_firmware.txt => qe_firmware.rst} (95%)
 rename Documentation/powerpc/{syscall64-abi.txt => syscall64-abi.rst} (82%)
 rename Documentation/powerpc/{transactional_memory.txt => 
transactional_memory.rst} (93%)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 83db42092935..e5d450df06b4 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -403,7 +403,7 @@ That is, the recovery API only requires that:
 .. note::
 
Implementation details for the powerpc platform are discussed in
-   the file Documentation/powerpc/eeh-pci-error-recovery.txt
+   the file Documentation/powerpc/eeh-pci-error-recovery.rst
 
As of this writing, there is a growing list of device drivers with
patches implementing error recovery. Not all of these patches are in
@@ -422,3 +422,6 @@ That is, the recovery API only requires that:
- drivers/net/cxgb3
- drivers/net/s2io.c
- drivers/net/qlge
+
+The End
+---
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 70ae148ec980..3fe6170aa41d 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -143,6 +143,7 @@ implementation.
arm64/index
ia64/index
m68k/index
+   powerpc/index
riscv/index
s390/index
sh/index
diff --git a/Documentation/powerpc/bootwrapper.txt 
b/Documentation/powerpc/bootwrapper.rst
similarity index 93%
rename from Documentation/powerpc/bootwrapper.txt
rename to Documentation/p

Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-16 Thread Michael Ellerman

Segher Boessenkool  writes:
> On Mon, Jul 15, 2019 at 05:05:34PM +1000, Michael Ellerman wrote:
>> Segher Boessenkool  writes:
>> > Yes, that is why I used the environment variable, all binutils work
>> > with that.  There was no --target option in GNU ar before 2.22.
>> 
>> Yeah, we're not very good at testing with really old binutils, so I
>> guess we broke that.
>> 
>> I'm inclined to merge this, it doesn't seem to break anything, and it
>> fixes using --target on old binutils that don't have it.
>
> But we don't set the target any other way either.  I don't think this
> will work with a 32-bit toolchain (default target 32 bit) and a 64-bit
> kernel, or the other way around.

I think it does, but maybe I'm misunderstanding.

My test setup is:

  ~/linux$ export 
PATH=/home/toolchains/ppc/gcc-8-branch/powerpc-linux/bin/:$PATH
  ~/linux$ echo "int test(void) { return 2; }" > test.c
  ~/linux$ powerpc-linux-gcc -c test.c 
  ~/linux$ file test.o 
  test.o: ELF 32-bit MSB relocatable, PowerPC or cisco 4500, version 1 (SYSV), 
not stripped
  ~/linux$ make CROSS_COMPILE=powerpc-linux- -s ppc64le_defconfig
  ~/linux$ make CROSS_COMPILE=powerpc-linux- -s -j 320
  ~/linux$ echo $?
  0

And it's definitely calling ar with no flags, eg:

  rm -f init/built-in.a; powerpc-linux-ar rcSTPD init/built-in.a init/main.o 
init/version.o init/do_mounts.o init/do_mounts_rd.o init/do_mounts_initrd.o 
init/do_mounts_md.o init/initramfs.o init/init_task.o

So presumably at some point ar learnt to cope with objects that don't
match its default? (how do I ask it what its default is?)

> Then again, does that work at *all* nowadays?  Do we even consider that
> important, *should* it work?

Yes and yes. There were a lot of bugs in the kernel makefiles after we
added LE support which prevented a biarch/biendian compiler from working.
But now it does work and we want it to keep working because it means you
can have a single compiler for building 32-bit, 64-bit BE & 64-bit LE.

cheers

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Christian Brauner

On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> I think Vasily already has a clone3 patch for s390x with 435. 

A quick follow-up on this. Helge and Michael have asked whether there
are any tests for clone3. Yes, there will be and I try to have them
ready by the end of the this or next week for review. In the meantime I
hope the following minimalistic test program that just verifies very
very basic functionality (It's not pretty.) will help you test:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef CLONE_PIDFD
#define CLONE_PIDFD 0x1000
#endif

#ifndef __NR_clone3
#define __NR_clone3 -1
#endif

static pid_t sys_clone3(struct clone_args *args)
{
return syscall(__NR_clone3, args, sizeof(struct clone_args));
}

static int wait_for_pid(pid_t pid)
{
int status, ret;

again:
ret = waitpid(pid, &status, 0);
if (ret == -1) {
if (errno == EINTR)
goto again;

return -1;
}

if (ret != pid)
goto again;

if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
return -1;

return 0;
}

#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))

int main(int argc, char *argv[])
{
int pidfd = -1;
pid_t parent_tid = -1, pid = -1;
struct clone_args args = {0};

args.parent_tid = ptr_to_u64(&parent_tid); /* CLONE_PARENT_SETTID */
args.pidfd = ptr_to_u64(&pidfd); /* CLONE_PIDFD */
args.flags = CLONE_PIDFD | CLONE_PARENT_SETTID;
args.exit_signal = SIGCHLD;

pid = sys_clone3(&args);
if (pid < 0) {
fprintf(stderr, "%s - Failed to create new process\n", 
strerror(errno));
exit(EXIT_FAILURE);
}

if (pid == 0) {
printf("Child process with pid %d\n", getpid());
exit(EXIT_SUCCESS);
}

printf("Parent process received child's pid %d as return value\n", pid);
printf("Parent process received child's pidfd %d\n", *(int 
*)args.pidfd);
printf("Parent process received child's pid %d as return argument\n",
   *(pid_t *)args.parent_tid);

if (wait_for_pid(pid))
exit(EXIT_FAILURE);

if (pid != *(pid_t *)args.parent_tid)
exit(EXIT_FAILURE);

close(pidfd);

return 0;
}

Re: [PATCH 02/12] Documentation/arm: repointer docs to Documentation/arch/arm

2019-07-16 Thread Krzysztof Kozlowski

On Fri, 12 Jul 2019 at 04:20, Alex Shi  wrote:
>
> Since we move 'arm/arm64' docs to Documentation/arch/{arm,arm64} dir,
> redirect the doc pointer to them.
>
> Signed-off-by: Alex Shi 
> Cc: Jonathan Corbet 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: linux-...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linux-cry...@vger.kernel.org
> Cc: linux-in...@vger.kernel.org
> Cc: linux-ser...@vger.kernel.org
> ---
>  Documentation/arch/arm/Samsung-S3C24XX/GPIO.txt|  2 +-
>  .../arch/arm/Samsung-S3C24XX/Overview.txt  |  6 +++---
>  Documentation/arch/arm/Samsung/GPIO.txt|  2 +-
>  Documentation/arch/arm/Samsung/Overview.txt|  4 ++--
>  Documentation/devicetree/bindings/arm/xen.txt  |  2 +-
>  Documentation/devicetree/booting-without-of.txt|  4 ++--
>  Documentation/translations/zh_CN/arm/Booting   |  4 ++--
>  .../translations/zh_CN/arm/kernel_user_helpers.txt |  4 ++--
>  MAINTAINERS|  6 +++---

I assume it will go through doc tree, so for Samsung:
Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Non deterministic kernel crashes after minimal devicetree changes.

2019-07-16 Thread Maik Nassauer

Dear everyone,

we are currently developing a kernel upgrade for an older hardware. The
system shall be upgraded from kernel 2.6.24 to the current stable
vanilla kernel (4.19).

With our new kernel we are facing strange and non deterministic kernel
crashes which occur more or less randomly when modifying our devicetree
(even small changes may lead to crashes).

The setup:
- CPU Platform: MPC5121 on a custom board, somewhat similar to ADS5121
eval board

- Bootloader: u-boot 1.3.2
CPU:   MPC5121e rev. 2.0, Core e300c4 at 400 MHz, CSB at 200 MHz
Board: CCS5121
DRAM:  256 MB
FLASH: 32 MB
In:serial
Out:   serial
Err:   serial
I2C:   PMC KEY
ETH:   (eeprom) 00:30:d6:00:00:00
Net:   FEC ETHERNET

- Vanilla Kernel: 4.19 (based on git commit 84df9525) with custom
modifications, mostly devicetree and some drivers and board setup.

Kernel command line: root=/dev/nfs rw
nfsroot=192.168.2.85:/srv/nfs_rootfs,v3,tcp video=fslfb:800x480-32@68
ip=192.168.2.230:192.168.2.85:192.168.2.254:255.255.255.0:dhcp28.kc.loc
:eth0:off panic=1 console=ttyPSC0,115200 no_console_suspend
video=fslfb:800x480-32@68


We are currently building the kernel and devicetree using a power pc
cross toolchain:

powerpc-linux-gnu-gcc 9.1.0-1
https://aur.archlinux.org/packages/powerpc-linux-gnu-gcc/



In the u-boot code, we changed CFG_BOOTMAPSZ from 8 to 64 MB, because
the 4.19 kernel is way bigger than the old (2.6.x) one that we
previously bootet on our system.

However the CFG_BOOTMAPSZ setting does not seem to have any influence
on the problem itself.

We are also padding the devicetree with --space 131072, so that the
actual size of the (padded) device tree binary may not have any impact
on the kernel crashes.

It looks like these crashes may be caused by alignment error or similar
reasons, because when we e.g. add `a;` to node `usb@4000` it will boot,
but when we add an additional line like `b;` the kernel crashes. Also
it does matter where we put these lines. We can't put these a/b lines
at the top of the device tree, because this will also cause a crash,
even if I just put an `a;` on the top. Also the crashes differ if I add
more lines or may even dissapear.



Here is an example of what we changed:

Original:

/* USB0 using internal UTMI PHY */
usb@4000 {
dr_mode = "otg";
fsl,invert-drvvbus;
fsl,invert-pwr-fault;
ccs5121-front-and-back-port;
ccs5121-otg-switch;
};

Modified, crashes:

/* USB0 using internal UTMI PHY */
usb@4000 {
a;// "nonsense nodes" but these
lines cause the crash.
b; 
dr_mode = "otg";
fsl,invert-drvvbus;
fsl,invert-pwr-fault;
ccs5121-front-and-back-port;
ccs5121-otg-switch;
};

The actual node, where we apply these changes does not matter. And also
a and b are just examples. You can add, whatever you want, even "real"
properties may lead to crashes.

Further, it is not sure, that just two lines will cause the crash.
Sometimes, even single lines with longer property names or multiple
added lines may lead to crashes. And also removing nodes or just
properties may also lead to crashes.

In other words: modifying the devicetree in any kind may lead to
crashes.

If we boot multiple times, we may even get different crash reports...

I hope this, in conjunction with the attached logs, is detailed enough
to illustrate the problem. Does anyone of you have any idea what
exactly might cause this or how to debug this further?

A full bootlog of a _working_ boot is attached at the end of this mail.


Thanks and best regards,

Maik Nassauer





Attachments:

Some crashes:
=


Faulting instruction address: 0x
Oops: Kernel access of bad area, sig: 11 [#1]
BE MPC5121 CCS 0
Modules linked in:
CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.19.0-00023-g1077d91e4c12-
dirty #2
NIP:   LR:  CTR: c005cf30
REGS: cf837e50 TRAP: 0400   Not tainted  (4.19.0-00023-g1077d91e4c12-
dirty)
MSR:  20009032   CR: 22000844  XER: 

GPR00: c00248d8 cf837f00 cf822aa0 c0040430 0002 0005 
 
GPR08: c005cf30   1032 42000842  0004
0100 
GPR16: cf836000 c07b55c4 c07b55c0 0001 0002 0004 04208040
 
GPR24: c07a c060e4cc c06b5334 000a fffb7619 c076cfa0 c0770d10
c06b4f60 
NIP []   (null)
LR []   (null)
Call Trace:
Instruction dump:
      
 
      
 
---[ end trace 9de0a50b44704278 ]---

Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..


-


Unrecoverable FP Unavailable Exception 801 at c005a6e8
Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
BE MPC5121 CCS 0
Modules linked in:
CPU: 0 PID: 430 Comm: kworker/u2:5 Not tainted 4.19.0-00023-
g1077d91e4c12-dirty #2
Workqueue: rpciod rpc

Re: [PATCH v3] tpm: tpm_ibm_vtpm: Fix unallocated banks

2019-07-16 Thread Michal Suchánek

On Fri, 12 Jul 2019 00:13:57 +0300
Jarkko Sakkinen  wrote:

> On Thu, Jul 11, 2019 at 11:28:24PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Jul 11, 2019 at 12:13:35PM -0400, Nayna Jain wrote:  
> > > The nr_allocated_banks and allocated banks are initialized as part of
> > > tpm_chip_register. Currently, this is done as part of auto startup
> > > function. However, some drivers, like the ibm vtpm driver, do not run
> > > auto startup during initialization. This results in uninitialized memory
> > > issue and causes a kernel panic during boot.
> > > 
> > > This patch moves the pcr allocation outside the auto startup function
> > > into tpm_chip_register. This ensures that allocated banks are initialized
> > > in any case.
> > > 
> > > Fixes: 879b589210a9 ("tpm: retrieve digest size of unknown algorithms with
> > > PCR read")
> > > Reported-by: Michal Suchanek 
> > > Signed-off-by: Nayna Jain 
> > > Reviewed-by: Mimi Zohar 
> > > Tested-by: Sachin Sant 
> > > Tested-by: Michal Suchánek   
> > 
> > Reviewed-by: Jarkko Sakkinen   
> 
> Thanks a lot! It is applied now.

Fixes the issue for me.

Thanks

Michal

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Sven Schnelle

Hi,

[Adding Helge to CC list]

On Tue, Jul 16, 2019 at 03:06:33PM +0200, Christian Brauner wrote:
> On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> > I think Vasily already has a clone3 patch for s390x with 435. 
> 
> A quick follow-up on this. Helge and Michael have asked whether there
> are any tests for clone3. Yes, there will be and I try to have them
> ready by the end of the this or next week for review. In the meantime I
> hope the following minimalistic test program that just verifies very
> very basic functionality (It's not pretty.) will help you test:
> [..]

On PA-RISC this seems to work fine with Helge's patch to wire up the
clone3 syscall.

root@c3750:/# clonetest
Parent process received child's pid 84 as return value
Parent process received child's pidfd 3
Parent process received child's pid 84 as return argument
Child process with pid 84
root@c3750:/# echo $?
0

Regards
Sven

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Christian Brauner

On Tue, Jul 16, 2019 at 08:53:10PM +0200, Sven Schnelle wrote:
> Hi,
> 
> [Adding Helge to CC list]
> 
> On Tue, Jul 16, 2019 at 03:06:33PM +0200, Christian Brauner wrote:
> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> > > I think Vasily already has a clone3 patch for s390x with 435. 
> > 
> > A quick follow-up on this. Helge and Michael have asked whether there
> > are any tests for clone3. Yes, there will be and I try to have them
> > ready by the end of the this or next week for review. In the meantime I
> > hope the following minimalistic test program that just verifies very
> > very basic functionality (It's not pretty.) will help you test:
> > [..]
> 
> On PA-RISC this seems to work fine with Helge's patch to wire up the
> clone3 syscall.

I think I already responded to Helge before and yes, I think that parisc
doesn't do anything special for fork, vfork, clone, and by extension
also probably doesn't need to for clone3.
It should only be a problem for arches that require mucking explicitly
with arguments of clone-like syscalls.
In any case, I saw Helge's patch and I think I might've missed to add an
Acked-by but feel free to add it.

Thanks for testing it and sorry that I couldn't test!
Christian

[PATCH] powerpc/64: mark __boot_from_prom and start_here_common as __ref

2019-07-16 Thread Desnes A. Nunes do Rosario

Functions `__boot_from_prom` and `start_here_common` are "init code" in
the sense that they are only executed at boot time, nevertheless they
should not be tagged as __init since this will carry them to a different
section located at the very end of kernel text. If the TOC is not set up,
the kernel may not be able to tolerate a branch trampoline to reach the
init function.

Thus, these functions should be marked as `__ref` and the assembler must
be reminded to insert the code that follows into the last active section
by the use of the `.previous` directive. This will allow the powerpc
kernel to be built with CONFIG_SECTION_MISMATCH_WARN_ONLY disabled and
quieten the following modpost warnings during compilation:

WARNING: vmlinux.o(.text+0x2ad4): Section mismatch in reference from the 
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2cd0): Section mismatch in reference from the 
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

Credits: code is based on commit <9c4e4c90ec24> ("powerpc/64: mark
start_here_multiplatform as __ref") and message is based on 2016 patch by
Nicholas Piggin: 
https://lore.kernel.org/linuxppc-dev/20161222131419.18062-1-npig...@gmail.com/

Signed-off-by: Desnes A. Nunes do Rosario 
---
 arch/powerpc/kernel/head_64.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 259be7f6d551..04b34397b656 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -540,6 +540,7 @@ __start_initialization_multiplatform:
b   __after_prom_start
 #endif /* CONFIG_PPC_BOOK3E */
 
+__REF
 __boot_from_prom:
 #ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE
/* Save parameters */
@@ -577,6 +578,7 @@ __boot_from_prom:
/* We never return. We also hit that trap if trying to boot
 * from OF while CONFIG_PPC_OF_BOOT_TRAMPOLINE isn't selected */
trap
+   .previous
 
 __after_prom_start:
 #ifdef CONFIG_RELOCATABLE
@@ -983,6 +985,7 @@ start_here_multiplatform:
.previous
/* This is where all platforms converge execution */
 
+__REF
 start_here_common:
/* relocation is on at this point */
std r1,PACAKSAVE(r13)
@@ -1003,6 +1006,7 @@ start_here_common:
 
/* Not reached */
BUG_OPCODE
+   .previous
 
 /*
  * We put a few things here that have to be page-aligned.
-- 
2.18.1

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Helge Deller


On 16.07.19 20:55, Christian Brauner wrote:

On Tue, Jul 16, 2019 at 08:53:10PM +0200, Sven Schnelle wrote:

Hi,

[Adding Helge to CC list]

On Tue, Jul 16, 2019 at 03:06:33PM +0200, Christian Brauner wrote:

On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:

I think Vasily already has a clone3 patch for s390x with 435.


A quick follow-up on this. Helge and Michael have asked whether there
are any tests for clone3. Yes, there will be and I try to have them
ready by the end of the this or next week for review. In the meantime I
hope the following minimalistic test program that just verifies very
very basic functionality (It's not pretty.) will help you test:
[..]


On PA-RISC this seems to work fine with Helge's patch to wire up the
clone3 syscall.


[...]
In any case, I saw Helge's patch and I think I might've missed to add an
Acked-by but feel free to add it.


Thanks!
I've added the patch to the parisc-linux for-next tree.

Helge

[PATCH v2 2/4] Add fchmodat4(), a new syscall

2019-07-16 Thread Palmer Dabbelt

man 3p says that fchmodat() takes a flags argument, but the Linux
syscall does not.  There doesn't appear to be a good userspace
workaround for this issue but the implementation in the kernel is pretty
straight-forward.  The specific use case where the missing flags came up
was WRT a fuse filesystem implemenation, but the functionality is pretty
generic so I'm assuming there would be other use cases.

Signed-off-by: Palmer Dabbelt 
---
 fs/open.c| 20 
 include/linux/syscalls.h |  7 +--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index b5b80469b93d..2f72b4d6a2c1 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -569,11 +569,17 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
return ksys_fchmod(fd, mode);
 }
 
-int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
+int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, int flags)
 {
struct path path;
int error;
-   unsigned int lookup_flags = LOOKUP_FOLLOW;
+   unsigned int lookup_flags;
+
+   if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
+   return -EINVAL;
+
+   lookup_flags = flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW;
+
 retry:
error = user_path_at(dfd, filename, lookup_flags, &path);
if (!error) {
@@ -587,15 +593,21 @@ int do_fchmodat(int dfd, const char __user *filename, 
umode_t mode)
return error;
 }
 
+SYSCALL_DEFINE4(fchmodat4, int, dfd, const char __user *, filename,
+   umode_t, mode, int, flags)
+{
+   return do_fchmodat4(dfd, filename, mode, flags);
+}
+
 SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
umode_t, mode)
 {
-   return do_fchmodat(dfd, filename, mode);
+   return do_fchmodat4(dfd, filename, mode, 0);
 }
 
 SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
 {
-   return do_fchmodat(AT_FDCWD, filename, mode);
+   return do_fchmodat4(AT_FDCWD, filename, mode, 0);
 }
 
 static int chown_common(const struct path *path, uid_t user, gid_t group)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e1c20f1d0525..a4bde25ad264 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -433,6 +433,8 @@ asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
 asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
+asmlinkage long sys_fchmodat4(int dfd, const char __user *filename,
+umode_t mode, int flags);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
 asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
@@ -1320,11 +1322,12 @@ static inline long ksys_link(const char __user *oldname,
return do_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
-extern int do_fchmodat(int dfd, const char __user *filename, umode_t mode);
+extern int do_fchmodat4(int dfd, const char __user *filename, umode_t mode,
+   int flags);
 
 static inline int ksys_chmod(const char __user *filename, umode_t mode)
 {
-   return do_fchmodat(AT_FDCWD, filename, mode);
+   return do_fchmodat4(AT_FDCWD, filename, mode, 0);
 }
 
 extern long do_faccessat(int dfd, const char __user *filename, int mode);
-- 
2.21.0

[PATCH v2 1/4] Non-functional cleanup of a "__user * filename"

2019-07-16 Thread Palmer Dabbelt

The next patch defines a very similar interface, which I copied from
this definition.  Since I'm touching it anyway I don't see any reason
not to just go fix this one up.

Signed-off-by: Palmer Dabbelt 
---
 include/linux/syscalls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2bcef4c70183..e1c20f1d0525 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -431,7 +431,7 @@ asmlinkage long sys_chdir(const char __user *filename);
 asmlinkage long sys_fchdir(unsigned int fd);
 asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
-asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
+asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
-- 
2.21.0

Add a new fchmodat4() syscall, v2

2019-07-16 Thread Palmer Dabbelt

This patch set adds fchmodat4(), a new syscall. The actual
implementation is super simple: essentially it's just the same as
fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
I've attempted to make this match "man 2 fchmodat" as closely as
possible, which says EINVAL is returned for invalid flags (as opposed to
ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
I have a sketch of a glibc patch that I haven't even compiled yet, but
seems fairly straight-forward:

diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
b/sysdeps/unix/sysv/linux/fchmodat.c
index 6d9cbc1ce9e0..b1beab76d56c 100644
--- a/sysdeps/unix/sysv/linux/fchmodat.c
+++ b/sysdeps/unix/sysv/linux/fchmodat.c
@@ -29,12 +29,36 @@
 int
 fchmodat (int fd, const char *file, mode_t mode, int flag)
 {
-  if (flag & ~AT_SYMLINK_NOFOLLOW)
-return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
-#ifndef __NR_lchmod/* Linux so far has no lchmod syscall.  
*/
+  /* There are four paths through this code:
+  - The flags are zero.  In this case it's fine to call fchmodat.
+  - The flags are non-zero and glibc doesn't have access to
+   __NR_fchmodat4.  In this case all we can do is emulate the error codes
+   defined by the glibc interface from userspace.
+  - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel 
has
+   fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
+   matches glibc's library interface so it can be called directly.
+  - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel 
does
+   not.  In this case we must respect the error codes defined by the glibc
+   interface instead of returning ENOSYS.
+The intent here is to ensure that the kernel is called at most once per
+library call, and that the error types defined by glibc are always
+respected.  */
+
+#ifdef __NR_fchmodat4
+  long result;
+#endif
+
+  if (flag == 0)
+return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
+
+#ifdef __NR_fchmodat4
+  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
+  if (result == 0 || errno != ENOSYS)
+return result;
+#endif
+
   if (flag & AT_SYMLINK_NOFOLLOW)
 return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP);
-#endif

-  return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
+  return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
 }

I've never added a new syscall before so I'm not really sure what the
proper procedure to follow is.  Based on the feedback from my v1 patch
set it seems this is somewhat uncontroversial.  At this point I don't
think there's anything I'm missing, though note that I haven't gotten
around to testing it this time because the diff from v1 is trivial for
any platform I could reasonably test on.  The v1 patches suggest a
simple test case, but I didn't re-run it because I don't want to reboot
my laptop.

$ touch test-file
$ ln -s test-file test-link
$ cat > test.c
#include 
#include 
#include 

int main(int argc, char **argv)
{
long out;

out = syscall(434, AT_FDCWD, "test-file", 0x888, 
AT_SYMLINK_NOFOLLOW);
printf("fchmodat4(AT_FDCWD, \"test-file\", 0x888, 
AT_SYMLINK_NOFOLLOW): %ld\n", out);

out = syscall(434, AT_FDCWD, "test-file", 0x888, 0);
printf("fchmodat4(AT_FDCWD, \"test-file\", 0x888, 0): %ld\n", out);

out = syscall(268, AT_FDCWD, "test-file", 0x888);
printf("fchmodat(AT_FDCWD, \"test-file\", 0x888): %ld\n", out);

out = syscall(434, AT_FDCWD, "test-link", 0x888, 
AT_SYMLINK_NOFOLLOW);
printf("fchmodat4(AT_FDCWD, \"test-link\", 0x888, 
AT_SYMLINK_NOFOLLOW): %ld\n", out);

out = syscall(434, AT_FDCWD, "test-link", 0x888, 0);
printf("fchmodat4(AT_FDCWD, \"test-link\", 0x888, 0): %ld\n", out);

out = syscall(268, AT_FDCWD, "test-link", 0x888);
printf("fchmodat(AT_FDCWD, \"test-link\", 0x888): %ld\n", out);

return 0;
}
$ gcc test.c -o test
$ ./test
fchmodat4(AT_FDCWD, "test-file", 0x888, AT_SYMLINK_NOFOLLOW): 0
fchmodat4(AT_FDCWD, "test-file", 0x888, 0): 0
fchmodat(AT_FDCWD, "test-file", 0x888): 0
fchmodat4(AT_FDCWD, "test-link", 0x888, AT_SYMLINK_NOFOLLOW): -1
fchmodat4(AT_FDCWD, "test-link", 0x888, 0): 0
fchmodat(AT_FDCWD, "test-link", 0x888): 0

I've only built this on 64-bit x86.

Changes since v1 [20190531191204.4044-1-pal...@sifive.com]:

* All architectures are now supported, which support squashed into a
  single patch.
* The do_fchmodat() helper function has been removed, in favor of directly
  calling do_fchmodat4().
* The patches are based on 5.2 instead of 5.1.

[PATCH v2 3/4] arch: Register fchmodat4, usually as syscall 434

2019-07-16 Thread Palmer Dabbelt

This registers the new fchmodat4 syscall in most places as nuber 434,
with alpha being the exception where it's 544.  I found all these sites
by grepping for fspick, which I assume has found me everything.

Signed-off-by: Palmer Dabbelt 
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 include/uapi/asm-generic/unistd.h   | 5 -
 17 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..6c4ef43c8b52 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541common  fsconfigsys_fsconfig
 542common  fsmount sys_fsmount
 543common  fspick  sys_fspick
+544common  fcmodat4sys_fchmodat4
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..c008b76fbf92 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+434common  fchmodat4   sys_fchmodat4
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index aa995920bd34..049471b468c1 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -875,6 +875,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_fchmodat4 434
+__SYSCALL(__NR_fchmodat4, sys_fchmodat4)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..d16e9801fe82 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+434common  fchmodat4   sys_fchmodat4
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..1bbff1a9153c 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+434common  fchmodat4   sys_fchmodat4
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..3ed878cb10a3 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+434common  fchmodat4   sys_fchmodat4
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..916cdb808e62 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
 431n32 fsconfigsys_fsconfig
 432n32 fsmount sys_fsmount
 433n32 fspick  sys_fspick
+434n32 fchmodat4   sys_fchmodat4
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl 
b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..48b4badb1914 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
 431n64 fsconfigsys_fsconfig
 432n64 fsmount sys_fsmount
 433n64 fspick  sys_fspick
+43

[PATCH v2 4/4] tools: Add fchmodat4

2019-07-16 Thread Palmer Dabbelt

I'm not sure why it's necessary to add this explicitly to tools/ as well
as arch/, but there were a few instances of fspick in here so I blindly
added fchmodat4 in the same fashion.

Signed-off-by: Palmer Dabbelt 
---
 tools/include/uapi/asm-generic/unistd.h   | 4 +++-
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h 
b/tools/include/uapi/asm-generic/unistd.h
index a87904daf103..36232ea94956 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_fchmodat4 434
+__SYSCALL(__NR_fchmodat4, sys_fchmodat4)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..b92d5b195e66 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431common  fsconfig__x64_sys_fsconfig
 432common  fsmount __x64_sys_fsmount
 433common  fspick  __x64_sys_fspick
+434common  fchmodat4   __x64_sys_fchmodat4
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.21.0

Re: [PATCH v2 2/4] Add fchmodat4(), a new syscall

2019-07-16 Thread Al Viro

On Tue, Jul 16, 2019 at 06:27:17PM -0700, Palmer Dabbelt wrote:

> -int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> +int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, int 
> flags)
>  {
>   struct path path;
>   int error;
> - unsigned int lookup_flags = LOOKUP_FOLLOW;
> + unsigned int lookup_flags;
> +
> + if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
> + return -EINVAL;
> +
> + lookup_flags = flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW;
> +

Why not do that in sys_fchmodat4() itself, passing lookup_flags to
do_fchmodat() and updating old callers to pass it 0 as extra argument?

Re: [PATCH v2 2/4] Add fchmodat4(), a new syscall

2019-07-16 Thread Palmer Dabbelt


On Tue, 16 Jul 2019 18:48:02 PDT (-0700), v...@zeniv.linux.org.uk wrote:

On Tue, Jul 16, 2019 at 06:27:17PM -0700, Palmer Dabbelt wrote:


-int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
+int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, int flags)
 {
struct path path;
int error;
-   unsigned int lookup_flags = LOOKUP_FOLLOW;
+   unsigned int lookup_flags;
+
+   if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
+   return -EINVAL;
+
+   lookup_flags = flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW;
+


Why not do that in sys_fchmodat4() itself, passing lookup_flags to
do_fchmodat() and updating old callers to pass it 0 as extra argument?


Ya, that seems better -- passing LOOKUP_FOLLOW instead of 0, to keep the
behavior the same.  That way I could avoid the overhead of these checks for the
old syscalls, as we know they're not necessary.

I'll replace this patch with the following for a v3

   diff --git a/fs/open.c b/fs/open.c
   index b5b80469b93d..a5f99408af11 100644
   --- a/fs/open.c
   +++ b/fs/open.c
   @@ -569,11 +569,12 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, 
mode)
   return ksys_fchmod(fd, mode);
}
   
   -int do_fchmodat(int dfd, const char __user *filename, umode_t mode)

   +int do_fchmodat4(int dfd, const char __user *filename, umode_t mode,
   +int lookup_flags)
{
   struct path path;
   int error;
   -   unsigned int lookup_flags = LOOKUP_FOLLOW;
   +
retry:
   error = user_path_at(dfd, filename, lookup_flags, &path);
   if (!error) {
   @@ -587,15 +588,28 @@ int do_fchmodat(int dfd, const char __user *filename, 
umode_t mode)
   return error;
}
   
   +SYSCALL_DEFINE4(fchmodat4, int, dfd, const char __user *, filename,

   +   umode_t, mode, int, flags)
   +{
   +   unsigned int lookup_flags;
   +
   +   if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
   +   return -EINVAL;
   +
   +   lookup_flags = flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW;
   +
   +   return do_fchmodat4(dfd, filename, mode, lookup_flags);
   +}
   +
SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
   umode_t, mode)
{
   -   return do_fchmodat(dfd, filename, mode);
   +   return do_fchmodat4(dfd, filename, mode, LOOKUP_FOLLOW);
}
   
SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)

{
   -   return do_fchmodat(AT_FDCWD, filename, mode);
   +   return do_fchmodat4(AT_FDCWD, filename, mode, LOOKUP_FOLLOW);
}
   
static int chown_common(const struct path *path, uid_t user, gid_t group)

   diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
   index e1c20f1d0525..6676b1cc5485 100644
   --- a/include/linux/syscalls.h
   +++ b/include/linux/syscalls.h
   @@ -81,6 +81,7 @@ struct io_uring_params;
#include 
#include 
#include 
   +#include 
#include 
   
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER

   @@ -433,6 +434,8 @@ asmlinkage long sys_chroot(const char __user *filename);
asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
umode_t mode);
   +asmlinkage long sys_fchmodat4(int dfd, const char __user *filename,
   +umode_t mode, int flags);
asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t 
user,
gid_t group, int flag);
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
   @@ -1320,11 +1323,12 @@ static inline long ksys_link(const char __user 
*oldname,
   return do_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
   
   -extern int do_fchmodat(int dfd, const char __user *filename, umode_t mode);

   +extern int do_fchmodat4(int dfd, const char __user *filename, umode_t mode,
   +   int flags);
   
static inline int ksys_chmod(const char __user *filename, umode_t mode)

{
   -   return do_fchmodat(AT_FDCWD, filename, mode);
   +   return do_fchmodat4(AT_FDCWD, filename, mode, LOOKUP_FOLLOW);
}
   
extern long do_faccessat(int dfd, const char __user *filename, int mode);

Re: [PATCH v2 2/4] Add fchmodat4(), a new syscall

2019-07-16 Thread Rich Felker

On Tue, Jul 16, 2019 at 06:27:17PM -0700, Palmer Dabbelt wrote:
> man 3p says that fchmodat() takes a flags argument, but the Linux
> syscall does not.  There doesn't appear to be a good userspace
> workaround for this issue but the implementation in the kernel is pretty
> straight-forward.  The specific use case where the missing flags came up
> was WRT a fuse filesystem implemenation, but the functionality is pretty
> generic so I'm assuming there would be other use cases.

Note that we do have a workaround in musl libc with O_PATH and
/proc/self/fd, but a syscall that allows a proper fix with the ugly
workaround only in the fallback path for old kernels will be much
appreciated!

What about also doing a new SYS_faccessat4 with working AT_EACCESS
flag? The workaround we have to do for it is far worse.

Rich

Re: [PATCH v2 2/4] Add fchmodat4(), a new syscall

2019-07-16 Thread Al Viro

On Tue, Jul 16, 2019 at 10:40:46PM -0400, Rich Felker wrote:
> On Tue, Jul 16, 2019 at 06:27:17PM -0700, Palmer Dabbelt wrote:
> > man 3p says that fchmodat() takes a flags argument, but the Linux
> > syscall does not.  There doesn't appear to be a good userspace
> > workaround for this issue but the implementation in the kernel is pretty
> > straight-forward.  The specific use case where the missing flags came up
> > was WRT a fuse filesystem implemenation, but the functionality is pretty
> > generic so I'm assuming there would be other use cases.
> 
> Note that we do have a workaround in musl libc with O_PATH and
> /proc/self/fd, but a syscall that allows a proper fix with the ugly
> workaround only in the fallback path for old kernels will be much
> appreciated!
> 
> What about also doing a new SYS_faccessat4 with working AT_EACCESS
> flag? The workaround we have to do for it is far worse.

Umm...  That's doable, but getting into the "don't switch creds unless
needed" territory.  I'll need to play with that a bit and see what
gives a tolerable variant...

What of this part wrt AT_EACCESS?
if (!issecure(SECURE_NO_SETUID_FIXUP)) {
/* Clear the capabilities if we switch to a non-root user */
kuid_t root_uid = make_kuid(override_cred->user_ns, 0);
if (!uid_eq(override_cred->uid, root_uid))
cap_clear(override_cred->cap_effective);
else
override_cred->cap_effective =
override_cred->cap_permitted;
}

[PATCH v7] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-16 Thread Wen Yang

The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f6 ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
v7: adapt to commit ("cpufreq: Make cpufreq_generic_init() return void")
v6: keep the blank line and fix warning: label 'out_unmap_sdcpwr' defined but 
not used.
v5: put together the code to get, use, and release cpu device_node.
v4: restore the blank line.
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 93f39a1..c66f566 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -131,10 +131,18 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int err = -ENODEV;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
+   if (!cpu)
+   goto out;
 
+   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
of_node_put(cpu);
-   if (!cpu)
+   if (!max_freqp) {
+   err = -EINVAL;
goto out;
+   }
+
+   /* we need the freq in kHz */
+   max_freq = *max_freqp / 1000;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
@@ -171,16 +179,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
}
 
pr_debug("init cpufreq on CPU %d\n", policy->cpu);
-
-   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
-   if (!max_freqp) {
-   err = -EINVAL;
-   goto out_unmap_sdcpwr;
-   }
-
-   /* we need the freq in kHz */
-   max_freq = *max_freqp / 1000;
-
pr_debug("max clock-frequency is at %u kHz\n", max_freq);
pr_debug("initializing frequency table\n");
 
@@ -199,9 +197,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
return 0;
 
-out_unmap_sdcpwr:
-   iounmap(sdcpwr_mapbase);
-
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
 out:
-- 
2.9.5

Re: [PATCH v7] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-16 Thread Viresh Kumar

On 17-07-19, 11:55, Wen Yang wrote:
> The cpu variable is still being used in the of_get_property() call
> after the of_node_put() call, which may result in use-after-free.
> 
> Fixes: a9acc26b75f6 ("cpufreq/pasemi: fix possible object reference leak")
> Signed-off-by: Wen Yang 
> Cc: "Rafael J. Wysocki" 
> Cc: Viresh Kumar 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
> v7: adapt to commit ("cpufreq: Make cpufreq_generic_init() return void")
> v6: keep the blank line and fix warning: label 'out_unmap_sdcpwr' defined but 
> not used.
> v5: put together the code to get, use, and release cpu device_node.
> v4: restore the blank line.
> v3: fix a leaked reference.
> v2: clean up the code according to the advice of viresh.
> 
>  drivers/cpufreq/pasemi-cpufreq.c | 23 +--
>  1 file changed, 9 insertions(+), 14 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh

Re: [PATCH kernel v3] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Benjamin Herrenschmidt

On Wed, 2019-07-17 at 15:00 +1000, Alexey Kardashevskiy wrote:
> There is a race between releasing an irq on one cpu and fetching it
> from XIVE on another cpu as there does not seem to be any locking between
> these, probably because xive_irq_chip::irq_shutdown() is supposed to
> remove the irq from all queues in the system which it does not do.
> 
> As a result, when such released irq appears in a queue, we take it
> from the queue but we do not change the current priority on that cpu and
> since there is no handler for the irq, EOI is never called and the cpu
> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
> is assigned to the same cpu, then that device stops working until irq
> is moved to another cpu or the device is reset.
> 
> This adds a new ppc_md.orphan_irq callback which is called if no irq
> descriptor is found. The XIVE implementation drops the current priority
> to 0xff which effectively unmasks interrupts in a current CPU.

Better.

Now, you should proably add orphan_irq as a separate patch, and it
wouldn't hurt to make other PICs like XICS also provide it :-) They are
less likely to hit due to the absence of queuing but I suppose the
theorical race exists.

Cheers,
Ben.

> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Cédric Le Goater 
> ---
> Changes:
> v3:
> * added a comment above xive_orphan_irq()
> 
> v2:
> * added ppc_md.orphan_irq
> ---
>  arch/powerpc/include/asm/machdep.h |  3 +++
>  arch/powerpc/kernel/irq.c  |  9 ++---
>  arch/powerpc/sysdev/xive/common.c  | 18 ++
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index c43d6eca9edd..6cc14e28e89a 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -59,6 +59,9 @@ struct machdep_calls {
>   /* Return an irq, or 0 to indicate there are none pending. */
>   unsigned int(*get_irq)(void);
>  
> + /* Drops irq if it does not have a valid descriptor */
> + void(*orphan_irq)(unsigned int irq);
> +
>   /* PCI stuff */
>   /* Called after allocating resources */
>   void(*pcibios_fixup)(void);
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index bc68c53af67c..b4e06d05bdba 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
>   may_hard_irq_enable();
>  
>   /* And finally process it */
> - if (unlikely(!irq))
> + if (unlikely(!irq)) {
>   __this_cpu_inc(irq_stat.spurious_irqs);
> - else
> - generic_handle_irq(irq);
> + } else if (generic_handle_irq(irq)) {
> + if (ppc_md.orphan_irq)
> + ppc_md.orphan_irq(irq);
> + __this_cpu_inc(irq_stat.spurious_irqs);
> + }
>  
>   trace_irq_exit(regs);
>  
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..17e696b2d71b 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -283,6 +283,23 @@ static unsigned int xive_get_irq(void)
>   return irq;
>  }
>  
> +/*
> + * Handles the case when a target CPU catches an interrupt which is being 
> shut
> + * down on another CPU. generic_handle_irq() returns an error in such case
> + * and then the orphan_irq() handler restores the CPPR to reenable 
> interrupts.
> + *
> + * Without orphan_irq() and valid irq_desc, there is no other way to restore
> + * the CPPR. This executes on a CPU which caught the interrupt.
> + */
> +static void xive_orphan_irq(unsigned int irq)
> +{
> + struct xive_cpu *xc = __this_cpu_read(xive_cpu);
> +
> + xc->cppr = 0xff;
> + out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
> + DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
> +}
> +
>  /*
>   * After EOI'ing an interrupt, we need to re-check the queue
>   * to see if another interrupt is pending since multiple
> @@ -1419,6 +1436,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
> void __iomem *area, u32 o
>   xive_irq_priority = max_prio;
>  
>   ppc_md.get_irq = xive_get_irq;
> + ppc_md.orphan_irq = xive_orphan_irq;
>   __xive_enabled = true;
>  
>   pr_devel("Initializing host..\n");

[PATCH kernel v3] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Alexey Kardashevskiy

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This adds a new ppc_md.orphan_irq callback which is called if no irq
descriptor is found. The XIVE implementation drops the current priority
to 0xff which effectively unmasks interrupts in a current CPU.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Cédric Le Goater 
---
Changes:
v3:
* added a comment above xive_orphan_irq()

v2:
* added ppc_md.orphan_irq
---
 arch/powerpc/include/asm/machdep.h |  3 +++
 arch/powerpc/kernel/irq.c  |  9 ++---
 arch/powerpc/sysdev/xive/common.c  | 18 ++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index c43d6eca9edd..6cc14e28e89a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
/* Return an irq, or 0 to indicate there are none pending. */
unsigned int(*get_irq)(void);
 
+   /* Drops irq if it does not have a valid descriptor */
+   void(*orphan_irq)(unsigned int irq);
+
/* PCI stuff */
/* Called after allocating resources */
void(*pcibios_fixup)(void);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bc68c53af67c..b4e06d05bdba 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
may_hard_irq_enable();
 
/* And finally process it */
-   if (unlikely(!irq))
+   if (unlikely(!irq)) {
__this_cpu_inc(irq_stat.spurious_irqs);
-   else
-   generic_handle_irq(irq);
+   } else if (generic_handle_irq(irq)) {
+   if (ppc_md.orphan_irq)
+   ppc_md.orphan_irq(irq);
+   __this_cpu_inc(irq_stat.spurious_irqs);
+   }
 
trace_irq_exit(regs);
 
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..17e696b2d71b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,6 +283,23 @@ static unsigned int xive_get_irq(void)
return irq;
 }
 
+/*
+ * Handles the case when a target CPU catches an interrupt which is being shut
+ * down on another CPU. generic_handle_irq() returns an error in such case
+ * and then the orphan_irq() handler restores the CPPR to reenable interrupts.
+ *
+ * Without orphan_irq() and valid irq_desc, there is no other way to restore
+ * the CPPR. This executes on a CPU which caught the interrupt.
+ */
+static void xive_orphan_irq(unsigned int irq)
+{
+   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
+
+   xc->cppr = 0xff;
+   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
+   DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
+}
+
 /*
  * After EOI'ing an interrupt, we need to re-check the queue
  * to see if another interrupt is pending since multiple
@@ -1419,6 +1436,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
void __iomem *area, u32 o
xive_irq_priority = max_prio;
 
ppc_md.get_irq = xive_get_irq;
+   ppc_md.orphan_irq = xive_orphan_irq;
__xive_enabled = true;
 
pr_devel("Initializing host..\n");
-- 
2.17.1

Re: [PATCH kernel v3] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Alexey Kardashevskiy





On 17/07/2019 15:00, Alexey Kardashevskiy wrote:

There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This adds a new ppc_md.orphan_irq callback which is called if no irq
descriptor is found. The XIVE implementation drops the current priority
to 0xff which effectively unmasks interrupts in a current CPU.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Cédric Le Goater 


Of course I missed this:


Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE 
interrupt controller")






---
Changes:
v3:
* added a comment above xive_orphan_irq()

v2:
* added ppc_md.orphan_irq
---
  arch/powerpc/include/asm/machdep.h |  3 +++
  arch/powerpc/kernel/irq.c  |  9 ++---
  arch/powerpc/sysdev/xive/common.c  | 18 ++
  3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index c43d6eca9edd..6cc14e28e89a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
/* Return an irq, or 0 to indicate there are none pending. */
unsigned int(*get_irq)(void);
  
+	/* Drops irq if it does not have a valid descriptor */

+   void(*orphan_irq)(unsigned int irq);
+
/* PCI stuff */
/* Called after allocating resources */
void(*pcibios_fixup)(void);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bc68c53af67c..b4e06d05bdba 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
may_hard_irq_enable();
  
  	/* And finally process it */

-   if (unlikely(!irq))
+   if (unlikely(!irq)) {
__this_cpu_inc(irq_stat.spurious_irqs);
-   else
-   generic_handle_irq(irq);
+   } else if (generic_handle_irq(irq)) {
+   if (ppc_md.orphan_irq)
+   ppc_md.orphan_irq(irq);
+   __this_cpu_inc(irq_stat.spurious_irqs);
+   }
  
  	trace_irq_exit(regs);
  
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c

index 082c7e1c20f0..17e696b2d71b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,6 +283,23 @@ static unsigned int xive_get_irq(void)
return irq;
  }
  
+/*

+ * Handles the case when a target CPU catches an interrupt which is being shut
+ * down on another CPU. generic_handle_irq() returns an error in such case
+ * and then the orphan_irq() handler restores the CPPR to reenable interrupts.
+ *
+ * Without orphan_irq() and valid irq_desc, there is no other way to restore
+ * the CPPR. This executes on a CPU which caught the interrupt.
+ */
+static void xive_orphan_irq(unsigned int irq)
+{
+   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
+
+   xc->cppr = 0xff;
+   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
+   DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
+}
+
  /*
   * After EOI'ing an interrupt, we need to re-check the queue
   * to see if another interrupt is pending since multiple
@@ -1419,6 +1436,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
void __iomem *area, u32 o
xive_irq_priority = max_prio;
  
  	ppc_md.get_irq = xive_get_irq;

+   ppc_md.orphan_irq = xive_orphan_irq;
__xive_enabled = true;
  
  	pr_devel("Initializing host..\n");




--
Alexey

70 matches

Mail list logo