Re: [PATCH v6 00/27] ptrace: add PTRACE_GET_SYSCALL_INFO request

2018-12-14 Thread Paul Moore
On Thu, Dec 13, 2018 at 12:18 PM Dmitry V. Levin  wrote:
> PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
> details of the syscall the tracee is blocked in.
>
> There are two reasons for a special syscall-related ptrace request.
>
> Firstly, with the current ptrace API there are cases when ptracer cannot
> retrieve necessary information about syscalls.  Some examples include:
> * The notorious int-0x80-from-64-bit-task issue.  See [1] for details.
> In short, if a 64-bit task performs a syscall through int 0x80, its tracer
> has no reliable means to find out that the syscall was, in fact,
> a compat syscall, and misidentifies it.
> * Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
> Common practice is to keep track of the sequence of ptrace-stops in order
> not to mix the two syscall-stops up.  But it is not as simple as it looks;
> for example, strace had a (just recently fixed) long-standing bug where
> attaching strace to a tracee that is performing the execve system call
> led to the tracer identifying the following syscall-exit-stop as
> syscall-enter-stop, which messed up all the state tracking.
> * Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
> ("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
> and process_vm_readv become unavailable when the process dumpable flag
> is cleared.  On such architectures as ia64 this results in all syscall
> arguments being unavailable for the tracer.
>
> Secondly, ptracers also have to support a lot of arch-specific code for
> obtaining information about the tracee.  For some architectures, this
> requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
> argument and return value.
>
> PTRACE_GET_SYSCALL_INFO returns the following structure:
>
> struct ptrace_syscall_info {
> __u8 op;/* PTRACE_SYSCALL_INFO_* */
> __u32 arch __attribute__((__aligned__(sizeof(__u32;
> __u64 instruction_pointer;
> __u64 stack_pointer;
> union {
> struct {
> __u64 nr;
> __u64 args[6];
> } entry;
> struct {
> __s64 rval;
> __u8 is_error;
> } exit;
> struct {
> __u64 nr;
> __u64 args[6];
> __u32 ret_data;
> } seccomp;
> };
> };
>
> The structure was chosen according to [2], except for the following
> changes:
> * seccomp substructure was added as a superset of entry substructure;
> * the type of nr field was changed from int to __u64 because syscall
> numbers are, as a practical matter, 64 bits;
> * stack_pointer field was added along with instruction_pointer field
> since it is readily available and can save the tracer from extra
> PTRACE_GETREGS/PTRACE_GETREGSET calls;
> * arch is always initialized to aid with tracing system calls
> * such as execve();
> * instruction_pointer and stack_pointer are always initialized
> so they could be easily obtained for non-syscall stops;
> * a boolean is_error field was added along with rval field, this way
> the tracer can more reliably distinguish a return value
> from an error value.
>
> strace has been ported to PTRACE_GET_SYSCALL_INFO, you can find it
> in [3] and [4].
>
> [1] 
> https://lore.kernel.org/lkml/ca+55afzcsvmddj9lh_gdbz1ozhyem6zrgpbdajnywm2lf_e...@mail.gmail.com/
> [2] 
> https://lore.kernel.org/lkml/caobl_7gm0n80n7j_dfw_eqyflyzq+sf4y2avsccv88tb3aw...@mail.gmail.com/
> [3] https://github.com/strace/strace/commits/ldv/PTRACE_GET_SYSCALL_INFO
> [4] https://gitlab.com/strace/strace/commits/ldv/PTRACE_GET_SYSCALL_INFO
>
> ---
>
> Notes:
> v6:
> * Add syscall_get_arguments and syscall_set_arguments wrappers
>   to asm-generic/syscall.h, requested by Geert.
> * Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing 
> paddings
>   into account, use the end of the last field of the structure being 
> written.
> * Change struct ptrace_syscall_info:
>   * remove .frame_pointer field, is is not needed and not portable;
>   * make .arch field explicitly aligned, remove no longer needed
> padding before .arch field;
>   * remove trailing pads, they are no longer needed.
>
> v5:
> * Merge separate series and patches into the single series.
> * Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
> * Change struct ptrace_syscall_info: generalize instruction_pointer,
>   stack_pointer, and frame_pointer fields by moving them from
>   ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
>   and initializing them for all stops.
> * Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
>   so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
>   instruction_pointer 

Re: [PATCH v1 9/9] mm: better document PG_reserved

2018-12-14 Thread Randy Dunlap
On 12/14/18 3:10 AM, David Hildenbrand wrote:
> The usage of PG_reserved and how PG_reserved pages are to be treated is
> buried deep down in different parts of the kernel. Let's shine some light
> onto these details by documenting current users and expected
> behavior.
> 
> Especially, clarify on the "Some of them might not even exist" case.
> These are physical memory gaps that will never be dumped as they
> are not marked as IORESOURCE_SYSRAM. PG_reserved does in general not
> hinder anybody from dumping or swapping. In some cases, these pages
> will not be stored in the hibernation image.

Hi,
Thanks for the doc update.
Comments below.

> Cc: Andrew Morton 
> Cc: Stephen Rothwell 
> Cc: Pavel Tatashin 
> Cc: Michal Hocko 
> Cc: Alexander Duyck 
> Cc: Matthew Wilcox 
> Cc: Anthony Yznaga 
> Cc: Miles Chen 
> Cc: yi.z.zh...@linux.intel.com
> Cc: Dan Williams 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/page-flags.h | 33 +++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 808b4183e30d..9de2e941cbd5 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -17,8 +17,37 @@
>  /*
>   * Various page->flags bits:
>   *
> - * PG_reserved is set for special pages, which can never be swapped out. Some
> - * of them might not even exist...
> + * PG_reserved is set for special pages. The "struct page" of such a page
> + * should in general not be touched (e.g. set dirty) except by their owner.

   by its owner.

> + * Pages marked as PG_reserved include:
> + * - Pages part of the kernel image (including vDSO) and similar (e.g. BIOS,
> + *   initrd, HW tables)
> + * - Pages reserved or allocated early during boot (before the page allocator
> + *   was initialized). This includes (depending on the architecture) the
> + *   initial vmmap, initial page tables, crashkernel, elfcorehdr, and much

VM map,

> + *   much more. Once (if ever) freed, PG_reserved is cleared and they will
> + *   be given to the page allocator.
> + * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
> + *   to read/write these pages might end badly. Don't touch!
> + * - The zero page(s)
> + * - Pages not added to the page allocator when onlining a section because
> + *   they were excluded via the online_page_callback() or because they are
> + *   PG_hwpoison.
> + * - Pages allocated in the context of kexec/kdump (loaded kernel image,
> + *   control pages, vmcoreinfo)
> + * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
> + *   not marked PG_reserved (as they might be in use by somebody else who 
> does
> + *   not respect the caching strategy).
> + * - Pages part of an offline section (struct pages of offline sections 
> should
> + *   not be trusted as they will be initialized when first onlined).
> + * - MCA pages on ia64
> + * - Pages holding CPU notes for POWER Firmware Assisted Dump
> + * - Device memory (e.g. PMEM, DAX, HMM)
> + * Some PG_reserved pages will be excluded from the hibernation image.
> + * PG_reserved does in general not hinder anybody from dumping or swapping
> + * and is no longer required for remap_pfn_range(). ioremap might require it.
> + * Consequently, PG_reserved for a page mapped into user space can indicate
> + * the zero page, the vDSO, MMIO pages or device memory.
>   *
>   * The PG_private bitflag is set on pagecache pages if they contain 
> filesystem
>   * specific data (which is normally at page->private). It can be used by
> 

cheers.
-- 
~Randy


Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-14 Thread Frank Rowand
On 12/14/18 2:47 PM, Frank Rowand wrote:
> On 12/14/18 9:15 AM, Rob Herring wrote:
>> On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>>>
>>> From: Frank Rowand 
>>>
>>> The phandle cache contains struct device_node pointers.  The refcount
>>> of the pointers was not incremented while in the cache, allowing use
>>> after free error after kfree() of the node.  Add the proper increment
>>> and decrement of the use count.
>>
>> Since we pre-populate the cache at boot, all the nodes will have a ref
>> count and will never be freed unless we happen to repopulate the whole

>> cache. That doesn't seem ideal. The node pointer is not "in use" just
>> because it is in the cache.

I forgot to reply to this sentence.

The node pointers are "in use" because of_find_node_by_phandle() will
use the pointers to access the phandle field.  This is a use after
free bug if the node has been kfree()'ed.


>>
>> Rob
>>
> 
> This patch also adds of_node_put() so that the refcount will go to zero
> when the node is removed as part of an overlay remove, if the node was
> added by an overlay.
> 
> Patch 2/2 adds the free cache entry call to __of_detach_node(), so the
> refcount will go to zero when the node is removed for dynamic use cases
> other than overlays.  (For overlays, all nodes are instead removed from
> the cache before __of_detach_node() is called.)
> 
> -Frank
> 



Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-14 Thread Frank Rowand
On 12/14/18 9:15 AM, Rob Herring wrote:
> On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>>
>> From: Frank Rowand 
>>
>> The phandle cache contains struct device_node pointers.  The refcount
>> of the pointers was not incremented while in the cache, allowing use
>> after free error after kfree() of the node.  Add the proper increment
>> and decrement of the use count.
> 
> Since we pre-populate the cache at boot, all the nodes will have a ref
> count and will never be freed unless we happen to repopulate the whole
> cache. That doesn't seem ideal. The node pointer is not "in use" just
> because it is in the cache.
> 
> Rob
> 

This patch also adds of_node_put() so that the refcount will go to zero
when the node is removed as part of an overlay remove, if the node was
added by an overlay.

Patch 2/2 adds the free cache entry call to __of_detach_node(), so the
refcount will go to zero when the node is removed for dynamic use cases
other than overlays.  (For overlays, all nodes are instead removed from
the cache before __of_detach_node() is called.)

-Frank


Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-14 Thread Frank Rowand
On 12/14/18 1:56 PM, Michael Bringmann wrote:
> On 12/14/2018 11:20 AM, Rob Herring wrote:
>> On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>>>
>>> From: Frank Rowand 
>>>
>>> Non-overlay dynamic devicetree node removal may leave the node in
>>> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
>>> will incorrectly find the stale entry.  Remove the node from the
>>> cache.
>>>
>>> Add paranoia checks in of_find_node_by_phandle() as a second level
>>> of defense (do not return cached node if detached, do not add node
>>> to cache if detached).
>>>
>>> Reported-by: Michael Bringmann 
>>> Signed-off-by: Frank Rowand 
>>> ---
>>>  drivers/of/base.c   | 29 -
>>>  drivers/of/dynamic.c|  3 +++
>>>  drivers/of/of_private.h |  4 
>>>  3 files changed, 35 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>>> index d599367cb92a..34a5125713c8 100644
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void)
>>>  late_initcall_sync(of_free_phandle_cache);
>>>  #endif
>>>
>>> +/*
>>> + * Caller must hold devtree_lock.
>>> + */
>>> +void __of_free_phandle_cache_entry(phandle handle)
>>> +{
>>> +   phandle masked_handle;
>>> +
>>> +   if (!handle)
>>> +   return;
>>> +
>>> +   masked_handle = handle & phandle_cache_mask;
>>> +
>>> +   if (phandle_cache) {
>>> +   if (phandle_cache[masked_handle] &&
>>> +   handle == phandle_cache[masked_handle]->phandle) {
>>> +   of_node_put(phandle_cache[masked_handle]);
>>> +   phandle_cache[masked_handle] = NULL;
>>> +   }
>>> +   }
>>> +}
>>> +
>>>  void of_populate_phandle_cache(void)
>>>  {
>>> unsigned long flags;
>>> @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle 
>>> handle)
>>> if (phandle_cache[masked_handle] &&
>>> handle == phandle_cache[masked_handle]->phandle)
>>> np = phandle_cache[masked_handle];
>>> +   if (np && of_node_check_flag(np, OF_DETACHED)) {
>>> +   of_node_put(np);
>>> +   phandle_cache[masked_handle] = NULL;
>>
>> This should never happen, right? Any time we set OF_DETACHED, the
>> entry should get removed from the cache. I think we want a WARN here
>> in case we're in an unexpected state.

Correct, this should never happen.  I will add the WARN.


> We don't actually remove the pointer from the phandle cache when we set
> OF_DETACHED in drivers/of/dynamic.c:__of_detach_node.  The phandle cache
> is currently static within drivers/of/base.c.  There are a couple of
> calls to of_populate_phandle_cache / of_free_phandle_cache within
> drivers/of/overlay.c, but these are not involved in the device tree
> updates that occur during LPAR migration.  A WARN here would only make
> sense, if we also arrange to clear the handle.

Rob's reply did not include the full patch 2/2.  The full patch 2/2 also
adds a call to __of_free_phandle_cache_entry() in __of_detach_node().

-Frank

> 
>>
>> Rob
> 
> Michael
> 
>>
>>
> 



Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem

2018-12-14 Thread Michael Bringmann
On 12/12/2018 08:57 PM, Michael Ellerman wrote:
> Frank Rowand  writes:
>> On 12/11/18 8:07 AM, Rob Herring wrote:
>>> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman  
>>> wrote:
> ...
 diff --git a/drivers/of/base.c b/drivers/of/base.c
 index 09692c9b32a7..d8e4534c0686 100644
 --- a/drivers/of/base.c
 +++ b/drivers/of/base.c
 @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle 
 handle)
 if (phandle_cache[masked_handle] &&
 handle == phandle_cache[masked_handle]->phandle)
 np = phandle_cache[masked_handle];
 +
 +   /* If we find a detached node, remove it */
 +   if (of_node_check_flag(np, OF_DETACHED))
 +   np = phandle_cache[masked_handle] = NULL;
>>
>> The bug you found exposes a couple of different issues, a little bit
>> deeper than the proposed fix.  I'll work on a fuller fix tonight or
>> tomorrow.
> 
> OK thanks.
> 
>>> I'm wondering if we should explicitly remove the node from the cache
>>> when we set OF_DETACHED. Otherwise, it could be possible that the node
>>> pointer has been freed already. Or maybe we need both?
>>
>> Yes, it should be explicitly removed.  I may also add in a paranoia check in
>> of_find_node_by_phandle().
> 
> That seems best to me.

I agree that we should do both.

> 
> cheers

Michael

-- 
Michael W. Bringmann
Linux I/O, Networking and Security Development
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-14 Thread Michael Bringmann
On 12/14/2018 11:20 AM, Rob Herring wrote:
> On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>>
>> From: Frank Rowand 
>>
>> Non-overlay dynamic devicetree node removal may leave the node in
>> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
>> will incorrectly find the stale entry.  Remove the node from the
>> cache.
>>
>> Add paranoia checks in of_find_node_by_phandle() as a second level
>> of defense (do not return cached node if detached, do not add node
>> to cache if detached).
>>
>> Reported-by: Michael Bringmann 
>> Signed-off-by: Frank Rowand 
>> ---
>>  drivers/of/base.c   | 29 -
>>  drivers/of/dynamic.c|  3 +++
>>  drivers/of/of_private.h |  4 
>>  3 files changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index d599367cb92a..34a5125713c8 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void)
>>  late_initcall_sync(of_free_phandle_cache);
>>  #endif
>>
>> +/*
>> + * Caller must hold devtree_lock.
>> + */
>> +void __of_free_phandle_cache_entry(phandle handle)
>> +{
>> +   phandle masked_handle;
>> +
>> +   if (!handle)
>> +   return;
>> +
>> +   masked_handle = handle & phandle_cache_mask;
>> +
>> +   if (phandle_cache) {
>> +   if (phandle_cache[masked_handle] &&
>> +   handle == phandle_cache[masked_handle]->phandle) {
>> +   of_node_put(phandle_cache[masked_handle]);
>> +   phandle_cache[masked_handle] = NULL;
>> +   }
>> +   }
>> +}
>> +
>>  void of_populate_phandle_cache(void)
>>  {
>> unsigned long flags;
>> @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle 
>> handle)
>> if (phandle_cache[masked_handle] &&
>> handle == phandle_cache[masked_handle]->phandle)
>> np = phandle_cache[masked_handle];
>> +   if (np && of_node_check_flag(np, OF_DETACHED)) {
>> +   of_node_put(np);
>> +   phandle_cache[masked_handle] = NULL;
> 
> This should never happen, right? Any time we set OF_DETACHED, the
> entry should get removed from the cache. I think we want a WARN here
> in case we're in an unexpected state.

We don't actually remove the pointer from the phandle cache when we set
OF_DETACHED in drivers/of/dynamic.c:__of_detach_node.  The phandle cache
is currently static within drivers/of/base.c.  There are a couple of
calls to of_populate_phandle_cache / of_free_phandle_cache within
drivers/of/overlay.c, but these are not involved in the device tree
updates that occur during LPAR migration.  A WARN here would only make
sense, if we also arrange to clear the handle.

> 
> Rob

Michael

> 
> 

-- 
Michael W. Bringmann
Linux I/O, Networking and Security Development
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com



Re: [PATCH] net/ibmvnic: Remove tests of member address

2018-12-14 Thread David Miller
From: Wen Yang 
Date: Tue, 11 Dec 2018 12:20:46 +0800

> The driver was checking for non-NULL address.
> - adapter->napi[i]
> 
> This is pointless as these will be always non-NULL, since the
> 'dapter->napi' is allocated in init_napi().
> It is safe to get rid of useless checks for addresses to fix the
> coccinelle warning:
>>>drivers/net/ethernet/ibm/ibmvnic.c: test of a variable/field address
> Since such statements always return true, they are redundant.
> 
> Signed-off-by: Wen Yang 

Applied.


[RFC 6/6] powerpc: Enable support for ibm, drc-info devtree property

2018-12-14 Thread Michael Bringmann
Enable support for new DRC device tree property "ibm,drc-info"
in initial handshake between the Linux kernel and the front end
processor.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/kernel/prom_init.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index f33ff41..5d20a4d 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -910,7 +910,7 @@ struct ibm_arch_vec {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
-   .byte22 = OV5_FEAT(OV5_DRMEM_V2),
+   .byte22 = OV5_FEAT(OV5_DRMEM_V2 | OV5_DRC_INFO),
.intarch = 0,
.mmu = 0,
.hash_ext = 0,



[RFC 5/6] powerpc/pci/hotplug: Use common drcinfo parsing

2018-12-14 Thread Michael Bringmann
The implementation of the pseries-specific drc info properties
is currently implemented in pseries-specific and non-pseries-specific
files.  This patch set uses a new implementation of the device-tree
parsing code for the properties.

This patch refactors parsing of the pseries-specific drc-info properties
out of rpaphp_core.c to use the common parser.  In the case where an
architecture does not use these properties, an __weak copy of the
function is provided with dummy return values.  Changes include creating
appropriate callback functions and passing callback-specific data
blocks into arch_find_drc_match.  All functions that were used just
to support the previous parsing implementation have been moved.

Signed-off-by: Michael Bringmann 
---
 drivers/pci/hotplug/rpaphp_core.c |  232 -
 1 file changed, 28 insertions(+), 204 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index bcd5d35..9ad7384 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -154,182 +154,18 @@ static enum pci_bus_speed get_max_bus_speed(struct slot 
*slot)
return speed;
 }
 
-static int get_children_props(struct device_node *dn, const int **drc_indexes,
-   const int **drc_names, const int **drc_types,
-   const int **drc_power_domains)
-{
-   const int *indexes, *names, *types, *domains;
-
-   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
-   names = of_get_property(dn, "ibm,drc-names", NULL);
-   types = of_get_property(dn, "ibm,drc-types", NULL);
-   domains = of_get_property(dn, "ibm,drc-power-domains", NULL);
-
-   if (!indexes || !names || !types || !domains) {
-   /* Slot does not have dynamically-removable children */
-   return -EINVAL;
-   }
-   if (drc_indexes)
-   *drc_indexes = indexes;
-   if (drc_names)
-   /* _names[1] contains NULL terminated slot names */
-   *drc_names = names;
-   if (drc_types)
-   /* _types[1] contains NULL terminated slot types */
-   *drc_types = types;
-   if (drc_power_domains)
-   *drc_power_domains = domains;
-
-   return 0;
-}
-
-
 /* Verify the existence of 'drc_name' and/or 'drc_type' within the
- * current node.  First obtain it's my-drc-index property.  Next,
- * obtain the DRC info from it's parent.  Use the my-drc-index for
- * correlation, and obtain/validate the requested properties.
+ * current node.
  */
 
-static int rpaphp_check_drc_props_v1(struct device_node *dn, char *drc_name,
-   char *drc_type, unsigned int my_index)
-{
-   char *name_tmp, *type_tmp;
-   const int *indexes, *names;
-   const int *types, *domains;
-   int i, rc;
-
-   rc = get_children_props(dn->parent, , , , );
-   if (rc < 0) {
-   return -EINVAL;
-   }
-
-   name_tmp = (char *) [1];
-   type_tmp = (char *) [1];
-
-   /* Iterate through parent properties, looking for my-drc-index */
-   for (i = 0; i < be32_to_cpu(indexes[0]); i++) {
-   if ((unsigned int) indexes[i + 1] == my_index)
-   break;
-
-   name_tmp += (strlen(name_tmp) + 1);
-   type_tmp += (strlen(type_tmp) + 1);
-   }
-
-   if (((drc_name == NULL) || (drc_name && !strcmp(drc_name, name_tmp))) &&
-   ((drc_type == NULL) || (drc_type && !strcmp(drc_type, type_tmp
-   return 0;
-
-   return -EINVAL;
-}
-
-static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name,
-   char *drc_type, unsigned int my_index)
-{
-   struct property *info;
-   unsigned int entries;
-   struct of_drc_info drc;
-   const __be32 *value;
-   char cell_drc_name[MAX_DRC_NAME_LEN];
-   int j, fndit;
-
-   info = of_find_property(dn->parent, "ibm,drc-info", NULL);
-   if (info == NULL)
-   return -EINVAL;
-
-   value = of_prop_next_u32(info, NULL, );
-   if (!value)
-   return -EINVAL;
-
-   for (j = 0; j < entries; j++) {
-   of_read_drc_info_cell(, , );
-
-   /* Should now know end of current entry */
-
-   if (my_index > drc.last_drc_index)
-   continue;
-
-   fndit = 1;
-   break;
-   }
-   /* Found it */
-
-   if (fndit)
-   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, 
-   my_index);
-
-   if (((drc_name == NULL) ||
-(drc_name && !strcmp(drc_name, cell_drc_name))) &&
-   ((drc_type == NULL) ||
-(drc_type && !strcmp(drc_type, drc.drc_type
-   return 0;
-
-   return -EINVAL;
-}
-
 int rpaphp_check_drc_props(struct device_node *dn, char *drc_name,
char *drc_type)
 {
-   

[RFC 4/6] powerpc/pseries: Use common drcinfo parsing

2018-12-14 Thread Michael Bringmann
The implementation of the pseries-specific drc info properties
is currently implemented in pseries-specific and non-pseries-specific
files.  This patch set uses a new implementation of the device-tree
parsing code for the properties.

This patch refactors parsing of the drc-info properties out of
pseries_energy.c and hotplug-cpu.c to use the common parser.
Changes include creating appropriate callback functions and
passing callback-specific data blocks into arch_find_drc_match.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c|   83 +++-
 arch/powerpc/platforms/pseries/pseries_energy.c |  157 ---
 2 files changed, 107 insertions(+), 133 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 2f8e621..ee3028c 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -411,23 +411,29 @@ static bool dlpar_cpu_exists(struct device_node *parent, 
u32 drc_index)
return found;
 }
 
-static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
+struct cpu_drc_index_struct {
+   u32 drc_index;
+};
+
+bool cpu_drc_index_cb(struct device_node *dn,
+   u32 drc_index, char *drc_name,
+   char *drc_type, u32 drc_power_domain,
+   void *data)
 {
-   bool found = false;
-   int rc, index;
+   struct cpu_drc_index_struct *cdata = data;
 
-   index = 0;
-   while (!found) {
-   u32 drc;
+   if (drc_index == cdata->drc_index)
+   return true;
+   return false;
+}
 
-   rc = of_property_read_u32_index(parent, "ibm,drc-indexes",
-   index++, );
-   if (rc)
-   break;
+static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
+{
+   struct cpu_drc_index_struct cdata = { drc_index };
+   bool found = false;
 
-   if (drc == drc_index)
-   found = true;
-   }
+   found = arch_find_drc_match(parent, cpu_drc_index_cb,
+   "CPU", NULL, false, false, );
 
return found;
 }
@@ -721,11 +727,34 @@ static int dlpar_cpu_remove_by_count(u32 cpus_to_remove)
return rc;
 }
 
+struct cpus_to_add_cb_struct {
+   struct device_node *parent;
+   u32 *cpu_drcs;
+   u32 cpus_to_add;
+   u32 cpus_found;
+};
+
+static bool cpus_to_add_cb(struct device_node *dn,
+   u32 drc_index, char *drc_name,
+   char *drc_type, u32 drc_power_domain,
+   void *data)
+{
+   struct cpus_to_add_cb_struct *cdata = data;
+
+   if (cdata->cpus_found < cdata->cpus_to_add) {
+   if (!dlpar_cpu_exists(cdata->parent, drc_index))
+   cdata->cpu_drcs[cdata->cpus_found++] = drc_index;
+   }
+
+   return !(cdata->cpus_found < cdata->cpus_to_add);
+}
+
 static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add)
 {
struct device_node *parent;
-   int cpus_found = 0;
-   int index, rc;
+   struct cpus_to_add_cb_struct cdata = {
+   NULL, cpu_drcs, cpus_to_add, 0 };
+   int cpus_found;
 
parent = of_find_node_by_path("/cpus");
if (!parent) {
@@ -734,25 +763,13 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 
cpus_to_add)
return -1;
}
 
-   /* Search the ibm,drc-indexes array for possible CPU drcs to
-* add. Note that the format of the ibm,drc-indexes array is
-* the number of entries in the array followed by the array
-* of drc values so we start looking at index = 1.
+   /* Search the appropriate property for possible CPU drcs
+* to add.
 */
-   index = 1;
-   while (cpus_found < cpus_to_add) {
-   u32 drc;
-
-   rc = of_property_read_u32_index(parent, "ibm,drc-indexes",
-   index++, );
-   if (rc)
-   break;
-
-   if (dlpar_cpu_exists(parent, drc))
-   continue;
-
-   cpu_drcs[cpus_found++] = drc;
-   }
+   cdata.parent = parent;
+   arch_find_drc_match(parent, cpus_to_add_cb, "CPU",
+   NULL, false, false, );
+   cpus_found = cdata.cpus_found;
 
of_node_put(parent);
return cpus_found;
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
index 6ed2212..f7b9d86 100644
--- a/arch/powerpc/platforms/pseries/pseries_energy.c
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 #define MODULE_VERS "1.0"
@@ -36,60 +37,43 @@
 
 /* Helper Routines to convert between drc_index to cpu numbers */
 

[RFC 3/6] pseries/drcinfo: Pseries impl of arch_find_drc_info

2018-12-14 Thread Michael Bringmann
This patch provides a common interface to parse ibm,drc-indexes,
ibm,drc-names, ibm,drc-types, ibm,drc-power-domains, or ibm,drc-info.
The generic interface arch_find_drc_match is provided which accepts
callback functions that may be applied to examine the data for each
entry.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/include/asm/prom.h |3 
 arch/powerpc/platforms/pseries/of_helpers.c |  299 +++
 include/linux/topology.h|2 
 3 files changed, 298 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index b04c5ce..910d1dc 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -91,9 +91,6 @@ struct of_drc_info {
u32 last_drc_index;
 };
 
-extern int of_read_drc_info_cell(struct property **prop,
-   const __be32 **curval, struct of_drc_info *data);
-
 
 /*
  * There are two methods for telling firmware what our capabilities are.
diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 0185e50..11c90cd 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define pr_fmt(fmt) "drc: " fmt
+
 #include 
 #include 
 #include 
@@ -11,6 +13,12 @@
 
 #defineMAX_DRC_NAME_LEN 64
 
+static int drc_debug;
+#define dbg(args...) if (drc_debug) { printk(KERN_DEBUG args); }
+#define err(arg...) printk(KERN_ERR args);
+#define info(arg...) printk(KERN_INFO args);
+#define warn(arg...) printk(KERN_WARNING args);
+
 /**
  * pseries_of_derive_parent - basically like dirname(1)
  * @path:  the full_name of a node to be added to the tree
@@ -46,7 +54,8 @@ struct device_node *pseries_of_derive_parent(const char *path)
 
 /* Helper Routines to convert between drc_index to cpu numbers */
 
-int of_read_drc_info_cell(struct property **prop, const __be32 **curval,
+static int of_read_drc_info_cell(struct property **prop,
+   const __be32 **curval,
struct of_drc_info *data)
 {
const char *p;
@@ -90,4 +99,290 @@ int of_read_drc_info_cell(struct property **prop, const 
__be32 **curval,
 
return 0;
 }
-EXPORT_SYMBOL(of_read_drc_info_cell);
+
+static int walk_drc_info(struct device_node *dn,
+   bool (*usercb)(struct of_drc_info *drc,
+   void *data,
+   int *ret_code),
+   char *opt_drc_type,
+   void *data)
+{
+   struct property *info;
+   unsigned int entries;
+   struct of_drc_info drc;
+   const __be32 *value;
+   int j, ret_code = -EINVAL;
+   bool done = false;
+
+   info = of_find_property(dn, "ibm,drc-info", NULL);
+   if (info == NULL)
+   return -EINVAL;
+
+   value = info->value;
+   entries = of_read_number(value++, 1);
+
+   for (j = 0, done = 0; (j < entries) && (!done); j++) {
+   of_read_drc_info_cell(, , );
+
+   if (opt_drc_type && strcmp(opt_drc_type, drc.drc_type))
+   continue;
+
+   done = usercb(, data, _code);
+   }
+
+   return ret_code;
+}
+
+static int get_children_props(struct device_node *dn, const int **drc_indexes,
+   const int **drc_names, const int **drc_types,
+   const int **drc_power_domains)
+{
+   const int *indexes, *names, *types, *domains;
+
+   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
+   names = of_get_property(dn, "ibm,drc-names", NULL);
+   types = of_get_property(dn, "ibm,drc-types", NULL);
+   domains = of_get_property(dn, "ibm,drc-power-domains", NULL);
+
+   if (!indexes || !names || !types || !domains) {
+   /* Slot does not have dynamically-removable children */
+   return -EINVAL;
+   }
+   if (drc_indexes)
+   *drc_indexes = indexes;
+   if (drc_names)
+   /* _names[1] contains NULL terminated slot names */
+   *drc_names = names;
+   if (drc_types)
+   /* _types[1] contains NULL terminated slot types */
+   *drc_types = types;
+   if (drc_power_domains)
+   *drc_power_domains = domains;
+
+   return 0;
+}
+
+static int is_php_type(char *drc_type)
+{
+   unsigned long value;
+   char *endptr;
+
+   /* PCI Hotplug nodes have an integer for drc_type */
+   value = simple_strtoul(drc_type, , 10);
+   if (endptr == drc_type)
+   return 0;
+
+   return 1;
+}
+
+/**
+ * is_php_dn() - return 1 if this is a hotpluggable pci slot, else 0
+ * @dn: target _node
+ * @indexes: passed to get_children_props()
+ * @names: passed to get_children_props()
+ * @types: returned from get_children_props()
+ * @power_domains:
+ *
+ * This routine will return true only if the 

[RFC 2/6] pseries/drcinfo: Fix bug parsing ibm,drc-info

2018-12-14 Thread Michael Bringmann
Replace use of of_prop_next_u32() in when parsing 'ibm,drc-info'
structure to simplify and reduce parsing code.

Signed-off-by: Michael Bringmann 
Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info 
firmware feature" -- end of patch series applied to powerpc next)
---
 arch/powerpc/platforms/pseries/of_helpers.c |   24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 6df192f..0185e50 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+
 #include 
 #include 
 #include 
@@ -6,6 +7,9 @@
 #include 
 
 #include "of_helpers.h"
+#include "pseries.h"
+
+#defineMAX_DRC_NAME_LEN 64
 
 /**
  * pseries_of_derive_parent - basically like dirname(1)
@@ -65,29 +69,19 @@ int of_read_drc_info_cell(struct property **prop, const 
__be32 **curval,
 
/* Get drc-index-start:encode-int */
p2 = (const __be32 *)p;
-   p2 = of_prop_next_u32(*prop, p2, >drc_index_start);
-   if (!p2)
-   return -EINVAL;
+   data->drc_index_start = of_read_number(p2++, 1);
 
/* Get drc-name-suffix-start:encode-int */
-   p2 = of_prop_next_u32(*prop, p2, >drc_name_suffix_start);
-   if (!p2)
-   return -EINVAL;
+   data->drc_name_suffix_start = of_read_number(p2++, 1);
 
/* Get number-sequential-elements:encode-int */
-   p2 = of_prop_next_u32(*prop, p2, >num_sequential_elems);
-   if (!p2)
-   return -EINVAL;
+   data->num_sequential_elems = of_read_number(p2++, 1);
 
/* Get sequential-increment:encode-int */
-   p2 = of_prop_next_u32(*prop, p2, >sequential_inc);
-   if (!p2)
-   return -EINVAL;
+   data->sequential_inc = of_read_number(p2++, 1);
 
/* Get drc-power-domain:encode-int */
-   p2 = of_prop_next_u32(*prop, p2, >drc_power_domain);
-   if (!p2)
-   return -EINVAL;
+   data->drc_power_domain = of_read_number(p2++, 1);
 
/* Should now know end of current entry */
(*curval) = (void *)p2;



[RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2018-12-14 Thread Michael Bringmann
Define interface to acquire arch-specific drc info to match against
hotpluggable devices.  The current implementation exposes several
pseries-specific dynamic memory properties in generic kernel code.
This patch set provides an interface to pull that code out of the
generic kernel.

Signed-off-by: Michael Bringmann 
---
 include/linux/topology.h |9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..df97f5f 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -44,6 +44,15 @@
 
 int arch_update_cpu_topology(void);
 
+int arch_find_drc_match(struct device_node *dn,
+   bool (*usercb)(struct device_node *dn,
+   u32 drc_index, char *drc_name,
+   char *drc_type, u32 drc_power_domain,
+   void *data),
+   char *opt_drc_type, char *opt_drc_name,
+   bool match_drc_index, bool ck_php_type,
+   void *data);
+
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE 10
 #define REMOTE_DISTANCE20



[RFC 0/6] powerpc/pseries: Refactor code to centralize drcinfo parsing

2018-12-14 Thread Michael Bringmann
The implementation of the pseries-specific drc info information feature
is currently implemented within and outside of the powerpc pseries
code.  This patch set moves the parsing code for the pseries-specific
device-tree properties ibm,drc-indexes, ibm,drc-names, ibm,drc-types,
ibm,drc-power-domains, and the compressed ibm,drc-info into the
platform-specific code for the Pseries features.

Signed-off-by: Michael W. Bringmann 

Michael Bringmann (6):
  powerpc:/drc Define interface to acquire arch-specific drc info
  pseries/drcinfo: Fix bug parsing ibm,drc-info
  pseries/drcinfo: Pseries impl of arch_find_drc_info
  powerpc/pseries: Use common drcinfo parsing
  powerpc/pci/hotplug: Use common drcinfo parsing
  powerpc: Enable support for ibm,drc-info devtree property



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-4.20-4 tag

2018-12-14 Thread pr-tracker-bot
The pull request you sent on Sat, 15 Dec 2018 01:26:19 +1100:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-4.20-4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4645453cefcebdff9db26f02cf325607357295c4

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: [PATCH] PCI/AER: only insert one element into kfifo

2018-12-14 Thread Bjorn Helgaas
On Wed, Dec 12, 2018 at 04:32:30PM +0800, Yanjiang Jin wrote:
> 'commit ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to
> insert locked elements")' replace kfifo_put() with kfifo_in_spinlocked().
> 
> But as "kfifo_in(fifo, buf, n)" describes:
> " * @n: number of elements to be added".
> 
> We want to insert only one element into kfifo, not "sizeof(entry) = 16".
> Without this patch, we would get 15 uninitialized elements.
> 
> Signed-off-by: Yanjiang Jin 

Since this fixes ecae65e133f2, which was applied for v4.20, I applied
this with Keith's reviewed-by to for-linus so we can get it into
v4.20.

For some reason the patch didn't apply, but I can't see why, so I
just applied it by hand.

Thanks!

> ---
>  drivers/pci/pcie/aer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a90a919..fed29de 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, 
> unsigned int devfn,
> .regs   = aer_regs,
> };
> 
> -   if (kfifo_in_spinlocked(_recover_ring, , sizeof(entry),
> +   if (kfifo_in_spinlocked(_recover_ring, , 1,
>  _recover_ring_lock))
> schedule_work(_recover_work);
> else
> --
> 1.8.3.1
> 
> 
> 
> 
> This email is intended only for the named addressee. It may contain 
> information that is confidential/private, legally privileged, or 
> copyright-protected, and you should handle it accordingly. If you are not the 
> intended recipient, you do not have legal rights to retain, copy, or 
> distribute this email or its contents, and should promptly delete the email 
> and all electronic copies in your system; do not retain copies in any media. 
> If you have received this email in error, please notify the sender promptly. 
> Thank you.
> 
> 


Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-14 Thread Rob Herring
On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>
> From: Frank Rowand 
>
> Non-overlay dynamic devicetree node removal may leave the node in
> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> will incorrectly find the stale entry.  Remove the node from the
> cache.
>
> Add paranoia checks in of_find_node_by_phandle() as a second level
> of defense (do not return cached node if detached, do not add node
> to cache if detached).
>
> Reported-by: Michael Bringmann 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/base.c   | 29 -
>  drivers/of/dynamic.c|  3 +++
>  drivers/of/of_private.h |  4 
>  3 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d599367cb92a..34a5125713c8 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void)
>  late_initcall_sync(of_free_phandle_cache);
>  #endif
>
> +/*
> + * Caller must hold devtree_lock.
> + */
> +void __of_free_phandle_cache_entry(phandle handle)
> +{
> +   phandle masked_handle;
> +
> +   if (!handle)
> +   return;
> +
> +   masked_handle = handle & phandle_cache_mask;
> +
> +   if (phandle_cache) {
> +   if (phandle_cache[masked_handle] &&
> +   handle == phandle_cache[masked_handle]->phandle) {
> +   of_node_put(phandle_cache[masked_handle]);
> +   phandle_cache[masked_handle] = NULL;
> +   }
> +   }
> +}
> +
>  void of_populate_phandle_cache(void)
>  {
> unsigned long flags;
> @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle 
> handle)
> if (phandle_cache[masked_handle] &&
> handle == phandle_cache[masked_handle]->phandle)
> np = phandle_cache[masked_handle];
> +   if (np && of_node_check_flag(np, OF_DETACHED)) {
> +   of_node_put(np);
> +   phandle_cache[masked_handle] = NULL;

This should never happen, right? Any time we set OF_DETACHED, the
entry should get removed from the cache. I think we want a WARN here
in case we're in an unexpected state.

Rob


Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-14 Thread Rob Herring
On Fri, Dec 14, 2018 at 12:43 AM  wrote:
>
> From: Frank Rowand 
>
> The phandle cache contains struct device_node pointers.  The refcount
> of the pointers was not incremented while in the cache, allowing use
> after free error after kfree() of the node.  Add the proper increment
> and decrement of the use count.

Since we pre-populate the cache at boot, all the nodes will have a ref
count and will never be freed unless we happen to repopulate the whole
cache. That doesn't seem ideal. The node pointer is not "in use" just
because it is in the cache.

Rob


Re: use generic DMA mapping code in powerpc V4

2018-12-14 Thread Christoph Hellwig
On Fri, Dec 14, 2018 at 01:00:26PM +0100, Christian Zigotzky wrote:
> On 12 December 2018 at 3:15PM, Christoph Hellwig wrote:
> > Thanks for bisecting.  I've spent some time going over the conversion
> > but can't really pinpoint it.  I have three little patches that switch
> > parts of the code to the generic version.  This is on top of the
> > last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda).
> >
> > Can you check with whіch one things stop working?
>
> Hello Christoph,
>
> Great news! All your patches work!

No so great because that means I still have no idea what broke..

> I tested all your patches (including the patch '0004-alloc-free.patch' 
> today) and the PASEMI onboard ethernet works and the P5020 board boots 
> without any problems. Thank you for your work!
> I have a few days off. That means, I will work less and only for the A-EON 
> first level Linux support. I can test again on Thursday next week.

Enjoy your days off!

I think I'll need to prepare something new that is better bisectable,
and use that opportunity to reason about the changes a little more.


Re: [PATCH 12/34] powerpc/cell: move dma direct window setup out of dma_configure

2018-12-14 Thread Christoph Hellwig
On Sat, Dec 15, 2018 at 12:29:11AM +1100, Michael Ellerman wrote:
> I think the problem is that we don't want to set iommu_bypass_supported
> unless cell_iommu_fixed_mapping_init() succeeds.
> 
> Yep. This makes it work for me on cell on top of your v5.

Thanks, this looks good.  I've folded it with the slight change of moving
the iommu_bypass_supported setup into cell_iommu_fixed_mapping_init.


[PATCH] lkdtm: Add a tests for NULL pointer dereference

2018-12-14 Thread Christophe Leroy
Introduce lkdtm tests for NULL pointer dereference: check
access or exec at NULL address.

Signed-off-by: Christophe Leroy 
---
 drivers/misc/lkdtm/core.c  |  2 ++
 drivers/misc/lkdtm/lkdtm.h |  2 ++
 drivers/misc/lkdtm/perms.c | 18 ++
 3 files changed, 22 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index bc76756b7eda..36910e1d5c09 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -157,7 +157,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(EXEC_VMALLOC),
CRASHTYPE(EXEC_RODATA),
CRASHTYPE(EXEC_USERSPACE),
+   CRASHTYPE(EXEC_NULL),
CRASHTYPE(ACCESS_USERSPACE),
+   CRASHTYPE(ACCESS_NULL),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
CRASHTYPE(WRITE_KERN),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..b69ee004a3f7 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -45,7 +45,9 @@ void lkdtm_EXEC_KMALLOC(void);
 void lkdtm_EXEC_VMALLOC(void);
 void lkdtm_EXEC_RODATA(void);
 void lkdtm_EXEC_USERSPACE(void);
+void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
+void lkdtm_ACCESS_NULL(void);
 
 /* lkdtm_refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index fa54add6375a..62f76d506f04 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -164,6 +164,11 @@ void lkdtm_EXEC_USERSPACE(void)
vm_munmap(user_addr, PAGE_SIZE);
 }
 
+void lkdtm_EXEC_NULL(void)
+{
+   execute_location(NULL, CODE_AS_IS);
+}
+
 void lkdtm_ACCESS_USERSPACE(void)
 {
unsigned long user_addr, tmp = 0;
@@ -195,6 +200,19 @@ void lkdtm_ACCESS_USERSPACE(void)
vm_munmap(user_addr, PAGE_SIZE);
 }
 
+void lkdtm_ACCESS_NULL(void)
+{
+   unsigned long tmp;
+   unsigned long *ptr = (unsigned long *)NULL;
+
+   pr_info("attempting bad read at %px\n", ptr);
+   tmp = *ptr;
+   tmp += 0xc0dec0de;
+
+   pr_info("attempting bad write at %px\n", ptr);
+   *ptr = tmp;
+}
+
 void __init lkdtm_perms_init(void)
 {
/* Make sure we can write to __ro_after_init values during __init */
-- 
2.13.3



[PATCH v2] powerpc/mm: make NULL pointer deferences explicit on bad page faults.

2018-12-14 Thread Christophe Leroy
As several other arches including x86, this patch makes it explicit
that a bad page fault is a NULL pointer dereference when the fault
address is lower than PAGE_SIZE

In the mean time, this page makes all bad_page_fault() messages shorter
so that they remain on one single line. And it prefixes them by "BUG: "
so that they get easily greped.

Signed-off-by: Christophe Leroy 
---
 v2: Taking into account comments from Michael

 arch/powerpc/mm/fault.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 01b9bcc7fa85..3398291f4785 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -636,21 +636,24 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
address, int sig)
switch (TRAP(regs)) {
case 0x300:
case 0x380:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "data at address 0x%08lx\n", regs->dar);
+   if (regs->dar < PAGE_SIZE)
+   pr_alert("BUG: Kernel NULL pointer dereference");
+   else
+   pr_alert("BUG: Unable to handle kernel data access");
+   pr_cont(" at 0x%08lx\n", regs->dar);
break;
case 0x400:
case 0x480:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "instruction fetch\n");
+   pr_alert("BUG: Unable to handle kernel instruction fetch%s",
+regs->nip < PAGE_SIZE ? " (NULL pointer ?)\n" : "\n");
break;
case 0x600:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "unaligned access at address 0x%08lx\n", regs->dar);
+   pr_alert("BUG: Unable to handle kernel unaligned access at 
0x%08lx\n",
+regs->dar);
break;
default:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "unknown fault\n");
+   pr_alert("BUG: Unable to handle unknown paging fault at 
0x%08lx\n",
+regs->dar);
break;
}
printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n",
-- 
2.13.3



[GIT PULL] Please pull powerpc/linux.git powerpc-4.20-4 tag

2018-12-14 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Linus,

Please pull some more powerpc fixes for 4.20:

The following changes since commit b2fed34a628df6118b5d4e13f49a33e15f704fa9:

  selftests/powerpc: Adjust wild_bctr to build with old binutils (2018-11-15 
23:05:17 +1100)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.20-4

for you to fetch changes up to a225f1567405558fb5410e9b2b90805819df1c67:

  powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call 
(2018-12-10 15:19:58 +1100)

- --
powerpc fixes for 4.20 #4

One notable fix for our change to split pt_regs between user/kernel, we forgot
to update BPF to use the user-visible type which was an ABI break for BPF
programs.

A slightly ugly but minimal fix to do_syscall_trace_enter() so that we use
tracehook_report_syscall_entry() properly. We'll rework the code in next to
avoid the empty if body.

Seven commits fixing bugs in the new papr_scm (Storage Class Memory) driver.
The driver was finally able to be tested on the other hypervisor which exposed
several bugs. The fixes are all fairly minimal at least.

Fix a crash in our MSI code if an MSI-capable device is plugged into a non-MSI
capable PHB, only seen on older hardware (MPC8378).

Fix our legacy serial code to look for "stdout-path" since the device trees were
updated to use that instead of "linux,stdout-path".

A change to the COFF zImage code to fix booting old powermacs.

A couple of minor build fixes.

Thanks to:
  Benjamin Herrenschmidt, Daniel Axtens, Dmitry V. Levin, Elvira Khabirova,
  Oliver O'Halloran, Paul Mackerras, Radu Rendec, Rob Herring, Sandipan Das.

- --
Benjamin Herrenschmidt (1):
  powerpc: Look for "stdout-path" when setting up legacy consoles

Elvira Khabirova (1):
  powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call

Michael Ellerman (2):
  powerpc/mm: Fix linux page tables build with some configs
  powerpc/boot: Fix build failures with -j 1

Oliver O'Halloran (7):
  powerpc/papr_scm: Use depend instead of select
  powerpc/papr_scm: Fix resource end address
  powerpc/papr_scm: Update DT properties
  powerpc/papr_scm: Remove endian conversions
  powerpc/papr_scm: Fix DIMM device registration race
  powerpc/papr_scm: Use ibm,unit-guid as the iset cookie
  powerpc/mm: Fallback to RAM if the altmap is unusable

Paul Mackerras (1):
  powerpc: Fix COFF zImage booting on old powermacs

Radu Rendec (1):
  powerpc/msi: Fix NULL pointer access in teardown code

Sandipan Das (1):
  powerpc/bpf: Fix broken uapi for BPF_PROG_TYPE_PERF_EVENT


 arch/powerpc/boot/Makefile |  2 +-
 arch/powerpc/boot/crt0.S   |  4 ++-
 arch/powerpc/include/asm/perf_event.h  |  2 ++
 arch/powerpc/include/uapi/asm/Kbuild   |  1 -
 arch/powerpc/include/uapi/asm/bpf_perf_event.h |  9 ++
 arch/powerpc/kernel/legacy_serial.c|  6 +++-
 arch/powerpc/kernel/msi.c  |  7 -
 arch/powerpc/kernel/ptrace.c   |  7 -
 arch/powerpc/mm/dump_linuxpagetables.c |  1 +
 arch/powerpc/mm/init_64.c  | 19 +++--
 arch/powerpc/platforms/pseries/Kconfig |  3 +-
 arch/powerpc/platforms/pseries/papr_scm.c  | 39 --
 12 files changed, 80 insertions(+), 20 deletions(-)
 create mode 100644 arch/powerpc/include/uapi/asm/bpf_perf_event.h
-BEGIN PGP SIGNATURE-

iQIcBAEBAgAGBQJcE71eAAoJEFHr6jzI4aWAQUkP/0Z1YlNpl2Zf9GG1J1QsBCtj
H2KS/a01GciH6TO16ULovjFhJppcwPGOCYKU/elnge4cqC+c7sZDDLeXrB6byzfP
9jYbuC3QZ0ct1d34x7vzg/kplutMu1HClxUBU5g7QF9l2b70rB6UWf1WSJnCLgCg
MEwQscEPymJJpLO5r0B1/pCijLN2Rpy0vMbZbOoQW4/G/6qQ3d4ygDdWrh4lJNFf
LwWYLXi5SZ/1jFIb6/wYuomT4WIfD+JCaGsH3THAm7sfC7qx+nUOCGtA81vHGUaK
nW14snebk0viL8STJ4MO1IFJ9x3MsQAylpyyfcdzQ3Q3+rRkaPWwHrvOcnNavwfw
+PKDyqV+Ot9+2Es79ZDGeF68q7L7Q4qtZIIL9ARh5pokvMiycsYPOcMQW3IIbkZQ
aFS3SAQalGaSAlNp5NDnNoZ4rhb+tlR3k6fCbHkuPehj3IAvVzzc6XY3w2rr0wFy
ZAgnv+54aI7CNcasRiu+KStLYflr2xorjrRHWigli9KtyB+uZbx6IzDDGgQpPHAd
rUoJujp7G/GUclFJ42LF6MBDKTGU2XJP0rogEkNhmCY8RwbDPsvw52CYosEJmrMo
rMgVbO35eRmCmF+cbaRvRzsVXYVuPFZfTt8wF2SYwDzSx49b2tW696Xr5+mkUGw6
vu2XQ1BG1C/0mbIjgjKm
=ABj8
-END PGP SIGNATURE-


[PATCH] powerpc/configs: Don't enable PPC_EARLY_DEBUG in defconfigs

2018-12-14 Thread Michael Ellerman
This reverts the remains of commit b9ef7d6b11c1 ("powerpc: Update
default configurations").

That commit was proceeded by a commit which added a config option to
control use of BOOTX for early debug, ie. PPC_EARLY_DEBUG_BOOTX, and
then the update of the defconfigs was intended to not change behaviour
by then enabling the new config option.

However enabling PPC_EARLY_DEBUG had other consequences, notably
causing us to register the udbg console at the end of udbg_early_init().

This means on a system which doesn't have anything that BOOTX can
use (most systems), we register the udbg console very early but the
bootx code just throws everything away, meaning early boot messages
are never printed to the console.

What we want to happen is for the udbg console to only be registered
later (from setup_arch()) once we've setup udbg_putc, and then all
early boot messages will be replayed.

Fixes: b9ef7d6b11c1 ("powerpc: Update default configurations")
Reported-by: Torsten Duwe 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/configs/g5_defconfig | 1 -
 arch/powerpc/configs/maple_defconfig  | 1 -
 arch/powerpc/configs/pmac32_defconfig | 1 -
 arch/powerpc/configs/ppc64_defconfig  | 1 -
 arch/powerpc/configs/ppc6xx_defconfig | 1 -
 5 files changed, 5 deletions(-)

diff --git a/arch/powerpc/configs/g5_defconfig 
b/arch/powerpc/configs/g5_defconfig
index f686cc1eac0b..ceb3c770786f 100644
--- a/arch/powerpc/configs/g5_defconfig
+++ b/arch/powerpc/configs/g5_defconfig
@@ -246,7 +246,6 @@ CONFIG_DEBUG_KERNEL=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_LATENCYTOP=y
 CONFIG_BOOTX_TEXT=y
-CONFIG_PPC_EARLY_DEBUG=y
 CONFIG_CRYPTO_TEST=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
diff --git a/arch/powerpc/configs/maple_defconfig 
b/arch/powerpc/configs/maple_defconfig
index f71eddafb02f..c5f2005005d3 100644
--- a/arch/powerpc/configs/maple_defconfig
+++ b/arch/powerpc/configs/maple_defconfig
@@ -108,7 +108,6 @@ CONFIG_LATENCYTOP=y
 CONFIG_XMON=y
 CONFIG_XMON_DEFAULT=y
 CONFIG_BOOTX_TEXT=y
-CONFIG_PPC_EARLY_DEBUG=y
 CONFIG_CRYPTO_ECB=m
 CONFIG_CRYPTO_PCBC=m
 # CONFIG_CRYPTO_HW is not set
diff --git a/arch/powerpc/configs/pmac32_defconfig 
b/arch/powerpc/configs/pmac32_defconfig
index 62948d198d7f..50b610b48914 100644
--- a/arch/powerpc/configs/pmac32_defconfig
+++ b/arch/powerpc/configs/pmac32_defconfig
@@ -297,7 +297,6 @@ CONFIG_LATENCYTOP=y
 CONFIG_XMON=y
 CONFIG_XMON_DEFAULT=y
 CONFIG_BOOTX_TEXT=y
-CONFIG_PPC_EARLY_DEBUG=y
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_MD4=m
 CONFIG_CRYPTO_SHA512=m
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index 294a0b9d6038..91fdb619b484 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -376,4 +376,3 @@ CONFIG_FTR_FIXUP_SELFTEST=y
 CONFIG_MSI_BITMAP_SELFTEST=y
 CONFIG_XMON=y
 CONFIG_BOOTX_TEXT=y
-CONFIG_PPC_EARLY_DEBUG=y
diff --git a/arch/powerpc/configs/ppc6xx_defconfig 
b/arch/powerpc/configs/ppc6xx_defconfig
index 7ee736f20774..53687c3a70c4 100644
--- a/arch/powerpc/configs/ppc6xx_defconfig
+++ b/arch/powerpc/configs/ppc6xx_defconfig
@@ -1155,7 +1155,6 @@ CONFIG_STACK_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_XMON=y
 CONFIG_BOOTX_TEXT=y
-CONFIG_PPC_EARLY_DEBUG=y
 CONFIG_SECURITY=y
 CONFIG_SECURITY_NETWORK=y
 CONFIG_SECURITY_NETWORK_XFRM=y
-- 
2.17.2



Re: [PATCH 12/34] powerpc/cell: move dma direct window setup out of dma_configure

2018-12-14 Thread Michael Ellerman
Christoph Hellwig  writes:
> On Sun, Dec 09, 2018 at 09:23:39PM +1100, Michael Ellerman wrote:
>> Christoph Hellwig  writes:
>> 
>> > Configure the dma settings at device setup time, and stop playing games
>> > with get_pci_dma_ops.  This prepares for using the common dma_configure
>> > code later on.
>> >
>> > Signed-off-by: Christoph Hellwig 
>> > ---
>> >  arch/powerpc/platforms/cell/iommu.c | 20 +++-
>> >  1 file changed, 11 insertions(+), 9 deletions(-)
>> 
>> This one's crashing, haven't dug into why yet:
>
> Can you provide a gdb assembly of the exact crash site?  This looks
> like for some odd reason the DT structures aren't fully setup by the
> time we are probing the device, which seems odd.

It's dev->of_node which is NULL.

Because we were passed a platform_device which doesn't have an of_node.

It's the cbe-mic device created in cell_publish_devices().

I can fix that by simply checking for a NULL node, then the system boots
but then I have no network devices, due to:

  tg3 :00:01.0: enabling device (0140 -> 0142)
  tg3 :00:01.0: DMA engine test failed, aborting
  tg3: probe of :00:01.0 failed with error -12
  tg3 :00:01.1: enabling device (0140 -> 0142)
  tg3 :00:01.1: DMA engine test failed, aborting
  tg3: probe of :00:01.1 failed with error -12


I think the problem is that we don't want to set iommu_bypass_supported
unless cell_iommu_fixed_mapping_init() succeeds.

Yep. This makes it work for me on cell on top of your v5.

cheers


diff --git a/arch/powerpc/platforms/cell/iommu.c 
b/arch/powerpc/platforms/cell/iommu.c
index 348a815779c1..8329fda17cc8 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -813,6 +813,10 @@ static u64 cell_iommu_get_fixed_address(struct device *dev)
int i, len, best, naddr, nsize, pna, range_size;
 
np = of_node_get(dev->of_node);
+   if (!np)
+   /* We can be called for platform devices that have no of_node */
+   goto out;
+
while (1) {
naddr = of_n_addr_cells(np);
nsize = of_n_size_cells(np);
@@ -1065,8 +1069,11 @@ static int __init cell_iommu_init(void)
/* Setup various callbacks */
cell_pci_controller_ops.dma_dev_setup = cell_pci_dma_dev_setup;
 
-   if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
+   if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0) {
+   cell_pci_controller_ops.iommu_bypass_supported =
+   cell_pci_iommu_bypass_supported;
goto done;
+   }
 
/* Create an iommu for each /axon node.  */
for_each_node_by_name(np, "axon") {
@@ -1085,10 +1092,6 @@ static int __init cell_iommu_init(void)
}
  done:
/* Setup default PCI iommu ops */
-   if (!iommu_fixed_disabled) {
-   cell_pci_controller_ops.iommu_bypass_supported =
-   cell_pci_iommu_bypass_supported;
-   }
set_pci_dma_ops(_iommu_ops);
cell_iommu_enabled = true;
  bail:


Re: [PATCH 2/2] s390/pci: handle function enumeration after sriov enablement

2018-12-14 Thread Christoph Hellwig
On Fri, Dec 14, 2018 at 05:12:45AM -0800, Christoph Hellwig wrote:
> On Thu, Dec 13, 2018 at 06:54:28PM +0100, Sebastian Ott wrote:
> > Implement pcibios_sriov_{add|del}_vfs as empty functions. VF
> > creation will be triggered by the hotplug code.
> 
> And instead of having the arch suplply a no-op arch override I
> think it would be better to have the config option just stub it
> out in common code.

Or in fact maybe even a runtime flag in struct pci_dev.  Who knows
if all future s390 PCIe busses will have exactly the same behavior
or if we eventually get the standards compliant behvior back?


Re: [PATCH 2/2] s390/pci: handle function enumeration after sriov enablement

2018-12-14 Thread Christoph Hellwig
On Thu, Dec 13, 2018 at 06:54:28PM +0100, Sebastian Ott wrote:
> Implement pcibios_sriov_{add|del}_vfs as empty functions. VF
> creation will be triggered by the hotplug code.

And instead of having the arch suplply a no-op arch override I
think it would be better to have the config option just stub it
out in common code.


Re: [PATCH 1/2] PCI: provide pcibios_sriov_add_vfs

2018-12-14 Thread Christoph Hellwig
On Thu, Dec 13, 2018 at 06:54:27PM +0100, Sebastian Ott wrote:
> Move VF detection and device creation code to weak functions
> such that architectures can provide a different implementation.

No magic weak functions please.  If you really have a reason to
override the implementation add a proper Kconfig symbol and
an explicit override.


Re: use generic DMA mapping code in powerpc V4

2018-12-14 Thread Christian Zigotzky

On 12 December 2018 at 3:15PM, Christoph Hellwig wrote:
> Thanks for bisecting.  I've spent some time going over the conversion
> but can't really pinpoint it.  I have three little patches that switch
> parts of the code to the generic version.  This is on top of the
> last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda).
>
> Can you check with whіch one things stop working?

Hello Christoph,

Great news! All your patches work!

I tested all your patches (including the patch '0004-alloc-free.patch' 
today) and the PASEMI onboard ethernet works and the P5020 board boots 
without any problems. Thank you for your work!
I have a few days off. That means, I will work less and only for the 
A-EON first level Linux support. I can test again on Thursday next week.


Have a nice weekend!

Cheers,
Christian


[PATCH v1 9/9] mm: better document PG_reserved

2018-12-14 Thread David Hildenbrand
The usage of PG_reserved and how PG_reserved pages are to be treated is
buried deep down in different parts of the kernel. Let's shine some light
onto these details by documenting current users and expected
behavior.

Especially, clarify on the "Some of them might not even exist" case.
These are physical memory gaps that will never be dumped as they
are not marked as IORESOURCE_SYSRAM. PG_reserved does in general not
hinder anybody from dumping or swapping. In some cases, these pages
will not be stored in the hibernation image.

Cc: Andrew Morton 
Cc: Stephen Rothwell 
Cc: Pavel Tatashin 
Cc: Michal Hocko 
Cc: Alexander Duyck 
Cc: Matthew Wilcox 
Cc: Anthony Yznaga 
Cc: Miles Chen 
Cc: yi.z.zh...@linux.intel.com
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 808b4183e30d..9de2e941cbd5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -17,8 +17,37 @@
 /*
  * Various page->flags bits:
  *
- * PG_reserved is set for special pages, which can never be swapped out. Some
- * of them might not even exist...
+ * PG_reserved is set for special pages. The "struct page" of such a page
+ * should in general not be touched (e.g. set dirty) except by their owner.
+ * Pages marked as PG_reserved include:
+ * - Pages part of the kernel image (including vDSO) and similar (e.g. BIOS,
+ *   initrd, HW tables)
+ * - Pages reserved or allocated early during boot (before the page allocator
+ *   was initialized). This includes (depending on the architecture) the
+ *   initial vmmap, initial page tables, crashkernel, elfcorehdr, and much
+ *   much more. Once (if ever) freed, PG_reserved is cleared and they will
+ *   be given to the page allocator.
+ * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
+ *   to read/write these pages might end badly. Don't touch!
+ * - The zero page(s)
+ * - Pages not added to the page allocator when onlining a section because
+ *   they were excluded via the online_page_callback() or because they are
+ *   PG_hwpoison.
+ * - Pages allocated in the context of kexec/kdump (loaded kernel image,
+ *   control pages, vmcoreinfo)
+ * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
+ *   not marked PG_reserved (as they might be in use by somebody else who does
+ *   not respect the caching strategy).
+ * - Pages part of an offline section (struct pages of offline sections should
+ *   not be trusted as they will be initialized when first onlined).
+ * - MCA pages on ia64
+ * - Pages holding CPU notes for POWER Firmware Assisted Dump
+ * - Device memory (e.g. PMEM, DAX, HMM)
+ * Some PG_reserved pages will be excluded from the hibernation image.
+ * PG_reserved does in general not hinder anybody from dumping or swapping
+ * and is no longer required for remap_pfn_range(). ioremap might require it.
+ * Consequently, PG_reserved for a page mapped into user space can indicate
+ * the zero page, the vDSO, MMIO pages or device memory.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
-- 
2.17.2



[PATCH v1 8/9] ia64: perfmon: Don't mark buffer pages as PG_reserved

2018-12-14 Thread David Hildenbrand
In the old days, remap_pfn_range() required pages to be marked as
PG_reserved, so they would e.g. never get swapped out. This was required
for special mappings. Nowadays, this is fully handled via the VMA
(VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP inside remap_pfn_range()
to be precise). PG_reserved is no longer required but only a relict from
the past.

So only architecture specific MM handling might require it (e.g. to
detect them as MMIO pages). As there are no architecture specific checks
for PageReserved() apart from MCA handling in ia64code, this can go. Use
simple vzalloc()/vfree() instead.

Note that before calling vzalloc(), size has already been aligned to
PAGE_SIZE, no need to align again.

Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 arch/ia64/kernel/perfmon.c | 59 +++---
 1 file changed, 4 insertions(+), 55 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index a9d4dc6c0427..e1b9287dc455 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -583,17 +583,6 @@ pfm_put_task(struct task_struct *task)
if (task != current) put_task_struct(task);
 }
 
-static inline void
-pfm_reserve_page(unsigned long a)
-{
-   SetPageReserved(vmalloc_to_page((void *)a));
-}
-static inline void
-pfm_unreserve_page(unsigned long a)
-{
-   ClearPageReserved(vmalloc_to_page((void*)a));
-}
-
 static inline unsigned long
 pfm_protect_ctx_ctxsw(pfm_context_t *x)
 {
@@ -817,44 +806,6 @@ pfm_reset_msgq(pfm_context_t *ctx)
DPRINT(("ctx=%p msgq reset\n", ctx));
 }
 
-static void *
-pfm_rvmalloc(unsigned long size)
-{
-   void *mem;
-   unsigned long addr;
-
-   size = PAGE_ALIGN(size);
-   mem  = vzalloc(size);
-   if (mem) {
-   //printk("perfmon: CPU%d pfm_rvmalloc(%ld)=%p\n", 
smp_processor_id(), size, mem);
-   addr = (unsigned long)mem;
-   while (size > 0) {
-   pfm_reserve_page(addr);
-   addr+=PAGE_SIZE;
-   size-=PAGE_SIZE;
-   }
-   }
-   return mem;
-}
-
-static void
-pfm_rvfree(void *mem, unsigned long size)
-{
-   unsigned long addr;
-
-   if (mem) {
-   DPRINT(("freeing physical buffer @%p size=%lu\n", mem, size));
-   addr = (unsigned long) mem;
-   while ((long) size > 0) {
-   pfm_unreserve_page(addr);
-   addr+=PAGE_SIZE;
-   size-=PAGE_SIZE;
-   }
-   vfree(mem);
-   }
-   return;
-}
-
 static pfm_context_t *
 pfm_context_alloc(int ctx_flags)
 {
@@ -1499,7 +1450,7 @@ pfm_free_smpl_buffer(pfm_context_t *ctx)
/*
 * free the buffer
 */
-   pfm_rvfree(ctx->ctx_smpl_hdr, ctx->ctx_smpl_size);
+   vfree(ctx->ctx_smpl_hdr);
 
ctx->ctx_smpl_hdr  = NULL;
ctx->ctx_smpl_size = 0UL;
@@ -2138,7 +2089,7 @@ pfm_close(struct inode *inode, struct file *filp)
 * All memory free operations (especially for vmalloc'ed memory)
 * MUST be done with interrupts ENABLED.
 */
-   if (smpl_buf_addr)  pfm_rvfree(smpl_buf_addr, smpl_buf_size);
+   vfree(smpl_buf_addr);
 
/*
 * return the memory used by the context
@@ -2267,10 +2218,8 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct 
file *filp, pfm_context_t
 
/*
 * We do the easy to undo allocations first.
-*
-* pfm_rvmalloc(), clears the buffer, so there is no leak
 */
-   smpl_buf = pfm_rvmalloc(size);
+   smpl_buf = vzalloc(size);
if (smpl_buf == NULL) {
DPRINT(("Can't allocate sampling buffer\n"));
return -ENOMEM;
@@ -2347,7 +2296,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct 
file *filp, pfm_context_t
 error:
vm_area_free(vma);
 error_kmem:
-   pfm_rvfree(smpl_buf, size);
+   vfree(smpl_buf);
 
return -ENOMEM;
 }
-- 
2.17.2



[PATCH v1 7/9] arm64: kdump: No need to mark crashkernel pages manually PG_reserved

2018-12-14 Thread David Hildenbrand
The crashkernel is reserved via memblock_reserve(). memblock_free_all()
will call free_low_memory_core_early(), which will go over all reserved
memblocks, marking the pages as PG_reserved.

So manually marking pages as PG_reserved is not necessary, they are
already in the desired state (otherwise they would have been handed over
to the buddy as free pages and bad things would happen).

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: James Morse 
Cc: Bhupesh Sharma 
Cc: David Hildenbrand 
Cc: Mark Rutland 
Cc: Dave Kleikamp 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Florian Fainelli 
Cc: Stefan Agner 
Cc: Laura Abbott 
Cc: Greg Hackmann 
Cc: Johannes Weiner 
Cc: Kristina Martsenko 
Cc: CHANDAN VN 
Cc: AKASHI Takahiro 
Cc: Logan Gunthorpe 
Signed-off-by: David Hildenbrand 
---
 arch/arm64/kernel/machine_kexec.c |  2 +-
 arch/arm64/mm/init.c  | 27 ---
 2 files changed, 1 insertion(+), 28 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 6f0587b5e941..66b5d697d943 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -321,7 +321,7 @@ void crash_post_resume(void)
  * but does not hold any data of loaded kernel image.
  *
  * Note that all the pages in crash dump kernel memory have been initially
- * marked as Reserved in kexec_reserve_crashkres_pages().
+ * marked as Reserved as memory was allocated via memblock_reserve().
  *
  * In hibernation, the pages which are Reserved and yet "nosave" are excluded
  * from the hibernation iamge. crash_is_nosave() does thich check for crash
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index a8f2e4792ef9..9dcfa809b7ab 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -118,35 +118,10 @@ static void __init reserve_crashkernel(void)
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
 }
-
-static void __init kexec_reserve_crashkres_pages(void)
-{
-#ifdef CONFIG_HIBERNATION
-   phys_addr_t addr;
-   struct page *page;
-
-   if (!crashk_res.end)
-   return;
-
-   /*
-* To reduce the size of hibernation image, all the pages are
-* marked as Reserved initially.
-*/
-   for (addr = crashk_res.start; addr < (crashk_res.end + 1);
-   addr += PAGE_SIZE) {
-   page = phys_to_page(addr);
-   SetPageReserved(page);
-   }
-#endif
-}
 #else
 static void __init reserve_crashkernel(void)
 {
 }
-
-static void __init kexec_reserve_crashkres_pages(void)
-{
-}
 #endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_CRASH_DUMP
@@ -586,8 +561,6 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
memblock_free_all();
 
-   kexec_reserve_crashkres_pages();
-
mem_init_print_info(NULL);
 
/*
-- 
2.17.2



[PATCH v1 6/9] arm64: kexec: no need to ClearPageReserved()

2018-12-14 Thread David Hildenbrand
This will be done by free_reserved_page().

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Bhupesh Sharma 
Cc: James Morse 
Cc: Marc Zyngier 
Cc: Dave Kleikamp 
Cc: Mark Rutland 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Acked-by: James Morse 
Signed-off-by: David Hildenbrand 
---
 arch/arm64/kernel/machine_kexec.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index aa9c94113700..6f0587b5e941 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -361,7 +361,6 @@ void crash_free_reserved_phys_range(unsigned long begin, 
unsigned long end)
 
for (addr = begin; addr < end; addr += PAGE_SIZE) {
page = phys_to_page(addr);
-   ClearPageReserved(page);
free_reserved_page(page);
}
 }
-- 
2.17.2



[PATCH v1 5/9] m68k/mm: use __ClearPageReserved()

2018-12-14 Thread David Hildenbrand
The PG_reserved flag is cleared from memory that is part of the kernel
image (and therefore marked as PG_reserved). Avoid using PG_reserved
directly.

Cc: Geert Uytterhoeven 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Signed-off-by: David Hildenbrand 
---
 arch/m68k/mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/m68k/mm/memory.c b/arch/m68k/mm/memory.c
index b86a2e21693b..227c04fe60d2 100644
--- a/arch/m68k/mm/memory.c
+++ b/arch/m68k/mm/memory.c
@@ -51,7 +51,7 @@ void __init init_pointer_table(unsigned long ptable)
pr_debug("init_pointer_table: %lx, %x\n", ptable, PD_MARKBITS(dp));
 
/* unreserve the page so it's possible to free that page */
-   PD_PAGE(dp)->flags &= ~(1 << PG_reserved);
+   __ClearPageReserved(PD_PAGE(dp));
init_page_count(PD_PAGE(dp));
 
return;
-- 
2.17.2



[PATCH v1 4/9] riscv/vdso: don't clear PG_reserved

2018-12-14 Thread David Hildenbrand
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Tobias Klauser 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Acked-by: Palmer Dabbelt 
Signed-off-by: David Hildenbrand 
---
 arch/riscv/kernel/vdso.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index 582cb153eb24..0cd044122234 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -54,7 +54,6 @@ static int __init vdso_init(void)
struct page *pg;
 
pg = virt_to_page(vdso_start + (i << PAGE_SHIFT));
-   ClearPageReserved(pg);
vdso_pagelist[i] = pg;
}
vdso_pagelist[i] = virt_to_page(vdso_data);
-- 
2.17.2



[PATCH v1 3/9] powerpc/vdso: don't clear PG_reserved

2018-12-14 Thread David Hildenbrand
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Kees Cook 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/kernel/vdso.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 65b3bdb99f0b..d59dc2e9a695 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -795,7 +795,6 @@ static int __init vdso_init(void)
BUG_ON(vdso32_pagelist == NULL);
for (i = 0; i < vdso32_pages; i++) {
struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE);
-   ClearPageReserved(pg);
get_page(pg);
vdso32_pagelist[i] = pg;
}
@@ -809,7 +808,6 @@ static int __init vdso_init(void)
BUG_ON(vdso64_pagelist == NULL);
for (i = 0; i < vdso64_pages; i++) {
struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE);
-   ClearPageReserved(pg);
get_page(pg);
vdso64_pagelist[i] = pg;
}
-- 
2.17.2



[PATCH v1 2/9] s390/vdso: don't clear PG_reserved

2018-12-14 Thread David Hildenbrand
The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Suggested-by: Martin Schwidefsky 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Matthew Wilcox 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Vasily Gorbik 
Cc: Kees Cook 
Cc: Souptick Joarder 
Cc: Andrew Morton 
Cc: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 arch/s390/kernel/vdso.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index ebe748a9f472..9e24d23c26c0 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -292,7 +292,6 @@ static int __init vdso_init(void)
BUG_ON(vdso32_pagelist == NULL);
for (i = 0; i < vdso32_pages - 1; i++) {
struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE);
-   ClearPageReserved(pg);
get_page(pg);
vdso32_pagelist[i] = pg;
}
@@ -310,7 +309,6 @@ static int __init vdso_init(void)
BUG_ON(vdso64_pagelist == NULL);
for (i = 0; i < vdso64_pages - 1; i++) {
struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE);
-   ClearPageReserved(pg);
get_page(pg);
vdso64_pagelist[i] = pg;
}
-- 
2.17.2



[PATCH v1 1/9] agp: efficeon: no need to set PG_reserved on GATT tables

2018-12-14 Thread David Hildenbrand
The l1 GATT page table is kept in a special on-chip page with 64 entries.
We allocate the l2 page table pages via get_zeroed_page() and enter them
into the table. These l2 pages are modified accordingly when
inserting/removing memory via efficeon_insert_memory and
efficeon_remove_memory.

Apart from that, these pages are not exposed or ioremap'ed. We can stop
setting them reserved (propably copied from generic code).

Cc: David Airlie 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Signed-off-by: David Hildenbrand 
---
 drivers/char/agp/efficeon-agp.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/char/agp/efficeon-agp.c b/drivers/char/agp/efficeon-agp.c
index 7f88490b5479..c53f0f9ef5b0 100644
--- a/drivers/char/agp/efficeon-agp.c
+++ b/drivers/char/agp/efficeon-agp.c
@@ -163,7 +163,6 @@ static int efficeon_free_gatt_table(struct agp_bridge_data 
*bridge)
unsigned long page = efficeon_private.l1_table[index];
if (page) {
efficeon_private.l1_table[index] = 0;
-   ClearPageReserved(virt_to_page((char *)page));
free_page(page);
freed++;
}
@@ -219,7 +218,6 @@ static int efficeon_create_gatt_table(struct 
agp_bridge_data *bridge)
efficeon_free_gatt_table(agp_bridge);
return -ENOMEM;
}
-   SetPageReserved(virt_to_page((char *)page));
 
for (offset = 0; offset < PAGE_SIZE; offset += clflush_chunk)
clflush((char *)page+offset);
-- 
2.17.2



[PATCH v1 0/9] mm: PG_reserved cleanups and documentation

2018-12-14 Thread David Hildenbrand
I was recently going over all users of PG_reserved. Short story: it is
difficult and sometimes not really clear if setting/checking for
PG_reserved is only a relict from the past. Easy to break things. I
guess I know have a pretty good idea wh things are like that
nowadays and how they evolved.

I had way more cleanups in this series inititally,
but some architectures take PG_reserved as a way to apply a different
caching strategy (for MMIO pages). So I decided to only include the most
obvious changes (that are less likely to break something). So the big
chunk of manual SetPageReserved users are MMIO/DMA related things on
device buffers.

Most notably, for device memory we will hopefully soon stop setting
PG_reserved. The the documentation has to be updated.

RFC -> V1:
- Add more details to "mm: better document PG_reserved"
- Add "arm64: kdump: No need to mark crashkernel pages manually
   PG_reserved"
- Add "ia64: perfmon: Don't mark buffer pages as PG_reserved"
- Added ACKs

David Hildenbrand (9):
  agp: efficeon: no need to set PG_reserved on GATT tables
  s390/vdso: don't clear PG_reserved
  powerpc/vdso: don't clear PG_reserved
  riscv/vdso: don't clear PG_reserved
  m68k/mm: use __ClearPageReserved()
  arm64: kexec: no need to ClearPageReserved()
  arm64: kdump: No need to mark crashkernel pages manually PG_reserved
  ia64: perfmon: Don't mark buffer pages as PG_reserved
  mm: better document PG_reserved

 arch/arm64/kernel/machine_kexec.c |  3 +-
 arch/arm64/mm/init.c  | 27 --
 arch/ia64/kernel/perfmon.c| 59 +++
 arch/m68k/mm/memory.c |  2 +-
 arch/powerpc/kernel/vdso.c|  2 --
 arch/riscv/kernel/vdso.c  |  1 -
 arch/s390/kernel/vdso.c   |  2 --
 drivers/char/agp/efficeon-agp.c   |  2 --
 include/linux/page-flags.h| 33 +++--
 9 files changed, 37 insertions(+), 94 deletions(-)

-- 
2.17.2



Re: [PATCH] powerpc/prom: fix early DEBUG messages

2018-12-14 Thread Christophe Leroy




Le 14/12/2018 à 07:22, Michael Ellerman a écrit :

Christophe Leroy  writes:


diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index fe758cedb93f..d8e56e03c9c6 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -749,7 +749,11 @@ void __init early_init_devtree(void *params)
memblock_allow_resize();
memblock_dump_all();
  
+#ifdef CONFIG_PHYS_64BIT

DBG("Phys. mem: %llx\n", memblock_phys_mem_size());
+#else
+   DBG("Phys. mem: %x\n", memblock_phys_mem_size());
+#endif


Can we just do:

DBG("Phys. mem: %llx\n", (unsigned long long)memblock_phys_mem_size());

?



Yes that's obviously better, especially as we don't fix the length of 
the displayed value.

Christophe


[PATCH v2] powerpc/prom: fix early DEBUG messages

2018-12-14 Thread Christophe Leroy
This patch fixes early DEBUG messages in prom.c:
- Use %px instead of %p to see the addresses
- Cast memblock_phys_mem_size() with (unsigned long long) to
avoid build failure when phys_addr_t is not 64 bits.

Signed-off-by: Christophe Leroy 
---
 v2: cast instead of #ifdef %llx/%x

 arch/powerpc/kernel/prom.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index fe758cedb93f..87a68e2dc531 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -129,7 +129,7 @@ static void __init move_device_tree(void)
p = __va(memblock_phys_alloc(size, PAGE_SIZE));
memcpy(p, initial_boot_params, size);
initial_boot_params = p;
-   DBG("Moved device tree to 0x%p\n", p);
+   DBG("Moved device tree to 0x%px\n", p);
}
 
DBG("<- move_device_tree\n");
@@ -689,7 +689,7 @@ void __init early_init_devtree(void *params)
 {
phys_addr_t limit;
 
-   DBG(" -> early_init_devtree(%p)\n", params);
+   DBG(" -> early_init_devtree(%px)\n", params);
 
/* Too early to BUG_ON(), do it by hand */
if (!early_init_dt_verify(params))
@@ -749,7 +749,7 @@ void __init early_init_devtree(void *params)
memblock_allow_resize();
memblock_dump_all();
 
-   DBG("Phys. mem: %llx\n", memblock_phys_mem_size());
+   DBG("Phys. mem: %llx\n", (unsigned long long)memblock_phys_mem_size());
 
/* We may need to relocate the flat tree, do it now.
 * FIXME .. and the initrd too? */
-- 
2.13.3



[PATCH v2 2/5] powerpc/perf: Rearrange setting of ldbar for thread-imc

2018-12-14 Thread Anju T Sudhakar
LDBAR holds the memory address allocated for each cpu. For thread-imc
the mode bit (i.e bit 1) of LDBAR is set to accumulation.
Currently, ldbar is loaded with per cpu memory address and mode set to
accumulation at boot time.

To enable trace-imc, the mode bit of ldbar should be set to 'trace'. So to
accommodate trace-mode of IMC, reposition setting of ldbar for thread-imc
to thread_imc_event_add(). Also reset ldbar at thread_imc_event_del().

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index f292a3f284f1..3bef46f8417d 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -806,8 +806,11 @@ static int core_imc_event_init(struct perf_event *event)
 }
 
 /*
- * Allocates a page of memory for each of the online cpus, and write the
- * physical base address of that page to the LDBAR for that cpu.
+ * Allocates a page of memory for each of the online cpus, and load
+ * LDBAR with 0.
+ * The physical base address of the page allocated for a cpu will be
+ * written to the LDBAR for that cpu, when the thread-imc event
+ * is added.
  *
  * LDBAR Register Layout:
  *
@@ -825,7 +828,7 @@ static int core_imc_event_init(struct perf_event *event)
  */
 static int thread_imc_mem_alloc(int cpu_id, int size)
 {
-   u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, cpu_id);
+   u64 *local_mem = per_cpu(thread_imc_mem, cpu_id);
int nid = cpu_to_node(cpu_id);
 
if (!local_mem) {
@@ -842,9 +845,7 @@ static int thread_imc_mem_alloc(int cpu_id, int size)
per_cpu(thread_imc_mem, cpu_id) = local_mem;
}
 
-   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
THREAD_IMC_ENABLE;
-
-   mtspr(SPRN_LDBAR, ldbar_value);
+   mtspr(SPRN_LDBAR, 0);
return 0;
 }
 
@@ -995,6 +996,7 @@ static int thread_imc_event_add(struct perf_event *event, 
int flags)
 {
int core_id;
struct imc_pmu_ref *ref;
+   u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, 
smp_processor_id());
 
if (flags & PERF_EF_START)
imc_event_start(event, flags);
@@ -1003,6 +1005,9 @@ static int thread_imc_event_add(struct perf_event *event, 
int flags)
return -EINVAL;
 
core_id = smp_processor_id() / threads_per_core;
+   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
THREAD_IMC_ENABLE;
+   mtspr(SPRN_LDBAR, ldbar_value);
+
/*
 * imc pmus are enabled only when it is used.
 * See if this is triggered for the first time.
@@ -1034,11 +1039,7 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
int core_id;
struct imc_pmu_ref *ref;
 
-   /*
-* Take a snapshot and calculate the delta and update
-* the event counter values.
-*/
-   imc_event_update(event);
+   mtspr(SPRN_LDBAR, 0);
 
core_id = smp_processor_id() / threads_per_core;
ref = _imc_refc[core_id];
@@ -1057,6 +1058,11 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
ref->refc = 0;
}
mutex_unlock(>lock);
+   /*
+* Take a snapshot and calculate the delta and update
+* the event counter values.
+*/
+   imc_event_update(event);
 }
 
 /* update_pmu_ops : Populate the appropriate operations for "pmu" */
-- 
2.17.1



[PATCH v2 5/5] powerpc/perf: Trace imc PMU functions

2018-12-14 Thread Anju T Sudhakar
Add PMU functions to support trace-imc.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 175 
 1 file changed, 175 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 1f09265c8fb0..32ff0e449fca 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1120,6 +1120,173 @@ static int trace_imc_cpu_init(void)
  ppc_trace_imc_cpu_offline);
 }
 
+static u64 get_trace_imc_event_base_addr(void)
+{
+   return (u64)per_cpu(trace_imc_mem, smp_processor_id());
+}
+
+/*
+ * Function to parse trace-imc data obtained
+ * and to prepare the perf sample.
+ */
+static int trace_imc_prepare_sample(struct trace_imc_data *mem,
+   struct perf_sample_data *data,
+   u64 *prev_tb,
+   struct perf_event_header *header,
+   struct perf_event *event)
+{
+   /* Sanity checks for a valid record */
+   if (be64_to_cpu(READ_ONCE(mem->tb1)) > *prev_tb)
+   *prev_tb = be64_to_cpu(READ_ONCE(mem->tb1));
+   else
+   return -EINVAL;
+
+   if ((be64_to_cpu(READ_ONCE(mem->tb1)) & IMC_TRACE_RECORD_TB1_MASK) !=
+be64_to_cpu(READ_ONCE(mem->tb2)))
+   return -EINVAL;
+
+   /* Prepare perf sample */
+   data->ip =  be64_to_cpu(READ_ONCE(mem->ip));
+   data->period = event->hw.last_period;
+
+   header->type = PERF_RECORD_SAMPLE;
+   header->size = sizeof(*header) + event->header_size;
+   header->misc = 0;
+
+   if (is_kernel_addr(data->ip))
+   header->misc |= PERF_RECORD_MISC_KERNEL;
+   else
+   header->misc |= PERF_RECORD_MISC_USER;
+
+   perf_event_header__init_id(header, data, event);
+
+   return 0;
+}
+
+static void dump_trace_imc_data(struct perf_event *event)
+{
+   struct trace_imc_data *mem;
+   int i, ret;
+   u64 prev_tb = 0;
+
+   mem = (struct trace_imc_data *)get_trace_imc_event_base_addr();
+   for (i = 0; i < (trace_imc_mem_size / sizeof(struct trace_imc_data));
+   i++, mem++) {
+   struct perf_sample_data data;
+   struct perf_event_header header;
+
+   ret = trace_imc_prepare_sample(mem, , _tb, , 
event);
+   if (ret) /* Exit, if not a valid record */
+   break;
+   else {
+   /* If this is a valid record, create the sample */
+   struct perf_output_handle handle;
+
+   if (perf_output_begin(, event, header.size))
+   return;
+
+   perf_output_sample(, , , event);
+   perf_output_end();
+   }
+   }
+}
+
+static int trace_imc_event_add(struct perf_event *event, int flags)
+{
+   /* Enable the sched_task to start the engine */
+   perf_sched_cb_inc(event->ctx->pmu);
+   return 0;
+}
+
+static void trace_imc_event_read(struct perf_event *event)
+{
+   dump_trace_imc_data(event);
+}
+
+static void trace_imc_event_stop(struct perf_event *event, int flags)
+{
+   trace_imc_event_read(event);
+}
+
+static void trace_imc_event_start(struct perf_event *event, int flags)
+{
+   return;
+}
+
+static void trace_imc_event_del(struct perf_event *event, int flags)
+{
+   perf_sched_cb_dec(event->ctx->pmu);
+}
+
+void trace_imc_pmu_sched_task(struct perf_event_context *ctx,
+   bool sched_in)
+{
+   int core_id = smp_processor_id() / threads_per_core;
+   struct imc_pmu_ref *ref;
+   u64 local_mem, ldbar_value;
+
+   /* Set trace-imc bit in ldbar and load ldbar with per-thread memory 
address */
+   local_mem = get_trace_imc_event_base_addr();
+   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
TRACE_IMC_ENABLE;
+
+   ref = _imc_refc[core_id];
+   if (!ref)
+   return;
+
+   if (sched_in) {
+   mtspr(SPRN_LDBAR, ldbar_value);
+   mutex_lock(>lock);
+   if (ref->refc == 0) {
+   if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
+   
get_hard_smp_processor_id(smp_processor_id( {
+   mutex_unlock(>lock);
+   pr_err("trace-imc: Unable to start the counters 
for core %d\n", core_id);
+   mtspr(SPRN_LDBAR, 0);
+   return;
+   }
+   }
+   ++ref->refc;
+   mutex_unlock(>lock);
+   } else {
+   mtspr(SPRN_LDBAR, 0);
+   mutex_lock(>lock);
+   ref->refc--;
+   if (ref->refc == 0) {
+   if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE,
+ 

[PATCH v2 3/5] powerpc/perf: Add privileged access check for thread_imc

2018-12-14 Thread Anju T Sudhakar
From: Madhavan Srinivasan 

Add code to restrict user access to thread_imc pmu since
some event report privilege level information.

Fixes: f74c89bd80fb3 ('powerpc/perf: Add thread IMC PMU support')
Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 3bef46f8417d..5ca80545a849 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -877,6 +877,9 @@ static int thread_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
/* Sampling not supported */
if (event->hw.sample_period)
return -EINVAL;
-- 
2.17.1



[PATCH v2 4/5] powerpc/perf: Trace imc events detection and cpuhotplug

2018-12-14 Thread Anju T Sudhakar
Patch detects trace-imc events, does memory initilizations for each online
cpu, and registers cpuhotplug call-backs.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c   | 91 +++
 arch/powerpc/platforms/powernv/opal-imc.c |  3 +
 include/linux/cpuhotplug.h|  1 +
 3 files changed, 95 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 5ca80545a849..1f09265c8fb0 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -43,6 +43,10 @@ static DEFINE_PER_CPU(u64 *, thread_imc_mem);
 static struct imc_pmu *thread_imc_pmu;
 static int thread_imc_mem_size;
 
+/* Trace IMC data structures */
+static DEFINE_PER_CPU(u64 *, trace_imc_mem);
+static int trace_imc_mem_size;
+
 static struct imc_pmu *imc_event_to_pmu(struct perf_event *event)
 {
return container_of(event->pmu, struct imc_pmu, pmu);
@@ -1068,6 +1072,54 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
imc_event_update(event);
 }
 
+/*
+ * Allocate a page of memory for each cpu, and load LDBAR with 0.
+ */
+static int trace_imc_mem_alloc(int cpu_id, int size)
+{
+   u64 *local_mem = per_cpu(trace_imc_mem, cpu_id);
+   int phys_id = cpu_to_node(cpu_id), rc = 0;
+
+   if (!local_mem) {
+   local_mem = page_address(alloc_pages_node(phys_id,
+   GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
+   __GFP_NOWARN, get_order(size)));
+   if (!local_mem)
+   return -ENOMEM;
+   per_cpu(trace_imc_mem, cpu_id) = local_mem;
+
+   /* Initialise the counters for trace mode */
+   rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_TRACE, __pa((void 
*)local_mem),
+   get_hard_smp_processor_id(cpu_id));
+   if (rc) {
+   pr_info("IMC:opal init failed for trace imc\n");
+   return rc;
+   }
+   }
+
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+}
+
+static int ppc_trace_imc_cpu_online(unsigned int cpu)
+{
+   return trace_imc_mem_alloc(cpu, trace_imc_mem_size);
+}
+
+static int ppc_trace_imc_cpu_offline(unsigned int cpu)
+{
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+}
+
+static int trace_imc_cpu_init(void)
+{
+   return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
+ "perf/powerpc/imc_trace:online",
+ ppc_trace_imc_cpu_online,
+ ppc_trace_imc_cpu_offline);
+}
+
 /* update_pmu_ops : Populate the appropriate operations for "pmu" */
 static int update_pmu_ops(struct imc_pmu *pmu)
 {
@@ -1189,6 +1241,17 @@ static void cleanup_all_thread_imc_memory(void)
}
 }
 
+static void cleanup_all_trace_imc_memory(void)
+{
+   int i, order = get_order(trace_imc_mem_size);
+
+   for_each_online_cpu(i) {
+   if (per_cpu(trace_imc_mem, i))
+   free_pages((u64)per_cpu(trace_imc_mem, i), order);
+
+   }
+}
+
 /* Function to free the attr_groups which are dynamically allocated */
 static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
 {
@@ -1230,6 +1293,11 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE);
cleanup_all_thread_imc_memory();
}
+
+   if (pmu_ptr->domain == IMC_DOMAIN_TRACE) {
+   cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE);
+   cleanup_all_trace_imc_memory();
+   }
 }
 
 /*
@@ -1312,6 +1380,21 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
 
thread_imc_pmu = pmu_ptr;
break;
+   case IMC_DOMAIN_TRACE:
+   /* Update the pmu name */
+   pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s", s, "_imc");
+   if (!pmu_ptr->pmu.name)
+   return -ENOMEM;
+
+   trace_imc_mem_size = pmu_ptr->counter_mem_size;
+   for_each_online_cpu(cpu) {
+   res = trace_imc_mem_alloc(cpu, trace_imc_mem_size);
+   if (res) {
+   cleanup_all_trace_imc_memory();
+   goto err;
+   }
+   }
+   break;
default:
return -EINVAL;
}
@@ -1384,6 +1467,14 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
goto err_free_mem;
}
 
+   break;
+   case IMC_DOMAIN_TRACE:
+   ret = trace_imc_cpu_init();
+   if (ret) {
+   cleanup_all_trace_imc_memory();
+   goto err_free_mem;
+   }
+
break;
default:
 

[PATCH v2 1/5] powerpc/include: Add data structures and macros for IMC trace mode

2018-12-14 Thread Anju T Sudhakar
Add the macros needed for IMC (In-Memory Collection Counters) trace-mode
and data structure to hold the trace-imc record data.
Also, add the new type "OPAL_IMC_COUNTERS_TRACE" in 'opal-api.h', since
there is a new switch case added in the opal-calls for IMC.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/include/asm/imc-pmu.h  | 39 +
 arch/powerpc/include/asm/opal-api.h |  1 +
 2 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 69f516ecb2fd..7c2ef0e42661 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -33,6 +33,7 @@
  */
 #define THREAD_IMC_LDBAR_MASK   0x0003e000ULL
 #define THREAD_IMC_ENABLE   0x8000ULL
+#define TRACE_IMC_ENABLE   0x4000ULL
 
 /*
  * For debugfs interface for imc-mode and imc-command
@@ -59,6 +60,34 @@ struct imc_events {
char *scale;
 };
 
+/*
+ * Trace IMC hardware updates a 64bytes record on
+ * Core Performance Monitoring Counter (CPMC)
+ * overflow. Here is the layout for the trace imc record
+ *
+ * DW 0 : Timebase
+ * DW 1 : Program Counter
+ * DW 2 : PIDR information
+ * DW 3 : CPMC1
+ * DW 4 : CPMC2
+ * DW 5 : CPMC3
+ * Dw 6 : CPMC4
+ * DW 7 : Timebase
+ * .
+ *
+ * The following is the data structure to hold trace imc data.
+ */
+struct trace_imc_data {
+   u64 tb1;
+   u64 ip;
+   u64 val;
+   u64 cpmc1;
+   u64 cpmc2;
+   u64 cpmc3;
+   u64 cpmc4;
+   u64 tb2;
+};
+
 /* Event attribute array index */
 #define IMC_FORMAT_ATTR0
 #define IMC_EVENT_ATTR 1
@@ -68,6 +97,13 @@ struct imc_events {
 /* PMU Format attribute macros */
 #define IMC_EVENT_OFFSET_MASK  0xULL
 
+/*
+ * Macro to mask bits 0:21 of first double word(which is the timebase) to
+ * compare with 8th double word (timebase) of trace imc record data.
+ */
+#define IMC_TRACE_RECORD_TB1_MASK  0x3ffULL
+
+
 /*
  * Device tree parser code detects IMC pmu support and
  * registers new IMC pmus. This structure will hold the
@@ -113,6 +149,7 @@ struct imc_pmu_ref {
 
 enum {
IMC_TYPE_THREAD = 0x1,
+   IMC_TYPE_TRACE  = 0x2,
IMC_TYPE_CORE   = 0x4,
IMC_TYPE_CHIP   = 0x10,
 };
@@ -123,6 +160,8 @@ enum {
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_CORE2
 #define IMC_DOMAIN_THREAD  3
+/* For trace-imc the domain is still thread but it operates in trace-mode */
+#define IMC_DOMAIN_TRACE   4
 
 extern int init_imc_pmu(struct device_node *parent,
struct imc_pmu *pmu_ptr, int pmu_id);
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..a4130b21b159 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1118,6 +1118,7 @@ enum {
 enum {
OPAL_IMC_COUNTERS_NEST = 1,
OPAL_IMC_COUNTERS_CORE = 2,
+   OPAL_IMC_COUNTERS_TRACE = 3,
 };
 
 
-- 
2.17.1



[PATCH v2 0/5] powerpc/perf: IMC trace-mode support

2018-12-14 Thread Anju T Sudhakar
IMC (In-Memory collection counters) is a hardware monitoring facility  
that collects large number of hardware performance events. 
POWER9 support two modes for IMC which are the Accumulation mode and   
Trace mode. In Accumulation mode, event counts are accumulated in system   
Memory. Hypervisor then reads the posted counts periodically or when   
requested. In IMC Trace mode, event counted is fixed for cycles and on 
each overflow, hardware snapshots the program counter along with other 
details and writes into memory pointed by LDBAR(ring buffer memory,
hardware wraps around). LDBAR has bit which indicates the IMC trace-mode.   


Trace-IMC Implementation:  
-- 
To enable trace-imc, we need to

* Add trace node in the DTS file for power9, so that the new trace node can
be discovered by the kernel.   

Information included in the DTS file are as follows, (a snippet from  
the ima-catalog)   

TRACE_IMC: trace-events {  
 #address-cells = <0x1>;
 #size-cells = <0x1>;   
 event@1020 {   
 event-name = "cycles" ;
 reg = <0x1020 0x8>;
 desc = "Reference cycles" ;
 }; 
 }; 
 trace@0 {  
 compatible = "ibm,imc-counters";   
 events-prefix = "trace_";  
 reg = <0x0 0x8>;   
 events = < _IMC >;   
 type = <0x2>;  
 size = <0x4>;  
 }; 

OP-BUILD changes needed to include the "trace node" is already pulled in   
to the ima-catalog repo.   

ps://github.com/open-power/op-build/commit/d3e75dc26d1283d7d5eb444bff1ec9e40d5dfc07

* Enchance the opal_imc_counters_* calls to support this new trace mode
in imc. Add support to initialize the trace-mode scom. 

TRACE_IMC_SCOM bit representation: 

0:1 : SAMPSEL  
2:33: CPMC_LOAD
34:40   : CPMC1SEL 
41:47   : CPMC2SEL 
48:50   : BUFFERSIZE   
51:63   : RESERVED 

CPMC_LOAD contains the sampling duration. SAMPSEL and CPMC*SEL determines  
the event to count. BUFFRSIZE indicates the memory range. On each overflow,
hardware snapshots program counter along with other details and update the 
memory and reloads the CMPC_LOAD value for the next sampling duration. 
IMC hardware does not support exceptions, so it quietly wraps around if
memory buffer reaches the end. 

Link to the skiboot patches to enhance the opal_imc_counters_* calls:
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012878.html
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012879.html
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012882.html
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012880.html
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012881.html
https://lists.ozlabs.org/pipermail/skiboot/2018-December/012883.html

* Set LDBAR spr to enable imc-trace mode.


Re: [PATCH] powerpc/mm: make NULL pointer deferences explicit on bad page faults.

2018-12-14 Thread Christophe Leroy

Hi Michael,

Le 14/12/2018 à 01:57, Michael Ellerman a écrit :

Hi Christophe,

You know it's the trivial patches that are going to get lots of review
comments :)


I'm so happy to get comments.



Christophe Leroy  writes:

As several other arches including x86, this patch makes it explicit
that a bad page fault is a NULL pointer dereference when the fault
address is lower than PAGE_SIZE


I'm being pedantic, but it's not necessarily a NULL pointer dereference.
It might just be a direct access to a low address, eg:

  char *p = 0x100;
  *p = 0;

That's not a NULL pointer dereference.

But other arches do print this so I guess it's OK to add, and in most
cases it will be an actual NULL pointer dereference.

I wonder though if we should use 4096 rather than PAGE_SIZE, given
that's the actual value other arches are using. We support 256K pages on
some systems, which is getting quite large.


Those invalid accesses are catched because the first page is marked non 
present or non accessible in the page table, so I thing using PAGE_SIZE 
here is valid regardless of the page size.


Looks like the arches have PAGE_SHIFT ranging from 12 to 16 mainly.




diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index d51cf5f4e45e..501a1eadb3e9 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -631,13 +631,16 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
address, int sig)
switch (TRAP(regs)) {
case 0x300:
case 0x380:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "data at address 0x%08lx\n", regs->dar);
+   pr_alert("Unable to handle kernel %s for data at address 
0x%08lx\n",
+regs->dar < PAGE_SIZE ? "NULL pointer dereference" :
+"paging request",
+regs->dar);


This is now too long I think, with printk time you get:

[ 1096.450711] Unable to handle kernel NULL pointer dereference for data at 
address 0x

Which is 93 columns. It's true on many systems it doesn't really matter
any more, but it would still be good if it was shorter.

I like that on x86 they prefix it with "BUG:", just to avoid any confusion.

What if we had for the NULL pointer case:

   BUG: Kernel NULL pointer dereference at 0x

And for the normal case:

   BUG: Unable to handle kernel data access at 0x

Note on the very next line we print:
   Faulting instruction address: 0xc0795cc8

So there should be no confusion about whether "at" refers to the data
address or the instruction address.


Agreed




case 0x400:
case 0x480:
-   printk(KERN_ALERT "Unable to handle kernel paging request for "
-   "instruction fetch\n");
+   pr_alert("Unable to handle kernel %s for instruction fetch\n",
+regs->nip < PAGE_SIZE ? "NULL pointer dereference" :
+"paging request");


I don't really like using "NULL pointer dereference" here, that
terminology makes me think of a load/store, I think it confuses things
rather than making it clearer.

What about:

   BUG: Unable to handle kernel instruction fetch at 0x


I think we still need to make it explicit that we jumped there due to a 
NULL function pointer, allthought I don't have a good text idea yet for 
this.






break;
case 0x600:
printk(KERN_ALERT "Unable to handle kernel paging request for "


It would be good to clean up these other cases as well. They seem to be
trying to use the "page request for" terminology which leads to them
being very wordy. I assume that was done to help people grepping kernel
logs for errors, but I think we should not worry about that if we have
the "BUG:" prefix.

So we have:
printk(KERN_ALERT "Unable to handle kernel paging request for "
"unaligned access at address 0x%08lx\n", regs->dar);

What about:

   BUG: Unable to handle kernel unaligned access at 0x

And:
printk(KERN_ALERT "Unable to handle kernel paging request for "
"unknown fault\n");

What about:

   BUG: Unable to handle unknown paging fault at 0x


Thoughts?


Looks like good ideas I'll carry on.

Christophe