Re: [PATCH v6 00/27] ptrace: add PTRACE_GET_SYSCALL_INFO request
On Thu, Dec 13, 2018 at 12:18 PM Dmitry V. Levin wrote: > PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain > details of the syscall the tracee is blocked in. > > There are two reasons for a special syscall-related ptrace request. > > Firstly, with the current ptrace API there are cases when ptracer cannot > retrieve necessary information about syscalls. Some examples include: > * The notorious int-0x80-from-64-bit-task issue. See [1] for details. > In short, if a 64-bit task performs a syscall through int 0x80, its tracer > has no reliable means to find out that the syscall was, in fact, > a compat syscall, and misidentifies it. > * Syscall-enter-stop and syscall-exit-stop look the same for the tracer. > Common practice is to keep track of the sequence of ptrace-stops in order > not to mix the two syscall-stops up. But it is not as simple as it looks; > for example, strace had a (just recently fixed) long-standing bug where > attaching strace to a tracee that is performing the execve system call > led to the tracer identifying the following syscall-exit-stop as > syscall-enter-stop, which messed up all the state tracking. > * Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 > ("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA > and process_vm_readv become unavailable when the process dumpable flag > is cleared. On such architectures as ia64 this results in all syscall > arguments being unavailable for the tracer. > > Secondly, ptracers also have to support a lot of arch-specific code for > obtaining information about the tracee. For some architectures, this > requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall > argument and return value. > > PTRACE_GET_SYSCALL_INFO returns the following structure: > > struct ptrace_syscall_info { > __u8 op;/* PTRACE_SYSCALL_INFO_* */ > __u32 arch __attribute__((__aligned__(sizeof(__u32; > __u64 instruction_pointer; > __u64 stack_pointer; > union { > struct { > __u64 nr; > __u64 args[6]; > } entry; > struct { > __s64 rval; > __u8 is_error; > } exit; > struct { > __u64 nr; > __u64 args[6]; > __u32 ret_data; > } seccomp; > }; > }; > > The structure was chosen according to [2], except for the following > changes: > * seccomp substructure was added as a superset of entry substructure; > * the type of nr field was changed from int to __u64 because syscall > numbers are, as a practical matter, 64 bits; > * stack_pointer field was added along with instruction_pointer field > since it is readily available and can save the tracer from extra > PTRACE_GETREGS/PTRACE_GETREGSET calls; > * arch is always initialized to aid with tracing system calls > * such as execve(); > * instruction_pointer and stack_pointer are always initialized > so they could be easily obtained for non-syscall stops; > * a boolean is_error field was added along with rval field, this way > the tracer can more reliably distinguish a return value > from an error value. > > strace has been ported to PTRACE_GET_SYSCALL_INFO, you can find it > in [3] and [4]. > > [1] > https://lore.kernel.org/lkml/ca+55afzcsvmddj9lh_gdbz1ozhyem6zrgpbdajnywm2lf_e...@mail.gmail.com/ > [2] > https://lore.kernel.org/lkml/caobl_7gm0n80n7j_dfw_eqyflyzq+sf4y2avsccv88tb3aw...@mail.gmail.com/ > [3] https://github.com/strace/strace/commits/ldv/PTRACE_GET_SYSCALL_INFO > [4] https://gitlab.com/strace/strace/commits/ldv/PTRACE_GET_SYSCALL_INFO > > --- > > Notes: > v6: > * Add syscall_get_arguments and syscall_set_arguments wrappers > to asm-generic/syscall.h, requested by Geert. > * Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing > paddings > into account, use the end of the last field of the structure being > written. > * Change struct ptrace_syscall_info: > * remove .frame_pointer field, is is not needed and not portable; > * make .arch field explicitly aligned, remove no longer needed > padding before .arch field; > * remove trailing pads, they are no longer needed. > > v5: > * Merge separate series and patches into the single series. > * Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg. > * Change struct ptrace_syscall_info: generalize instruction_pointer, > stack_pointer, and frame_pointer fields by moving them from > ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info > and initializing them for all stops. > * Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop, > so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain > instruction_pointer
Re: [PATCH v1 9/9] mm: better document PG_reserved
On 12/14/18 3:10 AM, David Hildenbrand wrote: > The usage of PG_reserved and how PG_reserved pages are to be treated is > buried deep down in different parts of the kernel. Let's shine some light > onto these details by documenting current users and expected > behavior. > > Especially, clarify on the "Some of them might not even exist" case. > These are physical memory gaps that will never be dumped as they > are not marked as IORESOURCE_SYSRAM. PG_reserved does in general not > hinder anybody from dumping or swapping. In some cases, these pages > will not be stored in the hibernation image. Hi, Thanks for the doc update. Comments below. > Cc: Andrew Morton > Cc: Stephen Rothwell > Cc: Pavel Tatashin > Cc: Michal Hocko > Cc: Alexander Duyck > Cc: Matthew Wilcox > Cc: Anthony Yznaga > Cc: Miles Chen > Cc: yi.z.zh...@linux.intel.com > Cc: Dan Williams > Signed-off-by: David Hildenbrand > --- > include/linux/page-flags.h | 33 +++-- > 1 file changed, 31 insertions(+), 2 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 808b4183e30d..9de2e941cbd5 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -17,8 +17,37 @@ > /* > * Various page->flags bits: > * > - * PG_reserved is set for special pages, which can never be swapped out. Some > - * of them might not even exist... > + * PG_reserved is set for special pages. The "struct page" of such a page > + * should in general not be touched (e.g. set dirty) except by their owner. by its owner. > + * Pages marked as PG_reserved include: > + * - Pages part of the kernel image (including vDSO) and similar (e.g. BIOS, > + * initrd, HW tables) > + * - Pages reserved or allocated early during boot (before the page allocator > + * was initialized). This includes (depending on the architecture) the > + * initial vmmap, initial page tables, crashkernel, elfcorehdr, and much VM map, > + * much more. Once (if ever) freed, PG_reserved is cleared and they will > + * be given to the page allocator. > + * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying > + * to read/write these pages might end badly. Don't touch! > + * - The zero page(s) > + * - Pages not added to the page allocator when onlining a section because > + * they were excluded via the online_page_callback() or because they are > + * PG_hwpoison. > + * - Pages allocated in the context of kexec/kdump (loaded kernel image, > + * control pages, vmcoreinfo) > + * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are > + * not marked PG_reserved (as they might be in use by somebody else who > does > + * not respect the caching strategy). > + * - Pages part of an offline section (struct pages of offline sections > should > + * not be trusted as they will be initialized when first onlined). > + * - MCA pages on ia64 > + * - Pages holding CPU notes for POWER Firmware Assisted Dump > + * - Device memory (e.g. PMEM, DAX, HMM) > + * Some PG_reserved pages will be excluded from the hibernation image. > + * PG_reserved does in general not hinder anybody from dumping or swapping > + * and is no longer required for remap_pfn_range(). ioremap might require it. > + * Consequently, PG_reserved for a page mapped into user space can indicate > + * the zero page, the vDSO, MMIO pages or device memory. > * > * The PG_private bitflag is set on pagecache pages if they contain > filesystem > * specific data (which is normally at page->private). It can be used by > cheers. -- ~Randy
Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache
On 12/14/18 2:47 PM, Frank Rowand wrote: > On 12/14/18 9:15 AM, Rob Herring wrote: >> On Fri, Dec 14, 2018 at 12:43 AM wrote: >>> >>> From: Frank Rowand >>> >>> The phandle cache contains struct device_node pointers. The refcount >>> of the pointers was not incremented while in the cache, allowing use >>> after free error after kfree() of the node. Add the proper increment >>> and decrement of the use count. >> >> Since we pre-populate the cache at boot, all the nodes will have a ref >> count and will never be freed unless we happen to repopulate the whole >> cache. That doesn't seem ideal. The node pointer is not "in use" just >> because it is in the cache. I forgot to reply to this sentence. The node pointers are "in use" because of_find_node_by_phandle() will use the pointers to access the phandle field. This is a use after free bug if the node has been kfree()'ed. >> >> Rob >> > > This patch also adds of_node_put() so that the refcount will go to zero > when the node is removed as part of an overlay remove, if the node was > added by an overlay. > > Patch 2/2 adds the free cache entry call to __of_detach_node(), so the > refcount will go to zero when the node is removed for dynamic use cases > other than overlays. (For overlays, all nodes are instead removed from > the cache before __of_detach_node() is called.) > > -Frank >
Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache
On 12/14/18 9:15 AM, Rob Herring wrote: > On Fri, Dec 14, 2018 at 12:43 AM wrote: >> >> From: Frank Rowand >> >> The phandle cache contains struct device_node pointers. The refcount >> of the pointers was not incremented while in the cache, allowing use >> after free error after kfree() of the node. Add the proper increment >> and decrement of the use count. > > Since we pre-populate the cache at boot, all the nodes will have a ref > count and will never be freed unless we happen to repopulate the whole > cache. That doesn't seem ideal. The node pointer is not "in use" just > because it is in the cache. > > Rob > This patch also adds of_node_put() so that the refcount will go to zero when the node is removed as part of an overlay remove, if the node was added by an overlay. Patch 2/2 adds the free cache entry call to __of_detach_node(), so the refcount will go to zero when the node is removed for dynamic use cases other than overlays. (For overlays, all nodes are instead removed from the cache before __of_detach_node() is called.) -Frank
Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache
On 12/14/18 1:56 PM, Michael Bringmann wrote: > On 12/14/2018 11:20 AM, Rob Herring wrote: >> On Fri, Dec 14, 2018 at 12:43 AM wrote: >>> >>> From: Frank Rowand >>> >>> Non-overlay dynamic devicetree node removal may leave the node in >>> the phandle cache. Subsequent calls to of_find_node_by_phandle() >>> will incorrectly find the stale entry. Remove the node from the >>> cache. >>> >>> Add paranoia checks in of_find_node_by_phandle() as a second level >>> of defense (do not return cached node if detached, do not add node >>> to cache if detached). >>> >>> Reported-by: Michael Bringmann >>> Signed-off-by: Frank Rowand >>> --- >>> drivers/of/base.c | 29 - >>> drivers/of/dynamic.c| 3 +++ >>> drivers/of/of_private.h | 4 >>> 3 files changed, 35 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/of/base.c b/drivers/of/base.c >>> index d599367cb92a..34a5125713c8 100644 >>> --- a/drivers/of/base.c >>> +++ b/drivers/of/base.c >>> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void) >>> late_initcall_sync(of_free_phandle_cache); >>> #endif >>> >>> +/* >>> + * Caller must hold devtree_lock. >>> + */ >>> +void __of_free_phandle_cache_entry(phandle handle) >>> +{ >>> + phandle masked_handle; >>> + >>> + if (!handle) >>> + return; >>> + >>> + masked_handle = handle & phandle_cache_mask; >>> + >>> + if (phandle_cache) { >>> + if (phandle_cache[masked_handle] && >>> + handle == phandle_cache[masked_handle]->phandle) { >>> + of_node_put(phandle_cache[masked_handle]); >>> + phandle_cache[masked_handle] = NULL; >>> + } >>> + } >>> +} >>> + >>> void of_populate_phandle_cache(void) >>> { >>> unsigned long flags; >>> @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle >>> handle) >>> if (phandle_cache[masked_handle] && >>> handle == phandle_cache[masked_handle]->phandle) >>> np = phandle_cache[masked_handle]; >>> + if (np && of_node_check_flag(np, OF_DETACHED)) { >>> + of_node_put(np); >>> + phandle_cache[masked_handle] = NULL; >> >> This should never happen, right? Any time we set OF_DETACHED, the >> entry should get removed from the cache. I think we want a WARN here >> in case we're in an unexpected state. Correct, this should never happen. I will add the WARN. > We don't actually remove the pointer from the phandle cache when we set > OF_DETACHED in drivers/of/dynamic.c:__of_detach_node. The phandle cache > is currently static within drivers/of/base.c. There are a couple of > calls to of_populate_phandle_cache / of_free_phandle_cache within > drivers/of/overlay.c, but these are not involved in the device tree > updates that occur during LPAR migration. A WARN here would only make > sense, if we also arrange to clear the handle. Rob's reply did not include the full patch 2/2. The full patch 2/2 also adds a call to __of_free_phandle_cache_entry() in __of_detach_node(). -Frank > >> >> Rob > > Michael > >> >> >
Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
On 12/12/2018 08:57 PM, Michael Ellerman wrote: > Frank Rowand writes: >> On 12/11/18 8:07 AM, Rob Herring wrote: >>> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman >>> wrote: > ... diff --git a/drivers/of/base.c b/drivers/of/base.c index 09692c9b32a7..d8e4534c0686 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle) if (phandle_cache[masked_handle] && handle == phandle_cache[masked_handle]->phandle) np = phandle_cache[masked_handle]; + + /* If we find a detached node, remove it */ + if (of_node_check_flag(np, OF_DETACHED)) + np = phandle_cache[masked_handle] = NULL; >> >> The bug you found exposes a couple of different issues, a little bit >> deeper than the proposed fix. I'll work on a fuller fix tonight or >> tomorrow. > > OK thanks. > >>> I'm wondering if we should explicitly remove the node from the cache >>> when we set OF_DETACHED. Otherwise, it could be possible that the node >>> pointer has been freed already. Or maybe we need both? >> >> Yes, it should be explicitly removed. I may also add in a paranoia check in >> of_find_node_by_phandle(). > > That seems best to me. I agree that we should do both. > > cheers Michael -- Michael W. Bringmann Linux I/O, Networking and Security Development IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com
Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache
On 12/14/2018 11:20 AM, Rob Herring wrote: > On Fri, Dec 14, 2018 at 12:43 AM wrote: >> >> From: Frank Rowand >> >> Non-overlay dynamic devicetree node removal may leave the node in >> the phandle cache. Subsequent calls to of_find_node_by_phandle() >> will incorrectly find the stale entry. Remove the node from the >> cache. >> >> Add paranoia checks in of_find_node_by_phandle() as a second level >> of defense (do not return cached node if detached, do not add node >> to cache if detached). >> >> Reported-by: Michael Bringmann >> Signed-off-by: Frank Rowand >> --- >> drivers/of/base.c | 29 - >> drivers/of/dynamic.c| 3 +++ >> drivers/of/of_private.h | 4 >> 3 files changed, 35 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/of/base.c b/drivers/of/base.c >> index d599367cb92a..34a5125713c8 100644 >> --- a/drivers/of/base.c >> +++ b/drivers/of/base.c >> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void) >> late_initcall_sync(of_free_phandle_cache); >> #endif >> >> +/* >> + * Caller must hold devtree_lock. >> + */ >> +void __of_free_phandle_cache_entry(phandle handle) >> +{ >> + phandle masked_handle; >> + >> + if (!handle) >> + return; >> + >> + masked_handle = handle & phandle_cache_mask; >> + >> + if (phandle_cache) { >> + if (phandle_cache[masked_handle] && >> + handle == phandle_cache[masked_handle]->phandle) { >> + of_node_put(phandle_cache[masked_handle]); >> + phandle_cache[masked_handle] = NULL; >> + } >> + } >> +} >> + >> void of_populate_phandle_cache(void) >> { >> unsigned long flags; >> @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle >> handle) >> if (phandle_cache[masked_handle] && >> handle == phandle_cache[masked_handle]->phandle) >> np = phandle_cache[masked_handle]; >> + if (np && of_node_check_flag(np, OF_DETACHED)) { >> + of_node_put(np); >> + phandle_cache[masked_handle] = NULL; > > This should never happen, right? Any time we set OF_DETACHED, the > entry should get removed from the cache. I think we want a WARN here > in case we're in an unexpected state. We don't actually remove the pointer from the phandle cache when we set OF_DETACHED in drivers/of/dynamic.c:__of_detach_node. The phandle cache is currently static within drivers/of/base.c. There are a couple of calls to of_populate_phandle_cache / of_free_phandle_cache within drivers/of/overlay.c, but these are not involved in the device tree updates that occur during LPAR migration. A WARN here would only make sense, if we also arrange to clear the handle. > > Rob Michael > > -- Michael W. Bringmann Linux I/O, Networking and Security Development IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com
Re: [PATCH] net/ibmvnic: Remove tests of member address
From: Wen Yang Date: Tue, 11 Dec 2018 12:20:46 +0800 > The driver was checking for non-NULL address. > - adapter->napi[i] > > This is pointless as these will be always non-NULL, since the > 'dapter->napi' is allocated in init_napi(). > It is safe to get rid of useless checks for addresses to fix the > coccinelle warning: >>>drivers/net/ethernet/ibm/ibmvnic.c: test of a variable/field address > Since such statements always return true, they are redundant. > > Signed-off-by: Wen Yang Applied.
[RFC 6/6] powerpc: Enable support for ibm, drc-info devtree property
Enable support for new DRC device tree property "ibm,drc-info" in initial handshake between the Linux kernel and the front end processor. Signed-off-by: Michael Bringmann --- arch/powerpc/kernel/prom_init.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index f33ff41..5d20a4d 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -910,7 +910,7 @@ struct ibm_arch_vec { .reserved2 = 0, .reserved3 = 0, .subprocessors = 1, - .byte22 = OV5_FEAT(OV5_DRMEM_V2), + .byte22 = OV5_FEAT(OV5_DRMEM_V2 | OV5_DRC_INFO), .intarch = 0, .mmu = 0, .hash_ext = 0,
[RFC 5/6] powerpc/pci/hotplug: Use common drcinfo parsing
The implementation of the pseries-specific drc info properties is currently implemented in pseries-specific and non-pseries-specific files. This patch set uses a new implementation of the device-tree parsing code for the properties. This patch refactors parsing of the pseries-specific drc-info properties out of rpaphp_core.c to use the common parser. In the case where an architecture does not use these properties, an __weak copy of the function is provided with dummy return values. Changes include creating appropriate callback functions and passing callback-specific data blocks into arch_find_drc_match. All functions that were used just to support the previous parsing implementation have been moved. Signed-off-by: Michael Bringmann --- drivers/pci/hotplug/rpaphp_core.c | 232 - 1 file changed, 28 insertions(+), 204 deletions(-) diff --git a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c index bcd5d35..9ad7384 100644 --- a/drivers/pci/hotplug/rpaphp_core.c +++ b/drivers/pci/hotplug/rpaphp_core.c @@ -154,182 +154,18 @@ static enum pci_bus_speed get_max_bus_speed(struct slot *slot) return speed; } -static int get_children_props(struct device_node *dn, const int **drc_indexes, - const int **drc_names, const int **drc_types, - const int **drc_power_domains) -{ - const int *indexes, *names, *types, *domains; - - indexes = of_get_property(dn, "ibm,drc-indexes", NULL); - names = of_get_property(dn, "ibm,drc-names", NULL); - types = of_get_property(dn, "ibm,drc-types", NULL); - domains = of_get_property(dn, "ibm,drc-power-domains", NULL); - - if (!indexes || !names || !types || !domains) { - /* Slot does not have dynamically-removable children */ - return -EINVAL; - } - if (drc_indexes) - *drc_indexes = indexes; - if (drc_names) - /* _names[1] contains NULL terminated slot names */ - *drc_names = names; - if (drc_types) - /* _types[1] contains NULL terminated slot types */ - *drc_types = types; - if (drc_power_domains) - *drc_power_domains = domains; - - return 0; -} - - /* Verify the existence of 'drc_name' and/or 'drc_type' within the - * current node. First obtain it's my-drc-index property. Next, - * obtain the DRC info from it's parent. Use the my-drc-index for - * correlation, and obtain/validate the requested properties. + * current node. */ -static int rpaphp_check_drc_props_v1(struct device_node *dn, char *drc_name, - char *drc_type, unsigned int my_index) -{ - char *name_tmp, *type_tmp; - const int *indexes, *names; - const int *types, *domains; - int i, rc; - - rc = get_children_props(dn->parent, , , , ); - if (rc < 0) { - return -EINVAL; - } - - name_tmp = (char *) [1]; - type_tmp = (char *) [1]; - - /* Iterate through parent properties, looking for my-drc-index */ - for (i = 0; i < be32_to_cpu(indexes[0]); i++) { - if ((unsigned int) indexes[i + 1] == my_index) - break; - - name_tmp += (strlen(name_tmp) + 1); - type_tmp += (strlen(type_tmp) + 1); - } - - if (((drc_name == NULL) || (drc_name && !strcmp(drc_name, name_tmp))) && - ((drc_type == NULL) || (drc_type && !strcmp(drc_type, type_tmp - return 0; - - return -EINVAL; -} - -static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name, - char *drc_type, unsigned int my_index) -{ - struct property *info; - unsigned int entries; - struct of_drc_info drc; - const __be32 *value; - char cell_drc_name[MAX_DRC_NAME_LEN]; - int j, fndit; - - info = of_find_property(dn->parent, "ibm,drc-info", NULL); - if (info == NULL) - return -EINVAL; - - value = of_prop_next_u32(info, NULL, ); - if (!value) - return -EINVAL; - - for (j = 0; j < entries; j++) { - of_read_drc_info_cell(, , ); - - /* Should now know end of current entry */ - - if (my_index > drc.last_drc_index) - continue; - - fndit = 1; - break; - } - /* Found it */ - - if (fndit) - sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, - my_index); - - if (((drc_name == NULL) || -(drc_name && !strcmp(drc_name, cell_drc_name))) && - ((drc_type == NULL) || -(drc_type && !strcmp(drc_type, drc.drc_type - return 0; - - return -EINVAL; -} - int rpaphp_check_drc_props(struct device_node *dn, char *drc_name, char *drc_type) { -
[RFC 4/6] powerpc/pseries: Use common drcinfo parsing
The implementation of the pseries-specific drc info properties is currently implemented in pseries-specific and non-pseries-specific files. This patch set uses a new implementation of the device-tree parsing code for the properties. This patch refactors parsing of the drc-info properties out of pseries_energy.c and hotplug-cpu.c to use the common parser. Changes include creating appropriate callback functions and passing callback-specific data blocks into arch_find_drc_match. Signed-off-by: Michael Bringmann --- arch/powerpc/platforms/pseries/hotplug-cpu.c| 83 +++- arch/powerpc/platforms/pseries/pseries_energy.c | 157 --- 2 files changed, 107 insertions(+), 133 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 2f8e621..ee3028c 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -411,23 +411,29 @@ static bool dlpar_cpu_exists(struct device_node *parent, u32 drc_index) return found; } -static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index) +struct cpu_drc_index_struct { + u32 drc_index; +}; + +bool cpu_drc_index_cb(struct device_node *dn, + u32 drc_index, char *drc_name, + char *drc_type, u32 drc_power_domain, + void *data) { - bool found = false; - int rc, index; + struct cpu_drc_index_struct *cdata = data; - index = 0; - while (!found) { - u32 drc; + if (drc_index == cdata->drc_index) + return true; + return false; +} - rc = of_property_read_u32_index(parent, "ibm,drc-indexes", - index++, ); - if (rc) - break; +static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index) +{ + struct cpu_drc_index_struct cdata = { drc_index }; + bool found = false; - if (drc == drc_index) - found = true; - } + found = arch_find_drc_match(parent, cpu_drc_index_cb, + "CPU", NULL, false, false, ); return found; } @@ -721,11 +727,34 @@ static int dlpar_cpu_remove_by_count(u32 cpus_to_remove) return rc; } +struct cpus_to_add_cb_struct { + struct device_node *parent; + u32 *cpu_drcs; + u32 cpus_to_add; + u32 cpus_found; +}; + +static bool cpus_to_add_cb(struct device_node *dn, + u32 drc_index, char *drc_name, + char *drc_type, u32 drc_power_domain, + void *data) +{ + struct cpus_to_add_cb_struct *cdata = data; + + if (cdata->cpus_found < cdata->cpus_to_add) { + if (!dlpar_cpu_exists(cdata->parent, drc_index)) + cdata->cpu_drcs[cdata->cpus_found++] = drc_index; + } + + return !(cdata->cpus_found < cdata->cpus_to_add); +} + static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add) { struct device_node *parent; - int cpus_found = 0; - int index, rc; + struct cpus_to_add_cb_struct cdata = { + NULL, cpu_drcs, cpus_to_add, 0 }; + int cpus_found; parent = of_find_node_by_path("/cpus"); if (!parent) { @@ -734,25 +763,13 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add) return -1; } - /* Search the ibm,drc-indexes array for possible CPU drcs to -* add. Note that the format of the ibm,drc-indexes array is -* the number of entries in the array followed by the array -* of drc values so we start looking at index = 1. + /* Search the appropriate property for possible CPU drcs +* to add. */ - index = 1; - while (cpus_found < cpus_to_add) { - u32 drc; - - rc = of_property_read_u32_index(parent, "ibm,drc-indexes", - index++, ); - if (rc) - break; - - if (dlpar_cpu_exists(parent, drc)) - continue; - - cpu_drcs[cpus_found++] = drc; - } + cdata.parent = parent; + arch_find_drc_match(parent, cpus_to_add_cb, "CPU", + NULL, false, false, ); + cpus_found = cdata.cpus_found; of_node_put(parent); return cpus_found; diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c index 6ed2212..f7b9d86 100644 --- a/arch/powerpc/platforms/pseries/pseries_energy.c +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -23,6 +23,7 @@ #include #include #include +#include #define MODULE_VERS "1.0" @@ -36,60 +37,43 @@ /* Helper Routines to convert between drc_index to cpu numbers */
[RFC 3/6] pseries/drcinfo: Pseries impl of arch_find_drc_info
This patch provides a common interface to parse ibm,drc-indexes, ibm,drc-names, ibm,drc-types, ibm,drc-power-domains, or ibm,drc-info. The generic interface arch_find_drc_match is provided which accepts callback functions that may be applied to examine the data for each entry. Signed-off-by: Michael Bringmann --- arch/powerpc/include/asm/prom.h |3 arch/powerpc/platforms/pseries/of_helpers.c | 299 +++ include/linux/topology.h|2 3 files changed, 298 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index b04c5ce..910d1dc 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -91,9 +91,6 @@ struct of_drc_info { u32 last_drc_index; }; -extern int of_read_drc_info_cell(struct property **prop, - const __be32 **curval, struct of_drc_info *data); - /* * There are two methods for telling firmware what our capabilities are. diff --git a/arch/powerpc/platforms/pseries/of_helpers.c b/arch/powerpc/platforms/pseries/of_helpers.c index 0185e50..11c90cd 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.c +++ b/arch/powerpc/platforms/pseries/of_helpers.c @@ -1,5 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) "drc: " fmt + #include #include #include @@ -11,6 +13,12 @@ #defineMAX_DRC_NAME_LEN 64 +static int drc_debug; +#define dbg(args...) if (drc_debug) { printk(KERN_DEBUG args); } +#define err(arg...) printk(KERN_ERR args); +#define info(arg...) printk(KERN_INFO args); +#define warn(arg...) printk(KERN_WARNING args); + /** * pseries_of_derive_parent - basically like dirname(1) * @path: the full_name of a node to be added to the tree @@ -46,7 +54,8 @@ struct device_node *pseries_of_derive_parent(const char *path) /* Helper Routines to convert between drc_index to cpu numbers */ -int of_read_drc_info_cell(struct property **prop, const __be32 **curval, +static int of_read_drc_info_cell(struct property **prop, + const __be32 **curval, struct of_drc_info *data) { const char *p; @@ -90,4 +99,290 @@ int of_read_drc_info_cell(struct property **prop, const __be32 **curval, return 0; } -EXPORT_SYMBOL(of_read_drc_info_cell); + +static int walk_drc_info(struct device_node *dn, + bool (*usercb)(struct of_drc_info *drc, + void *data, + int *ret_code), + char *opt_drc_type, + void *data) +{ + struct property *info; + unsigned int entries; + struct of_drc_info drc; + const __be32 *value; + int j, ret_code = -EINVAL; + bool done = false; + + info = of_find_property(dn, "ibm,drc-info", NULL); + if (info == NULL) + return -EINVAL; + + value = info->value; + entries = of_read_number(value++, 1); + + for (j = 0, done = 0; (j < entries) && (!done); j++) { + of_read_drc_info_cell(, , ); + + if (opt_drc_type && strcmp(opt_drc_type, drc.drc_type)) + continue; + + done = usercb(, data, _code); + } + + return ret_code; +} + +static int get_children_props(struct device_node *dn, const int **drc_indexes, + const int **drc_names, const int **drc_types, + const int **drc_power_domains) +{ + const int *indexes, *names, *types, *domains; + + indexes = of_get_property(dn, "ibm,drc-indexes", NULL); + names = of_get_property(dn, "ibm,drc-names", NULL); + types = of_get_property(dn, "ibm,drc-types", NULL); + domains = of_get_property(dn, "ibm,drc-power-domains", NULL); + + if (!indexes || !names || !types || !domains) { + /* Slot does not have dynamically-removable children */ + return -EINVAL; + } + if (drc_indexes) + *drc_indexes = indexes; + if (drc_names) + /* _names[1] contains NULL terminated slot names */ + *drc_names = names; + if (drc_types) + /* _types[1] contains NULL terminated slot types */ + *drc_types = types; + if (drc_power_domains) + *drc_power_domains = domains; + + return 0; +} + +static int is_php_type(char *drc_type) +{ + unsigned long value; + char *endptr; + + /* PCI Hotplug nodes have an integer for drc_type */ + value = simple_strtoul(drc_type, , 10); + if (endptr == drc_type) + return 0; + + return 1; +} + +/** + * is_php_dn() - return 1 if this is a hotpluggable pci slot, else 0 + * @dn: target _node + * @indexes: passed to get_children_props() + * @names: passed to get_children_props() + * @types: returned from get_children_props() + * @power_domains: + * + * This routine will return true only if the
[RFC 2/6] pseries/drcinfo: Fix bug parsing ibm,drc-info
Replace use of of_prop_next_u32() in when parsing 'ibm,drc-info' structure to simplify and reduce parsing code. Signed-off-by: Michael Bringmann Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info firmware feature" -- end of patch series applied to powerpc next) --- arch/powerpc/platforms/pseries/of_helpers.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/pseries/of_helpers.c b/arch/powerpc/platforms/pseries/of_helpers.c index 6df192f..0185e50 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.c +++ b/arch/powerpc/platforms/pseries/of_helpers.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 + #include #include #include @@ -6,6 +7,9 @@ #include #include "of_helpers.h" +#include "pseries.h" + +#defineMAX_DRC_NAME_LEN 64 /** * pseries_of_derive_parent - basically like dirname(1) @@ -65,29 +69,19 @@ int of_read_drc_info_cell(struct property **prop, const __be32 **curval, /* Get drc-index-start:encode-int */ p2 = (const __be32 *)p; - p2 = of_prop_next_u32(*prop, p2, >drc_index_start); - if (!p2) - return -EINVAL; + data->drc_index_start = of_read_number(p2++, 1); /* Get drc-name-suffix-start:encode-int */ - p2 = of_prop_next_u32(*prop, p2, >drc_name_suffix_start); - if (!p2) - return -EINVAL; + data->drc_name_suffix_start = of_read_number(p2++, 1); /* Get number-sequential-elements:encode-int */ - p2 = of_prop_next_u32(*prop, p2, >num_sequential_elems); - if (!p2) - return -EINVAL; + data->num_sequential_elems = of_read_number(p2++, 1); /* Get sequential-increment:encode-int */ - p2 = of_prop_next_u32(*prop, p2, >sequential_inc); - if (!p2) - return -EINVAL; + data->sequential_inc = of_read_number(p2++, 1); /* Get drc-power-domain:encode-int */ - p2 = of_prop_next_u32(*prop, p2, >drc_power_domain); - if (!p2) - return -EINVAL; + data->drc_power_domain = of_read_number(p2++, 1); /* Should now know end of current entry */ (*curval) = (void *)p2;
[RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info
Define interface to acquire arch-specific drc info to match against hotpluggable devices. The current implementation exposes several pseries-specific dynamic memory properties in generic kernel code. This patch set provides an interface to pull that code out of the generic kernel. Signed-off-by: Michael Bringmann --- include/linux/topology.h |9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..df97f5f 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -44,6 +44,15 @@ int arch_update_cpu_topology(void); +int arch_find_drc_match(struct device_node *dn, + bool (*usercb)(struct device_node *dn, + u32 drc_index, char *drc_name, + char *drc_type, u32 drc_power_domain, + void *data), + char *opt_drc_type, char *opt_drc_name, + bool match_drc_index, bool ck_php_type, + void *data); + /* Conform to ACPI 2.0 SLIT distance definitions */ #define LOCAL_DISTANCE 10 #define REMOTE_DISTANCE20
[RFC 0/6] powerpc/pseries: Refactor code to centralize drcinfo parsing
The implementation of the pseries-specific drc info information feature is currently implemented within and outside of the powerpc pseries code. This patch set moves the parsing code for the pseries-specific device-tree properties ibm,drc-indexes, ibm,drc-names, ibm,drc-types, ibm,drc-power-domains, and the compressed ibm,drc-info into the platform-specific code for the Pseries features. Signed-off-by: Michael W. Bringmann Michael Bringmann (6): powerpc:/drc Define interface to acquire arch-specific drc info pseries/drcinfo: Fix bug parsing ibm,drc-info pseries/drcinfo: Pseries impl of arch_find_drc_info powerpc/pseries: Use common drcinfo parsing powerpc/pci/hotplug: Use common drcinfo parsing powerpc: Enable support for ibm,drc-info devtree property
Re: [GIT PULL] Please pull powerpc/linux.git powerpc-4.20-4 tag
The pull request you sent on Sat, 15 Dec 2018 01:26:19 +1100: > https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git > tags/powerpc-4.20-4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/4645453cefcebdff9db26f02cf325607357295c4 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [PATCH] PCI/AER: only insert one element into kfifo
On Wed, Dec 12, 2018 at 04:32:30PM +0800, Yanjiang Jin wrote: > 'commit ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to > insert locked elements")' replace kfifo_put() with kfifo_in_spinlocked(). > > But as "kfifo_in(fifo, buf, n)" describes: > " * @n: number of elements to be added". > > We want to insert only one element into kfifo, not "sizeof(entry) = 16". > Without this patch, we would get 15 uninitialized elements. > > Signed-off-by: Yanjiang Jin Since this fixes ecae65e133f2, which was applied for v4.20, I applied this with Keith's reviewed-by to for-linus so we can get it into v4.20. For some reason the patch didn't apply, but I can't see why, so I just applied it by hand. Thanks! > --- > drivers/pci/pcie/aer.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index a90a919..fed29de 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, > unsigned int devfn, > .regs = aer_regs, > }; > > - if (kfifo_in_spinlocked(_recover_ring, , sizeof(entry), > + if (kfifo_in_spinlocked(_recover_ring, , 1, > _recover_ring_lock)) > schedule_work(_recover_work); > else > -- > 1.8.3.1 > > > > > This email is intended only for the named addressee. It may contain > information that is confidential/private, legally privileged, or > copyright-protected, and you should handle it accordingly. If you are not the > intended recipient, you do not have legal rights to retain, copy, or > distribute this email or its contents, and should promptly delete the email > and all electronic copies in your system; do not retain copies in any media. > If you have received this email in error, please notify the sender promptly. > Thank you. > >
Re: [PATCH 2/2] of: __of_detach_node() - remove node from phandle cache
On Fri, Dec 14, 2018 at 12:43 AM wrote: > > From: Frank Rowand > > Non-overlay dynamic devicetree node removal may leave the node in > the phandle cache. Subsequent calls to of_find_node_by_phandle() > will incorrectly find the stale entry. Remove the node from the > cache. > > Add paranoia checks in of_find_node_by_phandle() as a second level > of defense (do not return cached node if detached, do not add node > to cache if detached). > > Reported-by: Michael Bringmann > Signed-off-by: Frank Rowand > --- > drivers/of/base.c | 29 - > drivers/of/dynamic.c| 3 +++ > drivers/of/of_private.h | 4 > 3 files changed, 35 insertions(+), 1 deletion(-) > > diff --git a/drivers/of/base.c b/drivers/of/base.c > index d599367cb92a..34a5125713c8 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -162,6 +162,27 @@ int of_free_phandle_cache(void) > late_initcall_sync(of_free_phandle_cache); > #endif > > +/* > + * Caller must hold devtree_lock. > + */ > +void __of_free_phandle_cache_entry(phandle handle) > +{ > + phandle masked_handle; > + > + if (!handle) > + return; > + > + masked_handle = handle & phandle_cache_mask; > + > + if (phandle_cache) { > + if (phandle_cache[masked_handle] && > + handle == phandle_cache[masked_handle]->phandle) { > + of_node_put(phandle_cache[masked_handle]); > + phandle_cache[masked_handle] = NULL; > + } > + } > +} > + > void of_populate_phandle_cache(void) > { > unsigned long flags; > @@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle > handle) > if (phandle_cache[masked_handle] && > handle == phandle_cache[masked_handle]->phandle) > np = phandle_cache[masked_handle]; > + if (np && of_node_check_flag(np, OF_DETACHED)) { > + of_node_put(np); > + phandle_cache[masked_handle] = NULL; This should never happen, right? Any time we set OF_DETACHED, the entry should get removed from the cache. I think we want a WARN here in case we're in an unexpected state. Rob
Re: [PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache
On Fri, Dec 14, 2018 at 12:43 AM wrote: > > From: Frank Rowand > > The phandle cache contains struct device_node pointers. The refcount > of the pointers was not incremented while in the cache, allowing use > after free error after kfree() of the node. Add the proper increment > and decrement of the use count. Since we pre-populate the cache at boot, all the nodes will have a ref count and will never be freed unless we happen to repopulate the whole cache. That doesn't seem ideal. The node pointer is not "in use" just because it is in the cache. Rob
Re: use generic DMA mapping code in powerpc V4
On Fri, Dec 14, 2018 at 01:00:26PM +0100, Christian Zigotzky wrote: > On 12 December 2018 at 3:15PM, Christoph Hellwig wrote: > > Thanks for bisecting. I've spent some time going over the conversion > > but can't really pinpoint it. I have three little patches that switch > > parts of the code to the generic version. This is on top of the > > last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda). > > > > Can you check with whіch one things stop working? > > Hello Christoph, > > Great news! All your patches work! No so great because that means I still have no idea what broke.. > I tested all your patches (including the patch '0004-alloc-free.patch' > today) and the PASEMI onboard ethernet works and the P5020 board boots > without any problems. Thank you for your work! > I have a few days off. That means, I will work less and only for the A-EON > first level Linux support. I can test again on Thursday next week. Enjoy your days off! I think I'll need to prepare something new that is better bisectable, and use that opportunity to reason about the changes a little more.
Re: [PATCH 12/34] powerpc/cell: move dma direct window setup out of dma_configure
On Sat, Dec 15, 2018 at 12:29:11AM +1100, Michael Ellerman wrote: > I think the problem is that we don't want to set iommu_bypass_supported > unless cell_iommu_fixed_mapping_init() succeeds. > > Yep. This makes it work for me on cell on top of your v5. Thanks, this looks good. I've folded it with the slight change of moving the iommu_bypass_supported setup into cell_iommu_fixed_mapping_init.
[PATCH] lkdtm: Add a tests for NULL pointer dereference
Introduce lkdtm tests for NULL pointer dereference: check access or exec at NULL address. Signed-off-by: Christophe Leroy --- drivers/misc/lkdtm/core.c | 2 ++ drivers/misc/lkdtm/lkdtm.h | 2 ++ drivers/misc/lkdtm/perms.c | 18 ++ 3 files changed, 22 insertions(+) diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index bc76756b7eda..36910e1d5c09 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -157,7 +157,9 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(EXEC_VMALLOC), CRASHTYPE(EXEC_RODATA), CRASHTYPE(EXEC_USERSPACE), + CRASHTYPE(EXEC_NULL), CRASHTYPE(ACCESS_USERSPACE), + CRASHTYPE(ACCESS_NULL), CRASHTYPE(WRITE_RO), CRASHTYPE(WRITE_RO_AFTER_INIT), CRASHTYPE(WRITE_KERN), diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 3c6fd327e166..b69ee004a3f7 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -45,7 +45,9 @@ void lkdtm_EXEC_KMALLOC(void); void lkdtm_EXEC_VMALLOC(void); void lkdtm_EXEC_RODATA(void); void lkdtm_EXEC_USERSPACE(void); +void lkdtm_EXEC_NULL(void); void lkdtm_ACCESS_USERSPACE(void); +void lkdtm_ACCESS_NULL(void); /* lkdtm_refcount.c */ void lkdtm_REFCOUNT_INC_OVERFLOW(void); diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c index fa54add6375a..62f76d506f04 100644 --- a/drivers/misc/lkdtm/perms.c +++ b/drivers/misc/lkdtm/perms.c @@ -164,6 +164,11 @@ void lkdtm_EXEC_USERSPACE(void) vm_munmap(user_addr, PAGE_SIZE); } +void lkdtm_EXEC_NULL(void) +{ + execute_location(NULL, CODE_AS_IS); +} + void lkdtm_ACCESS_USERSPACE(void) { unsigned long user_addr, tmp = 0; @@ -195,6 +200,19 @@ void lkdtm_ACCESS_USERSPACE(void) vm_munmap(user_addr, PAGE_SIZE); } +void lkdtm_ACCESS_NULL(void) +{ + unsigned long tmp; + unsigned long *ptr = (unsigned long *)NULL; + + pr_info("attempting bad read at %px\n", ptr); + tmp = *ptr; + tmp += 0xc0dec0de; + + pr_info("attempting bad write at %px\n", ptr); + *ptr = tmp; +} + void __init lkdtm_perms_init(void) { /* Make sure we can write to __ro_after_init values during __init */ -- 2.13.3
[PATCH v2] powerpc/mm: make NULL pointer deferences explicit on bad page faults.
As several other arches including x86, this patch makes it explicit that a bad page fault is a NULL pointer dereference when the fault address is lower than PAGE_SIZE In the mean time, this page makes all bad_page_fault() messages shorter so that they remain on one single line. And it prefixes them by "BUG: " so that they get easily greped. Signed-off-by: Christophe Leroy --- v2: Taking into account comments from Michael arch/powerpc/mm/fault.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 01b9bcc7fa85..3398291f4785 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -636,21 +636,24 @@ void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig) switch (TRAP(regs)) { case 0x300: case 0x380: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "data at address 0x%08lx\n", regs->dar); + if (regs->dar < PAGE_SIZE) + pr_alert("BUG: Kernel NULL pointer dereference"); + else + pr_alert("BUG: Unable to handle kernel data access"); + pr_cont(" at 0x%08lx\n", regs->dar); break; case 0x400: case 0x480: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "instruction fetch\n"); + pr_alert("BUG: Unable to handle kernel instruction fetch%s", +regs->nip < PAGE_SIZE ? " (NULL pointer ?)\n" : "\n"); break; case 0x600: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "unaligned access at address 0x%08lx\n", regs->dar); + pr_alert("BUG: Unable to handle kernel unaligned access at 0x%08lx\n", +regs->dar); break; default: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "unknown fault\n"); + pr_alert("BUG: Unable to handle unknown paging fault at 0x%08lx\n", +regs->dar); break; } printk(KERN_ALERT "Faulting instruction address: 0x%08lx\n", -- 2.13.3
[GIT PULL] Please pull powerpc/linux.git powerpc-4.20-4 tag
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Linus, Please pull some more powerpc fixes for 4.20: The following changes since commit b2fed34a628df6118b5d4e13f49a33e15f704fa9: selftests/powerpc: Adjust wild_bctr to build with old binutils (2018-11-15 23:05:17 +1100) are available in the git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-4.20-4 for you to fetch changes up to a225f1567405558fb5410e9b2b90805819df1c67: powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call (2018-12-10 15:19:58 +1100) - -- powerpc fixes for 4.20 #4 One notable fix for our change to split pt_regs between user/kernel, we forgot to update BPF to use the user-visible type which was an ABI break for BPF programs. A slightly ugly but minimal fix to do_syscall_trace_enter() so that we use tracehook_report_syscall_entry() properly. We'll rework the code in next to avoid the empty if body. Seven commits fixing bugs in the new papr_scm (Storage Class Memory) driver. The driver was finally able to be tested on the other hypervisor which exposed several bugs. The fixes are all fairly minimal at least. Fix a crash in our MSI code if an MSI-capable device is plugged into a non-MSI capable PHB, only seen on older hardware (MPC8378). Fix our legacy serial code to look for "stdout-path" since the device trees were updated to use that instead of "linux,stdout-path". A change to the COFF zImage code to fix booting old powermacs. A couple of minor build fixes. Thanks to: Benjamin Herrenschmidt, Daniel Axtens, Dmitry V. Levin, Elvira Khabirova, Oliver O'Halloran, Paul Mackerras, Radu Rendec, Rob Herring, Sandipan Das. - -- Benjamin Herrenschmidt (1): powerpc: Look for "stdout-path" when setting up legacy consoles Elvira Khabirova (1): powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call Michael Ellerman (2): powerpc/mm: Fix linux page tables build with some configs powerpc/boot: Fix build failures with -j 1 Oliver O'Halloran (7): powerpc/papr_scm: Use depend instead of select powerpc/papr_scm: Fix resource end address powerpc/papr_scm: Update DT properties powerpc/papr_scm: Remove endian conversions powerpc/papr_scm: Fix DIMM device registration race powerpc/papr_scm: Use ibm,unit-guid as the iset cookie powerpc/mm: Fallback to RAM if the altmap is unusable Paul Mackerras (1): powerpc: Fix COFF zImage booting on old powermacs Radu Rendec (1): powerpc/msi: Fix NULL pointer access in teardown code Sandipan Das (1): powerpc/bpf: Fix broken uapi for BPF_PROG_TYPE_PERF_EVENT arch/powerpc/boot/Makefile | 2 +- arch/powerpc/boot/crt0.S | 4 ++- arch/powerpc/include/asm/perf_event.h | 2 ++ arch/powerpc/include/uapi/asm/Kbuild | 1 - arch/powerpc/include/uapi/asm/bpf_perf_event.h | 9 ++ arch/powerpc/kernel/legacy_serial.c| 6 +++- arch/powerpc/kernel/msi.c | 7 - arch/powerpc/kernel/ptrace.c | 7 - arch/powerpc/mm/dump_linuxpagetables.c | 1 + arch/powerpc/mm/init_64.c | 19 +++-- arch/powerpc/platforms/pseries/Kconfig | 3 +- arch/powerpc/platforms/pseries/papr_scm.c | 39 -- 12 files changed, 80 insertions(+), 20 deletions(-) create mode 100644 arch/powerpc/include/uapi/asm/bpf_perf_event.h -BEGIN PGP SIGNATURE- iQIcBAEBAgAGBQJcE71eAAoJEFHr6jzI4aWAQUkP/0Z1YlNpl2Zf9GG1J1QsBCtj H2KS/a01GciH6TO16ULovjFhJppcwPGOCYKU/elnge4cqC+c7sZDDLeXrB6byzfP 9jYbuC3QZ0ct1d34x7vzg/kplutMu1HClxUBU5g7QF9l2b70rB6UWf1WSJnCLgCg MEwQscEPymJJpLO5r0B1/pCijLN2Rpy0vMbZbOoQW4/G/6qQ3d4ygDdWrh4lJNFf LwWYLXi5SZ/1jFIb6/wYuomT4WIfD+JCaGsH3THAm7sfC7qx+nUOCGtA81vHGUaK nW14snebk0viL8STJ4MO1IFJ9x3MsQAylpyyfcdzQ3Q3+rRkaPWwHrvOcnNavwfw +PKDyqV+Ot9+2Es79ZDGeF68q7L7Q4qtZIIL9ARh5pokvMiycsYPOcMQW3IIbkZQ aFS3SAQalGaSAlNp5NDnNoZ4rhb+tlR3k6fCbHkuPehj3IAvVzzc6XY3w2rr0wFy ZAgnv+54aI7CNcasRiu+KStLYflr2xorjrRHWigli9KtyB+uZbx6IzDDGgQpPHAd rUoJujp7G/GUclFJ42LF6MBDKTGU2XJP0rogEkNhmCY8RwbDPsvw52CYosEJmrMo rMgVbO35eRmCmF+cbaRvRzsVXYVuPFZfTt8wF2SYwDzSx49b2tW696Xr5+mkUGw6 vu2XQ1BG1C/0mbIjgjKm =ABj8 -END PGP SIGNATURE-
[PATCH] powerpc/configs: Don't enable PPC_EARLY_DEBUG in defconfigs
This reverts the remains of commit b9ef7d6b11c1 ("powerpc: Update default configurations"). That commit was proceeded by a commit which added a config option to control use of BOOTX for early debug, ie. PPC_EARLY_DEBUG_BOOTX, and then the update of the defconfigs was intended to not change behaviour by then enabling the new config option. However enabling PPC_EARLY_DEBUG had other consequences, notably causing us to register the udbg console at the end of udbg_early_init(). This means on a system which doesn't have anything that BOOTX can use (most systems), we register the udbg console very early but the bootx code just throws everything away, meaning early boot messages are never printed to the console. What we want to happen is for the udbg console to only be registered later (from setup_arch()) once we've setup udbg_putc, and then all early boot messages will be replayed. Fixes: b9ef7d6b11c1 ("powerpc: Update default configurations") Reported-by: Torsten Duwe Signed-off-by: Michael Ellerman --- arch/powerpc/configs/g5_defconfig | 1 - arch/powerpc/configs/maple_defconfig | 1 - arch/powerpc/configs/pmac32_defconfig | 1 - arch/powerpc/configs/ppc64_defconfig | 1 - arch/powerpc/configs/ppc6xx_defconfig | 1 - 5 files changed, 5 deletions(-) diff --git a/arch/powerpc/configs/g5_defconfig b/arch/powerpc/configs/g5_defconfig index f686cc1eac0b..ceb3c770786f 100644 --- a/arch/powerpc/configs/g5_defconfig +++ b/arch/powerpc/configs/g5_defconfig @@ -246,7 +246,6 @@ CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_MUTEXES=y CONFIG_LATENCYTOP=y CONFIG_BOOTX_TEXT=y -CONFIG_PPC_EARLY_DEBUG=y CONFIG_CRYPTO_TEST=m CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_HMAC=y diff --git a/arch/powerpc/configs/maple_defconfig b/arch/powerpc/configs/maple_defconfig index f71eddafb02f..c5f2005005d3 100644 --- a/arch/powerpc/configs/maple_defconfig +++ b/arch/powerpc/configs/maple_defconfig @@ -108,7 +108,6 @@ CONFIG_LATENCYTOP=y CONFIG_XMON=y CONFIG_XMON_DEFAULT=y CONFIG_BOOTX_TEXT=y -CONFIG_PPC_EARLY_DEBUG=y CONFIG_CRYPTO_ECB=m CONFIG_CRYPTO_PCBC=m # CONFIG_CRYPTO_HW is not set diff --git a/arch/powerpc/configs/pmac32_defconfig b/arch/powerpc/configs/pmac32_defconfig index 62948d198d7f..50b610b48914 100644 --- a/arch/powerpc/configs/pmac32_defconfig +++ b/arch/powerpc/configs/pmac32_defconfig @@ -297,7 +297,6 @@ CONFIG_LATENCYTOP=y CONFIG_XMON=y CONFIG_XMON_DEFAULT=y CONFIG_BOOTX_TEXT=y -CONFIG_PPC_EARLY_DEBUG=y CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_SHA512=m diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index 294a0b9d6038..91fdb619b484 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -376,4 +376,3 @@ CONFIG_FTR_FIXUP_SELFTEST=y CONFIG_MSI_BITMAP_SELFTEST=y CONFIG_XMON=y CONFIG_BOOTX_TEXT=y -CONFIG_PPC_EARLY_DEBUG=y diff --git a/arch/powerpc/configs/ppc6xx_defconfig b/arch/powerpc/configs/ppc6xx_defconfig index 7ee736f20774..53687c3a70c4 100644 --- a/arch/powerpc/configs/ppc6xx_defconfig +++ b/arch/powerpc/configs/ppc6xx_defconfig @@ -1155,7 +1155,6 @@ CONFIG_STACK_TRACER=y CONFIG_BLK_DEV_IO_TRACE=y CONFIG_XMON=y CONFIG_BOOTX_TEXT=y -CONFIG_PPC_EARLY_DEBUG=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_NETWORK_XFRM=y -- 2.17.2
Re: [PATCH 12/34] powerpc/cell: move dma direct window setup out of dma_configure
Christoph Hellwig writes: > On Sun, Dec 09, 2018 at 09:23:39PM +1100, Michael Ellerman wrote: >> Christoph Hellwig writes: >> >> > Configure the dma settings at device setup time, and stop playing games >> > with get_pci_dma_ops. This prepares for using the common dma_configure >> > code later on. >> > >> > Signed-off-by: Christoph Hellwig >> > --- >> > arch/powerpc/platforms/cell/iommu.c | 20 +++- >> > 1 file changed, 11 insertions(+), 9 deletions(-) >> >> This one's crashing, haven't dug into why yet: > > Can you provide a gdb assembly of the exact crash site? This looks > like for some odd reason the DT structures aren't fully setup by the > time we are probing the device, which seems odd. It's dev->of_node which is NULL. Because we were passed a platform_device which doesn't have an of_node. It's the cbe-mic device created in cell_publish_devices(). I can fix that by simply checking for a NULL node, then the system boots but then I have no network devices, due to: tg3 :00:01.0: enabling device (0140 -> 0142) tg3 :00:01.0: DMA engine test failed, aborting tg3: probe of :00:01.0 failed with error -12 tg3 :00:01.1: enabling device (0140 -> 0142) tg3 :00:01.1: DMA engine test failed, aborting tg3: probe of :00:01.1 failed with error -12 I think the problem is that we don't want to set iommu_bypass_supported unless cell_iommu_fixed_mapping_init() succeeds. Yep. This makes it work for me on cell on top of your v5. cheers diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c index 348a815779c1..8329fda17cc8 100644 --- a/arch/powerpc/platforms/cell/iommu.c +++ b/arch/powerpc/platforms/cell/iommu.c @@ -813,6 +813,10 @@ static u64 cell_iommu_get_fixed_address(struct device *dev) int i, len, best, naddr, nsize, pna, range_size; np = of_node_get(dev->of_node); + if (!np) + /* We can be called for platform devices that have no of_node */ + goto out; + while (1) { naddr = of_n_addr_cells(np); nsize = of_n_size_cells(np); @@ -1065,8 +1069,11 @@ static int __init cell_iommu_init(void) /* Setup various callbacks */ cell_pci_controller_ops.dma_dev_setup = cell_pci_dma_dev_setup; - if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0) + if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0) { + cell_pci_controller_ops.iommu_bypass_supported = + cell_pci_iommu_bypass_supported; goto done; + } /* Create an iommu for each /axon node. */ for_each_node_by_name(np, "axon") { @@ -1085,10 +1092,6 @@ static int __init cell_iommu_init(void) } done: /* Setup default PCI iommu ops */ - if (!iommu_fixed_disabled) { - cell_pci_controller_ops.iommu_bypass_supported = - cell_pci_iommu_bypass_supported; - } set_pci_dma_ops(_iommu_ops); cell_iommu_enabled = true; bail:
Re: [PATCH 2/2] s390/pci: handle function enumeration after sriov enablement
On Fri, Dec 14, 2018 at 05:12:45AM -0800, Christoph Hellwig wrote: > On Thu, Dec 13, 2018 at 06:54:28PM +0100, Sebastian Ott wrote: > > Implement pcibios_sriov_{add|del}_vfs as empty functions. VF > > creation will be triggered by the hotplug code. > > And instead of having the arch suplply a no-op arch override I > think it would be better to have the config option just stub it > out in common code. Or in fact maybe even a runtime flag in struct pci_dev. Who knows if all future s390 PCIe busses will have exactly the same behavior or if we eventually get the standards compliant behvior back?
Re: [PATCH 2/2] s390/pci: handle function enumeration after sriov enablement
On Thu, Dec 13, 2018 at 06:54:28PM +0100, Sebastian Ott wrote: > Implement pcibios_sriov_{add|del}_vfs as empty functions. VF > creation will be triggered by the hotplug code. And instead of having the arch suplply a no-op arch override I think it would be better to have the config option just stub it out in common code.
Re: [PATCH 1/2] PCI: provide pcibios_sriov_add_vfs
On Thu, Dec 13, 2018 at 06:54:27PM +0100, Sebastian Ott wrote: > Move VF detection and device creation code to weak functions > such that architectures can provide a different implementation. No magic weak functions please. If you really have a reason to override the implementation add a proper Kconfig symbol and an explicit override.
Re: use generic DMA mapping code in powerpc V4
On 12 December 2018 at 3:15PM, Christoph Hellwig wrote: > Thanks for bisecting. I've spent some time going over the conversion > but can't really pinpoint it. I have three little patches that switch > parts of the code to the generic version. This is on top of the > last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda). > > Can you check with whіch one things stop working? Hello Christoph, Great news! All your patches work! I tested all your patches (including the patch '0004-alloc-free.patch' today) and the PASEMI onboard ethernet works and the P5020 board boots without any problems. Thank you for your work! I have a few days off. That means, I will work less and only for the A-EON first level Linux support. I can test again on Thursday next week. Have a nice weekend! Cheers, Christian
[PATCH v1 9/9] mm: better document PG_reserved
The usage of PG_reserved and how PG_reserved pages are to be treated is buried deep down in different parts of the kernel. Let's shine some light onto these details by documenting current users and expected behavior. Especially, clarify on the "Some of them might not even exist" case. These are physical memory gaps that will never be dumped as they are not marked as IORESOURCE_SYSRAM. PG_reserved does in general not hinder anybody from dumping or swapping. In some cases, these pages will not be stored in the hibernation image. Cc: Andrew Morton Cc: Stephen Rothwell Cc: Pavel Tatashin Cc: Michal Hocko Cc: Alexander Duyck Cc: Matthew Wilcox Cc: Anthony Yznaga Cc: Miles Chen Cc: yi.z.zh...@linux.intel.com Cc: Dan Williams Signed-off-by: David Hildenbrand --- include/linux/page-flags.h | 33 +++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 808b4183e30d..9de2e941cbd5 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -17,8 +17,37 @@ /* * Various page->flags bits: * - * PG_reserved is set for special pages, which can never be swapped out. Some - * of them might not even exist... + * PG_reserved is set for special pages. The "struct page" of such a page + * should in general not be touched (e.g. set dirty) except by their owner. + * Pages marked as PG_reserved include: + * - Pages part of the kernel image (including vDSO) and similar (e.g. BIOS, + * initrd, HW tables) + * - Pages reserved or allocated early during boot (before the page allocator + * was initialized). This includes (depending on the architecture) the + * initial vmmap, initial page tables, crashkernel, elfcorehdr, and much + * much more. Once (if ever) freed, PG_reserved is cleared and they will + * be given to the page allocator. + * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying + * to read/write these pages might end badly. Don't touch! + * - The zero page(s) + * - Pages not added to the page allocator when onlining a section because + * they were excluded via the online_page_callback() or because they are + * PG_hwpoison. + * - Pages allocated in the context of kexec/kdump (loaded kernel image, + * control pages, vmcoreinfo) + * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are + * not marked PG_reserved (as they might be in use by somebody else who does + * not respect the caching strategy). + * - Pages part of an offline section (struct pages of offline sections should + * not be trusted as they will be initialized when first onlined). + * - MCA pages on ia64 + * - Pages holding CPU notes for POWER Firmware Assisted Dump + * - Device memory (e.g. PMEM, DAX, HMM) + * Some PG_reserved pages will be excluded from the hibernation image. + * PG_reserved does in general not hinder anybody from dumping or swapping + * and is no longer required for remap_pfn_range(). ioremap might require it. + * Consequently, PG_reserved for a page mapped into user space can indicate + * the zero page, the vDSO, MMIO pages or device memory. * * The PG_private bitflag is set on pagecache pages if they contain filesystem * specific data (which is normally at page->private). It can be used by -- 2.17.2
[PATCH v1 8/9] ia64: perfmon: Don't mark buffer pages as PG_reserved
In the old days, remap_pfn_range() required pages to be marked as PG_reserved, so they would e.g. never get swapped out. This was required for special mappings. Nowadays, this is fully handled via the VMA (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP inside remap_pfn_range() to be precise). PG_reserved is no longer required but only a relict from the past. So only architecture specific MM handling might require it (e.g. to detect them as MMIO pages). As there are no architecture specific checks for PageReserved() apart from MCA handling in ia64code, this can go. Use simple vzalloc()/vfree() instead. Note that before calling vzalloc(), size has already been aligned to PAGE_SIZE, no need to align again. Cc: Tony Luck Cc: Fenghua Yu Cc: Oleg Nesterov Cc: Andrew Morton Cc: David Hildenbrand Cc: David Howells Cc: Mike Rapoport Cc: Michal Hocko Signed-off-by: David Hildenbrand --- arch/ia64/kernel/perfmon.c | 59 +++--- 1 file changed, 4 insertions(+), 55 deletions(-) diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index a9d4dc6c0427..e1b9287dc455 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -583,17 +583,6 @@ pfm_put_task(struct task_struct *task) if (task != current) put_task_struct(task); } -static inline void -pfm_reserve_page(unsigned long a) -{ - SetPageReserved(vmalloc_to_page((void *)a)); -} -static inline void -pfm_unreserve_page(unsigned long a) -{ - ClearPageReserved(vmalloc_to_page((void*)a)); -} - static inline unsigned long pfm_protect_ctx_ctxsw(pfm_context_t *x) { @@ -817,44 +806,6 @@ pfm_reset_msgq(pfm_context_t *ctx) DPRINT(("ctx=%p msgq reset\n", ctx)); } -static void * -pfm_rvmalloc(unsigned long size) -{ - void *mem; - unsigned long addr; - - size = PAGE_ALIGN(size); - mem = vzalloc(size); - if (mem) { - //printk("perfmon: CPU%d pfm_rvmalloc(%ld)=%p\n", smp_processor_id(), size, mem); - addr = (unsigned long)mem; - while (size > 0) { - pfm_reserve_page(addr); - addr+=PAGE_SIZE; - size-=PAGE_SIZE; - } - } - return mem; -} - -static void -pfm_rvfree(void *mem, unsigned long size) -{ - unsigned long addr; - - if (mem) { - DPRINT(("freeing physical buffer @%p size=%lu\n", mem, size)); - addr = (unsigned long) mem; - while ((long) size > 0) { - pfm_unreserve_page(addr); - addr+=PAGE_SIZE; - size-=PAGE_SIZE; - } - vfree(mem); - } - return; -} - static pfm_context_t * pfm_context_alloc(int ctx_flags) { @@ -1499,7 +1450,7 @@ pfm_free_smpl_buffer(pfm_context_t *ctx) /* * free the buffer */ - pfm_rvfree(ctx->ctx_smpl_hdr, ctx->ctx_smpl_size); + vfree(ctx->ctx_smpl_hdr); ctx->ctx_smpl_hdr = NULL; ctx->ctx_smpl_size = 0UL; @@ -2138,7 +2089,7 @@ pfm_close(struct inode *inode, struct file *filp) * All memory free operations (especially for vmalloc'ed memory) * MUST be done with interrupts ENABLED. */ - if (smpl_buf_addr) pfm_rvfree(smpl_buf_addr, smpl_buf_size); + vfree(smpl_buf_addr); /* * return the memory used by the context @@ -2267,10 +2218,8 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t /* * We do the easy to undo allocations first. -* -* pfm_rvmalloc(), clears the buffer, so there is no leak */ - smpl_buf = pfm_rvmalloc(size); + smpl_buf = vzalloc(size); if (smpl_buf == NULL) { DPRINT(("Can't allocate sampling buffer\n")); return -ENOMEM; @@ -2347,7 +2296,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t error: vm_area_free(vma); error_kmem: - pfm_rvfree(smpl_buf, size); + vfree(smpl_buf); return -ENOMEM; } -- 2.17.2
[PATCH v1 7/9] arm64: kdump: No need to mark crashkernel pages manually PG_reserved
The crashkernel is reserved via memblock_reserve(). memblock_free_all() will call free_low_memory_core_early(), which will go over all reserved memblocks, marking the pages as PG_reserved. So manually marking pages as PG_reserved is not necessary, they are already in the desired state (otherwise they would have been handed over to the buddy as free pages and bad things would happen). Cc: Catalin Marinas Cc: Will Deacon Cc: James Morse Cc: Bhupesh Sharma Cc: David Hildenbrand Cc: Mark Rutland Cc: Dave Kleikamp Cc: Andrew Morton Cc: Mike Rapoport Cc: Michal Hocko Cc: Florian Fainelli Cc: Stefan Agner Cc: Laura Abbott Cc: Greg Hackmann Cc: Johannes Weiner Cc: Kristina Martsenko Cc: CHANDAN VN Cc: AKASHI Takahiro Cc: Logan Gunthorpe Signed-off-by: David Hildenbrand --- arch/arm64/kernel/machine_kexec.c | 2 +- arch/arm64/mm/init.c | 27 --- 2 files changed, 1 insertion(+), 28 deletions(-) diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index 6f0587b5e941..66b5d697d943 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -321,7 +321,7 @@ void crash_post_resume(void) * but does not hold any data of loaded kernel image. * * Note that all the pages in crash dump kernel memory have been initially - * marked as Reserved in kexec_reserve_crashkres_pages(). + * marked as Reserved as memory was allocated via memblock_reserve(). * * In hibernation, the pages which are Reserved and yet "nosave" are excluded * from the hibernation iamge. crash_is_nosave() does thich check for crash diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index a8f2e4792ef9..9dcfa809b7ab 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -118,35 +118,10 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; } - -static void __init kexec_reserve_crashkres_pages(void) -{ -#ifdef CONFIG_HIBERNATION - phys_addr_t addr; - struct page *page; - - if (!crashk_res.end) - return; - - /* -* To reduce the size of hibernation image, all the pages are -* marked as Reserved initially. -*/ - for (addr = crashk_res.start; addr < (crashk_res.end + 1); - addr += PAGE_SIZE) { - page = phys_to_page(addr); - SetPageReserved(page); - } -#endif -} #else static void __init reserve_crashkernel(void) { } - -static void __init kexec_reserve_crashkres_pages(void) -{ -} #endif /* CONFIG_KEXEC_CORE */ #ifdef CONFIG_CRASH_DUMP @@ -586,8 +561,6 @@ void __init mem_init(void) /* this will put all unused low memory onto the freelists */ memblock_free_all(); - kexec_reserve_crashkres_pages(); - mem_init_print_info(NULL); /* -- 2.17.2
[PATCH v1 6/9] arm64: kexec: no need to ClearPageReserved()
This will be done by free_reserved_page(). Cc: Catalin Marinas Cc: Will Deacon Cc: Bhupesh Sharma Cc: James Morse Cc: Marc Zyngier Cc: Dave Kleikamp Cc: Mark Rutland Cc: Andrew Morton Cc: Michal Hocko Cc: Matthew Wilcox Acked-by: James Morse Signed-off-by: David Hildenbrand --- arch/arm64/kernel/machine_kexec.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index aa9c94113700..6f0587b5e941 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -361,7 +361,6 @@ void crash_free_reserved_phys_range(unsigned long begin, unsigned long end) for (addr = begin; addr < end; addr += PAGE_SIZE) { page = phys_to_page(addr); - ClearPageReserved(page); free_reserved_page(page); } } -- 2.17.2
[PATCH v1 5/9] m68k/mm: use __ClearPageReserved()
The PG_reserved flag is cleared from memory that is part of the kernel image (and therefore marked as PG_reserved). Avoid using PG_reserved directly. Cc: Geert Uytterhoeven Cc: Andrew Morton Cc: Michal Hocko Cc: Matthew Wilcox Signed-off-by: David Hildenbrand --- arch/m68k/mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/m68k/mm/memory.c b/arch/m68k/mm/memory.c index b86a2e21693b..227c04fe60d2 100644 --- a/arch/m68k/mm/memory.c +++ b/arch/m68k/mm/memory.c @@ -51,7 +51,7 @@ void __init init_pointer_table(unsigned long ptable) pr_debug("init_pointer_table: %lx, %x\n", ptable, PD_MARKBITS(dp)); /* unreserve the page so it's possible to free that page */ - PD_PAGE(dp)->flags &= ~(1 << PG_reserved); + __ClearPageReserved(PD_PAGE(dp)); init_page_count(PD_PAGE(dp)); return; -- 2.17.2
[PATCH v1 4/9] riscv/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are marked as reserved during boot. As we install a special mapping, the actual struct pages will never be exposed to MM via the page tables. We can therefore leave the pages marked as reserved. Cc: Palmer Dabbelt Cc: Albert Ou Cc: Tobias Klauser Cc: Andrew Morton Cc: Michal Hocko Cc: Matthew Wilcox Acked-by: Palmer Dabbelt Signed-off-by: David Hildenbrand --- arch/riscv/kernel/vdso.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c index 582cb153eb24..0cd044122234 100644 --- a/arch/riscv/kernel/vdso.c +++ b/arch/riscv/kernel/vdso.c @@ -54,7 +54,6 @@ static int __init vdso_init(void) struct page *pg; pg = virt_to_page(vdso_start + (i << PAGE_SHIFT)); - ClearPageReserved(pg); vdso_pagelist[i] = pg; } vdso_pagelist[i] = virt_to_page(vdso_data); -- 2.17.2
[PATCH v1 3/9] powerpc/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are marked as reserved during boot. As we install a special mapping, the actual struct pages will never be exposed to MM via the page tables. We can therefore leave the pages marked as reserved. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Christophe Leroy Cc: Kees Cook Cc: Andrew Morton Cc: Michal Hocko Cc: Matthew Wilcox Signed-off-by: David Hildenbrand --- arch/powerpc/kernel/vdso.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index 65b3bdb99f0b..d59dc2e9a695 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -795,7 +795,6 @@ static int __init vdso_init(void) BUG_ON(vdso32_pagelist == NULL); for (i = 0; i < vdso32_pages; i++) { struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE); - ClearPageReserved(pg); get_page(pg); vdso32_pagelist[i] = pg; } @@ -809,7 +808,6 @@ static int __init vdso_init(void) BUG_ON(vdso64_pagelist == NULL); for (i = 0; i < vdso64_pages; i++) { struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE); - ClearPageReserved(pg); get_page(pg); vdso64_pagelist[i] = pg; } -- 2.17.2
[PATCH v1 2/9] s390/vdso: don't clear PG_reserved
The VDSO is part of the kernel image and therefore the struct pages are marked as reserved during boot. As we install a special mapping, the actual struct pages will never be exposed to MM via the page tables. We can therefore leave the pages marked as reserved. Suggested-by: Martin Schwidefsky Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Michal Hocko Cc: Vasily Gorbik Cc: Kees Cook Cc: Souptick Joarder Cc: Andrew Morton Cc: Michal Hocko Signed-off-by: David Hildenbrand --- arch/s390/kernel/vdso.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index ebe748a9f472..9e24d23c26c0 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -292,7 +292,6 @@ static int __init vdso_init(void) BUG_ON(vdso32_pagelist == NULL); for (i = 0; i < vdso32_pages - 1; i++) { struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE); - ClearPageReserved(pg); get_page(pg); vdso32_pagelist[i] = pg; } @@ -310,7 +309,6 @@ static int __init vdso_init(void) BUG_ON(vdso64_pagelist == NULL); for (i = 0; i < vdso64_pages - 1; i++) { struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE); - ClearPageReserved(pg); get_page(pg); vdso64_pagelist[i] = pg; } -- 2.17.2
[PATCH v1 1/9] agp: efficeon: no need to set PG_reserved on GATT tables
The l1 GATT page table is kept in a special on-chip page with 64 entries. We allocate the l2 page table pages via get_zeroed_page() and enter them into the table. These l2 pages are modified accordingly when inserting/removing memory via efficeon_insert_memory and efficeon_remove_memory. Apart from that, these pages are not exposed or ioremap'ed. We can stop setting them reserved (propably copied from generic code). Cc: David Airlie Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Andrew Morton Cc: Michal Hocko Cc: Matthew Wilcox Signed-off-by: David Hildenbrand --- drivers/char/agp/efficeon-agp.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/char/agp/efficeon-agp.c b/drivers/char/agp/efficeon-agp.c index 7f88490b5479..c53f0f9ef5b0 100644 --- a/drivers/char/agp/efficeon-agp.c +++ b/drivers/char/agp/efficeon-agp.c @@ -163,7 +163,6 @@ static int efficeon_free_gatt_table(struct agp_bridge_data *bridge) unsigned long page = efficeon_private.l1_table[index]; if (page) { efficeon_private.l1_table[index] = 0; - ClearPageReserved(virt_to_page((char *)page)); free_page(page); freed++; } @@ -219,7 +218,6 @@ static int efficeon_create_gatt_table(struct agp_bridge_data *bridge) efficeon_free_gatt_table(agp_bridge); return -ENOMEM; } - SetPageReserved(virt_to_page((char *)page)); for (offset = 0; offset < PAGE_SIZE; offset += clflush_chunk) clflush((char *)page+offset); -- 2.17.2
[PATCH v1 0/9] mm: PG_reserved cleanups and documentation
I was recently going over all users of PG_reserved. Short story: it is difficult and sometimes not really clear if setting/checking for PG_reserved is only a relict from the past. Easy to break things. I guess I know have a pretty good idea wh things are like that nowadays and how they evolved. I had way more cleanups in this series inititally, but some architectures take PG_reserved as a way to apply a different caching strategy (for MMIO pages). So I decided to only include the most obvious changes (that are less likely to break something). So the big chunk of manual SetPageReserved users are MMIO/DMA related things on device buffers. Most notably, for device memory we will hopefully soon stop setting PG_reserved. The the documentation has to be updated. RFC -> V1: - Add more details to "mm: better document PG_reserved" - Add "arm64: kdump: No need to mark crashkernel pages manually PG_reserved" - Add "ia64: perfmon: Don't mark buffer pages as PG_reserved" - Added ACKs David Hildenbrand (9): agp: efficeon: no need to set PG_reserved on GATT tables s390/vdso: don't clear PG_reserved powerpc/vdso: don't clear PG_reserved riscv/vdso: don't clear PG_reserved m68k/mm: use __ClearPageReserved() arm64: kexec: no need to ClearPageReserved() arm64: kdump: No need to mark crashkernel pages manually PG_reserved ia64: perfmon: Don't mark buffer pages as PG_reserved mm: better document PG_reserved arch/arm64/kernel/machine_kexec.c | 3 +- arch/arm64/mm/init.c | 27 -- arch/ia64/kernel/perfmon.c| 59 +++ arch/m68k/mm/memory.c | 2 +- arch/powerpc/kernel/vdso.c| 2 -- arch/riscv/kernel/vdso.c | 1 - arch/s390/kernel/vdso.c | 2 -- drivers/char/agp/efficeon-agp.c | 2 -- include/linux/page-flags.h| 33 +++-- 9 files changed, 37 insertions(+), 94 deletions(-) -- 2.17.2
Re: [PATCH] powerpc/prom: fix early DEBUG messages
Le 14/12/2018 à 07:22, Michael Ellerman a écrit : Christophe Leroy writes: diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index fe758cedb93f..d8e56e03c9c6 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -749,7 +749,11 @@ void __init early_init_devtree(void *params) memblock_allow_resize(); memblock_dump_all(); +#ifdef CONFIG_PHYS_64BIT DBG("Phys. mem: %llx\n", memblock_phys_mem_size()); +#else + DBG("Phys. mem: %x\n", memblock_phys_mem_size()); +#endif Can we just do: DBG("Phys. mem: %llx\n", (unsigned long long)memblock_phys_mem_size()); ? Yes that's obviously better, especially as we don't fix the length of the displayed value. Christophe
[PATCH v2] powerpc/prom: fix early DEBUG messages
This patch fixes early DEBUG messages in prom.c: - Use %px instead of %p to see the addresses - Cast memblock_phys_mem_size() with (unsigned long long) to avoid build failure when phys_addr_t is not 64 bits. Signed-off-by: Christophe Leroy --- v2: cast instead of #ifdef %llx/%x arch/powerpc/kernel/prom.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index fe758cedb93f..87a68e2dc531 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -129,7 +129,7 @@ static void __init move_device_tree(void) p = __va(memblock_phys_alloc(size, PAGE_SIZE)); memcpy(p, initial_boot_params, size); initial_boot_params = p; - DBG("Moved device tree to 0x%p\n", p); + DBG("Moved device tree to 0x%px\n", p); } DBG("<- move_device_tree\n"); @@ -689,7 +689,7 @@ void __init early_init_devtree(void *params) { phys_addr_t limit; - DBG(" -> early_init_devtree(%p)\n", params); + DBG(" -> early_init_devtree(%px)\n", params); /* Too early to BUG_ON(), do it by hand */ if (!early_init_dt_verify(params)) @@ -749,7 +749,7 @@ void __init early_init_devtree(void *params) memblock_allow_resize(); memblock_dump_all(); - DBG("Phys. mem: %llx\n", memblock_phys_mem_size()); + DBG("Phys. mem: %llx\n", (unsigned long long)memblock_phys_mem_size()); /* We may need to relocate the flat tree, do it now. * FIXME .. and the initrd too? */ -- 2.13.3
[PATCH v2 2/5] powerpc/perf: Rearrange setting of ldbar for thread-imc
LDBAR holds the memory address allocated for each cpu. For thread-imc the mode bit (i.e bit 1) of LDBAR is set to accumulation. Currently, ldbar is loaded with per cpu memory address and mode set to accumulation at boot time. To enable trace-imc, the mode bit of ldbar should be set to 'trace'. So to accommodate trace-mode of IMC, reposition setting of ldbar for thread-imc to thread_imc_event_add(). Also reset ldbar at thread_imc_event_del(). Signed-off-by: Anju T Sudhakar --- arch/powerpc/perf/imc-pmu.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index f292a3f284f1..3bef46f8417d 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -806,8 +806,11 @@ static int core_imc_event_init(struct perf_event *event) } /* - * Allocates a page of memory for each of the online cpus, and write the - * physical base address of that page to the LDBAR for that cpu. + * Allocates a page of memory for each of the online cpus, and load + * LDBAR with 0. + * The physical base address of the page allocated for a cpu will be + * written to the LDBAR for that cpu, when the thread-imc event + * is added. * * LDBAR Register Layout: * @@ -825,7 +828,7 @@ static int core_imc_event_init(struct perf_event *event) */ static int thread_imc_mem_alloc(int cpu_id, int size) { - u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, cpu_id); + u64 *local_mem = per_cpu(thread_imc_mem, cpu_id); int nid = cpu_to_node(cpu_id); if (!local_mem) { @@ -842,9 +845,7 @@ static int thread_imc_mem_alloc(int cpu_id, int size) per_cpu(thread_imc_mem, cpu_id) = local_mem; } - ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | THREAD_IMC_ENABLE; - - mtspr(SPRN_LDBAR, ldbar_value); + mtspr(SPRN_LDBAR, 0); return 0; } @@ -995,6 +996,7 @@ static int thread_imc_event_add(struct perf_event *event, int flags) { int core_id; struct imc_pmu_ref *ref; + u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, smp_processor_id()); if (flags & PERF_EF_START) imc_event_start(event, flags); @@ -1003,6 +1005,9 @@ static int thread_imc_event_add(struct perf_event *event, int flags) return -EINVAL; core_id = smp_processor_id() / threads_per_core; + ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | THREAD_IMC_ENABLE; + mtspr(SPRN_LDBAR, ldbar_value); + /* * imc pmus are enabled only when it is used. * See if this is triggered for the first time. @@ -1034,11 +1039,7 @@ static void thread_imc_event_del(struct perf_event *event, int flags) int core_id; struct imc_pmu_ref *ref; - /* -* Take a snapshot and calculate the delta and update -* the event counter values. -*/ - imc_event_update(event); + mtspr(SPRN_LDBAR, 0); core_id = smp_processor_id() / threads_per_core; ref = _imc_refc[core_id]; @@ -1057,6 +1058,11 @@ static void thread_imc_event_del(struct perf_event *event, int flags) ref->refc = 0; } mutex_unlock(>lock); + /* +* Take a snapshot and calculate the delta and update +* the event counter values. +*/ + imc_event_update(event); } /* update_pmu_ops : Populate the appropriate operations for "pmu" */ -- 2.17.1
[PATCH v2 5/5] powerpc/perf: Trace imc PMU functions
Add PMU functions to support trace-imc. Signed-off-by: Anju T Sudhakar --- arch/powerpc/perf/imc-pmu.c | 175 1 file changed, 175 insertions(+) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 1f09265c8fb0..32ff0e449fca 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -1120,6 +1120,173 @@ static int trace_imc_cpu_init(void) ppc_trace_imc_cpu_offline); } +static u64 get_trace_imc_event_base_addr(void) +{ + return (u64)per_cpu(trace_imc_mem, smp_processor_id()); +} + +/* + * Function to parse trace-imc data obtained + * and to prepare the perf sample. + */ +static int trace_imc_prepare_sample(struct trace_imc_data *mem, + struct perf_sample_data *data, + u64 *prev_tb, + struct perf_event_header *header, + struct perf_event *event) +{ + /* Sanity checks for a valid record */ + if (be64_to_cpu(READ_ONCE(mem->tb1)) > *prev_tb) + *prev_tb = be64_to_cpu(READ_ONCE(mem->tb1)); + else + return -EINVAL; + + if ((be64_to_cpu(READ_ONCE(mem->tb1)) & IMC_TRACE_RECORD_TB1_MASK) != +be64_to_cpu(READ_ONCE(mem->tb2))) + return -EINVAL; + + /* Prepare perf sample */ + data->ip = be64_to_cpu(READ_ONCE(mem->ip)); + data->period = event->hw.last_period; + + header->type = PERF_RECORD_SAMPLE; + header->size = sizeof(*header) + event->header_size; + header->misc = 0; + + if (is_kernel_addr(data->ip)) + header->misc |= PERF_RECORD_MISC_KERNEL; + else + header->misc |= PERF_RECORD_MISC_USER; + + perf_event_header__init_id(header, data, event); + + return 0; +} + +static void dump_trace_imc_data(struct perf_event *event) +{ + struct trace_imc_data *mem; + int i, ret; + u64 prev_tb = 0; + + mem = (struct trace_imc_data *)get_trace_imc_event_base_addr(); + for (i = 0; i < (trace_imc_mem_size / sizeof(struct trace_imc_data)); + i++, mem++) { + struct perf_sample_data data; + struct perf_event_header header; + + ret = trace_imc_prepare_sample(mem, , _tb, , event); + if (ret) /* Exit, if not a valid record */ + break; + else { + /* If this is a valid record, create the sample */ + struct perf_output_handle handle; + + if (perf_output_begin(, event, header.size)) + return; + + perf_output_sample(, , , event); + perf_output_end(); + } + } +} + +static int trace_imc_event_add(struct perf_event *event, int flags) +{ + /* Enable the sched_task to start the engine */ + perf_sched_cb_inc(event->ctx->pmu); + return 0; +} + +static void trace_imc_event_read(struct perf_event *event) +{ + dump_trace_imc_data(event); +} + +static void trace_imc_event_stop(struct perf_event *event, int flags) +{ + trace_imc_event_read(event); +} + +static void trace_imc_event_start(struct perf_event *event, int flags) +{ + return; +} + +static void trace_imc_event_del(struct perf_event *event, int flags) +{ + perf_sched_cb_dec(event->ctx->pmu); +} + +void trace_imc_pmu_sched_task(struct perf_event_context *ctx, + bool sched_in) +{ + int core_id = smp_processor_id() / threads_per_core; + struct imc_pmu_ref *ref; + u64 local_mem, ldbar_value; + + /* Set trace-imc bit in ldbar and load ldbar with per-thread memory address */ + local_mem = get_trace_imc_event_base_addr(); + ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | TRACE_IMC_ENABLE; + + ref = _imc_refc[core_id]; + if (!ref) + return; + + if (sched_in) { + mtspr(SPRN_LDBAR, ldbar_value); + mutex_lock(>lock); + if (ref->refc == 0) { + if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE, + get_hard_smp_processor_id(smp_processor_id( { + mutex_unlock(>lock); + pr_err("trace-imc: Unable to start the counters for core %d\n", core_id); + mtspr(SPRN_LDBAR, 0); + return; + } + } + ++ref->refc; + mutex_unlock(>lock); + } else { + mtspr(SPRN_LDBAR, 0); + mutex_lock(>lock); + ref->refc--; + if (ref->refc == 0) { + if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE, +
[PATCH v2 3/5] powerpc/perf: Add privileged access check for thread_imc
From: Madhavan Srinivasan Add code to restrict user access to thread_imc pmu since some event report privilege level information. Fixes: f74c89bd80fb3 ('powerpc/perf: Add thread IMC PMU support') Signed-off-by: Madhavan Srinivasan Signed-off-by: Anju T Sudhakar --- arch/powerpc/perf/imc-pmu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 3bef46f8417d..5ca80545a849 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -877,6 +877,9 @@ static int thread_imc_event_init(struct perf_event *event) if (event->attr.type != event->pmu->type) return -ENOENT; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + /* Sampling not supported */ if (event->hw.sample_period) return -EINVAL; -- 2.17.1
[PATCH v2 4/5] powerpc/perf: Trace imc events detection and cpuhotplug
Patch detects trace-imc events, does memory initilizations for each online cpu, and registers cpuhotplug call-backs. Signed-off-by: Anju T Sudhakar --- arch/powerpc/perf/imc-pmu.c | 91 +++ arch/powerpc/platforms/powernv/opal-imc.c | 3 + include/linux/cpuhotplug.h| 1 + 3 files changed, 95 insertions(+) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 5ca80545a849..1f09265c8fb0 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -43,6 +43,10 @@ static DEFINE_PER_CPU(u64 *, thread_imc_mem); static struct imc_pmu *thread_imc_pmu; static int thread_imc_mem_size; +/* Trace IMC data structures */ +static DEFINE_PER_CPU(u64 *, trace_imc_mem); +static int trace_imc_mem_size; + static struct imc_pmu *imc_event_to_pmu(struct perf_event *event) { return container_of(event->pmu, struct imc_pmu, pmu); @@ -1068,6 +1072,54 @@ static void thread_imc_event_del(struct perf_event *event, int flags) imc_event_update(event); } +/* + * Allocate a page of memory for each cpu, and load LDBAR with 0. + */ +static int trace_imc_mem_alloc(int cpu_id, int size) +{ + u64 *local_mem = per_cpu(trace_imc_mem, cpu_id); + int phys_id = cpu_to_node(cpu_id), rc = 0; + + if (!local_mem) { + local_mem = page_address(alloc_pages_node(phys_id, + GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | + __GFP_NOWARN, get_order(size))); + if (!local_mem) + return -ENOMEM; + per_cpu(trace_imc_mem, cpu_id) = local_mem; + + /* Initialise the counters for trace mode */ + rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_TRACE, __pa((void *)local_mem), + get_hard_smp_processor_id(cpu_id)); + if (rc) { + pr_info("IMC:opal init failed for trace imc\n"); + return rc; + } + } + + mtspr(SPRN_LDBAR, 0); + return 0; +} + +static int ppc_trace_imc_cpu_online(unsigned int cpu) +{ + return trace_imc_mem_alloc(cpu, trace_imc_mem_size); +} + +static int ppc_trace_imc_cpu_offline(unsigned int cpu) +{ + mtspr(SPRN_LDBAR, 0); + return 0; +} + +static int trace_imc_cpu_init(void) +{ + return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE, + "perf/powerpc/imc_trace:online", + ppc_trace_imc_cpu_online, + ppc_trace_imc_cpu_offline); +} + /* update_pmu_ops : Populate the appropriate operations for "pmu" */ static int update_pmu_ops(struct imc_pmu *pmu) { @@ -1189,6 +1241,17 @@ static void cleanup_all_thread_imc_memory(void) } } +static void cleanup_all_trace_imc_memory(void) +{ + int i, order = get_order(trace_imc_mem_size); + + for_each_online_cpu(i) { + if (per_cpu(trace_imc_mem, i)) + free_pages((u64)per_cpu(trace_imc_mem, i), order); + + } +} + /* Function to free the attr_groups which are dynamically allocated */ static void imc_common_mem_free(struct imc_pmu *pmu_ptr) { @@ -1230,6 +1293,11 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr) cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE); cleanup_all_thread_imc_memory(); } + + if (pmu_ptr->domain == IMC_DOMAIN_TRACE) { + cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE); + cleanup_all_trace_imc_memory(); + } } /* @@ -1312,6 +1380,21 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct device_node *parent, thread_imc_pmu = pmu_ptr; break; + case IMC_DOMAIN_TRACE: + /* Update the pmu name */ + pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s", s, "_imc"); + if (!pmu_ptr->pmu.name) + return -ENOMEM; + + trace_imc_mem_size = pmu_ptr->counter_mem_size; + for_each_online_cpu(cpu) { + res = trace_imc_mem_alloc(cpu, trace_imc_mem_size); + if (res) { + cleanup_all_trace_imc_memory(); + goto err; + } + } + break; default: return -EINVAL; } @@ -1384,6 +1467,14 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id goto err_free_mem; } + break; + case IMC_DOMAIN_TRACE: + ret = trace_imc_cpu_init(); + if (ret) { + cleanup_all_trace_imc_memory(); + goto err_free_mem; + } + break; default:
[PATCH v2 1/5] powerpc/include: Add data structures and macros for IMC trace mode
Add the macros needed for IMC (In-Memory Collection Counters) trace-mode and data structure to hold the trace-imc record data. Also, add the new type "OPAL_IMC_COUNTERS_TRACE" in 'opal-api.h', since there is a new switch case added in the opal-calls for IMC. Signed-off-by: Anju T Sudhakar --- arch/powerpc/include/asm/imc-pmu.h | 39 + arch/powerpc/include/asm/opal-api.h | 1 + 2 files changed, 40 insertions(+) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 69f516ecb2fd..7c2ef0e42661 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -33,6 +33,7 @@ */ #define THREAD_IMC_LDBAR_MASK 0x0003e000ULL #define THREAD_IMC_ENABLE 0x8000ULL +#define TRACE_IMC_ENABLE 0x4000ULL /* * For debugfs interface for imc-mode and imc-command @@ -59,6 +60,34 @@ struct imc_events { char *scale; }; +/* + * Trace IMC hardware updates a 64bytes record on + * Core Performance Monitoring Counter (CPMC) + * overflow. Here is the layout for the trace imc record + * + * DW 0 : Timebase + * DW 1 : Program Counter + * DW 2 : PIDR information + * DW 3 : CPMC1 + * DW 4 : CPMC2 + * DW 5 : CPMC3 + * Dw 6 : CPMC4 + * DW 7 : Timebase + * . + * + * The following is the data structure to hold trace imc data. + */ +struct trace_imc_data { + u64 tb1; + u64 ip; + u64 val; + u64 cpmc1; + u64 cpmc2; + u64 cpmc3; + u64 cpmc4; + u64 tb2; +}; + /* Event attribute array index */ #define IMC_FORMAT_ATTR0 #define IMC_EVENT_ATTR 1 @@ -68,6 +97,13 @@ struct imc_events { /* PMU Format attribute macros */ #define IMC_EVENT_OFFSET_MASK 0xULL +/* + * Macro to mask bits 0:21 of first double word(which is the timebase) to + * compare with 8th double word (timebase) of trace imc record data. + */ +#define IMC_TRACE_RECORD_TB1_MASK 0x3ffULL + + /* * Device tree parser code detects IMC pmu support and * registers new IMC pmus. This structure will hold the @@ -113,6 +149,7 @@ struct imc_pmu_ref { enum { IMC_TYPE_THREAD = 0x1, + IMC_TYPE_TRACE = 0x2, IMC_TYPE_CORE = 0x4, IMC_TYPE_CHIP = 0x10, }; @@ -123,6 +160,8 @@ enum { #define IMC_DOMAIN_NEST1 #define IMC_DOMAIN_CORE2 #define IMC_DOMAIN_THREAD 3 +/* For trace-imc the domain is still thread but it operates in trace-mode */ +#define IMC_DOMAIN_TRACE 4 extern int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id); diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 870fb7b239ea..a4130b21b159 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -1118,6 +1118,7 @@ enum { enum { OPAL_IMC_COUNTERS_NEST = 1, OPAL_IMC_COUNTERS_CORE = 2, + OPAL_IMC_COUNTERS_TRACE = 3, }; -- 2.17.1
[PATCH v2 0/5] powerpc/perf: IMC trace-mode support
IMC (In-Memory collection counters) is a hardware monitoring facility that collects large number of hardware performance events. POWER9 support two modes for IMC which are the Accumulation mode and Trace mode. In Accumulation mode, event counts are accumulated in system Memory. Hypervisor then reads the posted counts periodically or when requested. In IMC Trace mode, event counted is fixed for cycles and on each overflow, hardware snapshots the program counter along with other details and writes into memory pointed by LDBAR(ring buffer memory, hardware wraps around). LDBAR has bit which indicates the IMC trace-mode. Trace-IMC Implementation: -- To enable trace-imc, we need to * Add trace node in the DTS file for power9, so that the new trace node can be discovered by the kernel. Information included in the DTS file are as follows, (a snippet from the ima-catalog) TRACE_IMC: trace-events { #address-cells = <0x1>; #size-cells = <0x1>; event@1020 { event-name = "cycles" ; reg = <0x1020 0x8>; desc = "Reference cycles" ; }; }; trace@0 { compatible = "ibm,imc-counters"; events-prefix = "trace_"; reg = <0x0 0x8>; events = < _IMC >; type = <0x2>; size = <0x4>; }; OP-BUILD changes needed to include the "trace node" is already pulled in to the ima-catalog repo. ps://github.com/open-power/op-build/commit/d3e75dc26d1283d7d5eb444bff1ec9e40d5dfc07 * Enchance the opal_imc_counters_* calls to support this new trace mode in imc. Add support to initialize the trace-mode scom. TRACE_IMC_SCOM bit representation: 0:1 : SAMPSEL 2:33: CPMC_LOAD 34:40 : CPMC1SEL 41:47 : CPMC2SEL 48:50 : BUFFERSIZE 51:63 : RESERVED CPMC_LOAD contains the sampling duration. SAMPSEL and CPMC*SEL determines the event to count. BUFFRSIZE indicates the memory range. On each overflow, hardware snapshots program counter along with other details and update the memory and reloads the CMPC_LOAD value for the next sampling duration. IMC hardware does not support exceptions, so it quietly wraps around if memory buffer reaches the end. Link to the skiboot patches to enhance the opal_imc_counters_* calls: https://lists.ozlabs.org/pipermail/skiboot/2018-December/012878.html https://lists.ozlabs.org/pipermail/skiboot/2018-December/012879.html https://lists.ozlabs.org/pipermail/skiboot/2018-December/012882.html https://lists.ozlabs.org/pipermail/skiboot/2018-December/012880.html https://lists.ozlabs.org/pipermail/skiboot/2018-December/012881.html https://lists.ozlabs.org/pipermail/skiboot/2018-December/012883.html * Set LDBAR spr to enable imc-trace mode.
Re: [PATCH] powerpc/mm: make NULL pointer deferences explicit on bad page faults.
Hi Michael, Le 14/12/2018 à 01:57, Michael Ellerman a écrit : Hi Christophe, You know it's the trivial patches that are going to get lots of review comments :) I'm so happy to get comments. Christophe Leroy writes: As several other arches including x86, this patch makes it explicit that a bad page fault is a NULL pointer dereference when the fault address is lower than PAGE_SIZE I'm being pedantic, but it's not necessarily a NULL pointer dereference. It might just be a direct access to a low address, eg: char *p = 0x100; *p = 0; That's not a NULL pointer dereference. But other arches do print this so I guess it's OK to add, and in most cases it will be an actual NULL pointer dereference. I wonder though if we should use 4096 rather than PAGE_SIZE, given that's the actual value other arches are using. We support 256K pages on some systems, which is getting quite large. Those invalid accesses are catched because the first page is marked non present or non accessible in the page table, so I thing using PAGE_SIZE here is valid regardless of the page size. Looks like the arches have PAGE_SHIFT ranging from 12 to 16 mainly. diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index d51cf5f4e45e..501a1eadb3e9 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -631,13 +631,16 @@ void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig) switch (TRAP(regs)) { case 0x300: case 0x380: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "data at address 0x%08lx\n", regs->dar); + pr_alert("Unable to handle kernel %s for data at address 0x%08lx\n", +regs->dar < PAGE_SIZE ? "NULL pointer dereference" : +"paging request", +regs->dar); This is now too long I think, with printk time you get: [ 1096.450711] Unable to handle kernel NULL pointer dereference for data at address 0x Which is 93 columns. It's true on many systems it doesn't really matter any more, but it would still be good if it was shorter. I like that on x86 they prefix it with "BUG:", just to avoid any confusion. What if we had for the NULL pointer case: BUG: Kernel NULL pointer dereference at 0x And for the normal case: BUG: Unable to handle kernel data access at 0x Note on the very next line we print: Faulting instruction address: 0xc0795cc8 So there should be no confusion about whether "at" refers to the data address or the instruction address. Agreed case 0x400: case 0x480: - printk(KERN_ALERT "Unable to handle kernel paging request for " - "instruction fetch\n"); + pr_alert("Unable to handle kernel %s for instruction fetch\n", +regs->nip < PAGE_SIZE ? "NULL pointer dereference" : +"paging request"); I don't really like using "NULL pointer dereference" here, that terminology makes me think of a load/store, I think it confuses things rather than making it clearer. What about: BUG: Unable to handle kernel instruction fetch at 0x I think we still need to make it explicit that we jumped there due to a NULL function pointer, allthought I don't have a good text idea yet for this. break; case 0x600: printk(KERN_ALERT "Unable to handle kernel paging request for " It would be good to clean up these other cases as well. They seem to be trying to use the "page request for" terminology which leads to them being very wordy. I assume that was done to help people grepping kernel logs for errors, but I think we should not worry about that if we have the "BUG:" prefix. So we have: printk(KERN_ALERT "Unable to handle kernel paging request for " "unaligned access at address 0x%08lx\n", regs->dar); What about: BUG: Unable to handle kernel unaligned access at 0x And: printk(KERN_ALERT "Unable to handle kernel paging request for " "unknown fault\n"); What about: BUG: Unable to handle unknown paging fault at 0x Thoughts? Looks like good ideas I'll carry on. Christophe