Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
On 3/13/20 4:08 AM, Kim Phillips wrote: On 3/11/20 11:00 AM, Ravi Bangoria wrote: Hi Kim, Hi Ravi, On 3/6/20 3:36 AM, Kim Phillips wrote: On 3/3/20 3:55 AM, Kim Phillips wrote: On 3/2/20 2:21 PM, Stephane Eranian wrote: On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra wrote: On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote: Modern processors export such hazard data in Performance Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on AMD[3] provides similar information. Implementation detail: A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced. If it's set, kernel converts arch specific hazard information into generic format: struct perf_pipeline_haz_data { /* Instruction/Opcode type: Load, Store, Branch */ __u8 itype; /* Instruction Cache source */ __u8 icache; /* Instruction suffered hazard in pipeline stage */ __u8 hazard_stage; /* Hazard reason */ __u8 hazard_reason; /* Instruction suffered stall in pipeline stage */ __u8 stall_stage; /* Stall reason */ __u8 stall_reason; __u16 pad; }; Kim, does this format indeed work for AMD IBS? It's not really 1:1, we don't have these separations of stages and reasons, for example: we have missed in L2 cache, for example. So IBS output is flatter, with more cycle latency figures than IBM's AFAICT. AMD IBS captures pipeline latency data incase Fetch sampling like the Fetch latency, tag to retire latency, completion to retire latency and so on. Yes, Ops sampling do provide more data on load/store centric information. But it also captures more detailed data for Branch instructions. And we also looked at ARM SPE, which also captures more details pipeline data and latency information. Personally, I don't like the term hazard. This is too IBM Power specific. We need to find a better term, maybe stall or penalty. Right, IBS doesn't have a filter to only count stalled or otherwise bad events. IBS' PPR descriptions has one occurrence of the word stall, and no penalty. The way I read IBS is it's just reporting more sample data than just the precise IP: things like hits, misses, cycle latencies, addresses, types, etc., so words like 'extended', or the 'auxiliary' already used today even are more appropriate for IBS, although I'm the last person to bikeshed. We are thinking of using "pipeline" word instead of Hazard. Hm, the word 'pipeline' occurs 0 times in IBS documentation. NP. We thought pipeline is generic hw term so we proposed "pipeline" word. We are open to term which can be generic enough. I realize there are a couple of core pipeline-specific pieces of information coming out of it, but the vast majority are addresses, latencies of various components in the memory hierarchy, and various component hit/miss bits. Yes. we should capture core pipeline specific details. For example, IBS generates Branch unit information(IbsOpData1) and Icahce related data(IbsFetchCtl) which is something that shouldn't be extended as part of perf-mem, IMO. Sure, IBS Op-side output is more 'perf mem' friendly, and so it should populate perf_mem_data_src fields, just like POWER9 can: union perf_mem_data_src { ... __u64 mem_rsvd:24, mem_snoopx:2, /* snoop mode, ext */ mem_remote:1, /* remote */ mem_lvl_num:4, /* memory hierarchy level number */ mem_dtlb:7, /* tlb access */ mem_lock:2, /* lock instr */ mem_snoop:5,/* snoop mode */ mem_lvl:14, /* memory hierarchy level */ mem_op:5; /* type of opcode */ E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op', 'mem_lock', and the Reload Bus Source Encoding bits can be used to populate mem_snoop, right? Hi Kim, Yes. We do expose these data as part of perf-mem for POWER. For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be used for the ld/st target addresses, too. What's needed here is a vendor-specific extended sample information that all these technologies gather, of which things like e.g., 'L1 TLB cycle latency' we all should have in common. Yes. We will include fields to capture the latency cycles (like Issue latency, Instruction completion latency etc..) along with other pipeline details in the proposed structure. Latency figures are just an example, and from what I can tell, struct perf_sample_data already has a 'weight' member, used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to transfer memory access latency figures. Granted, that's a bad name given all other vendors don't call latency 'weight'
[PATCH v5 11/11] perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric events
The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the facility to continuously collect large numbers of hardware performance metrics efficiently and accurately. This patch adds hv_24x7 metric file for different Socket/chip resources. Result: power9 platform: command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0 -I 1000 1.96188 0.9 0.3 2.000285720 0.5 0.1 3.000424990 0.4 0.1 command:# ./perf stat --metric-only -M PowerBUS_Frequency -C 0 -I 1000 1.979812.32.3 2.0002917132.32.3 3.0004217192.32.3 4.0005509122.32.3 Signed-off-by: Kajol Jain --- .../arch/powerpc/power9/nest_metrics.json | 19 +++ 1 file changed, 19 insertions(+) create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json diff --git a/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json new file mode 100644 index ..ac38f5540ac6 --- /dev/null +++ b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json @@ -0,0 +1,19 @@ +[ +{ +"MetricExpr": "(hv_24x7@PM_MCS01_128B_RD_DISP_PORT01\\,chip\\=?@ + hv_24x7@PM_MCS01_128B_RD_DISP_PORT23\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_RD_DISP_PORT01\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_RD_DISP_PORT23\\,chip\\=?@)", +"MetricName": "Memory_RD_BW_Chip", +"MetricGroup": "Memory_BW", +"ScaleUnit": "1.6e-2MB" +}, +{ +"MetricExpr": "(hv_24x7@PM_MCS01_128B_WR_DISP_PORT01\\,chip\\=?@ + hv_24x7@PM_MCS01_128B_WR_DISP_PORT23\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_WR_DISP_PORT01\\,chip\\=?@ + hv_24x7@PM_MCS23_128B_WR_DISP_PORT23\\,chip\\=?@ )", +"MetricName": "Memory_WR_BW_Chip", +"MetricGroup": "Memory_BW", +"ScaleUnit": "1.6e-2MB" +}, +{ +"MetricExpr": "(hv_24x7@PM_PB_CYC\\,chip\\=?@ )", +"MetricName": "PowerBUS_Frequency", +"ScaleUnit": "2.5e-7GHz" +} +] -- 2.18.1
[PATCH v5 10/11] tools/perf: Enable Hz/hz prinitg for --metric-only option
Commit 54b5091606c18 ("perf stat: Implement --metric-only mode") added function 'valid_only_metric()' which drops "Hz" or "hz", if it is part of "ScaleUnit". This patch enable it since hv_24x7 supports couple of frequency events. Signed-off-by: Kajol Jain --- tools/perf/util/stat-display.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c index 16efdba1973a..ecdebfcdd379 100644 --- a/tools/perf/util/stat-display.c +++ b/tools/perf/util/stat-display.c @@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit) if (!unit) return false; if (strstr(unit, "/sec") || - strstr(unit, "hz") || - strstr(unit, "Hz") || strstr(unit, "CPUs utilized")) return false; return true; -- 2.18.1
[PATCH v5 09/11] perf/tools: Enhance JSON/metric infrastructure to handle "?"
Patch enhances current metric infrastructure to handle "?" in the metric expression. The "?" can be use for parameters whose value not known while creating metric events and which can be replace later at runtime to the proper value. It also add flexibility to create multiple events out of single metric event added in json file. Patch adds function 'arch_get_runtimeparam' which is a arch specific function, returns the count of metric events need to be created. By default it return 1. One loop is added in function 'metricgroup__add_metric_runtime_param', which create multiple events at run time depend on return value of 'arch_get_runtimeparam' and merge that event in 'group_list'. This infrastructure needed for hv_24x7 socket/chip level events. "hv_24x7" chip level events needs specific chip-id to which the data is requested. Function 'arch_get_runtimeparam' implemented in header.c which extract number of sockets from sysfs file "sockets" under "/sys/devices/hv_24x7/interface/". Signed-off-by: Kajol Jain --- tools/perf/arch/powerpc/util/header.c | 10 tools/perf/tests/expr.c | 8 ++-- tools/perf/util/expr.c| 11 +++-- tools/perf/util/expr.h| 5 +- tools/perf/util/expr.l| 27 --- tools/perf/util/metricgroup.c | 66 ++- tools/perf/util/metricgroup.h | 1 + tools/perf/util/stat-shadow.c | 12 - 8 files changed, 119 insertions(+), 21 deletions(-) diff --git a/tools/perf/arch/powerpc/util/header.c b/tools/perf/arch/powerpc/util/header.c index 3b4cdfc5efd6..dcc3c6ab2e67 100644 --- a/tools/perf/arch/powerpc/util/header.c +++ b/tools/perf/arch/powerpc/util/header.c @@ -7,6 +7,8 @@ #include #include #include "header.h" +#include "metricgroup.h" +#include #define mfspr(rn) ({unsigned long rval; \ asm volatile("mfspr %0," __stringify(rn) \ @@ -16,6 +18,8 @@ #define PVR_VER(pvr)(((pvr) >> 16) & 0x) /* Version field */ #define PVR_REV(pvr)(((pvr) >> 0) & 0x) /* Revison field */ +#define SOCKETS_INFO_FILE_PATH "/devices/hv_24x7/interface/sockets" + int get_cpuid(char *buffer, size_t sz) { @@ -44,3 +48,9 @@ get_cpuid_str(struct perf_pmu *pmu __maybe_unused) return bufp; } + +int arch_get_runtimeparam(void) +{ + int count; + return sysfs__read_int(SOCKETS_INFO_FILE_PATH, &count) < 0 ? 1 : count; +} diff --git a/tools/perf/tests/expr.c b/tools/perf/tests/expr.c index ea10fc4412c4..516504cf0ea5 100644 --- a/tools/perf/tests/expr.c +++ b/tools/perf/tests/expr.c @@ -10,7 +10,7 @@ static int test(struct expr_parse_ctx *ctx, const char *e, double val2) { double val; - if (expr__parse(&val, ctx, e)) + if (expr__parse(&val, ctx, e, 1)) TEST_ASSERT_VAL("parse test failed", 0); TEST_ASSERT_VAL("unexpected value", val == val2); return 0; @@ -44,15 +44,15 @@ int test__expr(struct test *t __maybe_unused, int subtest __maybe_unused) return ret; p = "FOO/0"; - ret = expr__parse(&val, &ctx, p); + ret = expr__parse(&val, &ctx, p, 1); TEST_ASSERT_VAL("division by zero", ret == -1); p = "BAR/"; - ret = expr__parse(&val, &ctx, p); + ret = expr__parse(&val, &ctx, p, 1); TEST_ASSERT_VAL("missing operand", ret == -1); TEST_ASSERT_VAL("find other", - expr__find_other("FOO + BAR + BAZ + BOZO", "FOO", &other, &num_other) == 0); + expr__find_other("FOO + BAR + BAZ + BOZO", "FOO", &other, &num_other, 1) == 0); TEST_ASSERT_VAL("find other", num_other == 3); TEST_ASSERT_VAL("find other", !strcmp(other[0], "BAR")); TEST_ASSERT_VAL("find other", !strcmp(other[1], "BAZ")); diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c index c3382d58cf40..b228b737a5b0 100644 --- a/tools/perf/util/expr.c +++ b/tools/perf/util/expr.c @@ -27,10 +27,11 @@ void expr__ctx_init(struct expr_parse_ctx *ctx) static int __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr, - int start) + int start, int param) { struct expr_scanner_ctx scanner_ctx = { .start_token = start, + .expr__runtimeparam = param, }; YY_BUFFER_STATE buffer; void *scanner; @@ -54,9 +55,9 @@ __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr, return ret; } -int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char *expr) +int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char *expr, int param) { - return __expr__parse(final_val, ctx, expr, EXPR_PARSE) ? -1 : 0; + return __expr__parse(final_val, ctx, expr, EXPR_PARSE, param) ? -1 : 0; } static bool @@ -74,13 +75,13 @@ already_seen(const char *val, const char *one, const char **other, } int expr__find_other(const char *exp
[PATCH v5 08/11] perf/tools: Refactoring metricgroup__add_metric function
This patch refactor metricgroup__add_metric function where some part of it move to function metricgroup__add_metric_param. No logic change. Signed-off-by: Kajol Jain --- tools/perf/util/metricgroup.c | 63 +-- 1 file changed, 38 insertions(+), 25 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index c3a8c701609a..b4919bcfbd8b 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -474,6 +474,41 @@ static bool metricgroup__has_constraint(struct pmu_event *pe) return false; } +static int metricgroup__add_metric_param(struct strbuf *events, + struct list_head *group_list, struct pmu_event *pe) +{ + + const char **ids; + int idnum; + struct egroup *eg; + int ret = -EINVAL; + + if (expr__find_other(pe->metric_expr, NULL, &ids, &idnum, 1) < 0) + return ret; + + if (events->len > 0) + strbuf_addf(events, ","); + + if (metricgroup__has_constraint(pe)) + metricgroup__add_metric_non_group(events, ids, idnum); + else + metricgroup__add_metric_weak_group(events, ids, idnum); + + eg = malloc(sizeof(*eg)); + if (!eg) + ret = -ENOMEM; + + eg->ids = ids; + eg->idnum = idnum; + eg->metric_name = pe->metric_name; + eg->metric_expr = pe->metric_expr; + eg->metric_unit = pe->unit; + list_add_tail(&eg->nd, group_list); + ret = 0; + + return ret; +} + static int metricgroup__add_metric(const char *metric, struct strbuf *events, struct list_head *group_list) { @@ -493,35 +528,13 @@ static int metricgroup__add_metric(const char *metric, struct strbuf *events, continue; if (match_metric(pe->metric_group, metric) || match_metric(pe->metric_name, metric)) { - const char **ids; - int idnum; - struct egroup *eg; pr_debug("metric expr %s for %s\n", pe->metric_expr, pe->metric_name); - if (expr__find_other(pe->metric_expr, -NULL, &ids, &idnum) < 0) + ret = metricgroup__add_metric_param(events, + group_list, pe); + if (ret == -EINVAL) continue; - if (events->len > 0) - strbuf_addf(events, ","); - - if (metricgroup__has_constraint(pe)) - metricgroup__add_metric_non_group(events, ids, idnum); - else - metricgroup__add_metric_weak_group(events, ids, idnum); - - eg = malloc(sizeof(struct egroup)); - if (!eg) { - ret = -ENOMEM; - break; - } - eg->ids = ids; - eg->idnum = idnum; - eg->metric_name = pe->metric_name; - eg->metric_expr = pe->metric_expr; - eg->metric_unit = pe->unit; - list_add_tail(&eg->nd, group_list); - ret = 0; } } return ret; -- 2.18.1
[PATCH v5 07/11] powerpc/hv-24x7: Update post_mobility_fixup() to handle migration
Function 'read_sys_info_pseries()' is added to get system parameter values like number of sockets and chips per socket. and it gets these details via rtas_call with token "PROCESSOR_MODULE_INFO". Incase lpar migrate from one system to another, system parameter details like chips per sockets or number of sockets might change. So, it needs to be re-initialized otherwise, these values corresponds to previous system values. This patch adds a call to 'read_sys_info_pseries()' from 'post-mobility_fixup()' to re-init the physsockets and physchips values. Signed-off-by: Kajol Jain --- arch/powerpc/platforms/pseries/mobility.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index b571285f6c14..226accd6218b 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -371,6 +371,18 @@ void post_mobility_fixup(void) /* Possibly switch to a new RFI flush type */ pseries_setup_rfi_flush(); + /* +* Incase lpar migrate from one system to another, system +* parameter details like chips per sockets and number of sockets +* might change. So, it needs to be re-initialized otherwise these +* values corresponds to previous system. +* Here, adding a call to read_sys_info_pseries() declared in +* platforms/pseries/pseries.h to re-init the physsockets and +* physchips value. +*/ + if (IS_ENABLED(CONFIG_HV_PERF_CTRS) && IS_ENABLED(CONFIG_PPC_RTAS)) + read_sys_info_pseries(); + return; } -- 2.18.1
[PATCH v5 06/11] Documentation/ABI: Add ABI documentation for chips and sockets
Add documentation for the following sysfs files: /sys/devices/hv_24x7/interface/chips, /sys/devices/hv_24x7/interface/sockets Signed-off-by: Kajol Jain --- .../testing/sysfs-bus-event_source-devices-hv_24x7 | 14 ++ 1 file changed, 14 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 index ec27c6c9e737..e17e5b444a1c 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 @@ -22,6 +22,20 @@ Description: Exposes the "version" field of the 24x7 catalog. This is also extractable from the provided binary "catalog" sysfs entry. +What: /sys/devices/hv_24x7/interface/sockets +Date: March 2020 +Contact: Linux on PowerPC Developer List +Description: read only + This sysfs interface exposes the number of sockets present in the + system. + +What: /sys/devices/hv_24x7/interface/chips +Date: March 2020 +Contact: Linux on PowerPC Developer List +Description: read only + This sysfs interface exposes the number of chips per socket + present in the system. + What: /sys/bus/event_source/devices/hv_24x7/event_descs/ Date: February 2014 Contact: Linux on PowerPC Developer List -- 2.18.1
[PATCH v5 05/11] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show processor details
To expose the system dependent parameter like total number of sockets and numbers of chips per socket, patch adds two sysfs files. "sockets" and "chips" are added to /sys/devices/hv_24x7/interface/ of the "hv_24x7" pmu. Signed-off-by: Kajol Jain --- arch/powerpc/perf/hv-24x7.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 9ae00f29bd21..a31bd5b88f7a 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -454,6 +454,20 @@ static ssize_t device_show_string(struct device *dev, return sprintf(buf, "%s\n", (char *)d->var); } +#ifdef CONFIG_PPC_RTAS +static ssize_t sockets_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", physsockets); +} + +static ssize_t chips_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", physchips); +} +#endif + static struct attribute *device_str_attr_create_(char *name, char *str) { struct dev_ext_attribute *attr = kzalloc(sizeof(*attr), GFP_KERNEL); @@ -1100,6 +1114,10 @@ PAGE_0_ATTR(catalog_len, "%lld\n", (unsigned long long)be32_to_cpu(page_0->length) * 4096); static BIN_ATTR_RO(catalog, 0/* real length varies */); static DEVICE_ATTR_RO(domains); +#ifdef CONFIG_PPC_RTAS +static DEVICE_ATTR_RO(sockets); +static DEVICE_ATTR_RO(chips); +#endif static struct bin_attribute *if_bin_attrs[] = { &bin_attr_catalog, @@ -1110,6 +1128,10 @@ static struct attribute *if_attrs[] = { &dev_attr_catalog_len.attr, &dev_attr_catalog_version.attr, &dev_attr_domains.attr, +#ifdef CONFIG_PPC_RTAS + &dev_attr_sockets.attr, + &dev_attr_chips.attr, +#endif NULL, }; -- 2.18.1
[PATCH v5 04/11] powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor details
For hv_24x7 socket/chip level events, specific chip-id to which the data requested should be added as part of pmu events. But number of chips/socket in the system details are not exposed. Patch implements read_sys_info_pseries() to get system parameter values like number of sockets and chips per socket. Rtas_call with token "PROCESSOR_MODULE_INFO" is used to get these values. Sub-sequent patch exports these values via sysfs. Patch also make these parameters default to 1. Signed-off-by: Kajol Jain --- arch/powerpc/perf/hv-24x7.c | 72 arch/powerpc/platforms/pseries/pseries.h | 3 + 2 files changed, 75 insertions(+) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 48e8f4b17b91..9ae00f29bd21 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -20,6 +20,11 @@ #include #include +#ifdef CONFIG_PPC_RTAS +#include +#include <../../platforms/pseries/pseries.h> +#endif + #include "hv-24x7.h" #include "hv-24x7-catalog.h" #include "hv-common.h" @@ -57,6 +62,69 @@ static bool is_physical_domain(unsigned domain) } } +#ifdef CONFIG_PPC_RTAS +#define PROCESSOR_MODULE_INFO 43 +#define PROCESSOR_MAX_LENGTH (8 * 1024) + +static int strbe16toh(const char *buf, int offset) +{ + return (buf[offset] << 8) + buf[offset + 1]; +} + +static u32 physsockets;/* Physical sockets */ +static u32 physchips; /* Physical chips */ + +/* + * Function read_sys_info_pseries() make a rtas_call which require + * data buffer of size 8K. As standard 'rtas_data_buf' is of size + * 4K, we are adding new local buffer 'rtas_local_data_buf'. + */ +char rtas_local_data_buf[PROCESSOR_MAX_LENGTH] __cacheline_aligned; + +/* + * read_sys_info_pseries() + * Retrieve the number of sockets and chips per socket details + * through the get-system-parameter rtas call. + */ +void read_sys_info_pseries(void) +{ + int call_status, len, ntypes; + + /* +* Making system parameter: chips and sockets default to 1. +*/ + physsockets = 1; + physchips = 1; + memset(rtas_local_data_buf, 0, PROCESSOR_MAX_LENGTH); + spin_lock(&rtas_data_buf_lock); + + call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1, + NULL, + PROCESSOR_MODULE_INFO, + __pa(rtas_local_data_buf), + PROCESSOR_MAX_LENGTH); + + spin_unlock(&rtas_data_buf_lock); + + if (call_status != 0) { + pr_info("%s %s Error calling get-system-parameter (0x%x)\n", + __FILE__, __func__, call_status); + } else { + rtas_local_data_buf[PROCESSOR_MAX_LENGTH - 1] = '\0'; + len = strbe16toh(rtas_local_data_buf, 0); + if (len < 6) + return; + + ntypes = strbe16toh(rtas_local_data_buf, 2); + + if (!ntypes) + return; + physsockets = strbe16toh(rtas_local_data_buf, 4); + physchips = strbe16toh(rtas_local_data_buf, 6); + } +} +#endif /* CONFIG_PPC_RTAS */ + /* Domains for which more than one result element are returned for each event. */ static bool domain_needs_aggregation(unsigned int domain) { @@ -1605,6 +1673,10 @@ static int hv_24x7_init(void) if (r) return r; +#ifdef CONFIG_PPC_RTAS + read_sys_info_pseries(); +#endif + return 0; } diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 13fa370a87e4..1727559ce304 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -19,6 +19,9 @@ extern void request_event_sources_irqs(struct device_node *np, struct pt_regs; extern int pSeries_system_reset_exception(struct pt_regs *regs); +#ifdef CONFIG_PPC_RTAS +extern void read_sys_info_pseries(void); +#endif extern int pSeries_machine_check_exception(struct pt_regs *regs); extern long pseries_machine_check_realmode(struct pt_regs *regs); -- 2.18.1
[PATCH v5 03/11] powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run
Commit 2b206ee6b0df ("powerpc/perf/hv-24x7: Display change in counter values")' added to print _change_ in the counter value rather then raw value for 24x7 counters. Incase of transactions, the event count is set to 0 at the beginning of the transaction. It also sets the event's prev_count to the raw value at the time of initialization. Because of setting event count to 0, we are seeing some weird behaviour, whenever we run multiple 24x7 events at a time. For example: command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/, hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}" -C 0 -I 1000 sleep 100 1.000121704120 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 1.000121704 5 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 2.000357733 8 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 2.000357733 10 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 3.000495215 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 3.000495215 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.000641884 56 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 4.000641884 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 5.000791887 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ Getting these large values in case we do -I. As we are setting event_count to 0, for interval case, overall event_count is not coming in incremental order. As we may can get new delta lesser then previous count. Because of which when we print intervals, we are getting negative value which create these large values. This patch removes part where we set event_count to 0 in function 'h_24x7_event_read'. There won't be much impact as we do set event->hw.prev_count to the raw value at the time of initialization to print change value. With this patch In power9 platform command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/, hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}" -C 0 -I 1000 sleep 100 1.000117685 93 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 1.000117685 1 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 2.000349331 98 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 2.000349331 2 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 3.000495900131 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 3.000495900 4 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.000645920204 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 4.000645920 61 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.284169997 22 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ Signed-off-by: Kajol Jain Suggested-by: Sukadev Bhattiprolu --- arch/powerpc/perf/hv-24x7.c | 10 -- 1 file changed, 10 deletions(-) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 573e0b309c0c..48e8f4b17b91 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -1400,16 +1400,6 @@ static void h_24x7_event_read(struct perf_event *event) h24x7hw = &get_cpu_var(hv_24x7_hw); h24x7hw->events[i] = event; put_cpu_var(h24x7hw); - /* -* Clear the event count so we can compute the _change_ -* in the 24x7 raw counter value at the end of the txn. -* -* Note that we could alternatively read the 24x7 value -* now and save its value in event->hw.prev_count. But -* that would require issuing a hcall, which would then -* defeat the purpose of using the txn interface. -*/ - local64_set(&event->count, 0); } put_cpu_var(hv_24x7_reqb); -- 2.18.1
[PATCH v5 02/11] perf expr: Add expr_scanner_ctx object
From: Jiri Olsa Adding expr_scanner_ctx object to hold user data for the expr scanner. Currently it holds only start_token, Kajol Jain will use it to hold 24x7 runtime param. Signed-off-by: Jiri Olsa --- tools/perf/util/expr.c | 6 -- tools/perf/util/expr.h | 4 tools/perf/util/expr.l | 10 +- 3 files changed, 13 insertions(+), 7 deletions(-) diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c index c8ccc548a585..c3382d58cf40 100644 --- a/tools/perf/util/expr.c +++ b/tools/perf/util/expr.c @@ -3,7 +3,6 @@ #include #include "expr.h" #include "expr-bison.h" -#define YY_EXTRA_TYPE int #include "expr-flex.h" #ifdef PARSER_DEBUG @@ -30,11 +29,14 @@ static int __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr, int start) { + struct expr_scanner_ctx scanner_ctx = { + .start_token = start, + }; YY_BUFFER_STATE buffer; void *scanner; int ret; - ret = expr_lex_init_extra(start, &scanner); + ret = expr_lex_init_extra(&scanner_ctx, &scanner); if (ret) return ret; diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h index b9e53f2b5844..0938ad166ece 100644 --- a/tools/perf/util/expr.h +++ b/tools/perf/util/expr.h @@ -15,6 +15,10 @@ struct expr_parse_ctx { struct expr_parse_id ids[MAX_PARSE_ID]; }; +struct expr_scanner_ctx { + int start_token; +}; + void expr__ctx_init(struct expr_parse_ctx *ctx); void expr__add_id(struct expr_parse_ctx *ctx, const char *id, double val); int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char *expr); diff --git a/tools/perf/util/expr.l b/tools/perf/util/expr.l index eaad29243c23..2582c2464938 100644 --- a/tools/perf/util/expr.l +++ b/tools/perf/util/expr.l @@ -76,13 +76,13 @@ sym [0-9a-zA-Z_\.:@]+ symbol {spec}*{sym}*{spec}*{sym}* %% - { - int start_token; + struct expr_scanner_ctx *sctx = expr_get_extra(yyscanner); - start_token = expr_get_extra(yyscanner); + { + int start_token = sctx->start_token; - if (start_token) { - expr_set_extra(NULL, yyscanner); + if (sctx->start_token) { + sctx->start_token = 0; return start_token; } } -- 2.18.1
[PATCH v5 01/11] perf expr: Add expr_ prefix for parse_ctx and parse_id
From: Jiri Olsa Adding expr_ prefix for parse_ctx and parse_id, to straighten out the expr* namespace. There's no functional change. Signed-off-by: Jiri Olsa --- tools/perf/tests/expr.c | 4 ++-- tools/perf/util/expr.c| 10 +- tools/perf/util/expr.h| 12 ++-- tools/perf/util/expr.y| 6 +++--- tools/perf/util/stat-shadow.c | 2 +- 5 files changed, 17 insertions(+), 17 deletions(-) diff --git a/tools/perf/tests/expr.c b/tools/perf/tests/expr.c index 28313e59d6f6..ea10fc4412c4 100644 --- a/tools/perf/tests/expr.c +++ b/tools/perf/tests/expr.c @@ -6,7 +6,7 @@ #include #include -static int test(struct parse_ctx *ctx, const char *e, double val2) +static int test(struct expr_parse_ctx *ctx, const char *e, double val2) { double val; @@ -22,7 +22,7 @@ int test__expr(struct test *t __maybe_unused, int subtest __maybe_unused) const char **other; double val; int i, ret; - struct parse_ctx ctx; + struct expr_parse_ctx ctx; int num_other; expr__ctx_init(&ctx); diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c index fd192ddf93c1..c8ccc548a585 100644 --- a/tools/perf/util/expr.c +++ b/tools/perf/util/expr.c @@ -11,7 +11,7 @@ extern int expr_debug; #endif /* Caller must make sure id is allocated */ -void expr__add_id(struct parse_ctx *ctx, const char *name, double val) +void expr__add_id(struct expr_parse_ctx *ctx, const char *name, double val) { int idx; @@ -21,13 +21,13 @@ void expr__add_id(struct parse_ctx *ctx, const char *name, double val) ctx->ids[idx].val = val; } -void expr__ctx_init(struct parse_ctx *ctx) +void expr__ctx_init(struct expr_parse_ctx *ctx) { ctx->num_ids = 0; } static int -__expr__parse(double *val, struct parse_ctx *ctx, const char *expr, +__expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr, int start) { YY_BUFFER_STATE buffer; @@ -52,7 +52,7 @@ __expr__parse(double *val, struct parse_ctx *ctx, const char *expr, return ret; } -int expr__parse(double *final_val, struct parse_ctx *ctx, const char *expr) +int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char *expr) { return __expr__parse(final_val, ctx, expr, EXPR_PARSE) ? -1 : 0; } @@ -75,7 +75,7 @@ int expr__find_other(const char *expr, const char *one, const char ***other, int *num_other) { int err, i = 0, j = 0; - struct parse_ctx ctx; + struct expr_parse_ctx ctx; expr__ctx_init(&ctx); err = __expr__parse(NULL, &ctx, expr, EXPR_OTHER); diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h index 9377538f4097..b9e53f2b5844 100644 --- a/tools/perf/util/expr.h +++ b/tools/perf/util/expr.h @@ -5,19 +5,19 @@ #define EXPR_MAX_OTHER 20 #define MAX_PARSE_ID EXPR_MAX_OTHER -struct parse_id { +struct expr_parse_id { const char *name; double val; }; -struct parse_ctx { +struct expr_parse_ctx { int num_ids; - struct parse_id ids[MAX_PARSE_ID]; + struct expr_parse_id ids[MAX_PARSE_ID]; }; -void expr__ctx_init(struct parse_ctx *ctx); -void expr__add_id(struct parse_ctx *ctx, const char *id, double val); -int expr__parse(double *final_val, struct parse_ctx *ctx, const char *expr); +void expr__ctx_init(struct expr_parse_ctx *ctx); +void expr__add_id(struct expr_parse_ctx *ctx, const char *id, double val); +int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char *expr); int expr__find_other(const char *expr, const char *one, const char ***other, int *num_other); diff --git a/tools/perf/util/expr.y b/tools/perf/util/expr.y index 4720cbe79357..cd17486c1c5d 100644 --- a/tools/perf/util/expr.y +++ b/tools/perf/util/expr.y @@ -15,7 +15,7 @@ %define api.pure full %parse-param { double *final_val } -%parse-param { struct parse_ctx *ctx } +%parse-param { struct expr_parse_ctx *ctx } %parse-param {void *scanner} %lex-param {void* scanner} @@ -39,14 +39,14 @@ %{ static void expr_error(double *final_val __maybe_unused, - struct parse_ctx *ctx __maybe_unused, + struct expr_parse_ctx *ctx __maybe_unused, void *scanner, const char *s) { pr_debug("%s\n", s); } -static int lookup_id(struct parse_ctx *ctx, char *id, double *val) +static int lookup_id(struct expr_parse_ctx *ctx, char *id, double *val) { int i; diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 0fd713d3674f..402af3e8d287 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -729,7 +729,7 @@ static void generic_metric(struct perf_stat_config *config, struct runtime_stat *st) { print_metric_t print_metric = out->print_metric; - struct parse_ctx pctx; + struct expr_parse_ctx pctx;
[PATCH v5 00/11] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events
Patchset fixes the inconsistent results we are getting when we run multiple 24x7 events. Patchset adds json file metric support for the hv_24x7 socket/chip level events. "hv_24x7" pmu interface events needs system dependent parameter like socket/chip/core. For example, hv_24x7 chip level events needs specific chip-id to which the data is requested should be added as part of pmu events. So to enable JSON file support to "hv_24x7" interface, patchset expose total number of sockets and chips per-socket details in sysfs files (sockets, chips) under "/sys/devices/hv_24x7/interface/". To get sockets and number of chips per sockets, patchset adds a rtas call with token "PROCESSOR_MODULE_INFO" to get these details. Patchset also handles partition migration case to re-init these system depended parameters by adding proper calls in post_mobility_fixup() (mobility.c). Second patch of the patchset adds expr_scanner_ctx object to hold user data for the expr scanner, which can be used to hold runtime parameter. Patch 9 & 11 of the patchset handles perf tool plumbing needed to replace the "?" character in the metric expression to proper value and hv_24x7 json metric file for different Socket/chip resources. Patch set also enable Hz/hz prinitg for --metric-only option to print metric data for bus frequency. Applied and tested all these patches cleanly on top of jiri's flex changes with the changes done by Kan Liang for "Support metric group constraint" patchset and made required changes. Changelog: v4 -> v5 - Using sysfs__read_int instead of sysfs__read_ull while reading parameter value in powerpc/util/header.c file. - Using asprintf rather then malloc and sprintf Suggested by Arnaldo Carvalho de Melo - Break patch 6 from previous version to two patch, - One to add refactor current "metricgroup__add_metric" function and another where actually "?" handling infra added. - Add expr__runtimeparam as part of 'expr_scanner_ctx' struct rather then making it global variable. Thanks Jiri for adding this structure to hold user data for the expr scanner. - Add runtime param as agrugement to function 'expr__find_other' and 'expr__parse' and made changes on references accordingly. v3 -> v4 - Apply these patch on top of Kan liang changes. As suggested by Jiri. v2 -> v3 - Remove setting event_count to 0 part in function 'h_24x7_event_read' with comment rather then adding 0 to event_count value. Suggested by: Sukadev Bhattiprolu - Apply tool side changes require to replace "?" on Jiri's flex patch series and made all require changes to make it compatible with added flex change. v1 -> v2 - Rename hv-24x7 metric json file as nest_metrics.json Jiri Olsa (2): perf expr: Add expr_ prefix for parse_ctx and parse_id perf expr: Add expr_scanner_ctx object Kajol Jain (9): powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor details powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show processor details Documentation/ABI: Add ABI documentation for chips and sockets powerpc/hv-24x7: Update post_mobility_fixup() to handle migration perf/tools: Refactoring metricgroup__add_metric function perf/tools: Enhance JSON/metric infrastructure to handle "?" tools/perf: Enable Hz/hz prinitg for --metric-only option perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric events .../sysfs-bus-event_source-devices-hv_24x7| 14 ++ arch/powerpc/perf/hv-24x7.c | 104 -- arch/powerpc/platforms/pseries/mobility.c | 12 ++ arch/powerpc/platforms/pseries/pseries.h | 3 + tools/perf/arch/powerpc/util/header.c | 10 ++ .../arch/powerpc/power9/nest_metrics.json | 19 +++ tools/perf/tests/expr.c | 12 +- tools/perf/util/expr.c| 25 ++-- tools/perf/util/expr.h| 19 ++- tools/perf/util/expr.l| 37 +++-- tools/perf/util/expr.y| 6 +- tools/perf/util/metricgroup.c | 127 ++ tools/perf/util/metricgroup.h | 1 + tools/perf/util/stat-display.c| 2 - tools/perf/util/stat-shadow.c | 14 +- 15 files changed, 326 insertions(+), 79 deletions(-) create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json -- 2.18.1
Re: [PATCH 00/15] powerpc/watchpoint: Preparation for more than one watchpoint
Le 16/03/2020 à 19:43, Segher Boessenkool a écrit : On Mon, Mar 16, 2020 at 04:05:01PM +0100, Christophe Leroy wrote: Some book3s (e300 family for instance, I think G2 as well) already have a DABR2 in addition to DABR. The original "G2" (meaning 603 and 604) do not have DABR2. The newer "G2" (meaning e300) does have it. e500 and e600 do not have it either. Hope I got that right ;-) G2 core reference manual says: Features specific to the G2 core not present on the original MPC603e (PID6-603e) processors follow: ... Enhanced debug features — Addition of three breakpoint registers—IABR2, DABR, and DABR2 — Two new breakpoint control registers—DBCR and IBCR e500 has DAC1 and DAC2 instead for breakpoints iaw e500 core reference manual. Christophe
Re: [PATCH V7 09/14] powerpc/vas: Update CSB and notify process for fault CRBs
Haren Myneni writes: > For each fault CRB, update fault address in CRB (fault_storage_addr) > and translation error status in CSB so that user space can touch the > fault address and resend the request. If the user space passed invalid > CSB address send signal to process with SIGSEGV. > > Signed-off-by: Sukadev Bhattiprolu > Signed-off-by: Haren Myneni > --- > arch/powerpc/platforms/powernv/vas-fault.c | 114 > + > 1 file changed, 114 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c > b/arch/powerpc/platforms/powernv/vas-fault.c > index 1c6d5cc..751ce48 100644 > --- a/arch/powerpc/platforms/powernv/vas-fault.c > +++ b/arch/powerpc/platforms/powernv/vas-fault.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -26,6 +27,118 @@ > #define VAS_FAULT_WIN_FIFO_SIZE (4 << 20) > > /* > + * Update the CSB to indicate a translation error. > + * > + * If we are unable to update the CSB means copy_to_user failed due to > + * invalid csb_addr, send a signal to the process. > + * > + * Remaining settings in the CSB are based on wait_for_csb() of > + * NX-GZIP. > + */ > +static void update_csb(struct vas_window *window, > + struct coprocessor_request_block *crb) > +{ > + int rc; > + struct pid *pid; > + void __user *csb_addr; > + struct task_struct *tsk; > + struct kernel_siginfo info; > + struct coprocessor_status_block csb; csb is on the stack, and later copied to user, which is a risk for creating an infoleak. Also please use reverse Christmas tree layout for your variables. > + > + /* > + * NX user space windows can not be opened for task->mm=NULL > + * and faults will not be generated for kernel requests. > + */ > + if (!window->mm || !window->user_win) > + return; If that's a should-never-happen condition then should it do a WARN_ON_ONCE() rather than silently returning? > + csb_addr = (void __user *)be64_to_cpu(crb->csb_addr); > + > + csb.cc = CSB_CC_TRANSLATION; > + csb.ce = CSB_CE_TERMINATION; > + csb.cs = 0; > + csb.count = 0; > + > + /* > + * NX operates and returns in BE format as defined CRB struct. > + * So return fault_storage_addr in BE as NX pastes in FIFO and > + * expects user space to convert to CPU format. > + */ > + csb.address = crb->stamp.nx.fault_storage_addr; > + csb.flags = 0; I'm pretty sure this has initialised all the fields of csb. But, I'd still be much happier if you zeroed the whole struct to begin with, that way we know for sure we can't leak any uninitialised bytes to userspace. It's only 16 bytes so it shouldn't add any noticeable overhead. > + > + pid = window->pid; > + tsk = get_pid_task(pid, PIDTYPE_PID); > + /* > + * Send window will be closed after processing all NX requests > + * and process exits after closing all windows. In multi-thread > + * applications, thread may not exists, but does not close FD > + * (means send window) upon exit. Parent thread (tgid) can use > + * and close the window later. > + * pid and mm references are taken when window is opened by > + * process (pid). So tgid is used only when child thread opens > + * a window and exits without closing it in multithread tasks. > + */ > + if (!tsk) { > + pid = window->tgid; > + tsk = get_pid_task(pid, PIDTYPE_PID); > + /* > + * Parent thread will be closing window during its exit. > + * So should not get here. > + */ > + if (!tsk) > + return; Similar question on WARN_ON_ONCE() > + } > + > + /* Return if the task is exiting. */ Why? Just because it's no use? It's racy isn't it, so it can't be for correctness? > + if (tsk->flags & PF_EXITING) { > + put_task_struct(tsk); > + return; > + } > + > + use_mm(window->mm); There's no check that csb_addr is actually pointing into userspace, but copy_to_user() does it for you. > + rc = copy_to_user(csb_addr, &csb, sizeof(csb)); > + /* > + * User space polls on csb.flags (first byte). So add barrier > + * then copy first byte with csb flags update. > + */ > + smp_mb(); You only need to order the stores above vs the store below to csb.flags. So you should only need an smp_wmb() here. > + if (!rc) { > + csb.flags = CSB_V; > + rc = copy_to_user(csb_addr, &csb, sizeof(u8)); > + } > + unuse_mm(window->mm); > + put_task_struct(tsk); > + > + /* Success */ > + if (!rc) > + return; > + > + pr_debug("Invalid CSB address 0x%p signalling pid(%d)\n", > + csb_addr, pid_vnr(pid)); > + > + clear_siginfo(&info); > + info.si_signo = SIGSEGV; > + info.si_errno = EFAULT; > + info.si_code = SEGV_M
Re: [PATCH V7 08/14] powerpc/vas: Take reference to PID and mm for user space windows
Haren Myneni writes: > Process close windows after its requests are completed. In multi-thread > applications, child can open a window but release FD will not be called > upon its exit. Parent thread will be closing it later upon its exit. What if the parent exits first? > The parent can also send NX requests with this window and NX can > generate page faults. After kernel handles the page fault, send > signal to process by using PID if CSB address is invalid. Parent > thread will not receive signal since its PID is different from the one > saved in vas_window. So use tgid in case if the task for the pid saved > in window is not running and send signal to its parent. > > To prevent reusing the pid until the window closed, take reference to > pid and task mm. That text is all very dense. Can you please flesh it out and reword it to clearly spell out what's going on in much more detail. > diff --git a/arch/powerpc/platforms/powernv/vas-window.c > b/arch/powerpc/platforms/powernv/vas-window.c > index a45d81d..7587258 100644 > --- a/arch/powerpc/platforms/powernv/vas-window.c > +++ b/arch/powerpc/platforms/powernv/vas-window.c > @@ -1266,8 +1300,17 @@ int vas_win_close(struct vas_window *window) > poll_window_castout(window); > > /* if send window, drop reference to matching receive window */ > - if (window->tx_win) > + if (window->tx_win) { > + if (window->user_win) { > + /* Drop references to pid and mm */ > + put_pid(window->pid); > + if (window->mm) { > + mmdrop(window->mm); > + mm_context_remove_copro(window->mm); That seems backward. Once you drop the reference the mm can be freed can't it? > + } > + } > put_rx_win(window->rxwin); > + } > > vas_window_free(window); cheers
Re: [PATCH v1 11/46] powerpc/ptdump: Display size of BATs
Hi Christophe, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on next-20200316] [cannot apply to powerpc/next v5.6-rc6 v5.6-rc5 v5.6-rc4 v5.6-rc6] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Christophe-Leroy/Use-hugepages-to-map-kernel-mem-on-8xx/20200317-065610 base:8548fd2f20ed19b0e8c0585b71fdfde1ae00ae3c config: powerpc-rhel-kconfig (attached as .config) compiler: powerpc64le-linux-gcc (GCC) 9.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=9.2.0 make.cross ARCH=powerpc If you fix the issue, kindly add following tag Reported-by: kbuild test robot All warnings (new ones prefixed by >>): In file included from arch/powerpc/mm/ptdump/book3s64.c:10: >> arch/powerpc/mm/ptdump/ptdump.h:21:26: warning: 'struct seq_file' declared >> inside parameter list will not be visible outside of this definition or >> declaration 21 | void pt_dump_size(struct seq_file *m, unsigned long delta); | ^~~~ vim +21 arch/powerpc/mm/ptdump/ptdump.h 20 > 21 void pt_dump_size(struct seq_file *m, unsigned long delta); --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
Re: [PATCH v4 3/3] powerpc/powernv: Parse device tree, population of SPR support
Pratik Rajesh Sampat writes: > Parse the device tree for nodes self-save, self-restore and populate > support for the preferred SPRs based what was advertised by the device > tree. These should be documented in: Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 97aeb45e897b..27dfadf609e8 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -1436,6 +1436,85 @@ static void __init pnv_probe_idle_states(void) > supported_cpuidle_states |= pnv_idle_states[i].flags; > } > > +/* > + * Extracts and populates the self save or restore capabilities > + * passed from the device tree node > + */ > +static int extract_save_restore_state_dt(struct device_node *np, int type) > +{ > + int nr_sprns = 0, i, bitmask_index; > + int rc = 0; > + u64 *temp_u64; > + u64 bit_pos; > + > + nr_sprns = of_property_count_u64_elems(np, "sprn-bitmask"); > + if (nr_sprns <= 0) > + return rc; Using <= 0 means zero SPRs is treated by success as the caller, is that intended? If so a comment would be appropriate. > + temp_u64 = kcalloc(nr_sprns, sizeof(u64), GFP_KERNEL); > + if (of_property_read_u64_array(np, "sprn-bitmask", > +temp_u64, nr_sprns)) { > + pr_warn("cpuidle-powernv: failed to find registers in DT\n"); > + kfree(temp_u64); > + return -EINVAL; > + } > + /* > + * Populate acknowledgment of support for the sprs in the global vector > + * gotten by the registers supplied by the firmware. > + * The registers are in a bitmask, bit index within > + * that specifies the SPR > + */ > + for (i = 0; i < nr_preferred_sprs; i++) { > + bitmask_index = preferred_sprs[i].spr / 64; > + bit_pos = preferred_sprs[i].spr % 64; This is basically a hand coded bitmap, see eg. BIT_WORD(), BIT_MASK() etc. I don't think there's an easy way to convert temp_u64 into a proper bitmap, so it's probably not worth doing that. But at least use the macros. > + if ((temp_u64[bitmask_index] & (1UL << bit_pos)) == 0) { > + if (type == SELF_RESTORE_TYPE) > + preferred_sprs[i].supported_mode &= > + ~SELF_RESTORE_STRICT; > + else > + preferred_sprs[i].supported_mode &= > + ~SELF_SAVE_STRICT; > + continue; > + } > + if (type == SELF_RESTORE_TYPE) { > + preferred_sprs[i].supported_mode |= > + SELF_RESTORE_STRICT; > + } else { > + preferred_sprs[i].supported_mode |= > + SELF_SAVE_STRICT; > + } > + } > + > + kfree(temp_u64); > + return rc; > +} > + > +static int pnv_parse_deepstate_dt(void) > +{ > + struct device_node *sr_np, *ss_np; You never use these concurrently AFAICS, so you could just have a single *np. > + int rc = 0, i; > + > + /* Self restore register population */ > + sr_np = of_find_node_by_path("/ibm,opal/power-mgt/self-restore"); I know the existing idle code uses of_find_node_by_path(), but that's because it's old and crufty. Please don't add new searches by path. You should be searching by compatible. > + if (!sr_np) { > + pr_warn("opal: self restore Node not found"); This warning and the others below will fire on all existing firmware versions, which is not OK. > + } else { > + rc = extract_save_restore_state_dt(sr_np, SELF_RESTORE_TYPE); > + if (rc != 0) > + return rc; > + } > + /* Self save register population */ > + ss_np = of_find_node_by_path("/ibm,opal/power-mgt/self-save"); > + if (!ss_np) { > + pr_warn("opal: self save Node not found"); > + pr_warn("Legacy firmware. Assuming default self-restore > support"); > + for (i = 0; i < nr_preferred_sprs; i++) > + preferred_sprs[i].supported_mode &= ~SELF_SAVE_STRICT; > + } else { > + rc = extract_save_restore_state_dt(ss_np, SELF_SAVE_TYPE); > + } > + return rc; You're leaking references on all the device_nodes in here, you need of_node_put() before exiting. > +} cheers
Re: [PATCH v1 39/46] powerpc/8xx: Add a function to early map kernel via huge pages
Hi Christophe, Thank you for the patch! Yet something to improve: [auto build test ERROR on next-20200316] [cannot apply to powerpc/next v5.6-rc6 v5.6-rc5 v5.6-rc4 v5.6-rc6] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Christophe-Leroy/Use-hugepages-to-map-kernel-mem-on-8xx/20200317-065610 base:8548fd2f20ed19b0e8c0585b71fdfde1ae00ae3c config: powerpc-tqm8xx_defconfig (attached as .config) compiler: powerpc-linux-gcc (GCC) 9.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=9.2.0 make.cross ARCH=powerpc If you fix the issue, kindly add following tag Reported-by: kbuild test robot All errors (new ones prefixed by >>): In file included from arch/powerpc/mm/fault.c:33: include/linux/hugetlb.h: In function 'hstate_inode': >> include/linux/hugetlb.h:522:9: error: implicit declaration of function >> 'HUGETLBFS_SB'; did you mean 'HUGETLBFS_MAGIC'? >> [-Werror=implicit-function-declaration] 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~~~ | HUGETLBFS_MAGIC >> include/linux/hugetlb.h:522:30: error: invalid type argument of '->' (have >> 'int') 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~ cc1: all warnings being treated as errors -- In file included from arch/powerpc/mm/mem.c:30: include/linux/hugetlb.h: In function 'hstate_inode': >> include/linux/hugetlb.h:522:9: error: implicit declaration of function >> 'HUGETLBFS_SB' [-Werror=implicit-function-declaration] 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~~~ >> include/linux/hugetlb.h:522:30: error: invalid type argument of '->' (have >> 'int') 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~ cc1: all warnings being treated as errors -- In file included from arch/powerpc/mm/nohash/8xx.c:12: include/linux/hugetlb.h: In function 'hstate_inode': >> include/linux/hugetlb.h:522:9: error: implicit declaration of function >> 'HUGETLBFS_SB' [-Werror=implicit-function-declaration] 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~~~ >> include/linux/hugetlb.h:522:30: error: invalid type argument of '->' (have >> 'int') 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~ At top level: arch/powerpc/mm/nohash/8xx.c:73:18: error: '__early_map_kernel_hugepage' defined but not used [-Werror=unused-function] 73 | static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa, | ^~~ cc1: all warnings being treated as errors -- In file included from arch/powerpc//mm/nohash/8xx.c:12: include/linux/hugetlb.h: In function 'hstate_inode': >> include/linux/hugetlb.h:522:9: error: implicit declaration of function >> 'HUGETLBFS_SB' [-Werror=implicit-function-declaration] 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~~~ >> include/linux/hugetlb.h:522:30: error: invalid type argument of '->' (have >> 'int') 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~ At top level: arch/powerpc//mm/nohash/8xx.c:73:18: error: '__early_map_kernel_hugepage' defined but not used [-Werror=unused-function] 73 | static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa, | ^~~ cc1: all warnings being treated as errors -- In file included from include/linux/migrate.h:8, from kernel///sched/sched.h:53, from kernel///sched/loadavg.c:9: include/linux/hugetlb.h: In function 'hstate_inode': >> include/linux/hugetlb.h:522:9: error: implicit declaration of function >> 'HUGETLBFS_SB'; did you mean 'HUGETLBFS_MAGIC'? >> [-Werror=implicit-function-declaration] 522 | return HUGETLBFS_SB(i->i_sb)->hstate; | ^~~~ | HUGETLBFS_MAGIC >> include/linux/hugetlb.h:522:30: error: invalid type argument of '->' (have >> 'int') 522 | ret
Re: [PATCH] powerpc/32: Fix missing NULL pmd check in virt_to_kpte()
Nick Desaulniers writes: > Hello ppc friends, did this get picked up into -next yet? Not yet. It's in my next-test, but it got stuck there because some subsequent patches caused some CI errors that I had to debug. I'll push it to next today. cheers > On Thu, Mar 12, 2020 at 8:35 PM Nathan Chancellor > wrote: >> >> On Sat, Mar 07, 2020 at 10:09:15AM +, Christophe Leroy wrote: >> > Commit 2efc7c085f05 ("powerpc/32: drop get_pteptr()"), >> > replaced get_pteptr() by virt_to_kpte(). But virt_to_kpte() lacks a >> > NULL pmd check and returns an invalid non NULL pointer when there >> > is no page table. >> > >> > Reported-by: Nick Desaulniers >> > Fixes: 2efc7c085f05 ("powerpc/32: drop get_pteptr()") >> > Signed-off-by: Christophe Leroy >> > --- >> > arch/powerpc/include/asm/pgtable.h | 4 +++- >> > 1 file changed, 3 insertions(+), 1 deletion(-) >> > >> > diff --git a/arch/powerpc/include/asm/pgtable.h >> > b/arch/powerpc/include/asm/pgtable.h >> > index b80bfd41828d..b1f1d5339735 100644 >> > --- a/arch/powerpc/include/asm/pgtable.h >> > +++ b/arch/powerpc/include/asm/pgtable.h >> > @@ -54,7 +54,9 @@ static inline pmd_t *pmd_ptr_k(unsigned long va) >> > >> > static inline pte_t *virt_to_kpte(unsigned long vaddr) >> > { >> > - return pte_offset_kernel(pmd_ptr_k(vaddr), vaddr); >> > + pmd_t *pmd = pmd_ptr_k(vaddr); >> > + >> > + return pmd_none(*pmd) ? NULL : pte_offset_kernel(pmd, vaddr); >> > } >> > #endif >> > >> > -- >> > 2.25.0 >> > >> >> With QEMU 4.2.0, I can confirm this fixes the panic: >> >> Tested-by: Nathan Chancellor > > > > -- > Thanks, > ~Nick Desaulniers
Re: [PATCH v3 0/9] crypto/nx: Enable GZIP engine and provide userpace API
On Tue, 2020-03-17 at 00:04 +1100, Daniel Axtens wrote: > Hi Haren, > > If I understand correctly, to test these, I need to apply both this > series and your VAS userspace page fault handling series - is that > right? Daniel, Yes, This patch series enables GZIP engine and provides user space API. Whereas VAS fault handling series process faults if NX sees fault on request buffer. selftest - https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/206035.html or https://github.com/abalib/power-gzip/tree/develop/selftest More tests are available - https://github.com/abalib/power-gzip libnxz - https://github.com/libnxz/power-gzip Thanks Haren > > Kind regards, > Daniel > > > Power9 processor supports Virtual Accelerator Switchboard (VAS) which > > allows kernel and userspace to send compression requests to Nest > > Accelerator (NX) directly. The NX unit comprises of 2 842 compression > > engines and 1 GZIP engine. Linux kernel already has 842 compression > > support on kernel. This patch series adds GZIP compression support > > from user space. The GZIP Compression engine implements the ZLIB and > > GZIP compression algorithms. No plans of adding NX-GZIP compression > > support in kernel right now. > > > > Applications can send requests to NX directly with COPY/PASTE > > instructions. But kernel has to establish channel / window on NX-GZIP > > device for the userspace. So userspace access to the GZIP engine is > > provided through /dev/crypto/nx-gzip device with several operations. > > > > An application must open the this device to obtain a file descriptor (fd). > > Using the fd, application should issue the VAS_TX_WIN_OPEN ioctl to > > establish a connection to the engine. Once window is opened, should use > > mmap() system call to map the hardware address of engine's request queue > > into the application's virtual address space. Then user space forms the > > request as co-processor Request Block (CRB) and paste this CRB on the > > mapped HW address using COPY/PASTE instructions. Application can poll > > on status flags (part of CRB) with timeout for request completion. > > > > For VAS_TX_WIN_OPEN ioctl, if user space passes vas_id = -1 (struct > > vas_tx_win_open_attr), kernel determines the VAS instance on the > > corresponding chip based on the CPU on which the process is executing. > > Otherwise, the specified VAS instance is used if application passes the > > proper VAS instance (vas_id listed in /proc/device-tree/vas@*/ibm,vas_id). > > > > Process can open multiple windows with different FDs or can send several > > requests to NX on the same window at the same time. > > > > A userspace library libnxz is available: > > https://github.com/abalib/power-gzip > > > > Applications that use inflate/deflate calls can link with libNXz and use > > NX GZIP compression without any modification. > > > > Tested the available 842 compression on power8 and power9 system to make > > sure no regression and tested GZIP compression on power9 with tests > > available in the above link. > > > > Thanks to Bulent Abali for nxz library and tests development. > > > > Changelog: > > V2: > > - Move user space API code to powerpc as suggested. Also this API > > can be extended to any other coprocessor type that VAS can support > > in future. Example: Fast thread wakeup feature from VAS > > - Rebased to 5.6-rc3 > > > > V3: > > - Fix sparse warnings (patches 3&6) > > > > Haren Myneni (9): > > powerpc/vas: Initialize window attributes for GZIP coprocessor type > > powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API > > powerpc/vas: Add VAS user space API > > crypto/nx: Initialize coproc entry with kzalloc > > crypto/nx: Rename nx-842-powernv file name to nx-common-powernv > > crypto/NX: Make enable code generic to add new GZIP compression type > > crypto/nx: Enable and setup GZIP compresstion type > > crypto/nx: Remove 'pid' in vas_tx_win_attr struct > > Documentation/powerpc: VAS API > > > > Documentation/powerpc/index.rst|1 + > > Documentation/powerpc/vas-api.rst | 246 + > > Documentation/userspace-api/ioctl/ioctl-number.rst |1 + > > arch/powerpc/include/asm/vas.h | 12 +- > > arch/powerpc/include/uapi/asm/vas-api.h| 22 + > > arch/powerpc/platforms/powernv/Makefile|2 +- > > arch/powerpc/platforms/powernv/vas-api.c | 290 + > > arch/powerpc/platforms/powernv/vas-window.c| 23 +- > > arch/powerpc/platforms/powernv/vas.h |2 + > > drivers/crypto/nx/Makefile |2 +- > > drivers/crypto/nx/nx-842-powernv.c | 1062 > > -- > > drivers/crypto/nx/nx-common-powernv.c | 1133 > > > > 12 files changed, 1723 insertions(+), 1073 deletions(-) > > create mode 100644 Documentation/powerpc/vas-api.rst > > create mode 100644 arch/powerpc/include/uapi/asm/
Re: [PATCH 0/5] selftests/powerpc: Add NX-GZIP engine testcase
On Mon, 2020-03-16 at 15:07 -0300, Raphael Moreira Zinsly wrote: > This patch series are intended to test the power8 and power9 Nest > Accelerator (NX) GZIP engine that is being introduced by > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/205659.html > More information about how to access the NX can be found in that patch, also a > complete userspace library and more documentation can be found at: > https://github.com/libnxz/power-gzip > Raphael, Please delete power8 reference. NX-GZIP engine and user space support (with VAS) are introduced in P9. > > Thanks, > Raphael > >
Re: [PATCH 0/6] Fix sparse warnings for common qe library code
On 12/03/2020 23.28, Li Yang wrote: > The QE code was previously only supported on big-endian PowerPC systems > that use the same endian as the QE device. The endian transfer code is > not really exercised. Recent updates extended the QE drivers to > little-endian ARM/ARM64 systems which makes the endian transfer really > meaningful and hence triggered more sparse warnings for the endian > mismatch. Some of these endian issues are real issues that need to be > fixed. > > While at it, fixed some direct de-references of IO memory space and > suppressed other __iomem address space mismatch issues by adding correct > address space attributes. > > Li Yang (6): > soc: fsl: qe: fix sparse warnings for qe.c > soc: fsl: qe: fix sparse warning for qe_common.c > soc: fsl: qe: fix sparse warnings for ucc.c > soc: fsl: qe: fix sparse warnings for qe_ic.c > soc: fsl: qe: fix sparse warnings for ucc_fast.c > soc: fsl: qe: fix sparse warnings for ucc_slow.c Patches 2-5 should not change the generated code, whether LE or BE host, as they merely add sparse annotations (please double-check with objdump that that is indeed the case), so for those you may add Reviewed-by: Rasmus Villemoes I think patch 1 is also correct, but I don't have hardware to test it on ATM. I'd like to see patch 6 split into smaller pieces, most of it seems obviously correct. Rasmus
Re: [PATCH 6/6] soc: fsl: qe: fix sparse warnings for ucc_slow.c
On 12/03/2020 23.28, Li Yang wrote: > Fixes the following sparse warnings: > [snip] > > Also removed the unneccessary clearing for kzalloc'ed structure. Please don't mix that in the same patch, do it in a preparatory patch. That makes reviewing much easier. > > /* Get PRAM base */ > uccs->us_pram_offset = > @@ -231,24 +224,24 @@ int ucc_slow_init(struct ucc_slow_info * us_info, > struct ucc_slow_private ** ucc > /* clear bd buffer */ > qe_iowrite32be(0, &bd->buf); > /* set bd status and length */ > - qe_iowrite32be(0, (u32 *)bd); > + qe_iowrite32be(0, (u32 __iomem *)bd); It's cleaner to do two qe_iowrite16be to &bd->status and &bd->length, that avoids the casting altogether. > bd++; > } > /* for last BD set Wrap bit */ > qe_iowrite32be(0, &bd->buf); > - qe_iowrite32be(cpu_to_be32(T_W), (u32 *)bd); > + qe_iowrite32be(T_W, (u32 __iomem *)bd); Yeah, and this is why. Who can actually keep track of where that bit ends up being set with that casting going on. Please use qe_iowrite16be() with an appropriately modified constant to the appropriate field instead of these games. And if the hardware doesn't support 16 bit writes, the definition of struct qe_bd is wrong and should have a single __be32 status_length field, with appropriate accessors defined. > /* Init Rx bds */ > bd = uccs->rx_bd = qe_muram_addr(uccs->rx_base_offset); > for (i = 0; i < us_info->rx_bd_ring_len - 1; i++) { > /* set bd status and length */ > - qe_iowrite32be(0, (u32 *)bd); > + qe_iowrite32be(0, (u32 __iomem *)bd); Same. > /* clear bd buffer */ > qe_iowrite32be(0, &bd->buf); > bd++; > } > /* for last BD set Wrap bit */ > - qe_iowrite32be(cpu_to_be32(R_W), (u32 *)bd); > + qe_iowrite32be(R_W, (u32 __iomem *)bd); Same. > qe_iowrite32be(0, &bd->buf); > > /* Set GUMR (For more details see the hardware spec.). */ > @@ -273,8 +266,8 @@ int ucc_slow_init(struct ucc_slow_info * us_info, struct > ucc_slow_private ** ucc > qe_iowrite32be(gumr, &us_regs->gumr_h); > > /* gumr_l */ > - gumr = us_info->tdcr | us_info->rdcr | us_info->tenc | us_info->renc | > - us_info->diag | us_info->mode; > + gumr = (u32)us_info->tdcr | (u32)us_info->rdcr | (u32)us_info->tenc | > +(u32)us_info->renc | (u32)us_info->diag | (u32)us_info->mode; Are the tdcr, rdcr, tenc, renc fields actually set anywhere (the same for the diag and mode, but word-grepping for those give way too many false positives)? They seem to be a somewhat pointless split out of the bitfields of gumr_l, and not populated anywhere?. That's not directly related to this patch, of course, but getting rid of them first (if they are indeed completely unused) might make the sparse cleanup a little simpler. Rasmus
[PATCHv2 00/50] Add log level to show_stack()
Changes to v2: - Removed excessive pr_cont("\n") (nits by Senozhatsky) - Leave backtrace debugging messages with pr_debug() (noted by Russell King and Will Deacon) - Correct microblaze_unwind_inner() declaration (Thanks to Michal Simek and kbuild test robot) - Fix copy'n'paste typo in show_stack_loglvl() for sparc (kbuild robot) - Fix backtrace output on xtensa (Thanks Max Filippov) - Add loglevel to show_stack() on s390 (kbuild robot) - Collected all Reviewed-by and Acked-by (thanks!) Add log level argument to show_stack(). Done in three stages: 1. Introducing show_stack_loglvl() for every architecture 2. Migrating old users with an explicit log level 3. Renaming show_stack_loglvl() into show_stack() Justification: o It's a design mistake to move a business-logic decision into platform realization detail. o I have currently two patches sets that would benefit from this work: Removing console_loglevel jumps in sysrq driver [1] Hung task warning before panic [2] - suggested by Tetsuo (but he probably didn't realise what it would involve). o While doing (1), (2) the backtraces were adjusted to headers and other messages for each situation - so there won't be a situation when the backtrace is printed, but the headers are missing because they have lesser log level (or the reverse). The least important for upstream, but maybe still worth to note that every company I've worked in so far had an off-list patch to print backtrace with the needed log level (but only for the architecture they cared about). If you have other ideas how you will benefit from show_stack() with a log level - please, reply to this cover letter. See also discussion on v1: https://lore.kernel.org/linux-riscv/20191106083538.z5nlpuf64cigx...@pathway.suse.cz/ Cc: Andrew Morton Cc: Greg Kroah-Hartman Cc: Ingo Molnar Cc: Jiri Slaby Cc: Petr Mladek Cc: Sergey Senozhatsky Cc: Steven Rostedt Cc: Tetsuo Handa Thanks, Dmitry [1]: https://lore.kernel.org/lkml/20190528002412.1625-1-d...@arista.com/T/#u [2]: https://lkml.kernel.org/r/41fd7652-df1f-26f6-aba0-b87ebae07...@i-love.sakura.ne.jp Dmitry Safonov (50): kallsyms/printk: Add loglvl to print_ip_sym() alpha: Add show_stack_loglvl() arc: Add show_stack_loglvl() arm/asm: Add loglvl to c_backtrace() arm: Add loglvl to unwind_backtrace() arm: Add loglvl to dump_backtrace() arm: Wire up dump_backtrace_{entry,stm} arm: Add show_stack_loglvl() arm64: Add loglvl to dump_backtrace() arm64: Add show_stack_loglvl() c6x: Add show_stack_loglvl() csky: Add show_stack_loglvl() h8300: Add show_stack_loglvl() hexagon: Add show_stack_loglvl() ia64: Pass log level as arg into ia64_do_show_stack() ia64: Add show_stack_loglvl() m68k: Add show_stack_loglvl() microblaze: Add loglvl to microblaze_unwind_inner() microblaze: Add loglvl to microblaze_unwind() microblaze: Add show_stack_loglvl() mips: Add show_stack_loglvl() nds32: Add show_stack_loglvl() nios2: Add show_stack_loglvl() openrisc: Add show_stack_loglvl() parisc: Add show_stack_loglvl() powerpc: Add show_stack_loglvl() riscv: Add show_stack_loglvl() s390: Add show_stack_loglvl() sh: Add loglvl to dump_mem() sh: Remove needless printk() sh: Add loglvl to printk_address() sh: Add loglvl to show_trace() sh: Add show_stack_loglvl() sparc: Add show_stack_loglvl() um/sysrq: Remove needless variable sp um: Add show_stack_loglvl() unicore32: Remove unused pmode argument in c_backtrace() unicore32: Add loglvl to c_backtrace() unicore32: Add show_stack_loglvl() x86: Add missing const qualifiers for log_lvl x86: Add show_stack_loglvl() xtensa: Add loglvl to show_trace() xtensa: Add show_stack_loglvl() sysrq: Use show_stack_loglvl() x86/amd_gart: Print stacktrace for a leak with KERN_ERR power: Use show_stack_loglvl() kdb: Don't play with console_loglevel sched: Print stack trace with KERN_INFO kernel: Use show_stack_loglvl() kernel: Rename show_stack_loglvl() => show_stack() arch/alpha/kernel/traps.c| 22 +++ arch/arc/include/asm/bug.h | 3 ++- arch/arc/kernel/stacktrace.c | 17 +++- arch/arc/kernel/troubleshoot.c | 2 +- arch/arm/include/asm/bug.h | 3 ++- arch/arm/include/asm/traps.h | 3 ++- arch/arm/include/asm/unwind.h| 3 ++- arch/arm/kernel/traps.c | 40 arch/arm/kernel/unwind.c | 7 ++--- arch/arm/lib/backtrace-clang.S | 9 +-- arch/arm/lib/backtrace.S | 14 +++--- arch/arm64/include/asm/stacktrace.h | 3 ++- arch/arm64/kernel/process.c | 2 +- arch/arm64/kernel/traps.c| 19 ++--- arch/c6x/kernel/traps.c | 18 +++-- arch/csky/kernel/dumpstack.c | 9 --- arch/csky/kernel/ptrace.c| 4 +-- arch/h8300/kernel/traps.c| 12 - arch/hexagon/kernel/traps.c | 25 +
[PATCH 5/5] selftests/powerpc: Add README for GZIP engine tests
Include a README file with the instructions to use the testcases at selftests/powerpc/nx-gzip. Signed-off-by: Bulent Abali Signed-off-by: Raphael Moreira Zinsly --- .../powerpc/nx-gzip/99-nx-gzip.rules | 1 + .../testing/selftests/powerpc/nx-gzip/README | 44 +++ 2 files changed, 45 insertions(+) create mode 100644 tools/testing/selftests/powerpc/nx-gzip/99-nx-gzip.rules create mode 100644 tools/testing/selftests/powerpc/nx-gzip/README diff --git a/tools/testing/selftests/powerpc/nx-gzip/99-nx-gzip.rules b/tools/testing/selftests/powerpc/nx-gzip/99-nx-gzip.rules new file mode 100644 index ..5a7118495cb3 --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/99-nx-gzip.rules @@ -0,0 +1 @@ +SUBSYSTEM=="nxgzip", KERNEL=="nx-gzip", MODE="0666" diff --git a/tools/testing/selftests/powerpc/nx-gzip/README b/tools/testing/selftests/powerpc/nx-gzip/README new file mode 100644 index ..ff0c817a65c5 --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/README @@ -0,0 +1,44 @@ +Test the nx-gzip function: += + +Verify that following device exists: + /dev/crypto/nx-gzip +If you get a permission error run as sudo or set the device permissions: + sudo chmod go+rw /dev/crypto/nx-gzip +However, chmod may not survive across boots. You may create a udev file such +as: + /etc/udev/rules.d/99-nx-gzip.rules + + +Then make and run: +$ make +gcc -O3 -I./inc -o gzfht_test gzfht_test.c gzip_vas.c +gcc -O3 -I./inc -o gunz_test gunz_test.c gzip_vas.c + + +Compress any file using Fixed Huffman mode. Output will have a .nx.gz suffix: +$ ./gzfht_test gzip_vas.c +file gzip_vas.c read, 5276 bytes +compressed 5276 to 2564 bytes total, crc32 checksum = b937a37d + + +Uncompress the previous output. Output will have a .nx.gunzip suffix: +$ ./gunz_test gzip_vas.c.nx.gz +gzHeader FLG 0 +00 00 00 00 04 03 +gzHeader MTIME, XFL, OS ignored +computed checksum b937a37d isize 149c +stored checksum b937a37d isize 149c +decomp is complete: fclose + + +Compare two files: +$ sha1sum gzip_vas.c.nx.gz.nx.gunzip gzip_vas.c +f041cd8581e8d920f79f6ce7f65411be5d026c2a gzip_vas.c.nx.gz.nx.gunzip +f041cd8581e8d920f79f6ce7f65411be5d026c2a gzip_vas.c + + +Note that the code here are intended for testing the nx-gzip hardware function. +They are not intended for demonstrating performance or compression ratio. +For more information and source code consider using: +https://github.com/libnxz/power-gzip -- 2.21.0
[PATCH 4/5] selftests/powerpc: Add NX-GZIP engine decompress testcase
Include a decompression testcase for the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali Signed-off-by: Raphael Moreira Zinsly --- .../selftests/powerpc/nx-gzip/Makefile|7 +- .../selftests/powerpc/nx-gzip/gunz_test.c | 1058 + 2 files changed, 1062 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/powerpc/nx-gzip/gunz_test.c diff --git a/tools/testing/selftests/powerpc/nx-gzip/Makefile b/tools/testing/selftests/powerpc/nx-gzip/Makefile index ab903f63bbbd..82abc19a49a0 100644 --- a/tools/testing/selftests/powerpc/nx-gzip/Makefile +++ b/tools/testing/selftests/powerpc/nx-gzip/Makefile @@ -1,9 +1,9 @@ CC = gcc CFLAGS = -O3 INC = ./inc -SRC = gzfht_test.c +SRC = gzfht_test.c gunz_test.c OBJ = $(SRC:.c=.o) -TESTS = gzfht_test +TESTS = gzfht_test gunz_test EXTRA_SOURCES = gzip_vas.c all: $(TESTS) @@ -16,6 +16,7 @@ $(TESTS): $(OBJ) run_tests: $(TESTS) ./gzfht_test gzip_vas.c + ./gunz_test gzip_vas.c.nx.gz clean: - rm -f $(TESTS) *.o *~ *.gz + rm -f $(TESTS) *.o *~ *.gz *.gunzip diff --git a/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c b/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c new file mode 100644 index ..653de92698cc --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c @@ -0,0 +1,1058 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * + * P9 gunzip sample code for demonstrating the P9 NX hardware + * interface. Not intended for productive uses or for performance or + * compression ratio measurements. Note also that /dev/crypto/gzip, + * VAS and skiboot support are required + * + * Copyright 2020 IBM Corp. + * + * Author: Bulent Abali + * + * https://github.com/libnxz/power-gzip for zlib api and other utils + * Definitions of acronyms used here. See + * P9 NX Gzip Accelerator User's Manual for details + * + * adler/crc: 32 bit checksums appended to stream tail + * ce: completion extension + * cpb: coprocessor parameter block (metadata) + * crb: coprocessor request block (command) + * csb: coprocessor status block (status) + * dht: dynamic huffman table + * dde: data descriptor element (address, length) + * ddl: list of ddes + * dh/fh:dynamic and fixed huffman types + * fc: coprocessor function code + * histlen: history/dictionary length + * history: sliding window of up to 32KB of data + * lzcount: Deflate LZ symbol counts + * rembytecnt: remaining byte count + * sfbt: source final block type; last block's type during decomp + * spbc: source processed byte count + * subc: source unprocessed bit count + * tebc: target ending bit count; valid bits in the last byte + * tpbc: target processed byte count + * vas: virtual accelerator switch; the user mode interface + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "nxu.h" +#include "nx.h" + +int nx_dbg = 0; +FILE *nx_gzip_log = NULL; + +#define NX_MIN(X, Y) (((X) < (Y))?(X):(Y)) +#define NX_MAX(X, Y) (((X) > (Y))?(X):(Y)) + +#define mb() asm volatile("sync" ::: "memory") +#define rmb()asm volatile("lwsync" ::: "memory") +#define wmb()rmb() + +const int fifo_in_len = 1<<24; +const int fifo_out_len = 1<<24; +const int page_sz = 1<<16; +const int line_sz = 1<<7; +const int window_max = 1<<15; +const int retry_max = 50; + +extern void *nx_fault_storage_address; +extern void *nx_function_begin(int function, int pri); +extern int nx_function_end(void *handle); + +/* + * Fault in pages prior to NX job submission. wr=1 may be required to + * touch writeable pages. System zero pages do not fault-in the page as + * intended. Typically set wr=1 for NX target pages and set wr=0 for + * NX source pages. + */ +static int nx_touch_pages(void *buf, long buf_len, long page_len, int wr) +{ + char *begin = buf; + char *end = (char *) buf + buf_len - 1; + volatile char t; + + assert(buf_len >= 0 && !!buf); + + NXPRT(fprintf(stderr, "touch %p %p len 0x%lx wr=%d\n", buf, + buf + buf_len, buf_len, wr)); + + if (buf_len <= 0 || buf == NULL) + return -1; + + do { + t = *begin; + if (wr) + *begin = t; + begin = begin + page_len; + } while (begin < end); + + /* When buf_sz is small or buf tail is in another page. */ + t = *end; + if (wr) + *end = t; + + return 0; +} + +void sigsegv_handler(int sig, siginfo_t *info, void *ctx) +{ + fprintf(stderr, "%d: Got signal %d si_code %d, si_addr %p\n", getpid(), + sig, info->si_code, info->si_addr); + + nx_fault_storage_address = info->si_addr; +} + +/* + * Adds an (address, len) pair to the list of ddes (ddl) and updates + * the base dde. ddl[0] is the only
[PATCH 3/5] selftests/powerpc: Add NX-GZIP engine compress testcase
Add a compression testcase for the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali Signed-off-by: Raphael Moreira Zinsly --- .../selftests/powerpc/nx-gzip/Makefile| 21 + .../selftests/powerpc/nx-gzip/gzfht_test.c| 475 ++ .../selftests/powerpc/nx-gzip/gzip_vas.c | 257 ++ 3 files changed, 753 insertions(+) create mode 100644 tools/testing/selftests/powerpc/nx-gzip/Makefile create mode 100644 tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c create mode 100644 tools/testing/selftests/powerpc/nx-gzip/gzip_vas.c diff --git a/tools/testing/selftests/powerpc/nx-gzip/Makefile b/tools/testing/selftests/powerpc/nx-gzip/Makefile new file mode 100644 index ..ab903f63bbbd --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/Makefile @@ -0,0 +1,21 @@ +CC = gcc +CFLAGS = -O3 +INC = ./inc +SRC = gzfht_test.c +OBJ = $(SRC:.c=.o) +TESTS = gzfht_test +EXTRA_SOURCES = gzip_vas.c + +all: $(TESTS) + +$(OBJ): %.o: %.c + $(CC) $(CFLAGS) -I$(INC) -c $< + +$(TESTS): $(OBJ) + $(CC) $(CFLAGS) -I$(INC) -o $@ $@.o $(EXTRA_SOURCES) + +run_tests: $(TESTS) + ./gzfht_test gzip_vas.c + +clean: + rm -f $(TESTS) *.o *~ *.gz diff --git a/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c b/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c new file mode 100644 index ..29d83fe2694f --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c @@ -0,0 +1,475 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * + * P9 gzip sample code for demonstrating the P9 NX hardware interface. + * Not intended for productive uses or for performance or compression + * ratio measurements. For simplicity of demonstration, this sample + * code compresses in to fixed Huffman blocks only (Deflate btype=1) + * and has very simple memory management. Dynamic Huffman blocks + * (Deflate btype=2) are more involved as detailed in the user guide. + * Note also that /dev/crypto/gzip, VAS and skiboot support are + * required. + * + * Copyright 2020 IBM Corp. + * + * https://github.com/libnxz/power-gzip for zlib api and other utils + * + * Author: Bulent Abali + * + * Definitions of acronyms used here. See + * P9 NX Gzip Accelerator User's Manual for details + * + * adler/crc: 32 bit checksums appended to stream tail + * ce: completion extension + * cpb: coprocessor parameter block (metadata) + * crb: coprocessor request block (command) + * csb: coprocessor status block (status) + * dht: dynamic huffman table + * dde: data descriptor element (address, length) + * ddl: list of ddes + * dh/fh:dynamic and fixed huffman types + * fc: coprocessor function code + * histlen: history/dictionary length + * history: sliding window of up to 32KB of data + * lzcount: Deflate LZ symbol counts + * rembytecnt: remaining byte count + * sfbt: source final block type; last block's type during decomp + * spbc: source processed byte count + * subc: source unprocessed bit count + * tebc: target ending bit count; valid bits in the last byte + * tpbc: target processed byte count + * vas: virtual accelerator switch; the user mode interface + */ + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "nxu.h" +#include "nx.h" + +int nx_dbg = 0; +FILE *nx_gzip_log = NULL; + +extern void *nx_fault_storage_address; +extern void *nx_function_begin(int function, int pri); +extern int nx_function_end(void *handle); + +#define NX_MIN(X, Y) (((X) < (Y)) ? (X) : (Y)) + +/* + * LZ counts returned in the user supplied nx_gzip_crb_cpb_t structure. + */ +static int compress_fht_sample(char *src, uint32_t srclen, char *dst, + uint32_t dstlen, int with_count, + nx_gzip_crb_cpb_t *cmdp, void *handle) +{ + int cc; + uint32_t fc; + + assert(!!cmdp); + + put32(cmdp->crb, gzip_fc, 0); /* clear */ + fc = (with_count) ? GZIP_FC_COMPRESS_RESUME_FHT_COUNT : + GZIP_FC_COMPRESS_RESUME_FHT; + putnn(cmdp->crb, gzip_fc, fc); + putnn(cmdp->cpb, in_histlen, 0); /* resuming with no history */ + memset((void *) &cmdp->crb.csb, 0, sizeof(cmdp->crb.csb)); + + /* Section 6.6 programming notes; spbc may be in two different +* places depending on FC. +*/ + if (!with_count) + put32(cmdp->cpb, out_spbc_comp, 0); + else + put32(cmdp->cpb, out_spbc_comp_with_count, 0); + + /* Figure 6-3 6-4; CSB location */ + put64(cmdp->crb, csb_address, 0); + put64(cmdp->crb, csb_address, + (uint64_t) &cmdp->crb.csb & csb_address_mask); + + /* Source direct dde (scatter-gather list) */ + clear_dde(cmdp->crb.source_dde); + putnn(cmdp->crb.source_dde, dde_count, 0); + put3
[PATCH 2/5] selftests/powerpc: Add header files for NX compresion/decompression
Add files to be able to compress and decompress files using the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali Signed-off-by: Raphael Moreira Zinsly --- .../powerpc/nx-gzip/inc/copy-paste.h | 54 ++ .../selftests/powerpc/nx-gzip/inc/nx_dbg.h| 95 +++ .../selftests/powerpc/nx-gzip/inc/nxu.h | 644 ++ 3 files changed, 793 insertions(+) create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/copy-paste.h create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/nx_dbg.h create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/nxu.h diff --git a/tools/testing/selftests/powerpc/nx-gzip/inc/copy-paste.h b/tools/testing/selftests/powerpc/nx-gzip/inc/copy-paste.h new file mode 100644 index ..107139b6c7df --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/inc/copy-paste.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#include "nx-helpers.h" + +/* + * Macros taken from arch/powerpc/include/asm/ppc-opcode.h and other + * header files. + */ +#define ___PPC_RA(a)(((a) & 0x1f) << 16) +#define ___PPC_RB(b)(((b) & 0x1f) << 11) + +#define PPC_INST_COPY 0x7c20060c +#define PPC_INST_PASTE 0x7c20070d + +#define PPC_COPY(a, b) stringify_in_c(.long PPC_INST_COPY | \ + ___PPC_RA(a) | ___PPC_RB(b)) +#define PPC_PASTE(a, b) stringify_in_c(.long PPC_INST_PASTE | \ + ___PPC_RA(a) | ___PPC_RB(b)) +#define CR0_SHIFT 28 +#define CR0_MASK 0xF +/* + * Copy/paste instructions: + * + * copy RA,RB + * Copy contents of address (RA) + effective_address(RB) + * to internal copy-buffer. + * + * paste RA,RB + * Paste contents of internal copy-buffer to the address + * (RA) + effective_address(RB) + */ +static inline int vas_copy(void *crb, int offset) +{ + asm volatile(PPC_COPY(%0, %1)";" + : + : "b" (offset), "b" (crb) + : "memory"); + + return 0; +} + +static inline int vas_paste(void *paste_address, int offset) +{ + u32 cr; + + cr = 0; + asm volatile(PPC_PASTE(%1, %2)";" + "mfocrf %0, 0x80;" + : "=r" (cr) + : "b" (offset), "b" (paste_address) + : "memory", "cr0"); + + return (cr >> CR0_SHIFT) & CR0_MASK; +} diff --git a/tools/testing/selftests/powerpc/nx-gzip/inc/nx_dbg.h b/tools/testing/selftests/powerpc/nx-gzip/inc/nx_dbg.h new file mode 100644 index ..f2c0eee2317e --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/inc/nx_dbg.h @@ -0,0 +1,95 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * + * Copyright 2020 IBM Corporation + * + */ + +#ifndef _NXU_DBG_H_ +#define _NXU_DBG_H_ + +#include +#include +#include +#include +#include + +extern FILE * nx_gzip_log; +extern int nx_gzip_trace; +extern unsigned int nx_gzip_inflate_impl; +extern unsigned int nx_gzip_deflate_impl; +extern unsigned int nx_gzip_inflate_flags; +extern unsigned int nx_gzip_deflate_flags; + +extern int nx_dbg; +pthread_mutex_t mutex_log; + +#define nx_gzip_trace_enabled() (nx_gzip_trace & 0x1) +#define nx_gzip_hw_trace_enabled()(nx_gzip_trace & 0x2) +#define nx_gzip_sw_trace_enabled()(nx_gzip_trace & 0x4) +#define nx_gzip_gather_statistics() (nx_gzip_trace & 0x8) +#define nx_gzip_per_stream_stat() (nx_gzip_trace & 0x10) + +#define prt(fmt, ...) do { \ + pthread_mutex_lock(&mutex_log); \ + flock(nx_gzip_log->_fileno, LOCK_EX); \ + time_t t; struct tm *m; time(&t); m = localtime(&t);\ + fprintf(nx_gzip_log, "[%04d/%02d/%02d %02d:%02d:%02d] " \ + "pid %d: " fmt, \ + (int)m->tm_year + 1900, (int)m->tm_mon+1, (int)m->tm_mday, \ + (int)m->tm_hour, (int)m->tm_min, (int)m->tm_sec,\ + (int)getpid(), ## __VA_ARGS__); \ + fflush(nx_gzip_log);\ + flock(nx_gzip_log->_fileno, LOCK_UN); \ + pthread_mutex_unlock(&mutex_log); \ +} while (0) + +/* Use in case of an error */ +#define prt_err(fmt, ...) do { if (nx_dbg >= 0) { \ + prt("%s:%u: Error: "fmt,\ + __FILE__, __LINE__, ## __VA_ARGS__);\ +}} while (0) + +/* Use in case of an warning */ +#define prt_warn(fmt, ...) do {if (nx_dbg >= 1) { \ + prt("%s:%u: Warning: "fmt, \ + __FILE__, __LINE__, ## __VA_ARGS__);\ +}} while (0) + +/* Informational printouts */ +#define prt_info(fmt, ...) do {if (nx_dbg >= 2) {
[PATCH 0/5] selftests/powerpc: Add NX-GZIP engine testcase
This patch series are intended to test the power8 and power9 Nest Accelerator (NX) GZIP engine that is being introduced by https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/205659.html More information about how to access the NX can be found in that patch, also a complete userspace library and more documentation can be found at: https://github.com/libnxz/power-gzip Thanks, Raphael
[PATCH 1/5] selftests/powerpc: Add header files for GZIP engine test
Add files to access the powerpc NX-GZIP engine in user space. Signed-off-by: Bulent Abali Signed-off-by: Raphael Moreira Zinsly --- .../selftests/powerpc/nx-gzip/inc/crb.h | 170 ++ .../selftests/powerpc/nx-gzip/inc/nx-gzip.h | 27 +++ .../powerpc/nx-gzip/inc/nx-helpers.h | 53 ++ .../selftests/powerpc/nx-gzip/inc/nx.h| 30 4 files changed, 280 insertions(+) create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/crb.h create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/nx-gzip.h create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/nx-helpers.h create mode 100644 tools/testing/selftests/powerpc/nx-gzip/inc/nx.h diff --git a/tools/testing/selftests/powerpc/nx-gzip/inc/crb.h b/tools/testing/selftests/powerpc/nx-gzip/inc/crb.h new file mode 100644 index ..6af25fb8461a --- /dev/null +++ b/tools/testing/selftests/powerpc/nx-gzip/inc/crb.h @@ -0,0 +1,170 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __CRB_H +#define __CRB_H +#include + +typedef unsigned char u8; +typedef unsigned int u32; +typedef unsigned long long u64; + +/* From nx-842.h */ + +/* CCW 842 CI/FC masks + * NX P8 workbook, section 4.3.1, figure 4-6 + * "CI/FC Boundary by NX CT type" + */ +#define CCW_CI_842 (0x3ff8) +#define CCW_FC_842 (0x0007) + +/* end - nx-842.h */ + +#ifndef __aligned +#define __aligned(x)__attribute__((aligned(x))) +#endif + +#ifndef __packed +#define __packed__attribute__((packed)) +#endif + +/* Chapter 6.5.8 Coprocessor-Completion Block (CCB) */ + +#define CCB_VALUE (0x3fff) +#define CCB_ADDRESS(0xfff8) +#define CCB_CM (0x0007) +#define CCB_CM0(0x0004) +#define CCB_CM12 (0x0003) + +#define CCB_CM0_ALL_COMPLETIONS(0x0) +#define CCB_CM0_LAST_IN_CHAIN (0x4) +#define CCB_CM12_STORE (0x0) +#define CCB_CM12_INTERRUPT (0x1) + +#define CCB_SIZE (0x10) +#define CCB_ALIGN CCB_SIZE + +struct coprocessor_completion_block { + __be64 value; + __be64 address; +} __packed __aligned(CCB_ALIGN); + + +/* Chapter 6.5.7 Coprocessor-Status Block (CSB) */ + +#define CSB_V (0x80) +#define CSB_F (0x04) +#define CSB_CH (0x03) +#define CSB_CE_INCOMPLETE (0x80) +#define CSB_CE_TERMINATION (0x40) +#define CSB_CE_TPBC(0x20) + +#define CSB_CC_SUCCESS (0) +#define CSB_CC_INVALID_ALIGN (1) +#define CSB_CC_OPERAND_OVERLAP (2) +#define CSB_CC_DATA_LENGTH (3) +#define CSB_CC_TRANSLATION (5) +#define CSB_CC_PROTECTION (6) +#define CSB_CC_RD_EXTERNAL (7) +#define CSB_CC_INVALID_OPERAND (8) +#define CSB_CC_PRIVILEGE (9) +#define CSB_CC_INTERNAL(10) +#define CSB_CC_WR_EXTERNAL (12) +#define CSB_CC_NOSPC (13) +#define CSB_CC_EXCESSIVE_DDE (14) +#define CSB_CC_WR_TRANSLATION (15) +#define CSB_CC_WR_PROTECTION (16) +#define CSB_CC_UNKNOWN_CODE(17) +#define CSB_CC_ABORT (18) +#define CSB_CC_TRANSPORT (20) +#define CSB_CC_SEGMENTED_DDL (31) +#define CSB_CC_PROGRESS_POINT (32) +#define CSB_CC_DDE_OVERFLOW(33) +#define CSB_CC_SESSION (34) +#define CSB_CC_PROVISION (36) +#define CSB_CC_CHAIN (37) +#define CSB_CC_SEQUENCE(38) +#define CSB_CC_HW (39) + +#define CSB_SIZE (0x10) +#define CSB_ALIGN CSB_SIZE + +struct coprocessor_status_block { + u8 flags; + u8 cs; + u8 cc; + u8 ce; + __be32 count; + __be64 address; +} __packed __aligned(CSB_ALIGN); + + +/* Chapter 6.5.10 Data-Descriptor List (DDL) + * each list contains one or more Data-Descriptor Entries (DDE) + */ + +#define DDE_P (0x8000) + +#define DDE_SIZE (0x10) +#define DDE_ALIGN DDE_SIZE + +struct data_descriptor_entry { + __be16 flags; + u8 count; + u8 index; + __be32 length; + __be64 address; +} __packed __aligned(DDE_ALIGN); + + +/* Chapter 6.5.2 Coprocessor-Request Block (CRB) */ + +#define CRB_SIZE (0x80) +#define CRB_ALIGN (0x100) /* Errata: requires 256 alignment */ + + +/* Coprocessor Status Block field + * ADDRESS address of CSB + * C CCB is valid + * AT0 = addrs are virtual, 1 = addrs are phys + * M enable perf monitor + */ +#define CRB_CSB_ADDRESS(0xfff0) +#define CRB_CSB_C (0x0008) +#define CRB_CSB_AT (0x0002) +#define CRB_CSB_M (0x0001) + +struct coprocessor_request_block { + __be32 ccw; + __be32 flags; + __be64 csb_addr; + + struct data_descriptor_entry source; + struct data_descriptor_entry
Re: [PATCH] powerpc/pseries: Fix MCE handling on pseries
On 3/14/20 9:18 AM, Nicholas Piggin wrote: Ganesh Goudar's on March 14, 2020 12:04 am: MCE handling on pSeries platform fails as recent rework to use common code for pSeries and PowerNV in machine check error handling tries to access per-cpu variables in realmode. The per-cpu variables may be outside the RMO region on pSeries platform and needs translation to be enabled for access. Just moving these per-cpu variable into RMO region did'nt help because we queue some work to workqueues in real mode, which again tries to touch per-cpu variables. Which queues are these? We should not be using Linux workqueues, but the powerpc mce code which uses irq_work. Yes, irq work queues accesses memory outside RMO. irq_work_queue()->__irq_work_queue_local()->[this_cpu_ptr(&lazy_list) | this_cpu_ptr(&raised_list)] Also fwnmi_release_errinfo() cannot be called when translation is not enabled. Why not? It crashes when we try to get RTAS token for "ibm, nmi-interlock" device tree node. But yes we can avoid it by storing it rtas_token somewhere but haven't tried it, here is the backtrace I got when fwnmi_release_errinfo() called from realmode handler. [ 70.856908] BUG: Unable to handle kernel data access on read at 0xc001a8f8 [ 70.856918] Faulting instruction address: 0xc0853920 [ 70.856927] Oops: Kernel access of bad area, sig: 11 [#1] [ 70.856935] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 70.856943] Modules linked in: mcetest_slb(OE+) bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg pseries_rng ip_tables xfs libcrc32c sd_mod t10_pi ibmvscsi ibmveth scsi_transport_srp [ 70.856975] CPU: 13 PID: 6480 Comm: insmod Kdump: loaded Tainted: G OE 5.6.0-rc2-ganesh+ #6 [ 70.856985] NIP: c0853920 LR: c0853a14 CTR: c00376b0 [ 70.856994] REGS: c7e4b870 TRAP: 0300 Tainted: G OE (5.6.0-rc2-ganesh+) [ 70.857003] MSR: 80001003 CR: 88000422 XER: 0009 [ 70.857015] CFAR: c0853a10 DAR: c001a8f8 DSISR: 4000 IRQMASK: 1 [ 70.857015] GPR00: c0853a14 c7e4bb00 c1372b00 c001a8c8 [ 70.857015] GPR04: c0cf8728 0002 c00800420810 [ 70.857015] GPR08: 0001 0001 [ 70.857015] GPR12: c7f92000 c001f8113d70 c0080059070d [ 70.857015] GPR16: 04f8 c00800421080 fff1 c00800421038 [ 70.857015] GPR20: c125eb20 c0d1d1c8 c0080059 [ 70.857015] GPR24: 4510 c0080800 c12355d8 c00800420940 [ 70.857015] GPR28: c0080811 c0cf8728 c169a098 [ 70.857097] NIP [c0853920] __of_find_property+0x30/0xd0 [ 70.857106] LR [c0853a14] of_find_property+0x54/0x90 [ 70.857113] Call Trace: [ 70.857117] Instruction dump: [ 70.857124] 3c4c00b2 3842f210 2c23 418200bc 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78 [ 70.857136] fbe1fff8 7c9e2378 f8010010 f821ffc1 2fbf 409e0014 4864 [ 70.857152] ---[ end trace 13755f7502f3150b ]--- [ 70.864199] [ 70.864226] Sending IPI to other CPUs [ 82.011761] ERROR: 15 cpu(s) not responding This patch fixes this by enabling translation in the exception handler when all required real mode handling is done. This change only affects the pSeries platform. Not supposed to do this, because we might not be in a state where the MMU is ready to be turned on at this point. I'd like to understand better which accesses are a problem, and whether we can fix them all to be in the RMO. I faced three such access problems, * accessing per-cpu data (like mce_event,mce_event_queue and mce_event_queue), we can move this inside RMO. * calling fwnmi_release_errinfo(). * And queuing work to irq_work_queue, not sure how to fix this. Thanks, Nick
Re: [PATCH 00/15] powerpc/watchpoint: Preparation for more than one watchpoint
On Mon, Mar 16, 2020 at 04:05:01PM +0100, Christophe Leroy wrote: > Some book3s (e300 family for instance, I think G2 as well) already have > a DABR2 in addition to DABR. The original "G2" (meaning 603 and 604) do not have DABR2. The newer "G2" (meaning e300) does have it. e500 and e600 do not have it either. Hope I got that right ;-) Segher
[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load
https://bugzilla.kernel.org/show_bug.cgi?id=206669 --- Comment #12 from John Paul Adrian Glaubitz (glaub...@physik.fu-berlin.de) --- Another crash: watson login: [17667512263.751484] BUG: Unable to handle kernel data access at 0xc00ff06e4838 [17667512263.751507] Faulting instruction address: 0xc017a778 [17667512263.751513] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751517] Faulting instruction address: 0xc02659a0 [17667512263.751521] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751525] Faulting instruction address: 0xc02659a0 [17667512263.751529] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751533] Faulting instruction address: 0xc02659a0 [17667512263.751537] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751541] Faulting instruction address: 0xc02659a0 [17667512263.751545] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751548] Faulting instruction address: 0xc02659a0 [17667512263.751552] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751556] Faulting instruction address: 0xc02659a0 [17667512263.751560] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751564] Faulting instruction address: 0xc02659a0 [17667512263.751569] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751574] Faulting instruction address: 0xc02659a0 [17667512263.751578] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751583] Faulting instruction address: 0xc02659a0 [17667512263.751587] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751591] Faulting instruction address: 0xc02659a0 [17667512263.751596] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751600] Faulting instruction address: 0xc02659a0 [17667512263.751604] Thread overran stack, or stack corrupted [17667512263.751608] BUG: Unable to handle kernel data access at 0xc007f9070c08 [17667512263.751612] Faulting instruction address: 0xc02659a0 [17667512263.751615] Thread overran stack, or stack corrupted [17667512263.751618] BUG: Unable to handle kernel data access at 0xc007f9070c08 [ 1835.743178] BUG: Unable to handle unknown paging fault at 0xc0c4b363 [ 1835.743180] Faulting instruction address: 0x [17667512263.751633] Faulting instruction address: 0xc02659a0 [ 1835.743195] Oops: Kernel access of bad area, sig: 11 [#1] [ 1835.743198] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV [ 1835.743203] Modules linked in: [17667512263.751652] Thread overran stack, or stack corrupted [ 1835.743205] -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH v1 5/5] mm/memory_hotplug: allow to specify a default online_type
On 16.03.20 16:31, Michal Hocko wrote: > On Wed 11-03-20 13:30:26, David Hildenbrand wrote: >> For now, distributions implement advanced udev rules to essentially >> - Don't online any hotplugged memory (s390x) >> - Online all memory to ZONE_NORMAL (e.g., most virt environments like >> hyperv) >> - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken >> care of (e.g., bare metal, special virt environments) >> >> In summary: All memory is usually onlined the same way, however, the >> kernel always has to ask userspace to come up with the same answer. >> E.g., HyperV always waits for a memory block to get onlined before >> continuing, otherwise it might end up adding memory faster than >> hotplugging it, which can result in strange OOM situations. >> >> Let's allow to specify a default online_type, not just "online" and >> "offline". This allows distributions to configure the default online_type >> when booting up and be done with it. >> >> We can now specify "offline", "online", "online_movable" and >> "online_kernel" via >> - "memhp_default_state=" on the kernel cmdline >> - /sys/devices/systemn/memory/auto_online_blocks >> just like we are able to specify for a single memory block via >> /sys/devices/systemn/memory/memoryX/state > > I still strongly believe that the whole interface is wrong. This is just > adding more lipstick on the pig. On the other hand I recognize that the > event based onlining is a PITA as well. The proper interface would > somehow communicate the type of the memory via the event or other sysfs > attribute and then the FW/HV could tell that this is an offline memory, > hotplugable memory or just an additional memory that doesn't need to > support hotremove by the consumer. The userspace or the kernel could > handle the hotadd request much more easier that way. Yeah, and I proposed patches like that which were not well received [1] [2]. But then, user space usually wants to online all memory the same way right now. Also, HyperV and virtio-mem don't want to wait for onlining to happen in user space, because it slows down the whole "add a hole bunch of memory" process. > >> Cc: Greg Kroah-Hartman >> Cc: Andrew Morton >> Cc: Michal Hocko >> Cc: Oscar Salvador >> Cc: "Rafael J. Wysocki" >> Cc: Baoquan He >> Cc: Wei Yang >> Signed-off-by: David Hildenbrand > > That being said, I will not object to this patch. I simply gave up > fighting this interface. So if it works for consumers and it doesn't > break the existing userspace (which is shouldn't AFAICS) then go ahead. As it solves a real problem and makes the interface to auto online usable, I don't think anything speaks against it. Thanks! [1] https://spinics.net/lists/linux-driver-devel/msg118337.html [2] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg32420.html -- Thanks, David / dhildenb
[PATCHv2 26/50] powerpc: Add show_stack_loglvl()
Currently, the log-level of show_stack() depends on a platform realization. It creates situations where the headers are printed with lower log level or higher than the stacktrace (depending on a platform or user). Furthermore, it forces the logic decision from user to an architecture side. In result, some users as sysrq/kdb/etc are doing tricks with temporary rising console_loglevel while printing their messages. And in result it not only may print unwanted messages from other CPUs, but also omit printing at all in the unlucky case where the printk() was deferred. Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier approach than introducing more printk buffers. Also, it will consolidate printings with headers. Introduce show_stack_loglvl(), that eventually will substitute show_stack(). Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org [1]: https://lore.kernel.org/lkml/20190528002412.1625-1-d...@arista.com/T/#u Acked-by: Michael Ellerman (powerpc) Signed-off-by: Dmitry Safonov --- arch/powerpc/kernel/process.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index fad50db9dcf2..c1ab7f613da4 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -2034,7 +2034,8 @@ unsigned long get_wchan(struct task_struct *p) static int kstack_depth_to_print = CONFIG_PRINT_STACK_DEPTH; -void show_stack(struct task_struct *tsk, unsigned long *stack) +void show_stack_loglvl(struct task_struct *tsk, unsigned long *stack, + const char *loglvl) { unsigned long sp, ip, lr, newsp; int count = 0; @@ -2059,7 +2060,7 @@ void show_stack(struct task_struct *tsk, unsigned long *stack) } lr = 0; - printk("Call Trace:\n"); + printk("%sCall Trace:\n", loglvl); do { if (!validate_sp(sp, tsk, STACK_FRAME_OVERHEAD)) break; @@ -2068,7 +2069,8 @@ void show_stack(struct task_struct *tsk, unsigned long *stack) newsp = stack[0]; ip = stack[STACK_FRAME_LR_SAVE]; if (!firstframe || ip != lr) { - printk("["REG"] ["REG"] %pS", sp, ip, (void *)ip); + printk("%s["REG"] ["REG"] %pS", + loglvl, sp, ip, (void *)ip); #ifdef CONFIG_FUNCTION_GRAPH_TRACER ret_addr = ftrace_graph_ret_addr(current, &ftrace_idx, ip, stack); @@ -2090,8 +2092,9 @@ void show_stack(struct task_struct *tsk, unsigned long *stack) struct pt_regs *regs = (struct pt_regs *) (sp + STACK_FRAME_OVERHEAD); lr = regs->link; - printk("--- interrupt: %lx at %pS\nLR = %pS\n", - regs->trap, (void *)regs->nip, (void *)lr); + printk("%s--- interrupt: %lx at %pS\nLR = %pS\n", + loglvl, regs->trap, + (void *)regs->nip, (void *)lr); firstframe = 1; } @@ -2101,6 +2104,11 @@ void show_stack(struct task_struct *tsk, unsigned long *stack) put_task_stack(tsk); } +void show_stack(struct task_struct *tsk, unsigned long *stack) +{ + show_stack_loglvl(tsk, stack, KERN_DEFAULT); +} + #ifdef CONFIG_PPC64 /* Called with hard IRQs off */ void notrace __ppc64_runlatch_on(void) -- 2.25.1
Re: [PATCH v1 4/5] mm/memory_hotplug: convert memhp_auto_online to store an online_type
On Mon 16-03-20 16:34:06, David Hildenbrand wrote: [...] > Best I can do here is to also always online all memory. Yes that sounds like a cleaner solution than having a condition that doesn't make much sense at first glance. -- Michal Hocko SUSE Labs
Re: [PATCH v1 4/5] mm/memory_hotplug: convert memhp_auto_online to store an online_type
On 16.03.20 16:24, Michal Hocko wrote: > On Wed 11-03-20 13:30:25, David Hildenbrand wrote: > [...] >> diff --git a/arch/powerpc/platforms/powernv/memtrace.c >> b/arch/powerpc/platforms/powernv/memtrace.c >> index d6d64f8718e6..e15a600cfa4d 100644 >> --- a/arch/powerpc/platforms/powernv/memtrace.c >> +++ b/arch/powerpc/platforms/powernv/memtrace.c >> @@ -235,7 +235,7 @@ static int memtrace_online(void) >> * If kernel isn't compiled with the auto online option >> * we need to online the memory ourselves. >> */ >> -if (!memhp_auto_online) { >> +if (memhp_default_online_type == MMOP_OFFLINE) { >> lock_device_hotplug(); >> walk_memory_blocks(ent->start, ent->size, NULL, >> online_mem_block); > > Whut? This stinks, doesn't it. For your defense, the original code is > fishy already but this just makes it even more ugly. PPC64 onlines all memory directly from the kernel, and not triggered by user space (I think that's ugly and not desired, but it is what it is and I am not going to touch that). See arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_change_lmb_state(). Best I can do here is to also always online all memory. What are your suggestions? -- Thanks, David / dhildenb
Re: [PATCH v1 5/5] mm/memory_hotplug: allow to specify a default online_type
On Wed 11-03-20 13:30:26, David Hildenbrand wrote: > For now, distributions implement advanced udev rules to essentially > - Don't online any hotplugged memory (s390x) > - Online all memory to ZONE_NORMAL (e.g., most virt environments like > hyperv) > - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken > care of (e.g., bare metal, special virt environments) > > In summary: All memory is usually onlined the same way, however, the > kernel always has to ask userspace to come up with the same answer. > E.g., HyperV always waits for a memory block to get onlined before > continuing, otherwise it might end up adding memory faster than > hotplugging it, which can result in strange OOM situations. > > Let's allow to specify a default online_type, not just "online" and > "offline". This allows distributions to configure the default online_type > when booting up and be done with it. > > We can now specify "offline", "online", "online_movable" and > "online_kernel" via > - "memhp_default_state=" on the kernel cmdline > - /sys/devices/systemn/memory/auto_online_blocks > just like we are able to specify for a single memory block via > /sys/devices/systemn/memory/memoryX/state I still strongly believe that the whole interface is wrong. This is just adding more lipstick on the pig. On the other hand I recognize that the event based onlining is a PITA as well. The proper interface would somehow communicate the type of the memory via the event or other sysfs attribute and then the FW/HV could tell that this is an offline memory, hotplugable memory or just an additional memory that doesn't need to support hotremove by the consumer. The userspace or the kernel could handle the hotadd request much more easier that way. > Cc: Greg Kroah-Hartman > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: "Rafael J. Wysocki" > Cc: Baoquan He > Cc: Wei Yang > Signed-off-by: David Hildenbrand That being said, I will not object to this patch. I simply gave up fighting this interface. So if it works for consumers and it doesn't break the existing userspace (which is shouldn't AFAICS) then go ahead. > --- > drivers/base/memory.c | 11 +-- > include/linux/memory_hotplug.h | 2 ++ > mm/memory_hotplug.c| 8 > 3 files changed, 11 insertions(+), 10 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 8d3e16dab69f..2b09b68b9f78 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -35,7 +35,7 @@ static const char *const online_type_to_str[] = { > [MMOP_ONLINE_MOVABLE] = "online_movable", > }; > > -static int memhp_online_type_from_str(const char *str) > +int memhp_online_type_from_str(const char *str) > { > int i; > > @@ -394,13 +394,12 @@ static ssize_t auto_online_blocks_store(struct device > *dev, > struct device_attribute *attr, > const char *buf, size_t count) > { > - if (sysfs_streq(buf, "online")) > - memhp_default_online_type = MMOP_ONLINE; > - else if (sysfs_streq(buf, "offline")) > - memhp_default_online_type = MMOP_OFFLINE; > - else > + const int online_type = memhp_online_type_from_str(buf); > + > + if (online_type < 0) > return -EINVAL; > > + memhp_default_online_type = online_type; > return count; > } > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index c6e090b34c4b..ef55115320fb 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -117,6 +117,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size, > struct mhp_restrictions *restrictions); > extern u64 max_mem_size; > > +extern int memhp_online_type_from_str(const char *str); > + > /* Default online_type (MMOP_*) when new memory blocks are added. */ > extern int memhp_default_online_type; > /* If movable_node boot option specified */ > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 01443c70aa27..4a96273eafa7 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -75,10 +75,10 @@ EXPORT_SYMBOL_GPL(memhp_default_online_type); > > static int __init setup_memhp_default_state(char *str) > { > - if (!strcmp(str, "online")) > - memhp_default_online_type = MMOP_ONLINE; > - else if (!strcmp(str, "offline")) > - memhp_default_online_type = MMOP_OFFLINE; > + const int online_type = memhp_online_type_from_str(str); > + > + if (online_type >= 0) > + memhp_default_online_type = online_type; > > return 1; > } > -- > 2.24.1 -- Michal Hocko SUSE Labs
Re: [PATCH v1 4/5] mm/memory_hotplug: convert memhp_auto_online to store an online_type
On Wed 11-03-20 13:30:25, David Hildenbrand wrote: [...] > diff --git a/arch/powerpc/platforms/powernv/memtrace.c > b/arch/powerpc/platforms/powernv/memtrace.c > index d6d64f8718e6..e15a600cfa4d 100644 > --- a/arch/powerpc/platforms/powernv/memtrace.c > +++ b/arch/powerpc/platforms/powernv/memtrace.c > @@ -235,7 +235,7 @@ static int memtrace_online(void) >* If kernel isn't compiled with the auto online option >* we need to online the memory ourselves. >*/ > - if (!memhp_auto_online) { > + if (memhp_default_online_type == MMOP_OFFLINE) { > lock_device_hotplug(); > walk_memory_blocks(ent->start, ent->size, NULL, > online_mem_block); Whut? This stinks, doesn't it. For your defense, the original code is fishy already but this just makes it even more ugly. -- Michal Hocko SUSE Labs
Re: [PATCH v1 1/5] drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE
On 16.03.20 16:12, Michal Hocko wrote: > On Wed 11-03-20 13:30:22, David Hildenbrand wrote: >> The name is misleading. Let's just name it like the online_type name we >> expose to user space ("online"). > > I would disagree the name is misleading. It just says that you want to > online and keep the zone type. Nothing I would insist on though. "online and keep the zone type" - that's not what's happening. -- Thanks, David / dhildenb
Re: [PATCH v1 3/5] drivers/base/memory: store mapping between MMOP_* and string in an array
On Wed 11-03-20 13:30:24, David Hildenbrand wrote: > Let's use a simple array which we can reuse soon. While at it, move the > string->mmop conversion out of the device hotplug lock. > > Cc: Greg Kroah-Hartman > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: "Rafael J. Wysocki" > Cc: Baoquan He > Cc: Wei Yang > Signed-off-by: David Hildenbrand Acked-by: Michal Hocko > --- > drivers/base/memory.c | 38 +++--- > 1 file changed, 23 insertions(+), 15 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index e7e77cafef80..8a7f29c0bf97 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -28,6 +28,24 @@ > > #define MEMORY_CLASS_NAME"memory" > > +static const char *const online_type_to_str[] = { > + [MMOP_OFFLINE] = "offline", > + [MMOP_ONLINE] = "online", > + [MMOP_ONLINE_KERNEL] = "online_kernel", > + [MMOP_ONLINE_MOVABLE] = "online_movable", > +}; > + > +static int memhp_online_type_from_str(const char *str) > +{ > + int i; > + > + for (i = 0; i < ARRAY_SIZE(online_type_to_str); i++) { > + if (sysfs_streq(str, online_type_to_str[i])) > + return i; > + } > + return -EINVAL; > +} > + > #define to_memory_block(dev) container_of(dev, struct memory_block, dev) > > static int sections_per_block; > @@ -236,26 +254,17 @@ static int memory_subsys_offline(struct device *dev) > static ssize_t state_store(struct device *dev, struct device_attribute *attr, > const char *buf, size_t count) > { > + const int online_type = memhp_online_type_from_str(buf); > struct memory_block *mem = to_memory_block(dev); > - int ret, online_type; > + int ret; > + > + if (online_type < 0) > + return -EINVAL; > > ret = lock_device_hotplug_sysfs(); > if (ret) > return ret; > > - if (sysfs_streq(buf, "online_kernel")) > - online_type = MMOP_ONLINE_KERNEL; > - else if (sysfs_streq(buf, "online_movable")) > - online_type = MMOP_ONLINE_MOVABLE; > - else if (sysfs_streq(buf, "online")) > - online_type = MMOP_ONLINE; > - else if (sysfs_streq(buf, "offline")) > - online_type = MMOP_OFFLINE; > - else { > - ret = -EINVAL; > - goto err; > - } > - > switch (online_type) { > case MMOP_ONLINE_KERNEL: > case MMOP_ONLINE_MOVABLE: > @@ -271,7 +280,6 @@ static ssize_t state_store(struct device *dev, struct > device_attribute *attr, > ret = -EINVAL; /* should never happen */ > } > > -err: > unlock_device_hotplug(); > > if (ret < 0) > -- > 2.24.1 -- Michal Hocko SUSE Labs
Re: [PATCH v1 2/5] drivers/base/memory: map MMOP_OFFLINE to 0
On Wed 11-03-20 13:30:23, David Hildenbrand wrote: > I have no idea why we have to start at -1. Because this is how the offline state offline was represented originally. > Just treat 0 as the special > case. Clarify a comment (which was wrong, when we come via > device_online() the first time, the online_type would have been 0 / > MEM_ONLINE). The default is now always MMOP_OFFLINE. git grep says that you have covered the only remaining place which hasn't used the enum value. > This is a preparation to use the online_type as an array index. > > Cc: Greg Kroah-Hartman > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: "Rafael J. Wysocki" > Cc: Baoquan He > Cc: Wei Yang > Signed-off-by: David Hildenbrand Acked-by: Michal Hocko > --- > drivers/base/memory.c | 11 --- > include/linux/memory_hotplug.h | 2 +- > 2 files changed, 5 insertions(+), 8 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 8c5ce42c0fc3..e7e77cafef80 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -211,17 +211,14 @@ static int memory_subsys_online(struct device *dev) > return 0; > > /* > - * If we are called from state_store(), online_type will be > - * set >= 0 Otherwise we were called from the device online > - * attribute and need to set the online_type. > + * When called via device_online() without configuring the online_type, > + * we want to default to MMOP_ONLINE. >*/ > - if (mem->online_type < 0) > + if (mem->online_type == MMOP_OFFLINE) > mem->online_type = MMOP_ONLINE; > > ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); > - > - /* clear online_type */ > - mem->online_type = -1; > + mem->online_type = MMOP_OFFLINE; > > return ret; > } > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 261dbf010d5d..c2e06ed5e0e9 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -48,7 +48,7 @@ enum { > /* Types for control the zone type of onlined and offlined memory */ > enum { > /* Offline the memory. */ > - MMOP_OFFLINE = -1, > + MMOP_OFFLINE = 0, > /* Online the memory. Zone depends, see default_zone_for_pfn(). */ > MMOP_ONLINE, > /* Online the memory to ZONE_NORMAL. */ > -- > 2.24.1 -- Michal Hocko SUSE Labs
Re: [PATCH v1 1/5] drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE
On Wed 11-03-20 13:30:22, David Hildenbrand wrote: > The name is misleading. Let's just name it like the online_type name we > expose to user space ("online"). I would disagree the name is misleading. It just says that you want to online and keep the zone type. Nothing I would insist on though. > Add some documentation to the types. > > Cc: Greg Kroah-Hartman > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: "Rafael J. Wysocki" > Cc: Baoquan He > Cc: Wei Yang > Signed-off-by: David Hildenbrand > --- > drivers/base/memory.c | 9 + > include/linux/memory_hotplug.h | 6 +- > 2 files changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 6448c9ece2cb..8c5ce42c0fc3 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -216,7 +216,7 @@ static int memory_subsys_online(struct device *dev) >* attribute and need to set the online_type. >*/ > if (mem->online_type < 0) > - mem->online_type = MMOP_ONLINE_KEEP; > + mem->online_type = MMOP_ONLINE; > > ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); > > @@ -251,7 +251,7 @@ static ssize_t state_store(struct device *dev, struct > device_attribute *attr, > else if (sysfs_streq(buf, "online_movable")) > online_type = MMOP_ONLINE_MOVABLE; > else if (sysfs_streq(buf, "online")) > - online_type = MMOP_ONLINE_KEEP; > + online_type = MMOP_ONLINE; > else if (sysfs_streq(buf, "offline")) > online_type = MMOP_OFFLINE; > else { > @@ -262,7 +262,7 @@ static ssize_t state_store(struct device *dev, struct > device_attribute *attr, > switch (online_type) { > case MMOP_ONLINE_KERNEL: > case MMOP_ONLINE_MOVABLE: > - case MMOP_ONLINE_KEEP: > + case MMOP_ONLINE: > /* mem->online_type is protected by device_hotplug_lock */ > mem->online_type = online_type; > ret = device_online(&mem->dev); > @@ -342,7 +342,8 @@ static ssize_t valid_zones_show(struct device *dev, > } > > nid = mem->nid; > - default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, > nr_pages); > + default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn, > + nr_pages); > strcat(buf, default_zone->name); > > print_allowed_zone(buf, nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL, > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index f4d59155f3d4..261dbf010d5d 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -47,9 +47,13 @@ enum { > > /* Types for control the zone type of onlined and offlined memory */ > enum { > + /* Offline the memory. */ > MMOP_OFFLINE = -1, > - MMOP_ONLINE_KEEP, > + /* Online the memory. Zone depends, see default_zone_for_pfn(). */ > + MMOP_ONLINE, > + /* Online the memory to ZONE_NORMAL. */ > MMOP_ONLINE_KERNEL, > + /* Online the memory to ZONE_MOVABLE. */ > MMOP_ONLINE_MOVABLE, > }; > > -- > 2.24.1 -- Michal Hocko SUSE Labs
Re: [PATCH 00/15] powerpc/watchpoint: Preparation for more than one watchpoint
Le 09/03/2020 à 09:57, Ravi Bangoria a écrit : So far, powerpc Book3S code has been written with an assumption of only one watchpoint. But future power architecture is introducing second watchpoint register (DAWR). Even though this patchset does not enable 2nd DAWR, it make the infrastructure ready so that enabling 2nd DAWR should just be a matter of changing count. Some book3s (e300 family for instance, I think G2 as well) already have a DABR2 in addition to DABR. Will this series allow to use it as well ? Christophe
[PATCH] cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn
The patch avoids allocating cpufreq_policy on stack hence fixing frame size overflow in 'powernv_cpufreq_work_fn' Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling") Signed-off-by: Pratik Rajesh Sampat --- drivers/cpufreq/powernv-cpufreq.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 56f4bc0d209e..20ee0661555a 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { void powernv_cpufreq_work_fn(struct work_struct *work) { struct chip *chip = container_of(work, struct chip, throttle); + struct cpufreq_policy *policy; unsigned int cpu; cpumask_t mask; @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work) chip->restore = false; for_each_cpu(cpu, &mask) { int index; - struct cpufreq_policy policy; - cpufreq_get_policy(&policy, cpu); - index = cpufreq_table_find_index_c(&policy, policy.cur); - powernv_cpufreq_target_index(&policy, index); - cpumask_andnot(&mask, &mask, policy.cpus); + policy = cpufreq_cpu_get(cpu); + if (!policy) + continue; + index = cpufreq_table_find_index_c(policy, policy->cur); + powernv_cpufreq_target_index(policy, index); + cpumask_andnot(&mask, &mask, policy->cpus); + cpufreq_cpu_put(policy); } out: put_online_cpus(); -- 2.24.1
Re: [PATCH] Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling")
Hi Daniel, Sure thing I'll re-send them. Rookie mistake, my bad. Thanks for pointing it out! Regards, Pratik On 16/03/20 6:35 pm, Daniel Axtens wrote: Hi Pratik, Please could you resend this with a more meaningful subject line and move the Fixes: line to immediately above your signed-off-by? Thanks! Regards, Daniel The patch avoids allocating cpufreq_policy on stack hence fixing frame size overflow in 'powernv_cpufreq_work_fn' Signed-off-by: Pratik Rajesh Sampat --- drivers/cpufreq/powernv-cpufreq.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 56f4bc0d209e..20ee0661555a 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { void powernv_cpufreq_work_fn(struct work_struct *work) { struct chip *chip = container_of(work, struct chip, throttle); + struct cpufreq_policy *policy; unsigned int cpu; cpumask_t mask; @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work) chip->restore = false; for_each_cpu(cpu, &mask) { int index; - struct cpufreq_policy policy; - cpufreq_get_policy(&policy, cpu); - index = cpufreq_table_find_index_c(&policy, policy.cur); - powernv_cpufreq_target_index(&policy, index); - cpumask_andnot(&mask, &mask, policy.cpus); + policy = cpufreq_cpu_get(cpu); + if (!policy) + continue; + index = cpufreq_table_find_index_c(policy, policy->cur); + powernv_cpufreq_target_index(policy, index); + cpumask_andnot(&mask, &mask, policy->cpus); + cpufreq_cpu_put(policy); } out: put_online_cpus(); -- 2.17.1
Re: [PATCH] Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling")
Hi Pratik, Please could you resend this with a more meaningful subject line and move the Fixes: line to immediately above your signed-off-by? Thanks! Regards, Daniel > The patch avoids allocating cpufreq_policy on stack hence fixing frame > size overflow in 'powernv_cpufreq_work_fn' > > Signed-off-by: Pratik Rajesh Sampat > --- > drivers/cpufreq/powernv-cpufreq.c | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 56f4bc0d209e..20ee0661555a 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { > void powernv_cpufreq_work_fn(struct work_struct *work) > { > struct chip *chip = container_of(work, struct chip, throttle); > + struct cpufreq_policy *policy; > unsigned int cpu; > cpumask_t mask; > > @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work) > chip->restore = false; > for_each_cpu(cpu, &mask) { > int index; > - struct cpufreq_policy policy; > > - cpufreq_get_policy(&policy, cpu); > - index = cpufreq_table_find_index_c(&policy, policy.cur); > - powernv_cpufreq_target_index(&policy, index); > - cpumask_andnot(&mask, &mask, policy.cpus); > + policy = cpufreq_cpu_get(cpu); > + if (!policy) > + continue; > + index = cpufreq_table_find_index_c(policy, policy->cur); > + powernv_cpufreq_target_index(policy, index); > + cpumask_andnot(&mask, &mask, policy->cpus); > + cpufreq_cpu_put(policy); > } > out: > put_online_cpus(); > -- > 2.17.1
Re: [PATCH v3 0/9] crypto/nx: Enable GZIP engine and provide userpace API
Hi Haren, If I understand correctly, to test these, I need to apply both this series and your VAS userspace page fault handling series - is that right? Kind regards, Daniel > Power9 processor supports Virtual Accelerator Switchboard (VAS) which > allows kernel and userspace to send compression requests to Nest > Accelerator (NX) directly. The NX unit comprises of 2 842 compression > engines and 1 GZIP engine. Linux kernel already has 842 compression > support on kernel. This patch series adds GZIP compression support > from user space. The GZIP Compression engine implements the ZLIB and > GZIP compression algorithms. No plans of adding NX-GZIP compression > support in kernel right now. > > Applications can send requests to NX directly with COPY/PASTE > instructions. But kernel has to establish channel / window on NX-GZIP > device for the userspace. So userspace access to the GZIP engine is > provided through /dev/crypto/nx-gzip device with several operations. > > An application must open the this device to obtain a file descriptor (fd). > Using the fd, application should issue the VAS_TX_WIN_OPEN ioctl to > establish a connection to the engine. Once window is opened, should use > mmap() system call to map the hardware address of engine's request queue > into the application's virtual address space. Then user space forms the > request as co-processor Request Block (CRB) and paste this CRB on the > mapped HW address using COPY/PASTE instructions. Application can poll > on status flags (part of CRB) with timeout for request completion. > > For VAS_TX_WIN_OPEN ioctl, if user space passes vas_id = -1 (struct > vas_tx_win_open_attr), kernel determines the VAS instance on the > corresponding chip based on the CPU on which the process is executing. > Otherwise, the specified VAS instance is used if application passes the > proper VAS instance (vas_id listed in /proc/device-tree/vas@*/ibm,vas_id). > > Process can open multiple windows with different FDs or can send several > requests to NX on the same window at the same time. > > A userspace library libnxz is available: > https://github.com/abalib/power-gzip > > Applications that use inflate/deflate calls can link with libNXz and use > NX GZIP compression without any modification. > > Tested the available 842 compression on power8 and power9 system to make > sure no regression and tested GZIP compression on power9 with tests > available in the above link. > > Thanks to Bulent Abali for nxz library and tests development. > > Changelog: > V2: > - Move user space API code to powerpc as suggested. Also this API > can be extended to any other coprocessor type that VAS can support > in future. Example: Fast thread wakeup feature from VAS > - Rebased to 5.6-rc3 > > V3: > - Fix sparse warnings (patches 3&6) > > Haren Myneni (9): > powerpc/vas: Initialize window attributes for GZIP coprocessor type > powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API > powerpc/vas: Add VAS user space API > crypto/nx: Initialize coproc entry with kzalloc > crypto/nx: Rename nx-842-powernv file name to nx-common-powernv > crypto/NX: Make enable code generic to add new GZIP compression type > crypto/nx: Enable and setup GZIP compresstion type > crypto/nx: Remove 'pid' in vas_tx_win_attr struct > Documentation/powerpc: VAS API > > Documentation/powerpc/index.rst|1 + > Documentation/powerpc/vas-api.rst | 246 + > Documentation/userspace-api/ioctl/ioctl-number.rst |1 + > arch/powerpc/include/asm/vas.h | 12 +- > arch/powerpc/include/uapi/asm/vas-api.h| 22 + > arch/powerpc/platforms/powernv/Makefile|2 +- > arch/powerpc/platforms/powernv/vas-api.c | 290 + > arch/powerpc/platforms/powernv/vas-window.c| 23 +- > arch/powerpc/platforms/powernv/vas.h |2 + > drivers/crypto/nx/Makefile |2 +- > drivers/crypto/nx/nx-842-powernv.c | 1062 -- > drivers/crypto/nx/nx-common-powernv.c | 1133 > > 12 files changed, 1723 insertions(+), 1073 deletions(-) > create mode 100644 Documentation/powerpc/vas-api.rst > create mode 100644 arch/powerpc/include/uapi/asm/vas-api.h > create mode 100644 arch/powerpc/platforms/powernv/vas-api.c > delete mode 100644 drivers/crypto/nx/nx-842-powernv.c > create mode 100644 drivers/crypto/nx/nx-common-powernv.c > > -- > 1.8.3.1
[PATCH v5 2/2] powerpc/64: Prevent stack protection in early boot
The previous commit reduced the amount of code that is run before we setup a paca. However there are still a few remaining functions that run with no paca, or worse, with an arbitrary value in r13 that will be used as a paca pointer. In particular the stack protector canary is stored in the paca, so if stack protector is activated for any of these functions we will read the stack canary from wherever r13 points. If r13 happens to point outside of memory we will get a machine check / checkstop. For example if we modify initialise_paca() to trigger stack protection, and then boot in the mambo simulator with r13 poisoned in skiboot before calling the kernel: DEBUG: 19952232: (19952232): INSTRUCTION: PC=0xC000191FC1E8: [0x3C4C006D]: addis r2,r12,0x6D [fetch] DEBUG: 19952236: (19952236): INSTRUCTION: PC=0xC0001807EAD8: [0x7D8802A6]: mflrr12 [fetch] FATAL ERROR: 19952276: (19952276): Check Stop for 0:0: Machine Check with ME bit of MSR off DEBUG: 19952276: (19952276): INSTRUCTION: PC=0xC000191FCA7C: [0xE90D0CF8]: ld r8,0xCF8(r13) [Instruction Failed] INFO: 19952276: (19952277): ** Execution stopped: Mambo Error, Machine Check Stop, ** systemsim % bt pc: 0xC000191FCA7C initialise_paca+0x54 lr: 0xC000191FC22C early_setup+0x44 stack:0x198CBED00x0 +0x0 stack:0x198CBF000xC000191FC22C early_setup+0x44 stack:0x198CBF900x1801C968 +0x1801C968 So annotate the relevant functions to ensure stack protection is never enabled for them. Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support") Cc: sta...@vger.kernel.org # v4.20+ Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/paca.c | 4 ++-- arch/powerpc/kernel/setup.h| 2 ++ arch/powerpc/kernel/setup_64.c | 2 +- 3 files changed, 5 insertions(+), 3 deletions(-) v5: New. diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 0ee6308541b1..3f91ccaa9c74 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -176,7 +176,7 @@ static struct slb_shadow * __init new_slb_shadow(int cpu, unsigned long limit) struct paca_struct **paca_ptrs __read_mostly; EXPORT_SYMBOL(paca_ptrs); -void __init initialise_paca(struct paca_struct *new_paca, int cpu) +void __init __nostackprotector initialise_paca(struct paca_struct *new_paca, int cpu) { #ifdef CONFIG_PPC_PSERIES new_paca->lppaca_ptr = NULL; @@ -205,7 +205,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu) } /* Put the paca pointer into r13 and SPRG_PACA */ -void setup_paca(struct paca_struct *new_paca) +void __nostackprotector setup_paca(struct paca_struct *new_paca) { /* Setup r13 */ local_paca = new_paca; diff --git a/arch/powerpc/kernel/setup.h b/arch/powerpc/kernel/setup.h index 2dd0d9cb5a20..d210671026e9 100644 --- a/arch/powerpc/kernel/setup.h +++ b/arch/powerpc/kernel/setup.h @@ -8,6 +8,8 @@ #ifndef __ARCH_POWERPC_KERNEL_SETUP_H #define __ARCH_POWERPC_KERNEL_SETUP_H +#define __nostackprotector __attribute__((__optimize__("no-stack-protector"))) + void initialize_cache_info(void); void irqstack_early_init(void); diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 17886d147dd0..438a9befce41 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -279,7 +279,7 @@ void __init record_spr_defaults(void) * device-tree is not accessible via normal means at this point. */ -void __init early_setup(unsigned long dt_ptr) +void __init __nostackprotector early_setup(unsigned long dt_ptr) { static __initdata struct paca_struct boot_paca; -- 2.24.1
[PATCH v5 1/2] powerpc/64: Setup a paca before parsing device tree etc.
From: Daniel Axtens Currently we set up the paca after parsing the device tree for CPU features. Prior to that, r13 contains random data, which means there is random data in r13 while we're running the generic dt parsing code. This random data varies depending on whether we boot through a vmlinux or a zImage: for the vmlinux case it's usually around zero, but for zImages we see random values like 912a72603d420015. This is poor practice, and can also lead to difficult-to-debug crashes. For example, when kcov is enabled, the kcov instrumentation attempts to read preempt_count out of the current task, which goes via the paca. This then crashes in the zImage case. Similarly stack protector can cause crashes if r13 is bogus, by reading from the stack canary in the paca. To resolve this: - move the paca setup to before the CPU feature parsing. - because we no longer have access to CPU feature flags in paca setup, change the HV feature test in the paca setup path to consider the actual value of the MSR rather than the CPU feature. Translations get switched on once we leave early_setup, so I think we'd already catch any other cases where the paca or task aren't set up. Boot tested on a P9 guest and host. Fixes: fb0b0a73b223 ("powerpc: Enable kcov") Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support") Cc: sta...@vger.kernel.org # v4.20+ Reviewed-by: Andrew Donnellan Suggested-by: Michael Ellerman Signed-off-by: Daniel Axtens [mpe: Reword comments & change log a bit to mention stack protector] Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/dt_cpu_ftrs.c | 1 - arch/powerpc/kernel/paca.c| 10 +++--- arch/powerpc/kernel/setup_64.c| 30 -- 3 files changed, 31 insertions(+), 10 deletions(-) v5: mpe: Reword comments & change log a bit to mention stack protector] dja: Regarding moving the comment about printk()-safety: I am about 75% sure that the thing that makes printk() safe is the PACA, not the CPU features. That's what commit 24d9649574fb ("[POWERPC] Document when printk is useable") seems to indicate, but as someone wise recently told me, "bootstrapping is hard", so I may be totally wrong. v4: Update commit message and clarify that the mfmsr() approach is not for general use. Thanks Nick Piggin. v3: Update comment, thanks Christophe Leroy. Remove a comment in dt_cpu_ftrs.c that is no longer accurate - thanks Andrew. I think we want to retain all the code still, but I'm open to being told otherwise. diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index 182b4047c1ef..36bc0d5c4f3a 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -139,7 +139,6 @@ static void __init cpufeatures_setup_cpu(void) /* Initialize the base environment -- clear FSCR/HFSCR. */ hv_mode = !!(mfmsr() & MSR_HV); if (hv_mode) { - /* CPU_FTR_HVMODE is used early in PACA setup */ cur_cpu_spec->cpu_features |= CPU_FTR_HVMODE; mtspr(SPRN_HFSCR, 0); } diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 949eceb254d8..0ee6308541b1 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -214,11 +214,15 @@ void setup_paca(struct paca_struct *new_paca) /* On Book3E, initialize the TLB miss exception frames */ mtspr(SPRN_SPRG_TLB_EXFRAME, local_paca->extlb); #else - /* In HV mode, we setup both HPACA and PACA to avoid problems + /* +* In HV mode, we setup both HPACA and PACA to avoid problems * if we do a GET_PACA() before the feature fixups have been -* applied +* applied. +* +* Normally you should test against CPU_FTR_HVMODE, but CPU features +* are not yet set up when we first reach here. */ - if (early_cpu_has_feature(CPU_FTR_HVMODE)) + if (mfmsr() & MSR_HV) mtspr(SPRN_SPRG_HPACA, local_paca); #endif mtspr(SPRN_SPRG_PACA, local_paca); diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index e05e6dd67ae6..17886d147dd0 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -285,18 +285,36 @@ void __init early_setup(unsigned long dt_ptr) /* printk is _NOT_ safe to use here ! --- */ - /* Try new device tree based feature discovery ... */ - if (!dt_cpu_ftrs_init(__va(dt_ptr))) - /* Otherwise use the old style CPU table */ - identify_cpu(0, mfspr(SPRN_PVR)); - - /* Assume we're on cpu 0 for now. Don't write to the paca yet! */ + /* +* Assume we're on cpu 0 for now. +* +* We need to load a PACA very early for a few reasons. +* +* The stack protector canary is stored in the paca, so as soon as we +* call any stack protected code we need r13 pointing somew
[PATCH v2] ocxl: control via sysfs whether the FPGA is reloaded on a link reset
Some opencapi FPGA images allow to control if the FPGA should be reloaded on the next adapter reset. If it is supported, the image specifies it through a Vendor Specific DVSEC in the config space of function 0. Signed-off-by: Philippe Bergheaud --- Changelog: v2: - refine ResetReload debug message - do not call get_function_0() if pci_dev is for function 0 Documentation/ABI/testing/sysfs-class-ocxl | 10 drivers/misc/ocxl/config.c | 64 +- drivers/misc/ocxl/ocxl_internal.h | 6 ++ drivers/misc/ocxl/sysfs.c | 35 include/misc/ocxl-config.h | 1 + 5 files changed, 115 insertions(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-class-ocxl b/Documentation/ABI/testing/sysfs-class-ocxl index b5b1fa197592..b9ea671d5805 100644 --- a/Documentation/ABI/testing/sysfs-class-ocxl +++ b/Documentation/ABI/testing/sysfs-class-ocxl @@ -33,3 +33,13 @@ Date:January 2018 Contact: linuxppc-dev@lists.ozlabs.org Description: read/write Give access the global mmio area for the AFU + +What: /sys/class/ocxl//reload_on_reset +Date: February 2020 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read/write + Control whether the FPGA is reloaded on a link reset + 0 Do not reload FPGA image from flash + 1 Reload FPGA image from flash + unavailable + The device does not support this capability diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index c8e19bfb5ef9..05500fdece7e 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -71,6 +71,20 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx) return 0; } +/** + * get_function_0() - Find a related PCI device (function 0) + * @device: PCI device to match + * + * Returns a pointer to the related device, or null if not found + */ +static struct pci_dev *get_function_0(struct pci_dev *dev) +{ + unsigned int devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0); + + return pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus), + dev->bus->number, devfn); +} + static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn) { u16 val; @@ -159,7 +173,7 @@ static int read_dvsec_afu_info(struct pci_dev *dev, struct ocxl_fn_config *fn) static int read_dvsec_vendor(struct pci_dev *dev) { int pos; - u32 cfg, tlx, dlx; + u32 cfg, tlx, dlx, reset_reload; /* * vendor specific DVSEC is optional @@ -183,6 +197,54 @@ static int read_dvsec_vendor(struct pci_dev *dev) dev_dbg(&dev->dev, " CFG version = 0x%x\n", cfg); dev_dbg(&dev->dev, " TLX version = 0x%x\n", tlx); dev_dbg(&dev->dev, " DLX version = 0x%x\n", dlx); + if (ocxl_config_get_reset_reload(dev, &reset_reload) != 0) + dev_dbg(&dev->dev, " ResetReload is not available\n"); + else + dev_dbg(&dev->dev, " ResetReload = 0x%x\n", reset_reload); + return 0; +} + +int ocxl_config_get_reset_reload(struct pci_dev *dev, int *val) +{ + int reset_reload = -1; + int pos = 0; + struct pci_dev *dev0 = dev; + + if (PCI_FUNC(dev->devfn) != 0) + dev0 = get_function_0(dev); + + if (dev0) + pos = find_dvsec(dev0, OCXL_DVSEC_VENDOR_ID); + + if (pos) + pci_read_config_dword(dev0, + pos + OCXL_DVSEC_VENDOR_RESET_RELOAD, + &reset_reload); + if (reset_reload == -1) + return reset_reload; + + *val = reset_reload & BIT(0); + return 0; +} + +int ocxl_config_set_reset_reload(struct pci_dev *dev, int val) +{ + int reset_reload = -1; + int pos = 0; + struct pci_dev *dev0 = get_function_0(dev); + + if (dev0) + pos = find_dvsec(dev0, OCXL_DVSEC_VENDOR_ID); + + if (pos) + pci_read_config_dword(dev0, + pos + OCXL_DVSEC_VENDOR_RESET_RELOAD, + &reset_reload); + if (reset_reload == -1) + return reset_reload; + + val &= BIT(0); + pci_write_config_dword(dev0, pos + OCXL_DVSEC_VENDOR_RESET_RELOAD, val); return 0; } diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h index 345bf843a38e..af9a84aeee6f 100644 --- a/drivers/misc/ocxl/ocxl_internal.h +++ b/drivers/misc/ocxl/ocxl_internal.h @@ -112,6 +112,12 @@ void ocxl_actag_afu_free(struct ocxl_fn *fn, u32 start, u32 size); */ int ocxl_config_get_pasid_info(struct pci_dev *dev, int *count); +/* + * Control whether the FPGA is reloaded on a link reset + */ +int ocxl_config_get_reset_reload(struct pci_dev *dev, int
[PATCH v1 44/46] powerpc/8xx: Implement dedicated kasan_init_region()
Implement a kasan_init_region() dedicated to 8xx that allocates KASAN regions using huge pages. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/8xx.c| 74 ++ arch/powerpc/mm/kasan/Makefile | 1 + 2 files changed, 75 insertions(+) create mode 100644 arch/powerpc/mm/kasan/8xx.c diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c new file mode 100644 index ..db4ef44af22f --- /dev/null +++ b/arch/powerpc/mm/kasan/8xx.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define DISABLE_BRANCH_PROFILING + +#include +#include +#include +#include + +static int __init +kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void *block) +{ + pmd_t *pmd = pmd_ptr_k(k_start); + unsigned long k_cur, k_next; + + for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { + pte_basic_t *new; + + k_next = pgd_addr_end(k_cur, k_end); + k_next = pgd_addr_end(k_next, k_end); + if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) + continue; + + new = memblock_alloc(sizeof(pte_basic_t), SZ_4K); + if (!new) + return -ENOMEM; + + *new = pte_val(pte_mkhuge(pfn_pte(PHYS_PFN(__pa(block)), PAGE_KERNEL))); + + hugepd_populate_kernel((hugepd_t *)pmd, (pte_t *)new, PAGE_SHIFT_8M); + hugepd_populate_kernel((hugepd_t *)pmd + 1, (pte_t *)new, PAGE_SHIFT_8M); + } + return 0; +} + +int __init kasan_init_region(void *start, size_t size) +{ + unsigned long k_start = (unsigned long)kasan_mem_to_shadow(start); + unsigned long k_end = (unsigned long)kasan_mem_to_shadow(start + size); + unsigned long k_cur; + int ret; + void *block; + + block = memblock_alloc(k_end - k_start, SZ_8M); + if (!block) + return -ENOMEM; + + if (IS_ALIGNED(k_start, SZ_8M)) { + kasan_init_shadow_8M(k_start, ALIGN_DOWN(k_end, SZ_8M), block); + k_cur = ALIGN_DOWN(k_end, SZ_8M); + if (k_cur == k_end) + goto finish; + } else { + k_cur = k_start; + } + + ret = kasan_init_shadow_page_tables(k_start, k_end); + if (ret) + return ret; + + for (; k_cur < k_end; k_cur += PAGE_SIZE) { + pmd_t *pmd = pmd_ptr_k(k_cur); + void *va = block + k_cur - k_start; + pte_t pte = pfn_pte(PHYS_PFN(__pa(va)), PAGE_KERNEL); + + if (k_cur < ALIGN_DOWN(k_end, SZ_512K)) + pte = pte_mkhuge(pte); + + __set_pte_at(&init_mm, k_cur, pte_offset_kernel(pmd, k_cur), pte, 0); + } +finish: + flush_tlb_kernel_range(k_start, k_end); + return 0; +} diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile index 6577897673dd..440038ea79f1 100644 --- a/arch/powerpc/mm/kasan/Makefile +++ b/arch/powerpc/mm/kasan/Makefile @@ -3,3 +3,4 @@ KASAN_SANITIZE := n obj-$(CONFIG_PPC32) += kasan_init_32.o +obj-$(CONFIG_PPC_8xx) += 8xx.o -- 2.25.0
[PATCH v1 46/46] powerpc/32s: Implement dedicated kasan_init_region()
Implement a kasan_init_region() dedicated to book3s/32 that allocates KASAN regions using BATs. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/kasan.h | 1 + arch/powerpc/mm/kasan/Makefile| 1 + arch/powerpc/mm/kasan/book3s_32.c | 57 +++ arch/powerpc/mm/kasan/kasan_init_32.c | 2 +- 4 files changed, 60 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/mm/kasan/book3s_32.c diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 107a24c3f7b3..be85c7005fb1 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -34,6 +34,7 @@ static inline void kasan_init(void) { } static inline void kasan_late_init(void) { } #endif +void kasan_update_early_region(unsigned long k_start, unsigned long k_end, pte_t pte); int kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end); int kasan_init_region(void *start, size_t size); diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile index 440038ea79f1..bb1a5408b86b 100644 --- a/arch/powerpc/mm/kasan/Makefile +++ b/arch/powerpc/mm/kasan/Makefile @@ -4,3 +4,4 @@ KASAN_SANITIZE := n obj-$(CONFIG_PPC32) += kasan_init_32.o obj-$(CONFIG_PPC_8xx) += 8xx.o +obj-$(CONFIG_PPC_BOOK3S_32)+= book3s_32.o diff --git a/arch/powerpc/mm/kasan/book3s_32.c b/arch/powerpc/mm/kasan/book3s_32.c new file mode 100644 index ..4bc491a4a1fd --- /dev/null +++ b/arch/powerpc/mm/kasan/book3s_32.c @@ -0,0 +1,57 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define DISABLE_BRANCH_PROFILING + +#include +#include +#include +#include + +int __init kasan_init_region(void *start, size_t size) +{ + unsigned long k_start = (unsigned long)kasan_mem_to_shadow(start); + unsigned long k_end = (unsigned long)kasan_mem_to_shadow(start + size); + unsigned long k_cur = k_start; + int k_size = k_end - k_start; + int k_size_base = 1 << (ffs(k_size) - 1); + int ret; + void *block; + + block = memblock_alloc(k_size, k_size_base); + + if (block && k_size_base >= SZ_128K && k_start == ALIGN(k_start, k_size_base)) { + int k_size_more = 1 << (ffs(k_size - k_size_base) - 1); + + setbat(-1, k_start, __pa(block), k_size_base, PAGE_KERNEL); + if (k_size_more >= SZ_128K) + setbat(-1, k_start + k_size_base, __pa(block) + k_size_base, + k_size_more, PAGE_KERNEL); + if (v_block_mapped(k_start)) + k_cur = k_start + k_size_base; + if (v_block_mapped(k_start + k_size_base)) + k_cur = k_start + k_size_base + k_size_more; + + update_bats(); + } + + if (!block) + block = memblock_alloc(k_size, PAGE_SIZE); + if (!block) + return -ENOMEM; + + ret = kasan_init_shadow_page_tables(k_start, k_end); + if (ret) + return ret; + + kasan_update_early_region(k_start, k_cur, __pte(0)); + + for (; k_cur < k_end; k_cur += PAGE_SIZE) { + pmd_t *pmd = pmd_ptr_k(k_cur); + void *va = block + k_cur - k_start; + pte_t pte = pfn_pte(PHYS_PFN(__pa(va)), PAGE_KERNEL); + + __set_pte_at(&init_mm, k_cur, pte_offset_kernel(pmd, k_cur), pte, 0); + } + flush_tlb_kernel_range(k_start, k_end); + return 0; +} diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index 03d30ec7a858..e9fda8451718 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -79,7 +79,7 @@ int __init __weak kasan_init_region(void *start, size_t size) return 0; } -static void __init +void __init kasan_update_early_region(unsigned long k_start, unsigned long k_end, pte_t pte) { unsigned long k_cur; -- 2.25.0
[PATCH v1 45/46] powerpc/32s: Allow mapping with BATs with DEBUG_PAGEALLOC
DEBUG_PAGEALLOC only manages RW data. Text and RO data can still be mapped with BATs. In order to map with BATs, also enforce data alignment. Set by default to 256M which is a good compromise for keeping enough BATs for also KASAN and IMMR. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s32/mmu.c | 6 ++ arch/powerpc/mm/init_32.c | 5 ++--- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c086992573d3..44c490e05954 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -793,6 +793,7 @@ config DATA_SHIFT range 17 28 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC) && PPC_BOOK3S_32 range 19 23 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC) && PPC_8xx default 22 if STRICT_KERNEL_RWX && PPC_BOOK3S_32 + default 18 if DEBUG_PAGEALLOC && PPC_BOOK3S_32 default 23 if STRICT_KERNEL_RWX && PPC_8xx default 23 if DEBUG_PAGEALLOC && PPC_8xx && PIN_TLB_DATA default 19 if DEBUG_PAGEALLOC && PPC_8xx diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index a9b2cbc74797..a6dcc708eee3 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -170,6 +170,12 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) pr_debug("RAM mapped without BATs\n"); return base; } + if (debug_pagealloc_enabled()) { + if (base >= border) + return base; + if (top >= border) + top = border; + } if (!strict_kernel_rwx_enabled() || base >= border || top <= border) return __mmu_mapin_ram(base, top); diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c index 8977a7c2543d..36c39bd37256 100644 --- a/arch/powerpc/mm/init_32.c +++ b/arch/powerpc/mm/init_32.c @@ -99,10 +99,9 @@ static void __init MMU_setup(void) if (IS_ENABLED(CONFIG_PPC_8xx)) return; - if (debug_pagealloc_enabled()) { - __map_without_bats = 1; + if (debug_pagealloc_enabled()) __map_without_ltlbs = 1; - } + if (strict_kernel_rwx_enabled()) __map_without_ltlbs = 1; } -- 2.25.0
[PATCH v1 43/46] powerpc/8xx: Allow large TLBs with DEBUG_PAGEALLOC
DEBUG_PAGEALLOC only manages RW data. Text and RO data can still be mapped with hugepages and pinned TLB. In order to map with hugepages, also enforce a 512kB data alignment minimum. That's a trade-off between size of speed, taking into account that DEBUG_PAGEALLOC is a debug option. Anyway the alignment is still tunable. We also allow tuning of alignment for book3s to limit the complexity of the test in Kconfig that will anyway disappear in the following patches once DEBUG_PAGEALLOC is handled together with BATs. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 11 +++ arch/powerpc/mm/init_32.c | 5 - arch/powerpc/mm/nohash/8xx.c | 11 --- arch/powerpc/platforms/8xx/Kconfig | 2 +- 4 files changed, 20 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 305c7b6a9229..c086992573d3 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -777,8 +777,9 @@ config THREAD_SHIFT config DATA_SHIFT_BOOL bool "Set custom data alignment" depends on ADVANCED_OPTIONS - depends on STRICT_KERNEL_RWX - depends on PPC_BOOK3S_32 || (PPC_8xx && !PIN_TLB_DATA && !PIN_TLB_TEXT) + depends on STRICT_KERNEL_RWX || DEBUG_PAGEALLOC + depends on PPC_BOOK3S_32 || (PPC_8xx && !PIN_TLB_DATA && \ +(!PIN_TLB_TEXT || !STRICT_KERNEL_RWX)) help This option allows you to set the kernel data alignment. When RAM is mapped by blocks, the alignment needs to fit the size and @@ -789,10 +790,12 @@ config DATA_SHIFT_BOOL config DATA_SHIFT int "Data shift" if DATA_SHIFT_BOOL default 24 if STRICT_KERNEL_RWX && PPC64 - range 17 28 if STRICT_KERNEL_RWX && PPC_BOOK3S_32 - range 19 23 if STRICT_KERNEL_RWX && PPC_8xx + range 17 28 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC) && PPC_BOOK3S_32 + range 19 23 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC) && PPC_8xx default 22 if STRICT_KERNEL_RWX && PPC_BOOK3S_32 default 23 if STRICT_KERNEL_RWX && PPC_8xx + default 23 if DEBUG_PAGEALLOC && PPC_8xx && PIN_TLB_DATA + default 19 if DEBUG_PAGEALLOC && PPC_8xx default PPC_PAGE_SHIFT help On Book3S 32 (603+), DBATs are used to map kernel text and rodata RO. diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c index a6991ef8727d..8977a7c2543d 100644 --- a/arch/powerpc/mm/init_32.c +++ b/arch/powerpc/mm/init_32.c @@ -96,11 +96,14 @@ static void __init MMU_setup(void) if (strstr(boot_command_line, "noltlbs")) { __map_without_ltlbs = 1; } + if (IS_ENABLED(CONFIG_PPC_8xx)) + return; + if (debug_pagealloc_enabled()) { __map_without_bats = 1; __map_without_ltlbs = 1; } - if (strict_kernel_rwx_enabled() && !IS_ENABLED(CONFIG_PPC_8xx)) + if (strict_kernel_rwx_enabled()) __map_without_ltlbs = 1; } diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 40815eba96f2..a122ec4806d0 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -149,7 +149,8 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) { unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M); unsigned long sinittext = __pa(_sinittext); - unsigned long boundary = strict_kernel_rwx_enabled() ? sinittext : etext8; + bool strict_boundary = strict_kernel_rwx_enabled() || debug_pagealloc_enabled(); + unsigned long boundary = strict_boundary ? sinittext : etext8; unsigned long einittext8 = ALIGN(__pa(_einittext), SZ_8M); WARN_ON(top < einittext8); @@ -160,8 +161,12 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) return 0; mmu_mapin_ram_chunk(0, boundary, PAGE_KERNEL_TEXT, true); - mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL_TEXT, true); - mmu_mapin_ram_chunk(einittext8, top, PAGE_KERNEL, true); + if (debug_pagealloc_enabled()) { + top = boundary; + } else { + mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL_TEXT, true); + mmu_mapin_ram_chunk(einittext8, top, PAGE_KERNEL, true); + } if (top > SZ_32M) memblock_set_current_limit(top); diff --git a/arch/powerpc/platforms/8xx/Kconfig b/arch/powerpc/platforms/8xx/Kconfig index 05669f2fadce..abb2b45b2789 100644 --- a/arch/powerpc/platforms/8xx/Kconfig +++ b/arch/powerpc/platforms/8xx/Kconfig @@ -167,7 +167,7 @@ menu "8xx advanced setup" config PIN_TLB bool "Pinned Kernel TLBs" - depends on ADVANCED_OPTIONS && !DEBUG_PAGEALLOC + depends on ADVANCED_OPTIONS help On the 8xx, we have 32 instruction TLBs and 32 data TLBs. In each table 4 TLBs can be pinned. -- 2.25.0
[PATCH v1 42/46] powerpc/8xx: Allow STRICT_KERNEL_RwX with pinned TLB
Pinned TLB are 8M. Now that there is no strict boundary anymore between text and RO data, it is possible to use 8M pinned executable TLB that covers both text and RO data. When PIN_TLB_DATA or PIN_TLB_TEXT is selected, enforce 8M RW data alignment and allow STRICT_KERNEL_RWX. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 8 +--- arch/powerpc/mm/nohash/8xx.c | 32 ++ arch/powerpc/platforms/8xx/Kconfig | 2 +- 3 files changed, 38 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 66d02667b43d..305c7b6a9229 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -775,9 +775,10 @@ config THREAD_SHIFT want. Only change this if you know what you are doing. config DATA_SHIFT_BOOL - bool "Set custom data alignment" if STRICT_KERNEL_RWX && \ - (PPC_BOOK3S_32 || PPC_8xx) + bool "Set custom data alignment" depends on ADVANCED_OPTIONS + depends on STRICT_KERNEL_RWX + depends on PPC_BOOK3S_32 || (PPC_8xx && !PIN_TLB_DATA && !PIN_TLB_TEXT) help This option allows you to set the kernel data alignment. When RAM is mapped by blocks, the alignment needs to fit the size and @@ -799,7 +800,8 @@ config DATA_SHIFT On 8xx, large pages (512kb or 8M) are used to map kernel linear memory. Aligning to 8M reduces TLB misses as only 8M pages are used - in that case. + in that case. If PIN_TLB is selected, it must be aligned to 8M as + 8M pages will be pinned. config FORCE_MAX_ZONEORDER int "Maximum zone order" diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index d670f3f82091..40815eba96f2 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -171,6 +171,20 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) return top; } +static void mmu_pin_text(unsigned long boundary) +{ + unsigned long addr = PAGE_OFFSET; + unsigned long twc = MD_SVALID | MD_PS8MEG; + unsigned long rpn = __pa(addr) | 0xf0 | _PAGE_RO | + _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT; + int i; + + for (i = 28; i < 32 && __pa(addr) < boundary; i++, addr += SZ_8M, rpn += SZ_8M) + mpc8xx_update_tlb(0, i, addr | MI_EVALID, twc, rpn); + for (; i < 32; i++) + mpc8xx_update_tlb(0, i, 0, 0, 0); +} + void mmu_mark_initmem_nx(void) { unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M); @@ -180,14 +194,32 @@ void mmu_mark_initmem_nx(void) mmu_mapin_ram_chunk(0, boundary, PAGE_KERNEL_TEXT, false); mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL, false); + + if (IS_ENABLED(CONFIG_PIN_TLB_TEXT)) + mmu_pin_text(boundary); } #ifdef CONFIG_STRICT_KERNEL_RWX +static void mmu_pin_data(unsigned long sinittext) +{ + unsigned long addr = PAGE_OFFSET; + unsigned long twc = MD_SVALID | MD_PS8MEG; + unsigned long rpn = __pa(addr) | 0xf0 | _PAGE_RO | + _PAGE_SPS | _PAGE_SH | _PAGE_PRESENT; + int i; + int max = IS_ENABLED(CONFIG_PIN_TLB_IMMR) ? 30 : 31; + + for (i = 28; i <= max && __pa(addr) < sinittext; i++, addr += SZ_8M, rpn += SZ_8M) + mpc8xx_update_tlb(0, i, addr | MI_EVALID, twc, rpn); +} + void mmu_mark_rodata_ro(void) { unsigned long sinittext = __pa(_sinittext); mmu_mapin_ram_chunk(0, sinittext, PAGE_KERNEL_ROX, false); + if (IS_ENABLED(CONFIG_PIN_TLB_DATA)) + mmu_pin_data(sinittext); } #endif diff --git a/arch/powerpc/platforms/8xx/Kconfig b/arch/powerpc/platforms/8xx/Kconfig index 04ea1a8a0bdc..05669f2fadce 100644 --- a/arch/powerpc/platforms/8xx/Kconfig +++ b/arch/powerpc/platforms/8xx/Kconfig @@ -167,7 +167,7 @@ menu "8xx advanced setup" config PIN_TLB bool "Pinned Kernel TLBs" - depends on ADVANCED_OPTIONS && !DEBUG_PAGEALLOC && !STRICT_KERNEL_RWX + depends on ADVANCED_OPTIONS && !DEBUG_PAGEALLOC help On the 8xx, we have 32 instruction TLBs and 32 data TLBs. In each table 4 TLBs can be pinned. -- 2.25.0
[PATCH v1 41/46] powerpc/8xx: Map linear memory with huge pages
Map linear memory space with 512k and 8M pages whenever possible. Three mappings are performed: - One for kernel text - One for RO data - One for the rest Separating the mappings is done to be able to update the protection later when using STRICT_KERNEL_RWX. The ITLB miss handler now need to also handle huge TLBs unless kernel text in pinned. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 4 +-- arch/powerpc/mm/nohash/8xx.c | 50 +- 2 files changed, 51 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 410cb48ab36f..49b9eee7dd57 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -223,7 +223,7 @@ InstructionTLBMiss: 3: mtcrr11 #endif -#ifdef CONFIG_HUGETLBFS +#if defined(CONFIG_HUGETLBFS) || !defined(CONFIG_PIN_TLB_TEXT) lwz r11, (swapper_pg_dir-PAGE_OFFSET)@l(r10)/* Get level 1 entry */ mtspr SPRN_MD_TWC, r11 #else @@ -233,7 +233,7 @@ InstructionTLBMiss: #endif mfspr r10, SPRN_MD_TWC lwz r10, 0(r10) /* Get the pte */ -#ifdef CONFIG_HUGETLBFS +#if defined(CONFIG_HUGETLBFS) || !defined(CONFIG_PIN_TLB_TEXT) rlwimi r11, r10, 32 - 9, _PMD_PAGE_512K mtspr SPRN_MI_TWC, r11 #endif diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 57e0c7496a6a..d670f3f82091 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -126,20 +126,68 @@ void __init mmu_mapin_immr(void) PAGE_KERNEL_NCG, MMU_PAGE_512K, true); } +static void __init mmu_mapin_ram_chunk(unsigned long offset, unsigned long top, + pgprot_t prot, bool new) +{ + unsigned long v = PAGE_OFFSET + offset; + unsigned long p = offset; + + WARN_ON(!IS_ALIGNED(offset, SZ_512K) || !IS_ALIGNED(top, SZ_512K)); + + for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K) + __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); + for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M) + __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new); + for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += SZ_512K) + __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); + + if (!new) + flush_tlb_kernel_range(PAGE_OFFSET + v, PAGE_OFFSET + top); +} + unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) { + unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M); + unsigned long sinittext = __pa(_sinittext); + unsigned long boundary = strict_kernel_rwx_enabled() ? sinittext : etext8; + unsigned long einittext8 = ALIGN(__pa(_einittext), SZ_8M); + + WARN_ON(top < einittext8); + mmu_mapin_immr(); - return 0; + if (__map_without_ltlbs) + return 0; + + mmu_mapin_ram_chunk(0, boundary, PAGE_KERNEL_TEXT, true); + mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL_TEXT, true); + mmu_mapin_ram_chunk(einittext8, top, PAGE_KERNEL, true); + + if (top > SZ_32M) + memblock_set_current_limit(top); + + block_mapped_ram = top; + + return top; } void mmu_mark_initmem_nx(void) { + unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M); + unsigned long sinittext = __pa(_sinittext); + unsigned long boundary = strict_kernel_rwx_enabled() ? sinittext : etext8; + unsigned long einittext8 = ALIGN(__pa(_einittext), SZ_8M); + + mmu_mapin_ram_chunk(0, boundary, PAGE_KERNEL_TEXT, false); + mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL, false); } #ifdef CONFIG_STRICT_KERNEL_RWX void mmu_mark_rodata_ro(void) { + unsigned long sinittext = __pa(_sinittext); + + mmu_mapin_ram_chunk(0, sinittext, PAGE_KERNEL_ROX, false); } #endif -- 2.25.0
[PATCH v1 40/46] powerpc/8xx: Map IMMR with a huge page
Map the IMMR area with a single 512k huge page. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/nohash/8xx.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 81ddcd9554e1..57e0c7496a6a 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -117,17 +117,13 @@ static bool immr_is_mapped __initdata; void __init mmu_mapin_immr(void) { - unsigned long p = PHYS_IMMR_BASE; - unsigned long v = VIRT_IMMR_BASE; - int offset; - if (immr_is_mapped) return; immr_is_mapped = true; - for (offset = 0; offset < IMMR_SIZE; offset += PAGE_SIZE) - map_kernel_page(v + offset, p + offset, PAGE_KERNEL_NCG); + __early_map_kernel_hugepage(VIRT_IMMR_BASE, PHYS_IMMR_BASE, + PAGE_KERNEL_NCG, MMU_PAGE_512K, true); } unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) -- 2.25.0
[PATCH v1 39/46] powerpc/8xx: Add a function to early map kernel via huge pages
Add a function to early map kernel memory using huge pages. For 512k pages, just use standard page table and map in using 512k pages. For 8M pages, create a hugepd table and populate the two PGD entries with it. This function can only be used to create page tables at startup. Once the regular SLAB allocation functions replace memblock functions, this function cannot allocate new pages anymore. However it can still update existing mappings with new protections. hugepd_none() macro is moved into asm/hugetlb.h to be usable outside of mm/hugetlbpage.c early_pte_alloc_kernel() is made visible. _PAGE_HUGE flag is now displayed by ptdump. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/hugetlb.h| 2 + .../include/asm/nohash/32/hugetlb-8xx.h | 5 ++ arch/powerpc/include/asm/pgtable.h| 2 + arch/powerpc/mm/hugetlbpage.c | 2 - arch/powerpc/mm/nohash/8xx.c | 52 +++ arch/powerpc/mm/pgtable_32.c | 2 +- arch/powerpc/mm/ptdump/8xx.c | 5 ++ arch/powerpc/platforms/Kconfig.cputype| 1 + 8 files changed, 68 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index e4276af034e9..0572c5cd12f2 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -13,6 +13,8 @@ #include #endif /* CONFIG_PPC_BOOK3S_64 */ +#define hugepd_none(hpd) (hpd_val(hpd) == 0) + extern bool hugetlb_disabled; void hugetlbpage_init_default(void); diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index 1c7d4693a78e..e752a5807a59 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -35,6 +35,11 @@ static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshi *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M); } +static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned int pshift) +{ + *hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M); +} + static inline int check_and_get_huge_psize(int shift) { return shift_to_mmu_psize(shift); diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index b80bfd41828d..ffddb052068c 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -105,6 +105,8 @@ unsigned long vmalloc_to_phys(void *vmalloc_addr); void pgtable_cache_add(unsigned int shift); +pte_t *early_pte_alloc_kernel(pmd_t *pmdp, unsigned long va); + #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_PPC32) void mark_initmem_nx(void); #else diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 243e90db400c..30d2d05d681d 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -28,8 +28,6 @@ bool hugetlb_disabled = false; -#define hugepd_none(hpd) (hpd_val(hpd) == 0) - #define PTE_T_ORDER(__builtin_ffs(sizeof(pte_basic_t)) - \ __builtin_ffs(sizeof(void *))) diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index d9f205d9a654..81ddcd9554e1 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -9,8 +9,10 @@ #include #include +#include #include #include +#include #include @@ -54,6 +56,56 @@ unsigned long p_block_mapped(phys_addr_t pa) return 0; } +static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va) +{ + if (hugepd_none(*pmdp)) { + pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K); + + if (!ptep) + return NULL; + + hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M); + hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M); + } + return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT); +} + +static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa, +pgprot_t prot, int psize, bool new) +{ + pmd_t *pmdp = pmd_ptr_k(va); + pte_t *ptep; + + if (WARN_ON(psize != MMU_PAGE_512K && psize != MMU_PAGE_8M)) + return -EINVAL; + + if (new) { + if (WARN_ON(slab_is_available())) + return -EINVAL; + + if (psize == MMU_PAGE_512K) + ptep = early_pte_alloc_kernel(pmdp, va); + else + ptep = early_hugepd_alloc_kernel((hugepd_t *)pmdp, va); + } else { + if (psize == MMU_PAGE_512K) + ptep = pte_offset_kernel(pmdp, va); + else + ptep = hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT); + } + + if (WARN_ON(!
[PATCH v1 38/46] powerpc/8xx: Refactor kernel address boundary comparison
Now that linear and IMMR dedicated TLB handling is gone, kernel boundary address comparison is similar in ITLB miss handler and in DTLB miss handler. Create a macro named compare_to_kernel_boundary. When TASK_SIZE is strictly below 0x8000 and PAGE_OFFSET is above 0x8000, it is enough to compare to 0x800, and this can be done with a single instruction. Using not. instruction, we get to use 'blt' conditional branch as when doing a regular comparison: 0x <= addr <= 0x7fff ==> 0x >= NOT(addr) >= 0x8000 The above test corresponds to a 'blt' Otherwise, do a regular comparison using two instructions. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index d12f5846d527..410cb48ab36f 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -31,10 +31,15 @@ #include "head_32.h" +.macro compare_to_kernel_boundary scratch, addr #if CONFIG_TASK_SIZE <= 0x8000 && CONFIG_PAGE_OFFSET >= 0x8000 /* By simply checking Address >= 0x8000, we know if its a kernel address */ -#define SIMPLE_KERNEL_ADDRESS 1 + not.\scratch, \addr +#else + rlwinm \scratch, \addr, 16, 0xfff8 + cmpli cr0, \scratch, PAGE_OFFSET@h #endif +.endm /* * We need an ITLB miss handler for kernel addresses if: @@ -208,20 +213,11 @@ InstructionTLBMiss: mtspr SPRN_MD_EPN, r10 #ifdef ITLB_MISS_KERNEL mfcrr11 -#if defined(SIMPLE_KERNEL_ADDRESS) - cmpicr0, r10, 0 /* Address >= 0x8000 */ -#else - rlwinm r10, r10, 16, 0xfff8 - cmpli cr0, r10, PAGE_OFFSET@h -#endif + compare_to_kernel_boundary r10, r10 #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ #ifdef ITLB_MISS_KERNEL -#if defined(SIMPLE_KERNEL_ADDRESS) - bge+3f -#else blt+3f -#endif rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha 3: @@ -287,9 +283,7 @@ DataStoreTLBMiss: * kernel page tables. */ mfspr r10, SPRN_MD_EPN - rlwinm r10, r10, 16, 0xfff8 - cmpli cr0, r10, PAGE_OFFSET@h - + compare_to_kernel_boundary r10, r10 mfspr r10, SPRN_M_TWB /* Get level 1 table */ blt+3f rlwinm r10, r10, 0, 20, 31 -- 2.25.0
[PATCH v1 37/46] powerpc/mm: Don't be too strict with _etext alignment on PPC32
Similar to PPC64, accept to map RO data as ROX as a trade off between between security and memory usage. Having RO data executable is not a high risk as RO data can't be modified to forge an exploit. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 26 -- arch/powerpc/kernel/vmlinux.lds.S | 3 +-- 2 files changed, 1 insertion(+), 28 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f3ea52bcbaf8..66d02667b43d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -774,32 +774,6 @@ config THREAD_SHIFT Used to define the stack size. The default is almost always what you want. Only change this if you know what you are doing. -config ETEXT_SHIFT_BOOL - bool "Set custom etext alignment" if STRICT_KERNEL_RWX && \ -(PPC_BOOK3S_32 || PPC_8xx) - depends on ADVANCED_OPTIONS - help - This option allows you to set the kernel end of text alignment. When - RAM is mapped by blocks, the alignment needs to fit the size and - number of possible blocks. The default should be OK for most configs. - - Say N here unless you know what you are doing. - -config ETEXT_SHIFT - int "_etext shift" if ETEXT_SHIFT_BOOL - range 17 28 if STRICT_KERNEL_RWX && PPC_BOOK3S_32 - range 19 23 if STRICT_KERNEL_RWX && PPC_8xx - default 17 if STRICT_KERNEL_RWX && PPC_BOOK3S_32 - default 19 if STRICT_KERNEL_RWX && PPC_8xx - default PPC_PAGE_SHIFT - help - On Book3S 32 (603+), IBATs are used to map kernel text. - Smaller is the alignment, greater is the number of necessary IBATs. - - On 8xx, large pages (512kb or 8M) are used to map kernel linear - memory. Aligning to 8M reduces TLB misses as only 8M pages are used - in that case. - config DATA_SHIFT_BOOL bool "Set custom data alignment" if STRICT_KERNEL_RWX && \ (PPC_BOOK3S_32 || PPC_8xx) diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index a32d478a7f41..41bc64880ca1 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -15,7 +15,6 @@ #include #define STRICT_ALIGN_SIZE (1 << CONFIG_DATA_SHIFT) -#define ETEXT_ALIGN_SIZE (1 << CONFIG_ETEXT_SHIFT) ENTRY(_stext) @@ -116,7 +115,7 @@ SECTIONS } :text - . = ALIGN(ETEXT_ALIGN_SIZE); + . = ALIGN(PAGE_SIZE); _etext = .; PROVIDE32 (etext = .); -- 2.25.0
[PATCH v1 33/46] powerpc/8xx: Always pin TLBs at startup.
At startup, map 32 Mbytes of memory through 4 pages of 8M, and PIN them inconditionnaly. They need to be pinned because KASAN is using page tables early and the TLBs might be dynamically replaced otherwise. Remove RSV4I flag after installing mappings unless CONFIG_PIN_TLB_ is selected. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 31 +-- arch/powerpc/mm/nohash/8xx.c | 19 +-- 2 files changed, 18 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 6a1a74e9b011..7ed866e83545 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -761,6 +761,14 @@ start_here: ori r0, r0, 0xf0 | _PAGE_DIRTY | _PAGE_SPS | _PAGE_SH | \ _PAGE_NO_CACHE | _PAGE_PRESENT mtspr SPRN_MD_RPN, r0 +#endif +#ifndef CONFIG_PIN_TLB_TEXT + li r0, 0 + mtspr SPRN_MI_CTR, r0 +#endif +#if !defined(CONFIG_PIN_TLB_DATA) && !defined(CONFIG_PIN_TLB_IMMR) + lis r0, MD_TWAM@h + mtspr SPRN_MD_CTR, r0 #endif tlbia /* Clear all TLB entries */ sync/* wait for tlbia/tlbie to finish */ @@ -798,10 +806,6 @@ initial_mmu: mtspr SPRN_MD_CTR, r10/* remove PINNED DTLB entries */ tlbia /* Invalidate all TLB entries */ -#ifdef CONFIG_PIN_TLB_DATA - orisr10, r10, MD_RSV4I@h - mtspr SPRN_MD_CTR, r10/* Set data TLB control */ -#endif lis r8, MI_APG_INIT@h /* Set protection modes */ ori r8, r8, MI_APG_INIT@l @@ -810,33 +814,32 @@ initial_mmu: ori r8, r8, MD_APG_INIT@l mtspr SPRN_MD_AP, r8 - /* Now map the lower RAM (up to 32 Mbytes) into the ITLB. */ -#ifdef CONFIG_PIN_TLB_TEXT + /* Map the lower RAM (up to 32 Mbytes) into the ITLB and DTLB */ lis r8, MI_RSV4I@h ori r8, r8, 0x1c00 -#endif + orisr12, r10, MD_RSV4I@h + ori r12, r12, 0x1c00 li r9, 4 /* up to 4 pages of 8M */ mtctr r9 lis r9, KERNELBASE@h/* Create vaddr for TLB */ li r10, MI_PS8MEG | MI_SVALID /* Set 8M byte page */ li r11, MI_BOOTINIT/* Create RPN for address 0 */ - lis r12, _einittext@h - ori r12, r12, _einittext@l 1: -#ifdef CONFIG_PIN_TLB_TEXT mtspr SPRN_MI_CTR, r8 /* Set instruction MMU control */ addir8, r8, 0x100 -#endif - ori r0, r9, MI_EVALID /* Mark it valid */ mtspr SPRN_MI_EPN, r0 mtspr SPRN_MI_TWC, r10 mtspr SPRN_MI_RPN, r11/* Store TLB entry */ + mtspr SPRN_MD_CTR, r12 + addir12, r12, 0x100 + mtspr SPRN_MD_EPN, r0 + mtspr SPRN_MD_TWC, r10 + mtspr SPRN_MD_RPN, r11 addis r9, r9, 0x80 addis r11, r11, 0x80 - cmplcr0, r9, r12 - bdnzf gt, 1b + bdnz1b /* Since the cache is enabled according to the information we * just loaded into the TLB, invalidate and enable the caches here. diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 2a55904986c6..0956bc92b19c 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -61,23 +61,6 @@ unsigned long p_block_mapped(phys_addr_t pa) */ void __init MMU_init_hw(void) { - /* PIN up to the 3 first 8Mb after IMMR in DTLB table */ - if (IS_ENABLED(CONFIG_PIN_TLB_DATA)) { - unsigned long ctr = mfspr(SPRN_MD_CTR) & 0xfe00; - unsigned long flags = 0xf0 | MD_SPS16K | _PAGE_SH | _PAGE_DIRTY; - int i = 28; - unsigned long addr = 0; - unsigned long mem = total_lowmem; - - for (; i < 32 && mem >= LARGE_PAGE_SIZE_8M; i++) { - mtspr(SPRN_MD_CTR, ctr | (i << 8)); - mtspr(SPRN_MD_EPN, (unsigned long)__va(addr) | MD_EVALID); - mtspr(SPRN_MD_TWC, MD_PS8MEG | MD_SVALID); - mtspr(SPRN_MD_RPN, addr | flags | _PAGE_PRESENT); - addr += LARGE_PAGE_SIZE_8M; - mem -= LARGE_PAGE_SIZE_8M; - } - } } static bool immr_is_mapped __initdata; @@ -222,7 +205,7 @@ void __init setup_initial_memory_limit(phys_addr_t first_memblock_base, BUG_ON(first_memblock_base != 0); /* 8xx can only access 32MB at the moment */ - memblock_set_current_limit(min_t(u64, first_memblock_size, 0x0200)); + memblock_set_current_limit(min_t(u64, first_memblock_size, SZ_32M)); } /* -- 2.25.0
[PATCH v1 36/46] powerpc/8xx: Move DTLB perf handling closer.
Now that space have been freed next to the DTLB miss handler, it's associated DTLB perf handling can be brought back in the same place. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 7fd7f7af1ac6..d12f5846d527 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -343,6 +343,17 @@ DataStoreTLBMiss: rfi patch_site 0b, patch__dtlbmiss_exit_1 +#ifdef CONFIG_PERF_EVENTS + patch_site 0f, patch__dtlbmiss_perf +0: lwz r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0) + addir10, r10, 1 + stw r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0) + mfspr r10, SPRN_DAR + mtspr SPRN_DAR, r11 /* Tag DAR */ + mfspr r11, SPRN_M_TW + rfi +#endif + /* This is an instruction TLB error on the MPC8xx. This could be due * to many reasons, such as executing guarded memory or illegal instruction * addresses. There is nothing to do but handle a big time error fault. @@ -389,18 +400,6 @@ DARFixed:/* Return from dcbx instruction bug workaround */ /* 0x300 is DataAccess exception, needed by bad_page_fault() */ EXC_XFER_LITE(0x300, handle_page_fault) -/* Called from DataStoreTLBMiss when perf TLB misses events are activated */ -#ifdef CONFIG_PERF_EVENTS - patch_site 0f, patch__dtlbmiss_perf -0: lwz r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0) - addir10, r10, 1 - stw r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0) - mfspr r10, SPRN_DAR - mtspr SPRN_DAR, r11 /* Tag DAR */ - mfspr r11, SPRN_M_TW - rfi -#endif - stack_overflow: vmap_stack_overflow_exception -- 2.25.0
[PATCH v1 34/46] powerpc/8xx: Drop special handling of Linear and IMMR mappings in I/D TLB handlers
Up to now, linear and IMMR mappings are managed via huge TLB entries through specific code directly in TLB miss handlers. This implies some patching of the TLB miss handlers at startup, and a lot of dedicated code. Remove all this specific dedicated code. For now we are back to normal handling via standard 4k pages. In the next patches, linear memory mapping and IMMR mapping will be managed through huge pages. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 29 +- arch/powerpc/mm/nohash/8xx.c | 103 + 2 files changed, 3 insertions(+), 129 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 7ed866e83545..df2874a0fd13 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -206,31 +206,21 @@ InstructionTLBMiss: mfspr r10, SPRN_SRR0 /* Get effective address of fault */ INVALIDATE_ADJACENT_PAGES_CPU15(r10) mtspr SPRN_MD_EPN, r10 - /* Only modules will cause ITLB Misses as we always -* pin the first 8MB of kernel memory */ #ifdef ITLB_MISS_KERNEL mfcrr11 -#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT) +#if defined(SIMPLE_KERNEL_ADDRESS) cmpicr0, r10, 0 /* Address >= 0x8000 */ #else rlwinm r10, r10, 16, 0xfff8 cmpli cr0, r10, PAGE_OFFSET@h -#ifndef CONFIG_PIN_TLB_TEXT - /* It is assumed that kernel code fits into the first 32M */ -0: cmpli cr7, r10, (PAGE_OFFSET + 0x200)@h - patch_site 0b, patch__itlbmiss_linmem_top -#endif #endif #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ #ifdef ITLB_MISS_KERNEL -#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT) +#if defined(SIMPLE_KERNEL_ADDRESS) bge+3f #else blt+3f -#endif -#ifndef CONFIG_PIN_TLB_TEXT - blt cr7, ITLBMissLinear #endif rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha @@ -326,19 +316,9 @@ DataStoreTLBMiss: mfspr r10, SPRN_MD_EPN rlwinm r10, r10, 16, 0xfff8 cmpli cr0, r10, PAGE_OFFSET@h -#ifndef CONFIG_PIN_TLB_IMMR - cmpli cr6, r10, VIRT_IMMR_BASE@h -#endif -0: cmpli cr7, r10, (PAGE_OFFSET + 0x200)@h - patch_site 0b, patch__dtlbmiss_linmem_top mfspr r10, SPRN_M_TWB /* Get level 1 table */ blt+3f -#ifndef CONFIG_PIN_TLB_IMMR -0: beq-cr6, DTLBMissIMMR - patch_site 0b, patch__dtlbmiss_immr_jmp -#endif - blt cr7, DTLBMissLinear rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha 3: @@ -570,14 +550,9 @@ FixupDAR:/* Entry point for dcbx workaround. */ cmpli cr1, r11, PAGE_OFFSET@h mfspr r11, SPRN_M_TWB /* Get level 1 table */ blt+cr1, 3f - rlwinm r11, r10, 16, 0xfff8 - -0: cmpli cr7, r11, (PAGE_OFFSET + 0x180)@h - patch_site 0b, patch__fixupdar_linmem_top /* create physical page address from effective address */ tophys(r11, r10) - blt-cr7, 201f mfspr r11, SPRN_M_TWB /* Get level 1 table */ rlwinm r11, r11, 0, 20, 31 orisr11, r11, (swapper_pg_dir - PAGE_OFFSET)@ha diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 0956bc92b19c..d9f205d9a654 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -54,8 +54,6 @@ unsigned long p_block_mapped(phys_addr_t pa) return 0; } -#define LARGE_PAGE_SIZE_8M (1<<23) - /* * MMU_init_hw does the chip-specific initialization of the MMU hardware. */ @@ -80,119 +78,20 @@ void __init mmu_mapin_immr(void) map_kernel_page(v + offset, p + offset, PAGE_KERNEL_NCG); } -static void mmu_patch_cmp_limit(s32 *site, unsigned long mapped) -{ - modify_instruction_site(site, 0x, (unsigned long)__va(mapped) >> 16); -} - -static void mmu_patch_addis(s32 *site, long simm) -{ - unsigned int instr = *(unsigned int *)patch_site_addr(site); - - instr &= 0x; - instr |= ((unsigned long)simm) >> 16; - patch_instruction_site(site, instr); -} - -static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top, pgprot_t prot) -{ - unsigned long s = offset; - unsigned long v = PAGE_OFFSET + s; - phys_addr_t p = memstart_addr + s; - - for (; s < top; s += PAGE_SIZE) { - map_kernel_page(v, p, prot); - v += PAGE_SIZE; - p += PAGE_SIZE; - } -} - unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) { - unsigned long mapped; - mmu_mapin_immr(); - if (__map_without_ltlbs) { - mapped = 0; - if (!IS_ENABLED(CONFIG_PIN_TLB_IMMR)) - patch_instruction_site(&patch__dtlbmiss_immr_jmp, PPC_INST_NOP); -
[PATCH v1 35/46] powerpc/8xx: Remove now unused TLB miss functions
The code to setup linear and IMMR mapping via huge TLB entries is not called anymore. Remove it. Also remove the handling of removed code exits in the perf driver. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 8 +- arch/powerpc/kernel/head_8xx.S | 83 arch/powerpc/perf/8xx-pmu.c | 10 --- 3 files changed, 1 insertion(+), 100 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 794bce83c5b0..6ab85e20559f 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -241,13 +241,7 @@ static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize) } /* patch sites */ -extern s32 patch__itlbmiss_linmem_top, patch__itlbmiss_linmem_top8; -extern s32 patch__dtlbmiss_linmem_top, patch__dtlbmiss_immr_jmp; -extern s32 patch__fixupdar_linmem_top; -extern s32 patch__dtlbmiss_romem_top, patch__dtlbmiss_romem_top8; - -extern s32 patch__itlbmiss_exit_1, patch__itlbmiss_exit_2; -extern s32 patch__dtlbmiss_exit_1, patch__dtlbmiss_exit_2, patch__dtlbmiss_exit_3; +extern s32 patch__itlbmiss_exit_1, patch__dtlbmiss_exit_1; extern s32 patch__itlbmiss_perf, patch__dtlbmiss_perf; #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index df2874a0fd13..7fd7f7af1ac6 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -277,33 +277,6 @@ InstructionTLBMiss: rfi #endif -#ifndef CONFIG_PIN_TLB_TEXT -ITLBMissLinear: - mtcrr11 -#if defined(CONFIG_STRICT_KERNEL_RWX) && CONFIG_ETEXT_SHIFT < 23 - patch_site 0f, patch__itlbmiss_linmem_top8 - - mfspr r10, SPRN_SRR0 -0: subis r11, r10, (PAGE_OFFSET - 0x8000)@ha - rlwinm r11, r11, 4, MI_PS8MEG ^ MI_PS512K - ori r11, r11, MI_PS512K | MI_SVALID - rlwinm r10, r10, 0, 0x0ff8 /* 8xx supports max 256Mb RAM */ -#else - /* Set 8M byte page and mark it valid */ - li r11, MI_PS8MEG | MI_SVALID - rlwinm r10, r10, 20, 0x0f80/* 8xx supports max 256Mb RAM */ -#endif - mtspr SPRN_MI_TWC, r11 - ori r10, r10, 0xf0 | MI_SPS16K | _PAGE_SH | _PAGE_DIRTY | \ - _PAGE_PRESENT - mtspr SPRN_MI_RPN, r10/* Update TLB entry */ - -0: mfspr r10, SPRN_SPRG_SCRATCH0 - mfspr r11, SPRN_SPRG_SCRATCH1 - rfi - patch_site 0b, patch__itlbmiss_exit_2 -#endif - . = 0x1200 DataStoreTLBMiss: mtspr SPRN_DAR, r10 @@ -370,62 +343,6 @@ DataStoreTLBMiss: rfi patch_site 0b, patch__dtlbmiss_exit_1 -DTLBMissIMMR: - mtcrr11 - /* Set 512k byte guarded page and mark it valid */ - li r10, MD_PS512K | MD_GUARDED | MD_SVALID - mtspr SPRN_MD_TWC, r10 - mfspr r10, SPRN_IMMR /* Get current IMMR */ - rlwinm r10, r10, 0, 0xfff8 /* Get 512 kbytes boundary */ - ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SH | _PAGE_DIRTY | \ - _PAGE_PRESENT | _PAGE_NO_CACHE - mtspr SPRN_MD_RPN, r10/* Update TLB entry */ - - li r11, RPN_PATTERN - -0: mfspr r10, SPRN_DAR - mtspr SPRN_DAR, r11 /* Tag DAR */ - mfspr r11, SPRN_M_TW - rfi - patch_site 0b, patch__dtlbmiss_exit_2 - -DTLBMissLinear: - mtcrr11 - rlwinm r10, r10, 20, 0x0f80/* 8xx supports max 256Mb RAM */ -#if defined(CONFIG_STRICT_KERNEL_RWX) && CONFIG_DATA_SHIFT < 23 - patch_site 0f, patch__dtlbmiss_romem_top8 - -0: subis r11, r10, (PAGE_OFFSET - 0x8000)@ha - rlwinm r11, r11, 0, 0xff80 - neg r10, r11 - or r11, r11, r10 - rlwinm r11, r11, 4, MI_PS8MEG ^ MI_PS512K - ori r11, r11, MI_PS512K | MI_SVALID - mfspr r10, SPRN_MD_EPN - rlwinm r10, r10, 0, 0x0ff8 /* 8xx supports max 256Mb RAM */ -#else - /* Set 8M byte page and mark it valid */ - li r11, MD_PS8MEG | MD_SVALID -#endif - mtspr SPRN_MD_TWC, r11 -#ifdef CONFIG_STRICT_KERNEL_RWX - patch_site 0f, patch__dtlbmiss_romem_top - -0: subis r11, r10, 0 - rlwimi r10, r11, 11, _PAGE_RO -#endif - ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SH | _PAGE_DIRTY | \ - _PAGE_PRESENT - mtspr SPRN_MD_RPN, r10/* Update TLB entry */ - - li r11, RPN_PATTERN - -0: mfspr r10, SPRN_DAR - mtspr SPRN_DAR, r11 /* Tag DAR */ - mfspr r11, SPRN_M_TW - rfi - patch_site 0b, patch__dtlbmiss_exit_3 - /* This is an instruction TLB error on the MPC8xx. This could be due * to many reasons, such as executing guarded memory or illegal instruction * addresses. There is nothing to do but handle a big time error fault.
[PATCH v1 31/46] powerpc/8xx: Add function to update pinned TLBs
Pinned TLBs are not easy to modify when the MMU is enabled. Create a small function to update a pinned TLB entry with MMU off. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 3 ++ arch/powerpc/kernel/head_8xx.S | 44 2 files changed, 47 insertions(+) diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index a092e6434bda..794bce83c5b0 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -193,6 +193,9 @@ #include +void mpc8xx_update_tlb(int data, int idx, unsigned long epn, + unsigned long twc, unsigned long rpn); + typedef struct { unsigned int id; unsigned int active; diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 423465b10c82..84b3c7692b37 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -866,6 +866,50 @@ initial_mmu: mtspr SPRN_DER, r8 blr +/* + * void mpc8xx_update_tlb(int data, int idx, unsigned long epn, + * unsigned long twc, unsigned long rpn); + */ +_GLOBAL(mpc8xx_update_tlb) + lis r9, (1f - PAGE_OFFSET)@h + ori r9, r9, (1f - PAGE_OFFSET)@l + mfmsr r10 + mflrr11 + li r12, MSR_KERNEL & ~(MSR_IR | MSR_DR) + rlwinm r0, r10, 0, ~MSR_RI + rlwinm r0, r0, 0, ~MSR_EE + mtmsr r0 + isync + .align 4 + mtspr SPRN_SRR0, r9 + mtspr SPRN_SRR1, r12 + rfi + +1: cmpwi r3, 0 + beq 2f + + mfspr r0, SPRN_MD_CTR + rlwimi r0, r4, 8, 0x1f00 + mtspr SPRN_MD_CTR, r0 + + mtspr SPRN_MD_EPN, r5 + mtspr SPRN_MD_TWC, r6 + mtspr SPRN_MD_RPN, r7 + b 3f + +2: mfspr r0, SPRN_MI_CTR + rlwimi r0, r4, 8, 0x1f00 + mtspr SPRN_MI_CTR, r0 + + mtspr SPRN_MI_EPN, r5 + mtspr SPRN_MI_TWC, r6 + mtspr SPRN_MI_RPN, r7 + +3: li r12, MSR_KERNEL & ~(MSR_IR | MSR_DR | MSR_RI) + mtmsr r12 + mtspr SPRN_SRR1, r10 + mtspr SPRN_SRR0, r11 + rfi /* * We put a few things here that have to be page-aligned. -- 2.25.0
[PATCH v1 32/46] powerpc/8xx: Don't set IMMR map anymore at boot
Only early debug requires IMMR to be mapped early. No need to set it up and pin it in assembly. Map it through page tables at udbg init when necessary. If CONFIG_PIN_TLB_IMMR is selected, pin it once we don't need the 32 Mb pinned RAM anymore. Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/head_8xx.S | 36 -- arch/powerpc/mm/mmu_decl.h | 4 arch/powerpc/mm/nohash/8xx.c | 15 + arch/powerpc/platforms/8xx/Kconfig | 2 +- arch/powerpc/sysdev/cpm_common.c | 2 ++ 5 files changed, 32 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 84b3c7692b37..6a1a74e9b011 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -748,6 +748,20 @@ start_here: rfi /* Load up the kernel context */ 2: +#ifdef CONFIG_PIN_TLB_IMMR + mfspr r0, SPRN_MD_CTR + ori r0, r0, 0x1f00 + mtspr SPRN_MD_CTR, r0 + LOAD_REG_IMMEDIATE(r0, VIRT_IMMR_BASE | MD_EVALID) + mtspr SPRN_MD_EPN, r0 + LOAD_REG_IMMEDIATE(r0, MD_SVALID | MD_PS512K | MD_GUARDED) + mtspr SPRN_MD_TWC, r0 + mfspr r0, SPRN_IMMR + rlwinm r0, r0, 0, 0xfff8 + ori r0, r0, 0xf0 | _PAGE_DIRTY | _PAGE_SPS | _PAGE_SH | \ + _PAGE_NO_CACHE | _PAGE_PRESENT + mtspr SPRN_MD_RPN, r0 +#endif tlbia /* Clear all TLB entries */ sync/* wait for tlbia/tlbie to finish */ @@ -796,28 +810,6 @@ initial_mmu: ori r8, r8, MD_APG_INIT@l mtspr SPRN_MD_AP, r8 - /* Map a 512k page for the IMMR to get the processor -* internal registers (among other things). -*/ -#ifdef CONFIG_PIN_TLB_IMMR - orisr10, r10, MD_RSV4I@h - ori r10, r10, 0x1c00 - mtspr SPRN_MD_CTR, r10 - - mfspr r9, 638 /* Get current IMMR */ - andis. r9, r9, 0xfff8 /* Get 512 kbytes boundary */ - - lis r8, VIRT_IMMR_BASE@h/* Create vaddr for TLB */ - ori r8, r8, MD_EVALID /* Mark it valid */ - mtspr SPRN_MD_EPN, r8 - li r8, MD_PS512K | MD_GUARDED /* Set 512k byte page */ - ori r8, r8, MD_SVALID /* Make it valid */ - mtspr SPRN_MD_TWC, r8 - mr r8, r9 /* Create paddr for TLB */ - ori r8, r8, MI_BOOTINIT|0x2 /* Inhibit cache -- Cort */ - mtspr SPRN_MD_RPN, r8 -#endif - /* Now map the lower RAM (up to 32 Mbytes) into the ITLB. */ #ifdef CONFIG_PIN_TLB_TEXT lis r8, MI_RSV4I@h diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h index 7097e07a209a..1b6d39e9baed 100644 --- a/arch/powerpc/mm/mmu_decl.h +++ b/arch/powerpc/mm/mmu_decl.h @@ -182,6 +182,10 @@ static inline void mmu_mark_initmem_nx(void) { } static inline void mmu_mark_rodata_ro(void) { } #endif +#ifdef CONFIG_PPC_8xx +void __init mmu_mapin_immr(void); +#endif + #ifdef CONFIG_PPC_DEBUG_WX void ptdump_check_wx(void); #else diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 3189308dece4..2a55904986c6 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -65,7 +65,7 @@ void __init MMU_init_hw(void) if (IS_ENABLED(CONFIG_PIN_TLB_DATA)) { unsigned long ctr = mfspr(SPRN_MD_CTR) & 0xfe00; unsigned long flags = 0xf0 | MD_SPS16K | _PAGE_SH | _PAGE_DIRTY; - int i = IS_ENABLED(CONFIG_PIN_TLB_IMMR) ? 29 : 28; + int i = 28; unsigned long addr = 0; unsigned long mem = total_lowmem; @@ -80,12 +80,19 @@ void __init MMU_init_hw(void) } } -static void __init mmu_mapin_immr(void) +static bool immr_is_mapped __initdata; + +void __init mmu_mapin_immr(void) { unsigned long p = PHYS_IMMR_BASE; unsigned long v = VIRT_IMMR_BASE; int offset; + if (immr_is_mapped) + return; + + immr_is_mapped = true; + for (offset = 0; offset < IMMR_SIZE; offset += PAGE_SIZE) map_kernel_page(v + offset, p + offset, PAGE_KERNEL_NCG); } @@ -121,9 +128,10 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) { unsigned long mapped; + mmu_mapin_immr(); + if (__map_without_ltlbs) { mapped = 0; - mmu_mapin_immr(); if (!IS_ENABLED(CONFIG_PIN_TLB_IMMR)) patch_instruction_site(&patch__dtlbmiss_immr_jmp, PPC_INST_NOP); if (!IS_ENABLED(CONFIG_PIN_TLB_TEXT)) @@ -142,7 +150,6 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) */ mmu_mapin_ram_chunk(0, einittext8, PAGE_KERNEL_X); mmu_mapin_ram_chunk(einittext8, mapped, PAGE_KERNEL); - mmu_mapin_immr(); }
[PATCH v1 30/46] powerpc/8xx: Move PPC_PIN_TLB options into 8xx Kconfig
PPC_PIN_TLB options are dedicated to the 8xx, move them into the 8xx Kconfig. While we are at it, add some text to explain what it does. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 20 --- arch/powerpc/platforms/8xx/Kconfig | 41 ++ 2 files changed, 41 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 497b7d0b2d7e..f3ea52bcbaf8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -1222,26 +1222,6 @@ config TASK_SIZE hex "Size of user task space" if TASK_SIZE_BOOL default "0x8000" if PPC_8xx default "0xc000" - -config PIN_TLB - bool "Pinned Kernel TLBs (860 ONLY)" - depends on ADVANCED_OPTIONS && PPC_8xx && \ - !DEBUG_PAGEALLOC && !STRICT_KERNEL_RWX - -config PIN_TLB_DATA - bool "Pinned TLB for DATA" - depends on PIN_TLB - default y - -config PIN_TLB_IMMR - bool "Pinned TLB for IMMR" - depends on PIN_TLB || PPC_EARLY_DEBUG_CPM - default y - -config PIN_TLB_TEXT - bool "Pinned TLB for TEXT" - depends on PIN_TLB - default y endmenu if PPC64 diff --git a/arch/powerpc/platforms/8xx/Kconfig b/arch/powerpc/platforms/8xx/Kconfig index b37de62d7e7f..0d036cd868ef 100644 --- a/arch/powerpc/platforms/8xx/Kconfig +++ b/arch/powerpc/platforms/8xx/Kconfig @@ -162,4 +162,45 @@ config UCODE_PATCH default y depends on !NO_UCODE_PATCH +menu "8xx advanced setup" + depends on PPC_8xx + +config PIN_TLB + bool "Pinned Kernel TLBs" + depends on ADVANCED_OPTIONS && !DEBUG_PAGEALLOC && !STRICT_KERNEL_RWX + help + On the 8xx, we have 32 instruction TLBs and 32 data TLBs. In each + table 4 TLBs can be pinned. + + It reduces the amount of usable TLBs to 28 (ie by 12%). That's the + reason why we make it selectable. + + This option does nothing, it just activate the selection of what + to pin. + +config PIN_TLB_DATA + bool "Pinned TLB for DATA" + depends on PIN_TLB + default y + help + This pins the first 32 Mbytes of memory with 8M pages. + +config PIN_TLB_IMMR + bool "Pinned TLB for IMMR" + depends on PIN_TLB || PPC_EARLY_DEBUG_CPM + default y + help + This pins the IMMR area with a 512kbytes page. In case + CONFIG_PIN_TLB_DATA is also selected, it will reduce + CONFIG_PIN_TLB_DATA to 24 Mbytes. + +config PIN_TLB_TEXT + bool "Pinned TLB for TEXT" + depends on PIN_TLB + default y + help + This pins kernel text with 8M pages. + +endmenu + endmenu -- 2.25.0
[PATCH v1 29/46] powerpc/8xx: MM_SLICE is not needed anymore
As the 8xx now manages 512k pages in standard page tables, it doesn't need CONFIG_PPC_MM_SLICES anymore. Don't select it anymore and remove all related code. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 64 arch/powerpc/include/asm/nohash/32/slice.h | 20 -- arch/powerpc/include/asm/slice.h | 2 - arch/powerpc/platforms/Kconfig.cputype | 1 - 4 files changed, 87 deletions(-) delete mode 100644 arch/powerpc/include/asm/nohash/32/slice.h diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 26b7cee34dfe..a092e6434bda 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -176,12 +176,6 @@ */ #define SPRN_M_TW 799 -#ifdef CONFIG_PPC_MM_SLICES -#include -#define SLICE_ARRAY_SIZE (1 << (32 - SLICE_LOW_SHIFT - 1)) -#define LOW_SLICE_ARRAY_SZ SLICE_ARRAY_SIZE -#endif - #if defined(CONFIG_PPC_4K_PAGES) #define mmu_virtual_psize MMU_PAGE_4K #elif defined(CONFIG_PPC_16K_PAGES) @@ -199,71 +193,13 @@ #include -struct slice_mask { - u64 low_slices; - DECLARE_BITMAP(high_slices, 0); -}; - typedef struct { unsigned int id; unsigned int active; unsigned long vdso_base; -#ifdef CONFIG_PPC_MM_SLICES - u16 user_psize; /* page size index */ - unsigned char low_slices_psize[SLICE_ARRAY_SIZE]; - unsigned char high_slices_psize[0]; - unsigned long slb_addr_limit; - struct slice_mask mask_base_psize; /* 4k or 16k */ - struct slice_mask mask_512k; - struct slice_mask mask_8m; -#endif void *pte_frag; } mm_context_t; -#ifdef CONFIG_PPC_MM_SLICES -static inline u16 mm_ctx_user_psize(mm_context_t *ctx) -{ - return ctx->user_psize; -} - -static inline void mm_ctx_set_user_psize(mm_context_t *ctx, u16 user_psize) -{ - ctx->user_psize = user_psize; -} - -static inline unsigned char *mm_ctx_low_slices(mm_context_t *ctx) -{ - return ctx->low_slices_psize; -} - -static inline unsigned char *mm_ctx_high_slices(mm_context_t *ctx) -{ - return ctx->high_slices_psize; -} - -static inline unsigned long mm_ctx_slb_addr_limit(mm_context_t *ctx) -{ - return ctx->slb_addr_limit; -} - -static inline void mm_ctx_set_slb_addr_limit(mm_context_t *ctx, unsigned long limit) -{ - ctx->slb_addr_limit = limit; -} - -static inline struct slice_mask *slice_mask_for_size(mm_context_t *ctx, int psize) -{ - if (psize == MMU_PAGE_512K) - return &ctx->mask_512k; - if (psize == MMU_PAGE_8M) - return &ctx->mask_8m; - - BUG_ON(psize != mmu_virtual_psize); - - return &ctx->mask_base_psize; -} -#endif /* CONFIG_PPC_MM_SLICE */ - #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8) #define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE)) diff --git a/arch/powerpc/include/asm/nohash/32/slice.h b/arch/powerpc/include/asm/nohash/32/slice.h deleted file mode 100644 index 39eb0154ae2d.. --- a/arch/powerpc/include/asm/nohash/32/slice.h +++ /dev/null @@ -1,20 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_POWERPC_NOHASH_32_SLICE_H -#define _ASM_POWERPC_NOHASH_32_SLICE_H - -#ifdef CONFIG_PPC_MM_SLICES - -#define SLICE_LOW_SHIFT26 /* 64 slices */ -#define SLICE_LOW_TOP (0x1ull) -#define SLICE_NUM_LOW (SLICE_LOW_TOP >> SLICE_LOW_SHIFT) -#define GET_LOW_SLICE_INDEX(addr) ((addr) >> SLICE_LOW_SHIFT) - -#define SLICE_HIGH_SHIFT 0 -#define SLICE_NUM_HIGH 0ul -#define GET_HIGH_SLICE_INDEX(addr) (addr & 0) - -#define SLB_ADDR_LIMIT_DEFAULT DEFAULT_MAP_WINDOW - -#endif /* CONFIG_PPC_MM_SLICES */ - -#endif /* _ASM_POWERPC_NOHASH_32_SLICE_H */ diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h index c6f466f4c241..0bdd9c62eca0 100644 --- a/arch/powerpc/include/asm/slice.h +++ b/arch/powerpc/include/asm/slice.h @@ -4,8 +4,6 @@ #ifdef CONFIG_PPC_BOOK3S_64 #include -#elif defined(CONFIG_PPC_MMU_NOHASH_32) -#include #endif #ifndef __ASSEMBLY__ diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 4208724e9f28..6a50392df7b5 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -55,7 +55,6 @@ config PPC_8xx select SYS_SUPPORTS_HUGETLBFS select PPC_HAVE_KUEP select PPC_HAVE_KUAP - select PPC_MM_SLICES if HUGETLB_PAGE select HAVE_ARCH_VMAP_STACK config 40x -- 2.25.0
[PATCH v1 27/46] powerpc/8xx: Manage 512k huge pages as standard pages.
At the time being, 512k huge pages are handled through hugepd page tables. The PMD entry is flagged as a hugepd pointer and it means that only 512k hugepages can be managed in that 4M block. However, the hugepd table has the same size as a normal page table, and 512k entries can therefore be nested with normal pages. On the 8xx, TLB loading is performed by software and allthough the page tables are organised to match the L1 and L2 level defined by the HW, all TLB entries have both L1 and L2 independent entries. It means that even if two TLB entries are associated with the same PMD entry, they can be loaded with different values in L1 part. The L1 entry contains the page size (PS field): - 00 for 4k and 16 pages - 01 for 512k pages - 11 for 8M pages By adding a flag for hugepages in the PTE (_PAGE_HUGE) and copying it into the lower bit of PS, we can then manage 512k pages with normal page tables: - PMD entry has PS=11 for 8M pages - PMD entry has PS=00 for other pages. As a PMD entry covers 4M areas, a PMD will either point to a hugepd table having a single entry to an 8M page, or the PMD will point to a standard page table which will have either entries to 4k or 16k or 512k pages. For 512k pages, as the L1 entry will not know it is a 512k page before the PTE is read, there will be 128 entries in the PTE as if it was 4k pages. But when loading the TLB, it will be flagged as a 512k page. Note that we can't use pmd_ptr() in asm/nohash/32/pgtable.h because it is not defined yet. In ITLB miss, we keep the possibility to opt it out as when kernel text is pinned and no user hugepages are used, we can save several instruction by not using r11. In DTLB miss, that's just one instruction so it's not worth bothering with it. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 10 ++--- arch/powerpc/include/asm/nohash/32/pte-8xx.h | 4 +++- arch/powerpc/include/asm/nohash/pgtable.h| 2 +- arch/powerpc/kernel/head_8xx.S | 12 +-- arch/powerpc/mm/hugetlbpage.c| 22 +--- arch/powerpc/mm/pgtable.c| 10 - 6 files changed, 44 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 1a86d20b58f3..1504af38a9a8 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -229,8 +229,9 @@ static inline void pmd_clear(pmd_t *pmdp) * those implementations. * * On the 8xx, the page tables are a bit special. For 16k pages, we have - * 4 identical entries. For other page sizes, we have a single entry in the - * table. + * 4 identical entries. For 512k pages, we have 128 entries as if it was + * 4k pages, but they are flagged as 512k pages for the hardware. + * For other page sizes, we have a single entry in the table. */ #ifdef CONFIG_PPC_8xx static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, @@ -240,13 +241,16 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p pte_basic_t old = pte_val(*p); pte_basic_t new = (old & ~(pte_basic_t)clr) | set; int num, i; + pmd_t *pmd = pmd_offset(pud_offset(pgd_offset(mm, addr), addr), addr); if (!huge) num = PAGE_SIZE / SZ_4K; + else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) != _PMD_PAGE_8M) + num = SZ_512K / SZ_4K; else num = 1; - for (i = 0; i < num; i++, entry++) + for (i = 0; i < num; i++, entry++, new += SZ_4K) *entry = new; return old; diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h index c9e4b2d90f65..66f403a7da44 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h @@ -46,6 +46,8 @@ #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */ #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */ +#define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */ + /* cache related flags non existing on 8xx */ #define _PAGE_COHERENT 0 #define _PAGE_WRITETHRU0 @@ -128,7 +130,7 @@ static inline pte_t pte_mkuser(pte_t pte) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_SPS); + return __pte(pte_val(pte) | _PAGE_SPS | _PAGE_HUGE); } #define pte_mkhuge pte_mkhuge diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index 7fed9dc0f147..f27c967d9269 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -267,7 +267,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, static inline int hugepd_ok(hugepd_t hpd) { #ifdef CONFIG_PPC_8xx - return ((hpd_val(hpd) & 0x4) != 0); + return ((hpd_val(hpd)
[PATCH v1 28/46] powerpc/8xx: Only 8M pages are hugepte pages now
512k pages are now standard pages, so only 8M pages are hugepte. No more handling of normal page tables through hugepd allocation and freeing, and hugepte helpers can also be simplified. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 7 +++ arch/powerpc/mm/hugetlbpage.c| 16 +++- 2 files changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index 785437323576..1c7d4693a78e 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -13,13 +13,13 @@ static inline pte_t *hugepd_page(hugepd_t hpd) static inline unsigned int hugepd_shift(hugepd_t hpd) { - return ((hpd_val(hpd) & _PMD_PAGE_MASK) >> 1) + 17; + return PAGE_SHIFT_8M; } static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, unsigned int pdshift) { - unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> PAGE_SHIFT; + unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT; return hugepd_page(hpd) + idx; } @@ -32,8 +32,7 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma, static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) { - *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | -(pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M : _PMD_PAGE_512K)); + *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M); } static inline int check_and_get_huge_psize(int shift) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 35eb29584b54..243e90db400c 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -54,24 +54,17 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, if (pshift >= pdshift) { cachep = PGT_CACHE(PTE_T_ORDER); num_hugepd = 1 << (pshift - pdshift); - new = NULL; - } else if (IS_ENABLED(CONFIG_PPC_8xx)) { - cachep = NULL; - num_hugepd = 1; - new = pte_alloc_one(mm); } else { cachep = PGT_CACHE(pdshift - pshift); num_hugepd = 1; - new = NULL; } - if (!cachep && !new) { + if (!cachep) { WARN_ONCE(1, "No page table cache created for hugetlb tables"); return -ENOMEM; } - if (cachep) - new = kmem_cache_alloc(cachep, pgtable_gfp_flags(mm, GFP_KERNEL)); + new = kmem_cache_alloc(cachep, pgtable_gfp_flags(mm, GFP_KERNEL)); BUG_ON(pshift > HUGEPD_SHIFT_MASK); BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK); @@ -102,10 +95,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, if (i < num_hugepd) { for (i = i - 1 ; i >= 0; i--, hpdp--) *hpdp = __hugepd(0); - if (cachep) - kmem_cache_free(cachep, new); - else - pte_free(mm, new); + kmem_cache_free(cachep, new); } else { kmemleak_ignore(new); } -- 2.25.0
[PATCH v1 26/46] powerpc/8xx: Prepare handlers for _PAGE_HUGE for 512k pages.
Prepare ITLB handler to handle _PAGE_HUGE when CONFIG_HUGETLBFS is enabled. This means that the L1 entry has to be kept in r11 until L2 entry is read, in order to insert _PAGE_HUGE into it. Also move pgd_offset helpers before pte_update() as they will be needed there in next patch. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 13 ++--- arch/powerpc/kernel/head_8xx.S | 15 +-- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index dd5835354e33..1a86d20b58f3 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -206,6 +206,12 @@ static inline void pmd_clear(pmd_t *pmdp) } +/* to find an entry in a kernel page-table-directory */ +#define pgd_offset_k(address) pgd_offset(&init_mm, address) + +/* to find an entry in a page-table-directory */ +#define pgd_index(address) ((address) >> PGDIR_SHIFT) +#define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) /* * PTE updates. This function is called whenever an existing @@ -348,13 +354,6 @@ static inline int pte_young(pte_t pte) pfn_to_page((__pa(pmd_val(pmd)) >> PAGE_SHIFT)) #endif -/* to find an entry in a kernel page-table-directory */ -#define pgd_offset_k(address) pgd_offset(&init_mm, address) - -/* to find an entry in a page-table-directory */ -#define pgd_index(address) ((address) >> PGDIR_SHIFT) -#define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) - /* Find an entry in the third-level page table.. */ #define pte_index(address) \ (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 905205c79a25..adad8baadcf5 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -196,7 +196,7 @@ SystemCall: InstructionTLBMiss: mtspr SPRN_SPRG_SCRATCH0, r10 -#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) +#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS) mtspr SPRN_SPRG_SCRATCH1, r11 #endif @@ -235,16 +235,19 @@ InstructionTLBMiss: rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha 3: + mtcrr11 #endif +#ifdef CONFIG_HUGETLBFS + lwz r11, (swapper_pg_dir-PAGE_OFFSET)@l(r10)/* Get level 1 entry */ + mtspr SPRN_MI_TWC, r11/* Set segment attributes */ + mtspr SPRN_MD_TWC, r11 +#else lwz r10, (swapper_pg_dir-PAGE_OFFSET)@l(r10)/* Get level 1 entry */ mtspr SPRN_MI_TWC, r10/* Set segment attributes */ - mtspr SPRN_MD_TWC, r10 +#endif mfspr r10, SPRN_MD_TWC lwz r10, 0(r10) /* Get the pte */ -#ifdef ITLB_MISS_KERNEL - mtcrr11 -#endif #ifdef CONFIG_SWAP rlwinm r11, r10, 32-5, _PAGE_PRESENT and r11, r11, r10 @@ -263,7 +266,7 @@ InstructionTLBMiss: /* Restore registers */ 0: mfspr r10, SPRN_SPRG_SCRATCH0 -#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) +#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_SWAP) || defined(CONFIG_HUGETLBFS) mfspr r11, SPRN_SPRG_SCRATCH1 #endif rfi -- 2.25.0
[PATCH v1 22/46] powerpc/mm: Standardise pte_update() prototype between PPC32 and PPC64
PPC64 takes 3 additional parameters compared to PPC32: - mm - address - huge These 3 parameters will be needed in order to perform different action depending on the page size on the 8xx. Make pte_update() prototype identical for PPC32 and PPC64. This allows dropping an #ifdef in huge_ptep_get_and_clear(). Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/pgtable.h | 15 --- arch/powerpc/include/asm/hugetlb.h | 4 arch/powerpc/include/asm/nohash/32/pgtable.h | 13 +++-- 3 files changed, 15 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 8122f0b55d21..f5eab98c4e41 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -218,7 +218,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot); */ #define pte_clear(mm, addr, ptep) \ - do { pte_update(ptep, ~_PAGE_HASHPTE, 0); } while (0) + do { pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, 0, 0); } while (0) #define pmd_none(pmd) (!pmd_val(pmd)) #definepmd_bad(pmd)(pmd_val(pmd) & _PMD_BAD) @@ -254,7 +254,8 @@ extern void flush_hash_entry(struct mm_struct *mm, pte_t *ptep, * when using atomic updates, only the low part of the PTE is * accessed atomically. */ -static inline pte_basic_t pte_update(pte_t *p, unsigned long clr, unsigned long set) +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, +unsigned long clr, unsigned long set, int huge) { pte_basic_t old; unsigned long tmp; @@ -292,7 +293,7 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { unsigned long old; - old = pte_update(ptep, _PAGE_ACCESSED, 0); + old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0); if (old & _PAGE_HASHPTE) { unsigned long ptephys = __pa(ptep) & PAGE_MASK; flush_hash_pages(mm->context.id, addr, ptephys, 1); @@ -306,14 +307,14 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm, static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - return __pte(pte_update(ptep, ~_PAGE_HASHPTE, 0)); + return __pte(pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, 0, 0)); } #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - pte_update(ptep, _PAGE_RW, 0); + pte_update(mm, addr, ptep, _PAGE_RW, 0, 0); } static inline void __ptep_set_access_flags(struct vm_area_struct *vma, @@ -324,7 +325,7 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC); - pte_update(ptep, 0, set); + pte_update(vma->vm_mm, address, ptep, 0, set, 0); flush_tlb_page(vma, address); } @@ -522,7 +523,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, *ptep = __pte((pte_val(*ptep) & _PAGE_HASHPTE) | (pte_val(pte) & ~_PAGE_HASHPTE)); else - pte_update(ptep, ~_PAGE_HASHPTE, pte_val(pte)); + pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, pte_val(pte), 0); #elif defined(CONFIG_PTE_64BIT) /* Second case is 32-bit with 64-bit PTE. In this case, we diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index bd6504c28c2f..e4276af034e9 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -40,11 +40,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { -#ifdef CONFIG_PPC64 return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1)); -#else - return __pte(pte_update(ptep, ~0UL, 0)); -#endif } #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index ddf681ceb860..75880eb1cb91 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -166,7 +166,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot); #ifndef __ASSEMBLY__ #define pte_clear(mm, addr, ptep) \ - do { pte_update(ptep, ~0, 0); } while (0) + do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0) #ifndef pte_mkwrite static inline pte_t pte_mkwrite(pte_t pte) @@ -222,7 +222,8 @@ static inline void p
[PATCH v1 23/46] powerpc/mm: Create a dedicated pte_update() for 8xx
pte_update() is a bit special for the 8xx. At the time being, that's an #ifdef inside the nohash/32 pte_update(). As we are going to make it even more special in the coming patches, create a dedicated version for pte_update() for 8xx. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 29 +--- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 75880eb1cb91..272963a05ab2 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -221,7 +221,31 @@ static inline void pmd_clear(pmd_t *pmdp) * that an executable user mapping was modified, which is needed * to properly flush the virtually tagged instruction cache of * those implementations. + * + * On the 8xx, the page tables are a bit special. For 16k pages, we have + * 4 identical entries. For other page sizes, we have a single entry in the + * table. */ +#ifdef CONFIG_PPC_8xx +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, +unsigned long clr, unsigned long set, int huge) +{ + pte_basic_t *entry = &p->pte; + pte_basic_t old = pte_val(*p); + pte_basic_t new = (old & ~(pte_basic_t)clr) | set; + int num, i; + + if (!huge) + num = PAGE_SIZE / SZ_4K; + else + num = 1; + + for (i = 0; i < num; i++, entry++) + *entry = new; + + return old; +} +#else static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, unsigned long clr, unsigned long set, int huge) { @@ -242,11 +266,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p pte_basic_t old = pte_val(*p); pte_basic_t new = (old & ~(pte_basic_t)clr) | set; -#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES) - p->pte = p->pte1 = p->pte2 = p->pte3 = new; -#else *p = __pte(new); -#endif #endif /* !PTE_ATOMIC_UPDATES */ #ifdef CONFIG_44x @@ -255,6 +275,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p #endif return old; } +#endif #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG static inline int __ptep_test_and_clear_young(struct mm_struct *mm, -- 2.25.0
[PATCH v1 25/46] powerpc/8xx: Drop CONFIG_8xx_COPYBACK option
CONFIG_8xx_COPYBACK was there to help disabling copyback cache mode for debuging hardware. But nobody will design new boards with 8xx now. All 8xx platforms select it, so make it the default and remove the option. Also remove the Mx_RESETVAL values which are pretty useless and hide the real value while reading code. Signed-off-by: Christophe Leroy --- arch/powerpc/configs/adder875_defconfig | 1 - arch/powerpc/configs/ep88xc_defconfig| 1 - arch/powerpc/configs/mpc866_ads_defconfig| 1 - arch/powerpc/configs/mpc885_ads_defconfig| 1 - arch/powerpc/configs/tqm8xx_defconfig| 1 - arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 2 -- arch/powerpc/kernel/head_8xx.S | 15 +-- arch/powerpc/platforms/8xx/Kconfig | 9 - 8 files changed, 1 insertion(+), 30 deletions(-) diff --git a/arch/powerpc/configs/adder875_defconfig b/arch/powerpc/configs/adder875_defconfig index f55e23cb176c..5326bc739279 100644 --- a/arch/powerpc/configs/adder875_defconfig +++ b/arch/powerpc/configs/adder875_defconfig @@ -10,7 +10,6 @@ CONFIG_EXPERT=y # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_PPC_ADDER875=y -CONFIG_8xx_COPYBACK=y CONFIG_GEN_RTC=y CONFIG_HZ_1000=y # CONFIG_SECCOMP is not set diff --git a/arch/powerpc/configs/ep88xc_defconfig b/arch/powerpc/configs/ep88xc_defconfig index 0e2e5e81a359..f5c3e72da719 100644 --- a/arch/powerpc/configs/ep88xc_defconfig +++ b/arch/powerpc/configs/ep88xc_defconfig @@ -12,7 +12,6 @@ CONFIG_EXPERT=y # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_PPC_EP88XC=y -CONFIG_8xx_COPYBACK=y CONFIG_GEN_RTC=y CONFIG_HZ_100=y # CONFIG_SECCOMP is not set diff --git a/arch/powerpc/configs/mpc866_ads_defconfig b/arch/powerpc/configs/mpc866_ads_defconfig index 5320735395e7..5c56d36cdfc5 100644 --- a/arch/powerpc/configs/mpc866_ads_defconfig +++ b/arch/powerpc/configs/mpc866_ads_defconfig @@ -12,7 +12,6 @@ CONFIG_EXPERT=y # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_MPC86XADS=y -CONFIG_8xx_COPYBACK=y CONFIG_GEN_RTC=y CONFIG_HZ_1000=y CONFIG_MATH_EMULATION=y diff --git a/arch/powerpc/configs/mpc885_ads_defconfig b/arch/powerpc/configs/mpc885_ads_defconfig index 82a008c04eae..949ff9ccda5e 100644 --- a/arch/powerpc/configs/mpc885_ads_defconfig +++ b/arch/powerpc/configs/mpc885_ads_defconfig @@ -11,7 +11,6 @@ CONFIG_EXPERT=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y -CONFIG_8xx_COPYBACK=y CONFIG_GEN_RTC=y CONFIG_HZ_100=y # CONFIG_SECCOMP is not set diff --git a/arch/powerpc/configs/tqm8xx_defconfig b/arch/powerpc/configs/tqm8xx_defconfig index eda8bfb2d0a3..77857d513022 100644 --- a/arch/powerpc/configs/tqm8xx_defconfig +++ b/arch/powerpc/configs/tqm8xx_defconfig @@ -15,7 +15,6 @@ CONFIG_MODULE_SRCVERSION_ALL=y # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_TQM8XX=y -CONFIG_8xx_COPYBACK=y # CONFIG_8xx_CPU15 is not set CONFIG_GEN_RTC=y CONFIG_HZ_100=y diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 76af5b0cb16e..26b7cee34dfe 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -19,7 +19,6 @@ #define MI_RSV4I 0x0800 /* Reserve 4 TLB entries */ #define MI_PPCS0x0200 /* Use MI_RPN prob/priv state */ #define MI_IDXMASK 0x1f00 /* TLB index to be loaded */ -#define MI_RESETVAL0x /* Value of register at reset */ /* These are the Ks and Kp from the PowerPC books. For proper operation, * Ks = 0, Kp = 1. @@ -95,7 +94,6 @@ #define MD_TWAM0x0400 /* Use 4K page hardware assist */ #define MD_PPCS0x0200 /* Use MI_RPN prob/priv state */ #define MD_IDXMASK 0x1f00 /* TLB index to be loaded */ -#define MD_RESETVAL0x0400 /* Value of register at reset */ #define SPRN_M_CASID 793 /* Address space ID (context) to match */ #define MC_ASIDMASK0x000f /* Bits used for ASID value */ diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 073a651787df..905205c79a25 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -779,10 +779,7 @@ start_here: initial_mmu: li r8, 0 mtspr SPRN_MI_CTR, r8 /* remove PINNED ITLB entries */ - lis r10, MD_RESETVAL@h -#ifndef CONFIG_8xx_COPYBACK - orisr10, r10, MD_WTDEF@h -#endif + lis r10, MD_TWAM@h mtspr SPRN_MD_CTR, r10/* remove PINNED DTLB entries */ tlbia /* Invalidate all TLB entries */ @@ -857,17 +854,7 @@ initial_mmu: mtspr SPRN_DC_CST, r8 lis r8, IDC_ENABLE@h mtspr SPRN_IC_CST, r8 -#ifdef CONFIG_8xx_COPYBACK - mtspr SPRN_DC_CST, r8 -#else - /* For a
[PATCH v1 24/46] powerpc/mm: Reduce hugepd size for 8M hugepages on 8xx
Commit 55c8fc3f4930 ("powerpc/8xx: reintroduce 16K pages with HW assistance") redefined pte_t as a struct of 4 pte_basic_t, because in 16K pages mode there are four identical entries in the page table. But hugepd entries for 8M pages require only one entry of size pte_basic_t. So there is no point in creating a cache for 4 entries page tables. Calculate PTE_T_ORDER using the size of pte_basic_t instead of pte_t. Define specific huge_pte helpers (set_huge_pte_at(), huge_pte_clear(), huge_ptep_set_wrprotect()) to write the pte in a single entry instead of using set_pte_at() which writes 4 identical entries in 16k pages mode. Also make sure that __ptep_set_access_flags() properly handle the huge_pte case. Define set_pte_filter() inline otherwise GCC doesn't inline it anymore because it is now used twice, and that gives a pretty suboptimal code because of pte_t being a struct of 4 entries. Those functions are also used for 512k pages which only require one entry as well allthough replicating it four times was harmless as 512k pages entries are spread every 128 bytes in the table. Signed-off-by: Christophe Leroy --- .../include/asm/nohash/32/hugetlb-8xx.h | 20 ++ arch/powerpc/include/asm/nohash/32/pgtable.h | 3 ++- arch/powerpc/mm/hugetlbpage.c | 3 ++- arch/powerpc/mm/pgtable.c | 26 --- 4 files changed, 46 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index a46616937d20..785437323576 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -41,4 +41,24 @@ static inline int check_and_get_huge_psize(int shift) return shift_to_mmu_psize(shift); } +#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); + +#define __HAVE_ARCH_HUGE_PTE_CLEAR +static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned long sz) +{ + pte_update(mm, addr, ptep, ~0UL, 0, 1); +} + +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT +static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) +{ + unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0))); + unsigned long set = pte_val(pte_wrprotect(__pte(0))); + + pte_update(mm, addr, ptep, clr, set, 1); +} + #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */ diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 272963a05ab2..dd5835354e33 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -314,8 +314,9 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t pte_clr = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(~0); unsigned long set = pte_val(entry) & pte_val(pte_set); unsigned long clr = ~pte_val(entry) & ~pte_val(pte_clr); + int huge = psize > mmu_virtual_psize ? 1 : 0; - pte_update(vma->vm_mm, address, ptep, clr, set, 0); + pte_update(vma->vm_mm, address, ptep, clr, set, huge); flush_tlb_page(vma, address); } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 33b3461d91e8..edf511c2a30a 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -30,7 +30,8 @@ bool hugetlb_disabled = false; #define hugepd_none(hpd) (hpd_val(hpd) == 0) -#define PTE_T_ORDER(__builtin_ffs(sizeof(pte_t)) - __builtin_ffs(sizeof(void *))) +#define PTE_T_ORDER(__builtin_ffs(sizeof(pte_basic_t)) - \ +__builtin_ffs(sizeof(void *))) pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz) { diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index e3759b69f81b..214a5f4beb6c 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -100,7 +100,7 @@ static pte_t set_pte_filter_hash(pte_t pte) { return pte; } * as we don't have two bits to spare for _PAGE_EXEC and _PAGE_HWEXEC so * instead we "filter out" the exec permission for non clean pages. */ -static pte_t set_pte_filter(pte_t pte) +static inline pte_t set_pte_filter(pte_t pte) { struct page *pg; @@ -249,16 +249,34 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, #else /* -* Not used on non book3s64 platforms. But 8xx -* can possibly use tsize derived from hstate. +* Not used on non book3s64 platforms. +* 8xx compares it with mmu_virtual_psize to +* know if it is a huge page or not. */ - psize = 0; + psize = MMU_PAGE_COUNT; #endif
[PATCH v1 21/46] powerpc/mm: Standardise __ptep_test_and_clear_young() params between PPC32 and PPC64
On PPC32, __ptep_test_and_clear_young() takes the mm->context.id In preparation of standardising pte_update() params between PPC32 and PPC64, __ptep_test_and_clear_young() need mm instead of mm->context.id Replace context param by mm. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/pgtable.h | 7 --- arch/powerpc/include/asm/nohash/32/pgtable.h | 5 +++-- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index d1108d25e2e5..8122f0b55d21 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -288,18 +288,19 @@ static inline pte_basic_t pte_update(pte_t *p, unsigned long clr, unsigned long * for our hash-based implementation, we fix that up here. */ #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG -static inline int __ptep_test_and_clear_young(unsigned int context, unsigned long addr, pte_t *ptep) +static inline int __ptep_test_and_clear_young(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) { unsigned long old; old = pte_update(ptep, _PAGE_ACCESSED, 0); if (old & _PAGE_HASHPTE) { unsigned long ptephys = __pa(ptep) & PAGE_MASK; - flush_hash_pages(context, addr, ptephys, 1); + flush_hash_pages(mm->context.id, addr, ptephys, 1); } return (old & _PAGE_ACCESSED) != 0; } #define ptep_test_and_clear_young(__vma, __addr, __ptep) \ - __ptep_test_and_clear_young((__vma)->vm_mm->context.id, __addr, __ptep) + __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep) #define __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 9eaf386a747b..ddf681ceb860 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -256,14 +256,15 @@ static inline pte_basic_t pte_update(pte_t *p, unsigned long clr, unsigned long } #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG -static inline int __ptep_test_and_clear_young(unsigned int context, unsigned long addr, pte_t *ptep) +static inline int __ptep_test_and_clear_young(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) { unsigned long old; old = pte_update(ptep, _PAGE_ACCESSED, 0); return (old & _PAGE_ACCESSED) != 0; } #define ptep_test_and_clear_young(__vma, __addr, __ptep) \ - __ptep_test_and_clear_young((__vma)->vm_mm->context.id, __addr, __ptep) + __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep) #define __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, -- 2.25.0
[PATCH v1 20/46] powerpc/mm: Refactor pte_update() on book3s/32
When CONFIG_PTE_64BIT is set, pte_update() operates on 'unsigned long long' When CONFIG_PTE_64BIT is not set, pte_update() operates on 'unsigned long' In asm/page.h, we have pte_basic_t which is 'unsigned long long' when CONFIG_PTE_64BIT is set and 'unsigned long' otherwise. Refactor pte_update() using pte_basic_t. While we are at it, drop the comment on 44x which is not applicable to book3s version of pte_update(). Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/pgtable.h | 58 +++- 1 file changed, 20 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 7549393c4c43..d1108d25e2e5 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -253,53 +253,35 @@ extern void flush_hash_entry(struct mm_struct *mm, pte_t *ptep, * and the PTE may be either 32 or 64 bit wide. In the later case, * when using atomic updates, only the low part of the PTE is * accessed atomically. - * - * In addition, on 44x, we also maintain a global flag indicating - * that an executable user mapping was modified, which is needed - * to properly flush the virtually tagged instruction cache of - * those implementations. */ -#ifndef CONFIG_PTE_64BIT -static inline unsigned long pte_update(pte_t *p, - unsigned long clr, - unsigned long set) +static inline pte_basic_t pte_update(pte_t *p, unsigned long clr, unsigned long set) { - unsigned long old, tmp; - - __asm__ __volatile__("\ -1: lwarx %0,0,%3\n\ - andc%1,%0,%4\n\ - or %1,%1,%5\n" -" stwcx. %1,0,%3\n\ - bne-1b" - : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) - : "cc" ); - - return old; -} -#else /* CONFIG_PTE_64BIT */ -static inline unsigned long long pte_update(pte_t *p, - unsigned long clr, - unsigned long set) -{ - unsigned long long old; + pte_basic_t old; unsigned long tmp; - __asm__ __volatile__("\ -1: lwarx %L0,0,%4\n\ - lwzx%0,0,%3\n\ - andc%1,%L0,%5\n\ - or %1,%1,%6\n" -" stwcx. %1,0,%4\n\ - bne-1b" + __asm__ __volatile__( +#ifndef CONFIG_PTE_64BIT +"1:lwarx %0, 0, %3\n" +" andc%1, %0, %4\n" +#else +"1:lwarx %L0, 0, %3\n" +" lwz %0, -4(%3)\n" +" andc%1, %L0, %4\n" +#endif +" or %1, %1, %5\n" +" stwcx. %1, 0, %3\n" +" bne-1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" ((unsigned long)(p) + 4), "r" (clr), "r" (set), "m" (*p) +#ifndef CONFIG_PTE_64BIT + : "r" (p), +#else + : "b" ((unsigned long)(p) + 4), +#endif + "r" (clr), "r" (set), "m" (*p) : "cc" ); return old; } -#endif /* CONFIG_PTE_64BIT */ /* * 2.6 calls this without flushing the TLB entry; this is wrong -- 2.25.0
[PATCH v1 19/46] powerpc/mm: Refactor pte_update() on nohash/32
When CONFIG_PTE_64BIT is set, pte_update() operates on 'unsigned long long' When CONFIG_PTE_64BIT is not set, pte_update() operates on 'unsigned long' In asm/page.h, we have pte_basic_t which is 'unsigned long long' when CONFIG_PTE_64BIT is set and 'unsigned long' otherwise. Refactor pte_update() using pte_basic_t. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 26 +++- 1 file changed, 4 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 523c4c3876c5..9eaf386a747b 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -222,12 +222,9 @@ static inline void pmd_clear(pmd_t *pmdp) * to properly flush the virtually tagged instruction cache of * those implementations. */ -#ifndef CONFIG_PTE_64BIT -static inline unsigned long pte_update(pte_t *p, - unsigned long clr, - unsigned long set) +static inline pte_basic_t pte_update(pte_t *p, unsigned long clr, unsigned long set) { -#ifdef PTE_ATOMIC_UPDATES +#if defined(PTE_ATOMIC_UPDATES) && !defined(CONFIG_PTE_64BIT) unsigned long old, tmp; __asm__ __volatile__("\ @@ -241,8 +238,8 @@ static inline unsigned long pte_update(pte_t *p, : "r" (p), "r" (clr), "r" (set), "m" (*p) : "cc" ); #else /* PTE_ATOMIC_UPDATES */ - unsigned long old = pte_val(*p); - unsigned long new = (old & ~clr) | set; + pte_basic_t old = pte_val(*p); + pte_basic_t new = (old & ~(pte_basic_t)clr) | set; #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES) p->pte = p->pte1 = p->pte2 = p->pte3 = new; @@ -257,21 +254,6 @@ static inline unsigned long pte_update(pte_t *p, #endif return old; } -#else /* CONFIG_PTE_64BIT */ -static inline unsigned long long pte_update(pte_t *p, - unsigned long clr, - unsigned long set) -{ - unsigned long long old = pte_val(*p); - *p = __pte((old & ~(unsigned long long)clr) | set); - -#ifdef CONFIG_44x - if ((old & _PAGE_USER) && (old & _PAGE_EXEC)) - icache_44x_need_flush = 1; -#endif - return old; -} -#endif /* CONFIG_PTE_64BIT */ #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG static inline int __ptep_test_and_clear_young(unsigned int context, unsigned long addr, pte_t *ptep) -- 2.25.0
[PATCH v1 14/46] powerpc/ptdump: Handle hugepd at PGD level
The 8xx is about to map kernel linear space and IMMR using huge pages. In order to display those pages properly, ptdump needs to handle hugepd tables at PGD level. For the time being do it only at PGD level. Further patches may add handling of hugepd tables at lower level for other platforms when needed in the future. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/ptdump.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 64434b66f240..1adaa7e794f3 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -23,6 +23,7 @@ #include #include #include +#include #include @@ -270,6 +271,26 @@ static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start) } } +static void walk_hugepd(struct pg_state *st, hugepd_t *phpd, unsigned long start, + int pdshift, int level) +{ +#ifdef CONFIG_ARCH_HAS_HUGEPD + unsigned int i; + int shift = hugepd_shift(*phpd); + int ptrs_per_hpd = pdshift - shift > 0 ? 1 << (pdshift - shift) : 1; + + if (start & ((1 << shift) - 1)) + return; + + for (i = 0; i < ptrs_per_hpd; i++) { + unsigned long addr = start + (i << shift); + pte_t *pte = hugepte_offset(*phpd, addr, pdshift); + + note_page(st, addr, level + 1, pte_val(*pte), shift); + } +#endif +} + static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start) { pmd_t *pmd = pmd_offset(pud, 0); @@ -313,11 +334,13 @@ static void walk_pagetables(struct pg_state *st) * the hash pagetable. */ for (i = pgd_index(addr); i < PTRS_PER_PGD; i++, pgd++, addr += PGDIR_SIZE) { - if (!pgd_none(*pgd) && !pgd_is_leaf(*pgd)) + if (pgd_none(*pgd) || pgd_is_leaf(*pgd)) + note_page(st, addr, 1, pgd_val(*pgd), PUD_SHIFT); + else if (is_hugepd(__hugepd(pgd_val(*pgd + walk_hugepd(st, (hugepd_t *)pgd, addr, PGDIR_SHIFT, 1); + else /* pgd exists */ walk_pud(st, pgd, addr); - else - note_page(st, addr, 1, pgd_val(*pgd), PUD_SHIFT); } } -- 2.25.0
[PATCH v1 18/46] powerpc/mm: PTE_ATOMIC_UPDATES is only for 40x
Only 40x still uses PTE_ATOMIC_UPDATES. 40x cannot not select CONFIG_PTE64_BIT. Drop handling of PTE_ATOMIC_UPDATES: - In nohash/64 - In nohash/32 for CONFIG_PTE_64BIT Keep PTE_ATOMIC_UPDATES only for nohash/32 for !CONFIG_PTE_64BIT Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 17 arch/powerpc/include/asm/nohash/64/pgtable.h | 28 +--- 2 files changed, 1 insertion(+), 44 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index b04ba257fddb..523c4c3876c5 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -262,25 +262,8 @@ static inline unsigned long long pte_update(pte_t *p, unsigned long clr, unsigned long set) { -#ifdef PTE_ATOMIC_UPDATES - unsigned long long old; - unsigned long tmp; - - __asm__ __volatile__("\ -1: lwarx %L0,0,%4\n\ - lwzx%0,0,%3\n\ - andc%1,%L0,%5\n\ - or %1,%1,%6\n" - PPC405_ERR77(0,%3) -" stwcx. %1,0,%4\n\ - bne-1b" - : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" ((unsigned long)(p) + 4), "r" (clr), "r" (set), "m" (*p) - : "cc" ); -#else /* PTE_ATOMIC_UPDATES */ unsigned long long old = pte_val(*p); *p = __pte((old & ~(unsigned long long)clr) | set); -#endif /* !PTE_ATOMIC_UPDATES */ #ifdef CONFIG_44x if ((old & _PAGE_USER) && (old & _PAGE_EXEC)) diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h index 9a33b8bd842d..9c703b140d64 100644 --- a/arch/powerpc/include/asm/nohash/64/pgtable.h +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h @@ -211,22 +211,9 @@ static inline unsigned long pte_update(struct mm_struct *mm, unsigned long set, int huge) { -#ifdef PTE_ATOMIC_UPDATES - unsigned long old, tmp; - - __asm__ __volatile__( - "1: ldarx %0,0,%3 # pte_update\n\ - andc%1,%0,%4 \n\ - or %1,%1,%6\n\ - stdcx. %1,0,%3 \n\ - bne-1b" - : "=&r" (old), "=&r" (tmp), "=m" (*ptep) - : "r" (ptep), "r" (clr), "m" (*ptep), "r" (set) - : "cc" ); -#else unsigned long old = pte_val(*ptep); *ptep = __pte((old & ~clr) | set); -#endif + /* huge pages use the old page table lock */ if (!huge) assert_pte_locked(mm, addr); @@ -310,21 +297,8 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, unsigned long bits = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC); -#ifdef PTE_ATOMIC_UPDATES - unsigned long old, tmp; - - __asm__ __volatile__( - "1: ldarx %0,0,%4\n\ - or %0,%3,%0\n\ - stdcx. %0,0,%4\n\ - bne-1b" - :"=&r" (old), "=&r" (tmp), "=m" (*ptep) - :"r" (bits), "r" (ptep), "m" (*ptep) - :"cc"); -#else unsigned long old = pte_val(*ptep); *ptep = __pte(old | bits); -#endif flush_tlb_page(vma, address); } -- 2.25.0
[PATCH v1 17/46] powerpc/mm: Fix conditions to perform MMU specific management by blocks on PPC32.
Setting init mem to NX shall depend on sinittext being mapped by block, not on stext being mapped by block. Setting text and rodata to RO shall depend on stext being mapped by block, not on sinittext being mapped by block. Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/mm/pgtable_32.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index 9934659cb871..bd0cb6e3573e 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -185,7 +185,7 @@ void mark_initmem_nx(void) unsigned long numpages = PFN_UP((unsigned long)_einittext) - PFN_DOWN((unsigned long)_sinittext); - if (v_block_mapped((unsigned long)_stext + 1)) + if (v_block_mapped((unsigned long)_sinittext)) mmu_mark_initmem_nx(); else change_page_attr(page, numpages, PAGE_KERNEL); @@ -197,7 +197,7 @@ void mark_rodata_ro(void) struct page *page; unsigned long numpages; - if (v_block_mapped((unsigned long)_sinittext)) { + if (v_block_mapped((unsigned long)_stext + 1)) { mmu_mark_rodata_ro(); ptdump_check_wx(); return; -- 2.25.0
[PATCH v1 16/46] powerpc/mm: Allocate static page tables for fixmap
Allocate static page tables for the fixmap area. This allows setting mappings through page tables before memblock is ready. That's needed to use early_ioremap() early and to use standard page mappings with fixmap. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/fixmap.h | 4 arch/powerpc/kernel/setup_32.c| 2 +- arch/powerpc/mm/pgtable_32.c | 16 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/fixmap.h b/arch/powerpc/include/asm/fixmap.h index 2ef155a3c821..ccbe2e83c950 100644 --- a/arch/powerpc/include/asm/fixmap.h +++ b/arch/powerpc/include/asm/fixmap.h @@ -86,6 +86,10 @@ enum fixed_addresses { #define __FIXADDR_SIZE (__end_of_fixed_addresses << PAGE_SHIFT) #define FIXADDR_START (FIXADDR_TOP - __FIXADDR_SIZE) +#define FIXMAP_ALIGNED_SIZE(ALIGN(FIXADDR_TOP, PGDIR_SIZE) - \ +ALIGN_DOWN(FIXADDR_START, PGDIR_SIZE)) +#define FIXMAP_PTE_SIZE(FIXMAP_ALIGNED_SIZE / PGDIR_SIZE * PTE_TABLE_SIZE) + #define FIXMAP_PAGE_NOCACHE PAGE_KERNEL_NCG #define FIXMAP_PAGE_IO PAGE_KERNEL_NCG diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c index 5b49b26eb154..3f1e1c0b328a 100644 --- a/arch/powerpc/kernel/setup_32.c +++ b/arch/powerpc/kernel/setup_32.c @@ -81,7 +81,7 @@ notrace void __init machine_init(u64 dt_ptr) /* Configure static keys first, now that we're relocated. */ setup_feature_keys(); - early_ioremap_setup(); + early_ioremap_init(); /* Enable early debugging if any specified (see udbg.h) */ udbg_early_init(); diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index f62de06e3d07..9934659cb871 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -29,11 +29,27 @@ #include #include #include +#include #include extern char etext[], _stext[], _sinittext[], _einittext[]; +static u8 early_fixmap_pagetable[FIXMAP_PTE_SIZE] __page_aligned_data; + +notrace void __init early_ioremap_init(void) +{ + unsigned long addr = ALIGN_DOWN(FIXADDR_START, PGDIR_SIZE); + pte_t *ptep = (pte_t *)early_fixmap_pagetable; + pmd_t *pmdp = pmd_ptr_k(addr); + + for (; (s32)(FIXADDR_TOP - addr) > 0; +addr += PGDIR_SIZE, ptep += PTRS_PER_PTE, pmdp++) + pmd_populate_kernel(&init_mm, pmdp, ptep); + + early_ioremap_setup(); +} + static void __init *early_alloc_pgtable(unsigned long size) { void *ptr = memblock_alloc(size, size); -- 2.25.0
[PATCH v1 15/46] powerpc/32s: Don't warn when mapping RO data ROX.
Mapping RO data as ROX is not an issue since that data cannot be modified to introduce an exploit. PPC64 accepts to have RO data mapped ROX, as a trade off between kernel size and strictness of protection. On PPC32, kernel size is even more critical as amount of memory is usually small. Depending on the number of available IBATs, the last IBATs might overflow the end of text. Only warn if it crosses the end of RO data. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/book3s32/mmu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 39ba53ca5bb5..a9b2cbc74797 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -187,6 +187,7 @@ void mmu_mark_initmem_nx(void) int i; unsigned long base = (unsigned long)_stext - PAGE_OFFSET; unsigned long top = (unsigned long)_etext - PAGE_OFFSET; + unsigned long border = (unsigned long)__init_begin - PAGE_OFFSET; unsigned long size; if (IS_ENABLED(CONFIG_PPC_BOOK3S_601)) @@ -201,9 +202,10 @@ void mmu_mark_initmem_nx(void) size = block_size(base, top); size = max(size, 128UL << 10); if ((top - base) > size) { - if (strict_kernel_rwx_enabled()) - pr_warn("Kernel _etext not properly aligned\n"); size <<= 1; + if (strict_kernel_rwx_enabled() && base + size > border) + pr_warn("Some RW data is getting mapped X. " + "Adjust CONFIG_DATA_SHIFT to avoid that.\n"); } setibat(i++, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT); base += size; -- 2.25.0
[PATCH v1 13/46] powerpc/ptdump: Properly handle non standard page size
In order to properly display information regardless of the page size, it is necessary to take into account real page size. Signed-off-by: Christophe Leroy Fixes: cabe8138b23c ("powerpc: dump as a single line areas mapping a single physical page.") Cc: sta...@vger.kernel.org --- arch/powerpc/mm/ptdump/ptdump.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 1f97668853e3..64434b66f240 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -60,6 +60,7 @@ struct pg_state { unsigned long start_address; unsigned long start_pa; unsigned long last_pa; + unsigned long page_size; unsigned int level; u64 current_flags; bool check_wx; @@ -168,9 +169,9 @@ static void dump_addr(struct pg_state *st, unsigned long addr) #endif pt_dump_seq_printf(st->seq, REG "-" REG " ", st->start_address, addr - 1); - if (st->start_pa == st->last_pa && st->start_address + PAGE_SIZE != addr) { + if (st->start_pa == st->last_pa && st->start_address + st->page_size != addr) { pt_dump_seq_printf(st->seq, "[" REG "]", st->start_pa); - delta = PAGE_SIZE >> 10; + delta = st->page_size >> 10; } else { pt_dump_seq_printf(st->seq, " " REG " ", st->start_pa); delta = (addr - st->start_address) >> 10; @@ -195,10 +196,11 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr) } static void note_page(struct pg_state *st, unsigned long addr, - unsigned int level, u64 val) + unsigned int level, u64 val, int shift) { u64 flag = val & pg_level[level].mask; u64 pa = val & PTE_RPN_MASK; + unsigned long page_size = 1 << shift; /* At first no level is set */ if (!st->level) { @@ -207,6 +209,7 @@ static void note_page(struct pg_state *st, unsigned long addr, st->start_address = addr; st->start_pa = pa; st->last_pa = pa; + st->page_size = page_size; pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); /* * Dump the section of virtual memory when: @@ -218,7 +221,7 @@ static void note_page(struct pg_state *st, unsigned long addr, */ } else if (flag != st->current_flags || level != st->level || addr >= st->marker[1].start_address || - (pa != st->last_pa + PAGE_SIZE && + (pa != st->last_pa + st->page_size && (pa != st->start_pa || st->start_pa != st->last_pa))) { /* Check the PTE flags */ @@ -246,6 +249,7 @@ static void note_page(struct pg_state *st, unsigned long addr, st->start_address = addr; st->start_pa = pa; st->last_pa = pa; + st->page_size = page_size; st->current_flags = flag; st->level = level; } else { @@ -261,7 +265,7 @@ static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start) for (i = 0; i < PTRS_PER_PTE; i++, pte++) { addr = start + i * PAGE_SIZE; - note_page(st, addr, 4, pte_val(*pte)); + note_page(st, addr, 4, pte_val(*pte), PAGE_SHIFT); } } @@ -278,7 +282,7 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start) /* pmd exists */ walk_pte(st, pmd, addr); else - note_page(st, addr, 3, pmd_val(*pmd)); + note_page(st, addr, 3, pmd_val(*pmd), PTE_SHIFT); } } @@ -294,7 +298,7 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) /* pud exists */ walk_pmd(st, pud, addr); else - note_page(st, addr, 2, pud_val(*pud)); + note_page(st, addr, 2, pud_val(*pud), PMD_SHIFT); } } @@ -313,7 +317,7 @@ static void walk_pagetables(struct pg_state *st) /* pgd exists */ walk_pud(st, pgd, addr); else - note_page(st, addr, 1, pgd_val(*pgd)); + note_page(st, addr, 1, pgd_val(*pgd), PUD_SHIFT); } } @@ -368,7 +372,7 @@ static int ptdump_show(struct seq_file *m, void *v) /* Traverse kernel page tables */ walk_pagetables(&st); - note_page(&st, 0, 0, 0); + note_page(&st, 0, 0, 0, 0); return 0; } -- 2.25.0
[PATCH v1 05/46] powerpc/kasan: Remove unnecessary page table locking
Commit 45ff3c559585 ("powerpc/kasan: Fix parallel loading of modules.") added spinlocks to manage parallele module loading. Since then commit 47febbeeec44 ("powerpc/32: Force KASAN_VMALLOC for modules") converted the module loading to KASAN_VMALLOC. The spinlocking has then become unneeded and can be removed to simplify kasan_init_shadow_page_tables() Also remove inclusion of linux/moduleloader.h and linux/vmalloc.h which are not needed anymore since the removal of modules management. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index c41e700153da..c9d053078c37 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -5,9 +5,7 @@ #include #include #include -#include #include -#include #include #include #include @@ -34,31 +32,22 @@ static int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned { pmd_t *pmd; unsigned long k_cur, k_next; - pte_t *new = NULL; pmd = pmd_ptr_k(k_start); for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { + pte_t *new; + k_next = pgd_addr_end(k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; - if (!new) - new = memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE); + new = memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE); if (!new) return -ENOMEM; kasan_populate_pte(new, PAGE_KERNEL); - - smp_wmb(); /* See comment in __pte_alloc */ - - spin_lock(&init_mm.page_table_lock); - /* Has another populated it ? */ - if (likely((void *)pmd_page_vaddr(*pmd) == kasan_early_shadow_pte)) { - pmd_populate_kernel(&init_mm, pmd, new); - new = NULL; - } - spin_unlock(&init_mm.page_table_lock); + pmd_populate_kernel(&init_mm, pmd, new); } return 0; } -- 2.25.0
[PATCH v1 12/46] powerpc/ptdump: Standardise display of BAT flags
Display BAT flags the same way as page flags: rwx and wimg Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/bats.c | 37 ++- 1 file changed, 15 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/mm/ptdump/bats.c b/arch/powerpc/mm/ptdump/bats.c index d6c660f63d71..cebb58c7e289 100644 --- a/arch/powerpc/mm/ptdump/bats.c +++ b/arch/powerpc/mm/ptdump/bats.c @@ -15,12 +15,12 @@ static char *pp_601(int k, int pp) { if (pp == 0) - return k ? "NA" : "RWX"; + return k ? " " : "rwx"; if (pp == 1) - return k ? "ROX" : "RWX"; + return k ? "r x" : "rwx"; if (pp == 2) - return k ? "RWX" : "RWX"; - return k ? "ROX" : "ROX"; + return "rwx"; + return "r x"; } static void bat_show_601(struct seq_file *m, int idx, u32 lower, u32 upper) @@ -48,12 +48,9 @@ static void bat_show_601(struct seq_file *m, int idx, u32 lower, u32 upper) seq_printf(m, "Kernel %s User %s", pp_601(k & 2, pp), pp_601(k & 1, pp)); - if (lower & _PAGE_WRITETHRU) - seq_puts(m, "write through "); - if (lower & _PAGE_NO_CACHE) - seq_puts(m, "no cache "); - if (lower & _PAGE_COHERENT) - seq_puts(m, "coherent "); + seq_puts(m, lower & _PAGE_WRITETHRU ? "w " : " "); + seq_puts(m, lower & _PAGE_NO_CACHE ? "i " : " "); + seq_puts(m, lower & _PAGE_COHERENT ? "m " : " "); seq_puts(m, "\n"); } @@ -101,20 +98,16 @@ static void bat_show_603(struct seq_file *m, int idx, u32 lower, u32 upper, bool seq_puts(m, "Kernel/User "); if (lower & BPP_RX) - seq_puts(m, is_d ? "RO " : "EXEC "); + seq_puts(m, is_d ? "r " : " x "); else if (lower & BPP_RW) - seq_puts(m, is_d ? "RW " : "EXEC "); + seq_puts(m, is_d ? "rw " : " x "); else - seq_puts(m, is_d ? "NA " : "NX "); - - if (lower & _PAGE_WRITETHRU) - seq_puts(m, "write through "); - if (lower & _PAGE_NO_CACHE) - seq_puts(m, "no cache "); - if (lower & _PAGE_COHERENT) - seq_puts(m, "coherent "); - if (lower & _PAGE_GUARDED) - seq_puts(m, "guarded "); + seq_puts(m, is_d ? "" : ""); + + seq_puts(m, lower & _PAGE_WRITETHRU ? "w " : " "); + seq_puts(m, lower & _PAGE_NO_CACHE ? "i " : " "); + seq_puts(m, lower & _PAGE_COHERENT ? "m " : " "); + seq_puts(m, lower & _PAGE_GUARDED ? "g " : " "); seq_puts(m, "\n"); } -- 2.25.0
[PATCH v1 08/46] powerpc/ptdump: Limit size of flags text to 1/2 chars on PPC32
In order to have all flags fit on a 80 chars wide screen, reduce the flags to 1 char (2 where ambiguous). No cache is 'i' User is 'ur' (Supervisor would be sr) Shared (for 8xx) becomes 'sh' (it was 'user' when not shared but that was ambiguous because that's not entirely right) Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/8xx.c| 33 --- arch/powerpc/mm/ptdump/shared.c | 35 + 2 files changed, 35 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c index 9e2d8e847d6e..ca9ce94672f5 100644 --- a/arch/powerpc/mm/ptdump/8xx.c +++ b/arch/powerpc/mm/ptdump/8xx.c @@ -12,9 +12,9 @@ static const struct flag_info flag_array[] = { { .mask = _PAGE_SH, - .val= 0, - .set= "user", - .clear = "", + .val= _PAGE_SH, + .set= "sh", + .clear = " ", }, { .mask = _PAGE_RO | _PAGE_NA, .val= 0, @@ -30,37 +30,38 @@ static const struct flag_info flag_array[] = { }, { .mask = _PAGE_EXEC, .val= _PAGE_EXEC, - .set= " X ", - .clear = " ", + .set= "x", + .clear = " ", }, { .mask = _PAGE_PRESENT, .val= _PAGE_PRESENT, - .set= "present", - .clear = " ", + .set= "p", + .clear = " ", }, { .mask = _PAGE_GUARDED, .val= _PAGE_GUARDED, - .set= "guarded", - .clear = " ", + .set= "g", + .clear = " ", }, { .mask = _PAGE_DIRTY, .val= _PAGE_DIRTY, - .set= "dirty", - .clear = " ", + .set= "d", + .clear = " ", }, { .mask = _PAGE_ACCESSED, .val= _PAGE_ACCESSED, - .set= "accessed", - .clear = "", + .set= "a", + .clear = " ", }, { .mask = _PAGE_NO_CACHE, .val= _PAGE_NO_CACHE, - .set= "no cache", - .clear = "", + .set= "i", + .clear = " ", }, { .mask = _PAGE_SPECIAL, .val= _PAGE_SPECIAL, - .set= "special", + .set= "s", + .clear = " ", } }; diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c index f7ed2f187cb0..44a8a64a664f 100644 --- a/arch/powerpc/mm/ptdump/shared.c +++ b/arch/powerpc/mm/ptdump/shared.c @@ -13,8 +13,8 @@ static const struct flag_info flag_array[] = { { .mask = _PAGE_USER, .val= _PAGE_USER, - .set= "user", - .clear = "", + .set= "ur", + .clear = " ", }, { .mask = _PAGE_RW, .val= _PAGE_RW, @@ -23,42 +23,43 @@ static const struct flag_info flag_array[] = { }, { .mask = _PAGE_EXEC, .val= _PAGE_EXEC, - .set= " X ", - .clear = " ", + .set= "x", + .clear = " ", }, { .mask = _PAGE_PRESENT, .val= _PAGE_PRESENT, - .set= "present", - .clear = " ", + .set= "p", + .clear = " ", }, { .mask = _PAGE_GUARDED, .val= _PAGE_GUARDED, - .set= "guarded", - .clear = " ", + .set= "g", + .clear = " ", }, { .mask = _PAGE_DIRTY, .val= _PAGE_DIRTY, - .set= "dirty", - .clear = " ", + .set= "d", + .clear = " ", }, { .mask = _PAGE_ACCESSED, .val= _PAGE_ACCESSED, - .set= "accessed", - .clear = "", + .set= "a", + .clear = " ", }, { .mask = _PAGE_WRITETHRU, .val= _PAGE_WRITETHRU, - .set= "write through", - .clear = " ", + .set= "w", + .clear = " ", }, { .mask = _PAGE_NO_CACHE, .val= _PAGE_NO_CACHE, - .set= "no cache", - .clear = "", + .set= "i", + .clear = " ",
[PATCH v1 09/46] powerpc/ptdump: Reorder flags
Reorder flags in a more logical way: - Page size (huge) first - User - RWX - Present - WIMG - Special - Dirty and Accessed Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/8xx.c| 30 +++--- arch/powerpc/mm/ptdump/shared.c | 30 +++--- 2 files changed, 30 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c index ca9ce94672f5..a3169677dced 100644 --- a/arch/powerpc/mm/ptdump/8xx.c +++ b/arch/powerpc/mm/ptdump/8xx.c @@ -11,11 +11,6 @@ static const struct flag_info flag_array[] = { { - .mask = _PAGE_SH, - .val= _PAGE_SH, - .set= "sh", - .clear = " ", - }, { .mask = _PAGE_RO | _PAGE_NA, .val= 0, .set= "rw", @@ -37,11 +32,26 @@ static const struct flag_info flag_array[] = { .val= _PAGE_PRESENT, .set= "p", .clear = " ", + }, { + .mask = _PAGE_NO_CACHE, + .val= _PAGE_NO_CACHE, + .set= "i", + .clear = " ", }, { .mask = _PAGE_GUARDED, .val= _PAGE_GUARDED, .set= "g", .clear = " ", + }, { + .mask = _PAGE_SH, + .val= _PAGE_SH, + .set= "sh", + .clear = " ", + }, { + .mask = _PAGE_SPECIAL, + .val= _PAGE_SPECIAL, + .set= "s", + .clear = " ", }, { .mask = _PAGE_DIRTY, .val= _PAGE_DIRTY, @@ -52,16 +62,6 @@ static const struct flag_info flag_array[] = { .val= _PAGE_ACCESSED, .set= "a", .clear = " ", - }, { - .mask = _PAGE_NO_CACHE, - .val= _PAGE_NO_CACHE, - .set= "i", - .clear = " ", - }, { - .mask = _PAGE_SPECIAL, - .val= _PAGE_SPECIAL, - .set= "s", - .clear = " ", } }; diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c index 44a8a64a664f..dab5d8028a9b 100644 --- a/arch/powerpc/mm/ptdump/shared.c +++ b/arch/powerpc/mm/ptdump/shared.c @@ -30,21 +30,6 @@ static const struct flag_info flag_array[] = { .val= _PAGE_PRESENT, .set= "p", .clear = " ", - }, { - .mask = _PAGE_GUARDED, - .val= _PAGE_GUARDED, - .set= "g", - .clear = " ", - }, { - .mask = _PAGE_DIRTY, - .val= _PAGE_DIRTY, - .set= "d", - .clear = " ", - }, { - .mask = _PAGE_ACCESSED, - .val= _PAGE_ACCESSED, - .set= "a", - .clear = " ", }, { .mask = _PAGE_WRITETHRU, .val= _PAGE_WRITETHRU, @@ -55,11 +40,26 @@ static const struct flag_info flag_array[] = { .val= _PAGE_NO_CACHE, .set= "i", .clear = " ", + }, { + .mask = _PAGE_GUARDED, + .val= _PAGE_GUARDED, + .set= "g", + .clear = " ", }, { .mask = _PAGE_SPECIAL, .val= _PAGE_SPECIAL, .set= "s", .clear = " ", + }, { + .mask = _PAGE_DIRTY, + .val= _PAGE_DIRTY, + .set= "d", + .clear = " ", + }, { + .mask = _PAGE_ACCESSED, + .val= _PAGE_ACCESSED, + .set= "a", + .clear = " ", } }; -- 2.25.0
[PATCH v1 11/46] powerpc/ptdump: Display size of BATs
Display the size of areas mapped with BATs. For that, the size display for pages is refactorised. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/bats.c | 4 arch/powerpc/mm/ptdump/ptdump.c | 23 ++- arch/powerpc/mm/ptdump/ptdump.h | 2 ++ 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/ptdump/bats.c b/arch/powerpc/mm/ptdump/bats.c index d3a5d6b318d1..d6c660f63d71 100644 --- a/arch/powerpc/mm/ptdump/bats.c +++ b/arch/powerpc/mm/ptdump/bats.c @@ -10,6 +10,8 @@ #include #include +#include "ptdump.h" + static char *pp_601(int k, int pp) { if (pp == 0) @@ -42,6 +44,7 @@ static void bat_show_601(struct seq_file *m, int idx, u32 lower, u32 upper) #else seq_printf(m, "0x%08x ", pbn); #endif + pt_dump_size(m, size); seq_printf(m, "Kernel %s User %s", pp_601(k & 2, pp), pp_601(k & 1, pp)); @@ -88,6 +91,7 @@ static void bat_show_603(struct seq_file *m, int idx, u32 lower, u32 upper, bool #else seq_printf(m, "0x%08x ", brpn); #endif + pt_dump_size(m, size); if (k == 1) seq_puts(m, "User "); diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index d92bb8ea229c..1f97668853e3 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -112,6 +112,19 @@ static struct addr_marker address_markers[] = { seq_putc(m, c); \ }) +void pt_dump_size(struct seq_file *m, unsigned long size) +{ + static const char units[] = "KMGTPE"; + const char *unit = units; + + /* Work out what appropriate unit to use */ + while (!(size & 1023) && unit[1]) { + size >>= 10; + unit++; + } + pt_dump_seq_printf(m, "%9lu%c ", size, *unit); +} + static void dump_flag_info(struct pg_state *st, const struct flag_info *flag, u64 pte, int num) { @@ -146,8 +159,6 @@ static void dump_flag_info(struct pg_state *st, const struct flag_info static void dump_addr(struct pg_state *st, unsigned long addr) { - static const char units[] = "KMGTPE"; - const char *unit = units; unsigned long delta; #ifdef CONFIG_PPC64 @@ -164,13 +175,7 @@ static void dump_addr(struct pg_state *st, unsigned long addr) pt_dump_seq_printf(st->seq, " " REG " ", st->start_pa); delta = (addr - st->start_address) >> 10; } - /* Work out what appropriate unit to use */ - while (!(delta & 1023) && unit[1]) { - delta >>= 10; - unit++; - } - pt_dump_seq_printf(st->seq, "%9lu%c", delta, *unit); - + pt_dump_size(st->seq, delta); } static void note_prot_wx(struct pg_state *st, unsigned long addr) diff --git a/arch/powerpc/mm/ptdump/ptdump.h b/arch/powerpc/mm/ptdump/ptdump.h index 5d513636de73..b91f65f162d6 100644 --- a/arch/powerpc/mm/ptdump/ptdump.h +++ b/arch/powerpc/mm/ptdump/ptdump.h @@ -17,3 +17,5 @@ struct pgtable_level { }; extern struct pgtable_level pg_level[5]; + +void pt_dump_size(struct seq_file *m, unsigned long delta); -- 2.25.0
[PATCH v1 07/46] powerpc/kasan: Declare kasan_init_region() weak
In order to alloc sub-arches to alloc KASAN regions using optimised methods (Huge pages on 8xx, BATs on BOOK3S, ...), declare kasan_init_region() weak. Also make kasan_init_shadow_page_tables() accessible from outside, so that it can be called from the specific kasan_init_region() functions if needed. And populate remaining KASAN address space only once performed the region mapping, to allow 8xx to allocate hugepd instead of standard page tables for mapping via 8M hugepages. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/kasan.h | 3 +++ arch/powerpc/mm/kasan/kasan_init_32.c | 21 +++-- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 4769bbf7173a..107a24c3f7b3 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -34,5 +34,8 @@ static inline void kasan_init(void) { } static inline void kasan_late_init(void) { } #endif +int kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end); +int kasan_init_region(void *start, size_t size); + #endif /* __ASSEMBLY */ #endif diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index 65fd8b891f8e..03d30ec7a858 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -28,7 +28,7 @@ static void __init kasan_populate_pte(pte_t *ptep, pgprot_t prot) __set_pte_at(&init_mm, va, ptep, pfn_pte(PHYS_PFN(pa), prot), 0); } -static int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end) +int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end) { pmd_t *pmd; unsigned long k_cur, k_next; @@ -52,7 +52,7 @@ static int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned return 0; } -static int __init kasan_init_region(void *start, size_t size) +int __init __weak kasan_init_region(void *start, size_t size) { unsigned long k_start = (unsigned long)kasan_mem_to_shadow(start); unsigned long k_end = (unsigned long)kasan_mem_to_shadow(start + size); @@ -122,14 +122,6 @@ static void __init kasan_mmu_init(void) int ret; struct memblock_region *reg; - if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE) || - IS_ENABLED(CONFIG_KASAN_VMALLOC)) { - ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, KASAN_SHADOW_END); - - if (ret) - panic("kasan: kasan_init_shadow_page_tables() failed"); - } - for_each_memblock(memory, reg) { phys_addr_t base = reg->base; phys_addr_t top = min(base + reg->size, total_lowmem); @@ -141,6 +133,15 @@ static void __init kasan_mmu_init(void) if (ret) panic("kasan: kasan_init_region() failed"); } + + if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE) || + IS_ENABLED(CONFIG_KASAN_VMALLOC)) { + ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, KASAN_SHADOW_END); + + if (ret) + panic("kasan: kasan_init_shadow_page_tables() failed"); + } + } void __init kasan_init(void) -- 2.25.0
[PATCH v1 06/46] powerpc/kasan: Refactor update of early shadow mappings
kasan_remap_early_shadow_ro() and kasan_unmap_early_shadow_vmalloc() are both updating the early shadow mapping: the first one sets the mapping read-only while the other clears the mapping. Refactor and create kasan_update_early_region() Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 39 +-- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index c9d053078c37..65fd8b891f8e 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -79,45 +79,42 @@ static int __init kasan_init_region(void *start, size_t size) return 0; } -static void __init kasan_remap_early_shadow_ro(void) +static void __init +kasan_update_early_region(unsigned long k_start, unsigned long k_end, pte_t pte) { - pgprot_t prot = kasan_prot_ro(); - unsigned long k_start = KASAN_SHADOW_START; - unsigned long k_end = KASAN_SHADOW_END; unsigned long k_cur; phys_addr_t pa = __pa(kasan_early_shadow_page); - kasan_populate_pte(kasan_early_shadow_pte, prot); - - for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { + for (k_cur = k_start; k_cur < k_end; k_cur += PAGE_SIZE) { pmd_t *pmd = pmd_ptr_k(k_cur); pte_t *ptep = pte_offset_kernel(pmd, k_cur); if ((pte_val(*ptep) & PTE_RPN_MASK) != pa) continue; - __set_pte_at(&init_mm, k_cur, ptep, pfn_pte(PHYS_PFN(pa), prot), 0); + __set_pte_at(&init_mm, k_cur, ptep, pte, 0); } - flush_tlb_kernel_range(KASAN_SHADOW_START, KASAN_SHADOW_END); + + flush_tlb_kernel_range(k_start, k_end); } -static void __init kasan_unmap_early_shadow_vmalloc(void) +static void __init kasan_remap_early_shadow_ro(void) { - unsigned long k_start = (unsigned long)kasan_mem_to_shadow((void *)VMALLOC_START); - unsigned long k_end = (unsigned long)kasan_mem_to_shadow((void *)VMALLOC_END); - unsigned long k_cur; + pgprot_t prot = kasan_prot_ro(); phys_addr_t pa = __pa(kasan_early_shadow_page); - for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { - pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), k_cur); - pte_t *ptep = pte_offset_kernel(pmd, k_cur); + kasan_populate_pte(kasan_early_shadow_pte, prot); - if ((pte_val(*ptep) & PTE_RPN_MASK) != pa) - continue; + kasan_update_early_region(KASAN_SHADOW_START, KASAN_SHADOW_END, + pfn_pte(PHYS_PFN(pa), prot)); +} - __set_pte_at(&init_mm, k_cur, ptep, __pte(0), 0); - } - flush_tlb_kernel_range(k_start, k_end); +static void __init kasan_unmap_early_shadow_vmalloc(void) +{ + unsigned long k_start = (unsigned long)kasan_mem_to_shadow((void *)VMALLOC_START); + unsigned long k_end = (unsigned long)kasan_mem_to_shadow((void *)VMALLOC_END); + + kasan_update_early_region(k_start, k_end, __pte(0)); } static void __init kasan_mmu_init(void) -- 2.25.0
[PATCH v1 10/46] powerpc/ptdump: Add _PAGE_COHERENT flag
For platforms using shared.c (4xx, Book3e, Book3s/32), also handle the _PAGE_COHERENT flag with corresponds to the M bit of the WIMG flags. Signed-off-by: Christophe Leroy --- arch/powerpc/mm/ptdump/shared.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c index dab5d8028a9b..634b83aa3487 100644 --- a/arch/powerpc/mm/ptdump/shared.c +++ b/arch/powerpc/mm/ptdump/shared.c @@ -40,6 +40,11 @@ static const struct flag_info flag_array[] = { .val= _PAGE_NO_CACHE, .set= "i", .clear = " ", + }, { + .mask = _PAGE_COHERENT, + .val= _PAGE_COHERENT, + .set= "m", + .clear = " ", }, { .mask = _PAGE_GUARDED, .val= _PAGE_GUARDED, -- 2.25.0
[PATCH v1 03/46] powerpc/kasan: Fix issues by lowering KASAN_SHADOW_END
At the time being, KASAN_SHADOW_END is 0x1, which is 0 in 32 bits representation. This leads to a couple of issues: - kasan_remap_early_shadow_ro() does nothing because the comparison k_cur < k_end is always false. - In ptdump, address comparison for markers display fails and the marker's name is printed at the start of the KASAN area instead of being printed at the end. However, there is no need to shadow the KASAN shadow area itself, so the KASAN shadow area can stop shadowing memory at the start of itself. With a PAGE_OFFSET set to 0xc000, KASAN shadow area is then going from 0xf800 to 0xff00. Signed-off-by: Christophe Leroy Fixes: cbd18991e24f ("powerpc/mm: Fix an Oops in kasan_mmu_init()") Cc: sta...@vger.kernel.org --- arch/powerpc/include/asm/kasan.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index fbff9ff9032e..fc900937f653 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -23,9 +23,7 @@ #define KASAN_SHADOW_OFFSETASM_CONST(CONFIG_KASAN_SHADOW_OFFSET) -#define KASAN_SHADOW_END 0UL - -#define KASAN_SHADOW_SIZE (KASAN_SHADOW_END - KASAN_SHADOW_START) +#define KASAN_SHADOW_END (-(-KASAN_SHADOW_START >> KASAN_SHADOW_SCALE_SHIFT)) #ifdef CONFIG_KASAN void kasan_early_init(void); -- 2.25.0
[PATCH v1 04/46] powerpc/kasan: Fix shadow pages allocation failure
Doing kasan pages allocation in MMU_init is too early, kernel doesn't have access yet to the entire memory space and memblock_alloc() fails when the kernel is a bit big. Do it from kasan_init() instead. Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/kasan.h | 2 -- arch/powerpc/mm/init_32.c | 2 -- arch/powerpc/mm/kasan/kasan_init_32.c | 4 +++- 3 files changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index fc900937f653..4769bbf7173a 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -27,12 +27,10 @@ #ifdef CONFIG_KASAN void kasan_early_init(void); -void kasan_mmu_init(void); void kasan_init(void); void kasan_late_init(void); #else static inline void kasan_init(void) { } -static inline void kasan_mmu_init(void) { } static inline void kasan_late_init(void) { } #endif diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c index 872df48ae41b..a6991ef8727d 100644 --- a/arch/powerpc/mm/init_32.c +++ b/arch/powerpc/mm/init_32.c @@ -170,8 +170,6 @@ void __init MMU_init(void) btext_unmap(); #endif - kasan_mmu_init(); - setup_kup(); /* Shortly after that, the entire linear mapping will be available */ diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index 60c2acdf73a7..c41e700153da 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -131,7 +131,7 @@ static void __init kasan_unmap_early_shadow_vmalloc(void) flush_tlb_kernel_range(k_start, k_end); } -void __init kasan_mmu_init(void) +static void __init kasan_mmu_init(void) { int ret; struct memblock_region *reg; @@ -159,6 +159,8 @@ void __init kasan_mmu_init(void) void __init kasan_init(void) { + kasan_mmu_init(); + kasan_remap_early_shadow_ro(); clear_page(kasan_early_shadow_page); -- 2.25.0
[PATCH v1 00/46] Use hugepages to map kernel mem on 8xx
The main purpose of this big series is to: - reorganise huge page handling to avoid using mm_slices. - use huge pages to map kernel memory on the 8xx. The 8xx supports 4 page sizes: 4k, 16k, 512k and 8M. It uses 2 Level page tables, PGD having 1024 entries, each entry covering 4M address space. Then each page table has 1024 entries. At the time being, page sizes are managed in PGD entries, implying the use of mm_slices as it can't mix several pages of the same size in one page table. The first purpose of this series is to reorganise things so that standard page tables can also handle 512k pages. This is done by adding a new _PAGE_HUGE flag which will be copied into the Level 1 entry in the TLB miss handler. That done, we have 2 types of pages: - PGD entries to regular page tables handling 4k/16k and 512k pages - PGD entries to hugepd tables handling 8M pages. There is no need to mix 8M pages with other sizes, because a 8M page will use more than what a single PGD covers. Then comes the second purpose of this series. At the time being, the 8xx has implemented special handling in the TLB miss handlers in order to transparently map kernel linear address space and the IMMR using huge pages by building the TLB entries in assembly at the time of the exception. As mm_slices is only for user space pages, and also because it would anyway not be convenient to slice kernel address space, it was not possible to use huge pages for kernel address space. But after step one of the series, it is now more flexible to use huge pages. This series drop all assembly 'just in time' handling of huge pages and use huge pages in page tables instead. Once the above is done, then comes the cherry on cake: - Use huge pages for KASAN shadow mapping - Allow pinned TLBs with strict kernel rwx - Allow pinned TLBs with debug pagealloc Then, last but not least, those modifications for the 8xx allows the following improvement on book3s/32: - Mapping KASAN shadow with BATs - Allowing BATs with debug pagealloc All this allows to considerably simplify TLB miss handlers and associated initialisation. The overhead of reading page tables is negligible compared to the reduction of the miss handlers. While we were at touching pte_update(), some cleanup was done there too. Tested widely on 8xx and 832x. Boot tested on QEMU MAC99. Christophe Leroy (46): powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC powerpc/kasan: Fix error detection on memory allocation powerpc/kasan: Fix issues by lowering KASAN_SHADOW_END powerpc/kasan: Fix shadow pages allocation failure powerpc/kasan: Remove unnecessary page table locking powerpc/kasan: Refactor update of early shadow mappings powerpc/kasan: Declare kasan_init_region() weak powerpc/ptdump: Limit size of flags text to 1/2 chars on PPC32 powerpc/ptdump: Reorder flags powerpc/ptdump: Add _PAGE_COHERENT flag powerpc/ptdump: Display size of BATs powerpc/ptdump: Standardise display of BAT flags powerpc/ptdump: Properly handle non standard page size powerpc/ptdump: Handle hugepd at PGD level powerpc/32s: Don't warn when mapping RO data ROX. powerpc/mm: Allocate static page tables for fixmap powerpc/mm: Fix conditions to perform MMU specific management by blocks on PPC32. powerpc/mm: PTE_ATOMIC_UPDATES is only for 40x powerpc/mm: Refactor pte_update() on nohash/32 powerpc/mm: Refactor pte_update() on book3s/32 powerpc/mm: Standardise __ptep_test_and_clear_young() params between PPC32 and PPC64 powerpc/mm: Standardise pte_update() prototype between PPC32 and PPC64 powerpc/mm: Create a dedicated pte_update() for 8xx powerpc/mm: Reduce hugepd size for 8M hugepages on 8xx powerpc/8xx: Drop CONFIG_8xx_COPYBACK option powerpc/8xx: Prepare handlers for _PAGE_HUGE for 512k pages. powerpc/8xx: Manage 512k huge pages as standard pages. powerpc/8xx: Only 8M pages are hugepte pages now powerpc/8xx: MM_SLICE is not needed anymore powerpc/8xx: Move PPC_PIN_TLB options into 8xx Kconfig powerpc/8xx: Add function to update pinned TLBs powerpc/8xx: Don't set IMMR map anymore at boot powerpc/8xx: Always pin TLBs at startup. powerpc/8xx: Drop special handling of Linear and IMMR mappings in I/D TLB handlers powerpc/8xx: Remove now unused TLB miss functions powerpc/8xx: Move DTLB perf handling closer. powerpc/mm: Don't be too strict with _etext alignment on PPC32 powerpc/8xx: Refactor kernel address boundary comparison powerpc/8xx: Add a function to early map kernel via huge pages powerpc/8xx: Map IMMR with a huge page powerpc/8xx: Map linear memory with huge pages powerpc/8xx: Allow STRICT_KERNEL_RwX with pinned TLB powerpc/8xx: Allow large TLBs with DEBUG_PAGEALLOC powerpc/8xx: Implement dedicated kasan_init_region() powerpc/32s: Allow mapping with BATs with DEBUG_PAGEALLOC powerpc/32s: Implement dedicated kasan_init_region() arch/powerpc/Kconfig | 62 +--- arch/powerpc/confi
[PATCH v1 01/46] powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC
With CONFIG_KASAN_VMALLOC, new page tables are created at the time shadow memory for vmalloc area in unmapped. If some parts of the page table still has entries to the zero page shadow memory, the entries are wrongly marked RW. With CONFIG_KASAN_VMALLOC, almost the entire kernel address space is managed by KASAN. To make it simple, just create KASAN page tables for the entire kernel space at kasan_init(). That doesn't use much more space, and that's anyway already done for hash platforms. Fixes: 3d4247fcc938 ("powerpc/32: Add support of KASAN_VMALLOC") Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index f19526e7d3dc..750a927839d7 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -120,12 +120,6 @@ static void __init kasan_unmap_early_shadow_vmalloc(void) unsigned long k_cur; phys_addr_t pa = __pa(kasan_early_shadow_page); - if (!early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) { - int ret = kasan_init_shadow_page_tables(k_start, k_end); - - if (ret) - panic("kasan: kasan_init_shadow_page_tables() failed"); - } for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), k_cur); pte_t *ptep = pte_offset_kernel(pmd, k_cur); @@ -143,7 +137,8 @@ void __init kasan_mmu_init(void) int ret; struct memblock_region *reg; - if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) { + if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE) || + IS_ENABLED(CONFIG_KASAN_VMALLOC)) { ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, KASAN_SHADOW_END); if (ret) -- 2.25.0
[PATCH v1 02/46] powerpc/kasan: Fix error detection on memory allocation
In case (k_start & PAGE_MASK) doesn't equal (kstart), 'va' will never be NULL allthough 'block' is NULL Check the return of memblock_alloc() directly instead of the resulting address in the loop. Fixes: 509cd3f2b473 ("powerpc/32: Simplify KASAN init") Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index 750a927839d7..60c2acdf73a7 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -76,15 +76,14 @@ static int __init kasan_init_region(void *start, size_t size) return ret; block = memblock_alloc(k_end - k_start, PAGE_SIZE); + if (!block) + return -ENOMEM; for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { pmd_t *pmd = pmd_ptr_k(k_cur); void *va = block + k_cur - k_start; pte_t pte = pfn_pte(PHYS_PFN(__pa(va)), PAGE_KERNEL); - if (!va) - return -ENOMEM; - __set_pte_at(&init_mm, k_cur, pte_offset_kernel(pmd, k_cur), pte, 0); } flush_tlb_kernel_range(k_start, k_end); -- 2.25.0
Re: [PATCH v4 0/6] implement KASLR for powerpc/fsl_booke/64
ping... 在 2020/3/6 14:40, Jason Yan 写道: This is a try to implement KASLR for Freescale BookE64 which is based on my earlier implementation for Freescale BookE32: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131718&state=* The implementation for Freescale BookE64 is similar as BookE32. One difference is that Freescale BookE64 set up a TLB mapping of 1G during booting. Another difference is that ppc64 needs the kernel to be 64K-aligned. So we can randomize the kernel in this 1G mapping and make it 64K-aligned. This can save some code to creat another TLB map at early boot. The disadvantage is that we only have about 1G/64K = 16384 slots to put the kernel in. KERNELBASE 64K |--> kernel <--| | | | +--+--+--++--+--+--+--+--+--+--+--+--++--+--+ | | | || | | | | | | | | || | | +--+--+--++--+--+--+--+--+--+--+--+--++--+--+ | |1G |-> offset<-| kernstart_virt_addr I'm not sure if the slot numbers is enough or the design has any defects. If you have some better ideas, I would be happy to hear that. Thank you all. v3->v4: Do not define __kaslr_offset as a fixed symbol. Reference __run_at_load and __kaslr_offset by symbol instead of magic offsets. Use IS_ENABLED(CONFIG_PPC32) instead of #ifdef CONFIG_PPC32. Change kaslr-booke32 to kaslr-booke in index.rst Switch some instructions to 64-bit. v2->v3: Fix build error when KASLR is disabled. v1->v2: Add __kaslr_offset for the secondary cpu boot up. Jason Yan (6): powerpc/fsl_booke/kaslr: refactor kaslr_legal_offset() and kaslr_early_init() powerpc/fsl_booke/64: introduce reloc_kernel_entry() helper powerpc/fsl_booke/64: implement KASLR for fsl_booke64 powerpc/fsl_booke/64: do not clear the BSS for the second pass powerpc/fsl_booke/64: clear the original kernel if randomized powerpc/fsl_booke/kaslr: rename kaslr-booke32.rst to kaslr-booke.rst and add 64bit part Documentation/powerpc/index.rst | 2 +- .../{kaslr-booke32.rst => kaslr-booke.rst}| 35 +++- arch/powerpc/Kconfig | 2 +- arch/powerpc/kernel/exceptions-64e.S | 23 + arch/powerpc/kernel/head_64.S | 13 +++ arch/powerpc/kernel/setup_64.c| 3 + arch/powerpc/mm/mmu_decl.h| 23 ++--- arch/powerpc/mm/nohash/kaslr_booke.c | 88 +-- 8 files changed, 144 insertions(+), 45 deletions(-) rename Documentation/powerpc/{kaslr-booke32.rst => kaslr-booke.rst} (59%)