Re: [PATCH 1/6] dump_stack: Support adding to the dump stack arch description
On Tue, 2015-05-05 at 14:16 -0700, Andrew Morton wrote: > On Tue, 5 May 2015 21:12:12 +1000 Michael Ellerman > wrote: > > > Arch code can set a "dump stack arch description string" which is > > displayed with oops output to describe the hardware platform. > > + > > + len = strnlen(dump_stack_arch_desc_str, > > sizeof(dump_stack_arch_desc_str)); > > + pos = len; > > + > > + if (len) > > + pos++; > > + > > + if (pos >= sizeof(dump_stack_arch_desc_str)) > > + return; /* Ran out of space */ > > + > > + p = &dump_stack_arch_desc_str[pos]; > > + > > + va_start(args, fmt); > > + vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args); > > + va_end(args); > > This code is almost race-free. A (documented) smp_wmb() in here would > make that 100%? > > > + if (len) > > + dump_stack_arch_desc_str[len] = ' '; > > +} On second thoughts I don't think it would. It would order the stores in vsnprintf() vs the store of the space. The idea being you never see a partially printed string. But for that to actually work you need a barrier on the read side, and where do you put it? The cpu printing the buffer could speculate the load of the tail of the buffer, seeing something half printed from vsnprintf(), and then load the head of the buffer and see the space, unless you order those loads. So I don't think we can prevent a crashing cpu seeing a semi-printed buffer without a lock, and we don't want to add a lock. The other issue would be that a reader could miss the trailing NULL from the vsnprintf() but see the space, meaning it would wander off the end of the buffer. But the buffer's in BSS to start with, and we're careful not to print off the end of it, so it should always be NULL terminated. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
On 05/08/2015 09:58 AM, Ingo Molnar wrote: * Hemant Kumar wrote: # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXITSamples Samples% Time%Min Time Max Time Avg time H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us 12.37us ( +- 0.52% ) HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us 1.91us ( +- 0.96% ) SYSCALL 269018.97% 0.10% 2.84us528.24us 18.29us ( +- 3.75% ) RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us 27470.23us ( +- 3.00% ) EXTERNAL240 1.69% 0.00%0.69us 10.67us 1.33us ( +- 5.34% ) Where is the last line misaligned? Copy & paste error or does perf kvm produce it in such a way? Its a copy-paste error. Thanks for pointing this out. Shall I resend the patches with the correct alignment of the o/p? Thanks, Ingo -- Thanks, Hemant Kumar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
* Hemant Kumar wrote: > > On 05/08/2015 09:58 AM, Ingo Molnar wrote: > >* Hemant Kumar wrote: > > > >> # perf kvm stat report -p 60515 > >>Analyze events for pid(s) 60515, all VCPUs: > >> > >>VM-EXITSamples Samples% Time%Min Time Max > >> Time Avg time > >> > >>H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us > >>12.37us ( +- 0.52% ) > >>HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us > >>1.91us ( +- 0.96% ) > >>SYSCALL 269018.97% 0.10% 2.84us528.24us > >> 18.29us ( +- 3.75% ) > >>RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us > >>27470.23us ( +- 3.00% ) > >> EXTERNAL240 1.69% 0.00%0.69us 10.67us > >> 1.33us ( +- 5.34% ) > >Where is the last line misaligned? Copy & paste error or does perf kvm > >produce it in such a way? > > Its a copy-paste error. Thanks for pointing this out. > > Shall I resend the patches with the correct alignment of the o/p? I don't think that's necessary, as long as the code is fine. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/6] powernv: cpufreq: Report frequency throttle by OCC
On 4 May 2015 at 14:24, Shilpasri G Bhat wrote: > This patchset intends to add frequency throttle reporting mechanism > to powernv-cpufreq driver when OCC throttles the frequency. OCC is an > On-Chip-Controller which takes care of the power and thermal safety of > the chip. The CPU frequency can be throttled during an OCC reset or > when OCC tries to limit the max allowed frequency. The patchset will > report such conditions so as to keep the user informed about reason > for the drop in performance of workloads when frequency is throttled. > > Changes from v2: > - Split into multiple patches > - Semantic fixes > > Shilpasri G Bhat (6): > cpufreq: poowernv: Handle throttling due to Pmax capping at chip level > powerpc/powernv: Add definition of OPAL_MSG_OCC message type > cpufreq: powernv: Register for OCC related opal_message notification > cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE > cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is > set > cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling > > arch/powerpc/include/asm/opal-api.h | 8 ++ > drivers/cpufreq/powernv-cpufreq.c | 199 > +--- > 2 files changed, 192 insertions(+), 15 deletions(-) Acked-by: Viresh Kumar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
* Hemant Kumar wrote: > # perf kvm stat report -p 60515 > Analyze events for pid(s) 60515, all VCPUs: > >VM-EXITSamples Samples% Time%Min Time Max > Time Avg time > > H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us > 12.37us ( +- 0.52% ) > HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us > 1.91us ( +- 0.96% ) >SYSCALL 269018.97% 0.10% 2.84us528.24us > 18.29us ( +- 3.75% ) > RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us > 27470.23us ( +- 3.00% ) > EXTERNAL240 1.69% 0.00%0.69us 10.67us > 1.33us ( +- 5.34% ) Where is the last line misaligned? Copy & paste error or does perf kvm produce it in such a way? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote: > On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote: >> On 05/05/2015 02:11 PM, Preeti U Murthy wrote: >>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:30 AM, Preeti U Murthy wrote: > Hi Shilpa, > > On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE >> notification by executing *throttle_check() on any one of the cpu on >> the chip. This is a sanity check to verify if we were indeed >> throttled/unthrottled after receiving OCC_THROTTLE notification. >> >> We cannot call *throttle_check() directly from the notification >> handler because we could be handling chip1's notification in chip2. So >> initiate an smp_call to execute *throttle_check(). We are irq-disabled >> in the notification handler, so use a worker thread to smp_call >> throttle_check() on any of the cpu in the chipmask. > > I see that the first patch takes care of reporting *per-chip* throttling > for pmax capping condition. But where are we taking care of reporting > "pstate set to safe" and "freq control disabled" scenarios per-chip ? > IMO let us not have "psafe" and "freq control disabled" states managed per-chip. Because when the above two conditions occur it is likely to happen across all chips during an OCC reset cycle. So I am setting 'throttled' to false on OCC_ACTIVE and re-verifying if it actually is the case by invoking *throttle_check(). >>> >>> Alright like I pointed in the previous reply, a comment to indicate that >>> psafe and freq control disabled conditions will fail when occ is >>> inactive and that all chips face the consequence of this will help. >> >> From your explanation on the thread of the first patch of this series, >> this will not be required. >> >> So, >> Reviewed-by: Preeti U Murthy > > OK, so is the whole series reviewed now? Yes the whole series has been reviewed. Regards Preeti U Murthy > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg
On Thu, May 07, 2015 at 11:06:06AM -0400, Dan Streetman wrote: > > The crypto 842-nx has (significant) code in it to handle any alignment > and length input buffers, to match them to what the driver requires. > Would it be better to move that into the crypto code, so that any > crypto compression hw driver can request buffers be specifically > aligned/sized? I did have to use a header on each compressed buffer > that needed re-alignment or re-sizing, so maybe it's not appropriate > for common crypto compression code. Yes we could certainly move this logic into the crypto layer, as we do for ciphers and hashes. But as you say we could make the next guy who writes a comp driver do this :) Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
Hello Vipin, On Thu, May 7, 2015 at 7:00 PM, Vipin K Parashar wrote: > This patch adds support for FSP EPOW (Early Power Off Warning) and > DPO (Delayed Power Off) events support for PowerNV platform. I reviewed this patch for the changes it made to the existing poweroff code, you still need someone to look at the EPOW code itself. > Signed-off-by: Vipin K Parashar > --- > arch/powerpc/include/asm/opal-api.h| 30 ++ > arch/powerpc/include/asm/opal.h| 3 +- > arch/powerpc/platforms/powernv/opal-power.c| 379 > +++-- > arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + > 4 files changed, 391 insertions(+), 22 deletions(-) > > /* Internal functions */ > extern int early_init_dt_scan_opal(unsigned long node, const char *uname, > diff --git a/arch/powerpc/platforms/powernv/opal-power.c > b/arch/powerpc/platforms/powernv/opal-power.c > index ac46c2c..7c1b2f8 100644 > --- a/arch/powerpc/platforms/powernv/opal-power.c > +++ b/arch/powerpc/platforms/powernv/opal-power.c > @@ -1,5 +1,5 @@ > /* > - * PowerNV OPAL power control for graceful shutdown handling > + * PowerNV poweroff events support > * > * Copyright 2015 IBM Corp. > * > @@ -9,58 +9,395 @@ > * 2 of the License, or (at your option) any later version. > */ > > +#define pr_fmt(fmt)"POWEROFF_EVENT: "fmt OPAL_POWER? > + > #include > +#include > +#include > #include > -#include > - > +#include > #include > #include > > -#define SOFT_OFF 0x00 > -#define SOFT_REBOOT 0x01 > +/* Power control event types */ > +#define SOFT_OFF 0x00 > +#define SOFT_REBOOT0x01 While you're touching this code, I think these should be moved to opal-api.h > + > +/* Max time for graceful system shutdown including guests. */ > +#define MAX_POWEROFF_SYS_TIME 600 > + > +/* IPMI power-control events notifier */ > static int opal_power_control_event(struct notifier_block *nb, > - unsigned long msg_type, void *msg) > + unsigned long msg_type, void *msg) > { > - struct opal_msg *power_msg = msg; > uint64_t type; > + struct opal_msg *power_msg = msg; > > type = be64_to_cpu(power_msg->params[0]); > > switch (type) { > case SOFT_REBOOT: > - pr_info("OPAL: reboot requested\n"); > + pr_info("Reboot requested\n"); I prefer the OPAL prefix. > orderly_reboot(); > break; > case SOFT_OFF: > - pr_info("OPAL: poweroff requested\n"); > + pr_info("Poweroff requested\n"); Ditto. > orderly_poweroff(true); > break; > default: > - pr_err("OPAL: power control type unexpected %016llx\n", type); > + pr_err("Unknown event %llu\n", type); Ditto. > } > > return 0; > } > > +/* OPAL EPOW event notifier block */ > +static struct notifier_block opal_epow_nb = { > + .notifier_call = opal_epow_event, > + .next = NULL, > + .priority = 0, > +}; > + > +/* OPAL DPO event notifier block */ > +static struct notifier_block opal_dpo_nb = { > + .notifier_call = opal_dpo_event, > + .next = NULL, > + .priority = 0, > +}; > + > +/* OPAL Power control events */ > static struct notifier_block opal_power_control_nb = { > - .notifier_call = opal_power_control_event, > - .next = NULL, > - .priority = 0, > + .notifier_call = opal_power_control_event, > + .next = NULL, > + .priority = 0, > }; Looks like you changed the whitespace? > > -static int __init opal_power_control_init(void) > +/* Poweroff events init */ > +static int __init opal_poweroff_events_init(void) This comment does not add any value. Renaming the function doesn't add much either. > { > int ret; > + struct device_node *node_epow; > > - ret = opal_message_notifier_register(OPAL_MSG_SHUTDOWN, > -&opal_power_control_nb); > - if (ret) { > - pr_err("%s: Can't register OPAL event notifier (%d)\n", > - __func__, ret); > - return ret; > + /* > + * Determine EPOW, DPO support in hardware. > + */ > + node_epow = of_find_node_by_path("/ibm,opal/epow"); > + if (node_epow) { > + if (of_device_is_compatible(node_epow, "ibm,opal-epow")) { > + epow_supported = true; > + dpo_supported = true; Why are these separate flags? Do we have any systems that will support EPOW but not DPO, or DPO without EPOW? I suggest merging them into the one flag. > + pr_info("OPAL EPOW, DPO support detected.\n"); > + } > + of_node_put(node_epow); > + } > + > + /* Prepare to handle EPOW e
[PATCH v3 2/2] perf/kvm: Support HCALL events
powerpc provides hcall events that also provide insights into guest behaviour. Enhance perf kvm to record and analyze hcall events. - To trace hcall events : perf kvm stat record - To show the results : perf kvm stat report --event=hcall The result shows the number of hypervisor calls from the guest grouped by their respective reasons displayed with the frequency. This patch makes use of two additional tracepoints "kvm_hv:kvm_hcall_enter" and "kvm_hv:kvm_hcall_exit". It uses the pSeries hypervisor codes exported through uapi to classify the hcalls into their respective reasons. Note : This patch has a dependency on "kvm/powerpc: Export HCALL reason codes" which exports HCALL reasons through uapi. # pgrep qemu A sample output : 19378 60515 2 VMs running. # perf kvm stat record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ] # perf kvm stat report -p 60515 --event=hcall Analyze events for pid(s) 60515, all VCPUs: HCALL-EVENTSamples Samples% Time%Min TimeMax Time Avg time H_VIO_SIGNAL 103438.44%15.77% 0.36us 1.59us 0.44us ( +- 0.66% ) H_SEND_CRQ65224.24%10.97% 0.39us 1.84us 0.49us ( +- 1.20% ) H_IPI52319.44%62.05% 1.35us 19.70us 3.44us ( +- 2.88% ) H_PUT_TERM_CHAR41115.28% 8.03% 0.38us 3.77us 0.57us ( +- 1.61% ) H_GET_TERM_CHAR 50 1.86% 0.99% 0.40us 0.98us 0.57us ( +- 3.37% ) H_EOI 20 0.74% 2.19% 2.22us 4.72us 3.17us ( +- 5.96% ) Total Samples:2690, Total events handled time:2896.94us. Signed-off-by: Hemant Kumar --- Patch has a dependency on https://patchwork.ozlabs.org/patch/469841/ which exports the HCALL reason codes to perf. arch/powerpc/include/uapi/asm/kvm_perf.h | 4 +++ tools/perf/arch/powerpc/util/kvm-stat.c | 61 2 files changed, 65 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm_perf.h b/arch/powerpc/include/uapi/asm/kvm_perf.h index 30fa670..440902e 100644 --- a/arch/powerpc/include/uapi/asm/kvm_perf.h +++ b/arch/powerpc/include/uapi/asm/kvm_perf.h @@ -3,6 +3,7 @@ #include #include +#include #define DECODE_STR_LEN 20 @@ -11,5 +12,8 @@ #define KVM_ENTRY_TRACE "kvm_hv:kvm_guest_enter" #define KVM_EXIT_TRACE "kvm_hv:kvm_guest_exit" #define KVM_EXIT_REASON "trap" +#define KVM_HCALL_ENTRY_TRACE "kvm_hv:kvm_hcall_enter" +#define KVM_HCALL_EXIT_TRACE "kvm_hv:kvm_hcall_exit" +#define KVM_HCALL_REASON "req" #endif /* _ASM_POWERPC_KVM_PERF_H */ diff --git a/tools/perf/arch/powerpc/util/kvm-stat.c b/tools/perf/arch/powerpc/util/kvm-stat.c index 62cdcc1..685201c 100644 --- a/tools/perf/arch/powerpc/util/kvm-stat.c +++ b/tools/perf/arch/powerpc/util/kvm-stat.c @@ -1,7 +1,9 @@ #include "../../util/kvm-stat.h" #include +#include "../../util/debug.h" define_exit_reasons_table(hv_exit_reasons, kvm_trace_symbol_exit); +define_exit_reasons_table(hcall_reasons, kvm_trace_symbol_hcall); static struct kvm_events_ops exit_events = { .is_begin_event = exit_event_begin, @@ -10,14 +12,73 @@ static struct kvm_events_ops exit_events = { .name = "VM-EXIT" }; +static void hcall_event_get_key(struct perf_evsel *evsel, + struct perf_sample *sample, + struct event_key *key) +{ + key->info = 0; + key->key = perf_evsel__intval(evsel, sample, KVM_HCALL_REASON); +} + +static const char *get_exit_reason(u64 exit_code) +{ + struct exit_reasons_table *tbl = hcall_reasons; + + while (tbl->reason != NULL) { + if (tbl->exit_code == exit_code) + return tbl->reason; + tbl++; + } + + pr_err("Unknown kvm hcall exit code: %lld\n", + (unsigned long long)exit_code); + return "UNKNOWN"; +} + +static bool hcall_event_end(struct perf_evsel *evsel, + struct perf_sample *sample __maybe_unused, + struct event_key *key __maybe_unused) +{ + return (!strcmp(evsel->name, KVM_HCALL_EXIT_TRACE)); +} + +static bool hcall_event_begin(struct perf_evsel *evsel, + struct perf_sample *sample, struct event_key *key) +{ + if (!strcmp(evsel->name, KVM_HCALL_ENTRY_TRACE)) { + hcall_event_get_key(evsel, sample, key); + return true; + } + +return false; +} +static void hcall_event_decode_key(struct perf_kvm_stat *kvm __maybe_unused, + struct event_key *key, + char *decode) +{ + const char *hcall_reason = get_exit_reason(key->key); + + scnprintf(decode, DECODE_STR_LEN, "%s", hcall_reason); +} + +static struct kv
[PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
From: Srikar Dronamraju perf kvm can be used to analyze guest exit reasons. This support already exists in x86. Hence, porting it to powerpc. - To trace KVM events : perf kvm stat record If many guests are running, we can track for a specific guest by using --pid as in : perf kvm stat record --pid - To see the results : perf kvm stat report The result shows the number of exits (from the guest context to host/hypervisor context) grouped by their respective exit reasons with their frequency. This patch makes use of the guest exit reasons available in "trace_book3s.h". It records on two already available tracepoints : "kvm_hv:kvm_guest_exit" and "kvm_hv:kvm_guest_enter". Note : This patch has a dependency on the patch "kvm/powerpc: Export kvm exit reasons" which exports the KVM exit reasons through the uapi. Here is a sample o/p: # pgrep qemu 19378 60515 2 Guests are running on the host. # perf kvm stat record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ] # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXITSamples Samples% Time%Min Time Max Time Avg time H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us 12.37us ( +- 0.52% ) HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us 1.91us ( +- 0.96% ) SYSCALL 269018.97% 0.10% 2.84us528.24us 18.29us ( +- 3.75% ) RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us 27470.23us ( +- 3.00% ) EXTERNAL240 1.69% 0.00%0.69us 10.67us 1.33us ( +- 5.34% ) Total Samples:14182, Total events handled time:49264158.30us. Signed-off-by: Srikar Dronamraju Signed-off-by: Hemant Kumar --- Patch has a dependency on : https://patchwork.ozlabs.org/patch/469839/ which exports the exit reasons to perf through uapi. Changes: - Original series split into two patchsets now : perf and powerpc side changes. arch/powerpc/include/uapi/asm/kvm_perf.h | 15 +++ tools/perf/arch/powerpc/Makefile | 1 + tools/perf/arch/powerpc/util/Build | 1 + tools/perf/arch/powerpc/util/kvm-stat.c | 33 4 files changed, 50 insertions(+) create mode 100644 arch/powerpc/include/uapi/asm/kvm_perf.h create mode 100644 tools/perf/arch/powerpc/util/kvm-stat.c diff --git a/arch/powerpc/include/uapi/asm/kvm_perf.h b/arch/powerpc/include/uapi/asm/kvm_perf.h new file mode 100644 index 000..30fa670 --- /dev/null +++ b/arch/powerpc/include/uapi/asm/kvm_perf.h @@ -0,0 +1,15 @@ +#ifndef _ASM_POWERPC_KVM_PERF_H +#define _ASM_POWERPC_KVM_PERF_H + +#include +#include + +#define DECODE_STR_LEN 20 + +#define VCPU_ID "vcpu_id" + +#define KVM_ENTRY_TRACE "kvm_hv:kvm_guest_enter" +#define KVM_EXIT_TRACE "kvm_hv:kvm_guest_exit" +#define KVM_EXIT_REASON "trap" + +#endif /* _ASM_POWERPC_KVM_PERF_H */ diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile index 7fbca17..21322e0 100644 --- a/tools/perf/arch/powerpc/Makefile +++ b/tools/perf/arch/powerpc/Makefile @@ -1,3 +1,4 @@ ifndef NO_DWARF PERF_HAVE_DWARF_REGS := 1 endif +HAVE_KVM_STAT_SUPPORT := 1 diff --git a/tools/perf/arch/powerpc/util/Build b/tools/perf/arch/powerpc/util/Build index 0af6e9b..dd47b5e 100644 --- a/tools/perf/arch/powerpc/util/Build +++ b/tools/perf/arch/powerpc/util/Build @@ -1,4 +1,5 @@ libperf-y += header.o +libperf-y += kvm-stat.o libperf-$(CONFIG_DWARF) += dwarf-regs.o libperf-$(CONFIG_DWARF) += skip-callchain-idx.o diff --git a/tools/perf/arch/powerpc/util/kvm-stat.c b/tools/perf/arch/powerpc/util/kvm-stat.c new file mode 100644 index 000..62cdcc1 --- /dev/null +++ b/tools/perf/arch/powerpc/util/kvm-stat.c @@ -0,0 +1,33 @@ +#include "../../util/kvm-stat.h" +#include + +define_exit_reasons_table(hv_exit_reasons, kvm_trace_symbol_exit); + +static struct kvm_events_ops exit_events = { + .is_begin_event = exit_event_begin, + .is_end_event = exit_event_end, + .decode_key = exit_event_decode_key, + .name = "VM-EXIT" +}; + +const char *const kvm_events_tp[] = { + "kvm_hv:kvm_guest_exit", + "kvm_hv:kvm_guest_enter", + NULL, +}; + +struct kvm_reg_events_ops kvm_reg_events_ops[] = { + { .name = "vmexit", .ops = &exit_events }, + { NULL, NULL }, +}; + +const char * const kvm_skip_events[] = { + NULL, +}; + +int cpu_isa_init(struct perf_kvm_stat *kvm, const char *cpuid __maybe_unused) +{ + kvm->exit_reasons = hv_exit_reasons; + kvm->exit_reasons_isa = "HV"; + return 0; +} -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 3/3] kvm/powerpc: Export HCALL reason codes
For perf to analyze the KVM events like hcalls, we need the hypervisor calls and their codes to be exported through uapi. This patch moves most of the pSeries hcall codes from arch/powerpc/include/asm/hvcall.h to arch/powerpc/include/uapi/asm/hcall_codes.h. It also moves the mapping from arch/powerpc/kvm/trace_hv.h to arch/powerpc/include/uapi/asm/trace_hcall.h. Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/hvcall.h | 120 +-- arch/powerpc/include/uapi/asm/hcall_codes.h | 123 arch/powerpc/include/uapi/asm/trace_hcall.h | 122 +++ arch/powerpc/kvm/trace_hv.h | 117 +- 4 files changed, 248 insertions(+), 234 deletions(-) create mode 100644 arch/powerpc/include/uapi/asm/hcall_codes.h create mode 100644 arch/powerpc/include/uapi/asm/trace_hcall.h diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 85bc8c0..799677d 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -155,124 +155,8 @@ /* Each control block has to be on a 4K boundary */ #define H_CB_ALIGNMENT 4096 -/* pSeries hypervisor opcodes */ -#define H_REMOVE 0x04 -#define H_ENTER0x08 -#define H_READ 0x0c -#define H_CLEAR_MOD0x10 -#define H_CLEAR_REF0x14 -#define H_PROTECT 0x18 -#define H_GET_TCE 0x1c -#define H_PUT_TCE 0x20 -#define H_SET_SPRG00x24 -#define H_SET_DABR 0x28 -#define H_PAGE_INIT0x2c -#define H_SET_ASR 0x30 -#define H_ASR_ON 0x34 -#define H_ASR_OFF 0x38 -#define H_LOGICAL_CI_LOAD 0x3c -#define H_LOGICAL_CI_STORE 0x40 -#define H_LOGICAL_CACHE_LOAD 0x44 -#define H_LOGICAL_CACHE_STORE 0x48 -#define H_LOGICAL_ICBI 0x4c -#define H_LOGICAL_DCBF 0x50 -#define H_GET_TERM_CHAR0x54 -#define H_PUT_TERM_CHAR0x58 -#define H_REAL_TO_LOGICAL 0x5c -#define H_HYPERVISOR_DATA 0x60 -#define H_EOI 0x64 -#define H_CPPR 0x68 -#define H_IPI 0x6c -#define H_IPOLL0x70 -#define H_XIRR 0x74 -#define H_PERFMON 0x7c -#define H_MIGRATE_DMA 0x78 -#define H_REGISTER_VPA 0xDC -#define H_CEDE 0xE0 -#define H_CONFER 0xE4 -#define H_PROD 0xE8 -#define H_GET_PPP 0xEC -#define H_SET_PPP 0xF0 -#define H_PURR 0xF4 -#define H_PIC 0xF8 -#define H_REG_CRQ 0xFC -#define H_FREE_CRQ 0x100 -#define H_VIO_SIGNAL 0x104 -#define H_SEND_CRQ 0x108 -#define H_COPY_RDMA0x110 -#define H_REGISTER_LOGICAL_LAN 0x114 -#define H_FREE_LOGICAL_LAN 0x118 -#define H_ADD_LOGICAL_LAN_BUFFER 0x11C -#define H_SEND_LOGICAL_LAN 0x120 -#define H_BULK_REMOVE 0x124 -#define H_MULTICAST_CTRL 0x130 -#define H_SET_XDABR0x134 -#define H_STUFF_TCE0x138 -#define H_PUT_TCE_INDIRECT 0x13C -#define H_CHANGE_LOGICAL_LAN_MAC 0x14C -#define H_VTERM_PARTNER_INFO 0x150 -#define H_REGISTER_VTERM 0x154 -#define H_FREE_VTERM 0x158 -#define H_RESET_EVENTS 0x15C -#define H_ALLOC_RESOURCE0x160 -#define H_FREE_RESOURCE 0x164 -#define H_MODIFY_QP 0x168 -#define H_QUERY_QP 0x16C -#define H_REREGISTER_PMR0x170 -#define H_REGISTER_SMR 0x174 -#define H_QUERY_MR 0x178 -#define H_QUERY_MW 0x17C -#define H_QUERY_HCA 0x180 -#define H_QUERY_PORT0x184 -#define H_MODIFY_PORT 0x188 -#define H_DEFINE_AQP1 0x18C -#define H_GET_TRACE_BUFFER 0x190 -#define H_DEFINE_AQP0 0x194 -#define H_RESIZE_MR 0x198 -#define H_ATTACH_MCQP 0x19C -#define H_DETACH_MCQP 0x1A0 -#define H_CREATE_RPT0x1A4 -#define H_REMOVE_RPT0x1A8 -#define H_REGISTER_RPAGES 0x1AC -#define H_DISABLE_AND_GETC 0x1B0 -#define H_ERROR_DATA0x1B4 -#define H_GET_HCA_INFO 0x1B8 -#define H_GET_PERF_COUNT0x1BC -#define H_MANAGE_TRACE 0x1C0 -#define H_FREE_LOGICAL_LAN_BUFFER 0x1D4 -#define H_QUERY_INT_STATE 0x1E4 -#define H_POLL_PENDING 0x1D8 -#define H_ILLAN_ATTRIBUTES 0x244 -#define H_MODIFY_HEA_QP0x250 -#define H_QUERY_HEA_QP 0x254 -#define H_QUERY_HEA0x258 -#define H_QUERY_HEA_PORT 0x25C -#define H_MODIFY_HEA_PORT 0x260 -#define H_REG_BCMC 0x264 -#define H_DEREG_BCMC 0x268 -#define H_REGISTER_HEA_RPAGES 0x26C -#define H_DISABLE_AND_GET_HEA 0x270 -#define H_GET_HEA_INFO 0x274 -#define H_ALLOC_HEA_RESOURCE 0x27
[PATCH v3 2/3] kvm/powerpc: Add exit reason for return code 0x0
This patch adds an exit reason "RETURN_TO_HOST" for the return code 0x0. Signed-off-by: Hemant Kumar --- arch/powerpc/include/uapi/asm/trace_book3s.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/include/uapi/asm/trace_book3s.h b/arch/powerpc/include/uapi/asm/trace_book3s.h index f647ce0..8635005 100644 --- a/arch/powerpc/include/uapi/asm/trace_book3s.h +++ b/arch/powerpc/include/uapi/asm/trace_book3s.h @@ -6,6 +6,7 @@ */ #define kvm_trace_symbol_exit \ + {0x0, "RETURN_TO_HOST"}, \ {0x100, "SYSTEM_RESET"}, \ {0x200, "MACHINE_CHECK"}, \ {0x300, "DATA_STORAGE"}, \ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/3] kvm/powerpc: Export kvm exit reasons
To analyze the kvm exits with perf, we will need to map the exit codes with the exit reasons. Such a mapping exists today in trace_book3s.h. Currently its not exported to perf. This patch moves these kvm exit reasons and their mapping from "arch/powerpc/kvm/trace_book3s.h" to "arch/powerpc/include/uapi/asm/trace_book3s.h". Accordingly change the include files in "trace_hv.h" and "trace_pr.h". Signed-off-by: Hemant Kumar --- Changes : - Original patchset split into 2 patchsets now: for perf and powerpc side changes. arch/powerpc/include/uapi/asm/trace_book3s.h | 32 arch/powerpc/kvm/trace_book3s.h | 32 arch/powerpc/kvm/trace_hv.h | 2 +- arch/powerpc/kvm/trace_pr.h | 2 +- 4 files changed, 34 insertions(+), 34 deletions(-) create mode 100644 arch/powerpc/include/uapi/asm/trace_book3s.h delete mode 100644 arch/powerpc/kvm/trace_book3s.h diff --git a/arch/powerpc/include/uapi/asm/trace_book3s.h b/arch/powerpc/include/uapi/asm/trace_book3s.h new file mode 100644 index 000..f647ce0 --- /dev/null +++ b/arch/powerpc/include/uapi/asm/trace_book3s.h @@ -0,0 +1,32 @@ +#if !defined(_TRACE_KVM_BOOK3S_H) +#define _TRACE_KVM_BOOK3S_H + +/* + * Common defines used by the trace macros in trace_pr.h and trace_hv.h + */ + +#define kvm_trace_symbol_exit \ + {0x100, "SYSTEM_RESET"}, \ + {0x200, "MACHINE_CHECK"}, \ + {0x300, "DATA_STORAGE"}, \ + {0x380, "DATA_SEGMENT"}, \ + {0x400, "INST_STORAGE"}, \ + {0x480, "INST_SEGMENT"}, \ + {0x500, "EXTERNAL"}, \ + {0x501, "EXTERNAL_LEVEL"}, \ + {0x502, "EXTERNAL_HV"}, \ + {0x600, "ALIGNMENT"}, \ + {0x700, "PROGRAM"}, \ + {0x800, "FP_UNAVAIL"}, \ + {0x900, "DECREMENTER"}, \ + {0x980, "HV_DECREMENTER"}, \ + {0xc00, "SYSCALL"}, \ + {0xd00, "TRACE"}, \ + {0xe00, "H_DATA_STORAGE"}, \ + {0xe20, "H_INST_STORAGE"}, \ + {0xe40, "H_EMUL_ASSIST"}, \ + {0xf00, "PERFMON"}, \ + {0xf20, "ALTIVEC"}, \ + {0xf40, "VSX"} + +#endif diff --git a/arch/powerpc/kvm/trace_book3s.h b/arch/powerpc/kvm/trace_book3s.h deleted file mode 100644 index f647ce0..000 --- a/arch/powerpc/kvm/trace_book3s.h +++ /dev/null @@ -1,32 +0,0 @@ -#if !defined(_TRACE_KVM_BOOK3S_H) -#define _TRACE_KVM_BOOK3S_H - -/* - * Common defines used by the trace macros in trace_pr.h and trace_hv.h - */ - -#define kvm_trace_symbol_exit \ - {0x100, "SYSTEM_RESET"}, \ - {0x200, "MACHINE_CHECK"}, \ - {0x300, "DATA_STORAGE"}, \ - {0x380, "DATA_SEGMENT"}, \ - {0x400, "INST_STORAGE"}, \ - {0x480, "INST_SEGMENT"}, \ - {0x500, "EXTERNAL"}, \ - {0x501, "EXTERNAL_LEVEL"}, \ - {0x502, "EXTERNAL_HV"}, \ - {0x600, "ALIGNMENT"}, \ - {0x700, "PROGRAM"}, \ - {0x800, "FP_UNAVAIL"}, \ - {0x900, "DECREMENTER"}, \ - {0x980, "HV_DECREMENTER"}, \ - {0xc00, "SYSCALL"}, \ - {0xd00, "TRACE"}, \ - {0xe00, "H_DATA_STORAGE"}, \ - {0xe20, "H_INST_STORAGE"}, \ - {0xe40, "H_EMUL_ASSIST"}, \ - {0xf00, "PERFMON"}, \ - {0xf20, "ALTIVEC"}, \ - {0xf40, "VSX"} - -#endif diff --git a/arch/powerpc/kvm/trace_hv.h b/arch/powerpc/kvm/trace_hv.h index 33d9daf..02d0a07 100644 --- a/arch/powerpc/kvm/trace_hv.h +++ b/arch/powerpc/kvm/trace_hv.h @@ -2,7 +2,7 @@ #define _TRACE_KVM_HV_H #include -#include "trace_book3s.h" +#include #include #include diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h index 810507c..a9850c6 100644 --- a/arch/powerpc/kvm/trace_pr.h +++ b/arch/powerpc/kvm/trace_pr.h @@ -3,7 +3,7 @@ #define _TRACE_KVM_PR_H #include -#include "trace_book3s.h" +#include #undef TRACE_SYSTEM #define TRACE_SYSTEM kvm_pr -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page
On Thu, May 07, 2015 at 12:46:21PM +0530, Aneesh Kumar K.V wrote: > We need to check whether pte is present in follow_huge_addr and > properly return NULL if mapping is not present. Also use READ_ONCE > when dereferencing pte_t address. > > Signed-off-by: Aneesh Kumar K.V Reviewed-by: David Gibson Looks sane. It's a long time since I worked with this so I don't really remember, but I have a suspicion that at the time hugepage PTEs could never exist but be non-present. > --- > arch/powerpc/mm/hugetlbpage.c | 25 - > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index 0ce968b00b7c..f5688423bc69 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, > struct page * > follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) > { > - pte_t *ptep; > - struct page *page; > + pte_t *ptep, pte; > unsigned shift; > unsigned long mask, flags; > + struct page *page = ERR_PTR(-EINVAL); > + > + local_irq_save(flags); > + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); > + if (!ptep) > + goto no_page; > + pte = READ_ONCE(*ptep); > /* > + * Verify it is a huge page else bail. >* Transparent hugepages are handled by generic code. We can skip them >* here. >*/ > - local_irq_save(flags); > - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); > + if (!shift || pmd_trans_huge((pmd_t)pte)) > + goto no_page; > > - /* Verify it is a huge page else bail. */ > - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) { > - local_irq_restore(flags); > - return ERR_PTR(-EINVAL); > + if (!pte_present(pte)) { > + page = NULL; > + goto no_page; > } > mask = (1UL << shift) - 1; > - page = pte_page(*ptep); > + page = pte_page(pte); > if (page) > page += (address & mask) / PAGE_SIZE; > > +no_page: > local_irq_restore(flags); > return page; > } -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpzzSbw7Pdjt.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
Hi Gavin, [Please run "git log --oneline drivers/pci/setup-bus.c" and observe the capitalization convention.] On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote: > Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(), > which is called for once after PCI probing and resource assignment > are completed, to allocate platform required resources for PCI devices: > PE#, IO and MMIO mapping, DMA address translation (TCE) table etc. > Obviously, it's not hotplug friendly. > > The patch adds weak function pcibios_setup_bridge(), which is called > by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function > to assign above platform required resources to newly added PCI devices, > in order to support PCI hotplug on PowerPC PowerNV platform. > > Signed-off-by: Gavin Shan > --- > drivers/pci/setup-bus.c | 12 +--- > include/linux/pci.h | 1 + > 2 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 4fd0cac..a7d0c3c 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev > *bridge) > pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu); > } > > -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type) > + > +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type) > { > struct pci_dev *bridge = bus->self; > > @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, > unsigned long type) > pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl); > } > > +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type) > +{ > + pci_setup_bridge_resources(bus, type); > +} I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do the architected updates in the generic PCI core code instead of down in the pcibios code. In other words, I would rather have this: void pci_setup_bridge(struct pci_bus *bus) { pcibios_setup_bridge(bus, type); pci_setup_bridge_resources(bus, type); } That way the default pcibios hook is empty, showing that by default there's no arch-specific code in this path, and we only have to look at the generic core code to verify that we actually do program the bridge windows. Bjorn ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020
On Thu, 2015-05-07 at 15:29 -0400, Oleksandr G Zhadan wrote: > On 05/07/2015 02:18 PM, Scott Wood wrote: > > On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote: > diff --git a/arch/powerpc/configs/ucp1020_defconfig > b/arch/powerpc/configs/ucp1020_defconfig > new file mode 100644 > index 000..62f99aa > --- /dev/null > +++ b/arch/powerpc/configs/ucp1020_defconfig > >>> > >>> Please explain why your board needs its own defconfig. > >>> > >> > >> Because, it's our own board and it has some specific to board > >> definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product > >> definitions. > >> > >> If I can do it in some other way could you please give me some example > >> if it's possible. > > > > I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream. > > Could you list what you need to be set that mpc85xx_smp_defconfig > > doesn't set? > > I make diff "mpc85xx_smp_defconfig" vs "ucp1020_defconfig after make > savedefconfig" and it's some differences like: > > - mpc85xx_smp_defconfig has: > CONFIG_PHYS_64BIT=y > CONFIG_NR_CPUS=8 These won't prevent your board from working. If you want CONFIG_PHYS_64BIT disabled for performance, I could see a fragment being used for that as per the recent defconfig discussions. I wouldn't expect NR_CPUS being 8 instead of 2 to be noticeable. > - it enabled almost all boards to build. What for ? Because that's what the common defconfigs are for. We don't want a defconfig for each board (most of the board-specific configs that are currently there were added long ago). If you want a config that contains nothing your board doesn't need, you can maintain that locally. > - it has MTD related differences (doesn't enabled spi flashes support we > need): > -CONFIG_MTD_M25P80=y > -CONFIG_MTD_SST25L=y So add them to the existing defconfig. > - It includes some PHY support, but not phy we are using This should not harm your board. > and we need include intel wifi support: > -CONFIG_MICREL_PHY=y > -CONFIG_IWLWIFI=y So add them to the existing defconfig. > - It doesn't enable EXT4 fs support. I think this would be a reasonable thing to add. > Etc... > > You can see it yourself below: That doesn't show me the set of changes that you *need*, only the set of changes that you have. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote: > On 05/05/2015 02:11 PM, Preeti U Murthy wrote: > > On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: > >> Hi Preeti, > >> > >> On 05/05/2015 09:30 AM, Preeti U Murthy wrote: > >>> Hi Shilpa, > >>> > >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > Re-evaluate the chip's throttled state on recieving OCC_THROTTLE > notification by executing *throttle_check() on any one of the cpu on > the chip. This is a sanity check to verify if we were indeed > throttled/unthrottled after receiving OCC_THROTTLE notification. > > We cannot call *throttle_check() directly from the notification > handler because we could be handling chip1's notification in chip2. So > initiate an smp_call to execute *throttle_check(). We are irq-disabled > in the notification handler, so use a worker thread to smp_call > throttle_check() on any of the cpu in the chipmask. > >>> > >>> I see that the first patch takes care of reporting *per-chip* throttling > >>> for pmax capping condition. But where are we taking care of reporting > >>> "pstate set to safe" and "freq control disabled" scenarios per-chip ? > >>> > >> > >> IMO let us not have "psafe" and "freq control disabled" states managed > >> per-chip. > >> Because when the above two conditions occur it is likely to happen across > >> all > >> chips during an OCC reset cycle. So I am setting 'throttled' to false on > >> OCC_ACTIVE and re-verifying if it actually is the case by invoking > >> *throttle_check(). > > > > Alright like I pointed in the previous reply, a comment to indicate that > > psafe and freq control disabled conditions will fail when occ is > > inactive and that all chips face the consequence of this will help. > > From your explanation on the thread of the first patch of this series, > this will not be required. > > So, > Reviewed-by: Preeti U Murthy OK, so is the whole series reviewed now? -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote: > When a CPU has to enter an idle state where tick stops, it makes a call > to tick_broadcast_enter(). The call will fail if this CPU is the > broadcast CPU. Today, under such a circumstance, the arch cpuidle code > handles this CPU. This is not convincing because not only are we not > aware what the arch cpuidle code does, but we also do not account for > the idle state residency time and usage of such a CPU. > > This scenario can be handled better by simply asking the cpuidle > governor to choose an idle state where in ticks do not stop. To > accommodate this change move the setting of runqueue idle state from the > core to the cpuidle driver, else the rq->idle_state will be set wrong. > > Signed-off-by: Preeti U Murthy > --- > Changes from V1: https://lkml.org/lkml/2015/5/7/24 > Rebased on the latest linux-pm/bleeding-edge > > drivers/cpuidle/cpuidle.c | 21 + > drivers/cpuidle/governors/ladder.c | 13 ++--- > drivers/cpuidle/governors/menu.c |6 +- > include/linux/cpuidle.h|6 +++--- > include/linux/sched.h | 16 > kernel/sched/core.c| 17 + > kernel/sched/fair.c|2 +- > kernel/sched/idle.c|8 +--- > kernel/sched/sched.h | 24 > 9 files changed, 70 insertions(+), 43 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index 8c24f95..b7e86f4 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > > #include "cpuidle.h" > @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, >* CPU as a broadcast timer, this call may fail if it is not available. >*/ > if (broadcast && tick_broadcast_enter()) { > - default_idle_call(); > - return -EBUSY; > + index = cpuidle_select(drv, dev, !broadcast); No, you can't do that. This code path may be used by suspend-to-idle and that should not call cpuidle_select(). What's needed here seems to be a fallback mechanism like "choose the deepest state shallower than X and such that it won't stop the tick". You don't really need to run a full governor for that. -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020
On 05/07/2015 02:18 PM, Scott Wood wrote: On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote: Hi Scott, Thanks for fast response, please see inline. On 05/06/2015 11:22 PM, Scott Wood wrote: On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote: +- + +P1020 SPI controller + +Properties: +- compatible: "spansion,s25fl008k", "winbond,w25q80bl" + +Example: + spi@7000 { + flash@0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "spansion,s25fl008k", "winbond,w25q80bl"; + reg = <0>; + spi-max-frequency = <4000>; /* input clock */ + ... + }; This isn't describing the controller, but rather a SPI chip attached to the controller. This also doesn't seem like the right place for random SPI chips. If all you're specifying is the compatible, maybe create a spi/trivial-devices.txt similar to i2c/trivial-devices.txt? Or something specific to SPI flash chips to describe the partition specification, though I generally recommend against describing partitions in the device tree -- especially if this is a developer board rather than something fixed-purpose where the partitioning is not going to change based on user requirements. Mostly in all Documentation/devicetree/bindings/ I tried to satisfy checkpatch script as simple as possible. And for me as well it looks reasonable to create spi/trivial-devices.txt file and I will. Checkpatch is a tool, not a dictator. Sometimes it gets things wrong. Also, please CC devicet...@vger.kernel.org when adding bindings or modifying dts files. OK, got it. +- + +Chipselect/Local Bus + +Properties: +- #address-cells: <2>. +- #size-cells: <1>. +- compatible: "fsl,p1020-elbc", "fsl,elbc", "simple-bus","fsl,p1020-immr" +- interrupts: interrupts to report localbus events. + +Example: + +&lbc { + #address-cells = <2>; + #size-cells = <1>; + compatible = "fsl,p1020-elbc", "fsl,elbc", "simple-bus"; + interrupts = <19 2 0 0>; +}; There's already a binding for elbc -- and the elbc node certainly should not claim compatibility with "fsl,p1020-immr". to satisfy checkpatch script. Even if that were necessary, why do it by copy-and-paste, and why put the immr compatible in the binding for a different node? Will fix. diff --git a/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi new file mode 100644 index 000..930a6e3 --- /dev/null +++ b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi Why can't you use p1020si-post.dtsi? The "si" means "silicon" -- it's meant to be included by all p1020 boards. Yes, silicon is the same, but p1020 boards using all 3 etsec ethernet controllers. Our board using only 2: etsec1 and etsec3. So have your board write status = "disabled" into the etsec2 node after including the post file. While ago I create ucp1020som-post.dtsi to make it work as we need. I did try some things, but I really don't remember what exactly and result always was eth0,eth1,eth2 or eth0,eth2 and newer eth0,eth1. I'll retry status="disabled". diff --git a/arch/powerpc/configs/ucp1020_defconfig b/arch/powerpc/configs/ucp1020_defconfig new file mode 100644 index 000..62f99aa --- /dev/null +++ b/arch/powerpc/configs/ucp1020_defconfig Please explain why your board needs its own defconfig. Because, it's our own board and it has some specific to board definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product definitions. If I can do it in some other way could you please give me some example if it's possible. I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream. Could you list what you need to be set that mpc85xx_smp_defconfig doesn't set? I make diff "mpc85xx_smp_defconfig" vs "ucp1020_defconfig after make savedefconfig" and it's some differences like: - mpc85xx_smp_defconfig has: CONFIG_PHYS_64BIT=y CONFIG_NR_CPUS=8 - it enabled almost all boards to build. What for ? - it has MTD related differences (doesn't enabled spi flashes support we need): -CONFIG_MTD_M25P80=y -CONFIG_MTD_SST25L=y - It includes some PHY support, but not phy we are using and we need include intel wifi support: -CONFIG_MICREL_PHY=y -CONFIG_IWLWIFI=y - It doesn't enable EXT4 fs support. Etc... You can see it yourself below: --- defconfig 2015-05-07 14:48:12.0 -0400 +++ mpc85xx_smp_defconfig 2015-05-01 18:45:03.0 -0400 @@ -1,51 +1,58 @@ CONFIG_PPC_85xx=y +CONFIG_PHYS_64BIT=y CONFIG_SMP=y -CONFIG_NR_CPUS=2 -CONFIG_CROSS_COMPILE="powerpc-linux-" -# CONFIG_LOCALVERSION_AUTO is not set -CONFIG_DEFAULT_HOSTNAME="uCP1020" -# CONFIG_SWAP is not set +CONFIG_NR_CPUS=8 CONFIG_S
Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020
On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote: > Hi Scott, > > Thanks for fast response, please see inline. > > On 05/06/2015 11:22 PM, Scott Wood wrote: > > On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote: > >> +- > >> + > >> +P1020 SPI controller > >> + > >> +Properties: > >> +- compatible: "spansion,s25fl008k", "winbond,w25q80bl" > >> + > >> +Example: > >> + spi@7000 { > >> + flash@0 { > >> + #address-cells = <1>; > >> + #size-cells = <1>; > >> + compatible = "spansion,s25fl008k", "winbond,w25q80bl"; > >> + reg = <0>; > >> + spi-max-frequency = <4000>; /* input clock */ > >> + ... > >> + }; > > > > This isn't describing the controller, but rather a SPI chip attached to > > the controller. This also doesn't seem like the right place for random > > SPI chips. > > > > If all you're specifying is the compatible, maybe create a > > spi/trivial-devices.txt similar to i2c/trivial-devices.txt? Or > > something specific to SPI flash chips to describe the partition > > specification, though I generally recommend against describing > > partitions in the device tree -- especially if this is a developer board > > rather than something fixed-purpose where the partitioning is not going > > to change based on user requirements. > > > > > > Mostly in all Documentation/devicetree/bindings/ I tried to satisfy > checkpatch script as simple as possible. And for me as well it looks > reasonable to create spi/trivial-devices.txt file and I will. Checkpatch is a tool, not a dictator. Sometimes it gets things wrong. Also, please CC devicet...@vger.kernel.org when adding bindings or modifying dts files. > >> +- > >> + > >> +Chipselect/Local Bus > >> + > >> +Properties: > >> +- #address-cells: <2>. > >> +- #size-cells:<1>. > >> +- compatible: "fsl,p1020-elbc", "fsl,elbc", > >> "simple-bus","fsl,p1020-immr" > >> +- interrupts: interrupts to report localbus events. > >> + > >> +Example: > >> + > >> +&lbc { > >> + #address-cells = <2>; > >> + #size-cells = <1>; > >> + compatible = "fsl,p1020-elbc", "fsl,elbc", "simple-bus"; > >> + interrupts = <19 2 0 0>; > >> +}; > > > > There's already a binding for elbc -- and the elbc node certainly should > > not claim compatibility with "fsl,p1020-immr". > > > > > > to satisfy checkpatch script. Even if that were necessary, why do it by copy-and-paste, and why put the immr compatible in the binding for a different node? > >> diff --git a/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi > >> b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi > >> new file mode 100644 > >> index 000..930a6e3 > >> --- /dev/null > >> +++ b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi > > > > Why can't you use p1020si-post.dtsi? The "si" means "silicon" -- it's > > meant to be included by all p1020 boards. > > > > Yes, silicon is the same, but p1020 boards using all 3 etsec ethernet > controllers. Our board using only 2: etsec1 and etsec3. So have your board write status = "disabled" into the etsec2 node after including the post file. > >> diff --git a/arch/powerpc/configs/ucp1020_defconfig > >> b/arch/powerpc/configs/ucp1020_defconfig > >> new file mode 100644 > >> index 000..62f99aa > >> --- /dev/null > >> +++ b/arch/powerpc/configs/ucp1020_defconfig > > > > Please explain why your board needs its own defconfig. > > > > Because, it's our own board and it has some specific to board > definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product > definitions. > > If I can do it in some other way could you please give me some example > if it's possible. I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream. Could you list what you need to be set that mpc85xx_smp_defconfig doesn't set? -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/07/2015 11:09 PM, Preeti U Murthy wrote: > When a CPU has to enter an idle state where tick stops, it makes a call > to tick_broadcast_enter(). The call will fail if this CPU is the > broadcast CPU. Today, under such a circumstance, the arch cpuidle code > handles this CPU. This is not convincing because not only are we not > aware what the arch cpuidle code does, but we also do not account for > the idle state residency time and usage of such a CPU. > > This scenario can be handled better by simply asking the cpuidle > governor to choose an idle state where in ticks do not stop. To > accommodate this change move the setting of runqueue idle state from the > core to the cpuidle driver, else the rq->idle_state will be set wrong. > > Signed-off-by: Preeti U Murthy > --- > Rebased on the latest linux-pm/bleeding-edge Kindly ignore this. I have sent the rebase as V2. [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully The below patch is not updated. Regards Preeti U Murthy > > drivers/cpuidle/cpuidle.c | 21 + > drivers/cpuidle/governors/ladder.c | 13 ++--- > drivers/cpuidle/governors/menu.c |6 +- > include/linux/cpuidle.h|6 +++--- > include/linux/sched.h | 16 > kernel/sched/core.c| 17 + > kernel/sched/fair.c|2 +- > kernel/sched/idle.c|8 +--- > kernel/sched/sched.h | 24 > 9 files changed, 70 insertions(+), 43 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index 61c417b..8f5657e 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > > #include "cpuidle.h" > @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, >* local timer will be shut down. If a local timer is used from another >* CPU as a broadcast timer, this call may fail if it is not available. >*/ > - if (broadcast && tick_broadcast_enter()) > - return -EBUSY; > + if (broadcast && tick_broadcast_enter()) { > + index = cpuidle_select(drv, dev, !broadcast); > + if (index < 0) > + return -EBUSY; > + target_state = &drv->states[index]; > + } > + > + /* Take note of the planned idle state. */ > + idle_set_state(smp_processor_id(), target_state); > > trace_cpu_idle_rcuidle(index, dev->cpu); > time_start = ktime_get(); > @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > time_end = ktime_get(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); > > + /* The cpu is no longer idle or about to enter idle. */ > + idle_set_state(smp_processor_id(), NULL); > + > if (broadcast) { > if (WARN_ON_ONCE(!irqs_disabled())) > local_irq_disable(); > @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > * > * @drv: the cpuidle driver > * @dev: the cpuidle device > + * @timer_stop_valid: allow selection of idle state where tick stops > * > * Returns the index of the idle state. > */ > -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > +int cpuidle_select(struct cpuidle_driver *drv, > + struct cpuidle_device *dev, int timer_stop_valid) > { > - return cpuidle_curr_governor->select(drv, dev); > + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); > } > > /** > diff --git a/drivers/cpuidle/governors/ladder.c > b/drivers/cpuidle/governors/ladder.c > index 401c010..c437322 100644 > --- a/drivers/cpuidle/governors/ladder.c > +++ b/drivers/cpuidle/governors/ladder.c > @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct > ladder_device *ldev, > * ladder_select_state - selects the next state to enter > * @drv: cpuidle driver > * @dev: the CPU > + * @timer_stop_valid: allow selection of idle state where tick stops > */ > static int ladder_select_state(struct cpuidle_driver *drv, > - struct cpuidle_device *dev) > + struct cpuidle_device *dev, int > timer_stop_valid) > { > struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); > struct ladder_device_state *last_state; > @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, > !drv->states[last_idx + 1].disabled && > !dev->states_usage[last_idx + 1].disable && > last_residency > last_state->threshold.promotion_time && > + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & > CPUIDLE_FLAG_TIMER_STOP)) && > drv->states[last_idx + 1].exit_latency
[PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg
Add crypto compression alg for 842 hardware compression and decompression, using the alg name "842" and driver_name "842-nx". This uses only the PowerPC coprocessor hardware for 842 compression. It also uses the hardware for decompression, but if the hardware fails it will fall back to the 842 software decompression library, so that decompression never fails (for valid 842 compressed buffers). A header must be used in most cases, due to the hardware's restrictions on the buffers being specifically aligned and sized. Due to the header this driver adds, compressed buffers it creates cannot be directly passed to the 842 software library for decompression. However, compressed buffers created by the software 842 library can be passed to this driver for hardware 842 decompression (with the exception of buffers containing the "short data" template, as lib/842/842.h explains). Signed-off-by: Dan Streetman --- drivers/crypto/nx/Kconfig | 10 + drivers/crypto/nx/Makefile| 2 + drivers/crypto/nx/nx-842-crypto.c | 585 ++ 3 files changed, 597 insertions(+) create mode 100644 drivers/crypto/nx/nx-842-crypto.c diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig index ee9e259..3e621ad 100644 --- a/drivers/crypto/nx/Kconfig +++ b/drivers/crypto/nx/Kconfig @@ -50,4 +50,14 @@ config CRYPTO_DEV_NX_COMPRESS_POWERNV algorithm. This supports NX hardware on the PowerNV platform. If you choose 'M' here, this module will be called nx_compress_powernv. +config CRYPTO_DEV_NX_COMPRESS_CRYPTO + tristate "Compression acceleration cryptographic interface" + select CRYPTO_ALGAPI + select 842_DECOMPRESS + default y + help + Support for PowerPC Nest (NX) accelerators using the cryptographic + API. If you choose 'M' here, this module will be called + nx_compress_crypto. + endif diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile index 6619787..868b5e6 100644 --- a/drivers/crypto/nx/Makefile +++ b/drivers/crypto/nx/Makefile @@ -13,6 +13,8 @@ nx-crypto-objs := nx.o \ obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o +obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_CRYPTO) += nx-compress-crypto.o nx-compress-objs := nx-842.o nx-compress-pseries-objs := nx-842-pseries.o nx-compress-powernv-objs := nx-842-powernv.o +nx-compress-crypto-objs := nx-842-crypto.o diff --git a/drivers/crypto/nx/nx-842-crypto.c b/drivers/crypto/nx/nx-842-crypto.c new file mode 100644 index 000..cb177c3 --- /dev/null +++ b/drivers/crypto/nx/nx-842-crypto.c @@ -0,0 +1,585 @@ +/* + * Cryptographic API for the NX-842 hardware compression. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Copyright (C) IBM Corporation, 2011-2015 + * + * Original Authors: Robert Jennings + * Seth Jennings + * + * Rewrite: Dan Streetman + * + * This is an interface to the NX-842 compression hardware in PowerPC + * processors. Most of the complexity of this drvier is due to the fact that + * the NX-842 compression hardware requires the input and output data buffers + * to be specifically aligned, to be a specific multiple in length, and within + * specific minimum and maximum lengths. Those restrictions, provided by the + * nx-842 driver via nx842_constraints, mean this driver must use bounce + * buffers and headers to correct misaligned in or out buffers, and to split + * input buffers that are too large. + * + * This driver will fall back to software decompression if the hardware + * decompression fails, so this driver's decompression should never fail as + * long as the provided compressed buffer is valid. Any compressed buffer + * created by this driver will have a header (except ones where the input + * perfectly matches the constraints); so users of this driver cannot simply + * pass a compressed buffer created by this driver over to the 842 software + * decompression library. Instead, users must use this driver to decompress; + * if the hardware fails or is unavailable, the compressed buffer will be + * parsed and the header removed, and the raw 842 buffer(s) passed to the 842 + * software decompression library. + * + * This does not fall back to software compression, however, since the caller + * of this function is specifically requesting hardware compression; if the + * hardware com
[PATCH 09/10] drivers/crypto/nx: simplify pSeries nx842 driver
Simplify the pSeries NX-842 driver: do not expect incoming buffers to be exactly page-sized; do not break up input buffers to compress smaller blocks; do not use any internal headers in the compressed data blocks; remove the software decompression implementation; implement the pSeries nx842_constraints. This changes the pSeries NX-842 driver to perform constraints-based compression so that it only needs to compress one entire input block at a time. This removes the need for it to split input data blocks into multiple compressed data sections in the output buffer, and removes the need for any extra header info in the compressed data; all that is moved (in a later patch) into the main crypto 842 driver. Additionally, the 842 software decompression implementation is no longer needed here, as the crypto 842 driver will use the generic software 842 decompression function as a fallback if any hardware 842 driver fails. Signed-off-by: Dan Streetman --- drivers/crypto/nx/nx-842-pseries.c | 779 - 1 file changed, 153 insertions(+), 626 deletions(-) diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index 6db9992..85837e9 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -21,7 +21,6 @@ * Seth Jennings */ -#include #include #include "nx-842.h" @@ -32,11 +31,6 @@ MODULE_LICENSE("GPL"); MODULE_AUTHOR("Robert Jennings "); MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors"); -#define SHIFT_4K 12 -#define SHIFT_64K 16 -#define SIZE_4K (1UL << SHIFT_4K) -#define SIZE_64K (1UL << SHIFT_64K) - /* IO buffer must be 128 byte aligned */ #define IO_BUFFER_ALIGN 128 @@ -47,18 +41,52 @@ static struct nx842_constraints nx842_pseries_constraints = { .maximum = PAGE_SIZE, /* dynamic, max_sync_size */ }; -struct nx842_header { - int blocks_nr; /* number of compressed blocks */ - int offset; /* offset of the first block (from beginning of header) */ - int sizes[0]; /* size of compressed blocks */ -}; - -static inline int nx842_header_size(const struct nx842_header *hdr) +static int check_constraints(unsigned long buf, unsigned int *len, bool in) { - return sizeof(struct nx842_header) + - hdr->blocks_nr * sizeof(hdr->sizes[0]); + if (!IS_ALIGNED(buf, nx842_pseries_constraints.alignment)) { + pr_debug("%s buffer 0x%lx not aligned to 0x%x\n", +in ? "input" : "output", buf, +nx842_pseries_constraints.alignment); + return -EINVAL; + } + if (*len % nx842_pseries_constraints.multiple) { + pr_debug("%s buffer len 0x%x not multiple of 0x%x\n", +in ? "input" : "output", *len, +nx842_pseries_constraints.multiple); + if (in) + return -EINVAL; + *len = round_down(*len, nx842_pseries_constraints.multiple); + } + if (*len < nx842_pseries_constraints.minimum) { + pr_debug("%s buffer len 0x%x under minimum 0x%x\n", +in ? "input" : "output", *len, +nx842_pseries_constraints.minimum); + return -EINVAL; + } + if (*len > nx842_pseries_constraints.maximum) { + pr_debug("%s buffer len 0x%x over maximum 0x%x\n", +in ? "input" : "output", *len, +nx842_pseries_constraints.maximum); + if (in) + return -EINVAL; + *len = nx842_pseries_constraints.maximum; + } + return 0; } +/* I assume we need to align the CSB? */ +#define WORKMEM_ALIGN (256) + +struct nx842_workmem { + /* scatterlist */ + char slin[4096]; + char slout[4096]; + /* coprocessor status/parameter block */ + struct nx_csbcpb csbcpb; + + char padding[WORKMEM_ALIGN]; +} __aligned(WORKMEM_ALIGN); + /* Macros for fields within nx_csbcpb */ /* Check the valid bit within the csbcpb valid field */ #define NX842_CSBCBP_VALID_CHK(x) (x & BIT_MASK(7)) @@ -72,8 +100,7 @@ static inline int nx842_header_size(const struct nx842_header *hdr) #define NX842_CSBCPB_CE2(x)(x & BIT_MASK(5)) /* The NX unit accepts data only on 4K page boundaries */ -#define NX842_HW_PAGE_SHIFTSHIFT_4K -#define NX842_HW_PAGE_SIZE (ASM_CONST(1) << NX842_HW_PAGE_SHIFT) +#define NX842_HW_PAGE_SIZE (4096) #define NX842_HW_PAGE_MASK (~(NX842_HW_PAGE_SIZE-1)) enum nx842_status { @@ -194,41 +221,6 @@ static int nx842_build_scatterlist(unsigned long buf, int len, return 0; } -/* - * Working memory for software decompression - */ -struct sw842_fifo { - union { - char f8[256][8]; - char f4[512][4]; - }; - char f2[256][2]; - unsigned char f84_full; - unsigned
[PATCH 08/10] drivers/crypto/nx: add PowerNV platform NX-842 driver
Add driver for NX-842 hardware on the PowerNV platform. This allows the use of the 842 compression hardware coprocessor on the PowerNV platform. Signed-off-by: Dan Streetman --- drivers/crypto/nx/Kconfig | 10 + drivers/crypto/nx/Makefile | 2 + drivers/crypto/nx/nx-842-powernv.c | 625 + drivers/crypto/nx/nx-842-pseries.c | 9 - drivers/crypto/nx/nx-842.c | 4 +- drivers/crypto/nx/nx-842.h | 97 ++ include/linux/nx842.h | 6 +- 7 files changed, 741 insertions(+), 12 deletions(-) create mode 100644 drivers/crypto/nx/nx-842-powernv.c diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig index 34013f7..ee9e259 100644 --- a/drivers/crypto/nx/Kconfig +++ b/drivers/crypto/nx/Kconfig @@ -40,4 +40,14 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES algorithm. This supports NX hardware on the pSeries platform. If you choose 'M' here, this module will be called nx_compress_pseries. +config CRYPTO_DEV_NX_COMPRESS_POWERNV + tristate "Compression acceleration support on PowerNV platform" + depends on PPC_POWERNV + default y + help + Support for PowerPC Nest (NX) compression acceleration. This + module supports acceleration for compressing memory with the 842 + algorithm. This supports NX hardware on the PowerNV platform. + If you choose 'M' here, this module will be called nx_compress_powernv. + endif diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile index 5d9f4bc..6619787 100644 --- a/drivers/crypto/nx/Makefile +++ b/drivers/crypto/nx/Makefile @@ -12,5 +12,7 @@ nx-crypto-objs := nx.o \ obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o +obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o nx-compress-objs := nx-842.o nx-compress-pseries-objs := nx-842-pseries.o +nx-compress-powernv-objs := nx-842-powernv.o diff --git a/drivers/crypto/nx/nx-842-powernv.c b/drivers/crypto/nx/nx-842-powernv.c new file mode 100644 index 000..6a9fb8b --- /dev/null +++ b/drivers/crypto/nx/nx-842-powernv.c @@ -0,0 +1,625 @@ +/* + * Driver for IBM PowerNV 842 compression accelerator + * + * Copyright (C) 2015 Dan Streetman, IBM Corp + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include "nx-842.h" + +#include + +#include +#include + +#define MODULE_NAME NX842_POWERNV_MODULE_NAME +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Dan Streetman "); +MODULE_DESCRIPTION("842 H/W Compression driver for IBM PowerNV processors"); + +#define WORKMEM_ALIGN (CRB_ALIGN) +#define CSB_WAIT_MAX (5000) /* ms */ + +struct nx842_workmem { + /* Below fields must be properly aligned */ + struct coprocessor_request_block crb; /* CRB_ALIGN align */ + struct data_descriptor_entry ddl_in[DDL_LEN_MAX]; /* DDE_ALIGN align */ + struct data_descriptor_entry ddl_out[DDL_LEN_MAX]; /* DDE_ALIGN align */ + /* Above fields must be properly aligned */ + + ktime_t start; + + char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */ +} __packed __aligned(WORKMEM_ALIGN); + +struct nx842_coproc { + unsigned int chip_id; + unsigned int ct; + unsigned int ci; + struct list_head list; +}; + +/* no cpu hotplug on powernv, so this list never changes after init */ +static LIST_HEAD(nx842_coprocs); +static unsigned int nx842_ct; + +/** + * setup_indirect_dde - Setup an indirect DDE + * + * The DDE is setup with the the DDE count, byte count, and address of + * first direct DDE in the list. + */ +static void setup_indirect_dde(struct data_descriptor_entry *dde, + struct data_descriptor_entry *ddl, + unsigned int dde_count, unsigned int byte_count) +{ + dde->flags = 0; + dde->count = dde_count; + dde->index = 0; + dde->length = cpu_to_be32(byte_count); + dde->address = cpu_to_be64(nx842_get_pa(ddl)); +} + +/** + * setup_direct_dde - Setup single DDE from buffer + * + * The DDE is setup with the buffer and length. The buffer must be properly + * aligned. The used length is returned. + * Returns: + * NSuccessfully set up DDE with N bytes + */ +static unsigned int setup_direct_dde(struct data_descriptor_entry *dde, +unsigned long pa, unsigned int len) +{ + unsigned int l =
[PATCH 07/10] drivers/crypto/nx: add nx842 constraints
Add "constraints" for the NX-842 driver. The constraints are used to indicate what the current NX-842 platform driver is capable of. The constraints tell the NX-842 user what alignment, min and max length, and length multiple each provided buffers should conform to. These are required because the 842 hardware requires buffers to meet specific constraints that vary based on platform - for example, the pSeries max length is much lower than the PowerNV max length. Signed-off-by: Dan Streetman --- drivers/crypto/nx/nx-842-pseries.c | 10 ++ drivers/crypto/nx/nx-842.c | 38 ++ drivers/crypto/nx/nx-842.h | 2 ++ include/linux/nx842.h | 9 + 4 files changed, 59 insertions(+) diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c index 9b83c9e..cb481d8 100644 --- a/drivers/crypto/nx/nx-842-pseries.c +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -40,6 +40,13 @@ MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors"); /* IO buffer must be 128 byte aligned */ #define IO_BUFFER_ALIGN 128 +static struct nx842_constraints nx842_pseries_constraints = { + .alignment =IO_BUFFER_ALIGN, + .multiple = DDE_BUFFER_LAST_MULT, + .minimum = IO_BUFFER_ALIGN, + .maximum = PAGE_SIZE, /* dynamic, max_sync_size */ +}; + struct nx842_header { int blocks_nr; /* number of compressed blocks */ int offset; /* offset of the first block (from beginning of header) */ @@ -842,6 +849,8 @@ static int nx842_OF_upd_maxsyncop(struct nx842_devdata *devdata, goto out; } + nx842_pseries_constraints.maximum = devdata->max_sync_size; + devdata->max_sync_sg = (unsigned int)min(maxsynccop->comp_sg_limit, maxsynccop->decomp_sg_limit); if (devdata->max_sync_sg < 1) { @@ -1115,6 +1124,7 @@ static struct attribute_group nx842_attribute_group = { static struct nx842_driver nx842_pseries_driver = { .owner =THIS_MODULE, + .constraints = &nx842_pseries_constraints, .compress = nx842_pseries_compress, .decompress = nx842_pseries_decompress, }; diff --git a/drivers/crypto/nx/nx-842.c b/drivers/crypto/nx/nx-842.c index f1f378e..160fe2d 100644 --- a/drivers/crypto/nx/nx-842.c +++ b/drivers/crypto/nx/nx-842.c @@ -86,6 +86,44 @@ static void put_driver(struct nx842_driver *driver) module_put(driver->owner); } +/** + * nx842_constraints + * + * This provides the driver's constraints. Different nx842 implementations + * may have varying requirements. The constraints are: + * @alignment: All buffers should be aligned to this + * @multiple:All buffer lengths should be a multiple of this + * @minimum: Buffer lengths must not be less than this amount + * @maximum: Buffer lengths must not be more than this amount + * + * The constraints apply to all buffers and lengths, both input and output, + * for both compression and decompression, except for the minimum which + * only applies to compression input and decompression output; the + * compressed data can be less than the minimum constraint. It can be + * assumed that compressed data will always adhere to the multiple + * constraint. + * + * The driver may succeed even if these constraints are violated; + * however the driver can return failure or suffer reduced performance + * if any constraint is not met. + */ +int nx842_constraints(struct nx842_constraints *c) +{ + struct nx842_driver *driver = get_driver(); + int ret = 0; + + if (!driver) + return -ENODEV; + + BUG_ON(!c); + memcpy(c, driver->constraints, sizeof(*c)); + + put_driver(driver); + + return ret; +} +EXPORT_SYMBOL_GPL(nx842_constraints); + int nx842_compress(const unsigned char *in, unsigned int in_len, unsigned char *out, unsigned int *out_len, void *wrkmem) diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h index 2a5d4e1..c6ceb0f 100644 --- a/drivers/crypto/nx/nx-842.h +++ b/drivers/crypto/nx/nx-842.h @@ -12,6 +12,8 @@ struct nx842_driver { struct module *owner; + struct nx842_constraints *constraints; + int (*compress)(const unsigned char *in, unsigned int in_len, unsigned char *out, unsigned int *out_len, void *wrkmem); diff --git a/include/linux/nx842.h b/include/linux/nx842.h index d919c22..aa1a97e9 100644 --- a/include/linux/nx842.h +++ b/include/linux/nx842.h @@ -5,6 +5,15 @@ #define NX842_MEM_COMPRESS __NX842_PSERIES_MEM_COMPRESS +struct nx842_constraints { + int alignment; + int multiple; + int minimum; + int maximum; +}; + +int nx842_constraints(struct nx842_constraints *constraints); + int nx842_compress(const unsigned char *in, unsigned
[PATCH 06/10] drivers/crypto/nx: add NX-842 platform frontend driver
Add NX-842 frontend that allows using either the pSeries platform or PowerNV platform driver (to be added by later patch) for the NX-842 hardware. Update the MAINTAINERS file to include the new filenames. Update Kconfig files to clarify titles and descriptions, and correct dependencies. Signed-off-by: Dan Streetman --- MAINTAINERS| 2 +- drivers/crypto/Kconfig | 10 +-- drivers/crypto/nx/Kconfig | 35 ++--- drivers/crypto/nx/Makefile | 4 +- drivers/crypto/nx/nx-842-pseries.c | 57 +++ drivers/crypto/nx/nx-842.c | 144 + drivers/crypto/nx/nx-842.h | 32 + include/linux/nx842.h | 10 +-- 8 files changed, 245 insertions(+), 49 deletions(-) create mode 100644 drivers/crypto/nx/nx-842.c create mode 100644 drivers/crypto/nx/nx-842.h diff --git a/MAINTAINERS b/MAINTAINERS index 5a5c1dc..e71855f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4870,7 +4870,7 @@ F:drivers/crypto/nx/ IBM Power 842 compression accelerator M: Dan Streetman S: Supported -F: drivers/crypto/nx/nx-842.c +F: drivers/crypto/nx/nx-842* F: include/linux/nx842.h F: include/linux/sw842.h F: crypto/842.c diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig index 033c0c8..872de26 100644 --- a/drivers/crypto/Kconfig +++ b/drivers/crypto/Kconfig @@ -312,11 +312,13 @@ config CRYPTO_DEV_S5P algorithms execution. config CRYPTO_DEV_NX - bool "Support for IBM Power7+ in-Nest cryptographic acceleration" - depends on PPC64 && IBMVIO && !CPU_LITTLE_ENDIAN - default n + bool "Support for IBM PowerPC Nest (NX) cryptographic acceleration" + depends on PPC64 help - Support for Power7+ in-Nest cryptographic acceleration. + This enables support for the NX hardware cryptographic accelerator + coprocessor that is in IBM PowerPC P7+ or later processors. This + does not actually enable any drivers, it only allows you to select + which acceleration type (encryption and/or compression) to enable. if CRYPTO_DEV_NX source "drivers/crypto/nx/Kconfig" diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig index f826166..34013f7 100644 --- a/drivers/crypto/nx/Kconfig +++ b/drivers/crypto/nx/Kconfig @@ -1,7 +1,9 @@ + config CRYPTO_DEV_NX_ENCRYPT - tristate "Encryption acceleration support" - depends on PPC64 && IBMVIO + tristate "Encryption acceleration support on pSeries platform" + depends on PPC_PSERIES && IBMVIO && !CPU_LITTLE_ENDIAN default y + select CRYPTO_ALGAPI select CRYPTO_AES select CRYPTO_CBC select CRYPTO_ECB @@ -12,15 +14,30 @@ config CRYPTO_DEV_NX_ENCRYPT select CRYPTO_SHA256 select CRYPTO_SHA512 help - Support for Power7+ in-Nest encryption acceleration. This - module supports acceleration for AES and SHA2 algorithms. If you - choose 'M' here, this module will be called nx_crypto. + Support for PowerPC Nest (NX) encryption acceleration. This + module supports acceleration for AES and SHA2 algorithms on + the pSeries platform. If you choose 'M' here, this module + will be called nx_crypto. config CRYPTO_DEV_NX_COMPRESS tristate "Compression acceleration support" - depends on PPC64 && IBMVIO default y help - Support for Power7+ in-Nest compression acceleration. This - module supports acceleration for AES and SHA2 algorithms. If you - choose 'M' here, this module will be called nx_compress. + Support for PowerPC Nest (NX) compression acceleration. This + module supports acceleration for compressing memory with the 842 + algorithm. One of the platform drivers must be selected also. + If you choose 'M' here, this module will be called nx_compress. + +if CRYPTO_DEV_NX_COMPRESS + +config CRYPTO_DEV_NX_COMPRESS_PSERIES + tristate "Compression acceleration support on pSeries platform" + depends on PPC_PSERIES && IBMVIO && !CPU_LITTLE_ENDIAN + default y + help + Support for PowerPC Nest (NX) compression acceleration. This + module supports acceleration for compressing memory with the 842 + algorithm. This supports NX hardware on the pSeries platform. + If you choose 'M' here, this module will be called nx_compress_pseries. + +endif diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile index 8669ffa..5d9f4bc 100644 --- a/drivers/crypto/nx/Makefile +++ b/drivers/crypto/nx/Makefile @@ -11,4 +11,6 @@ nx-crypto-objs := nx.o \ nx-sha512.o obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o -nx-compress-objs := nx-842-pseries.o +obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o +nx-compress-objs := nx-842.o +nx-co
[PATCH 05/10] drivers/crypto/nx: rename nx-842.c to nx-842-pseries.c
Move the entire NX-842 driver for the pSeries platform from the file nx-842.c to nx-842-pseries.c. This is required by later patches that add NX-842 support for the PowerNV platform. This patch does not alter the content of the pSeries NX-842 driver at all, it only changes the filename. Signed-off-by: Dan Streetman --- drivers/crypto/nx/Makefile |2 +- drivers/crypto/nx/nx-842-pseries.c | 1603 drivers/crypto/nx/nx-842.c | 1603 3 files changed, 1604 insertions(+), 1604 deletions(-) create mode 100644 drivers/crypto/nx/nx-842-pseries.c delete mode 100644 drivers/crypto/nx/nx-842.c diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile index bb770ea..8669ffa 100644 --- a/drivers/crypto/nx/Makefile +++ b/drivers/crypto/nx/Makefile @@ -11,4 +11,4 @@ nx-crypto-objs := nx.o \ nx-sha512.o obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o -nx-compress-objs := nx-842.o +nx-compress-objs := nx-842-pseries.o diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-842-pseries.c new file mode 100644 index 000..887196e --- /dev/null +++ b/drivers/crypto/nx/nx-842-pseries.c @@ -0,0 +1,1603 @@ +/* + * Driver for IBM Power 842 compression accelerator + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Copyright (C) IBM Corporation, 2012 + * + * Authors: Robert Jennings + * Seth Jennings + */ + +#include +#include +#include +#include +#include + +#include +#include + +#include "nx_csbcpb.h" /* struct nx_csbcpb */ + +#define MODULE_NAME "nx-compress" +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Robert Jennings "); +MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors"); + +#define SHIFT_4K 12 +#define SHIFT_64K 16 +#define SIZE_4K (1UL << SHIFT_4K) +#define SIZE_64K (1UL << SHIFT_64K) + +/* IO buffer must be 128 byte aligned */ +#define IO_BUFFER_ALIGN 128 + +struct nx842_header { + int blocks_nr; /* number of compressed blocks */ + int offset; /* offset of the first block (from beginning of header) */ + int sizes[0]; /* size of compressed blocks */ +}; + +static inline int nx842_header_size(const struct nx842_header *hdr) +{ + return sizeof(struct nx842_header) + + hdr->blocks_nr * sizeof(hdr->sizes[0]); +} + +/* Macros for fields within nx_csbcpb */ +/* Check the valid bit within the csbcpb valid field */ +#define NX842_CSBCBP_VALID_CHK(x) (x & BIT_MASK(7)) + +/* CE macros operate on the completion_extension field bits in the csbcpb. + * CE0 0=full completion, 1=partial completion + * CE1 0=CE0 indicates completion, 1=termination (output may be modified) + * CE2 0=processed_bytes is source bytes, 1=processed_bytes is target bytes */ +#define NX842_CSBCPB_CE0(x)(x & BIT_MASK(7)) +#define NX842_CSBCPB_CE1(x)(x & BIT_MASK(6)) +#define NX842_CSBCPB_CE2(x)(x & BIT_MASK(5)) + +/* The NX unit accepts data only on 4K page boundaries */ +#define NX842_HW_PAGE_SHIFTSHIFT_4K +#define NX842_HW_PAGE_SIZE (ASM_CONST(1) << NX842_HW_PAGE_SHIFT) +#define NX842_HW_PAGE_MASK (~(NX842_HW_PAGE_SIZE-1)) + +enum nx842_status { + UNAVAILABLE, + AVAILABLE +}; + +struct ibm_nx842_counters { + atomic64_t comp_complete; + atomic64_t comp_failed; + atomic64_t decomp_complete; + atomic64_t decomp_failed; + atomic64_t swdecomp; + atomic64_t comp_times[32]; + atomic64_t decomp_times[32]; +}; + +static struct nx842_devdata { + struct vio_dev *vdev; + struct device *dev; + struct ibm_nx842_counters *counters; + unsigned int max_sg_len; + unsigned int max_sync_size; + unsigned int max_sync_sg; + enum nx842_status status; +} __rcu *devdata; +static DEFINE_SPINLOCK(devdata_mutex); + +#define NX842_COUNTER_INC(_x) \ +static inline void nx842_inc_##_x( \ + const struct nx842_devdata *dev) { \ + if (dev) \ + atomic64_inc(&dev->counters->_x); \ +} +NX842_COUNTER_INC(comp_complete); +NX842_COUNTER_INC(comp_failed); +NX842_COUNTER_INC(decomp_complete); +NX842_COUNTER_INC(decomp_failed); +NX842_COUNTER_INC(swdecomp); + +#define NX842_HIST_SLOTS 16 + +static void ibm_nx842_incr_hist(atomic64_t *times, unsigned int t
[PATCH 04/10] crypto: change 842 alg to use software
Change the crypto 842 compression alg to use the software 842 compression and decompression library. Add the crypto driver_name as "842-generic". Remove the fallback to LZO compression. Previously, this crypto compression alg attemped 842 compression using PowerPC hardware, and fell back to LZO compression and decompression if the 842 PowerPC hardware was unavailable or failed. This should not fall back to any other compression method, however; users of this crypto compression alg can fallback if desired, and transparent fallback tricks callers into thinking they are getting 842 compression when they actually get LZO compression - the failure of the 842 hardware should not be transparent to the caller. The crypto compression alg for a hardware device also should not be located in crypto/ so this is now a software-only implementation that uses the 842 software compression/decompression library. Signed-off-by: Dan Streetman --- MAINTAINERS| 1 + crypto/842.c | 174 - crypto/Kconfig | 7 +-- 3 files changed, 41 insertions(+), 141 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 116af01..5a5c1dc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4873,6 +4873,7 @@ S:Supported F: drivers/crypto/nx/nx-842.c F: include/linux/nx842.h F: include/linux/sw842.h +F: crypto/842.c F: lib/842/ IBM Power Linux RAID adapter diff --git a/crypto/842.c b/crypto/842.c index b48f4f1..98e387e 100644 --- a/crypto/842.c +++ b/crypto/842.c @@ -1,5 +1,5 @@ /* - * Cryptographic API for the 842 compression algorithm. + * Cryptographic API for the 842 software compression algorithm. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -11,173 +11,73 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * Copyright (C) IBM Corporation, 2011-2015 * - * Copyright (C) IBM Corporation, 2011 + * Original Authors: Robert Jennings + * Seth Jennings * - * Authors: Robert Jennings - * Seth Jennings + * Rewrite: Dan Streetman + * + * This is the software implementation of compression and decompression using + * the 842 format. This uses the software 842 library at lib/842/ which is + * only a reference implementation, and is very, very slow as compared to other + * software compressors. You probably do not want to use this software + * compression. If you have access to the PowerPC 842 compression hardware, you + * want to use the 842 hardware compression interface, which is at: + * drivers/crypto/nx/nx-842-crypto.c */ #include #include #include -#include -#include -#include -#include - -static int nx842_uselzo; - -struct nx842_ctx { - void *nx842_wmem; /* working memory for 842/lzo */ -}; +#include -enum nx842_crypto_type { - NX842_CRYPTO_TYPE_842, - NX842_CRYPTO_TYPE_LZO +struct crypto842_ctx { + char wmem[SW842_MEM_COMPRESS]; /* working memory for compress */ }; -#define NX842_SENTINEL 0xdeadbeef - -struct nx842_crypto_header { - unsigned int sentinel; /* debug */ - enum nx842_crypto_type type; -}; - -static int nx842_init(struct crypto_tfm *tfm) -{ - struct nx842_ctx *ctx = crypto_tfm_ctx(tfm); - int wmemsize; - - wmemsize = max_t(int, nx842_get_workmem_size(), LZO1X_MEM_COMPRESS); - ctx->nx842_wmem = kmalloc(wmemsize, GFP_NOFS); - if (!ctx->nx842_wmem) - return -ENOMEM; - - return 0; -} - -static void nx842_exit(struct crypto_tfm *tfm) -{ - struct nx842_ctx *ctx = crypto_tfm_ctx(tfm); - - kfree(ctx->nx842_wmem); -} - -static void nx842_reset_uselzo(unsigned long data) +static int crypto842_compress(struct crypto_tfm *tfm, + const u8 *src, unsigned int slen, + u8 *dst, unsigned int *dlen) { - nx842_uselzo = 0; -} - -static DEFINE_TIMER(failover_timer, nx842_reset_uselzo, 0, 0); - -static int nx842_crypto_compress(struct crypto_tfm *tfm, const u8 *src, - unsigned int slen, u8 *dst, unsigned int *dlen) -{ - struct nx842_ctx *ctx = crypto_tfm_ctx(tfm); - struct nx842_crypto_header *hdr; - unsigned int tmp_len = *dlen; - size_t lzodlen; /* needed for lzo */ - int err; - - *dlen = 0; - hdr = (struct nx842_crypto_header *)dst; - hdr->sentinel = NX842_SENTINEL; /* debug */ - dst += sizeof(struct nx842_crypto_header); - tmp_len -= sizeof(struct nx842_crypto_header); - lzodlen = tmp_len; - - if (likely(!nx842_uselzo)) { - err = nx842_com
[PATCH 03/10] lib: add software 842 compression/decompression
Add 842-format software compression and decompression functions. Update the MAINTAINERS 842 section to include the new files. The 842 compression function can compress any input data into the 842 compression format. The 842 decompression function can decompress any standard-format 842 compressed data - specifically, either a compressed data buffer created by the 842 software compression function, or a compressed data buffer created by the 842 hardware compressor (located in PowerPC coprocessors). The 842 compressed data format is explained in the header comments. This is used in a later patch to provide a full software 842 compression and decompression crypto interface. Signed-off-by: Dan Streetman --- MAINTAINERS | 2 + include/linux/sw842.h| 12 + lib/842/842.h| 127 ++ lib/842/842_compress.c | 626 +++ lib/842/842_debugfs.h| 52 lib/842/842_decompress.c | 405 ++ lib/842/Makefile | 2 + lib/Kconfig | 6 + lib/Makefile | 2 + 9 files changed, 1234 insertions(+) create mode 100644 include/linux/sw842.h create mode 100644 lib/842/842.h create mode 100644 lib/842/842_compress.c create mode 100644 lib/842/842_debugfs.h create mode 100644 lib/842/842_decompress.c create mode 100644 lib/842/Makefile diff --git a/MAINTAINERS b/MAINTAINERS index 781e099..116af01 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4872,6 +4872,8 @@ M:Dan Streetman S: Supported F: drivers/crypto/nx/nx-842.c F: include/linux/nx842.h +F: include/linux/sw842.h +F: lib/842/ IBM Power Linux RAID adapter M: Brian King diff --git a/include/linux/sw842.h b/include/linux/sw842.h new file mode 100644 index 000..109ba04 --- /dev/null +++ b/include/linux/sw842.h @@ -0,0 +1,12 @@ +#ifndef __SW842_H__ +#define __SW842_H__ + +#define SW842_MEM_COMPRESS (0xf000) + +int sw842_compress(const u8 *src, unsigned int srclen, + u8 *dst, unsigned int *destlen, void *wmem); + +int sw842_decompress(const u8 *src, unsigned int srclen, +u8 *dst, unsigned int *destlen); + +#endif diff --git a/lib/842/842.h b/lib/842/842.h new file mode 100644 index 000..7c20003 --- /dev/null +++ b/lib/842/842.h @@ -0,0 +1,127 @@ + +#ifndef __842_H__ +#define __842_H__ + +/* The 842 compressed format is made up of multiple blocks, each of + * which have the format: + * + * [arg1][arg2][arg3][arg4] + * + * where there are between 0 and 4 template args, depending on the specific + * template operation. For normal operations, each arg is either a specific + * number of data bytes to add to the output buffer, or an index pointing + * to a previously-written number of data bytes to copy to the output buffer. + * + * The template code is a 5-bit value. This code indicates what to do with + * the following data. Template codes from 0 to 0x19 should use the template + * table, the static "decomp_ops" table used in decompress. For each template + * (table row), there are between 1 and 4 actions; each action corresponds to + * an arg following the template code bits. Each action is either a "data" + * type action, or a "index" type action, and each action results in 2, 4, or 8 + * bytes being written to the output buffer. Each template (i.e. all actions + * in the table row) will add up to 8 bytes being written to the output buffer. + * Any row with less than 4 actions is padded with noop actions, indicated by + * N0 (for which there is no corresponding arg in the compressed data buffer). + * + * "Data" actions, indicated in the table by D2, D4, and D8, mean that the + * corresponding arg is 2, 4, or 8 bytes, respectively, in the compressed data + * buffer should be copied directly to the output buffer. + * + * "Index" actions, indicated in the table by I2, I4, and I8, mean the + * corresponding arg is an index parameter that points to, respectively, a 2, + * 4, or 8 byte value already in the output buffer, that should be copied to + * the end of the output buffer. Essentially, the index points to a position + * in a ring buffer that contains the last N bytes of output buffer data. + * The number of bits for each index's arg are: 8 bits for I2, 9 bits for I4, + * and 8 bits for I8. Since each index points to a 2, 4, or 8 byte section, + * this means that I2 can reference 512 bytes ((2^8 bits = 256) * 2 bytes), I4 + * can reference 2048 bytes ((2^9 = 512) * 4 bytes), and I8 can reference 2048 + * bytes ((2^8 = 256) * 8 bytes). Think of it as a kind-of ring buffer for + * each of I2, I4, and I8 that are updated for each byte written to the output + * buffer. In this implementation, the output buffer is directly used for each + * index; there is no additional memory required. Note that the index is into + * a ring buffer, not a sliding window; for example, if there have been 260 + * bytes written to the output buffer, an
[PATCH 02/10] powerpc: Add ICSWX instruction
Add the asm ICSWX and ICSWEPX opcodes. Add definitions for the Coprocessor Request structures needed to use the icswx calls to coprocessors. Add icswx() function to perform the ICSWX asm using the provided Coprocessor Command Word value and Coprocessor Request Block structure. This is required for communication with the NX-842 coprocessor on a PowerNV system. Signed-off-by: Dan Streetman --- arch/powerpc/include/asm/icswx.h | 184 ++ arch/powerpc/include/asm/ppc-opcode.h | 13 +++ 2 files changed, 197 insertions(+) create mode 100644 arch/powerpc/include/asm/icswx.h diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h new file mode 100644 index 000..9f8402b --- /dev/null +++ b/arch/powerpc/include/asm/icswx.h @@ -0,0 +1,184 @@ +/* + * ICSWX api + * + * Copyright (C) 2015 IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * This provides the Initiate Coprocessor Store Word Indexed (ICSWX) + * instruction. This instruction is used to communicate with PowerPC + * coprocessors. This also provides definitions of the structures used + * to communicate with the coprocessor. + * + * The RFC02130: Coprocessor Architecture document is the reference for + * everything in this file unless otherwise noted. + */ +#ifndef _ARCH_POWERPC_INCLUDE_ASM_ICSWX_H_ +#define _ARCH_POWERPC_INCLUDE_ASM_ICSWX_H_ + +#include /* for PPC_ICSWX */ + +/* Chapter 6.5.8 Coprocessor-Completion Block (CCB) */ + +#define CCB_VALUE (0x3fff) +#define CCB_ADDRESS(0xfff8) +#define CCB_CM (0x0007) +#define CCB_CM0(0x0004) +#define CCB_CM12 (0x0003) + +#define CCB_CM0_ALL_COMPLETIONS(0x0) +#define CCB_CM0_LAST_IN_CHAIN (0x4) +#define CCB_CM12_STORE (0x0) +#define CCB_CM12_INTERRUPT (0x1) + +#define CCB_SIZE (0x10) +#define CCB_ALIGN CCB_SIZE + +struct coprocessor_completion_block { + __be64 value; + __be64 address; +} __packed __aligned(CCB_ALIGN); + + +/* Chapter 6.5.7 Coprocessor-Status Block (CSB) */ + +#define CSB_V (0x80) +#define CSB_F (0x04) +#define CSB_CH (0x03) +#define CSB_CE_INCOMPLETE (0x80) +#define CSB_CE_TERMINATION (0x40) +#define CSB_CE_TPBC(0x20) + +#define CSB_CC_SUCCESS (0) +#define CSB_CC_INVALID_ALIGN (1) +#define CSB_CC_OPERAND_OVERLAP (2) +#define CSB_CC_DATA_LENGTH (3) +#define CSB_CC_TRANSLATION (5) +#define CSB_CC_PROTECTION (6) +#define CSB_CC_RD_EXTERNAL (7) +#define CSB_CC_INVALID_OPERAND (8) +#define CSB_CC_PRIVILEGE (9) +#define CSB_CC_INTERNAL(10) +#define CSB_CC_WR_EXTERNAL (12) +#define CSB_CC_NOSPC (13) +#define CSB_CC_EXCESSIVE_DDE (14) +#define CSB_CC_WR_TRANSLATION (15) +#define CSB_CC_WR_PROTECTION (16) +#define CSB_CC_UNKNOWN_CODE(17) +#define CSB_CC_ABORT (18) +#define CSB_CC_TRANSPORT (20) +#define CSB_CC_SEGMENTED_DDL (31) +#define CSB_CC_PROGRESS_POINT (32) +#define CSB_CC_DDE_OVERFLOW(33) +#define CSB_CC_SESSION (34) +#define CSB_CC_PROVISION (36) +#define CSB_CC_CHAIN (37) +#define CSB_CC_SEQUENCE(38) +#define CSB_CC_HW (39) + +#define CSB_SIZE (0x10) +#define CSB_ALIGN CSB_SIZE + +struct coprocessor_status_block { + u8 flags; + u8 cs; + u8 cc; + u8 ce; + __be32 count; + __be64 address; +} __packed __aligned(CSB_ALIGN); + + +/* Chapter 6.5.10 Data-Descriptor List (DDL) + * each list contains one or more Data-Descriptor Entries (DDE) + */ + +#define DDE_P (0x8000) + +#define DDE_SIZE (0x10) +#define DDE_ALIGN DDE_SIZE + +struct data_descriptor_entry { + __be16 flags; + u8 count; + u8 index; + __be32 length; + __be64 address; +} __packed __aligned(DDE_ALIGN); + + +/* Chapter 6.5.2 Coprocessor-Request Block (CRB) */ + +#define CRB_SIZE (0x80) +#define CRB_ALIGN (0x100) /* Errata: requires 256 alignment */ + +/* Coprocessor Status Block field + * ADDRESS address of CSB + * C CCB is valid + * AT0 = addrs are virtual, 1 = addrs are phys + * M enable perf monitor + */ +#define CRB_CSB_ADDRESS(0xfff0) +#define CRB_CSB_C (0x0008) +#define CRB_CSB_AT (0x0002) +#define CRB_CSB_M (0x0001) + +struct coprocessor_request_block { + __be32 ccw; + __be32 flags; + __be64 csb_addr; + +
[PATCH 01/10] powerpc: export of_get_ibm_chip_id function
Export the of_get_ibm_chip_id() function. This will be used by the PowerNV NX-842 driver. Signed-off-by: Dan Streetman --- arch/powerpc/kernel/prom.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 308c5e1..ea2cea7 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -800,6 +800,7 @@ int of_get_ibm_chip_id(struct device_node *np) } return -1; } +EXPORT_SYMBOL(of_get_ibm_chip_id); /** * cpu_to_chip_id - Return the cpus chip-id -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCHv3 00/10] add 842 hw compression for PowerNV platform
IBM PowerPC processors starting at version P7+ contain a NX coprocessor that provides various hw-accelerated functions, one of which is memory compression to the IBM "842" compression format. This NX-842 coprocessor is already supported on the pSeries platform, by the nx-842.c driver and the crypto compression interface at crypto/842.c. This patch set adds support for NX-842 on the PowerNV (Non-Virtualized) platform, as well as adding a full software 842 compression/decompression implementation. Quick summary of changes: the current 842 crypto compression interface uses only the 842 hardware on pSeries platforms, and can handle only page-sized and page-aligned uncompressed buffers. These patches add a full software 842 impementation, change the crypto/ directory 842 interface to a software only implementation, add a 842 hardware crypto compression interface that can handle any size and alignment buffers, add a driver for 842 hardware on PowerNV platforms, and create a common interface for both 842 hardware platform drivers. The existing pSeries platform NX-842 driver could not be re-used for the PowerNV platform driver, as there are fundamentally different interfaces; on pSeries the system hypervisor (pHyp) provides the interface and manages communication with the coprocessor, while on PowerNV the kernel talks directly to the coprocessor using the ICSWX instruction. The data structures used to describe each compression or decompression request to the coprocessor are also different between pHyp's interface and direct communication with ICSWX. So, different drivers for pSeries and PowerNV are required. Adding the new PowerNV driver but keeping the interface to the drivers the same required adding a new common frontend interface, to which only one of the platform drivers will connect (based on what platform the kernel is currently running on), and moving some functionality out of the existing pSeries driver into a more common location. The existing crypto/842.c interface is in the wrong place, since crypto/ should only contain software implementations; so lib/842/ is added containing a reference (i.e. rather slow) implementation in software of both 842 compression and 842 decompression. The crypto/842.c interface is changed to use only that software implementation. The hardware 842 crypto compression interface is moved to drivers/crypto/nx/nx-842-crypto.c. It is also modified to be able to handle any alignment/length input or output buffer; currently it is only able to handle page-size and page-aligned (uncompressed) buffers, due to restrictions in the pSeries 842 hardware driver. v3 changes the sw and hw crypto drivers to use the same alg name "842", and different driver names, "842-generic" and "842-nx" Dan Streetman (10): powerpc: export of_get_ibm_chip_id function powerpc: Add ICSWX instruction lib: add software 842 compression/decompression crypto: change 842 alg to use software drivers/crypto/nx: rename nx-842.c to nx-842-pseries.c drivers/crypto/nx: add NX-842 platform frontend driver drivers/crypto/nx: add nx842 constraints drivers/crypto/nx: add PowerNV platform NX-842 driver drivers/crypto/nx: simplify pSeries nx842 driver drivers/crypto/nx: add hardware 842 crypto comp alg MAINTAINERS |5 +- arch/powerpc/include/asm/icswx.h | 184 arch/powerpc/include/asm/ppc-opcode.h | 13 + arch/powerpc/kernel/prom.c|1 + crypto/842.c | 174 +--- crypto/Kconfig|7 +- drivers/crypto/Kconfig| 10 +- drivers/crypto/nx/Kconfig | 55 +- drivers/crypto/nx/Makefile|6 + drivers/crypto/nx/nx-842-crypto.c | 585 drivers/crypto/nx/nx-842-powernv.c| 625 + drivers/crypto/nx/nx-842-pseries.c| 1128 +++ drivers/crypto/nx/nx-842.c| 1623 +++-- drivers/crypto/nx/nx-842.h| 131 +++ include/linux/nx842.h | 21 +- include/linux/sw842.h | 12 + lib/842/842.h | 127 +++ lib/842/842_compress.c| 626 + lib/842/842_debugfs.h | 52 ++ lib/842/842_decompress.c | 405 lib/842/Makefile |2 + lib/Kconfig |6 + lib/Makefile |2 + 23 files changed, 4120 insertions(+), 1680 deletions(-) create mode 100644 arch/powerpc/include/asm/icswx.h create mode 100644 drivers/crypto/nx/nx-842-crypto.c create mode 100644 drivers/crypto/nx/nx-842-powernv.c create mode 100644 drivers/crypto/nx/nx-842-pseries.c create mode 100644 drivers/crypto/nx/nx-842.h create mode 100644 include/linux/sw842.h create mode 100644 lib/842/842.h create mode 100644 lib/842/842_compress.c create mode 100644 lib/842/842_debugfs.h create mode 100644 lib/84
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..b7e86f4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast && tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = cpuidle_select(drv, dev, !broadcast); + if (index < 0) { + default_idle_call(); + return -EBUSY; + } + target_state = &drv->states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor->select(drv, dev); + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv->states[last_idx + 1].disabled && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time && + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & CPUIDLE_FLAG_TIMER_STOP)) && drv->states[last_idx + 1].exit_latency <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx > CPUIDLE_DRIVER_STATE_START && (drv->states[last_idx].disabled || dev->states_usage[last_idx].disable
Re: [PATCH v3 2/8] ipmi/powernv: Convert to irq event interface
On 05/06/2015 10:16 PM, Alistair Popple wrote: > Convert the opal ipmi driver to use the new irq interface for events. > > Signed-off-by: Alistair Popple > Cc: Corey Minyard > Cc: openipmi-develo...@lists.sourceforge.net > --- > > Corey, > > If this looks ok can you please ack it? Michael Ellerman will then take > the whole series via the powerpc tree. Thanks. This looks fine; I don't really understand much of this, but I don't see any issues. The only thing I would suggest is passing the irq level (IRQ_TYPE_LEVEL_HIGH) as part of the openfirmware data instead of hard-coding it. Acked-by: Corey Minyard > > drivers/char/ipmi/ipmi_powernv.c | 39 ++- > 1 file changed, 22 insertions(+), 17 deletions(-) > > diff --git a/drivers/char/ipmi/ipmi_powernv.c > b/drivers/char/ipmi/ipmi_powernv.c > index 8753b0f..9b409c0 100644 > --- a/drivers/char/ipmi/ipmi_powernv.c > +++ b/drivers/char/ipmi/ipmi_powernv.c > @@ -15,6 +15,8 @@ > #include > #include > #include > +#include > +#include > > #include > > @@ -23,8 +25,7 @@ struct ipmi_smi_powernv { > u64 interface_id; > struct ipmi_device_id ipmi_id; > ipmi_smi_t intf; > - u64 event; > - struct notifier_block event_nb; > + unsigned intirq; > > /** >* We assume that there can only be one outstanding request, so > @@ -197,15 +198,12 @@ static struct ipmi_smi_handlers > ipmi_powernv_smi_handlers = { > .poll = ipmi_powernv_poll, > }; > > -static int ipmi_opal_event(struct notifier_block *nb, > - unsigned long events, void *change) > +static irqreturn_t ipmi_opal_event(int irq, void *data) > { > - struct ipmi_smi_powernv *smi = container_of(nb, > - struct ipmi_smi_powernv, event_nb); > + struct ipmi_smi_powernv *smi = data; > > - if (events & smi->event) > - ipmi_powernv_recv(smi); > - return 0; > + ipmi_powernv_recv(smi); > + return IRQ_HANDLED; > } > > static int ipmi_powernv_probe(struct platform_device *pdev) > @@ -240,13 +238,16 @@ static int ipmi_powernv_probe(struct platform_device > *pdev) > goto err_free; > } > > - ipmi->event = 1ull << prop; > - ipmi->event_nb.notifier_call = ipmi_opal_event; > + ipmi->irq = irq_of_parse_and_map(dev->of_node, 0); > + if (!ipmi->irq) { > + dev_info(dev, "Unable to map irq from device tree\n"); > + ipmi->irq = opal_event_request(prop); > + } > > - rc = opal_notifier_register(&ipmi->event_nb); > - if (rc) { > - dev_warn(dev, "OPAL notifier registration failed (%d)\n", rc); > - goto err_free; > + if (request_irq(ipmi->irq, ipmi_opal_event, IRQ_TYPE_LEVEL_HIGH, > + "opal-ipmi", ipmi)) { > + dev_warn(dev, "Unable to request irq\n"); > + goto err_dispose; > } > > ipmi->opal_msg = devm_kmalloc(dev, > @@ -271,7 +272,9 @@ static int ipmi_powernv_probe(struct platform_device > *pdev) > err_free_msg: > devm_kfree(dev, ipmi->opal_msg); > err_unregister: > - opal_notifier_unregister(&ipmi->event_nb); > + free_irq(ipmi->irq, ipmi); > +err_dispose: > + irq_dispose_mapping(ipmi->irq); > err_free: > devm_kfree(dev, ipmi); > return rc; > @@ -282,7 +285,9 @@ static int ipmi_powernv_remove(struct platform_device > *pdev) > struct ipmi_smi_powernv *smi = dev_get_drvdata(&pdev->dev); > > ipmi_unregister_smi(smi->intf); > - opal_notifier_unregister(&smi->event_nb); > + free_irq(smi->irq, smi); > + irq_dispose_mapping(smi->irq); > + > return 0; > } > > -- > 1.8.3.2 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast && tick_broadcast_enter()) - return -EBUSY; + if (broadcast && tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index < 0) + return -EBUSY; + target_state = &drv->states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor->select(drv, dev); + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv->states[last_idx + 1].disabled && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time && + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & CPUIDLE_FLAG_TIMER_STOP)) && drv->states[last_idx + 1].exit_latency <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx > CPUIDLE_DRIVER_STATE_START && (drv->states[last_idx].disabled || dev->states_usage[last_idx].disable || + (!timer_sto
Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020
Hi Scott, Thanks for fast response, please see inline. On 05/06/2015 11:22 PM, Scott Wood wrote: On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote: New QorIQ p1020 based board support from Arcturus Networks Inc. http://www.arcturusnetworks.com/products/ucp1020/ Signed-off-by: Michael Durrant Signed-off-by: Oleksandr G Zhadan --- Documentation/devicetree/bindings/pci/fsl,pci.txt |2 +- .../devicetree/bindings/powerpc/arcturus/board.txt | 149 ++ .../devicetree/bindings/powerpc/arcturus/ecm.txt | 64 + Documentation/devicetree/bindings/usb/fsl-usb.txt |2 +- .../devicetree/bindings/vendor-prefixes.txt|1 + arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi | 179 ++ arch/powerpc/boot/dts/fsl/ucp1020som-pre.dtsi | 70 + arch/powerpc/boot/dts/ucp1020_32b.dts | 88 + arch/powerpc/boot/dts/ucp1020_32b.dtsi | 174 ++ arch/powerpc/configs/ucp1020_defconfig | 2731 arch/powerpc/platforms/85xx/Kconfig|7 + arch/powerpc/platforms/85xx/Makefile |1 + arch/powerpc/platforms/85xx/ucp1020_som.c | 100 + 13 files changed, 3566 insertions(+), 2 deletions(-) create mode 100644 Documentation/devicetree/bindings/powerpc/arcturus/board.txt create mode 100644 Documentation/devicetree/bindings/powerpc/arcturus/ecm.txt create mode 100644 arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi create mode 100644 arch/powerpc/boot/dts/fsl/ucp1020som-pre.dtsi create mode 100644 arch/powerpc/boot/dts/ucp1020_32b.dts create mode 100644 arch/powerpc/boot/dts/ucp1020_32b.dtsi create mode 100644 arch/powerpc/configs/ucp1020_defconfig create mode 100644 arch/powerpc/platforms/85xx/ucp1020_som.c diff --git a/Documentation/devicetree/bindings/pci/fsl,pci.txt b/Documentation/devicetree/bindings/pci/fsl,pci.txt index d8ac4a7..298a5e6 100644 --- a/Documentation/devicetree/bindings/pci/fsl,pci.txt +++ b/Documentation/devicetree/bindings/pci/fsl,pci.txt @@ -20,7 +20,7 @@ Example: #interrupt-cells = <1>; #size-cells = <2>; #address-cells = <3>; - compatible = "fsl,mpc8540-pcix", "fsl,mpc8540-pci"; + compatible = "fsl,mpc8540-pcix", "fsl,mpc8540-pci", "fsl,mpc8548-pcie"; device_type = "pci"; ... ... diff --git a/Documentation/devicetree/bindings/powerpc/arcturus/board.txt b/Documentation/devicetree/bindings/powerpc/arcturus/board.txt new file mode 100644 index 000..54e9765 --- /dev/null +++ b/Documentation/devicetree/bindings/powerpc/arcturus/board.txt @@ -0,0 +1,149 @@ +UCP1020 module Tree Bindings + + +Copyright 2013-2015 Arcturus Networks, Inc. + +QorIQ p1020 based board +http://www.arcturusnetworks.com/products/ucp1020/ +- + +Root Module + +Properties: +- model: "arcturus,uCP1020" +- compatible: "arcturus,uCP1020" +- SN: "1234567890-1234" + +/ { + model = "arcturus,uCP1020"; + compatible = "arcturus,uCP1020", "fsl,P1020"; + SN = "1234567890-1234"; + ... + } Drop the "fsl,P1020" compatible. Top-level compatible strings describe the whole board. SN is a bad property name. Call it something like "arcturus,serial#", and define what it actually means rather than just giving an example. OK, will fix. +- + +P1020 SPI controller + +Properties: +- compatible: "spansion,s25fl008k", "winbond,w25q80bl" + +Example: + spi@7000 { + flash@0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "spansion,s25fl008k", "winbond,w25q80bl"; + reg = <0>; + spi-max-frequency = <4000>; /* input clock */ + ... + }; This isn't describing the controller, but rather a SPI chip attached to the controller. This also doesn't seem like the right place for random SPI chips. If all you're specifying is the compatible, maybe create a spi/trivial-devices.txt similar to i2c/trivial-devices.txt? Or something specific to SPI flash chips to describe the partition specification, though I generally recommend against describing partitions in the device tree -- especially if this is a developer board rather than something fixed-purpose where the partitioning is not going to change based on user requirements. Mostly in all Documentation/devicetree/bindings/ I tried to satisfy checkpatch script as simple as possible. And for me as well it looks reasonable to create spi/trivial-devices.txt file and I will. +- + +Chipselect/Local Bus + +Properties: +- #address-cells: <2>. +- #size-cells: <1>. +- compatible: "fsl,p1020-e
Re: [PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg
On Wed, May 6, 2015 at 11:12 PM, Herbert Xu wrote: > On Wed, May 06, 2015 at 12:51:06PM -0400, Dan Streetman wrote: >> Add crypto compression alg for 842 hardware compression and decompression. >> >> This crypto compression alg is named "nx842" to indicate it uses hardware >> to perform the compression and decompression, while the software 842 >> compression alg is named "sw842". However, since before this split there >> was only one 842 compression alg named "842" which only used hardware, >> this is also aliased "842" for backwards compatibility. > > This should still be called 842. You can set the driver name to > nx842 or 842-nx. ah, ok, will do. So, I'm wondering about the common NX 842 frontend driver, for the pSeries and PowerNV platform drivers. The current setup is: [ crypto "842-nx" driver ] v [ nx-842 main driver ] v [ nx-842 pSeries driver | nx-842 PowerNV driver ] The main reason for that is that the HW has specific constraints, specifically each input and output buffer passed to it for comp or decomp has to: -be located at a specific alignment -have a length of a specific multiple -have a length between a specific minimum and maximum The crypto 842-nx has (significant) code in it to handle any alignment and length input buffers, to match them to what the driver requires. Would it be better to move that into the crypto code, so that any crypto compression hw driver can request buffers be specifically aligned/sized? I did have to use a header on each compressed buffer that needed re-alignment or re-sizing, so maybe it's not appropriate for common crypto compression code. Since there doesn't seem to be any other hw compression drivers (yet), maybe it should stay in the 842-nx code, at least for now. Hopefully any future compression hw won't have alignment or length multiple restrictions... > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH] powerpc/mm: Use a non-idle variant of kick_all_cpus_sync
What we really need is the ability to wait for other cpus to finish local_irq_save/local_irq_restore region. We don't need to send IPI to idle cpus in that case. Add a vairant of kick_all_cpus_sync to do that. If idle_cpu_mask change during the call, we should be ok because: 1) new cpus got added. In this case when they enter the critical path they would have seen the new values i modified before smp_wmb(); 2) cpus got removed: In this case we are ok, because we send stray IPI to them Signed-off-by: Aneesh Kumar K.V --- NOTE: This need closer review, because I am new to the area of cpu mask. arch/powerpc/mm/pgtable_64.c | 6 +++--- include/linux/smp.h | 9 + kernel/sched/fair.c | 19 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 049d961802aa..e54b111f8737 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -590,7 +590,7 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, * by sending an IPI to all the cpus and executing a dummy * function there. */ - kick_all_cpus_sync(); + poke_nonidle_cpus_sync(); /* * Now invalidate the hpte entries in the range * covered by pmd. This make sure we take a @@ -670,7 +670,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, * This ensures that generic code that rely on IRQ disabling * to prevent a parallel THP split work as expected. */ - kick_all_cpus_sync(); + poke_nonidle_cpus_sync(); } /* @@ -855,7 +855,7 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm, * different code paths. So make sure we wait for the parallel * find_linux_pte_or_hugepage to finish. */ - kick_all_cpus_sync(); + poke_nonidle_cpus_sync(); return old_pmd; } diff --git a/include/linux/smp.h b/include/linux/smp.h index c4414074bd88..16d539b94c31 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -101,6 +101,14 @@ int smp_call_function_any(const struct cpumask *mask, void kick_all_cpus_sync(void); void wake_up_all_idle_cpus(void); +#ifdef CONFIG_NO_HZ_COMMON +void poke_nonidle_cpus_sync(void); +#else +static inline void poke_nonidle_cpus_sync(void) +{ + return kick_all_cpus_sync(); +} +#endif /* * Generic and arch helpers @@ -150,6 +158,7 @@ smp_call_function_any(const struct cpumask *mask, smp_call_func_t func, static inline void kick_all_cpus_sync(void) { } static inline void wake_up_all_idle_cpus(void) { } +static inline void poke_nonidle_cpus_sync(void) { } #ifdef CONFIG_UP_LATE_INIT extern void __init up_late_init(void); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffeaa4105e48..00abc6ae077b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7513,6 +7513,25 @@ static int sched_ilb_notifier(struct notifier_block *nfb, return NOTIFY_DONE; } } + +static void do_nothing(void *unused) +{ +} + +void poke_nonidle_cpus_sync(void) +{ + struct cpumask mask; + + /* +* Make sure the change is visible before we poke the cpus +*/ + smp_mb(); + preempt_disable(); + cpumask_andnot(&mask, cpu_online_mask, nohz.idle_cpus_mask); + smp_call_function_many(&mask, do_nothing, NULL, 1); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(poke_nonidle_cpus_sync); #endif static DEFINE_SPINLOCK(balancing); -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2] powerpc/mm: Return NULL for not present hugetlb page
We need to check whether pte is present in follow_huge_addr and properly return NULL if mapping is not present. Also use READ_ONCE when dereferencing pte_t address. Signed-off-by: Aneesh Kumar K.V --- Changes from V1: -- * Fix build failures with some platform configs. involves pmd_trans_huge(__pmd(pte_val(pte))) arch/powerpc/mm/hugetlbpage.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 0ce968b00b7c..3385e3d0506e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, struct page * follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { - pte_t *ptep; - struct page *page; + pte_t *ptep, pte; unsigned shift; unsigned long mask, flags; + struct page *page = ERR_PTR(-EINVAL); + + local_irq_save(flags); + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); + if (!ptep) + goto no_page; + pte = READ_ONCE(*ptep); /* +* Verify it is a huge page else bail. * Transparent hugepages are handled by generic code. We can skip them * here. */ - local_irq_save(flags); - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); + if (!shift || pmd_trans_huge(__pmd(pte_val(pte + goto no_page; - /* Verify it is a huge page else bail. */ - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) { - local_irq_restore(flags); - return ERR_PTR(-EINVAL); + if (!pte_present(pte)) { + page = NULL; + goto no_page; } mask = (1UL << shift) - 1; - page = pte_page(*ptep); + page = pte_page(pte); if (page) page += (address & mask) / PAGE_SIZE; +no_page: local_irq_restore(flags); return page; } -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/2] powerpc/mpc85xx: Fix EDAC address capture
From: York Sun Extend err_addr to cover 64 bits for DDR errors. Signed-off-by: York Sun Signed-off-by: songwenbin --- drivers/edac/mpc85xx_edac.c | 10 +++--- drivers/edac/mpc85xx_edac.h | 1 + 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c index 68bf234..23ef8e9 100644 --- a/drivers/edac/mpc85xx_edac.c +++ b/drivers/edac/mpc85xx_edac.c @@ -811,6 +811,8 @@ static void sbe_ecc_decode(u32 cap_high, u32 cap_low, u32 cap_ecc, } } +#define make64(high, low) (((u64)(high) << 32) | (low)) + static void mpc85xx_mc_check(struct mem_ctl_info *mci) { struct mpc85xx_mc_pdata *pdata = mci->pvt_info; @@ -818,7 +820,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) u32 bus_width; u32 err_detect; u32 syndrome; - u32 err_addr; + u64 err_addr; u32 pfn; int row_index; u32 cap_high; @@ -849,7 +851,9 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) else syndrome &= 0x; - err_addr = in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS); + err_addr = make64( + in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_EXT_ADDRESS), + in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS)); pfn = err_addr >> PAGE_SHIFT; for (row_index = 0; row_index < mci->nr_csrows; row_index++) { @@ -886,7 +890,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) mpc85xx_mc_printk(mci, KERN_ERR, "Captured Data / ECC:\t%#8.8x_%08x / %#2.2x\n", cap_high, cap_low, syndrome); - mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8x\n", err_addr); + mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8llx\n", err_addr); mpc85xx_mc_printk(mci, KERN_ERR, "PFN: %#8.8x\n", pfn); /* we are out of range */ diff --git a/drivers/edac/mpc85xx_edac.h b/drivers/edac/mpc85xx_edac.h index 4498baf..9352e88 100644 --- a/drivers/edac/mpc85xx_edac.h +++ b/drivers/edac/mpc85xx_edac.h @@ -43,6 +43,7 @@ #define MPC85XX_MC_ERR_INT_EN 0x0e48 #define MPC85XX_MC_CAPTURE_ATRIBUTES 0x0e4c #define MPC85XX_MC_CAPTURE_ADDRESS 0x0e50 +#define MPC85XX_MC_CAPTURE_EXT_ADDRESS 0x0e54 #define MPC85XX_MC_ERR_SBE 0x0e58 #define DSC_MEM_EN 0x8000 -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/2] powerpc/mpc8xxx: Change EDAC for FSL SoC
From: York Sun Remove mpc83xx and mpc85xx as dependency. Signed-off-by: York Sun Signed-off-by: songwenbin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index cb59619..ad07d4f 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -262,10 +262,10 @@ config EDAC_SBRIDGE config EDAC_MPC85XX tristate "Freescale MPC83xx / MPC85xx" - depends on EDAC_MM_EDAC && FSL_SOC && (PPC_83xx || PPC_85xx) + depends on EDAC_MM_EDAC && FSL_SOC help Support for error detection and correction on the Freescale - MPC8349, MPC8560, MPC8540, MPC8548 + MPC8349, MPC8560, MPC8540, MPC8548, T4240 config EDAC_MV64X60 tristate "Marvell MV64x60" -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] powerpc/mpc8xxx: Change EDAC for FSL SoC
From: York Sun Remove mpc83xx and mpc85xx as dependency. Signed-off-by: York Sun Change-Id: I92ff2ecf38b00e48a713baf2443495f8a1468beb Reviewed-on: http://git.am.freescale.net:8181/554 Reviewed-by: Schmitt Richard-B43082 Tested-by: Schmitt Richard-B43082 Reviewed-by: Fleming Andrew-AFLEMING Tested-by: Fleming Andrew-AFLEMING Signed-off-by: songwenbin --- drivers/edac/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index cb59619..ad07d4f 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -262,10 +262,10 @@ config EDAC_SBRIDGE config EDAC_MPC85XX tristate "Freescale MPC83xx / MPC85xx" - depends on EDAC_MM_EDAC && FSL_SOC && (PPC_83xx || PPC_85xx) + depends on EDAC_MM_EDAC && FSL_SOC help Support for error detection and correction on the Freescale - MPC8349, MPC8560, MPC8540, MPC8548 + MPC8349, MPC8560, MPC8540, MPC8548, T4240 config EDAC_MV64X60 tristate "Marvell MV64x60" -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/mpc85xx: Fix EDAC address capture
From: York Sun Extend err_addr to cover 64 bits for DDR errors. Signed-off-by: York Sun Change-Id: Idb112c4a106416a9cad9933c415e6f62de5cf07b Reviewed-on: http://git.am.freescale.net:8181/553 Tested-by: Schmitt Richard-B43082 Reviewed-by: Fleming Andrew-AFLEMING Tested-by: Fleming Andrew-AFLEMING Signed-off-by: songwenbin --- drivers/edac/mpc85xx_edac.c | 10 +++--- drivers/edac/mpc85xx_edac.h | 1 + 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c index 68bf234..23ef8e9 100644 --- a/drivers/edac/mpc85xx_edac.c +++ b/drivers/edac/mpc85xx_edac.c @@ -811,6 +811,8 @@ static void sbe_ecc_decode(u32 cap_high, u32 cap_low, u32 cap_ecc, } } +#define make64(high, low) (((u64)(high) << 32) | (low)) + static void mpc85xx_mc_check(struct mem_ctl_info *mci) { struct mpc85xx_mc_pdata *pdata = mci->pvt_info; @@ -818,7 +820,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) u32 bus_width; u32 err_detect; u32 syndrome; - u32 err_addr; + u64 err_addr; u32 pfn; int row_index; u32 cap_high; @@ -849,7 +851,9 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) else syndrome &= 0x; - err_addr = in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS); + err_addr = make64( + in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_EXT_ADDRESS), + in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS)); pfn = err_addr >> PAGE_SHIFT; for (row_index = 0; row_index < mci->nr_csrows; row_index++) { @@ -886,7 +890,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci) mpc85xx_mc_printk(mci, KERN_ERR, "Captured Data / ECC:\t%#8.8x_%08x / %#2.2x\n", cap_high, cap_low, syndrome); - mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8x\n", err_addr); + mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8llx\n", err_addr); mpc85xx_mc_printk(mci, KERN_ERR, "PFN: %#8.8x\n", pfn); /* we are out of range */ diff --git a/drivers/edac/mpc85xx_edac.h b/drivers/edac/mpc85xx_edac.h index 4498baf..9352e88 100644 --- a/drivers/edac/mpc85xx_edac.h +++ b/drivers/edac/mpc85xx_edac.h @@ -43,6 +43,7 @@ #define MPC85XX_MC_ERR_INT_EN 0x0e48 #define MPC85XX_MC_CAPTURE_ATRIBUTES 0x0e4c #define MPC85XX_MC_CAPTURE_ADDRESS 0x0e50 +#define MPC85XX_MC_CAPTURE_EXT_ADDRESS 0x0e54 #define MPC85XX_MC_ERR_SBE 0x0e58 #define DSC_MEM_EN 0x8000 -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/05/2015 02:11 PM, Preeti U Murthy wrote: > On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: >> Hi Preeti, >> >> On 05/05/2015 09:30 AM, Preeti U Murthy wrote: >>> Hi Shilpa, >>> >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: Re-evaluate the chip's throttled state on recieving OCC_THROTTLE notification by executing *throttle_check() on any one of the cpu on the chip. This is a sanity check to verify if we were indeed throttled/unthrottled after receiving OCC_THROTTLE notification. We cannot call *throttle_check() directly from the notification handler because we could be handling chip1's notification in chip2. So initiate an smp_call to execute *throttle_check(). We are irq-disabled in the notification handler, so use a worker thread to smp_call throttle_check() on any of the cpu in the chipmask. >>> >>> I see that the first patch takes care of reporting *per-chip* throttling >>> for pmax capping condition. But where are we taking care of reporting >>> "pstate set to safe" and "freq control disabled" scenarios per-chip ? >>> >> >> IMO let us not have "psafe" and "freq control disabled" states managed >> per-chip. >> Because when the above two conditions occur it is likely to happen across all >> chips during an OCC reset cycle. So I am setting 'throttled' to false on >> OCC_ACTIVE and re-verifying if it actually is the case by invoking >> *throttle_check(). > > Alright like I pointed in the previous reply, a comment to indicate that > psafe and freq control disabled conditions will fail when occ is > inactive and that all chips face the consequence of this will help. From your explanation on the thread of the first patch of this series, this will not be required. So, Reviewed-by: Preeti U Murthy Regards Preeti U Murthy > >> Signed-off-by: Shilpasri G Bhat --- drivers/cpufreq/powernv-cpufreq.c | 28 ++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 9268424..9618813 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; static struct chip { unsigned int id; bool throttled; + cpumask_t mask; + struct work_struct throttle; } *chips; static int nr_chips; @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) return powernv_pstate_info.max - powernv_pstate_info.nominal; } -static void powernv_cpufreq_throttle_check(unsigned int cpu) +static void powernv_cpufreq_throttle_check(void *data) { + unsigned int cpu = smp_processor_id(); unsigned long pmsr; int pmsr_pmax, pmsr_lp, i; @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, return 0; if (!throttled) - powernv_cpufreq_throttle_check(smp_processor_id()); + powernv_cpufreq_throttle_check(NULL); freq_data.pstate_id = powernv_freqs[new_index].driver_data; @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { .notifier_call = powernv_cpufreq_reboot_notifier, }; +void powernv_cpufreq_work_fn(struct work_struct *work) +{ + struct chip *chip = container_of(work, struct chip, throttle); + + smp_call_function_any(&chip->mask, +powernv_cpufreq_throttle_check, NULL, 0); +} + static char throttle_reason[][30] = { "No throttling", "Power Cap", @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, struct opal_msg *occ_msg = msg; uint64_t token; uint64_t chip_id, reason; + int i; if (msg_type != OPAL_MSG_OCC) return 0; @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, occ_reset = false; throttled = false; pr_info("OCC: Active\n"); + + for (i = 0; i < nr_chips; i++) + schedule_work(&chips[i].throttle); + return 0; } @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, else if (!reason) pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id, throttle_reason[reason]); + else + return 0; >>> >>> Why the else section ? The code can never reach here, can it ? >> >> When reason > 5 , we dont want to handle it. > > Of course! My bad!
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote: > > > On 05/05/2015 02:08 PM, Preeti U Murthy wrote: >> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: >>> Hi Preeti, >>> >>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the > max allowed frequency for that chip if the chip exceeds its power or > temperature limits. As Pmax capping is a chip level condition report > this throttling behavior at chip level and also do not set the global > 'throttled' on Pmax capping instead set the per-chip throttled > variable. Report unthrottling if Pmax is restored after throttling. > > This patch adds a structure to store chip id and throttled state of > the chip. > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpufreq/powernv-cpufreq.c | 59 > --- > 1 file changed, 55 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index ebef0d8..d0c18c9 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -42,6 +43,13 @@ > static struct cpufreq_frequency_table > powernv_freqs[POWERNV_MAX_PSTATES+1]; > static bool rebooting, throttled; > > +static struct chip { > + unsigned int id; > + bool throttled; > +} *chips; > + > +static int nr_chips; > + > /* > * Note: The set of pstates consists of contiguous integers, the > * smallest of which is indicated by powernv_pstate_info.min, the > @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) > static void powernv_cpufreq_throttle_check(unsigned int cpu) > { > unsigned long pmsr; > - int pmsr_pmax, pmsr_lp; > + int pmsr_pmax, pmsr_lp, i; > > pmsr = get_pmspr(SPRN_PMSR); > > + for (i = 0; i < nr_chips; i++) > + if (chips[i].id == cpu_to_chip_id(cpu)) > + break; > + > /* Check for Pmax Capping */ > pmsr_pmax = (s8)PMSR_MAX(pmsr); > if (pmsr_pmax != powernv_pstate_info.max) { > - throttled = true; > - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax); > - pr_info("Max allowed Pstate is capped\n"); > + if (chips[i].throttled) > + goto next; > + chips[i].throttled = true; > + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu, > + chips[i].id, pmsr_pmax); > + } else if (chips[i].throttled) { > + chips[i].throttled = false; Is this check on pmax sufficient to indicate that the chip is unthrottled ? >>> >>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is >>> sufficient to decide throttling/unthrottling when OCC is active for that >>> chip. >> >> Ok then we can perhaps exit after detecting unthrottling here. > > This won't work for older firmwares which do not clear "Frequency control > enabled bit" on OCC reset cycle. So let us check for remaining two conditions > on > unthrottling as well. ok. > >>> > + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, > + chips[i].id, pmsr_pmax); > } > > /* >* Check for Psafe by reading LocalPstate >* or check if Psafe_mode_active is set in PMSR. >*/ > +next: > pmsr_lp = (s8)PMSR_LP(pmsr); > if ((pmsr_lp < powernv_pstate_info.min) || > (pmsr & PMSR_PSAFE_ENABLE)) { > @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver > = { > .attr = powernv_cpu_freq_attr, What about the situation where although occ is active, this particular chip has been throttled and we end up repeatedly reporting "pstate set to safe" and "frequency control disabled from OS" ? Should we not have a check on (chips[i].throttled) before reporting an anomaly for these two scenarios as well just like you have for pmsr_pmax ? >>> >>> We will not have "Psafe" and "frequency control disabled" repeatedly printed >>> because of global variable 'throttled', which is set to true on passing any >>> of >>> these two conditions. >>> >>> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency >>> control disabled" state. These two conditions are most likely to happen >>> during >>> an OCC reset cycle which will occur across all chips. >> >> Let us then add a comment to indicate that Psafe and frequency control >> disabled conditions will fail *only if OCC is inactive* and not >> otherwise and that this is a s
Re: [PATCH 3/3] kvm/powerpc: report guest steal time in host
Am 06.05.2015 um 18:42 schrieb Naveen N. Rao: > On 2015/05/06 02:46PM, Christian Borntraeger wrote: >> Am 06.05.2015 um 13:56 schrieb Naveen N. Rao: >>> On powerpc, kvm tracks both the guest steal time as well as the time >>> when guest was idle and this gets sent in to the guest through DTL. The >>> guest accounts these entries as either steal time or idle time based on >>> the last running task. Since the true guest idle status is not visible >>> to the host, we can't accurately expose the guest steal time in the >>> host. >>> >>> However, tracking the guest vcpu cede status can get us a reasonable >>> (within 5% variation) vcpu steal time since guest vcpus cede the >>> processor on entering the idle task. To do this, we introduce a new >>> field ceded_st in kvm_vcpu_arch structure to accurately track the guest >>> vcpu cede status (this is needed since the existing ceded field is >>> modified before we can use it). During DTL entry creation, we check this >>> flag and account the time as stolen if the guest vcpu had not ceded. >> >> I think this is more or less a question about the semantic: >> >> What would happen if you use current->sched_info.run_delay like x86 also >> on power? How far are the numbers away? > > The numbers were quite off and didn't quite make sense. Strange. I would expect to match at least the wall clock time between runnable and running. Maybe its just a bug? > >> My feeling is, that the semantics >> of "steal time" inside the guest is somewhat different on each platform. >> >> This brings me to a 2nd question: >> Do you need to match the host view of guest steal time with the guest view >> or do we want to have a host view that translates as "this is the time that >> the guest was runnable but we were too busy to schedule him"? > > Very good point. This is probably good enough for our purpose and I'd > like to think my current patchset does something similar for powerpc. We > don't report the exact steal time as seen from within the guest, but a > close approximation of it. We count all time that a vcpu was not-idle as > steal. This includes time we were doing something in the host on behalf > of the vcpu as well as time when we were just doing something else. I > don't know if we can separate these two or if that would be desirable. > The scheduler statistics don't seem to accurately reflect this on ppc. > >> For the former x86 has the best solution, as the host tells the guest its >> understanding of steal - so both match. For the latter we actually try to >> give guest steal a meaning in the host context - the overload. >> Would /proc//schedstat value 2 (time spent waiting on a runqueue) >> meet your requirements from the cover-letter? > > This looks to be the same as sched_info.run_delay, which doesn't seem to > reflect the wait on the runqueue. I will recheck this on ppc tomorrow. > > As an aside, do you happen to know if /proc//schedstat accurately > reports the "overload" on s390? Things are usually even more complicated as we always have the LPAR hypervisor below the KVM or z/VM hypervisor (KVM or z/VM guests are always nested so to speak). Depending on the overcommit on LPAR level the wall clock times might indicate a problem in a "wrong" place. Now the steal time in a kvm guest is actually precise as the hardware will step the guest cpu timer only when both LPAR and KVM have this CPU scheduled. This will also cause "steal" when KVM emulates an instruction for the guest - unless we correct the guest view - which we dont right now. The Linux in LPAR also sees the steal time it got stolen by LPAR. I really have not looked closely at run_delay. My assumption is that it boils down to "wall clock time between runnable and running". If the admin does overcommit in KVM and LPAR is just slightly overcommitted this is probably good enough. If the overcommit happens at LPAR then the value might be confusing. I would assume that people overcommit at the z/VM or KVM level and the LPAR is managed with less overcommit - but thats not a given. Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions
"Kirill A. Shutemov" writes: > On Thu, May 07, 2015 at 12:53:27PM +0530, Aneesh Kumar K.V wrote: >> After this patch pmdp_* functions operate only on hugepage pte, >> and not on regular pmd_t values pointing to page table. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ >> arch/powerpc/mm/pgtable_64.c | 76 >> +--- >> include/asm-generic/pgtable.h| 19 >> mm/huge_memory.c | 2 +- >> 4 files changed, 65 insertions(+), 36 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h >> b/arch/powerpc/include/asm/pgtable-ppc64.h >> index 43e6ad424c7f..50830c9a2116 100644 >> --- a/arch/powerpc/include/asm/pgtable-ppc64.h >> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h >> @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct >> *mm, unsigned long addr, >> extern void pmdp_splitting_flush(struct vm_area_struct *vma, >> unsigned long address, pmd_t *pmdp); >> >> +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH >> +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmdp); >> + >> #define __HAVE_ARCH_PGTABLE_DEPOSIT >> extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, >> pgtable_t pgtable); >> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c >> index 59daa5eeec25..9171c1a37290 100644 >> --- a/arch/powerpc/mm/pgtable_64.c >> +++ b/arch/powerpc/mm/pgtable_64.c >> @@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, >> unsigned long address, >> pmd_t pmd; >> >> VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> -if (pmd_trans_huge(*pmdp)) { >> -pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); >> -} else { >> -/* >> - * khugepaged calls this for normal pmd >> - */ >> -pmd = *pmdp; >> -pmd_clear(pmdp); >> -/* >> - * Wait for all pending hash_page to finish. This is needed >> - * in case of subpage collapse. When we collapse normal pages >> - * to hugepage, we first clear the pmd, then invalidate all >> - * the PTE entries. The assumption here is that any low level >> - * page fault will see a none pmd and take the slow path that >> - * will wait on mmap_sem. But we could very well be in a >> - * hash_page with local ptep pointer value. Such a hash page >> - * can result in adding new HPTE entries for normal subpages. >> - * That means we could be modifying the page content as we >> - * copy them to a huge page. So wait for parallel hash_page >> - * to finish before invalidating HPTE entries. We can do this >> - * by sending an IPI to all the cpus and executing a dummy >> - * function there. >> - */ >> -kick_all_cpus_sync(); >> -/* >> - * Now invalidate the hpte entries in the range >> - * covered by pmd. This make sure we take a >> - * fault and will find the pmd as none, which will >> - * result in a major fault which takes mmap_sem and >> - * hence wait for collapse to complete. Without this >> - * the __collapse_huge_page_copy can result in copying >> - * the old content. >> - */ >> -flush_tlb_pmd_range(vma->vm_mm, &pmd, address); >> -} >> +VM_BUG_ON(!pmd_trans_huge(*pmdp)); >> +pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); >> +return pmd; > > The patches are in reverse order: you need to change pmdp_get_and_clear > first otherwise you break bisectability. > Or better merge patches together. The first patch is really a cleanup and should not result in code changes. It just make sure that we use pmdp_* functions only on hugepage ptes and not on regular pmd_t pointers to pgtable. It avoid the not so nice if (pmd_trans_huge()) check in the code and allows us to do the VM_BUG_ON(!pmd_trans_huge(*pmdp)) there. That is really important on archs like ppc64 where regular pmd format is different from hugepage pte format. > >> +} >> + >> +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, >> + pmd_t *pmdp) >> +{ >> +pmd_t pmd; >> + >> +VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> +VM_BUG_ON(pmd_trans_huge(*pmdp)); >> + >> +pmd = *pmdp; >> +pmd_clear(pmdp); >> +/* >> + * Wait for all pending hash_page to finish. This is needed >> + * in case of subpage collapse. When we collapse normal pages >> + * to hugepage, we first clear the pmd, then invalidate all >> + * the PTE entries. The assumption here is that any low level >> + * page fault will see a n
Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page
Benjamin Herrenschmidt writes: > On Thu, 2015-05-07 at 12:46 +0530, Aneesh Kumar K.V wrote: >> We need to check whether pte is present in follow_huge_addr and >> properly return NULL if mapping is not present. Also use READ_ONCE >> when dereferencing pte_t address. > > Do that need to go to stable as well ? Yes. I will like David to take a look at this and give his feedback. W.r.t patch itself I hit a build failure on mpc85xx_smp_defconfig. I will resent after the test build finish on all configs -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/05/2015 02:08 PM, Preeti U Murthy wrote: > On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: >> Hi Preeti, >> >> On 05/05/2015 09:21 AM, Preeti U Murthy wrote: >>> Hi Shilpa, >>> >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the max allowed frequency for that chip if the chip exceeds its power or temperature limits. As Pmax capping is a chip level condition report this throttling behavior at chip level and also do not set the global 'throttled' on Pmax capping instead set the per-chip throttled variable. Report unthrottling if Pmax is restored after throttling. This patch adds a structure to store chip id and throttled state of the chip. Signed-off-by: Shilpasri G Bhat --- drivers/cpufreq/powernv-cpufreq.c | 59 --- 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ebef0d8..d0c18c9 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -42,6 +43,13 @@ static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static bool rebooting, throttled; +static struct chip { + unsigned int id; + bool throttled; +} *chips; + +static int nr_chips; + /* * Note: The set of pstates consists of contiguous integers, the * smallest of which is indicated by powernv_pstate_info.min, the @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) static void powernv_cpufreq_throttle_check(unsigned int cpu) { unsigned long pmsr; - int pmsr_pmax, pmsr_lp; + int pmsr_pmax, pmsr_lp, i; pmsr = get_pmspr(SPRN_PMSR); + for (i = 0; i < nr_chips; i++) + if (chips[i].id == cpu_to_chip_id(cpu)) + break; + /* Check for Pmax Capping */ pmsr_pmax = (s8)PMSR_MAX(pmsr); if (pmsr_pmax != powernv_pstate_info.max) { - throttled = true; - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax); - pr_info("Max allowed Pstate is capped\n"); + if (chips[i].throttled) + goto next; + chips[i].throttled = true; + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu, + chips[i].id, pmsr_pmax); + } else if (chips[i].throttled) { + chips[i].throttled = false; >>> >>> Is this check on pmax sufficient to indicate that the chip is unthrottled ? >> >> Unthrottling due to Pmax uncapping here is specific to a chip. So it is >> sufficient to decide throttling/unthrottling when OCC is active for that >> chip. > > Ok then we can perhaps exit after detecting unthrottling here. This won't work for older firmwares which do not clear "Frequency control enabled bit" on OCC reset cycle. So let us check for remaining two conditions on unthrottling as well. >> >>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, + chips[i].id, pmsr_pmax); } /* * Check for Psafe by reading LocalPstate * or check if Psafe_mode_active is set in PMSR. */ +next: pmsr_lp = (s8)PMSR_LP(pmsr); if ((pmsr_lp < powernv_pstate_info.min) || (pmsr & PMSR_PSAFE_ENABLE)) { @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = { .attr = powernv_cpu_freq_attr, >>> >>> What about the situation where although occ is active, this particular >>> chip has been throttled and we end up repeatedly reporting "pstate set >>> to safe" and "frequency control disabled from OS" ? Should we not have a >>> check on (chips[i].throttled) before reporting an anomaly for these two >>> scenarios as well just like you have for pmsr_pmax ? >> >> We will not have "Psafe" and "frequency control disabled" repeatedly printed >> because of global variable 'throttled', which is set to true on passing any >> of >> these two conditions. >> >> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency >> control disabled" state. These two conditions are most likely to happen >> during >> an OCC reset cycle which will occur across all chips. > > Let us then add a comment to indicate that Psafe and frequency control > disabled conditions will fail *only if OCC is inactive* and not > otherwise and that this is a system wide phenomenon. > I agree that adding a comment here will clear global vs local throttling scenarios, but this will contradict the architectural design
Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
Hi Preeti, On 07/05/15 06:26, Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Based on linux-pm/bleeding-edge I am unable to apply this patch cleanly on linux-pm/bleeding-edge I think it conflicts with few patches that Rafael posted recently which are in the branch now. Regards, Sudeep ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/2] powerpc/powernv: Extract EPOW events timeout values from OPAL device tree
OPAL exports plaform timeout values for various EPOW events under EPOW device tree node. EPOW node contains sub nodes for each EPOW class. Under each class platform timeout property files are located for EPOW events under that class. Each file contains platform timeout value for corresponding EPOW event in seconds. Support for extracting EPOW event timeout values from OPAL device tree is added by this patch. Below property files are parsed to extract EPOW event timeout values. Power EPOW === ups-timeout ups-low-timeout Temp EPOW == high-ambient-temp-timeout crit-ambient-temp-timeout high-internal-temp-timeout crit-internal-temp-timeout Signed-off-by: Vipin K Parashar --- arch/powerpc/platforms/powernv/opal-power.c | 79 + 1 file changed, 70 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-power.c b/arch/powerpc/platforms/powernv/opal-power.c index 7c1b2f8..5b015f3 100644 --- a/arch/powerpc/platforms/powernv/opal-power.c +++ b/arch/powerpc/platforms/powernv/opal-power.c @@ -60,15 +60,7 @@ static const char * const epow_events_map[] = { }; /* Poweroff EPOW events timeout values in seconds */ -static const int epow_timeout[] = { - [EPOW_POWER_UPS]= 900, - [EPOW_POWER_UPS_LOW]= 20, - [EPOW_TEMP_HIGH_AMB]= 900, - [EPOW_TEMP_CRIT_AMB]= 20, - [EPOW_TEMP_HIGH_INT]= 900, - [EPOW_TEMP_CRIT_INT]= 20, - [EPOW_UNKNOWN] = 0, -}; +static int epow_timeout[MAX_EPOW_EVENTS]; /* System poweroff function. */ static void epow_poweroff(unsigned long event) @@ -125,6 +117,72 @@ static void stop_epow_timer(void) pr_info("Poweroff timer deactivated\n"); } +/* Extract timeout value from device tree property */ +static int get_timeout_value(struct device_node *node, const char *prop) +{ + const __be32 *pval; + int timeout = 0; + + pval = of_get_property(node, prop, NULL); + if (pval) + timeout = be32_to_cpup(pval); + else + pr_err("Didn't find %s dt property\n", prop); + + return timeout; +} + +/* Get EPOW events timeout values from OPAL device tree */ +static void get_epow_timeouts(void) +{ + struct device_node *epow_power, *epow_temp; + + /* EPOW power class event timeouts */ + epow_power = of_find_node_by_path("/ibm,opal/epow/power"); + if (epow_power) { + epow_timeout[EPOW_POWER_UPS] = + get_timeout_value(epow_power, "ups-timeout"); + pr_info("Power EPOW ups-timeout = %d seconds\n", + epow_timeout[EPOW_POWER_UPS]); + + epow_timeout[EPOW_POWER_UPS_LOW] = + get_timeout_value(epow_power, "ups-low-timeout"); + pr_info("Power EPOW ups-low-timeout = %d seconds\n", + epow_timeout[EPOW_POWER_UPS_LOW]); + + of_node_put(epow_power); + } else + pr_info("Power EPOW class not supported in OPAL\n"); + + /* EPOW temp class event timeouts */ + epow_temp = of_find_node_by_path("/ibm,opal/epow/temp"); + if (epow_temp) { + epow_timeout[EPOW_TEMP_HIGH_AMB] = + get_timeout_value(epow_temp, "high-ambient-temp-timeout"); + pr_info("Temp EPOW high-ambient-temp-timeout = %d seconds\n", + epow_timeout[EPOW_TEMP_HIGH_AMB]); + + epow_timeout[EPOW_TEMP_CRIT_AMB] = + get_timeout_value(epow_temp, "crit-ambient-temp-timeout"); + pr_info("Temp EPOW crit-ambient-temp-timeout = %d seconds\n", + epow_timeout[EPOW_TEMP_CRIT_AMB]); + + epow_timeout[EPOW_TEMP_HIGH_INT] = + get_timeout_value(epow_temp, "high-internal-temp-timeout"); + pr_info("Temp EPOW high-inernal-temp-timeout = %d seconds\n", + epow_timeout[EPOW_TEMP_HIGH_INT]); + + epow_timeout[EPOW_TEMP_CRIT_INT] = + get_timeout_value(epow_temp, "crit-internal-temp-timeout"); + pr_info("Temp EPOW crit-inernal-temp-timeout = %d seconds\n", + epow_timeout[EPOW_TEMP_CRIT_INT]); + + of_node_put(epow_temp); + } else + pr_info("Temp EPOW class not supported in OPAL\n"); + +} + /* Get DPO status */ static bool get_dpo_status(int32_t *dpo_timeout) { @@ -366,6 +424,9 @@ static int __init opal_poweroff_events_init(void) init_timer(&epow_timer); epow_timer.function = epow_poweroff; + /* Get EPOW events timeout value */ + get_epow_timeouts(); + /* Register EPOW event notifier */ ret = opal_message_notifier_register(OPAL_MSG_EPOW, &opal_epow_nb); -- 1.9.3 ___
[PATCH v2 1/2] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
This patch adds support for FSP EPOW (Early Power Off Warning) and DPO (Delayed Power Off) events support for PowerNV platform. EPOW events are generated by SPCN/FSP due to various critical system conditions that need system shutdown. Few examples of these conditions are high ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system shutdown request. This patch enables host kernel on PowerNV platform to handle OPAL notifications for these events and initiate system poweroff. Since EPOW notifications are sent in advance of impending shutdown event and thus this patch also adds functionality to wait for EPOW condition to return to normal. Host allows MAX_POWEROFF_SYS_TIME (600 seconds) as system poweroff time (time for host + guests shutdown) and waits for remaining time for EPOW condition to return to normal. If EPOW condition doesn't return to normal in calculated time it proceeds with graceful system shutdown. For EPOW events with smaller timeouts values than MAX_POWEROFF_SYS_TIME it proceeds with system shutdown without any wait for EPOW condition to return to normal. System admin can also add systemd service shutdown scripts to perform any specific actions like graceful guest shutdown upon system poweroff. libvirt-guests is systemd service available on recent distros for management of guests at system stat/shutdown time. Signed-off-by: Vipin K Parashar --- arch/powerpc/include/asm/opal-api.h| 30 ++ arch/powerpc/include/asm/opal.h| 3 +- arch/powerpc/platforms/powernv/opal-power.c| 379 +++-- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 391 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0321a90..03b3cef 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -730,6 +730,36 @@ struct opal_i2c_request { __be64 buffer_ra; /* Buffer real address */ }; +/* + * EPOW status sharing (OPAL and the host) + * + * The host will pass on OPAL, a buffer of length OPAL_EPOW_MAX_CLASSES + * to fetch system wide EPOW status. Each element in the returned buffer + * will contain bitwise EPOW status for each EPOW sub class. + */ + +/* EPOW types */ +enum OpalEpow { + OPAL_EPOW_POWER = 0,/* Power EPOW */ + OPAL_EPOW_TEMP = 1,/* Temperature EPOW */ + OPAL_EPOW_COOLING = 2,/* Cooling EPOW */ + OPAL_MAX_EPOW_CLASSES = 3,/* Max EPOW categories */ +}; + +/* Power EPOW events */ +enum OpalEpowPower { + OPAL_EPOW_POWER_UPS = 0x1, /* System on UPS power */ + OPAL_EPOW_POWER_UPS_LOW = 0x2, /* System on UPS power with low battery*/ +}; + +/* Temperature EPOW events */ +enum OpalEpowTemp { + OPAL_EPOW_TEMP_HIGH_AMB = 0x1, /* High ambient temperature */ + OPAL_EPOW_TEMP_CRIT_AMB = 0x2, /* Critical ambient temperature */ + OPAL_EPOW_TEMP_HIGH_INT = 0x4, /* High internal temperature */ + OPAL_EPOW_TEMP_CRIT_INT = 0x8, /* Critical internal temperature */ +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 042af1a..0777864 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -141,7 +141,6 @@ int64_t opal_pci_fence_phb(uint64_t phb_id); int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data); int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t error_type, uint8_t mask_action); int64_t opal_set_slot_led_status(uint64_t phb_id, uint64_t slot_id, uint8_t led_type, uint8_t led_action); -int64_t opal_get_epow_status(__be64 *status); int64_t opal_set_system_attention_led(uint8_t led_action); int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe, __be16 *pci_error_type, __be16 *severity); @@ -200,6 +199,8 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf, uint64_t size, uint64_t token); int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size, uint64_t token); +int32_t opal_get_epow_status(__be32 *status, __be32 *num_classes); +int32_t opal_get_dpo_status(__be32 *timeout); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/platforms/powernv/opal-power.c b/arch/powerpc/platforms/powernv/opal-power.c index ac46c2c..7c1b2f8 100644 --- a/arch/powerpc/platforms/powernv/opal-power.c +++ b/arch/powerpc/platforms/powernv/opal-power.c @@ -1,5 +1,5 @@ /* - * PowerNV OPAL power control for graceful shutdown handling + * PowerNV poweroff events support * * Copyright 2015 IBM Corp. * @@ -9,58 +9,395 @@ * 2 of the License, or (at your option) any later version. */ +#
[PATCH v2 0/2] Poweroff (EPOW, DPO) events support for PowerNV platform
This patchset adds support for FSP EPOW (Early Power Off Warning) and DPO (Delayed Power Off) events support for PowerNV platform. EPOW events are generated by SPCN/FSP due to various critical system conditions that need system shutdown. Few examples of these conditions are high ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system shutdown request. This patchset enables host kernel on PowerNV platform to handle OPAL notifications for these events and initiate system poweroff. Since EPOW notifications are sent in advance of impending shutdown event and thus functionality is also added to wait for EPOW condition to return to normal. EPOW events timeout values are available via OPAL exported device tree values under EPOW node. Host kernel allows MAX_POWEROFF_SYS_TIME (600 seconds) as system poweroff time (time for host + guests shutdown) and waits for remaining time for EPOW condition to return to normal. If EPOW condition doesn't return to normal in calculated time it proceeds with graceful system shutdown. For EPOW events with smaller timeouts values than MAX_POWEROFF_SYS_TIME it proceeds with system shutdown without any wait for EPOW condition to return to normal. System admin can also add systemd service shutdown scripts to perform any specific actions like graceful guest shutdown upon system poweroff. libvirt-guests is systemd service available on recent distros for management of guests at system stat/shutdown time. Changes in v2: - Made code changes to improve code as per previous review comments. - Added patch to obtain EPOW event timeout values from OPAL device-tree. Vipin K Parashar (2): powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform powerpc/powernv: Extract EPOW events timeout values from OPAL device tree arch/powerpc/include/asm/opal-api.h| 30 ++ arch/powerpc/include/asm/opal.h| 3 +- arch/powerpc/platforms/powernv/opal-power.c| 440 +++-- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 452 insertions(+), 22 deletions(-) -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions
On Thu, May 07, 2015 at 12:53:27PM +0530, Aneesh Kumar K.V wrote: > After this patch pmdp_* functions operate only on hugepage pte, > and not on regular pmd_t values pointing to page table. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ > arch/powerpc/mm/pgtable_64.c | 76 > +--- > include/asm-generic/pgtable.h| 19 > mm/huge_memory.c | 2 +- > 4 files changed, 65 insertions(+), 36 deletions(-) > > diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h > b/arch/powerpc/include/asm/pgtable-ppc64.h > index 43e6ad424c7f..50830c9a2116 100644 > --- a/arch/powerpc/include/asm/pgtable-ppc64.h > +++ b/arch/powerpc/include/asm/pgtable-ppc64.h > @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct > *mm, unsigned long addr, > extern void pmdp_splitting_flush(struct vm_area_struct *vma, >unsigned long address, pmd_t *pmdp); > > +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH > +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmdp); > + > #define __HAVE_ARCH_PGTABLE_DEPOSIT > extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, > pgtable_t pgtable); > diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c > index 59daa5eeec25..9171c1a37290 100644 > --- a/arch/powerpc/mm/pgtable_64.c > +++ b/arch/powerpc/mm/pgtable_64.c > @@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, > unsigned long address, > pmd_t pmd; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > - if (pmd_trans_huge(*pmdp)) { > - pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); > - } else { > - /* > - * khugepaged calls this for normal pmd > - */ > - pmd = *pmdp; > - pmd_clear(pmdp); > - /* > - * Wait for all pending hash_page to finish. This is needed > - * in case of subpage collapse. When we collapse normal pages > - * to hugepage, we first clear the pmd, then invalidate all > - * the PTE entries. The assumption here is that any low level > - * page fault will see a none pmd and take the slow path that > - * will wait on mmap_sem. But we could very well be in a > - * hash_page with local ptep pointer value. Such a hash page > - * can result in adding new HPTE entries for normal subpages. > - * That means we could be modifying the page content as we > - * copy them to a huge page. So wait for parallel hash_page > - * to finish before invalidating HPTE entries. We can do this > - * by sending an IPI to all the cpus and executing a dummy > - * function there. > - */ > - kick_all_cpus_sync(); > - /* > - * Now invalidate the hpte entries in the range > - * covered by pmd. This make sure we take a > - * fault and will find the pmd as none, which will > - * result in a major fault which takes mmap_sem and > - * hence wait for collapse to complete. Without this > - * the __collapse_huge_page_copy can result in copying > - * the old content. > - */ > - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); > - } > + VM_BUG_ON(!pmd_trans_huge(*pmdp)); > + pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); > + return pmd; The patches are in reverse order: you need to change pmdp_get_and_clear first otherwise you break bisectability. Or better merge patches together. > +} > + > +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, > + pmd_t *pmdp) > +{ > + pmd_t pmd; > + > + VM_BUG_ON(address & ~HPAGE_PMD_MASK); > + VM_BUG_ON(pmd_trans_huge(*pmdp)); > + > + pmd = *pmdp; > + pmd_clear(pmdp); > + /* > + * Wait for all pending hash_page to finish. This is needed > + * in case of subpage collapse. When we collapse normal pages > + * to hugepage, we first clear the pmd, then invalidate all > + * the PTE entries. The assumption here is that any low level > + * page fault will see a none pmd and take the slow path that > + * will wait on mmap_sem. But we could very well be in a > + * hash_page with local ptep pointer value. Such a hash page > + * can result in adding new HPTE entries for normal subpages. > + * That means we could be modifying the page content as we > + * copy them to a huge page. So wait for parallel hash_page > + * to finish before invalidating HPTE entries. We can do this > + * by sending an IPI to all the cpus and executing a dummy
Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page
On Thu, 2015-05-07 at 12:46 +0530, Aneesh Kumar K.V wrote: > We need to check whether pte is present in follow_huge_addr and > properly return NULL if mapping is not present. Also use READ_ONCE > when dereferencing pte_t address. Do that need to go to stable as well ? > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/mm/hugetlbpage.c | 25 - > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index 0ce968b00b7c..f5688423bc69 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, > struct page * > follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) > { > - pte_t *ptep; > - struct page *page; > + pte_t *ptep, pte; > unsigned shift; > unsigned long mask, flags; > + struct page *page = ERR_PTR(-EINVAL); > + > + local_irq_save(flags); > + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); > + if (!ptep) > + goto no_page; > + pte = READ_ONCE(*ptep); > /* > + * Verify it is a huge page else bail. >* Transparent hugepages are handled by generic code. We can skip them >* here. >*/ > - local_irq_save(flags); > - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); > + if (!shift || pmd_trans_huge((pmd_t)pte)) > + goto no_page; > > - /* Verify it is a huge page else bail. */ > - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) { > - local_irq_restore(flags); > - return ERR_PTR(-EINVAL); > + if (!pte_present(pte)) { > + page = NULL; > + goto no_page; > } > mask = (1UL << shift) - 1; > - page = pte_page(*ptep); > + page = pte_page(pte); > if (page) > page += (address & mask) / PAGE_SIZE; > > +no_page: > local_irq_restore(flags); > return page; > } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 2/2] powerpc/thp: Serialize pmd clear against a linux page table walk.
Serialize against find_linux_pte_or_hugepte which does lock-less lookup in page tables with local interrupts disabled. For huge pages it casts pmd_t to pte_t. Since format of pte_t is different from pmd_t we want to prevent transit from pmd pointing to page table to pmd pointing to huge page (and back) while interrupts are disabled. We clear pmd to possibly replace it with page table pointer in different code paths. So make sure we wait for the parallel find_linux_pte_or_hugepage to finish. Reported-by: Kirill A. Shutemov Signed-off-by: Aneesh Kumar K.V --- Changes from v1: * Move kick_all_cpus_sync to pmdp_get_and_clear so that it handle zap_huge_pmd case also. arch/powerpc/mm/pgtable_64.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 9171c1a37290..049d961802aa 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -845,6 +845,17 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm, * hash fault look at them. */ memset(pgtable, 0, PTE_FRAG_SIZE); + /* +* Serialize against find_linux_pte_or_hugepte which does lock-less +* lookup in page tables with local interrupts disabled. For huge pages +* it casts pmd_t to pte_t. Since format of pte_t is different from +* pmd_t we want to prevent transit from pmd pointing to page table +* to pmd pointing to huge page (and back) while interrupts are disabled. +* We clear pmd to possibly replace it with page table pointer in +* different code paths. So make sure we wait for the parallel +* find_linux_pte_or_hugepage to finish. +*/ + kick_all_cpus_sync(); return old_pmd; } -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions
After this patch pmdp_* functions operate only on hugepage pte, and not on regular pmd_t values pointing to page table. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/pgtable_64.c | 76 +--- include/asm-generic/pgtable.h| 19 mm/huge_memory.c | 2 +- 4 files changed, 65 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 43e6ad424c7f..50830c9a2116 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, extern void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, +unsigned long address, pmd_t *pmdp); + #define __HAVE_ARCH_PGTABLE_DEPOSIT extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable); diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 59daa5eeec25..9171c1a37290 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address, pmd_t pmd; VM_BUG_ON(address & ~HPAGE_PMD_MASK); - if (pmd_trans_huge(*pmdp)) { - pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); - } else { - /* -* khugepaged calls this for normal pmd -*/ - pmd = *pmdp; - pmd_clear(pmdp); - /* -* Wait for all pending hash_page to finish. This is needed -* in case of subpage collapse. When we collapse normal pages -* to hugepage, we first clear the pmd, then invalidate all -* the PTE entries. The assumption here is that any low level -* page fault will see a none pmd and take the slow path that -* will wait on mmap_sem. But we could very well be in a -* hash_page with local ptep pointer value. Such a hash page -* can result in adding new HPTE entries for normal subpages. -* That means we could be modifying the page content as we -* copy them to a huge page. So wait for parallel hash_page -* to finish before invalidating HPTE entries. We can do this -* by sending an IPI to all the cpus and executing a dummy -* function there. -*/ - kick_all_cpus_sync(); - /* -* Now invalidate the hpte entries in the range -* covered by pmd. This make sure we take a -* fault and will find the pmd as none, which will -* result in a major fault which takes mmap_sem and -* hence wait for collapse to complete. Without this -* the __collapse_huge_page_copy can result in copying -* the old content. -*/ - flush_tlb_pmd_range(vma->vm_mm, &pmd, address); - } + VM_BUG_ON(!pmd_trans_huge(*pmdp)); + pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp); + return pmd; +} + +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + pmd_t pmd; + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + VM_BUG_ON(pmd_trans_huge(*pmdp)); + + pmd = *pmdp; + pmd_clear(pmdp); + /* +* Wait for all pending hash_page to finish. This is needed +* in case of subpage collapse. When we collapse normal pages +* to hugepage, we first clear the pmd, then invalidate all +* the PTE entries. The assumption here is that any low level +* page fault will see a none pmd and take the slow path that +* will wait on mmap_sem. But we could very well be in a +* hash_page with local ptep pointer value. Such a hash page +* can result in adding new HPTE entries for normal subpages. +* That means we could be modifying the page content as we +* copy them to a huge page. So wait for parallel hash_page +* to finish before invalidating HPTE entries. We can do this +* by sending an IPI to all the cpus and executing a dummy +* function there. +*/ + kick_all_cpus_sync(); + /* +* Now invalidate the hpte entries in the range +* covered by pmd. This make sure we take a +* fault and will find the pmd as none, which will +* result in a major fault which takes
[RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page
We need to check whether pte is present in follow_huge_addr and properly return NULL if mapping is not present. Also use READ_ONCE when dereferencing pte_t address. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/hugetlbpage.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 0ce968b00b7c..f5688423bc69 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, struct page * follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { - pte_t *ptep; - struct page *page; + pte_t *ptep, pte; unsigned shift; unsigned long mask, flags; + struct page *page = ERR_PTR(-EINVAL); + + local_irq_save(flags); + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); + if (!ptep) + goto no_page; + pte = READ_ONCE(*ptep); /* +* Verify it is a huge page else bail. * Transparent hugepages are handled by generic code. We can skip them * here. */ - local_irq_save(flags); - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift); + if (!shift || pmd_trans_huge((pmd_t)pte)) + goto no_page; - /* Verify it is a huge page else bail. */ - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) { - local_irq_restore(flags); - return ERR_PTR(-EINVAL); + if (!pte_present(pte)) { + page = NULL; + goto no_page; } mask = (1UL << shift) - 1; - page = pte_page(*ptep); + page = pte_page(pte); if (page) page += (address & mask) / PAGE_SIZE; +no_page: local_irq_restore(flags); return page; } -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev