Re: [PATCH 1/6] dump_stack: Support adding to the dump stack arch description

2015-05-07 Thread Michael Ellerman
On Tue, 2015-05-05 at 14:16 -0700, Andrew Morton wrote:
> On Tue,  5 May 2015 21:12:12 +1000 Michael Ellerman  
> wrote:
> 
> > Arch code can set a "dump stack arch description string" which is
> > displayed with oops output to describe the hardware platform.
> > +
> > +   len = strnlen(dump_stack_arch_desc_str, 
> > sizeof(dump_stack_arch_desc_str));
> > +   pos = len;
> > +
> > +   if (len)
> > +   pos++;
> > +
> > +   if (pos >= sizeof(dump_stack_arch_desc_str))
> > +   return; /* Ran out of space */
> > +
> > +   p = &dump_stack_arch_desc_str[pos];
> > +
> > +   va_start(args, fmt);
> > +   vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args);
> > +   va_end(args);
> 
> This code is almost race-free.  A (documented) smp_wmb() in here would
> make that 100%?
> 
> > +   if (len)
> > +   dump_stack_arch_desc_str[len] = ' ';
> > +}

On second thoughts I don't think it would.

It would order the stores in vsnprintf() vs the store of the space. The idea
being you never see a partially printed string. But for that to actually work
you need a barrier on the read side, and where do you put it?

The cpu printing the buffer could speculate the load of the tail of the buffer,
seeing something half printed from vsnprintf(), and then load the head of the
buffer and see the space, unless you order those loads.

So I don't think we can prevent a crashing cpu seeing a semi-printed buffer
without a lock, and we don't want to add a lock.

The other issue would be that a reader could miss the trailing NULL from the
vsnprintf() but see the space, meaning it would wander off the end of the
buffer. But the buffer's in BSS to start with, and we're careful not to print
off the end of it, so it should always be NULL terminated.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-07 Thread Hemant Kumar


On 05/08/2015 09:58 AM, Ingo Molnar wrote:

* Hemant Kumar  wrote:


  # perf kvm stat report -p 60515
Analyze events for pid(s) 60515, all VCPUs:

VM-EXITSamples  Samples% Time%Min Time Max
Time Avg time

H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
12.37us ( +-   0.52% )
HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
1.91us ( +-   0.96% )
SYSCALL   269018.97% 0.10%  2.84us528.24us 
18.29us ( +-   3.75% )
RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
27470.23us ( +-   3.00% )
   EXTERNAL240 1.69% 0.00%0.69us 10.67us  
1.33us ( +-   5.34% )

Where is the last line misaligned? Copy & paste error or does perf kvm
produce it in such a way?


Its a copy-paste error. Thanks for pointing this out.

Shall I resend the patches with the correct alignment of the o/p?


Thanks,

Ingo



--
Thanks,
Hemant Kumar

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-07 Thread Ingo Molnar

* Hemant Kumar  wrote:

> 
> On 05/08/2015 09:58 AM, Ingo Molnar wrote:
> >* Hemant Kumar  wrote:
> >
> >>  # perf kvm stat report -p 60515
> >>Analyze events for pid(s) 60515, all VCPUs:
> >>
> >>VM-EXITSamples  Samples% Time%Min Time Max  
> >>   Time Avg time
> >>
> >>H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
> >>12.37us ( +-   0.52% )
> >>HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
> >>1.91us ( +-   0.96% )
> >>SYSCALL   269018.97% 0.10%  2.84us528.24us 
> >> 18.29us ( +-   3.75% )
> >>RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
> >>27470.23us ( +-   3.00% )
> >>   EXTERNAL240 1.69% 0.00%0.69us 10.67us
> >>   1.33us ( +-   5.34% )
> >Where is the last line misaligned? Copy & paste error or does perf kvm
> >produce it in such a way?
> 
> Its a copy-paste error. Thanks for pointing this out.
> 
> Shall I resend the patches with the correct alignment of the o/p?

I don't think that's necessary, as long as the code is fine.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/6] powernv: cpufreq: Report frequency throttle by OCC

2015-05-07 Thread Viresh Kumar
On 4 May 2015 at 14:24, Shilpasri G Bhat  wrote:
> This patchset intends to add frequency throttle reporting mechanism
> to powernv-cpufreq driver when OCC throttles the frequency. OCC is an
> On-Chip-Controller which takes care of the power and thermal safety of
> the chip. The CPU frequency can be throttled during an OCC reset or
> when OCC tries to limit the max allowed frequency. The patchset will
> report such conditions so as to keep the user informed about reason
> for the drop in performance of workloads when frequency is throttled.
>
> Changes from v2:
> - Split into multiple patches
> - Semantic fixes
>
> Shilpasri G Bhat (6):
>   cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
>   powerpc/powernv: Add definition of OPAL_MSG_OCC message type
>   cpufreq: powernv: Register for OCC related opal_message notification
>   cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
>   cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is
> set
>   cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling
>
>  arch/powerpc/include/asm/opal-api.h |   8 ++
>  drivers/cpufreq/powernv-cpufreq.c   | 199 
> +---
>  2 files changed, 192 insertions(+), 15 deletions(-)

Acked-by: Viresh Kumar 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-07 Thread Ingo Molnar

* Hemant Kumar  wrote:

>  # perf kvm stat report -p 60515
> Analyze events for pid(s) 60515, all VCPUs:
> 
>VM-EXITSamples  Samples% Time%Min Time Max
> Time Avg time
> 
> H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
> 12.37us ( +-   0.52% )
> HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
> 1.91us ( +-   0.96% )
>SYSCALL   269018.97% 0.10%  2.84us528.24us 
> 18.29us ( +-   3.75% )
> RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
> 27470.23us ( +-   3.00% )
>   EXTERNAL240 1.69% 0.00%0.69us 10.67us  
> 1.33us ( +-   5.34% )

Where is the last line misaligned? Copy & paste error or does perf kvm 
produce it in such a way?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy
On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote:
> On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
>> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
>>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
 Hi Preeti,

 On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
> Hi Shilpa,
>
> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>> notification by executing *throttle_check() on any one of the cpu on
>> the chip. This is a sanity check to verify if we were indeed
>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>
>> We cannot call *throttle_check() directly from the notification
>> handler because we could be handling chip1's notification in chip2. So
>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>> in the notification handler, so use a worker thread to smp_call
>> throttle_check() on any of the cpu in the chipmask.
>
> I see that the first patch takes care of reporting *per-chip* throttling
> for pmax capping condition. But where are we taking care of reporting
> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>

 IMO let us not have "psafe" and "freq control disabled" states managed 
 per-chip.
 Because when the above two conditions occur it is likely to happen across 
 all
 chips during an OCC reset cycle. So I am setting 'throttled' to false on
 OCC_ACTIVE and re-verifying if it actually is the case by invoking
 *throttle_check().
>>>
>>> Alright like I pointed in the previous reply, a comment to indicate that
>>> psafe and freq control disabled conditions will fail when occ is
>>> inactive and that all chips face the consequence of this will help.
>>
>> From your explanation on the thread of the first patch of this series,
>> this will not be required.
>>
>> So,
>> Reviewed-by: Preeti U Murthy 
> 
> OK, so is the whole series reviewed now?

Yes the whole series has been reviewed.

Regards
Preeti U Murthy


> 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg

2015-05-07 Thread Herbert Xu
On Thu, May 07, 2015 at 11:06:06AM -0400, Dan Streetman wrote:
> 
> The crypto 842-nx has (significant) code in it to handle any alignment
> and length input buffers, to match them to what the driver requires.
> Would it be better to move that into the crypto code, so that any
> crypto compression hw driver can request buffers be specifically
> aligned/sized?  I did have to use a header on each compressed buffer
> that needed re-alignment or re-sizing, so maybe it's not appropriate
> for common crypto compression code.

Yes we could certainly move this logic into the crypto layer, as
we do for ciphers and hashes.

But as you say we could make the next guy who writes a comp driver
do this :)

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/2] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform

2015-05-07 Thread Joel Stanley
Hello Vipin,

On Thu, May 7, 2015 at 7:00 PM, Vipin K Parashar
 wrote:
> This patch adds support for FSP EPOW (Early Power Off Warning) and
> DPO (Delayed Power Off) events support for PowerNV platform.

I reviewed this patch for the changes it made to the existing poweroff
code, you still need someone to look at the EPOW code itself.

> Signed-off-by: Vipin K Parashar 
> ---
>  arch/powerpc/include/asm/opal-api.h|  30 ++
>  arch/powerpc/include/asm/opal.h|   3 +-
>  arch/powerpc/platforms/powernv/opal-power.c| 379 
> +++--
>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>  4 files changed, 391 insertions(+), 22 deletions(-)
>

>  /* Internal functions */
>  extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
> diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
> b/arch/powerpc/platforms/powernv/opal-power.c
> index ac46c2c..7c1b2f8 100644
> --- a/arch/powerpc/platforms/powernv/opal-power.c
> +++ b/arch/powerpc/platforms/powernv/opal-power.c
> @@ -1,5 +1,5 @@
>  /*
> - * PowerNV OPAL power control for graceful shutdown handling
> + * PowerNV poweroff events support
>   *
>   * Copyright 2015 IBM Corp.
>   *
> @@ -9,58 +9,395 @@
>   * 2 of the License, or (at your option) any later version.
>   */
>
> +#define pr_fmt(fmt)"POWEROFF_EVENT: "fmt

OPAL_POWER?

> +
>  #include 
> +#include 
> +#include 
>  #include 
> -#include 
> -
> +#include 
>  #include 
>  #include 
>
> -#define SOFT_OFF 0x00
> -#define SOFT_REBOOT 0x01
> +/* Power control event types */
> +#define SOFT_OFF   0x00
> +#define SOFT_REBOOT0x01

While you're touching this code, I think these should be moved to opal-api.h

> +
> +/* Max time for graceful system shutdown including guests. */
> +#define MAX_POWEROFF_SYS_TIME  600
> +

> +/* IPMI power-control events notifier */
>  static int opal_power_control_event(struct notifier_block *nb,
> -   unsigned long msg_type, void *msg)
> +   unsigned long msg_type, void *msg)
>  {
> -   struct opal_msg *power_msg = msg;
> uint64_t type;
> +   struct opal_msg *power_msg = msg;
>
> type = be64_to_cpu(power_msg->params[0]);
>
> switch (type) {
> case SOFT_REBOOT:
> -   pr_info("OPAL: reboot requested\n");
> +   pr_info("Reboot requested\n");

I prefer the OPAL prefix.

> orderly_reboot();
> break;
> case SOFT_OFF:
> -   pr_info("OPAL: poweroff requested\n");
> +   pr_info("Poweroff requested\n");

Ditto.

> orderly_poweroff(true);
> break;
> default:
> -   pr_err("OPAL: power control type unexpected %016llx\n", type);
> +   pr_err("Unknown event %llu\n", type);

Ditto.

> }
>
> return 0;
>  }
>
> +/* OPAL EPOW event notifier block */
> +static struct notifier_block opal_epow_nb = {
> +   .notifier_call  = opal_epow_event,
> +   .next   = NULL,
> +   .priority   = 0,
> +};
> +
> +/* OPAL DPO event notifier block */
> +static struct notifier_block opal_dpo_nb = {
> +   .notifier_call  = opal_dpo_event,
> +   .next   = NULL,
> +   .priority   = 0,
> +};
> +
> +/* OPAL Power control events */
>  static struct notifier_block opal_power_control_nb = {
> -   .notifier_call  = opal_power_control_event,
> -   .next   = NULL,
> -   .priority   = 0,
> +   .notifier_call  = opal_power_control_event,
> +   .next   = NULL,
> +   .priority   = 0,
>  };

Looks like you changed the whitespace?

>
> -static int __init opal_power_control_init(void)
> +/* Poweroff events init */
> +static int __init opal_poweroff_events_init(void)

This comment does not add any value.

Renaming the function doesn't add much either.

>  {
> int ret;
> +   struct device_node *node_epow;
>
> -   ret = opal_message_notifier_register(OPAL_MSG_SHUTDOWN,
> -&opal_power_control_nb);
> -   if (ret) {
> -   pr_err("%s: Can't register OPAL event notifier (%d)\n",
> -   __func__, ret);
> -   return ret;
> +   /*
> +   * Determine EPOW, DPO support in hardware.
> +   */
> +   node_epow = of_find_node_by_path("/ibm,opal/epow");
> +   if (node_epow) {
> +   if (of_device_is_compatible(node_epow, "ibm,opal-epow")) {
> +   epow_supported = true;
> +   dpo_supported = true;

Why are these separate flags? Do we have any systems that will support
EPOW but not DPO, or DPO without EPOW?

I suggest merging them into the one flag.

> +   pr_info("OPAL EPOW, DPO support detected.\n");
> +   }
> +   of_node_put(node_epow);
> +   }
> +
> +   /* Prepare to handle EPOW e

[PATCH v3 2/2] perf/kvm: Support HCALL events

2015-05-07 Thread Hemant Kumar
powerpc provides hcall events that also provide insights into guest
behaviour. Enhance perf kvm to record and analyze hcall events.

 - To trace hcall events :
  perf kvm stat record

 - To show the results :
  perf kvm stat report --event=hcall

The result shows the number of hypervisor calls from the guest grouped
by their respective reasons displayed with the frequency.

This patch makes use of two additional tracepoints "kvm_hv:kvm_hcall_enter"
and "kvm_hv:kvm_hcall_exit". It uses the pSeries hypervisor codes
exported through uapi to classify the hcalls into their respective reasons.

Note : This patch has a dependency on "kvm/powerpc: Export HCALL reason
codes" which exports HCALL reasons through uapi.

 # pgrep qemu
A sample output :
19378
60515

2 VMs running.

 # perf kvm stat record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ]

 # perf kvm stat report -p 60515 --event=hcall
Analyze events for pid(s) 60515, all VCPUs:

 HCALL-EVENTSamples  Samples% Time%Min TimeMax Time 
Avg time

H_VIO_SIGNAL   103438.44%15.77%  0.36us  1.59us 
 0.44us ( +-   0.66% )
  H_SEND_CRQ65224.24%10.97%  0.39us  1.84us 
 0.49us ( +-   1.20% )
   H_IPI52319.44%62.05%  1.35us 19.70us 
 3.44us ( +-   2.88% )
 H_PUT_TERM_CHAR41115.28% 8.03%  0.38us  3.77us 
 0.57us ( +-   1.61% )
 H_GET_TERM_CHAR 50 1.86% 0.99%  0.40us  0.98us 
 0.57us ( +-   3.37% )
   H_EOI 20 0.74% 2.19%  2.22us  4.72us 
 3.17us ( +-   5.96% )

Total Samples:2690, Total events handled time:2896.94us.

Signed-off-by: Hemant Kumar 
---
Patch has a dependency on https://patchwork.ozlabs.org/patch/469841/
which exports the HCALL reason codes to perf.

 arch/powerpc/include/uapi/asm/kvm_perf.h |  4 +++
 tools/perf/arch/powerpc/util/kvm-stat.c  | 61 
 2 files changed, 65 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm_perf.h 
b/arch/powerpc/include/uapi/asm/kvm_perf.h
index 30fa670..440902e 100644
--- a/arch/powerpc/include/uapi/asm/kvm_perf.h
+++ b/arch/powerpc/include/uapi/asm/kvm_perf.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 #define DECODE_STR_LEN 20
 
@@ -11,5 +12,8 @@
 #define KVM_ENTRY_TRACE "kvm_hv:kvm_guest_enter"
 #define KVM_EXIT_TRACE "kvm_hv:kvm_guest_exit"
 #define KVM_EXIT_REASON "trap"
+#define KVM_HCALL_ENTRY_TRACE "kvm_hv:kvm_hcall_enter"
+#define KVM_HCALL_EXIT_TRACE "kvm_hv:kvm_hcall_exit"
+#define KVM_HCALL_REASON "req"
 
 #endif /* _ASM_POWERPC_KVM_PERF_H */
diff --git a/tools/perf/arch/powerpc/util/kvm-stat.c 
b/tools/perf/arch/powerpc/util/kvm-stat.c
index 62cdcc1..685201c 100644
--- a/tools/perf/arch/powerpc/util/kvm-stat.c
+++ b/tools/perf/arch/powerpc/util/kvm-stat.c
@@ -1,7 +1,9 @@
 #include "../../util/kvm-stat.h"
 #include 
+#include "../../util/debug.h"
 
 define_exit_reasons_table(hv_exit_reasons, kvm_trace_symbol_exit);
+define_exit_reasons_table(hcall_reasons, kvm_trace_symbol_hcall);
 
 static struct kvm_events_ops exit_events = {
.is_begin_event = exit_event_begin,
@@ -10,14 +12,73 @@ static struct kvm_events_ops exit_events = {
.name = "VM-EXIT"
 };
 
+static void hcall_event_get_key(struct perf_evsel *evsel,
+   struct perf_sample *sample,
+   struct event_key *key)
+{
+   key->info = 0;
+   key->key = perf_evsel__intval(evsel, sample, KVM_HCALL_REASON);
+}
+
+static const char *get_exit_reason(u64 exit_code)
+{
+   struct exit_reasons_table *tbl = hcall_reasons;
+
+   while (tbl->reason != NULL) {
+   if (tbl->exit_code == exit_code)
+   return tbl->reason;
+   tbl++;
+   }
+
+   pr_err("Unknown kvm hcall exit code: %lld\n",
+  (unsigned long long)exit_code);
+   return "UNKNOWN";
+}
+
+static bool hcall_event_end(struct perf_evsel *evsel,
+   struct perf_sample *sample __maybe_unused,
+   struct event_key *key __maybe_unused)
+{
+   return (!strcmp(evsel->name, KVM_HCALL_EXIT_TRACE));
+}
+
+static bool hcall_event_begin(struct perf_evsel *evsel,
+ struct perf_sample *sample, struct event_key *key)
+{
+   if (!strcmp(evsel->name, KVM_HCALL_ENTRY_TRACE)) {
+   hcall_event_get_key(evsel, sample, key);
+   return true;
+   }
+
+return false;
+}
+static void hcall_event_decode_key(struct perf_kvm_stat *kvm __maybe_unused,
+  struct event_key *key,
+  char *decode)
+{
+   const char *hcall_reason = get_exit_reason(key->key);
+
+   scnprintf(decode, DECODE_STR_LEN, "%s", hcall_reason);
+}
+
+static struct kv

[PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-07 Thread Hemant Kumar
From: Srikar Dronamraju 

perf kvm can be used to analyze guest exit reasons. This support already
exists in x86. Hence, porting it to powerpc.

 - To trace KVM events :
  perf kvm stat record
  If many guests are running, we can track for a specific guest by using
  --pid as in : perf kvm stat record --pid 

 - To see the results :
  perf kvm stat report

The result shows the number of exits (from the guest context to
host/hypervisor context) grouped by their respective exit reasons with
their frequency.

This patch makes use of the guest exit reasons available in
"trace_book3s.h". It records on two already available tracepoints :
"kvm_hv:kvm_guest_exit" and "kvm_hv:kvm_guest_enter".

Note : This patch has a dependency on the patch "kvm/powerpc: Export
kvm exit reasons" which exports the KVM exit reasons through the uapi.

Here is a sample o/p:
 # pgrep qemu
19378
60515

2 Guests are running on the host.

 # perf kvm stat record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ]

 # perf kvm stat report -p 60515
Analyze events for pid(s) 60515, all VCPUs:

   VM-EXITSamples  Samples% Time%Min Time Max
Time Avg time

H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
12.37us ( +-   0.52% )
HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
1.91us ( +-   0.96% )
   SYSCALL   269018.97% 0.10%  2.84us528.24us 
18.29us ( +-   3.75% )
RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
27470.23us ( +-   3.00% )
  EXTERNAL240 1.69% 0.00%0.69us 10.67us  
1.33us ( +-   5.34% )

Total Samples:14182, Total events handled time:49264158.30us.

Signed-off-by: Srikar Dronamraju 
Signed-off-by: Hemant Kumar 
---
Patch has a dependency on : https://patchwork.ozlabs.org/patch/469839/
which exports the exit reasons to perf through uapi.

Changes:
- Original series split into two patchsets now : perf and powerpc
  side changes.

 arch/powerpc/include/uapi/asm/kvm_perf.h | 15 +++
 tools/perf/arch/powerpc/Makefile |  1 +
 tools/perf/arch/powerpc/util/Build   |  1 +
 tools/perf/arch/powerpc/util/kvm-stat.c  | 33 
 4 files changed, 50 insertions(+)
 create mode 100644 arch/powerpc/include/uapi/asm/kvm_perf.h
 create mode 100644 tools/perf/arch/powerpc/util/kvm-stat.c

diff --git a/arch/powerpc/include/uapi/asm/kvm_perf.h 
b/arch/powerpc/include/uapi/asm/kvm_perf.h
new file mode 100644
index 000..30fa670
--- /dev/null
+++ b/arch/powerpc/include/uapi/asm/kvm_perf.h
@@ -0,0 +1,15 @@
+#ifndef _ASM_POWERPC_KVM_PERF_H
+#define _ASM_POWERPC_KVM_PERF_H
+
+#include 
+#include 
+
+#define DECODE_STR_LEN 20
+
+#define VCPU_ID "vcpu_id"
+
+#define KVM_ENTRY_TRACE "kvm_hv:kvm_guest_enter"
+#define KVM_EXIT_TRACE "kvm_hv:kvm_guest_exit"
+#define KVM_EXIT_REASON "trap"
+
+#endif /* _ASM_POWERPC_KVM_PERF_H */
diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile
index 7fbca17..21322e0 100644
--- a/tools/perf/arch/powerpc/Makefile
+++ b/tools/perf/arch/powerpc/Makefile
@@ -1,3 +1,4 @@
 ifndef NO_DWARF
 PERF_HAVE_DWARF_REGS := 1
 endif
+HAVE_KVM_STAT_SUPPORT := 1
diff --git a/tools/perf/arch/powerpc/util/Build 
b/tools/perf/arch/powerpc/util/Build
index 0af6e9b..dd47b5e 100644
--- a/tools/perf/arch/powerpc/util/Build
+++ b/tools/perf/arch/powerpc/util/Build
@@ -1,4 +1,5 @@
 libperf-y += header.o
+libperf-y += kvm-stat.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_DWARF) += skip-callchain-idx.o
diff --git a/tools/perf/arch/powerpc/util/kvm-stat.c 
b/tools/perf/arch/powerpc/util/kvm-stat.c
new file mode 100644
index 000..62cdcc1
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/kvm-stat.c
@@ -0,0 +1,33 @@
+#include "../../util/kvm-stat.h"
+#include 
+
+define_exit_reasons_table(hv_exit_reasons, kvm_trace_symbol_exit);
+
+static struct kvm_events_ops exit_events = {
+   .is_begin_event = exit_event_begin,
+   .is_end_event = exit_event_end,
+   .decode_key = exit_event_decode_key,
+   .name = "VM-EXIT"
+};
+
+const char *const kvm_events_tp[] = {
+   "kvm_hv:kvm_guest_exit",
+   "kvm_hv:kvm_guest_enter",
+   NULL,
+};
+
+struct kvm_reg_events_ops kvm_reg_events_ops[] = {
+   { .name = "vmexit", .ops = &exit_events },
+   { NULL, NULL },
+};
+
+const char * const kvm_skip_events[] = {
+   NULL,
+};
+
+int cpu_isa_init(struct perf_kvm_stat *kvm, const char *cpuid __maybe_unused)
+{
+   kvm->exit_reasons = hv_exit_reasons;
+   kvm->exit_reasons_isa = "HV";
+   return 0;
+}
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 3/3] kvm/powerpc: Export HCALL reason codes

2015-05-07 Thread Hemant Kumar
For perf to analyze the KVM events like hcalls, we need the
hypervisor calls and their codes to be exported through uapi.

This patch moves most of the pSeries hcall codes from
arch/powerpc/include/asm/hvcall.h to
arch/powerpc/include/uapi/asm/hcall_codes.h.
It also moves the mapping  from
arch/powerpc/kvm/trace_hv.h to
arch/powerpc/include/uapi/asm/trace_hcall.h.

Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/hvcall.h   | 120 +--
 arch/powerpc/include/uapi/asm/hcall_codes.h | 123 
 arch/powerpc/include/uapi/asm/trace_hcall.h | 122 +++
 arch/powerpc/kvm/trace_hv.h | 117 +-
 4 files changed, 248 insertions(+), 234 deletions(-)
 create mode 100644 arch/powerpc/include/uapi/asm/hcall_codes.h
 create mode 100644 arch/powerpc/include/uapi/asm/trace_hcall.h

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 85bc8c0..799677d 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -155,124 +155,8 @@
 /* Each control block has to be on a 4K boundary */
 #define H_CB_ALIGNMENT  4096
 
-/* pSeries hypervisor opcodes */
-#define H_REMOVE   0x04
-#define H_ENTER0x08
-#define H_READ 0x0c
-#define H_CLEAR_MOD0x10
-#define H_CLEAR_REF0x14
-#define H_PROTECT  0x18
-#define H_GET_TCE  0x1c
-#define H_PUT_TCE  0x20
-#define H_SET_SPRG00x24
-#define H_SET_DABR 0x28
-#define H_PAGE_INIT0x2c
-#define H_SET_ASR  0x30
-#define H_ASR_ON   0x34
-#define H_ASR_OFF  0x38
-#define H_LOGICAL_CI_LOAD  0x3c
-#define H_LOGICAL_CI_STORE 0x40
-#define H_LOGICAL_CACHE_LOAD   0x44
-#define H_LOGICAL_CACHE_STORE  0x48
-#define H_LOGICAL_ICBI 0x4c
-#define H_LOGICAL_DCBF 0x50
-#define H_GET_TERM_CHAR0x54
-#define H_PUT_TERM_CHAR0x58
-#define H_REAL_TO_LOGICAL  0x5c
-#define H_HYPERVISOR_DATA  0x60
-#define H_EOI  0x64
-#define H_CPPR 0x68
-#define H_IPI  0x6c
-#define H_IPOLL0x70
-#define H_XIRR 0x74
-#define H_PERFMON  0x7c
-#define H_MIGRATE_DMA  0x78
-#define H_REGISTER_VPA 0xDC
-#define H_CEDE 0xE0
-#define H_CONFER   0xE4
-#define H_PROD 0xE8
-#define H_GET_PPP  0xEC
-#define H_SET_PPP  0xF0
-#define H_PURR 0xF4
-#define H_PIC  0xF8
-#define H_REG_CRQ  0xFC
-#define H_FREE_CRQ 0x100
-#define H_VIO_SIGNAL   0x104
-#define H_SEND_CRQ 0x108
-#define H_COPY_RDMA0x110
-#define H_REGISTER_LOGICAL_LAN 0x114
-#define H_FREE_LOGICAL_LAN 0x118
-#define H_ADD_LOGICAL_LAN_BUFFER 0x11C
-#define H_SEND_LOGICAL_LAN 0x120
-#define H_BULK_REMOVE  0x124
-#define H_MULTICAST_CTRL   0x130
-#define H_SET_XDABR0x134
-#define H_STUFF_TCE0x138
-#define H_PUT_TCE_INDIRECT 0x13C
-#define H_CHANGE_LOGICAL_LAN_MAC 0x14C
-#define H_VTERM_PARTNER_INFO   0x150
-#define H_REGISTER_VTERM   0x154
-#define H_FREE_VTERM   0x158
-#define H_RESET_EVENTS  0x15C
-#define H_ALLOC_RESOURCE0x160
-#define H_FREE_RESOURCE 0x164
-#define H_MODIFY_QP 0x168
-#define H_QUERY_QP  0x16C
-#define H_REREGISTER_PMR0x170
-#define H_REGISTER_SMR  0x174
-#define H_QUERY_MR  0x178
-#define H_QUERY_MW  0x17C
-#define H_QUERY_HCA 0x180
-#define H_QUERY_PORT0x184
-#define H_MODIFY_PORT   0x188
-#define H_DEFINE_AQP1   0x18C
-#define H_GET_TRACE_BUFFER  0x190
-#define H_DEFINE_AQP0   0x194
-#define H_RESIZE_MR 0x198
-#define H_ATTACH_MCQP   0x19C
-#define H_DETACH_MCQP   0x1A0
-#define H_CREATE_RPT0x1A4
-#define H_REMOVE_RPT0x1A8
-#define H_REGISTER_RPAGES   0x1AC
-#define H_DISABLE_AND_GETC  0x1B0
-#define H_ERROR_DATA0x1B4
-#define H_GET_HCA_INFO  0x1B8
-#define H_GET_PERF_COUNT0x1BC
-#define H_MANAGE_TRACE  0x1C0
-#define H_FREE_LOGICAL_LAN_BUFFER 0x1D4
-#define H_QUERY_INT_STATE   0x1E4
-#define H_POLL_PENDING 0x1D8
-#define H_ILLAN_ATTRIBUTES 0x244
-#define H_MODIFY_HEA_QP0x250
-#define H_QUERY_HEA_QP 0x254
-#define H_QUERY_HEA0x258
-#define H_QUERY_HEA_PORT   0x25C
-#define H_MODIFY_HEA_PORT  0x260
-#define H_REG_BCMC 0x264
-#define H_DEREG_BCMC   0x268
-#define H_REGISTER_HEA_RPAGES  0x26C
-#define H_DISABLE_AND_GET_HEA  0x270
-#define H_GET_HEA_INFO 0x274
-#define H_ALLOC_HEA_RESOURCE   0x27

[PATCH v3 2/3] kvm/powerpc: Add exit reason for return code 0x0

2015-05-07 Thread Hemant Kumar
This patch adds an exit reason "RETURN_TO_HOST" for the return code
0x0.

Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/uapi/asm/trace_book3s.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/uapi/asm/trace_book3s.h 
b/arch/powerpc/include/uapi/asm/trace_book3s.h
index f647ce0..8635005 100644
--- a/arch/powerpc/include/uapi/asm/trace_book3s.h
+++ b/arch/powerpc/include/uapi/asm/trace_book3s.h
@@ -6,6 +6,7 @@
  */
 
 #define kvm_trace_symbol_exit \
+   {0x0,   "RETURN_TO_HOST"}, \
{0x100, "SYSTEM_RESET"}, \
{0x200, "MACHINE_CHECK"}, \
{0x300, "DATA_STORAGE"}, \
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 1/3] kvm/powerpc: Export kvm exit reasons

2015-05-07 Thread Hemant Kumar
To analyze the kvm exits with perf, we will need to map the exit codes
with the exit reasons. Such a mapping exists today in trace_book3s.h.
Currently its not exported to perf.

This patch moves these kvm exit reasons and their mapping from
"arch/powerpc/kvm/trace_book3s.h" to
"arch/powerpc/include/uapi/asm/trace_book3s.h".

Accordingly change the include files in "trace_hv.h" and "trace_pr.h".

Signed-off-by: Hemant Kumar 
---
Changes :
- Original patchset split into 2 patchsets now: for perf and powerpc
  side changes.

 arch/powerpc/include/uapi/asm/trace_book3s.h | 32 
 arch/powerpc/kvm/trace_book3s.h  | 32 
 arch/powerpc/kvm/trace_hv.h  |  2 +-
 arch/powerpc/kvm/trace_pr.h  |  2 +-
 4 files changed, 34 insertions(+), 34 deletions(-)
 create mode 100644 arch/powerpc/include/uapi/asm/trace_book3s.h
 delete mode 100644 arch/powerpc/kvm/trace_book3s.h

diff --git a/arch/powerpc/include/uapi/asm/trace_book3s.h 
b/arch/powerpc/include/uapi/asm/trace_book3s.h
new file mode 100644
index 000..f647ce0
--- /dev/null
+++ b/arch/powerpc/include/uapi/asm/trace_book3s.h
@@ -0,0 +1,32 @@
+#if !defined(_TRACE_KVM_BOOK3S_H)
+#define _TRACE_KVM_BOOK3S_H
+
+/*
+ * Common defines used by the trace macros in trace_pr.h and trace_hv.h
+ */
+
+#define kvm_trace_symbol_exit \
+   {0x100, "SYSTEM_RESET"}, \
+   {0x200, "MACHINE_CHECK"}, \
+   {0x300, "DATA_STORAGE"}, \
+   {0x380, "DATA_SEGMENT"}, \
+   {0x400, "INST_STORAGE"}, \
+   {0x480, "INST_SEGMENT"}, \
+   {0x500, "EXTERNAL"}, \
+   {0x501, "EXTERNAL_LEVEL"}, \
+   {0x502, "EXTERNAL_HV"}, \
+   {0x600, "ALIGNMENT"}, \
+   {0x700, "PROGRAM"}, \
+   {0x800, "FP_UNAVAIL"}, \
+   {0x900, "DECREMENTER"}, \
+   {0x980, "HV_DECREMENTER"}, \
+   {0xc00, "SYSCALL"}, \
+   {0xd00, "TRACE"}, \
+   {0xe00, "H_DATA_STORAGE"}, \
+   {0xe20, "H_INST_STORAGE"}, \
+   {0xe40, "H_EMUL_ASSIST"}, \
+   {0xf00, "PERFMON"}, \
+   {0xf20, "ALTIVEC"}, \
+   {0xf40, "VSX"}
+
+#endif
diff --git a/arch/powerpc/kvm/trace_book3s.h b/arch/powerpc/kvm/trace_book3s.h
deleted file mode 100644
index f647ce0..000
--- a/arch/powerpc/kvm/trace_book3s.h
+++ /dev/null
@@ -1,32 +0,0 @@
-#if !defined(_TRACE_KVM_BOOK3S_H)
-#define _TRACE_KVM_BOOK3S_H
-
-/*
- * Common defines used by the trace macros in trace_pr.h and trace_hv.h
- */
-
-#define kvm_trace_symbol_exit \
-   {0x100, "SYSTEM_RESET"}, \
-   {0x200, "MACHINE_CHECK"}, \
-   {0x300, "DATA_STORAGE"}, \
-   {0x380, "DATA_SEGMENT"}, \
-   {0x400, "INST_STORAGE"}, \
-   {0x480, "INST_SEGMENT"}, \
-   {0x500, "EXTERNAL"}, \
-   {0x501, "EXTERNAL_LEVEL"}, \
-   {0x502, "EXTERNAL_HV"}, \
-   {0x600, "ALIGNMENT"}, \
-   {0x700, "PROGRAM"}, \
-   {0x800, "FP_UNAVAIL"}, \
-   {0x900, "DECREMENTER"}, \
-   {0x980, "HV_DECREMENTER"}, \
-   {0xc00, "SYSCALL"}, \
-   {0xd00, "TRACE"}, \
-   {0xe00, "H_DATA_STORAGE"}, \
-   {0xe20, "H_INST_STORAGE"}, \
-   {0xe40, "H_EMUL_ASSIST"}, \
-   {0xf00, "PERFMON"}, \
-   {0xf20, "ALTIVEC"}, \
-   {0xf40, "VSX"}
-
-#endif
diff --git a/arch/powerpc/kvm/trace_hv.h b/arch/powerpc/kvm/trace_hv.h
index 33d9daf..02d0a07 100644
--- a/arch/powerpc/kvm/trace_hv.h
+++ b/arch/powerpc/kvm/trace_hv.h
@@ -2,7 +2,7 @@
 #define _TRACE_KVM_HV_H
 
 #include 
-#include "trace_book3s.h"
+#include 
 #include 
 #include 
 
diff --git a/arch/powerpc/kvm/trace_pr.h b/arch/powerpc/kvm/trace_pr.h
index 810507c..a9850c6 100644
--- a/arch/powerpc/kvm/trace_pr.h
+++ b/arch/powerpc/kvm/trace_pr.h
@@ -3,7 +3,7 @@
 #define _TRACE_KVM_PR_H
 
 #include 
-#include "trace_book3s.h"
+#include 
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvm_pr
-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page

2015-05-07 Thread David Gibson
On Thu, May 07, 2015 at 12:46:21PM +0530, Aneesh Kumar K.V wrote:
> We need to check whether pte is present in follow_huge_addr and
> properly return NULL if mapping is not present. Also use READ_ONCE
> when dereferencing pte_t address.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: David Gibson 

Looks sane.  It's a long time since I worked with this so I don't
really remember, but I have a suspicion that at the time hugepage PTEs
could never exist but be non-present.

> ---
>  arch/powerpc/mm/hugetlbpage.c | 25 -
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 0ce968b00b7c..f5688423bc69 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  struct page *
>  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  {
> - pte_t *ptep;
> - struct page *page;
> + pte_t *ptep, pte;
>   unsigned shift;
>   unsigned long mask, flags;
> + struct page *page = ERR_PTR(-EINVAL);
> +
> + local_irq_save(flags);
> + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> + if (!ptep)
> + goto no_page;
> + pte = READ_ONCE(*ptep);
>   /*
> +  * Verify it is a huge page else bail.
>* Transparent hugepages are handled by generic code. We can skip them
>* here.
>*/
> - local_irq_save(flags);
> - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> + if (!shift || pmd_trans_huge((pmd_t)pte))
> + goto no_page;
>  
> - /* Verify it is a huge page else bail. */
> - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) {
> - local_irq_restore(flags);
> - return ERR_PTR(-EINVAL);
> + if (!pte_present(pte)) {
> + page = NULL;
> + goto no_page;
>   }
>   mask = (1UL << shift) - 1;
> - page = pte_page(*ptep);
> + page = pte_page(pte);
>   if (page)
>   page += (address & mask) / PAGE_SIZE;
>  
> +no_page:
>   local_irq_restore(flags);
>   return page;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpzzSbw7Pdjt.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()

2015-05-07 Thread Bjorn Helgaas
Hi Gavin,

[Please run "git log --oneline drivers/pci/setup-bus.c" and observe the
capitalization convention.]

On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote:
> Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
> which is called for once after PCI probing and resource assignment
> are completed, to allocate platform required resources for PCI devices:
> PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
> Obviously, it's not hotplug friendly.
> 
> The patch adds weak function pcibios_setup_bridge(), which is called
> by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
> to assign above platform required resources to newly added PCI devices,
> in order to support PCI hotplug on PowerPC PowerNV platform.
> 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/pci/setup-bus.c | 12 +---
>  include/linux/pci.h |  1 +
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 4fd0cac..a7d0c3c 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev 
> *bridge)
>   pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
>  }
>  
> -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
> +
> +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
>  {
>   struct pci_dev *bridge = bus->self;
>  
> @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, 
> unsigned long type)
>   pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>  }
>  
> +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
> +{
> + pci_setup_bridge_resources(bus, type);
> +}

I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do
the architected updates in the generic PCI core code instead of down in the
pcibios code.  In other words, I would rather have this:

  void pci_setup_bridge(struct pci_bus *bus)
  {
pcibios_setup_bridge(bus, type);
pci_setup_bridge_resources(bus, type);
  }

That way the default pcibios hook is empty, showing that by default there's
no arch-specific code in this path, and we only have to look at the generic
core code to verify that we actually do program the bridge windows.

Bjorn
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020

2015-05-07 Thread Scott Wood
On Thu, 2015-05-07 at 15:29 -0400, Oleksandr G Zhadan wrote:
> On 05/07/2015 02:18 PM, Scott Wood wrote:
> > On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote:
>  diff --git a/arch/powerpc/configs/ucp1020_defconfig 
>  b/arch/powerpc/configs/ucp1020_defconfig
>  new file mode 100644
>  index 000..62f99aa
>  --- /dev/null
>  +++ b/arch/powerpc/configs/ucp1020_defconfig
> >>>
> >>> Please explain why your board needs its own defconfig.
> >>>
> >>
> >> Because, it's our own board and it has some specific to board
> >> definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product
> >> definitions.
> >>
> >> If I can do it in some other way could you please give me some example
> >> if it's possible.
> >
> > I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream.
> > Could you list what you need to be set that mpc85xx_smp_defconfig
> > doesn't set?
> 
> I make diff "mpc85xx_smp_defconfig" vs "ucp1020_defconfig after make 
> savedefconfig" and it's some differences like:
> 
> - mpc85xx_smp_defconfig has:
> CONFIG_PHYS_64BIT=y
> CONFIG_NR_CPUS=8

These won't prevent your board from working.  If you want
CONFIG_PHYS_64BIT disabled for performance, I could see a fragment being
used for that as per the recent defconfig discussions.  I wouldn't
expect NR_CPUS being 8 instead of 2 to be noticeable.

> - it enabled almost all boards to build. What for ?

Because that's what the common defconfigs are for.  We don't want a
defconfig for each board (most of the board-specific configs that are
currently there were added long ago).  If you want a config that
contains nothing your board doesn't need, you can maintain that locally.

> - it has MTD related differences (doesn't enabled spi flashes support we 
> need):
> -CONFIG_MTD_M25P80=y
> -CONFIG_MTD_SST25L=y

So add them to the existing defconfig.

> - It includes some PHY support, but not phy we are using

This should not harm your board.

>  and we need include intel wifi support:
> -CONFIG_MICREL_PHY=y
> -CONFIG_IWLWIFI=y

So add them to the existing defconfig.

> - It doesn't enable EXT4 fs support.

I think this would be a reasonable thing to add.


> Etc...
> 
> You can see it yourself below:

That doesn't show me the set of changes that you *need*, only the set of
changes that you have.

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Rafael J. Wysocki
On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> > On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
> >> Hi Preeti,
> >>
> >> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
> >>> Hi Shilpa,
> >>>
> >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>  Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>  notification by executing *throttle_check() on any one of the cpu on
>  the chip. This is a sanity check to verify if we were indeed
>  throttled/unthrottled after receiving OCC_THROTTLE notification.
> 
>  We cannot call *throttle_check() directly from the notification
>  handler because we could be handling chip1's notification in chip2. So
>  initiate an smp_call to execute *throttle_check(). We are irq-disabled
>  in the notification handler, so use a worker thread to smp_call
>  throttle_check() on any of the cpu in the chipmask.
> >>>
> >>> I see that the first patch takes care of reporting *per-chip* throttling
> >>> for pmax capping condition. But where are we taking care of reporting
> >>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
> >>>
> >>
> >> IMO let us not have "psafe" and "freq control disabled" states managed 
> >> per-chip.
> >> Because when the above two conditions occur it is likely to happen across 
> >> all
> >> chips during an OCC reset cycle. So I am setting 'throttled' to false on
> >> OCC_ACTIVE and re-verifying if it actually is the case by invoking
> >> *throttle_check().
> > 
> > Alright like I pointed in the previous reply, a comment to indicate that
> > psafe and freq control disabled conditions will fail when occ is
> > inactive and that all chips face the consequence of this will help.
> 
> From your explanation on the thread of the first patch of this series,
> this will not be required.
> 
> So,
> Reviewed-by: Preeti U Murthy 

OK, so is the whole series reviewed now?


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Rafael J. Wysocki
On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote:
> When a CPU has to enter an idle state where tick stops, it makes a call
> to tick_broadcast_enter(). The call will fail if this CPU is the
> broadcast CPU. Today, under such a circumstance, the arch cpuidle code
> handles this CPU.  This is not convincing because not only are we not
> aware what the arch cpuidle code does, but we also do not account for
> the idle state residency time and usage of such a CPU.
> 
> This scenario can be handled better by simply asking the cpuidle
> governor to choose an idle state where in ticks do not stop. To
> accommodate this change move the setting of runqueue idle state from the
> core to the cpuidle driver, else the rq->idle_state will be set wrong.
> 
> Signed-off-by: Preeti U Murthy 
> ---
> Changes from V1: https://lkml.org/lkml/2015/5/7/24
> Rebased on the latest linux-pm/bleeding-edge
> 
>  drivers/cpuidle/cpuidle.c  |   21 +
>  drivers/cpuidle/governors/ladder.c |   13 ++---
>  drivers/cpuidle/governors/menu.c   |6 +-
>  include/linux/cpuidle.h|6 +++---
>  include/linux/sched.h  |   16 
>  kernel/sched/core.c|   17 +
>  kernel/sched/fair.c|2 +-
>  kernel/sched/idle.c|8 +---
>  kernel/sched/sched.h   |   24 
>  9 files changed, 70 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 8c24f95..b7e86f4 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "cpuidle.h"
> @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>* CPU as a broadcast timer, this call may fail if it is not available.
>*/
>   if (broadcast && tick_broadcast_enter()) {
> - default_idle_call();
> - return -EBUSY;
> + index = cpuidle_select(drv, dev, !broadcast);

No, you can't do that.

This code path may be used by suspend-to-idle and that should not call
cpuidle_select().

What's needed here seems to be a fallback mechanism like "choose the
deepest state shallower than X and such that it won't stop the tick".
You don't really need to run a full governor for that.


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020

2015-05-07 Thread Oleksandr G Zhadan

On 05/07/2015 02:18 PM, Scott Wood wrote:

On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote:

Hi Scott,

Thanks for fast response, please see inline.

On 05/06/2015 11:22 PM, Scott Wood wrote:

On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote:

+-
+
+P1020 SPI controller
+
+Properties:
+- compatible:  "spansion,s25fl008k", "winbond,w25q80bl"
+
+Example:
+   spi@7000 {
+   flash@0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "spansion,s25fl008k", "winbond,w25q80bl";
+   reg = <0>;
+   spi-max-frequency = <4000>; /* input clock */
+   ...
+   };


This isn't describing the controller, but rather a SPI chip attached to
the controller.  This also doesn't seem like the right place for random
SPI chips.

If all you're specifying is the compatible, maybe create a
spi/trivial-devices.txt similar to i2c/trivial-devices.txt?  Or
something specific to SPI flash chips to describe the partition
specification, though I generally recommend against describing
partitions in the device tree -- especially if this is a developer board
rather than something fixed-purpose where the partitioning is not going
to change based on user requirements.




Mostly in all Documentation/devicetree/bindings/ I tried to satisfy
checkpatch script as simple as possible. And for me as well it looks
reasonable to create spi/trivial-devices.txt file and I will.


Checkpatch is a tool, not a dictator.  Sometimes it gets things wrong.

Also, please CC devicet...@vger.kernel.org when adding bindings or
modifying dts files.



OK, got it.


+-
+
+Chipselect/Local Bus
+
+Properties:
+- #address-cells:  <2>.
+- #size-cells: <1>.
+- compatible:  "fsl,p1020-elbc", "fsl,elbc", 
"simple-bus","fsl,p1020-immr"
+- interrupts:  interrupts to report localbus events.
+
+Example:
+
+&lbc {
+   #address-cells = <2>;
+   #size-cells = <1>;
+   compatible = "fsl,p1020-elbc", "fsl,elbc", "simple-bus";
+   interrupts = <19 2 0 0>;
+};


There's already a binding for elbc -- and the elbc node certainly should
not claim compatibility with "fsl,p1020-immr".




to satisfy checkpatch script.


Even if that were necessary, why do it by copy-and-paste, and why put
the immr compatible in the binding for a different node?


Will fix.




diff --git a/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi 
b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi
new file mode 100644
index 000..930a6e3
--- /dev/null
+++ b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi


Why can't you use p1020si-post.dtsi?  The "si" means "silicon" -- it's
meant to be included by all p1020 boards.



Yes, silicon is the same, but p1020 boards using all 3 etsec ethernet
controllers. Our board using only 2: etsec1 and etsec3.


So have your board write status = "disabled" into the etsec2 node after
including the post file.


While ago I create ucp1020som-post.dtsi to make it work as we need. I 
did try some things, but I really don't remember what exactly and result 
always was eth0,eth1,eth2 or eth0,eth2 and newer eth0,eth1.


I'll retry status="disabled".




diff --git a/arch/powerpc/configs/ucp1020_defconfig 
b/arch/powerpc/configs/ucp1020_defconfig
new file mode 100644
index 000..62f99aa
--- /dev/null
+++ b/arch/powerpc/configs/ucp1020_defconfig


Please explain why your board needs its own defconfig.



Because, it's our own board and it has some specific to board
definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product
definitions.

If I can do it in some other way could you please give me some example
if it's possible.


I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream.
Could you list what you need to be set that mpc85xx_smp_defconfig
doesn't set?


I make diff "mpc85xx_smp_defconfig" vs "ucp1020_defconfig after make 
savedefconfig" and it's some differences like:


- mpc85xx_smp_defconfig has:
CONFIG_PHYS_64BIT=y
CONFIG_NR_CPUS=8

- it enabled almost all boards to build. What for ?

- it has MTD related differences (doesn't enabled spi flashes support we 
need):

-CONFIG_MTD_M25P80=y
-CONFIG_MTD_SST25L=y

- It includes some PHY support, but not phy we are using and we need 
include intel wifi support:

-CONFIG_MICREL_PHY=y
-CONFIG_IWLWIFI=y

- It doesn't enable EXT4 fs support.
Etc...

You can see it yourself below:

--- defconfig   2015-05-07 14:48:12.0 -0400
+++ mpc85xx_smp_defconfig   2015-05-01 18:45:03.0 -0400
@@ -1,51 +1,58 @@
 CONFIG_PPC_85xx=y
+CONFIG_PHYS_64BIT=y
 CONFIG_SMP=y
-CONFIG_NR_CPUS=2
-CONFIG_CROSS_COMPILE="powerpc-linux-"
-# CONFIG_LOCALVERSION_AUTO is not set
-CONFIG_DEFAULT_HOSTNAME="uCP1020"
-# CONFIG_SWAP is not set
+CONFIG_NR_CPUS=8
 CONFIG_S

Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020

2015-05-07 Thread Scott Wood
On Thu, 2015-05-07 at 12:31 -0400, Oleksandr G Zhadan wrote:
> Hi Scott,
> 
> Thanks for fast response, please see inline.
> 
> On 05/06/2015 11:22 PM, Scott Wood wrote:
> > On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote:
> >> +-
> >> +
> >> +P1020 SPI controller
> >> +
> >> +Properties:
> >> +- compatible: "spansion,s25fl008k", "winbond,w25q80bl"
> >> +
> >> +Example:
> >> +  spi@7000 {
> >> +  flash@0 {
> >> +  #address-cells = <1>;
> >> +  #size-cells = <1>;
> >> +  compatible = "spansion,s25fl008k", "winbond,w25q80bl";
> >> +  reg = <0>;
> >> +  spi-max-frequency = <4000>; /* input clock */
> >> +  ...
> >> +  };
> >
> > This isn't describing the controller, but rather a SPI chip attached to
> > the controller.  This also doesn't seem like the right place for random
> > SPI chips.
> >
> > If all you're specifying is the compatible, maybe create a
> > spi/trivial-devices.txt similar to i2c/trivial-devices.txt?  Or
> > something specific to SPI flash chips to describe the partition
> > specification, though I generally recommend against describing
> > partitions in the device tree -- especially if this is a developer board
> > rather than something fixed-purpose where the partitioning is not going
> > to change based on user requirements.
> >
> >
> 
> Mostly in all Documentation/devicetree/bindings/ I tried to satisfy 
> checkpatch script as simple as possible. And for me as well it looks 
> reasonable to create spi/trivial-devices.txt file and I will.

Checkpatch is a tool, not a dictator.  Sometimes it gets things wrong.

Also, please CC devicet...@vger.kernel.org when adding bindings or
modifying dts files.

> >> +-
> >> +
> >> +Chipselect/Local Bus
> >> +
> >> +Properties:
> >> +- #address-cells: <2>.
> >> +- #size-cells:<1>.
> >> +- compatible: "fsl,p1020-elbc", "fsl,elbc", 
> >> "simple-bus","fsl,p1020-immr"
> >> +- interrupts: interrupts to report localbus events.
> >> +
> >> +Example:
> >> +
> >> +&lbc {
> >> +  #address-cells = <2>;
> >> +  #size-cells = <1>;
> >> +  compatible = "fsl,p1020-elbc", "fsl,elbc", "simple-bus";
> >> +  interrupts = <19 2 0 0>;
> >> +};
> >
> > There's already a binding for elbc -- and the elbc node certainly should
> > not claim compatibility with "fsl,p1020-immr".
> >
> >
> 
> to satisfy checkpatch script.

Even if that were necessary, why do it by copy-and-paste, and why put
the immr compatible in the binding for a different node?

> >> diff --git a/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi 
> >> b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi
> >> new file mode 100644
> >> index 000..930a6e3
> >> --- /dev/null
> >> +++ b/arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi
> >
> > Why can't you use p1020si-post.dtsi?  The "si" means "silicon" -- it's
> > meant to be included by all p1020 boards.
> >
> 
> Yes, silicon is the same, but p1020 boards using all 3 etsec ethernet 
> controllers. Our board using only 2: etsec1 and etsec3.

So have your board write status = "disabled" into the etsec2 node after
including the post file.

> >> diff --git a/arch/powerpc/configs/ucp1020_defconfig 
> >> b/arch/powerpc/configs/ucp1020_defconfig
> >> new file mode 100644
> >> index 000..62f99aa
> >> --- /dev/null
> >> +++ b/arch/powerpc/configs/ucp1020_defconfig
> >
> > Please explain why your board needs its own defconfig.
> >
> 
> Because, it's our own board and it has some specific to board 
> definitions like CONFIG_DEFAULT_HOSTNAME and some specific to product 
> definitions.
> 
> If I can do it in some other way could you please give me some example 
> if it's possible.

I don't think stuff like CONFIG_DEFAULT_HOSTNAME belongs upstream.
Could you list what you need to be set that mpc85xx_smp_defconfig
doesn't set?

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy
On 05/07/2015 11:09 PM, Preeti U Murthy wrote:
> When a CPU has to enter an idle state where tick stops, it makes a call
> to tick_broadcast_enter(). The call will fail if this CPU is the
> broadcast CPU. Today, under such a circumstance, the arch cpuidle code
> handles this CPU.  This is not convincing because not only are we not
> aware what the arch cpuidle code does, but we also do not account for
> the idle state residency time and usage of such a CPU.
> 
> This scenario can be handled better by simply asking the cpuidle
> governor to choose an idle state where in ticks do not stop. To
> accommodate this change move the setting of runqueue idle state from the
> core to the cpuidle driver, else the rq->idle_state will be set wrong.
> 
> Signed-off-by: Preeti U Murthy 
> ---
> Rebased on the latest linux-pm/bleeding-edge

Kindly ignore this. I have sent the rebase as V2.
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

The below patch is not updated.

Regards
Preeti U Murthy
> 
>  drivers/cpuidle/cpuidle.c  |   21 +
>  drivers/cpuidle/governors/ladder.c |   13 ++---
>  drivers/cpuidle/governors/menu.c   |6 +-
>  include/linux/cpuidle.h|6 +++---
>  include/linux/sched.h  |   16 
>  kernel/sched/core.c|   17 +
>  kernel/sched/fair.c|2 +-
>  kernel/sched/idle.c|8 +---
>  kernel/sched/sched.h   |   24 
>  9 files changed, 70 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 61c417b..8f5657e 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "cpuidle.h"
> @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>* local timer will be shut down.  If a local timer is used from another
>* CPU as a broadcast timer, this call may fail if it is not available.
>*/
> - if (broadcast && tick_broadcast_enter())
> - return -EBUSY;
> + if (broadcast && tick_broadcast_enter()) {
> + index = cpuidle_select(drv, dev, !broadcast);
> + if (index < 0)
> + return -EBUSY;
> + target_state = &drv->states[index];
> + }
> +
> + /* Take note of the planned idle state. */
> + idle_set_state(smp_processor_id(), target_state);
> 
>   trace_cpu_idle_rcuidle(index, dev->cpu);
>   time_start = ktime_get();
> @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>   time_end = ktime_get();
>   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> 
> + /* The cpu is no longer idle or about to enter idle. */
> + idle_set_state(smp_processor_id(), NULL);
> +
>   if (broadcast) {
>   if (WARN_ON_ONCE(!irqs_disabled()))
>   local_irq_disable();
> @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>   *
>   * @drv: the cpuidle driver
>   * @dev: the cpuidle device
> + * @timer_stop_valid: allow selection of idle state where tick stops
>   *
>   * Returns the index of the idle state.
>   */
> -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> +int cpuidle_select(struct cpuidle_driver *drv,
> + struct cpuidle_device *dev, int timer_stop_valid)
>  {
> - return cpuidle_curr_governor->select(drv, dev);
> + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
>  }
> 
>  /**
> diff --git a/drivers/cpuidle/governors/ladder.c 
> b/drivers/cpuidle/governors/ladder.c
> index 401c010..c437322 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct 
> ladder_device *ldev,
>   * ladder_select_state - selects the next state to enter
>   * @drv: cpuidle driver
>   * @dev: the CPU
> + * @timer_stop_valid: allow selection of idle state where tick stops
>   */
>  static int ladder_select_state(struct cpuidle_driver *drv,
> - struct cpuidle_device *dev)
> + struct cpuidle_device *dev, int 
> timer_stop_valid)
>  {
>   struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
>   struct ladder_device_state *last_state;
> @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
>   !drv->states[last_idx + 1].disabled &&
>   !dev->states_usage[last_idx + 1].disable &&
>   last_residency > last_state->threshold.promotion_time &&
> + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
> CPUIDLE_FLAG_TIMER_STOP)) &&
>   drv->states[last_idx + 1].exit_latency 

[PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg

2015-05-07 Thread Dan Streetman
Add crypto compression alg for 842 hardware compression and decompression,
using the alg name "842" and driver_name "842-nx".

This uses only the PowerPC coprocessor hardware for 842 compression.  It
also uses the hardware for decompression, but if the hardware fails it will
fall back to the 842 software decompression library, so that decompression
never fails (for valid 842 compressed buffers).  A header must be used in
most cases, due to the hardware's restrictions on the buffers being
specifically aligned and sized.

Due to the header this driver adds, compressed buffers it creates cannot be
directly passed to the 842 software library for decompression.  However,
compressed buffers created by the software 842 library can be passed to
this driver for hardware 842 decompression (with the exception of buffers
containing the "short data" template, as lib/842/842.h explains).

Signed-off-by: Dan Streetman 
---
 drivers/crypto/nx/Kconfig |  10 +
 drivers/crypto/nx/Makefile|   2 +
 drivers/crypto/nx/nx-842-crypto.c | 585 ++
 3 files changed, 597 insertions(+)
 create mode 100644 drivers/crypto/nx/nx-842-crypto.c

diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
index ee9e259..3e621ad 100644
--- a/drivers/crypto/nx/Kconfig
+++ b/drivers/crypto/nx/Kconfig
@@ -50,4 +50,14 @@ config CRYPTO_DEV_NX_COMPRESS_POWERNV
  algorithm.  This supports NX hardware on the PowerNV platform.
  If you choose 'M' here, this module will be called 
nx_compress_powernv.
 
+config CRYPTO_DEV_NX_COMPRESS_CRYPTO
+   tristate "Compression acceleration cryptographic interface"
+   select CRYPTO_ALGAPI
+   select 842_DECOMPRESS
+   default y
+   help
+ Support for PowerPC Nest (NX) accelerators using the cryptographic
+ API.  If you choose 'M' here, this module will be called
+ nx_compress_crypto.
+
 endif
diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index 6619787..868b5e6 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -13,6 +13,8 @@ nx-crypto-objs := nx.o \
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o
+obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_CRYPTO) += nx-compress-crypto.o
 nx-compress-objs := nx-842.o
 nx-compress-pseries-objs := nx-842-pseries.o
 nx-compress-powernv-objs := nx-842-powernv.o
+nx-compress-crypto-objs := nx-842-crypto.o
diff --git a/drivers/crypto/nx/nx-842-crypto.c 
b/drivers/crypto/nx/nx-842-crypto.c
new file mode 100644
index 000..cb177c3
--- /dev/null
+++ b/drivers/crypto/nx/nx-842-crypto.c
@@ -0,0 +1,585 @@
+/*
+ * Cryptographic API for the NX-842 hardware compression.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) IBM Corporation, 2011-2015
+ *
+ * Original Authors: Robert Jennings 
+ *   Seth Jennings 
+ *
+ * Rewrite: Dan Streetman 
+ *
+ * This is an interface to the NX-842 compression hardware in PowerPC
+ * processors.  Most of the complexity of this drvier is due to the fact that
+ * the NX-842 compression hardware requires the input and output data buffers
+ * to be specifically aligned, to be a specific multiple in length, and within
+ * specific minimum and maximum lengths.  Those restrictions, provided by the
+ * nx-842 driver via nx842_constraints, mean this driver must use bounce
+ * buffers and headers to correct misaligned in or out buffers, and to split
+ * input buffers that are too large.
+ *
+ * This driver will fall back to software decompression if the hardware
+ * decompression fails, so this driver's decompression should never fail as
+ * long as the provided compressed buffer is valid.  Any compressed buffer
+ * created by this driver will have a header (except ones where the input
+ * perfectly matches the constraints); so users of this driver cannot simply
+ * pass a compressed buffer created by this driver over to the 842 software
+ * decompression library.  Instead, users must use this driver to decompress;
+ * if the hardware fails or is unavailable, the compressed buffer will be
+ * parsed and the header removed, and the raw 842 buffer(s) passed to the 842
+ * software decompression library.
+ *
+ * This does not fall back to software compression, however, since the caller
+ * of this function is specifically requesting hardware compression; if the
+ * hardware com

[PATCH 09/10] drivers/crypto/nx: simplify pSeries nx842 driver

2015-05-07 Thread Dan Streetman
Simplify the pSeries NX-842 driver: do not expect incoming buffers to be
exactly page-sized; do not break up input buffers to compress smaller
blocks; do not use any internal headers in the compressed data blocks;
remove the software decompression implementation; implement the pSeries
nx842_constraints.

This changes the pSeries NX-842 driver to perform constraints-based
compression so that it only needs to compress one entire input block at a
time.  This removes the need for it to split input data blocks into
multiple compressed data sections in the output buffer, and removes the
need for any extra header info in the compressed data; all that is moved
(in a later patch) into the main crypto 842 driver.  Additionally, the
842 software decompression implementation is no longer needed here, as
the crypto 842 driver will use the generic software 842 decompression
function as a fallback if any hardware 842 driver fails.

Signed-off-by: Dan Streetman 
---
 drivers/crypto/nx/nx-842-pseries.c | 779 -
 1 file changed, 153 insertions(+), 626 deletions(-)

diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index 6db9992..85837e9 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -21,7 +21,6 @@
  *  Seth Jennings 
  */
 
-#include 
 #include 
 
 #include "nx-842.h"
@@ -32,11 +31,6 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Robert Jennings ");
 MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors");
 
-#define SHIFT_4K 12
-#define SHIFT_64K 16
-#define SIZE_4K (1UL << SHIFT_4K)
-#define SIZE_64K (1UL << SHIFT_64K)
-
 /* IO buffer must be 128 byte aligned */
 #define IO_BUFFER_ALIGN 128
 
@@ -47,18 +41,52 @@ static struct nx842_constraints nx842_pseries_constraints = 
{
.maximum =  PAGE_SIZE, /* dynamic, max_sync_size */
 };
 
-struct nx842_header {
-   int blocks_nr; /* number of compressed blocks */
-   int offset; /* offset of the first block (from beginning of header) */
-   int sizes[0]; /* size of compressed blocks */
-};
-
-static inline int nx842_header_size(const struct nx842_header *hdr)
+static int check_constraints(unsigned long buf, unsigned int *len, bool in)
 {
-   return sizeof(struct nx842_header) +
-   hdr->blocks_nr * sizeof(hdr->sizes[0]);
+   if (!IS_ALIGNED(buf, nx842_pseries_constraints.alignment)) {
+   pr_debug("%s buffer 0x%lx not aligned to 0x%x\n",
+in ? "input" : "output", buf,
+nx842_pseries_constraints.alignment);
+   return -EINVAL;
+   }
+   if (*len % nx842_pseries_constraints.multiple) {
+   pr_debug("%s buffer len 0x%x not multiple of 0x%x\n",
+in ? "input" : "output", *len,
+nx842_pseries_constraints.multiple);
+   if (in)
+   return -EINVAL;
+   *len = round_down(*len, nx842_pseries_constraints.multiple);
+   }
+   if (*len < nx842_pseries_constraints.minimum) {
+   pr_debug("%s buffer len 0x%x under minimum 0x%x\n",
+in ? "input" : "output", *len,
+nx842_pseries_constraints.minimum);
+   return -EINVAL;
+   }
+   if (*len > nx842_pseries_constraints.maximum) {
+   pr_debug("%s buffer len 0x%x over maximum 0x%x\n",
+in ? "input" : "output", *len,
+nx842_pseries_constraints.maximum);
+   if (in)
+   return -EINVAL;
+   *len = nx842_pseries_constraints.maximum;
+   }
+   return 0;
 }
 
+/* I assume we need to align the CSB? */
+#define WORKMEM_ALIGN  (256)
+
+struct nx842_workmem {
+   /* scatterlist */
+   char slin[4096];
+   char slout[4096];
+   /* coprocessor status/parameter block */
+   struct nx_csbcpb csbcpb;
+
+   char padding[WORKMEM_ALIGN];
+} __aligned(WORKMEM_ALIGN);
+
 /* Macros for fields within nx_csbcpb */
 /* Check the valid bit within the csbcpb valid field */
 #define NX842_CSBCBP_VALID_CHK(x) (x & BIT_MASK(7))
@@ -72,8 +100,7 @@ static inline int nx842_header_size(const struct 
nx842_header *hdr)
 #define NX842_CSBCPB_CE2(x)(x & BIT_MASK(5))
 
 /* The NX unit accepts data only on 4K page boundaries */
-#define NX842_HW_PAGE_SHIFTSHIFT_4K
-#define NX842_HW_PAGE_SIZE (ASM_CONST(1) << NX842_HW_PAGE_SHIFT)
+#define NX842_HW_PAGE_SIZE (4096)
 #define NX842_HW_PAGE_MASK (~(NX842_HW_PAGE_SIZE-1))
 
 enum nx842_status {
@@ -194,41 +221,6 @@ static int nx842_build_scatterlist(unsigned long buf, int 
len,
return 0;
 }
 
-/*
- * Working memory for software decompression
- */
-struct sw842_fifo {
-   union {
-   char f8[256][8];
-   char f4[512][4];
-   };
-   char f2[256][2];
-   unsigned char f84_full;
-   unsigned

[PATCH 08/10] drivers/crypto/nx: add PowerNV platform NX-842 driver

2015-05-07 Thread Dan Streetman
Add driver for NX-842 hardware on the PowerNV platform.

This allows the use of the 842 compression hardware coprocessor on
the PowerNV platform.

Signed-off-by: Dan Streetman 
---
 drivers/crypto/nx/Kconfig  |  10 +
 drivers/crypto/nx/Makefile |   2 +
 drivers/crypto/nx/nx-842-powernv.c | 625 +
 drivers/crypto/nx/nx-842-pseries.c |   9 -
 drivers/crypto/nx/nx-842.c |   4 +-
 drivers/crypto/nx/nx-842.h |  97 ++
 include/linux/nx842.h  |   6 +-
 7 files changed, 741 insertions(+), 12 deletions(-)
 create mode 100644 drivers/crypto/nx/nx-842-powernv.c

diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
index 34013f7..ee9e259 100644
--- a/drivers/crypto/nx/Kconfig
+++ b/drivers/crypto/nx/Kconfig
@@ -40,4 +40,14 @@ config CRYPTO_DEV_NX_COMPRESS_PSERIES
  algorithm.  This supports NX hardware on the pSeries platform.
  If you choose 'M' here, this module will be called 
nx_compress_pseries.
 
+config CRYPTO_DEV_NX_COMPRESS_POWERNV
+   tristate "Compression acceleration support on PowerNV platform"
+   depends on PPC_POWERNV
+   default y
+   help
+ Support for PowerPC Nest (NX) compression acceleration. This
+ module supports acceleration for compressing memory with the 842
+ algorithm.  This supports NX hardware on the PowerNV platform.
+ If you choose 'M' here, this module will be called 
nx_compress_powernv.
+
 endif
diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index 5d9f4bc..6619787 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -12,5 +12,7 @@ nx-crypto-objs := nx.o \
 
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o
+obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o
 nx-compress-objs := nx-842.o
 nx-compress-pseries-objs := nx-842-pseries.o
+nx-compress-powernv-objs := nx-842-powernv.o
diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
new file mode 100644
index 000..6a9fb8b
--- /dev/null
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -0,0 +1,625 @@
+/*
+ * Driver for IBM PowerNV 842 compression accelerator
+ *
+ * Copyright (C) 2015 Dan Streetman, IBM Corp
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include "nx-842.h"
+
+#include 
+
+#include 
+#include 
+
+#define MODULE_NAME NX842_POWERNV_MODULE_NAME
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Dan Streetman ");
+MODULE_DESCRIPTION("842 H/W Compression driver for IBM PowerNV processors");
+
+#define WORKMEM_ALIGN  (CRB_ALIGN)
+#define CSB_WAIT_MAX   (5000) /* ms */
+
+struct nx842_workmem {
+   /* Below fields must be properly aligned */
+   struct coprocessor_request_block crb; /* CRB_ALIGN align */
+   struct data_descriptor_entry ddl_in[DDL_LEN_MAX]; /* DDE_ALIGN align */
+   struct data_descriptor_entry ddl_out[DDL_LEN_MAX]; /* DDE_ALIGN align */
+   /* Above fields must be properly aligned */
+
+   ktime_t start;
+
+   char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */
+} __packed __aligned(WORKMEM_ALIGN);
+
+struct nx842_coproc {
+   unsigned int chip_id;
+   unsigned int ct;
+   unsigned int ci;
+   struct list_head list;
+};
+
+/* no cpu hotplug on powernv, so this list never changes after init */
+static LIST_HEAD(nx842_coprocs);
+static unsigned int nx842_ct;
+
+/**
+ * setup_indirect_dde - Setup an indirect DDE
+ *
+ * The DDE is setup with the the DDE count, byte count, and address of
+ * first direct DDE in the list.
+ */
+static void setup_indirect_dde(struct data_descriptor_entry *dde,
+  struct data_descriptor_entry *ddl,
+  unsigned int dde_count, unsigned int byte_count)
+{
+   dde->flags = 0;
+   dde->count = dde_count;
+   dde->index = 0;
+   dde->length = cpu_to_be32(byte_count);
+   dde->address = cpu_to_be64(nx842_get_pa(ddl));
+}
+
+/**
+ * setup_direct_dde - Setup single DDE from buffer
+ *
+ * The DDE is setup with the buffer and length.  The buffer must be properly
+ * aligned.  The used length is returned.
+ * Returns:
+ *   NSuccessfully set up DDE with N bytes
+ */
+static unsigned int setup_direct_dde(struct data_descriptor_entry *dde,
+unsigned long pa, unsigned int len)
+{
+   unsigned int l = 

[PATCH 07/10] drivers/crypto/nx: add nx842 constraints

2015-05-07 Thread Dan Streetman
Add "constraints" for the NX-842 driver.  The constraints are used to
indicate what the current NX-842 platform driver is capable of.  The
constraints tell the NX-842 user what alignment, min and max length, and
length multiple each provided buffers should conform to.  These are
required because the 842 hardware requires buffers to meet specific
constraints that vary based on platform - for example, the pSeries
max length is much lower than the PowerNV max length.

Signed-off-by: Dan Streetman 
---
 drivers/crypto/nx/nx-842-pseries.c | 10 ++
 drivers/crypto/nx/nx-842.c | 38 ++
 drivers/crypto/nx/nx-842.h |  2 ++
 include/linux/nx842.h  |  9 +
 4 files changed, 59 insertions(+)

diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
index 9b83c9e..cb481d8 100644
--- a/drivers/crypto/nx/nx-842-pseries.c
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -40,6 +40,13 @@ MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power 
processors");
 /* IO buffer must be 128 byte aligned */
 #define IO_BUFFER_ALIGN 128
 
+static struct nx842_constraints nx842_pseries_constraints = {
+   .alignment =IO_BUFFER_ALIGN,
+   .multiple = DDE_BUFFER_LAST_MULT,
+   .minimum =  IO_BUFFER_ALIGN,
+   .maximum =  PAGE_SIZE, /* dynamic, max_sync_size */
+};
+
 struct nx842_header {
int blocks_nr; /* number of compressed blocks */
int offset; /* offset of the first block (from beginning of header) */
@@ -842,6 +849,8 @@ static int nx842_OF_upd_maxsyncop(struct nx842_devdata 
*devdata,
goto out;
}
 
+   nx842_pseries_constraints.maximum = devdata->max_sync_size;
+
devdata->max_sync_sg = (unsigned int)min(maxsynccop->comp_sg_limit,
maxsynccop->decomp_sg_limit);
if (devdata->max_sync_sg < 1) {
@@ -1115,6 +1124,7 @@ static struct attribute_group nx842_attribute_group = {
 
 static struct nx842_driver nx842_pseries_driver = {
.owner =THIS_MODULE,
+   .constraints =  &nx842_pseries_constraints,
.compress = nx842_pseries_compress,
.decompress =   nx842_pseries_decompress,
 };
diff --git a/drivers/crypto/nx/nx-842.c b/drivers/crypto/nx/nx-842.c
index f1f378e..160fe2d 100644
--- a/drivers/crypto/nx/nx-842.c
+++ b/drivers/crypto/nx/nx-842.c
@@ -86,6 +86,44 @@ static void put_driver(struct nx842_driver *driver)
module_put(driver->owner);
 }
 
+/**
+ * nx842_constraints
+ *
+ * This provides the driver's constraints.  Different nx842 implementations
+ * may have varying requirements.  The constraints are:
+ *   @alignment:   All buffers should be aligned to this
+ *   @multiple:All buffer lengths should be a multiple of this
+ *   @minimum: Buffer lengths must not be less than this amount
+ *   @maximum: Buffer lengths must not be more than this amount
+ *
+ * The constraints apply to all buffers and lengths, both input and output,
+ * for both compression and decompression, except for the minimum which
+ * only applies to compression input and decompression output; the
+ * compressed data can be less than the minimum constraint.  It can be
+ * assumed that compressed data will always adhere to the multiple
+ * constraint.
+ *
+ * The driver may succeed even if these constraints are violated;
+ * however the driver can return failure or suffer reduced performance
+ * if any constraint is not met.
+ */
+int nx842_constraints(struct nx842_constraints *c)
+{
+   struct nx842_driver *driver = get_driver();
+   int ret = 0;
+
+   if (!driver)
+   return -ENODEV;
+
+   BUG_ON(!c);
+   memcpy(c, driver->constraints, sizeof(*c));
+
+   put_driver(driver);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(nx842_constraints);
+
 int nx842_compress(const unsigned char *in, unsigned int in_len,
   unsigned char *out, unsigned int *out_len,
   void *wrkmem)
diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h
index 2a5d4e1..c6ceb0f 100644
--- a/drivers/crypto/nx/nx-842.h
+++ b/drivers/crypto/nx/nx-842.h
@@ -12,6 +12,8 @@
 struct nx842_driver {
struct module *owner;
 
+   struct nx842_constraints *constraints;
+
int (*compress)(const unsigned char *in, unsigned int in_len,
unsigned char *out, unsigned int *out_len,
void *wrkmem);
diff --git a/include/linux/nx842.h b/include/linux/nx842.h
index d919c22..aa1a97e9 100644
--- a/include/linux/nx842.h
+++ b/include/linux/nx842.h
@@ -5,6 +5,15 @@
 
 #define NX842_MEM_COMPRESS __NX842_PSERIES_MEM_COMPRESS
 
+struct nx842_constraints {
+   int alignment;
+   int multiple;
+   int minimum;
+   int maximum;
+};
+
+int nx842_constraints(struct nx842_constraints *constraints);
+
 int nx842_compress(const unsigned char *in, unsigned 

[PATCH 06/10] drivers/crypto/nx: add NX-842 platform frontend driver

2015-05-07 Thread Dan Streetman
Add NX-842 frontend that allows using either the pSeries platform or
PowerNV platform driver (to be added by later patch) for the NX-842
hardware.  Update the MAINTAINERS file to include the new filenames.
Update Kconfig files to clarify titles and descriptions, and correct
dependencies.

Signed-off-by: Dan Streetman 
---
 MAINTAINERS|   2 +-
 drivers/crypto/Kconfig |  10 +--
 drivers/crypto/nx/Kconfig  |  35 ++---
 drivers/crypto/nx/Makefile |   4 +-
 drivers/crypto/nx/nx-842-pseries.c |  57 +++
 drivers/crypto/nx/nx-842.c | 144 +
 drivers/crypto/nx/nx-842.h |  32 +
 include/linux/nx842.h  |  10 +--
 8 files changed, 245 insertions(+), 49 deletions(-)
 create mode 100644 drivers/crypto/nx/nx-842.c
 create mode 100644 drivers/crypto/nx/nx-842.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5a5c1dc..e71855f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4870,7 +4870,7 @@ F:drivers/crypto/nx/
 IBM Power 842 compression accelerator
 M: Dan Streetman 
 S: Supported
-F: drivers/crypto/nx/nx-842.c
+F: drivers/crypto/nx/nx-842*
 F: include/linux/nx842.h
 F: include/linux/sw842.h
 F: crypto/842.c
diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
index 033c0c8..872de26 100644
--- a/drivers/crypto/Kconfig
+++ b/drivers/crypto/Kconfig
@@ -312,11 +312,13 @@ config CRYPTO_DEV_S5P
  algorithms execution.
 
 config CRYPTO_DEV_NX
-   bool "Support for IBM Power7+ in-Nest cryptographic acceleration"
-   depends on PPC64 && IBMVIO && !CPU_LITTLE_ENDIAN
-   default n
+   bool "Support for IBM PowerPC Nest (NX) cryptographic acceleration"
+   depends on PPC64
help
- Support for Power7+ in-Nest cryptographic acceleration.
+ This enables support for the NX hardware cryptographic accelerator
+ coprocessor that is in IBM PowerPC P7+ or later processors.  This
+ does not actually enable any drivers, it only allows you to select
+ which acceleration type (encryption and/or compression) to enable.
 
 if CRYPTO_DEV_NX
source "drivers/crypto/nx/Kconfig"
diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
index f826166..34013f7 100644
--- a/drivers/crypto/nx/Kconfig
+++ b/drivers/crypto/nx/Kconfig
@@ -1,7 +1,9 @@
+
 config CRYPTO_DEV_NX_ENCRYPT
-   tristate "Encryption acceleration support"
-   depends on PPC64 && IBMVIO
+   tristate "Encryption acceleration support on pSeries platform"
+   depends on PPC_PSERIES && IBMVIO && !CPU_LITTLE_ENDIAN
default y
+   select CRYPTO_ALGAPI
select CRYPTO_AES
select CRYPTO_CBC
select CRYPTO_ECB
@@ -12,15 +14,30 @@ config CRYPTO_DEV_NX_ENCRYPT
select CRYPTO_SHA256
select CRYPTO_SHA512
help
- Support for Power7+ in-Nest encryption acceleration. This
- module supports acceleration for AES and SHA2 algorithms. If you
- choose 'M' here, this module will be called nx_crypto.
+ Support for PowerPC Nest (NX) encryption acceleration. This
+ module supports acceleration for AES and SHA2 algorithms on
+ the pSeries platform.  If you choose 'M' here, this module
+ will be called nx_crypto.
 
 config CRYPTO_DEV_NX_COMPRESS
tristate "Compression acceleration support"
-   depends on PPC64 && IBMVIO
default y
help
- Support for Power7+ in-Nest compression acceleration. This
- module supports acceleration for AES and SHA2 algorithms. If you
- choose 'M' here, this module will be called nx_compress.
+ Support for PowerPC Nest (NX) compression acceleration. This
+ module supports acceleration for compressing memory with the 842
+ algorithm.  One of the platform drivers must be selected also.
+ If you choose 'M' here, this module will be called nx_compress.
+
+if CRYPTO_DEV_NX_COMPRESS
+
+config CRYPTO_DEV_NX_COMPRESS_PSERIES
+   tristate "Compression acceleration support on pSeries platform"
+   depends on PPC_PSERIES && IBMVIO && !CPU_LITTLE_ENDIAN
+   default y
+   help
+ Support for PowerPC Nest (NX) compression acceleration. This
+ module supports acceleration for compressing memory with the 842
+ algorithm.  This supports NX hardware on the pSeries platform.
+ If you choose 'M' here, this module will be called 
nx_compress_pseries.
+
+endif
diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index 8669ffa..5d9f4bc 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -11,4 +11,6 @@ nx-crypto-objs := nx.o \
  nx-sha512.o
 
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o
-nx-compress-objs := nx-842-pseries.o
+obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o
+nx-compress-objs := nx-842.o
+nx-co

[PATCH 05/10] drivers/crypto/nx: rename nx-842.c to nx-842-pseries.c

2015-05-07 Thread Dan Streetman
Move the entire NX-842 driver for the pSeries platform from the file
nx-842.c to nx-842-pseries.c.  This is required by later patches that
add NX-842 support for the PowerNV platform.

This patch does not alter the content of the pSeries NX-842 driver at
all, it only changes the filename.

Signed-off-by: Dan Streetman 
---
 drivers/crypto/nx/Makefile |2 +-
 drivers/crypto/nx/nx-842-pseries.c | 1603 
 drivers/crypto/nx/nx-842.c | 1603 
 3 files changed, 1604 insertions(+), 1604 deletions(-)
 create mode 100644 drivers/crypto/nx/nx-842-pseries.c
 delete mode 100644 drivers/crypto/nx/nx-842.c

diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index bb770ea..8669ffa 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -11,4 +11,4 @@ nx-crypto-objs := nx.o \
  nx-sha512.o
 
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS) += nx-compress.o
-nx-compress-objs := nx-842.o
+nx-compress-objs := nx-842-pseries.o
diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-842-pseries.c
new file mode 100644
index 000..887196e
--- /dev/null
+++ b/drivers/crypto/nx/nx-842-pseries.c
@@ -0,0 +1,1603 @@
+/*
+ * Driver for IBM Power 842 compression accelerator
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) IBM Corporation, 2012
+ *
+ * Authors: Robert Jennings 
+ *  Seth Jennings 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "nx_csbcpb.h" /* struct nx_csbcpb */
+
+#define MODULE_NAME "nx-compress"
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Robert Jennings ");
+MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors");
+
+#define SHIFT_4K 12
+#define SHIFT_64K 16
+#define SIZE_4K (1UL << SHIFT_4K)
+#define SIZE_64K (1UL << SHIFT_64K)
+
+/* IO buffer must be 128 byte aligned */
+#define IO_BUFFER_ALIGN 128
+
+struct nx842_header {
+   int blocks_nr; /* number of compressed blocks */
+   int offset; /* offset of the first block (from beginning of header) */
+   int sizes[0]; /* size of compressed blocks */
+};
+
+static inline int nx842_header_size(const struct nx842_header *hdr)
+{
+   return sizeof(struct nx842_header) +
+   hdr->blocks_nr * sizeof(hdr->sizes[0]);
+}
+
+/* Macros for fields within nx_csbcpb */
+/* Check the valid bit within the csbcpb valid field */
+#define NX842_CSBCBP_VALID_CHK(x) (x & BIT_MASK(7))
+
+/* CE macros operate on the completion_extension field bits in the csbcpb.
+ * CE0 0=full completion, 1=partial completion
+ * CE1 0=CE0 indicates completion, 1=termination (output may be modified)
+ * CE2 0=processed_bytes is source bytes, 1=processed_bytes is target bytes */
+#define NX842_CSBCPB_CE0(x)(x & BIT_MASK(7))
+#define NX842_CSBCPB_CE1(x)(x & BIT_MASK(6))
+#define NX842_CSBCPB_CE2(x)(x & BIT_MASK(5))
+
+/* The NX unit accepts data only on 4K page boundaries */
+#define NX842_HW_PAGE_SHIFTSHIFT_4K
+#define NX842_HW_PAGE_SIZE (ASM_CONST(1) << NX842_HW_PAGE_SHIFT)
+#define NX842_HW_PAGE_MASK (~(NX842_HW_PAGE_SIZE-1))
+
+enum nx842_status {
+   UNAVAILABLE,
+   AVAILABLE
+};
+
+struct ibm_nx842_counters {
+   atomic64_t comp_complete;
+   atomic64_t comp_failed;
+   atomic64_t decomp_complete;
+   atomic64_t decomp_failed;
+   atomic64_t swdecomp;
+   atomic64_t comp_times[32];
+   atomic64_t decomp_times[32];
+};
+
+static struct nx842_devdata {
+   struct vio_dev *vdev;
+   struct device *dev;
+   struct ibm_nx842_counters *counters;
+   unsigned int max_sg_len;
+   unsigned int max_sync_size;
+   unsigned int max_sync_sg;
+   enum nx842_status status;
+} __rcu *devdata;
+static DEFINE_SPINLOCK(devdata_mutex);
+
+#define NX842_COUNTER_INC(_x) \
+static inline void nx842_inc_##_x( \
+   const struct nx842_devdata *dev) { \
+   if (dev) \
+   atomic64_inc(&dev->counters->_x); \
+}
+NX842_COUNTER_INC(comp_complete);
+NX842_COUNTER_INC(comp_failed);
+NX842_COUNTER_INC(decomp_complete);
+NX842_COUNTER_INC(decomp_failed);
+NX842_COUNTER_INC(swdecomp);
+
+#define NX842_HIST_SLOTS 16
+
+static void ibm_nx842_incr_hist(atomic64_t *times, unsigned int t

[PATCH 04/10] crypto: change 842 alg to use software

2015-05-07 Thread Dan Streetman
Change the crypto 842 compression alg to use the software 842 compression
and decompression library.  Add the crypto driver_name as "842-generic".
Remove the fallback to LZO compression.

Previously, this crypto compression alg attemped 842 compression using
PowerPC hardware, and fell back to LZO compression and decompression if
the 842 PowerPC hardware was unavailable or failed.  This should not
fall back to any other compression method, however; users of this crypto
compression alg can fallback if desired, and transparent fallback tricks
callers into thinking they are getting 842 compression when they actually
get LZO compression - the failure of the 842 hardware should not be
transparent to the caller.

The crypto compression alg for a hardware device also should not be located
in crypto/ so this is now a software-only implementation that uses the 842
software compression/decompression library.

Signed-off-by: Dan Streetman 
---
 MAINTAINERS|   1 +
 crypto/842.c   | 174 -
 crypto/Kconfig |   7 +--
 3 files changed, 41 insertions(+), 141 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 116af01..5a5c1dc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4873,6 +4873,7 @@ S:Supported
 F: drivers/crypto/nx/nx-842.c
 F: include/linux/nx842.h
 F: include/linux/sw842.h
+F: crypto/842.c
 F: lib/842/
 
 IBM Power Linux RAID adapter
diff --git a/crypto/842.c b/crypto/842.c
index b48f4f1..98e387e 100644
--- a/crypto/842.c
+++ b/crypto/842.c
@@ -1,5 +1,5 @@
 /*
- * Cryptographic API for the 842 compression algorithm.
+ * Cryptographic API for the 842 software compression algorithm.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -11,173 +11,73 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ * Copyright (C) IBM Corporation, 2011-2015
  *
- * Copyright (C) IBM Corporation, 2011
+ * Original Authors: Robert Jennings 
+ *   Seth Jennings 
  *
- * Authors: Robert Jennings 
- *  Seth Jennings 
+ * Rewrite: Dan Streetman 
+ *
+ * This is the software implementation of compression and decompression using
+ * the 842 format.  This uses the software 842 library at lib/842/ which is
+ * only a reference implementation, and is very, very slow as compared to other
+ * software compressors.  You probably do not want to use this software
+ * compression.  If you have access to the PowerPC 842 compression hardware, 
you
+ * want to use the 842 hardware compression interface, which is at:
+ * drivers/crypto/nx/nx-842-crypto.c
  */
 
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
-
-static int nx842_uselzo;
-
-struct nx842_ctx {
-   void *nx842_wmem; /* working memory for 842/lzo */
-};
+#include 
 
-enum nx842_crypto_type {
-   NX842_CRYPTO_TYPE_842,
-   NX842_CRYPTO_TYPE_LZO
+struct crypto842_ctx {
+   char wmem[SW842_MEM_COMPRESS];  /* working memory for compress */
 };
 
-#define NX842_SENTINEL 0xdeadbeef
-
-struct nx842_crypto_header {
-   unsigned int sentinel; /* debug */
-   enum nx842_crypto_type type;
-};
-
-static int nx842_init(struct crypto_tfm *tfm)
-{
-   struct nx842_ctx *ctx = crypto_tfm_ctx(tfm);
-   int wmemsize;
-
-   wmemsize = max_t(int, nx842_get_workmem_size(), LZO1X_MEM_COMPRESS);
-   ctx->nx842_wmem = kmalloc(wmemsize, GFP_NOFS);
-   if (!ctx->nx842_wmem)
-   return -ENOMEM;
-
-   return 0;
-}
-
-static void nx842_exit(struct crypto_tfm *tfm)
-{
-   struct nx842_ctx *ctx = crypto_tfm_ctx(tfm);
-
-   kfree(ctx->nx842_wmem);
-}
-
-static void nx842_reset_uselzo(unsigned long data)
+static int crypto842_compress(struct crypto_tfm *tfm,
+ const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen)
 {
-   nx842_uselzo = 0;
-}
-
-static DEFINE_TIMER(failover_timer, nx842_reset_uselzo, 0, 0);
-
-static int nx842_crypto_compress(struct crypto_tfm *tfm, const u8 *src,
-   unsigned int slen, u8 *dst, unsigned int *dlen)
-{
-   struct nx842_ctx *ctx = crypto_tfm_ctx(tfm);
-   struct nx842_crypto_header *hdr;
-   unsigned int tmp_len = *dlen;
-   size_t lzodlen; /* needed for lzo */
-   int err;
-
-   *dlen = 0;
-   hdr = (struct nx842_crypto_header *)dst;
-   hdr->sentinel = NX842_SENTINEL; /* debug */
-   dst += sizeof(struct nx842_crypto_header);
-   tmp_len -= sizeof(struct nx842_crypto_header);
-   lzodlen = tmp_len;
-
-   if (likely(!nx842_uselzo)) {
-   err = nx842_com

[PATCH 03/10] lib: add software 842 compression/decompression

2015-05-07 Thread Dan Streetman
Add 842-format software compression and decompression functions.
Update the MAINTAINERS 842 section to include the new files.

The 842 compression function can compress any input data into the 842
compression format.  The 842 decompression function can decompress any
standard-format 842 compressed data - specifically, either a compressed
data buffer created by the 842 software compression function, or a
compressed data buffer created by the 842 hardware compressor (located
in PowerPC coprocessors).

The 842 compressed data format is explained in the header comments.

This is used in a later patch to provide a full software 842 compression
and decompression crypto interface.

Signed-off-by: Dan Streetman 
---
 MAINTAINERS  |   2 +
 include/linux/sw842.h|  12 +
 lib/842/842.h| 127 ++
 lib/842/842_compress.c   | 626 +++
 lib/842/842_debugfs.h|  52 
 lib/842/842_decompress.c | 405 ++
 lib/842/Makefile |   2 +
 lib/Kconfig  |   6 +
 lib/Makefile |   2 +
 9 files changed, 1234 insertions(+)
 create mode 100644 include/linux/sw842.h
 create mode 100644 lib/842/842.h
 create mode 100644 lib/842/842_compress.c
 create mode 100644 lib/842/842_debugfs.h
 create mode 100644 lib/842/842_decompress.c
 create mode 100644 lib/842/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index 781e099..116af01 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4872,6 +4872,8 @@ M:Dan Streetman 
 S: Supported
 F: drivers/crypto/nx/nx-842.c
 F: include/linux/nx842.h
+F: include/linux/sw842.h
+F: lib/842/
 
 IBM Power Linux RAID adapter
 M: Brian King 
diff --git a/include/linux/sw842.h b/include/linux/sw842.h
new file mode 100644
index 000..109ba04
--- /dev/null
+++ b/include/linux/sw842.h
@@ -0,0 +1,12 @@
+#ifndef __SW842_H__
+#define __SW842_H__
+
+#define SW842_MEM_COMPRESS (0xf000)
+
+int sw842_compress(const u8 *src, unsigned int srclen,
+  u8 *dst, unsigned int *destlen, void *wmem);
+
+int sw842_decompress(const u8 *src, unsigned int srclen,
+u8 *dst, unsigned int *destlen);
+
+#endif
diff --git a/lib/842/842.h b/lib/842/842.h
new file mode 100644
index 000..7c20003
--- /dev/null
+++ b/lib/842/842.h
@@ -0,0 +1,127 @@
+
+#ifndef __842_H__
+#define __842_H__
+
+/* The 842 compressed format is made up of multiple blocks, each of
+ * which have the format:
+ *
+ * [arg1][arg2][arg3][arg4]
+ *
+ * where there are between 0 and 4 template args, depending on the specific
+ * template operation.  For normal operations, each arg is either a specific
+ * number of data bytes to add to the output buffer, or an index pointing
+ * to a previously-written number of data bytes to copy to the output buffer.
+ *
+ * The template code is a 5-bit value.  This code indicates what to do with
+ * the following data.  Template codes from 0 to 0x19 should use the template
+ * table, the static "decomp_ops" table used in decompress.  For each template
+ * (table row), there are between 1 and 4 actions; each action corresponds to
+ * an arg following the template code bits.  Each action is either a "data"
+ * type action, or a "index" type action, and each action results in 2, 4, or 8
+ * bytes being written to the output buffer.  Each template (i.e. all actions
+ * in the table row) will add up to 8 bytes being written to the output buffer.
+ * Any row with less than 4 actions is padded with noop actions, indicated by
+ * N0 (for which there is no corresponding arg in the compressed data buffer).
+ *
+ * "Data" actions, indicated in the table by D2, D4, and D8, mean that the
+ * corresponding arg is 2, 4, or 8 bytes, respectively, in the compressed data
+ * buffer should be copied directly to the output buffer.
+ *
+ * "Index" actions, indicated in the table by I2, I4, and I8, mean the
+ * corresponding arg is an index parameter that points to, respectively, a 2,
+ * 4, or 8 byte value already in the output buffer, that should be copied to
+ * the end of the output buffer.  Essentially, the index points to a position
+ * in a ring buffer that contains the last N bytes of output buffer data.
+ * The number of bits for each index's arg are: 8 bits for I2, 9 bits for I4,
+ * and 8 bits for I8.  Since each index points to a 2, 4, or 8 byte section,
+ * this means that I2 can reference 512 bytes ((2^8 bits = 256) * 2 bytes), I4
+ * can reference 2048 bytes ((2^9 = 512) * 4 bytes), and I8 can reference 2048
+ * bytes ((2^8 = 256) * 8 bytes).  Think of it as a kind-of ring buffer for
+ * each of I2, I4, and I8 that are updated for each byte written to the output
+ * buffer.  In this implementation, the output buffer is directly used for each
+ * index; there is no additional memory required.  Note that the index is into
+ * a ring buffer, not a sliding window; for example, if there have been 260
+ * bytes written to the output buffer, an 

[PATCH 02/10] powerpc: Add ICSWX instruction

2015-05-07 Thread Dan Streetman
Add the asm ICSWX and ICSWEPX opcodes.  Add definitions for the
Coprocessor Request structures needed to use the icswx calls to
coprocessors.  Add icswx() function to perform the ICSWX asm
using the provided Coprocessor Command Word value and
Coprocessor Request Block structure.

This is required for communication with the NX-842 coprocessor on
a PowerNV system.

Signed-off-by: Dan Streetman 
---
 arch/powerpc/include/asm/icswx.h  | 184 ++
 arch/powerpc/include/asm/ppc-opcode.h |  13 +++
 2 files changed, 197 insertions(+)
 create mode 100644 arch/powerpc/include/asm/icswx.h

diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
new file mode 100644
index 000..9f8402b
--- /dev/null
+++ b/arch/powerpc/include/asm/icswx.h
@@ -0,0 +1,184 @@
+/*
+ * ICSWX api
+ *
+ * Copyright (C) 2015 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * This provides the Initiate Coprocessor Store Word Indexed (ICSWX)
+ * instruction.  This instruction is used to communicate with PowerPC
+ * coprocessors.  This also provides definitions of the structures used
+ * to communicate with the coprocessor.
+ *
+ * The RFC02130: Coprocessor Architecture document is the reference for
+ * everything in this file unless otherwise noted.
+ */
+#ifndef _ARCH_POWERPC_INCLUDE_ASM_ICSWX_H_
+#define _ARCH_POWERPC_INCLUDE_ASM_ICSWX_H_
+
+#include  /* for PPC_ICSWX */
+
+/* Chapter 6.5.8 Coprocessor-Completion Block (CCB) */
+
+#define CCB_VALUE  (0x3fff)
+#define CCB_ADDRESS(0xfff8)
+#define CCB_CM (0x0007)
+#define CCB_CM0(0x0004)
+#define CCB_CM12   (0x0003)
+
+#define CCB_CM0_ALL_COMPLETIONS(0x0)
+#define CCB_CM0_LAST_IN_CHAIN  (0x4)
+#define CCB_CM12_STORE (0x0)
+#define CCB_CM12_INTERRUPT (0x1)
+
+#define CCB_SIZE   (0x10)
+#define CCB_ALIGN  CCB_SIZE
+
+struct coprocessor_completion_block {
+   __be64 value;
+   __be64 address;
+} __packed __aligned(CCB_ALIGN);
+
+
+/* Chapter 6.5.7 Coprocessor-Status Block (CSB) */
+
+#define CSB_V  (0x80)
+#define CSB_F  (0x04)
+#define CSB_CH (0x03)
+#define CSB_CE_INCOMPLETE  (0x80)
+#define CSB_CE_TERMINATION (0x40)
+#define CSB_CE_TPBC(0x20)
+
+#define CSB_CC_SUCCESS (0)
+#define CSB_CC_INVALID_ALIGN   (1)
+#define CSB_CC_OPERAND_OVERLAP (2)
+#define CSB_CC_DATA_LENGTH (3)
+#define CSB_CC_TRANSLATION (5)
+#define CSB_CC_PROTECTION  (6)
+#define CSB_CC_RD_EXTERNAL (7)
+#define CSB_CC_INVALID_OPERAND (8)
+#define CSB_CC_PRIVILEGE   (9)
+#define CSB_CC_INTERNAL(10)
+#define CSB_CC_WR_EXTERNAL (12)
+#define CSB_CC_NOSPC   (13)
+#define CSB_CC_EXCESSIVE_DDE   (14)
+#define CSB_CC_WR_TRANSLATION  (15)
+#define CSB_CC_WR_PROTECTION   (16)
+#define CSB_CC_UNKNOWN_CODE(17)
+#define CSB_CC_ABORT   (18)
+#define CSB_CC_TRANSPORT   (20)
+#define CSB_CC_SEGMENTED_DDL   (31)
+#define CSB_CC_PROGRESS_POINT  (32)
+#define CSB_CC_DDE_OVERFLOW(33)
+#define CSB_CC_SESSION (34)
+#define CSB_CC_PROVISION   (36)
+#define CSB_CC_CHAIN   (37)
+#define CSB_CC_SEQUENCE(38)
+#define CSB_CC_HW  (39)
+
+#define CSB_SIZE   (0x10)
+#define CSB_ALIGN  CSB_SIZE
+
+struct coprocessor_status_block {
+   u8 flags;
+   u8 cs;
+   u8 cc;
+   u8 ce;
+   __be32 count;
+   __be64 address;
+} __packed __aligned(CSB_ALIGN);
+
+
+/* Chapter 6.5.10 Data-Descriptor List (DDL)
+ * each list contains one or more Data-Descriptor Entries (DDE)
+ */
+
+#define DDE_P  (0x8000)
+
+#define DDE_SIZE   (0x10)
+#define DDE_ALIGN  DDE_SIZE
+
+struct data_descriptor_entry {
+   __be16 flags;
+   u8 count;
+   u8 index;
+   __be32 length;
+   __be64 address;
+} __packed __aligned(DDE_ALIGN);
+
+
+/* Chapter 6.5.2 Coprocessor-Request Block (CRB) */
+
+#define CRB_SIZE   (0x80)
+#define CRB_ALIGN  (0x100) /* Errata: requires 256 alignment */
+
+/* Coprocessor Status Block field
+ *   ADDRESS   address of CSB
+ *   C CCB is valid
+ *   AT0 = addrs are virtual, 1 = addrs are phys
+ *   M enable perf monitor
+ */
+#define CRB_CSB_ADDRESS(0xfff0)
+#define CRB_CSB_C  (0x0008)
+#define CRB_CSB_AT (0x0002)
+#define CRB_CSB_M  (0x0001)
+
+struct coprocessor_request_block {
+   __be32 ccw;
+   __be32 flags;
+   __be64 csb_addr;
+
+   

[PATCH 01/10] powerpc: export of_get_ibm_chip_id function

2015-05-07 Thread Dan Streetman
Export the of_get_ibm_chip_id() function.  This will be used by the
PowerNV NX-842 driver.

Signed-off-by: Dan Streetman 
---
 arch/powerpc/kernel/prom.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 308c5e1..ea2cea7 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -800,6 +800,7 @@ int of_get_ibm_chip_id(struct device_node *np)
}
return -1;
 }
+EXPORT_SYMBOL(of_get_ibm_chip_id);
 
 /**
  * cpu_to_chip_id - Return the cpus chip-id
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv3 00/10] add 842 hw compression for PowerNV platform

2015-05-07 Thread Dan Streetman
IBM PowerPC processors starting at version P7+ contain a NX coprocessor
that provides various hw-accelerated functions, one of which is memory
compression to the IBM "842" compression format.  This NX-842 coprocessor
is already supported on the pSeries platform, by the nx-842.c driver and
the crypto compression interface at crypto/842.c.  This patch set adds
support for NX-842 on the PowerNV (Non-Virtualized) platform, as well as
adding a full software 842 compression/decompression implementation.

Quick summary of changes: the current 842 crypto compression interface uses
only the 842 hardware on pSeries platforms, and can handle only page-sized
and page-aligned uncompressed buffers.  These patches add a full software
842 impementation, change the crypto/ directory 842 interface to a
software only implementation, add a 842 hardware crypto compression
interface that can handle any size and alignment buffers, add a
driver for 842 hardware on PowerNV platforms, and create a common
interface for both 842 hardware platform drivers.

The existing pSeries platform NX-842 driver could not be re-used for the
PowerNV platform driver, as there are fundamentally different interfaces;
on pSeries the system hypervisor (pHyp) provides the interface and manages
communication with the coprocessor, while on PowerNV the kernel talks directly
to the coprocessor using the ICSWX instruction.  The data structures used to
describe each compression or decompression request to the coprocessor are
also different between pHyp's interface and direct communication with ICSWX.
So, different drivers for pSeries and PowerNV are required.  Adding the new
PowerNV driver but keeping the interface to the drivers the same required
adding a new common frontend interface, to which only one of the platform
drivers will connect (based on what platform the kernel is currently running
on), and moving some functionality out of the existing pSeries driver into a
more common location.

The existing crypto/842.c interface is in the wrong place, since crypto/
should only contain software implementations; so lib/842/ is added
containing a reference (i.e. rather slow) implementation in software
of both 842 compression and 842 decompression.  The crypto/842.c interface
is changed to use only that software implementation.

The hardware 842 crypto compression interface is moved to
drivers/crypto/nx/nx-842-crypto.c.  It is also modified to be able to
handle any alignment/length input or output buffer; currently it is only
able to handle page-size and page-aligned (uncompressed) buffers, due to
restrictions in the pSeries 842 hardware driver.

v3 changes the sw and hw crypto drivers to use the same alg name "842",
and different driver names, "842-generic" and "842-nx"


Dan Streetman (10):
  powerpc: export of_get_ibm_chip_id function
  powerpc: Add ICSWX instruction
  lib: add software 842 compression/decompression
  crypto: change 842 alg to use software
  drivers/crypto/nx: rename nx-842.c to nx-842-pseries.c
  drivers/crypto/nx: add NX-842 platform frontend driver
  drivers/crypto/nx: add nx842 constraints
  drivers/crypto/nx: add PowerNV platform NX-842 driver
  drivers/crypto/nx: simplify pSeries nx842 driver
  drivers/crypto/nx: add hardware 842 crypto comp alg

 MAINTAINERS   |5 +-
 arch/powerpc/include/asm/icswx.h  |  184 
 arch/powerpc/include/asm/ppc-opcode.h |   13 +
 arch/powerpc/kernel/prom.c|1 +
 crypto/842.c  |  174 +---
 crypto/Kconfig|7 +-
 drivers/crypto/Kconfig|   10 +-
 drivers/crypto/nx/Kconfig |   55 +-
 drivers/crypto/nx/Makefile|6 +
 drivers/crypto/nx/nx-842-crypto.c |  585 
 drivers/crypto/nx/nx-842-powernv.c|  625 +
 drivers/crypto/nx/nx-842-pseries.c| 1128 +++
 drivers/crypto/nx/nx-842.c| 1623 +++--
 drivers/crypto/nx/nx-842.h|  131 +++
 include/linux/nx842.h |   21 +-
 include/linux/sw842.h |   12 +
 lib/842/842.h |  127 +++
 lib/842/842_compress.c|  626 +
 lib/842/842_debugfs.h |   52 ++
 lib/842/842_decompress.c  |  405 
 lib/842/Makefile  |2 +
 lib/Kconfig   |6 +
 lib/Makefile  |2 +
 23 files changed, 4120 insertions(+), 1680 deletions(-)
 create mode 100644 arch/powerpc/include/asm/icswx.h
 create mode 100644 drivers/crypto/nx/nx-842-crypto.c
 create mode 100644 drivers/crypto/nx/nx-842-powernv.c
 create mode 100644 drivers/crypto/nx/nx-842-pseries.c
 create mode 100644 drivers/crypto/nx/nx-842.h
 create mode 100644 include/linux/sw842.h
 create mode 100644 lib/842/842.h
 create mode 100644 lib/842/842_compress.c
 create mode 100644 lib/842/842_debugfs.h
 create mode 100644 lib/84

[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy
When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Changes from V1: https://lkml.org/lkml/2015/5/7/24
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8c24f95..b7e86f4 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
if (broadcast && tick_broadcast_enter()) {
-   default_idle_call();
-   return -EBUSY;
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index < 0) {
+   default_idle_call();
+   return -EBUSY;
+   }
+   target_state = &drv->states[index];
}
 
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
+
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
 
@@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor->select(drv, dev);
+   return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv->states[last_idx + 1].disabled &&
!dev->states_usage[last_idx + 1].disable &&
last_residency > last_state->threshold.promotion_time &&
+   !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
CPUIDLE_FLAG_TIMER_STOP)) &&
drv->states[last_idx + 1].exit_latency <= latency_req) {
last_state->stats.promotion_count++;
last_state->stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx > CPUIDLE_DRIVER_STATE_START &&
(drv->states[last_idx].disabled ||
dev->states_usage[last_idx].disable 

Re: [PATCH v3 2/8] ipmi/powernv: Convert to irq event interface

2015-05-07 Thread Corey Minyard
On 05/06/2015 10:16 PM, Alistair Popple wrote:
> Convert the opal ipmi driver to use the new irq interface for events.
>
> Signed-off-by: Alistair Popple 
> Cc: Corey Minyard 
> Cc: openipmi-develo...@lists.sourceforge.net
> ---
>
> Corey,
>
> If this looks ok can you please ack it? Michael Ellerman will then take
> the whole series via the powerpc tree. Thanks.

This looks fine; I don't really understand much of this, but I don't see
any issues.

The only thing I would suggest is passing the irq level
(IRQ_TYPE_LEVEL_HIGH) as
part of the openfirmware data instead of hard-coding it.

Acked-by: Corey Minyard 

>
>  drivers/char/ipmi/ipmi_powernv.c | 39 ++-
>  1 file changed, 22 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/char/ipmi/ipmi_powernv.c 
> b/drivers/char/ipmi/ipmi_powernv.c
> index 8753b0f..9b409c0 100644
> --- a/drivers/char/ipmi/ipmi_powernv.c
> +++ b/drivers/char/ipmi/ipmi_powernv.c
> @@ -15,6 +15,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>
> @@ -23,8 +25,7 @@ struct ipmi_smi_powernv {
>   u64 interface_id;
>   struct ipmi_device_id   ipmi_id;
>   ipmi_smi_t  intf;
> - u64 event;
> - struct notifier_block   event_nb;
> + unsigned intirq;
>
>   /**
>* We assume that there can only be one outstanding request, so
> @@ -197,15 +198,12 @@ static struct ipmi_smi_handlers 
> ipmi_powernv_smi_handlers = {
>   .poll   = ipmi_powernv_poll,
>  };
>
> -static int ipmi_opal_event(struct notifier_block *nb,
> -   unsigned long events, void *change)
> +static irqreturn_t ipmi_opal_event(int irq, void *data)
>  {
> - struct ipmi_smi_powernv *smi = container_of(nb,
> - struct ipmi_smi_powernv, event_nb);
> + struct ipmi_smi_powernv *smi = data;
>
> - if (events & smi->event)
> - ipmi_powernv_recv(smi);
> - return 0;
> + ipmi_powernv_recv(smi);
> + return IRQ_HANDLED;
>  }
>
>  static int ipmi_powernv_probe(struct platform_device *pdev)
> @@ -240,13 +238,16 @@ static int ipmi_powernv_probe(struct platform_device 
> *pdev)
>   goto err_free;
>   }
>
> - ipmi->event = 1ull << prop;
> - ipmi->event_nb.notifier_call = ipmi_opal_event;
> + ipmi->irq = irq_of_parse_and_map(dev->of_node, 0);
> + if (!ipmi->irq) {
> + dev_info(dev, "Unable to map irq from device tree\n");
> + ipmi->irq = opal_event_request(prop);
> + }
>
> - rc = opal_notifier_register(&ipmi->event_nb);
> - if (rc) {
> - dev_warn(dev, "OPAL notifier registration failed (%d)\n", rc);
> - goto err_free;
> + if (request_irq(ipmi->irq, ipmi_opal_event, IRQ_TYPE_LEVEL_HIGH,
> + "opal-ipmi", ipmi)) {
> + dev_warn(dev, "Unable to request irq\n");
> + goto err_dispose;
>   }
>
>   ipmi->opal_msg = devm_kmalloc(dev,
> @@ -271,7 +272,9 @@ static int ipmi_powernv_probe(struct platform_device 
> *pdev)
>  err_free_msg:
>   devm_kfree(dev, ipmi->opal_msg);
>  err_unregister:
> - opal_notifier_unregister(&ipmi->event_nb);
> + free_irq(ipmi->irq, ipmi);
> +err_dispose:
> + irq_dispose_mapping(ipmi->irq);
>  err_free:
>   devm_kfree(dev, ipmi);
>   return rc;
> @@ -282,7 +285,9 @@ static int ipmi_powernv_remove(struct platform_device 
> *pdev)
>   struct ipmi_smi_powernv *smi = dev_get_drvdata(&pdev->dev);
>
>   ipmi_unregister_smi(smi->intf);
> - opal_notifier_unregister(&smi->event_nb);
> + free_irq(smi->irq, smi);
> + irq_dispose_mapping(smi->irq);
> +
>   return 0;
>  }
>
> --
> 1.8.3.2
>

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy
When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 61c417b..8f5657e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 * local timer will be shut down.  If a local timer is used from another
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
-   if (broadcast && tick_broadcast_enter())
-   return -EBUSY;
+   if (broadcast && tick_broadcast_enter()) {
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index < 0)
+   return -EBUSY;
+   target_state = &drv->states[index];
+   }
+
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
 
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
@@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor->select(drv, dev);
+   return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv->states[last_idx + 1].disabled &&
!dev->states_usage[last_idx + 1].disable &&
last_residency > last_state->threshold.promotion_time &&
+   !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
CPUIDLE_FLAG_TIMER_STOP)) &&
drv->states[last_idx + 1].exit_latency <= latency_req) {
last_state->stats.promotion_count++;
last_state->stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx > CPUIDLE_DRIVER_STATE_START &&
(drv->states[last_idx].disabled ||
dev->states_usage[last_idx].disable ||
+   (!timer_sto

Re: [PATCH 1/1] powerpc: mpc85xx: Add board support for ucp1020

2015-05-07 Thread Oleksandr G Zhadan

Hi Scott,

Thanks for fast response, please see inline.

On 05/06/2015 11:22 PM, Scott Wood wrote:

On Tue, 2015-05-05 at 11:52 -0400, Oleksandr G Zhadan wrote:

New QorIQ p1020 based board support from Arcturus Networks Inc.
http://www.arcturusnetworks.com/products/ucp1020/

Signed-off-by: Michael Durrant 
Signed-off-by: Oleksandr G Zhadan 
---
  Documentation/devicetree/bindings/pci/fsl,pci.txt  |2 +-
  .../devicetree/bindings/powerpc/arcturus/board.txt |  149 ++
  .../devicetree/bindings/powerpc/arcturus/ecm.txt   |   64 +
  Documentation/devicetree/bindings/usb/fsl-usb.txt  |2 +-
  .../devicetree/bindings/vendor-prefixes.txt|1 +
  arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi |  179 ++
  arch/powerpc/boot/dts/fsl/ucp1020som-pre.dtsi  |   70 +
  arch/powerpc/boot/dts/ucp1020_32b.dts  |   88 +
  arch/powerpc/boot/dts/ucp1020_32b.dtsi |  174 ++
  arch/powerpc/configs/ucp1020_defconfig | 2731 
  arch/powerpc/platforms/85xx/Kconfig|7 +
  arch/powerpc/platforms/85xx/Makefile   |1 +
  arch/powerpc/platforms/85xx/ucp1020_som.c  |  100 +
  13 files changed, 3566 insertions(+), 2 deletions(-)
  create mode 100644 
Documentation/devicetree/bindings/powerpc/arcturus/board.txt
  create mode 100644 Documentation/devicetree/bindings/powerpc/arcturus/ecm.txt
  create mode 100644 arch/powerpc/boot/dts/fsl/ucp1020som-post.dtsi
  create mode 100644 arch/powerpc/boot/dts/fsl/ucp1020som-pre.dtsi
  create mode 100644 arch/powerpc/boot/dts/ucp1020_32b.dts
  create mode 100644 arch/powerpc/boot/dts/ucp1020_32b.dtsi
  create mode 100644 arch/powerpc/configs/ucp1020_defconfig
  create mode 100644 arch/powerpc/platforms/85xx/ucp1020_som.c

diff --git a/Documentation/devicetree/bindings/pci/fsl,pci.txt 
b/Documentation/devicetree/bindings/pci/fsl,pci.txt
index d8ac4a7..298a5e6 100644
--- a/Documentation/devicetree/bindings/pci/fsl,pci.txt
+++ b/Documentation/devicetree/bindings/pci/fsl,pci.txt
@@ -20,7 +20,7 @@ Example:
#interrupt-cells = <1>;
#size-cells = <2>;
#address-cells = <3>;
-   compatible = "fsl,mpc8540-pcix", "fsl,mpc8540-pci";
+   compatible = "fsl,mpc8540-pcix", "fsl,mpc8540-pci", 
"fsl,mpc8548-pcie";
device_type = "pci";
...
...
diff --git a/Documentation/devicetree/bindings/powerpc/arcturus/board.txt 
b/Documentation/devicetree/bindings/powerpc/arcturus/board.txt
new file mode 100644
index 000..54e9765
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/arcturus/board.txt
@@ -0,0 +1,149 @@
+UCP1020 module Tree Bindings
+
+
+Copyright 2013-2015 Arcturus Networks, Inc.
+
+QorIQ p1020 based board
+http://www.arcturusnetworks.com/products/ucp1020/
+-
+
+Root Module
+
+Properties:
+- model:   "arcturus,uCP1020"
+- compatible:  "arcturus,uCP1020"
+- SN:  "1234567890-1234"
+
+/ {
+   model = "arcturus,uCP1020";
+   compatible = "arcturus,uCP1020", "fsl,P1020";
+   SN = "1234567890-1234";
+   ...
+  }


Drop the "fsl,P1020" compatible.  Top-level compatible strings describe
the whole board.

SN is a bad property name.  Call it something like "arcturus,serial#",
and define what it actually means rather than just giving an example.



OK, will fix.


+-
+
+P1020 SPI controller
+
+Properties:
+- compatible:  "spansion,s25fl008k", "winbond,w25q80bl"
+
+Example:
+   spi@7000 {
+   flash@0 {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   compatible = "spansion,s25fl008k", "winbond,w25q80bl";
+   reg = <0>;
+   spi-max-frequency = <4000>; /* input clock */
+   ...
+   };


This isn't describing the controller, but rather a SPI chip attached to
the controller.  This also doesn't seem like the right place for random
SPI chips.

If all you're specifying is the compatible, maybe create a
spi/trivial-devices.txt similar to i2c/trivial-devices.txt?  Or
something specific to SPI flash chips to describe the partition
specification, though I generally recommend against describing
partitions in the device tree -- especially if this is a developer board
rather than something fixed-purpose where the partitioning is not going
to change based on user requirements.




Mostly in all Documentation/devicetree/bindings/ I tried to satisfy 
checkpatch script as simple as possible. And for me as well it looks 
reasonable to create spi/trivial-devices.txt file and I will.



+-
+
+Chipselect/Local Bus
+
+Properties:
+- #address-cells:  <2>.
+- #size-cells: <1>.
+- compatible:  "fsl,p1020-e

Re: [PATCH 10/10] drivers/crypto/nx: add hardware 842 crypto comp alg

2015-05-07 Thread Dan Streetman
On Wed, May 6, 2015 at 11:12 PM, Herbert Xu  wrote:
> On Wed, May 06, 2015 at 12:51:06PM -0400, Dan Streetman wrote:
>> Add crypto compression alg for 842 hardware compression and decompression.
>>
>> This crypto compression alg is named "nx842" to indicate it uses hardware
>> to perform the compression and decompression, while the software 842
>> compression alg is named "sw842".  However, since before this split there
>> was only one 842 compression alg named "842" which only used hardware,
>> this is also aliased "842" for backwards compatibility.
>
> This should still be called 842.  You can set the driver name to
> nx842 or 842-nx.

ah, ok, will do.

So, I'm wondering about the common NX 842 frontend driver, for the
pSeries and PowerNV platform drivers.  The current setup is:

[ crypto "842-nx" driver ]
   v
[ nx-842 main driver ]
  v
[ nx-842 pSeries driver | nx-842 PowerNV driver ]

The main reason for that is that the HW has specific constraints,
specifically each input and output buffer passed to it for comp or
decomp has to:
-be located at a specific alignment
-have a length of a specific multiple
-have a length between a specific minimum and maximum

The crypto 842-nx has (significant) code in it to handle any alignment
and length input buffers, to match them to what the driver requires.
Would it be better to move that into the crypto code, so that any
crypto compression hw driver can request buffers be specifically
aligned/sized?  I did have to use a header on each compressed buffer
that needed re-alignment or re-sizing, so maybe it's not appropriate
for common crypto compression code.

Since there doesn't seem to be any other hw compression drivers (yet),
maybe it should stay in the 842-nx code, at least for now.  Hopefully
any future compression hw won't have alignment or length multiple
restrictions...

>
> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH] powerpc/mm: Use a non-idle variant of kick_all_cpus_sync

2015-05-07 Thread Aneesh Kumar K.V
What we really need is the ability to wait for other cpus to
finish local_irq_save/local_irq_restore region. We don't need to send
IPI to idle cpus in that case. Add a vairant of kick_all_cpus_sync
to do that. If idle_cpu_mask change during the call, we should be
ok because:
1) new cpus got added. In this case when they enter the critical path
   they would have seen the new values i modified before smp_wmb();

2) cpus got removed: In this case we are ok, because we send stray IPI
   to them

Signed-off-by: Aneesh Kumar K.V 
---
NOTE: 
This need closer review, because I am new to the area of cpu mask.

 arch/powerpc/mm/pgtable_64.c |  6 +++---
 include/linux/smp.h  |  9 +
 kernel/sched/fair.c  | 19 +++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 049d961802aa..e54b111f8737 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -590,7 +590,7 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
 * by sending an IPI to all the cpus and executing a dummy
 * function there.
 */
-   kick_all_cpus_sync();
+   poke_nonidle_cpus_sync();
/*
 * Now invalidate the hpte entries in the range
 * covered by pmd. This make sure we take a
@@ -670,7 +670,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma,
 * This ensures that generic code that rely on IRQ disabling
 * to prevent a parallel THP split work as expected.
 */
-   kick_all_cpus_sync();
+   poke_nonidle_cpus_sync();
 }
 
 /*
@@ -855,7 +855,7 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm,
 * different code paths. So make sure we wait for the parallel
 * find_linux_pte_or_hugepage to finish.
 */
-   kick_all_cpus_sync();
+   poke_nonidle_cpus_sync();
return old_pmd;
 }
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index c4414074bd88..16d539b94c31 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -101,6 +101,14 @@ int smp_call_function_any(const struct cpumask *mask,
 
 void kick_all_cpus_sync(void);
 void wake_up_all_idle_cpus(void);
+#ifdef CONFIG_NO_HZ_COMMON
+void poke_nonidle_cpus_sync(void);
+#else
+static inline void poke_nonidle_cpus_sync(void)
+{
+   return kick_all_cpus_sync();
+}
+#endif
 
 /*
  * Generic and arch helpers
@@ -150,6 +158,7 @@ smp_call_function_any(const struct cpumask *mask, 
smp_call_func_t func,
 
 static inline void kick_all_cpus_sync(void) {  }
 static inline void wake_up_all_idle_cpus(void) {  }
+static inline void poke_nonidle_cpus_sync(void) {  }
 
 #ifdef CONFIG_UP_LATE_INIT
 extern void __init up_late_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffeaa4105e48..00abc6ae077b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7513,6 +7513,25 @@ static int sched_ilb_notifier(struct notifier_block *nfb,
return NOTIFY_DONE;
}
 }
+
+static void do_nothing(void *unused)
+{
+}
+
+void poke_nonidle_cpus_sync(void)
+{
+   struct cpumask mask;
+
+   /*
+* Make sure the change is visible before we poke the cpus
+*/
+   smp_mb();
+   preempt_disable();
+   cpumask_andnot(&mask, cpu_online_mask, nohz.idle_cpus_mask);
+   smp_call_function_many(&mask, do_nothing, NULL, 1);
+   preempt_enable();
+}
+EXPORT_SYMBOL_GPL(poke_nonidle_cpus_sync);
 #endif
 
 static DEFINE_SPINLOCK(balancing);
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2] powerpc/mm: Return NULL for not present hugetlb page

2015-05-07 Thread Aneesh Kumar K.V
We need to check whether pte is present in follow_huge_addr and
properly return NULL if mapping is not present. Also use READ_ONCE
when dereferencing pte_t address.

Signed-off-by: Aneesh Kumar K.V 
---
Changes from V1:
--
* Fix build failures with some platform configs.
  involves pmd_trans_huge(__pmd(pte_val(pte)))

 arch/powerpc/mm/hugetlbpage.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0ce968b00b7c..3385e3d0506e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
-   pte_t *ptep;
-   struct page *page;
+   pte_t *ptep, pte;
unsigned shift;
unsigned long mask, flags;
+   struct page *page = ERR_PTR(-EINVAL);
+
+   local_irq_save(flags);
+   ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
+   if (!ptep)
+   goto no_page;
+   pte = READ_ONCE(*ptep);
/*
+* Verify it is a huge page else bail.
 * Transparent hugepages are handled by generic code. We can skip them
 * here.
 */
-   local_irq_save(flags);
-   ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
+   if (!shift || pmd_trans_huge(__pmd(pte_val(pte
+   goto no_page;
 
-   /* Verify it is a huge page else bail. */
-   if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) {
-   local_irq_restore(flags);
-   return ERR_PTR(-EINVAL);
+   if (!pte_present(pte)) {
+   page = NULL;
+   goto no_page;
}
mask = (1UL << shift) - 1;
-   page = pte_page(*ptep);
+   page = pte_page(pte);
if (page)
page += (address & mask) / PAGE_SIZE;
 
+no_page:
local_irq_restore(flags);
return page;
 }
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/2] powerpc/mpc85xx: Fix EDAC address capture

2015-05-07 Thread songwenbin
From: York Sun 

Extend err_addr to cover 64 bits for DDR errors.

Signed-off-by: York Sun 
Signed-off-by: songwenbin 
---
 drivers/edac/mpc85xx_edac.c | 10 +++---
 drivers/edac/mpc85xx_edac.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
index 68bf234..23ef8e9 100644
--- a/drivers/edac/mpc85xx_edac.c
+++ b/drivers/edac/mpc85xx_edac.c
@@ -811,6 +811,8 @@ static void sbe_ecc_decode(u32 cap_high, u32 cap_low, u32 
cap_ecc,
}
 }
 
+#define make64(high, low) (((u64)(high) << 32) | (low))
+
 static void mpc85xx_mc_check(struct mem_ctl_info *mci)
 {
struct mpc85xx_mc_pdata *pdata = mci->pvt_info;
@@ -818,7 +820,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
u32 bus_width;
u32 err_detect;
u32 syndrome;
-   u32 err_addr;
+   u64 err_addr;
u32 pfn;
int row_index;
u32 cap_high;
@@ -849,7 +851,9 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
else
syndrome &= 0x;
 
-   err_addr = in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS);
+   err_addr = make64(
+   in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_EXT_ADDRESS),
+   in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS));
pfn = err_addr >> PAGE_SHIFT;
 
for (row_index = 0; row_index < mci->nr_csrows; row_index++) {
@@ -886,7 +890,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
mpc85xx_mc_printk(mci, KERN_ERR,
"Captured Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
cap_high, cap_low, syndrome);
-   mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8x\n", err_addr);
+   mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8llx\n", err_addr);
mpc85xx_mc_printk(mci, KERN_ERR, "PFN: %#8.8x\n", pfn);
 
/* we are out of range */
diff --git a/drivers/edac/mpc85xx_edac.h b/drivers/edac/mpc85xx_edac.h
index 4498baf..9352e88 100644
--- a/drivers/edac/mpc85xx_edac.h
+++ b/drivers/edac/mpc85xx_edac.h
@@ -43,6 +43,7 @@
 #define MPC85XX_MC_ERR_INT_EN  0x0e48
 #define MPC85XX_MC_CAPTURE_ATRIBUTES   0x0e4c
 #define MPC85XX_MC_CAPTURE_ADDRESS 0x0e50
+#define MPC85XX_MC_CAPTURE_EXT_ADDRESS 0x0e54
 #define MPC85XX_MC_ERR_SBE 0x0e58
 
 #define DSC_MEM_EN 0x8000
-- 
2.1.0.27.g96db324

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/2] powerpc/mpc8xxx: Change EDAC for FSL SoC

2015-05-07 Thread songwenbin
From: York Sun 

Remove mpc83xx and mpc85xx as dependency.

Signed-off-by: York Sun 
Signed-off-by: songwenbin 
---
 drivers/edac/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index cb59619..ad07d4f 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -262,10 +262,10 @@ config EDAC_SBRIDGE
 
 config EDAC_MPC85XX
tristate "Freescale MPC83xx / MPC85xx"
-   depends on EDAC_MM_EDAC && FSL_SOC && (PPC_83xx || PPC_85xx)
+   depends on EDAC_MM_EDAC && FSL_SOC
help
  Support for error detection and correction on the Freescale
- MPC8349, MPC8560, MPC8540, MPC8548
+ MPC8349, MPC8560, MPC8540, MPC8548, T4240
 
 config EDAC_MV64X60
tristate "Marvell MV64x60"
-- 
2.1.0.27.g96db324

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] powerpc/mpc8xxx: Change EDAC for FSL SoC

2015-05-07 Thread songwenbin
From: York Sun 

Remove mpc83xx and mpc85xx as dependency.

Signed-off-by: York Sun 
Change-Id: I92ff2ecf38b00e48a713baf2443495f8a1468beb
Reviewed-on: http://git.am.freescale.net:8181/554
Reviewed-by: Schmitt Richard-B43082 
Tested-by: Schmitt Richard-B43082 
Reviewed-by: Fleming Andrew-AFLEMING 
Tested-by: Fleming Andrew-AFLEMING 
Signed-off-by: songwenbin 
---
 drivers/edac/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index cb59619..ad07d4f 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -262,10 +262,10 @@ config EDAC_SBRIDGE
 
 config EDAC_MPC85XX
tristate "Freescale MPC83xx / MPC85xx"
-   depends on EDAC_MM_EDAC && FSL_SOC && (PPC_83xx || PPC_85xx)
+   depends on EDAC_MM_EDAC && FSL_SOC
help
  Support for error detection and correction on the Freescale
- MPC8349, MPC8560, MPC8540, MPC8548
+ MPC8349, MPC8560, MPC8540, MPC8548, T4240
 
 config EDAC_MV64X60
tristate "Marvell MV64x60"
-- 
2.1.0.27.g96db324

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/mpc85xx: Fix EDAC address capture

2015-05-07 Thread songwenbin
From: York Sun 

Extend err_addr to cover 64 bits for DDR errors.

Signed-off-by: York Sun 
Change-Id: Idb112c4a106416a9cad9933c415e6f62de5cf07b
Reviewed-on: http://git.am.freescale.net:8181/553
Tested-by: Schmitt Richard-B43082 
Reviewed-by: Fleming Andrew-AFLEMING 
Tested-by: Fleming Andrew-AFLEMING 
Signed-off-by: songwenbin 
---
 drivers/edac/mpc85xx_edac.c | 10 +++---
 drivers/edac/mpc85xx_edac.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
index 68bf234..23ef8e9 100644
--- a/drivers/edac/mpc85xx_edac.c
+++ b/drivers/edac/mpc85xx_edac.c
@@ -811,6 +811,8 @@ static void sbe_ecc_decode(u32 cap_high, u32 cap_low, u32 
cap_ecc,
}
 }
 
+#define make64(high, low) (((u64)(high) << 32) | (low))
+
 static void mpc85xx_mc_check(struct mem_ctl_info *mci)
 {
struct mpc85xx_mc_pdata *pdata = mci->pvt_info;
@@ -818,7 +820,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
u32 bus_width;
u32 err_detect;
u32 syndrome;
-   u32 err_addr;
+   u64 err_addr;
u32 pfn;
int row_index;
u32 cap_high;
@@ -849,7 +851,9 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
else
syndrome &= 0x;
 
-   err_addr = in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS);
+   err_addr = make64(
+   in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_EXT_ADDRESS),
+   in_be32(pdata->mc_vbase + MPC85XX_MC_CAPTURE_ADDRESS));
pfn = err_addr >> PAGE_SHIFT;
 
for (row_index = 0; row_index < mci->nr_csrows; row_index++) {
@@ -886,7 +890,7 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
mpc85xx_mc_printk(mci, KERN_ERR,
"Captured Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
cap_high, cap_low, syndrome);
-   mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8x\n", err_addr);
+   mpc85xx_mc_printk(mci, KERN_ERR, "Err addr: %#8.8llx\n", err_addr);
mpc85xx_mc_printk(mci, KERN_ERR, "PFN: %#8.8x\n", pfn);
 
/* we are out of range */
diff --git a/drivers/edac/mpc85xx_edac.h b/drivers/edac/mpc85xx_edac.h
index 4498baf..9352e88 100644
--- a/drivers/edac/mpc85xx_edac.h
+++ b/drivers/edac/mpc85xx_edac.h
@@ -43,6 +43,7 @@
 #define MPC85XX_MC_ERR_INT_EN  0x0e48
 #define MPC85XX_MC_CAPTURE_ATRIBUTES   0x0e4c
 #define MPC85XX_MC_CAPTURE_ADDRESS 0x0e50
+#define MPC85XX_MC_CAPTURE_EXT_ADDRESS 0x0e54
 #define MPC85XX_MC_ERR_SBE 0x0e58
 
 #define DSC_MEM_EN 0x8000
-- 
2.1.0.27.g96db324

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy
On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
>> Hi Preeti,
>>
>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>>> Hi Shilpa,
>>>
>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
 notification by executing *throttle_check() on any one of the cpu on
 the chip. This is a sanity check to verify if we were indeed
 throttled/unthrottled after receiving OCC_THROTTLE notification.

 We cannot call *throttle_check() directly from the notification
 handler because we could be handling chip1's notification in chip2. So
 initiate an smp_call to execute *throttle_check(). We are irq-disabled
 in the notification handler, so use a worker thread to smp_call
 throttle_check() on any of the cpu in the chipmask.
>>>
>>> I see that the first patch takes care of reporting *per-chip* throttling
>>> for pmax capping condition. But where are we taking care of reporting
>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>>
>>
>> IMO let us not have "psafe" and "freq control disabled" states managed 
>> per-chip.
>> Because when the above two conditions occur it is likely to happen across all
>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
>> *throttle_check().
> 
> Alright like I pointed in the previous reply, a comment to indicate that
> psafe and freq control disabled conditions will fail when occ is
> inactive and that all chips face the consequence of this will help.

From your explanation on the thread of the first patch of this series,
this will not be required.

So,
Reviewed-by: Preeti U Murthy 

Regards
Preeti U Murthy
> 
>>

 Signed-off-by: Shilpasri G Bhat 
 ---
  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
  1 file changed, 26 insertions(+), 2 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index 9268424..9618813 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
  static struct chip {
unsigned int id;
bool throttled;
 +  cpumask_t mask;
 +  struct work_struct throttle;
  } *chips;

  static int nr_chips;
 @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
return powernv_pstate_info.max - powernv_pstate_info.nominal;
  }

 -static void powernv_cpufreq_throttle_check(unsigned int cpu)
 +static void powernv_cpufreq_throttle_check(void *data)
  {
 +  unsigned int cpu = smp_processor_id();
unsigned long pmsr;
int pmsr_pmax, pmsr_lp, i;

 @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
 cpufreq_policy *policy,
return 0;

if (!throttled)
 -  powernv_cpufreq_throttle_check(smp_processor_id());
 +  powernv_cpufreq_throttle_check(NULL);

freq_data.pstate_id = powernv_freqs[new_index].driver_data;

 @@ -418,6 +421,14 @@ static struct notifier_block 
 powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
  };

 +void powernv_cpufreq_work_fn(struct work_struct *work)
 +{
 +  struct chip *chip = container_of(work, struct chip, throttle);
 +
 +  smp_call_function_any(&chip->mask,
 +powernv_cpufreq_throttle_check, NULL, 0);
 +}
 +
  static char throttle_reason[][30] = {
"No throttling",
"Power Cap",
 @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
struct opal_msg *occ_msg = msg;
uint64_t token;
uint64_t chip_id, reason;
 +  int i;

if (msg_type != OPAL_MSG_OCC)
return 0;
 @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
occ_reset = false;
throttled = false;
pr_info("OCC: Active\n");
 +
 +  for (i = 0; i < nr_chips; i++)
 +  schedule_work(&chips[i].throttle);
 +
return 0;
}

 @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
else if (!reason)
pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
throttle_reason[reason]);
 +  else
 +  return 0;
>>>
>>> Why the else section ? The code can never reach here, can it ?
>>
>> When reason > 5 , we dont want to handle it.
> 
> Of course! My bad!

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-07 Thread Preeti U Murthy
On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote:
> 
> 
> On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
>> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
>>> Hi Preeti,
>>>
>>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
> max allowed frequency for that chip if the chip exceeds its power or
> temperature limits. As Pmax capping is a chip level condition report
> this throttling behavior at chip level and also do not set the global
> 'throttled' on Pmax capping instead set the per-chip throttled
> variable. Report unthrottling if Pmax is restored after throttling.
>
> This patch adds a structure to store chip id and throttled state of
> the chip.
>
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 59 
> ---
>  1 file changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index ebef0d8..d0c18c9 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -42,6 +43,13 @@
>  static struct cpufreq_frequency_table 
> powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static bool rebooting, throttled;
>
> +static struct chip {
> + unsigned int id;
> + bool throttled;
> +} *chips;
> +
> +static int nr_chips;
> +
>  /*
>   * Note: The set of pstates consists of contiguous integers, the
>   * smallest of which is indicated by powernv_pstate_info.min, the
> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>  static void powernv_cpufreq_throttle_check(unsigned int cpu)
>  {
>   unsigned long pmsr;
> - int pmsr_pmax, pmsr_lp;
> + int pmsr_pmax, pmsr_lp, i;
>
>   pmsr = get_pmspr(SPRN_PMSR);
>
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == cpu_to_chip_id(cpu))
> + break;
> +
>   /* Check for Pmax Capping */
>   pmsr_pmax = (s8)PMSR_MAX(pmsr);
>   if (pmsr_pmax != powernv_pstate_info.max) {
> - throttled = true;
> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
> - pr_info("Max allowed Pstate is capped\n");
> + if (chips[i].throttled)
> + goto next;
> + chips[i].throttled = true;
> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
> + } else if (chips[i].throttled) {
> + chips[i].throttled = false;

 Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>>>
>>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
>>> sufficient to decide throttling/unthrottling when OCC is active for that 
>>> chip.
>>
>> Ok then we can perhaps exit after detecting unthrottling here.
> 
> This won't work for older firmwares which do not clear "Frequency control
> enabled bit" on OCC reset cycle. So let us check for remaining two conditions 
> on
> unthrottling as well.

ok.

> 
>>>

> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
>   }
>
>   /*
>* Check for Psafe by reading LocalPstate
>* or check if Psafe_mode_active is set in PMSR.
>*/
> +next:
>   pmsr_lp = (s8)PMSR_LP(pmsr);
>   if ((pmsr_lp < powernv_pstate_info.min) ||
>   (pmsr & PMSR_PSAFE_ENABLE)) {
> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver 
> = {
>   .attr   = powernv_cpu_freq_attr,

 What about the situation where although occ is active, this particular
 chip has been throttled and we end up repeatedly reporting "pstate set
 to safe" and "frequency control disabled from OS" ? Should we not have a
 check on (chips[i].throttled) before reporting an anomaly for these two
 scenarios as well just like you have for pmsr_pmax ?
>>>
>>> We will not have "Psafe" and "frequency control disabled" repeatedly printed
>>> because of global variable 'throttled', which is set to true on passing any 
>>> of
>>> these two conditions.
>>>
>>> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
>>> control disabled" state. These two conditions are most likely to happen 
>>> during
>>> an OCC reset cycle which will occur across all chips.
>>
>> Let us then add a comment to indicate that Psafe and frequency control
>> disabled conditions will fail *only if OCC is inactive* and not
>> otherwise and that this is a s

Re: [PATCH 3/3] kvm/powerpc: report guest steal time in host

2015-05-07 Thread Christian Borntraeger
Am 06.05.2015 um 18:42 schrieb Naveen N. Rao:
> On 2015/05/06 02:46PM, Christian Borntraeger wrote:
>> Am 06.05.2015 um 13:56 schrieb Naveen N. Rao:
>>> On powerpc, kvm tracks both the guest steal time as well as the time
>>> when guest was idle and this gets sent in to the guest through DTL. The
>>> guest accounts these entries as either steal time or idle time based on
>>> the last running task. Since the true guest idle status is not visible
>>> to the host, we can't accurately expose the guest steal time in the
>>> host.
>>>
>>> However, tracking the guest vcpu cede status can get us a reasonable
>>> (within 5% variation) vcpu steal time since guest vcpus cede the
>>> processor on entering the idle task. To do this, we introduce a new
>>> field ceded_st in kvm_vcpu_arch structure to accurately track the guest
>>> vcpu cede status (this is needed since the existing ceded field is
>>> modified before we can use it). During DTL entry creation, we check this
>>> flag and account the time as stolen if the guest vcpu had not ceded.
>>
>> I think this is more or less a question about the semantic:
>>
>> What would happen if you use  current->sched_info.run_delay like x86 also
>> on power? How far are the numbers away?
> 
> The numbers were quite off and didn't quite make sense.

Strange. I would expect to match at least the wall clock time between
runnable and running. Maybe its just a bug?


> 
>> My feeling is, that the semantics
>> of "steal time" inside the guest is somewhat different on each platform. 
>>
>> This brings me to a 2nd question:
>> Do you need to match the host view of guest steal time with the guest view
>> or do we want to have a host view that translates as "this is the time that
>> the guest was runnable but we were too busy to schedule him"?
> 
> Very good point. This is probably good enough for our purpose and I'd 
> like to think my current patchset does something similar for powerpc. We 
> don't report the exact steal time as seen from within the guest, but a 
> close approximation of it. We count all time that a vcpu was not-idle as 
> steal. This includes time we were doing something in the host on behalf 
> of the vcpu as well as time when we were just doing something else. I 
> don't know if we can separate these two or if that would be desirable.  
> The scheduler statistics don't seem to accurately reflect this on ppc.
> 
>> For the former x86 has the best solution, as the host tells the guest its
>> understanding of steal - so both match. For the latter we actually try to
>> give guest steal a meaning in the host context  - the overload.
>> Would /proc//schedstat value 2 (time spent waiting on a runqueue)
>> meet your requirements from the cover-letter?
> 
> This looks to be the same as sched_info.run_delay, which doesn't seem to 
> reflect the wait on the runqueue. I will recheck this on ppc tomorrow.
> 
> As an aside, do you happen to know if /proc//schedstat accurately 
> reports the "overload" on s390?

Things are usually even more complicated as we always have the LPAR hypervisor
below the KVM or z/VM hypervisor (KVM or z/VM guests are always nested so to
speak). Depending on the overcommit on LPAR level the wall clock times might 
indicate a problem in a "wrong" place. 

Now the steal time in a kvm guest is actually precise as the hardware will
step the guest cpu timer only when both LPAR and KVM have this CPU scheduled.
This will also cause "steal" when KVM emulates an instruction for the guest - 
unless we correct the guest view - which we dont right now.
The Linux in LPAR also sees the steal time it got stolen by LPAR.

I really have not looked closely at run_delay. My assumption is that
it boils down to "wall clock time between runnable and running". If the
admin does overcommit in KVM and LPAR is just slightly  overcommitted this
is probably good enough. If the overcommit happens at LPAR then the value
might be confusing. I would assume that people overcommit at the z/VM or KVM
level and the LPAR is managed with less overcommit - but thats not a given.

Christian

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions

2015-05-07 Thread Aneesh Kumar K.V
"Kirill A. Shutemov"  writes:

> On Thu, May 07, 2015 at 12:53:27PM +0530, Aneesh Kumar K.V wrote:
>> After this patch pmdp_* functions operate only on hugepage pte,
>> and not on regular pmd_t values pointing to page table.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
>>  arch/powerpc/mm/pgtable_64.c | 76 
>> +---
>>  include/asm-generic/pgtable.h| 19 
>>  mm/huge_memory.c |  2 +-
>>  4 files changed, 65 insertions(+), 36 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
>> b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index 43e6ad424c7f..50830c9a2116 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct 
>> *mm, unsigned long addr,
>>  extern void pmdp_splitting_flush(struct vm_area_struct *vma,
>>   unsigned long address, pmd_t *pmdp);
>>  
>> +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
>> +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp);
>> +
>>  #define __HAVE_ARCH_PGTABLE_DEPOSIT
>>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>> pgtable_t pgtable);
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 59daa5eeec25..9171c1a37290 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, 
>> unsigned long address,
>>  pmd_t pmd;
>>  
>>  VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -if (pmd_trans_huge(*pmdp)) {
>> -pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
>> -} else {
>> -/*
>> - * khugepaged calls this for normal pmd
>> - */
>> -pmd = *pmdp;
>> -pmd_clear(pmdp);
>> -/*
>> - * Wait for all pending hash_page to finish. This is needed
>> - * in case of subpage collapse. When we collapse normal pages
>> - * to hugepage, we first clear the pmd, then invalidate all
>> - * the PTE entries. The assumption here is that any low level
>> - * page fault will see a none pmd and take the slow path that
>> - * will wait on mmap_sem. But we could very well be in a
>> - * hash_page with local ptep pointer value. Such a hash page
>> - * can result in adding new HPTE entries for normal subpages.
>> - * That means we could be modifying the page content as we
>> - * copy them to a huge page. So wait for parallel hash_page
>> - * to finish before invalidating HPTE entries. We can do this
>> - * by sending an IPI to all the cpus and executing a dummy
>> - * function there.
>> - */
>> -kick_all_cpus_sync();
>> -/*
>> - * Now invalidate the hpte entries in the range
>> - * covered by pmd. This make sure we take a
>> - * fault and will find the pmd as none, which will
>> - * result in a major fault which takes mmap_sem and
>> - * hence wait for collapse to complete. Without this
>> - * the __collapse_huge_page_copy can result in copying
>> - * the old content.
>> - */
>> -flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
>> -}
>> +VM_BUG_ON(!pmd_trans_huge(*pmdp));
>> +pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
>> +return pmd;
>
> The patches are in reverse order: you need to change pmdp_get_and_clear
> first otherwise you break bisectability.
> Or better merge patches together.

The first patch is really a cleanup and should not result in code
changes. It just make sure that we use pmdp_* functions only on hugepage
ptes and not on regular pmd_t pointers to pgtable. It avoid the not so
nice if (pmd_trans_huge()) check in the code and allows us to do the
VM_BUG_ON(!pmd_trans_huge(*pmdp)) there. That is really important on
archs like ppc64 where regular pmd format is different from hugepage pte
format.


>
>> +}
>> +
>> +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
>> +  pmd_t *pmdp)
>> +{
>> +pmd_t pmd;
>> +
>> +VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +VM_BUG_ON(pmd_trans_huge(*pmdp));
>> +
>> +pmd = *pmdp;
>> +pmd_clear(pmdp);
>> +/*
>> + * Wait for all pending hash_page to finish. This is needed
>> + * in case of subpage collapse. When we collapse normal pages
>> + * to hugepage, we first clear the pmd, then invalidate all
>> + * the PTE entries. The assumption here is that any low level
>> + * page fault will see a n

Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page

2015-05-07 Thread Aneesh Kumar K.V
Benjamin Herrenschmidt  writes:

> On Thu, 2015-05-07 at 12:46 +0530, Aneesh Kumar K.V wrote:
>> We need to check whether pte is present in follow_huge_addr and
>> properly return NULL if mapping is not present. Also use READ_ONCE
>> when dereferencing pte_t address.
>
> Do that need to go to stable as well ?

Yes. I will like David to take a look at this and give his feedback.
W.r.t patch itself I hit a build failure on mpc85xx_smp_defconfig. I will
resent after the test build finish on all configs

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-07 Thread Shilpasri G Bhat


On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
>> Hi Preeti,
>>
>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>>> Hi Shilpa,
>>>
>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
 max allowed frequency for that chip if the chip exceeds its power or
 temperature limits. As Pmax capping is a chip level condition report
 this throttling behavior at chip level and also do not set the global
 'throttled' on Pmax capping instead set the per-chip throttled
 variable. Report unthrottling if Pmax is restored after throttling.

 This patch adds a structure to store chip id and throttled state of
 the chip.

 Signed-off-by: Shilpasri G Bhat 
 ---
  drivers/cpufreq/powernv-cpufreq.c | 59 
 ---
  1 file changed, 55 insertions(+), 4 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index ebef0d8..d0c18c9 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
 +#include 

  #include 
  #include 
 @@ -42,6 +43,13 @@
  static struct cpufreq_frequency_table 
 powernv_freqs[POWERNV_MAX_PSTATES+1];
  static bool rebooting, throttled;

 +static struct chip {
 +  unsigned int id;
 +  bool throttled;
 +} *chips;
 +
 +static int nr_chips;
 +
  /*
   * Note: The set of pstates consists of contiguous integers, the
   * smallest of which is indicated by powernv_pstate_info.min, the
 @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
  static void powernv_cpufreq_throttle_check(unsigned int cpu)
  {
unsigned long pmsr;
 -  int pmsr_pmax, pmsr_lp;
 +  int pmsr_pmax, pmsr_lp, i;

pmsr = get_pmspr(SPRN_PMSR);

 +  for (i = 0; i < nr_chips; i++)
 +  if (chips[i].id == cpu_to_chip_id(cpu))
 +  break;
 +
/* Check for Pmax Capping */
pmsr_pmax = (s8)PMSR_MAX(pmsr);
if (pmsr_pmax != powernv_pstate_info.max) {
 -  throttled = true;
 -  pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
 -  pr_info("Max allowed Pstate is capped\n");
 +  if (chips[i].throttled)
 +  goto next;
 +  chips[i].throttled = true;
 +  pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
 +  chips[i].id, pmsr_pmax);
 +  } else if (chips[i].throttled) {
 +  chips[i].throttled = false;
>>>
>>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>>
>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
>> sufficient to decide throttling/unthrottling when OCC is active for that 
>> chip.
> 
> Ok then we can perhaps exit after detecting unthrottling here.

This won't work for older firmwares which do not clear "Frequency control
enabled bit" on OCC reset cycle. So let us check for remaining two conditions on
unthrottling as well.

>>
>>>
 +  pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
 +  chips[i].id, pmsr_pmax);
}

/*
 * Check for Psafe by reading LocalPstate
 * or check if Psafe_mode_active is set in PMSR.
 */
 +next:
pmsr_lp = (s8)PMSR_LP(pmsr);
if ((pmsr_lp < powernv_pstate_info.min) ||
(pmsr & PMSR_PSAFE_ENABLE)) {
 @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = 
 {
.attr   = powernv_cpu_freq_attr,
>>>
>>> What about the situation where although occ is active, this particular
>>> chip has been throttled and we end up repeatedly reporting "pstate set
>>> to safe" and "frequency control disabled from OS" ? Should we not have a
>>> check on (chips[i].throttled) before reporting an anomaly for these two
>>> scenarios as well just like you have for pmsr_pmax ?
>>
>> We will not have "Psafe" and "frequency control disabled" repeatedly printed
>> because of global variable 'throttled', which is set to true on passing any 
>> of
>> these two conditions.
>>
>> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
>> control disabled" state. These two conditions are most likely to happen 
>> during
>> an OCC reset cycle which will occur across all chips.
> 
> Let us then add a comment to indicate that Psafe and frequency control
> disabled conditions will fail *only if OCC is inactive* and not
> otherwise and that this is a system wide phenomenon.
> 

I agree that adding a comment here will clear global vs local throttling
scenarios, but this will contradict the architectural design

Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Sudeep Holla

Hi Preeti,

On 07/05/15 06:26, Preeti U Murthy wrote:

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Based on linux-pm/bleeding-edge


I am unable to apply this patch cleanly on linux-pm/bleeding-edge
I think it conflicts with few patches that Rafael posted recently
which are in the branch now.

Regards,
Sudeep
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/2] powerpc/powernv: Extract EPOW events timeout values from OPAL device tree

2015-05-07 Thread Vipin K Parashar
OPAL exports plaform timeout values for various EPOW events under
EPOW device tree node. EPOW node contains sub nodes for each EPOW
class. Under each class platform timeout property files are located
for EPOW events under that class. Each file contains platform timeout
value for corresponding EPOW event in seconds.
Support for extracting EPOW event timeout values from OPAL
device tree is added by this patch. Below property files are parsed
to extract EPOW event timeout values.

 Power EPOW
 ===
 ups-timeout
 ups-low-timeout

 Temp EPOW
 ==
 high-ambient-temp-timeout
 crit-ambient-temp-timeout
 high-internal-temp-timeout
 crit-internal-temp-timeout

Signed-off-by: Vipin K Parashar 
---
 arch/powerpc/platforms/powernv/opal-power.c | 79 +
 1 file changed, 70 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
b/arch/powerpc/platforms/powernv/opal-power.c
index 7c1b2f8..5b015f3 100644
--- a/arch/powerpc/platforms/powernv/opal-power.c
+++ b/arch/powerpc/platforms/powernv/opal-power.c
@@ -60,15 +60,7 @@ static const char * const epow_events_map[] = {
 };
 
 /* Poweroff EPOW events timeout values in seconds */
-static const int epow_timeout[] = {
-   [EPOW_POWER_UPS]= 900,
-   [EPOW_POWER_UPS_LOW]= 20,
-   [EPOW_TEMP_HIGH_AMB]= 900,
-   [EPOW_TEMP_CRIT_AMB]= 20,
-   [EPOW_TEMP_HIGH_INT]= 900,
-   [EPOW_TEMP_CRIT_INT]= 20,
-   [EPOW_UNKNOWN]  = 0,
-};
+static int epow_timeout[MAX_EPOW_EVENTS];
 
 /* System poweroff function. */
 static void epow_poweroff(unsigned long event)
@@ -125,6 +117,72 @@ static void stop_epow_timer(void)
pr_info("Poweroff timer deactivated\n");
 }
 
+/* Extract timeout value from device tree property */
+static int get_timeout_value(struct device_node *node, const char *prop)
+{
+   const __be32 *pval;
+   int timeout = 0;
+
+   pval = of_get_property(node, prop, NULL);
+   if (pval)
+   timeout = be32_to_cpup(pval);
+   else
+   pr_err("Didn't find %s dt property\n", prop);
+
+   return timeout;
+}
+
+/* Get EPOW events timeout values from OPAL device tree */
+static void get_epow_timeouts(void)
+{
+   struct device_node *epow_power, *epow_temp;
+
+   /* EPOW power class event timeouts */
+   epow_power = of_find_node_by_path("/ibm,opal/epow/power");
+   if (epow_power) {
+   epow_timeout[EPOW_POWER_UPS] =
+   get_timeout_value(epow_power, "ups-timeout");
+   pr_info("Power EPOW ups-timeout = %d seconds\n",
+   epow_timeout[EPOW_POWER_UPS]);
+
+   epow_timeout[EPOW_POWER_UPS_LOW] =
+   get_timeout_value(epow_power, "ups-low-timeout");
+   pr_info("Power EPOW ups-low-timeout = %d seconds\n",
+   epow_timeout[EPOW_POWER_UPS_LOW]);
+
+   of_node_put(epow_power);
+   } else
+   pr_info("Power EPOW class not supported in OPAL\n");
+
+   /* EPOW temp class event timeouts */
+   epow_temp = of_find_node_by_path("/ibm,opal/epow/temp");
+   if (epow_temp) {
+   epow_timeout[EPOW_TEMP_HIGH_AMB] =
+   get_timeout_value(epow_temp, "high-ambient-temp-timeout");
+   pr_info("Temp EPOW high-ambient-temp-timeout = %d seconds\n",
+   epow_timeout[EPOW_TEMP_HIGH_AMB]);
+
+   epow_timeout[EPOW_TEMP_CRIT_AMB] =
+   get_timeout_value(epow_temp, "crit-ambient-temp-timeout");
+   pr_info("Temp EPOW crit-ambient-temp-timeout = %d seconds\n",
+   epow_timeout[EPOW_TEMP_CRIT_AMB]);
+
+   epow_timeout[EPOW_TEMP_HIGH_INT] =
+   get_timeout_value(epow_temp, "high-internal-temp-timeout");
+   pr_info("Temp EPOW high-inernal-temp-timeout = %d seconds\n",
+   epow_timeout[EPOW_TEMP_HIGH_INT]);
+
+   epow_timeout[EPOW_TEMP_CRIT_INT] =
+   get_timeout_value(epow_temp, "crit-internal-temp-timeout");
+   pr_info("Temp EPOW crit-inernal-temp-timeout = %d seconds\n",
+   epow_timeout[EPOW_TEMP_CRIT_INT]);
+
+   of_node_put(epow_temp);
+   } else
+   pr_info("Temp EPOW class not supported in OPAL\n");
+
+}
+
 /* Get DPO status */
 static bool get_dpo_status(int32_t *dpo_timeout)
 {
@@ -366,6 +424,9 @@ static int __init opal_poweroff_events_init(void)
init_timer(&epow_timer);
epow_timer.function = epow_poweroff;
 
+   /* Get EPOW events timeout value */
+   get_epow_timeouts();
+
/* Register EPOW event notifier */
ret = opal_message_notifier_register(OPAL_MSG_EPOW,
&opal_epow_nb);
-- 
1.9.3

___

[PATCH v2 1/2] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform

2015-05-07 Thread Vipin K Parashar
This patch adds support for FSP EPOW (Early Power Off Warning) and
DPO (Delayed Power Off) events support for PowerNV platform.  EPOW events
are generated by SPCN/FSP due to various critical system conditions that
need system shutdown.  Few examples of these conditions are high ambient
temperature or system running on UPS power with low UPS battery. DPO event
is generated in response to admin initiated system shutdown request.
This patch enables host kernel on PowerNV platform to handle OPAL
notifications for these events and initiate system poweroff. Since EPOW
notifications are sent in advance of impending shutdown event and thus
this patch also adds functionality to wait for EPOW condition to return to
normal. Host allows MAX_POWEROFF_SYS_TIME (600 seconds) as system
poweroff time (time for host + guests shutdown) and waits for remaining
time for EPOW condition to return to normal. If EPOW condition doesn't
return to normal in calculated time it proceeds with graceful system
shutdown. For EPOW events with smaller timeouts values than
MAX_POWEROFF_SYS_TIME it proceeds with system shutdown without any wait
for EPOW condition to return to normal.
System admin can also add systemd service shutdown scripts to
perform any specific actions like graceful guest shutdown upon system
poweroff. libvirt-guests is systemd service available on recent distros
for management of guests at system stat/shutdown time.

Signed-off-by: Vipin K Parashar 
---
 arch/powerpc/include/asm/opal-api.h|  30 ++
 arch/powerpc/include/asm/opal.h|   3 +-
 arch/powerpc/platforms/powernv/opal-power.c| 379 +++--
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 4 files changed, 391 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0321a90..03b3cef 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -730,6 +730,36 @@ struct opal_i2c_request {
__be64 buffer_ra;   /* Buffer real address */
 };
 
+/*
+ * EPOW status sharing (OPAL and the host)
+ *
+ * The host will pass on OPAL, a buffer of length OPAL_EPOW_MAX_CLASSES
+ * to fetch system wide EPOW status. Each element in the returned buffer
+ * will contain bitwise EPOW status for each EPOW sub class.
+ */
+
+/* EPOW types */
+enum OpalEpow {
+   OPAL_EPOW_POWER = 0,/* Power EPOW */
+   OPAL_EPOW_TEMP  = 1,/* Temperature EPOW */
+   OPAL_EPOW_COOLING   = 2,/* Cooling EPOW */
+   OPAL_MAX_EPOW_CLASSES   = 3,/* Max EPOW categories */
+};
+
+/* Power EPOW events */
+enum OpalEpowPower {
+   OPAL_EPOW_POWER_UPS = 0x1, /* System on UPS power */
+   OPAL_EPOW_POWER_UPS_LOW = 0x2, /* System on UPS power with low battery*/
+};
+
+/* Temperature EPOW events */
+enum OpalEpowTemp {
+   OPAL_EPOW_TEMP_HIGH_AMB = 0x1, /* High ambient temperature */
+   OPAL_EPOW_TEMP_CRIT_AMB = 0x2, /* Critical ambient temperature */
+   OPAL_EPOW_TEMP_HIGH_INT = 0x4, /* High internal temperature */
+   OPAL_EPOW_TEMP_CRIT_INT = 0x8, /* Critical internal temperature */
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 042af1a..0777864 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -141,7 +141,6 @@ int64_t opal_pci_fence_phb(uint64_t phb_id);
 int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data);
 int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t 
error_type, uint8_t mask_action);
 int64_t opal_set_slot_led_status(uint64_t phb_id, uint64_t slot_id, uint8_t 
led_type, uint8_t led_action);
-int64_t opal_get_epow_status(__be64 *status);
 int64_t opal_set_system_attention_led(uint8_t led_action);
 int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
__be16 *pci_error_type, __be16 *severity);
@@ -200,6 +199,8 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, 
uint64_t buf,
uint64_t size, uint64_t token);
 int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
uint64_t token);
+int32_t opal_get_epow_status(__be32 *status, __be32 *num_classes);
+int32_t opal_get_dpo_status(__be32 *timeout);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
b/arch/powerpc/platforms/powernv/opal-power.c
index ac46c2c..7c1b2f8 100644
--- a/arch/powerpc/platforms/powernv/opal-power.c
+++ b/arch/powerpc/platforms/powernv/opal-power.c
@@ -1,5 +1,5 @@
 /*
- * PowerNV OPAL power control for graceful shutdown handling
+ * PowerNV poweroff events support
  *
  * Copyright 2015 IBM Corp.
  *
@@ -9,58 +9,395 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#

[PATCH v2 0/2] Poweroff (EPOW, DPO) events support for PowerNV platform

2015-05-07 Thread Vipin K Parashar
This patchset adds support for FSP EPOW (Early Power Off Warning) and
DPO (Delayed Power Off) events support for PowerNV platform.  EPOW events
are generated by SPCN/FSP due to various critical system conditions that
need system shutdown. Few examples of these conditions are high ambient
temperature or system running on UPS power with low UPS battery. DPO event
is generated in response to admin initiated system shutdown request.
This patchset enables host kernel on PowerNV platform to handle OPAL
notifications for these events and initiate system poweroff. Since EPOW
notifications are sent in advance of impending shutdown event and thus
functionality is also added to wait for EPOW condition to return to
normal. EPOW events timeout values are available via OPAL exported device
tree values under EPOW node.
Host kernel allows MAX_POWEROFF_SYS_TIME (600 seconds) as system
poweroff time (time for host + guests shutdown) and waits for remaining
time for EPOW condition to return to normal. If EPOW condition doesn't
return to normal in calculated time it proceeds with graceful system
shutdown. For EPOW events with smaller timeouts values than
MAX_POWEROFF_SYS_TIME it proceeds with system shutdown without any wait
for EPOW condition to return to normal.
System admin can also add systemd service shutdown scripts to
perform any specific actions like graceful guest shutdown upon system
poweroff. libvirt-guests is systemd service available on recent distros
for management of guests at system stat/shutdown time.

Changes in v2:
 - Made code changes to improve code as per previous review comments.
 - Added patch to obtain EPOW event timeout values from OPAL device-tree.

Vipin K Parashar (2):
  powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV
platform
  powerpc/powernv: Extract EPOW events timeout values from OPAL device
tree

 arch/powerpc/include/asm/opal-api.h|  30 ++
 arch/powerpc/include/asm/opal.h|   3 +-
 arch/powerpc/platforms/powernv/opal-power.c| 440 +++--
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 4 files changed, 452 insertions(+), 22 deletions(-)

--
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions

2015-05-07 Thread Kirill A. Shutemov
On Thu, May 07, 2015 at 12:53:27PM +0530, Aneesh Kumar K.V wrote:
> After this patch pmdp_* functions operate only on hugepage pte,
> and not on regular pmd_t values pointing to page table.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
>  arch/powerpc/mm/pgtable_64.c | 76 
> +---
>  include/asm-generic/pgtable.h| 19 
>  mm/huge_memory.c |  2 +-
>  4 files changed, 65 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
> b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 43e6ad424c7f..50830c9a2116 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct 
> *mm, unsigned long addr,
>  extern void pmdp_splitting_flush(struct vm_area_struct *vma,
>unsigned long address, pmd_t *pmdp);
>  
> +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
> +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> +  unsigned long address, pmd_t *pmdp);
> +
>  #define __HAVE_ARCH_PGTABLE_DEPOSIT
>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>  pgtable_t pgtable);
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index 59daa5eeec25..9171c1a37290 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t pmd;
>  
>   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> - if (pmd_trans_huge(*pmdp)) {
> - pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
> - } else {
> - /*
> -  * khugepaged calls this for normal pmd
> -  */
> - pmd = *pmdp;
> - pmd_clear(pmdp);
> - /*
> -  * Wait for all pending hash_page to finish. This is needed
> -  * in case of subpage collapse. When we collapse normal pages
> -  * to hugepage, we first clear the pmd, then invalidate all
> -  * the PTE entries. The assumption here is that any low level
> -  * page fault will see a none pmd and take the slow path that
> -  * will wait on mmap_sem. But we could very well be in a
> -  * hash_page with local ptep pointer value. Such a hash page
> -  * can result in adding new HPTE entries for normal subpages.
> -  * That means we could be modifying the page content as we
> -  * copy them to a huge page. So wait for parallel hash_page
> -  * to finish before invalidating HPTE entries. We can do this
> -  * by sending an IPI to all the cpus and executing a dummy
> -  * function there.
> -  */
> - kick_all_cpus_sync();
> - /*
> -  * Now invalidate the hpte entries in the range
> -  * covered by pmd. This make sure we take a
> -  * fault and will find the pmd as none, which will
> -  * result in a major fault which takes mmap_sem and
> -  * hence wait for collapse to complete. Without this
> -  * the __collapse_huge_page_copy can result in copying
> -  * the old content.
> -  */
> - flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
> - }
> + VM_BUG_ON(!pmd_trans_huge(*pmdp));
> + pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
> + return pmd;

The patches are in reverse order: you need to change pmdp_get_and_clear
first otherwise you break bisectability.
Or better merge patches together.

> +}
> +
> +pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
> +   pmd_t *pmdp)
> +{
> + pmd_t pmd;
> +
> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> + VM_BUG_ON(pmd_trans_huge(*pmdp));
> +
> + pmd = *pmdp;
> + pmd_clear(pmdp);
> + /*
> +  * Wait for all pending hash_page to finish. This is needed
> +  * in case of subpage collapse. When we collapse normal pages
> +  * to hugepage, we first clear the pmd, then invalidate all
> +  * the PTE entries. The assumption here is that any low level
> +  * page fault will see a none pmd and take the slow path that
> +  * will wait on mmap_sem. But we could very well be in a
> +  * hash_page with local ptep pointer value. Such a hash page
> +  * can result in adding new HPTE entries for normal subpages.
> +  * That means we could be modifying the page content as we
> +  * copy them to a huge page. So wait for parallel hash_page
> +  * to finish before invalidating HPTE entries. We can do this
> +  * by sending an IPI to all the cpus and executing a dummy

Re: [RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page

2015-05-07 Thread Benjamin Herrenschmidt
On Thu, 2015-05-07 at 12:46 +0530, Aneesh Kumar K.V wrote:
> We need to check whether pte is present in follow_huge_addr and
> properly return NULL if mapping is not present. Also use READ_ONCE
> when dereferencing pte_t address.

Do that need to go to stable as well ?

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/hugetlbpage.c | 25 -
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 0ce968b00b7c..f5688423bc69 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  struct page *
>  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  {
> - pte_t *ptep;
> - struct page *page;
> + pte_t *ptep, pte;
>   unsigned shift;
>   unsigned long mask, flags;
> + struct page *page = ERR_PTR(-EINVAL);
> +
> + local_irq_save(flags);
> + ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> + if (!ptep)
> + goto no_page;
> + pte = READ_ONCE(*ptep);
>   /*
> +  * Verify it is a huge page else bail.
>* Transparent hugepages are handled by generic code. We can skip them
>* here.
>*/
> - local_irq_save(flags);
> - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> + if (!shift || pmd_trans_huge((pmd_t)pte))
> + goto no_page;
>  
> - /* Verify it is a huge page else bail. */
> - if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) {
> - local_irq_restore(flags);
> - return ERR_PTR(-EINVAL);
> + if (!pte_present(pte)) {
> + page = NULL;
> + goto no_page;
>   }
>   mask = (1UL << shift) - 1;
> - page = pte_page(*ptep);
> + page = pte_page(pte);
>   if (page)
>   page += (address & mask) / PAGE_SIZE;
>  
> +no_page:
>   local_irq_restore(flags);
>   return page;
>  }


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 2/2] powerpc/thp: Serialize pmd clear against a linux page table walk.

2015-05-07 Thread Aneesh Kumar K.V
Serialize against find_linux_pte_or_hugepte which does lock-less
lookup in page tables with local interrupts disabled. For huge pages
it casts pmd_t to pte_t. Since format of pte_t is different from
pmd_t we want to prevent transit from pmd pointing to page table
to pmd pointing to huge page (and back) while interrupts are disabled.
We clear pmd to possibly replace it with page table pointer in
different code paths. So make sure we wait for the parallel
find_linux_pte_or_hugepage to finish.

Reported-by: Kirill A. Shutemov 
Signed-off-by: Aneesh Kumar K.V 
---
Changes from v1:
* Move kick_all_cpus_sync to pmdp_get_and_clear so that it handle zap_huge_pmd
  case also.

 arch/powerpc/mm/pgtable_64.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 9171c1a37290..049d961802aa 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -845,6 +845,17 @@ pmd_t pmdp_get_and_clear(struct mm_struct *mm,
 * hash fault look at them.
 */
memset(pgtable, 0, PTE_FRAG_SIZE);
+   /*
+* Serialize against find_linux_pte_or_hugepte which does lock-less
+* lookup in page tables with local interrupts disabled. For huge pages
+* it casts pmd_t to pte_t. Since format of pte_t is different from
+* pmd_t we want to prevent transit from pmd pointing to page table
+* to pmd pointing to huge page (and back) while interrupts are 
disabled.
+* We clear pmd to possibly replace it with page table pointer in
+* different code paths. So make sure we wait for the parallel
+* find_linux_pte_or_hugepage to finish.
+*/
+   kick_all_cpus_sync();
return old_pmd;
 }
 
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions

2015-05-07 Thread Aneesh Kumar K.V
After this patch pmdp_* functions operate only on hugepage pte,
and not on regular pmd_t values pointing to page table.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
 arch/powerpc/mm/pgtable_64.c | 76 +---
 include/asm-generic/pgtable.h| 19 
 mm/huge_memory.c |  2 +-
 4 files changed, 65 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 43e6ad424c7f..50830c9a2116 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct 
*mm, unsigned long addr,
 extern void pmdp_splitting_flush(struct vm_area_struct *vma,
 unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+unsigned long address, pmd_t *pmdp);
+
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
   pgtable_t pgtable);
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 59daa5eeec25..9171c1a37290 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -560,41 +560,47 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, 
unsigned long address,
pmd_t pmd;
 
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-   if (pmd_trans_huge(*pmdp)) {
-   pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
-   } else {
-   /*
-* khugepaged calls this for normal pmd
-*/
-   pmd = *pmdp;
-   pmd_clear(pmdp);
-   /*
-* Wait for all pending hash_page to finish. This is needed
-* in case of subpage collapse. When we collapse normal pages
-* to hugepage, we first clear the pmd, then invalidate all
-* the PTE entries. The assumption here is that any low level
-* page fault will see a none pmd and take the slow path that
-* will wait on mmap_sem. But we could very well be in a
-* hash_page with local ptep pointer value. Such a hash page
-* can result in adding new HPTE entries for normal subpages.
-* That means we could be modifying the page content as we
-* copy them to a huge page. So wait for parallel hash_page
-* to finish before invalidating HPTE entries. We can do this
-* by sending an IPI to all the cpus and executing a dummy
-* function there.
-*/
-   kick_all_cpus_sync();
-   /*
-* Now invalidate the hpte entries in the range
-* covered by pmd. This make sure we take a
-* fault and will find the pmd as none, which will
-* result in a major fault which takes mmap_sem and
-* hence wait for collapse to complete. Without this
-* the __collapse_huge_page_copy can result in copying
-* the old content.
-*/
-   flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
-   }
+   VM_BUG_ON(!pmd_trans_huge(*pmdp));
+   pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
+   return pmd;
+}
+
+pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+   pmd_t pmd;
+
+   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+   VM_BUG_ON(pmd_trans_huge(*pmdp));
+
+   pmd = *pmdp;
+   pmd_clear(pmdp);
+   /*
+* Wait for all pending hash_page to finish. This is needed
+* in case of subpage collapse. When we collapse normal pages
+* to hugepage, we first clear the pmd, then invalidate all
+* the PTE entries. The assumption here is that any low level
+* page fault will see a none pmd and take the slow path that
+* will wait on mmap_sem. But we could very well be in a
+* hash_page with local ptep pointer value. Such a hash page
+* can result in adding new HPTE entries for normal subpages.
+* That means we could be modifying the page content as we
+* copy them to a huge page. So wait for parallel hash_page
+* to finish before invalidating HPTE entries. We can do this
+* by sending an IPI to all the cpus and executing a dummy
+* function there.
+*/
+   kick_all_cpus_sync();
+   /*
+* Now invalidate the hpte entries in the range
+* covered by pmd. This make sure we take a
+* fault and will find the pmd as none, which will
+* result in a major fault which takes 

[RFC PATCH] powerpc/mm: Return NULL for not present hugetlb page

2015-05-07 Thread Aneesh Kumar K.V
We need to check whether pte is present in follow_huge_addr and
properly return NULL if mapping is not present. Also use READ_ONCE
when dereferencing pte_t address.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0ce968b00b7c..f5688423bc69 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -689,27 +689,34 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
-   pte_t *ptep;
-   struct page *page;
+   pte_t *ptep, pte;
unsigned shift;
unsigned long mask, flags;
+   struct page *page = ERR_PTR(-EINVAL);
+
+   local_irq_save(flags);
+   ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
+   if (!ptep)
+   goto no_page;
+   pte = READ_ONCE(*ptep);
/*
+* Verify it is a huge page else bail.
 * Transparent hugepages are handled by generic code. We can skip them
 * here.
 */
-   local_irq_save(flags);
-   ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
+   if (!shift || pmd_trans_huge((pmd_t)pte))
+   goto no_page;
 
-   /* Verify it is a huge page else bail. */
-   if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep)) {
-   local_irq_restore(flags);
-   return ERR_PTR(-EINVAL);
+   if (!pte_present(pte)) {
+   page = NULL;
+   goto no_page;
}
mask = (1UL << shift) - 1;
-   page = pte_page(*ptep);
+   page = pte_page(pte);
if (page)
page += (address & mask) / PAGE_SIZE;
 
+no_page:
local_irq_restore(flags);
return page;
 }
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev