Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-07 Thread Hari Bathini

Hi Michael,


On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:

Hari Bathini  writes:


In case of fadump, capture (fadump) kernel boots like a normal kernel.
While this has its advantages, the capture kernel would initialize all
the components like normal kernel, which may not necessarily be needed
for a typical dump capture kernel. So, fadump capture kernel ends up
needing more memory than a typical (read kdump) capture kernel to boot.

This can be overcome by introducing parameters like fadump_nr_cpus=1,
similar to nr_cpus=1 parameter, applicable only when fadump is active.
But this approach needs introduction of special parameters applicable
only when fadump is active (capture kernel), for every parameter that
reduces memory/resource consumption.

A better approach would be to pass extra parameters to fadump capture
kernel. As firmware leaves the memory contents intact from the time of
crash till the new kernel is booted up, parameters to append to capture
kernel can be saved in real memory region and retrieved later when the
capture kernel is in its early boot process for appending to command
line parameters.

This patch introduces a new node /sys/kernel/fadump_cmdline_append to
specify the parameters to pass to fadump capture kernel, saves them in
real memory region and appends these parameters to capture kernel early
in its boot process.

As we discussed on IRC I don't really like this.

It's clever, (ab)using the fact that the first kernel's memory is left
intact. But it's also a bit gross :)


No doubt. It is an ugly trick :)


It also has a few real problems, like hard coding 128MB as the handover
location. You may not have memory there, or it may be reserved.



Yeah, there is a chance that appending parameters is not possible
like in the scenarios you mentioned above. My intention behind this
hack is to build on this handover area later to probably pass off a
special intird which brings down the dump capture time and memory
consumption further. But to put it in your words, it would be abusing
it even more :P . So, I would take it as a road not worthing taking..


My preference would be that the fadump kernel "just works". If it's
using too much memory then the fadump kernel should do whatever it needs
to use less memory, eg. shrinking nr_cpu_ids etc.



Do we actually know *why* the fadump kernel is running out of memory?
Obviously large numbers of CPUs is one of the main drivers (lots of
stacks required). But other than that what is causing the memory
pressure? I would like some data on that before we proceed.


Almost the same amount of memory in comparison with the memory
required to boot the production kernel but that is unwarranted for fadump
(dump capture) kernel. Let's say the production kernel is configured for
memory cgroups or hugepages which is not required in a dump capture kernel
but with no option to say so, we are wasting that much more memory on fadump
and eventually depriving the production kernel of that memory.

So, if parameters like cgroup_disable=memory, transparent_hugepages=never,
numa=off, nr_cpus=1, etc.. are passed to fadump (dump capture) kernel it 
would
be beneficial. Not to mention any future additions to the kernel that 
increase the

footprint of a production kernel..


If we *must* have a way to pass command line arguments to the fadump
kernel then I think we should just use a command line argument that
specifies them.

eg: fadump_append=nr_cpus=1,use_less_memory,some_other_obscure_parameter=100




Hmmm.. this sounds like a better interface. But I would like know your 
preference on

how to process fadump_append parameter:

1. Modify cmdline early in fadump kernel boot process (before parsing 
parameters) to change
   fadump_append="nr_cpus=1 cgroup_disable=memory" in cmdline to 
"nr_cpus=1 cgroup_disable=memory"
   so that fadump doesn't have to bother about processing this 
parameters later.
2. A parse function in fadump to parse fadump_append parameters. A 
function similar to parse_early_param()

   meant for fadump_append parameter alone..
3. fadump code processes fadump_append for each parameter passed in it.

The third one sounds like a nightmare to me as we need to make fadump 
code aware of every new parameter

we want to enforce on fadump..

Thanks
Hari



kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-07 Thread Sachin Sant
I have run into few instances where the lost_exception_test from
powerpc kselftest fails with SIGABRT. Following o/p is against
4.11.0-rc5. The failure is intermittent. 

When the test fails it is killed due to SIGABRT.

# ./lost_exception_test 
test: lost_exception
tags: git_version:unknown
Binding to cpu 8
main test running as pid 9208
EBB Handler is at 0x10003dcc
!! killing lost_exception
ebb_state:
  ebb_count= 191529
  spurious = 0
  negative = 0
  no_overflow  = 0
  pmc[1] count = 0x0
  pmc[2] count = 0x0
  pmc[3] count = 0x0
  pmc[4] count = 0x4c1b707
  pmc[5] count = 0x0
  pmc[6] count = 0x0
HW state:
MMCR0 0x8080 FC PMAO 
MMCR2 0x
EBBHR 0x10003dcc
BESCR 0x8001 GE PMAE 
PMC1  0x
PMC2  0x
PMC3  0x
PMC4  0x8000
PMC5  0x88d4f0c8
PMC6  0x1e49da22
SIAR  0x3fffad60a608
!! child died by signal 6
failure: lost_exception
#

Thanks
-Sachin




Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-07 Thread Hari Bathini



On Friday 07 April 2017 12:54 PM, Hari Bathini wrote:

Hi Michael,


On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:

Hari Bathini  writes:


In case of fadump, capture (fadump) kernel boots like a normal kernel.
While this has its advantages, the capture kernel would initialize all
the components like normal kernel, which may not necessarily be needed
for a typical dump capture kernel. So, fadump capture kernel ends up
needing more memory than a typical (read kdump) capture kernel to boot.

This can be overcome by introducing parameters like fadump_nr_cpus=1,
similar to nr_cpus=1 parameter, applicable only when fadump is active.
But this approach needs introduction of special parameters applicable
only when fadump is active (capture kernel), for every parameter that
reduces memory/resource consumption.

A better approach would be to pass extra parameters to fadump capture
kernel. As firmware leaves the memory contents intact from the time of
crash till the new kernel is booted up, parameters to append to capture
kernel can be saved in real memory region and retrieved later when the
capture kernel is in its early boot process for appending to command
line parameters.

This patch introduces a new node /sys/kernel/fadump_cmdline_append to
specify the parameters to pass to fadump capture kernel, saves them in
real memory region and appends these parameters to capture kernel early
in its boot process.

As we discussed on IRC I don't really like this.

It's clever, (ab)using the fact that the first kernel's memory is left
intact. But it's also a bit gross :)


No doubt. It is an ugly trick :)


It also has a few real problems, like hard coding 128MB as the handover
location. You may not have memory there, or it may be reserved.



Yeah, there is a chance that appending parameters is not possible
like in the scenarios you mentioned above. My intention behind this
hack is to build on this handover area later to probably pass off a
special intird which brings down the dump capture time and memory
consumption further. But to put it in your words, it would be abusing
it even more :P . So, I would take it as a road not worthing taking..


My preference would be that the fadump kernel "just works". If it's
using too much memory then the fadump kernel should do whatever it needs
to use less memory, eg. shrinking nr_cpu_ids etc.



Do we actually know *why* the fadump kernel is running out of memory?
Obviously large numbers of CPUs is one of the main drivers (lots of
stacks required). But other than that what is causing the memory
pressure? I would like some data on that before we proceed.


Almost the same amount of memory in comparison with the memory
required to boot the production kernel but that is unwarranted for fadump
(dump capture) kernel. Let's say the production kernel is configured for
memory cgroups or hugepages which is not required in a dump capture 
kernel
but with no option to say so, we are wasting that much more memory on 
fadump

and eventually depriving the production kernel of that memory.

So, if parameters like cgroup_disable=memory, 
transparent_hugepages=never,
numa=off, nr_cpus=1, etc.. are passed to fadump (dump capture) kernel 
it would
be beneficial. Not to mention any future additions to the kernel that 
increase the

footprint of a production kernel..


If we *must* have a way to pass command line arguments to the fadump
kernel then I think we should just use a command line argument that
specifies them.

eg: 
fadump_append=nr_cpus=1,use_less_memory,some_other_obscure_parameter=100





Hmmm.. this sounds like a better interface. But I would like know your 
preference on

how to process fadump_append parameter:

1. Modify cmdline early in fadump kernel boot process (before parsing 
parameters) to change
   fadump_append="nr_cpus=1 cgroup_disable=memory" in cmdline to 
"nr_cpus=1 cgroup_disable=memory"
   so that fadump doesn't have to bother about processing this 
parameters later.
2. A parse function in fadump to parse fadump_append parameters. A 
function similar to parse_early_param()

   meant for fadump_append parameter alone..
3. fadump code processes fadump_append for each parameter passed in it.

The third one sounds like a nightmare to me as we need to make fadump 
code aware of every new parameter

we want to enforce on fadump..



I prefer option 2 for it is simple and cleaner..

Thanks
Hari



Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-07 Thread Peter Zijlstra
On Thu, Apr 06, 2017 at 10:31:46AM -0700, Linus Torvalds wrote:
> But maybe "monitor" is really cheap. I suspect it's microcoded,
> though, which implies "no".

On my IVB-EP (will also try on something newer):

MONITOR ~332 cycles
MWAIT   ~224 cycles (C0, explicitly invalidated MONITOR)

So yes, expensive.


[PATCH] ibmveth: Support to enable LSO/CSO for Trunk VEA.

2017-04-07 Thread Sivakumar Krishnasamy
Enable largesend and checksum offload for ibmveth configured in trunk mode.
Added support to SKB frag_list in TX path by skb_linearize'ing such SKBs.

Signed-off-by: Sivakumar Krishnasamy 
---
 drivers/net/ethernet/ibm/ibmveth.c | 102 ++---
 drivers/net/ethernet/ibm/ibmveth.h |   1 +
 2 files changed, 85 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 72ab7b6..e1e238d 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -46,6 +46,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "ibmveth.h"
 
@@ -808,8 +810,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, 
u32 data)
 
ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
 
-   if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
-   !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
+   if (ret == H_SUCCESS &&
(ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
 set_attr, &ret_attr);
@@ -1040,6 +1041,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff 
*skb,
dma_addr_t dma_addr;
unsigned long mss = 0;
 
+   /* veth doesn't handle frag_list, so linearize the skb.
+* When GRO is enabled SKB's can have frag_list.
+*/
+   if (adapter->is_active_trunk &&
+   skb_has_frag_list(skb) && __skb_linearize(skb)) {
+   netdev->stats.tx_dropped++;
+   goto out;
+   }
+
/*
 * veth handles a maximum of 6 segments including the header, so
 * we have to linearize the skb if there are more than this.
@@ -1064,9 +1074,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
 
desc_flags = IBMVETH_BUF_VALID;
 
-   if (skb_is_gso(skb) && adapter->fw_large_send_support)
-   desc_flags |= IBMVETH_BUF_LRG_SND;
-
if (skb->ip_summed == CHECKSUM_PARTIAL) {
unsigned char *buf = skb_transport_header(skb) +
skb->csum_offset;
@@ -1076,6 +1083,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
/* Need to zero out the checksum */
buf[0] = 0;
buf[1] = 0;
+
+   if (skb_is_gso(skb) && adapter->fw_large_send_support)
+   desc_flags |= IBMVETH_BUF_LRG_SND;
}
 
 retry_bounce:
@@ -1128,7 +1138,7 @@ retry_bounce:
descs[i+1].fields.address = dma_addr;
}
 
-   if (skb_is_gso(skb)) {
+   if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
if (adapter->fw_large_send_support) {
mss = (unsigned long)skb_shinfo(skb)->gso_size;
adapter->tx_large_packets++;
@@ -1232,6 +1242,66 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, 
u16 mss, int lrg_pkt)
}
 }
 
+static void ibmveth_rx_csum_helper(struct sk_buff *skb,
+  struct ibmveth_adapter *adapter)
+{
+   struct iphdr *iph = NULL;
+   struct ipv6hdr *iph6 = NULL;
+   __be16 skb_proto = 0;
+   u16 iphlen = 0;
+   u16 iph_proto = 0;
+   u16 tcphdrlen = 0;
+
+   skb_proto = be16_to_cpu(skb->protocol);
+
+   if (skb_proto == ETH_P_IP) {
+   iph = (struct iphdr *)skb->data;
+
+   /* If the IP checksum is not offloaded and if the packet
+*  is large send, the checksum must be rebuilt.
+*/
+   if (iph->check == 0x) {
+   iph->check = 0;
+   iph->check = ip_fast_csum((unsigned char *)iph,
+ iph->ihl);
+   }
+
+   iphlen = iph->ihl * 4;
+   iph_proto = iph->protocol;
+   } else if (skb_proto == ETH_P_IPV6) {
+   iph6 = (struct ipv6hdr *)skb->data;
+   iphlen = sizeof(struct ipv6hdr);
+   iph_proto = iph6->nexthdr;
+   }
+
+   /* In OVS environment, when a flow is not cached, specifically for a
+* new TCP connection, the first (SYN) packet information is passed up
+* the user space for finding a flow. During this process, OVS computes
+* checksum on the packet when CHECKSUM_PARTIAL flag is set.
+* Given that we zeroed out TCP checksum field in transmit path as we
+* set "no checksum bit", OVS computed checksum will be incorrect w/o
+* TCP pseudo checksum in the packet.
+* So, re-compute TCP pseudo header checksum.
+*/
+   if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
+   struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
+
+   tcphdrlen = skb->len - iphlen;
+
+   

Re: [RFC PATCH 1/7] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use
> that instead of set_pte_at.
>

Though set_huge_pte_at() calls set_pte_at() on powerpc,
changing this in the generic code makes sense.



Re: [RFC PATCH 2/7] mm/follow_page_mask: Split follow_page_mask to smaller functions.

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> Makes code reading easy. No functional changes in this patch.

The description should mention how the follow function is
broken down to PGD follow, PUD follow and PMD follow on
4 level page table system. Needs to be bit verbose.



Re: [RFC PATCH 3/7] mm/hugetlb: export hugetlb_entry_migration helper

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> We will be using this later from the ppc64 code. Change the return type to 
> bool.

How all other arch were able to detect the hugetlb migration
entries without using this helper function before ?



Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-07 Thread Nicholas Piggin
On Fri, 7 Apr 2017 11:43:49 +0200
Peter Zijlstra  wrote:

> On Thu, Apr 06, 2017 at 10:31:46AM -0700, Linus Torvalds wrote:
> > But maybe "monitor" is really cheap. I suspect it's microcoded,
> > though, which implies "no".  
> 
> On my IVB-EP (will also try on something newer):
> 
> MONITOR   ~332 cycles
> MWAIT ~224 cycles (C0, explicitly invalidated MONITOR)
> 
> So yes, expensive.

Interestingly, Intel optimization manual says:

  The latency of PAUSE instruction in prior generation microarchitecture
  is about 10 cycles, whereas on Skylake microarchitecture it has been
  extended to as many as 140 cycles.

In another part this is claimed for efficiency improvement. Still much
cheaper than your monitor+mwait on your IVB but if skylake is a bit
faster it might become worth it.



Re: [RFC PATCH 4/7] mm/follow_page_mask: Add support for hugepage directory entry

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> The defaul implementation prints warning and returns NULL. We will add ppc64
> support in later patches.

The description is not sufficient. The patch makes the entire follow
page mask function aware of hugepd based implementation at PGD, PUD
and PMD level. It also provides default follow_huge_pd() function
which in absence of architecture support prints warning and returns
NULL. The commit description should contain all these details. 



Re: [RFC PATCH 5/7] mm/follow_page_mask: Add support for hugetlb pgd entries.

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to
> follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries.
> 
> Signed-off-by: Aneesh Kumar K.V 

This was exactly proposed by two of the patches I had posted
last year with bit more descriptive commit message. Making
follow page mask function aware of PGD based HugeTLB can be
sent separately to core MM. I will send it out to mm list soon
for fresh consideration which will make the series one patch
less.

https://patchwork.ozlabs.org/patch/595033/
https://patchwork.ozlabs.org/patch/595037/




Re: [RFC PATCH 7/7] powerpc/hugetlb: Enable hugetlb migration for ppc64

2017-04-07 Thread Anshuman Khandual
On 04/04/2017 07:34 PM, Aneesh Kumar K.V wrote:
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/platforms/Kconfig.cputype | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/Kconfig.cputype 
> b/arch/powerpc/platforms/Kconfig.cputype
> index 382c3dd86d6d..c0ca27521679 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -350,6 +350,11 @@ config PPC_RADIX_MMU
> is only implemented by IBM Power9 CPUs, if you don't have one of them
> you can probably disable this.
> 
> +config ARCH_ENABLE_HUGEPAGE_MIGRATION
> + def_bool y
> + depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION
> +

I have tested this patch series both for anon and file mapping
on a POWER8 box with 16MB pages. Will try to test it for 16GB
pages on a PVM system.





Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-07 Thread Michael Ellerman
Sachin Sant  writes:

> I have run into few instances where the lost_exception_test from
> powerpc kselftest fails with SIGABRT. Following o/p is against
> 4.11.0-rc5. The failure is intermittent. 

What hardware are you on?

How long does it take to run when it fails? I assume ~2 minutes?

> When the test fails it is killed due to SIGABRT.

> # ./lost_exception_test 
> test: lost_exception
> tags: git_version:unknown
> Binding to cpu 8
> main test running as pid 9208
> EBB Handler is at 0x10003dcc
> !! killing lost_exception

This is the parent (test harness saying) it's about to kill the child,
because it took too long.

It sends SIGTERM, but the child catches that, prints all this info, and
then aborts() - so that's why you're seeing SIGABRT.

> ebb_state):
>   ebb_count= 191529

The test usually runs until it's taken 1,000,000 EBBs, so it looks like
we got stuck.

>   spurious = 0
>   negative = 0
>   no_overflow  = 0
>   pmc[1] count = 0x0
>   pmc[2] count = 0x0
>   pmc[3] count = 0x0
>   pmc[4] count = 0x4c1b707

We use a varying sample period of between 400 and 600, and from above
we've taken 191,529 EBBs.

0x4c1b707 / 191,529 ~= 416

So that looks reasonable.

>   pmc[5] count = 0x0
>   pmc[6] count = 0x0
> HW state:
> MMCR0 0x8080 FC PMAO 

But this says we're stopped with counters frozen and an event pending.

> MMCR2 0x
> EBBHR 0x10003dcc
> BESCR 0x8001 GE PMAE 

And that says we have global enable set and events enabled.


So I think there is a bug here somewhere. I don't really have time to
dig into it now, neither does Maddy I think. But we should try and get
to it at some point.

cheers


Re: [PATCH] powerpc/mm: Remove reduntant initmem information from log

2017-04-07 Thread Michael Ellerman
Anshuman Khandual  writes:

> Generic core VM already prints these information in the log
> buffer, hence there is no need for a second print. This just
> removes the second print from arch powerpc NUMA init path.
>
> Before the patch:
>
> $dmesg | grep "Initmem"
>
> numa: Initmem setup node 0 [mem 0x-0x]
> numa: Initmem setup node 1 [mem 0x1-0x1]
> numa: Initmem setup node 2 [mem 0x2-0x2]
> numa: Initmem setup node 3 [mem 0x3-0x3]
> numa: Initmem setup node 4 [mem 0x4-0x4]
> numa: Initmem setup node 5 [mem 0x5-0x5]
> numa: Initmem setup node 6 [mem 0x6-0x6]
> numa: Initmem setup node 7 [mem 0x7-0x7]
> Initmem setup node 0 [mem 0x-0x]
> Initmem setup node 1 [mem 0x0001-0x0001]
> Initmem setup node 2 [mem 0x0002-0x0002]
> Initmem setup node 3 [mem 0x0003-0x0003]
> Initmem setup node 4 [mem 0x0004-0x0004]
> Initmem setup node 5 [mem 0x0005-0x0005]
> Initmem setup node 6 [mem 0x0006-0x0006]
> Initmem setup node 7 [mem 0x0007-0x0007]
>
> After the patch:
>
> $dmesg | grep "Initmem"
>
> Initmem setup node 0 [mem 0x-0x]
> Initmem setup node 1 [mem 0x0001-0x0001]
> Initmem setup node 2 [mem 0x0002-0x0002]
> Initmem setup node 3 [mem 0x0003-0x0003]
> Initmem setup node 4 [mem 0x0004-0x0004]
> Initmem setup node 5 [mem 0x0005-0x0005]
> Initmem setup node 6 [mem 0x0006-0x0006]
> Initmem setup node 7 [mem 0x0007-0x0007]
>
> Signed-off-by: Anshuman Khandual 

Looks good.

> ---
> Generic core VM prints the information inside free_area_init_node
> function but only when CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled.
> So if there are other PPC platforms which dont enable the config,
> we can put the code section inside applicable platform configs
> instead of removing it completely.

Are there other PPC platforms which don't enable it?

...

config PPC
...
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP


No.

So this should be fine for all PPC.

cheers


[PATCH 0/5] doorbell patches for POWER9

2017-04-07 Thread Nicholas Piggin
This is what I'd like to do for POWER9 doorbells, which reworks
the existing code a bit. I guess it won't work on DD1 with OPAL
until darn is fixed (only tested on POWER9 using mambo).

Nicholas Piggin (5):
  powerpc/pseries: do not use msgsndp doorbells on POWER9 guests
  powerpc: change the doorbell IPI calling convention
  powerpc: Introduce msgsnd/doorbell barrier primitives
  powerpc/64s: avoid branch for ppc_msgsnd
  powerpc/powernv: POWER9 support for msgsnd/doorbell IPI

 arch/powerpc/include/asm/dbell.h  | 45 ++--
 arch/powerpc/include/asm/feature-fixups.h | 20 +++
 arch/powerpc/include/asm/ppc-opcode.h |  6 
 arch/powerpc/include/asm/ppc_asm.h| 15 
 arch/powerpc/include/asm/smp.h|  4 +--
 arch/powerpc/include/asm/xics.h   |  2 +-
 arch/powerpc/kernel/dbell.c   | 58 +++
 arch/powerpc/kernel/smp.c | 27 +++---
 arch/powerpc/platforms/85xx/smp.c | 11 ++
 arch/powerpc/platforms/powermac/smp.c |  2 +-
 arch/powerpc/platforms/powernv/smp.c  | 33 +-
 arch/powerpc/platforms/pseries/smp.c  | 33 --
 arch/powerpc/sysdev/xics/icp-hv.c |  2 +-
 arch/powerpc/sysdev/xics/icp-native.c | 12 +--
 arch/powerpc/sysdev/xics/icp-opal.c   |  2 +-
 arch/powerpc/sysdev/xics/xics-common.c|  3 --
 16 files changed, 189 insertions(+), 86 deletions(-)

-- 
2.11.0



[PATCH 0/5] doorbell patches for POWER9

2017-04-07 Thread Nicholas Piggin
This is what I'd like to do for POWER9 doorbells, which reworks
the existing code a bit. I guess it won't work on DD1 with OPAL
until darn is fixed (only tested on POWER9 using mambo).

Nicholas Piggin (5):
  powerpc/pseries: do not use msgsndp doorbells on POWER9 guests
  powerpc: change the doorbell IPI calling convention
  powerpc: Introduce msgsnd/doorbell barrier primitives
  powerpc/64s: avoid branch for ppc_msgsnd
  powerpc/powernv: POWER9 support for msgsnd/doorbell IPI

 arch/powerpc/include/asm/dbell.h  | 45 ++--
 arch/powerpc/include/asm/feature-fixups.h | 20 +++
 arch/powerpc/include/asm/ppc-opcode.h |  6 
 arch/powerpc/include/asm/ppc_asm.h| 15 
 arch/powerpc/include/asm/smp.h|  4 +--
 arch/powerpc/include/asm/xics.h   |  2 +-
 arch/powerpc/kernel/dbell.c   | 58 +++
 arch/powerpc/kernel/smp.c | 27 +++---
 arch/powerpc/platforms/85xx/smp.c | 11 ++
 arch/powerpc/platforms/powermac/smp.c |  2 +-
 arch/powerpc/platforms/powernv/smp.c  | 33 +-
 arch/powerpc/platforms/pseries/smp.c  | 33 --
 arch/powerpc/sysdev/xics/icp-hv.c |  2 +-
 arch/powerpc/sysdev/xics/icp-native.c | 12 +--
 arch/powerpc/sysdev/xics/icp-opal.c   |  2 +-
 arch/powerpc/sysdev/xics/xics-common.c|  3 --
 16 files changed, 189 insertions(+), 86 deletions(-)

-- 
2.11.0



[PATCH 1/5] powerpc/pseries: do not use msgsndp doorbells on POWER9 guests

2017-04-07 Thread Nicholas Piggin
POWER9 hypervisors will not necessarily run guest threads together on
the same core at the same time, so msgsndp should not be used.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/smp.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index f6f83aeccaaa..1fa08155206b 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -200,7 +200,12 @@ static __init void pSeries_smp_probe(void)
 {
xics_smp_probe();
 
-   if (cpu_has_feature(CPU_FTR_DBELL)) {
+   /*
+* POWER9 can not use msgsndp doorbells for IPI because thread
+* siblings do not necessarily run on physical cores at the same
+* time. This could be enabled for pHyp.
+*/
+   if (cpu_has_feature(CPU_FTR_DBELL) && 
!cpu_has_feature(CPU_FTR_ARCH_300)) {
xics_cause_ipi = smp_ops->cause_ipi;
smp_ops->cause_ipi = pSeries_cause_ipi_mux;
}
-- 
2.11.0



[PATCH 2/5] powerpc: change the doorbell IPI calling convention

2017-04-07 Thread Nicholas Piggin
Change the doorbell callers to know about their msgsnd addressing, rather
than have them set a per-cpu target data tag at boot that gets sent to the
cause_ipi functions. The data is only used for doorbell IPI functions, no
other IPI types, so it makes sense to keep that detail local to doorbell.

Have the platform code understand doorbell IPIs, rather than the interrupt
controller code understand them. Platform code can look at capabilities
it has available and decide which to use.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/dbell.h   |  9 ++
 arch/powerpc/include/asm/smp.h |  3 +-
 arch/powerpc/include/asm/xics.h|  2 +-
 arch/powerpc/kernel/dbell.c| 52 ++
 arch/powerpc/kernel/smp.c  | 17 ---
 arch/powerpc/platforms/85xx/smp.c  | 11 ++-
 arch/powerpc/platforms/powermac/smp.c  |  2 +-
 arch/powerpc/platforms/powernv/smp.c   | 29 +--
 arch/powerpc/platforms/pseries/smp.c   | 28 +++---
 arch/powerpc/sysdev/xics/icp-hv.c  |  2 +-
 arch/powerpc/sysdev/xics/icp-native.c  | 12 +---
 arch/powerpc/sysdev/xics/icp-opal.c|  2 +-
 arch/powerpc/sysdev/xics/xics-common.c |  3 --
 13 files changed, 94 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index 378167377065..5a7301f333a4 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -35,8 +35,6 @@ enum ppc_dbell {
 #ifdef CONFIG_PPC_BOOK3S
 
 #define PPC_DBELL_MSGTYPE  PPC_DBELL_SERVER
-#define SPRN_DOORBELL_CPUTAG   SPRN_TIR
-#define PPC_DBELL_TAG_MASK 0x7f
 
 static inline void _ppc_msgsnd(u32 msg)
 {
@@ -49,8 +47,6 @@ static inline void _ppc_msgsnd(u32 msg)
 #else /* CONFIG_PPC_BOOK3S */
 
 #define PPC_DBELL_MSGTYPE  PPC_DBELL
-#define SPRN_DOORBELL_CPUTAG   SPRN_PIR
-#define PPC_DBELL_TAG_MASK 0x3fff
 
 static inline void _ppc_msgsnd(u32 msg)
 {
@@ -59,9 +55,10 @@ static inline void _ppc_msgsnd(u32 msg)
 
 #endif /* CONFIG_PPC_BOOK3S */
 
-extern void doorbell_cause_ipi(int cpu, unsigned long data);
+extern void global_doorbell_cause_ipi(int cpu);
+extern void core_doorbell_cause_ipi(int cpu);
+extern int try_core_doorbell_cause_ipi(int cpu);
 extern void doorbell_exception(struct pt_regs *regs);
-extern void doorbell_setup_this_cpu(void);
 
 static inline void ppc_msgsnd(enum ppc_dbell type, u32 flags, u32 tag)
 {
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 32db16d2e7ad..0ada12e61fd7 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -40,7 +40,7 @@ extern int cpu_to_chip_id(int cpu);
 struct smp_ops_t {
void  (*message_pass)(int cpu, int msg);
 #ifdef CONFIG_PPC_SMP_MUXED_IPI
-   void  (*cause_ipi)(int cpu, unsigned long data);
+   void  (*cause_ipi)(int cpu);
 #endif
void  (*probe)(void);
int   (*kick_cpu)(int nr);
@@ -125,7 +125,6 @@ extern int smp_request_message_ipi(int virq, int message);
 extern const char *smp_ipi_name[];
 
 /* for irq controllers with only a single ipi */
-extern void smp_muxed_ipi_set_data(int cpu, unsigned long data);
 extern void smp_muxed_ipi_message_pass(int cpu, int msg);
 extern void smp_muxed_ipi_set_message(int cpu, int msg);
 extern irqreturn_t smp_ipi_demux(void);
diff --git a/arch/powerpc/include/asm/xics.h b/arch/powerpc/include/asm/xics.h
index e0b9e576905a..7ce2c3ac2964 100644
--- a/arch/powerpc/include/asm/xics.h
+++ b/arch/powerpc/include/asm/xics.h
@@ -57,7 +57,7 @@ struct icp_ops {
void (*teardown_cpu)(void);
void (*flush_ipi)(void);
 #ifdef CONFIG_SMP
-   void (*cause_ipi)(int cpu, unsigned long data);
+   void (*cause_ipi)(int cpu);
irq_handler_t ipi_action;
 #endif
 };
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index 2128f3a96c32..2b41f145de05 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -20,18 +20,60 @@
 #include 
 
 #ifdef CONFIG_SMP
-void doorbell_setup_this_cpu(void)
+
+/*
+ * Doorbells must only be used if CPU_FTR_DBELL is available.
+ * msgsnd is used in HV, and msgsndp is used in !HV.
+ *
+ * These should be used by platform code that is aware of restrictions.
+ * Other arch code should use ->cause_ipi.
+ *
+ * global_doorbell_cause_ipi sends a dbell to any target CPU.
+ * Must be used only by architectures that address msgsnd target
+ * by PIR/get_hard_smp_processor_id.
+ */
+void global_doorbell_cause_ipi(int cpu)
 {
-   unsigned long tag = mfspr(SPRN_DOORBELL_CPUTAG) & PPC_DBELL_TAG_MASK;
+   u32 tag = get_hard_smp_processor_id(cpu);
 
-   smp_muxed_ipi_set_data(smp_processor_id(), tag);
+   kvmppc_set_host_ipi(cpu, 1);
+   /* Order previous accesses vs. msgsnd, which is treated as a store */
+   mb();
+   ppc_msgsnd(PPC_DBELL_MSGTYPE, 0, tag);
 }
 
-void doorbell_cause_ipi(in

[PATCH 3/5] powerpc: Introduce msgsnd/doorbell barrier primitives

2017-04-07 Thread Nicholas Piggin
POWER9 changes requirements and adds new instructions for
synchronization.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/dbell.h | 22 ++
 arch/powerpc/include/asm/smp.h   |  1 +
 arch/powerpc/kernel/dbell.c  |  8 +---
 arch/powerpc/kernel/smp.c| 10 --
 4 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index 5a7301f333a4..4db4cfdd829c 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -44,6 +44,17 @@ static inline void _ppc_msgsnd(u32 msg)
__asm__ __volatile__ (PPC_MSGSNDP(%0) : : "r" (msg));
 }
 
+/* sync before sending message */
+static inline void ppc_msgsnd_sync(void)
+{
+   __asm__ __volatile__ ("sync" : : : "memory");
+}
+
+/* sync after taking message interrupt */
+static inline void ppc_msgsync(void)
+{
+}
+
 #else /* CONFIG_PPC_BOOK3S */
 
 #define PPC_DBELL_MSGTYPE  PPC_DBELL
@@ -53,6 +64,17 @@ static inline void _ppc_msgsnd(u32 msg)
__asm__ __volatile__ (PPC_MSGSND(%0) : : "r" (msg));
 }
 
+/* sync before sending message */
+static inline void ppc_msgsnd_sync(void)
+{
+   __asm__ __volatile__ ("sync" : : : "memory");
+}
+
+/* sync after taking message interrupt */
+static inline void ppc_msgsync(void)
+{
+}
+
 #endif /* CONFIG_PPC_BOOK3S */
 
 extern void global_doorbell_cause_ipi(int cpu);
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 0ada12e61fd7..0ee8a6cb1d87 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -128,6 +128,7 @@ extern const char *smp_ipi_name[];
 extern void smp_muxed_ipi_message_pass(int cpu, int msg);
 extern void smp_muxed_ipi_set_message(int cpu, int msg);
 extern irqreturn_t smp_ipi_demux(void);
+extern irqreturn_t smp_ipi_demux_relaxed(void);
 
 void smp_init_pSeries(void);
 void smp_init_cell(void);
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index 2b41f145de05..98de6d3d904e 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -38,7 +38,7 @@ void global_doorbell_cause_ipi(int cpu)
 
kvmppc_set_host_ipi(cpu, 1);
/* Order previous accesses vs. msgsnd, which is treated as a store */
-   mb();
+   ppc_msgsnd_sync();
ppc_msgsnd(PPC_DBELL_MSGTYPE, 0, tag);
 }
 
@@ -53,7 +53,7 @@ void core_doorbell_cause_ipi(int cpu)
 
kvmppc_set_host_ipi(cpu, 1);
/* Order previous accesses vs. msgsnd, which is treated as a store */
-   mb();
+   ppc_msgsnd_sync();
ppc_msgsnd(PPC_DBELL_MSGTYPE, 0, tag);
 }
 
@@ -82,12 +82,14 @@ void doorbell_exception(struct pt_regs *regs)
 
irq_enter();
 
+   ppc_msgsync();
+
may_hard_irq_enable();
 
kvmppc_set_host_ipi(smp_processor_id(), 0);
__this_cpu_inc(irq_stat.doorbell_irqs);
 
-   smp_ipi_demux();
+   smp_ipi_demux_relaxed(); /* already performed the barrier */
 
irq_exit();
set_irq_regs(old_regs);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index f1f6e4e3906b..fd2441591b81 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -246,11 +246,17 @@ void smp_muxed_ipi_message_pass(int cpu, int msg)
 
 irqreturn_t smp_ipi_demux(void)
 {
+   mb();   /* order any irq clear */
+
+   return smp_ipi_demux_relaxed();
+}
+
+/* sync-free variant. Callers should ensure synchronization */
+irqreturn_t smp_ipi_demux_relaxed(void)
+{
struct cpu_messages *info;
unsigned long all;
 
-   mb();   /* order any irq clear */
-
info = this_cpu_ptr(&ipi_message);
do {
all = xchg(&info->messages, 0);
-- 
2.11.0



[PATCH 4/5] powerpc/64s: avoid branch for ppc_msgsnd

2017-04-07 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/dbell.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index 4db4cfdd829c..8ad66ccb7180 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -38,10 +38,8 @@ enum ppc_dbell {
 
 static inline void _ppc_msgsnd(u32 msg)
 {
-   if (cpu_has_feature(CPU_FTR_HVMODE))
-   __asm__ __volatile__ (PPC_MSGSND(%0) : : "r" (msg));
-   else
-   __asm__ __volatile__ (PPC_MSGSNDP(%0) : : "r" (msg));
+   __asm__ __volatile__ (ASM_FTR_IFSET(PPC_MSGSND(%1), PPC_MSGSNDP(%1), %0)
+   : : "i" (CPU_FTR_HVMODE), "r" (msg));
 }
 
 /* sync before sending message */
-- 
2.11.0



[PATCH 5/5] powerpc/powernv: POWER9 support for msgsnd/doorbell IPI

2017-04-07 Thread Nicholas Piggin
POWER9 requires msgsync for receiver-side synchronization,
and a DD1 workaround that uses the darn instruction.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/dbell.h  |  8 
 arch/powerpc/include/asm/feature-fixups.h | 20 
 arch/powerpc/include/asm/ppc-opcode.h |  6 ++
 arch/powerpc/include/asm/ppc_asm.h| 15 +++
 arch/powerpc/platforms/powernv/smp.c  |  8 ++--
 5 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/dbell.h b/arch/powerpc/include/asm/dbell.h
index 8ad66ccb7180..2cdad4381045 100644
--- a/arch/powerpc/include/asm/dbell.h
+++ b/arch/powerpc/include/asm/dbell.h
@@ -51,6 +51,14 @@ static inline void ppc_msgsnd_sync(void)
 /* sync after taking message interrupt */
 static inline void ppc_msgsync(void)
 {
+   /* sync is not required when taking messages from the same core */
+   if (cpu_has_feature(CPU_FTR_ARCH_300) && 
cpu_has_feature(CPU_FTR_HVMODE)) {
+   unsigned long reg;
+   __asm__ __volatile__ (ASM_FTR_IFCLR(
+   PPC_MSGSYNC " ; lwsync",
+   PPC_DARN(%0, 2) " ; lwsync",
+   %1) : "=r" (reg) : "i" (CPU_FTR_POWER9_DD1) : "memory");
+   }
 }
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index ddf54f5bbdd1..d6b8c9a20496 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -66,7 +66,14 @@ label##5:
\
 #define END_FTR_SECTION(msk, val)  \
END_FTR_SECTION_NESTED(msk, val, 97)
 
+#define END_FTR_SECTION_NESTED_IFSET(msk, label) \
+   END_FTR_SECTION_NESTED((msk), (msk), label)
+
 #define END_FTR_SECTION_IFSET(msk) END_FTR_SECTION((msk), (msk))
+
+#define END_FTR_SECTION_NESTED_IFCLR(msk, label) \
+   END_FTR_SECTION_NESTED((msk), 0, label)
+
 #define END_FTR_SECTION_IFCLR(msk) END_FTR_SECTION((msk), 0)
 
 /* CPU feature sections with alternatives, use BEGIN_FTR_SECTION to start */
@@ -153,12 +160,25 @@ label##5: 
\
section_else "; "   \
stringify_in_c(ALT_FTR_SECTION_END((msk), (val)))
 
+#define ASM_FTR_IF_NESTED(section_if, section_else, msk, val, label)   \
+   stringify_in_c(BEGIN_FTR_SECTION_NESTED(label)) \
+   section_if "; " \
+   stringify_in_c(FTR_SECTION_ELSE_NESTED(label))  \
+   section_else "; "   \
+   stringify_in_c(ALT_FTR_SECTION_END_NESTED((msk), (val), label))
+
 #define ASM_FTR_IFSET(section_if, section_else, msk)   \
ASM_FTR_IF(section_if, section_else, (msk), (msk))
 
+#define ASM_FTR_IFSET_NESTED(section_if, section_else, msk, label) \
+   ASM_FTR_IF(section_if, section_else, (msk), (msk), label)
+
 #define ASM_FTR_IFCLR(section_if, section_else, msk)   \
ASM_FTR_IF(section_if, section_else, (msk), 0)
 
+#define ASM_FTR_IFCLR_NESTED(section_if, section_else, msk, label) \
+   ASM_FTR_IF(section_if, section_else, (msk), 0, label)
+
 #define ASM_MMU_FTR_IF(section_if, section_else, msk, val) \
stringify_in_c(BEGIN_MMU_FTR_SECTION)   \
section_if "; " \
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index e7d6d86563ee..44009dfeab69 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -134,6 +134,7 @@
 #define PPC_INST_COPY  0x7c00060c
 #define PPC_INST_COPY_FIRST0x7c20060c
 #define PPC_INST_CP_ABORT  0x7c00068c
+#define PPC_INST_DARN  0x7c0005e6
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
 #define PPC_INST_DCBAL 0x7c2005ec
@@ -161,6 +162,7 @@
 #define PPC_INST_MFTMR 0x7c0002dc
 #define PPC_INST_MSGSND0x7c00019c
 #define PPC_INST_MSGCLR0x7c0001dc
+#define PPC_INST_MSGSYNC   0x7c0006ec
 #define PPC_INST_MSGSNDP   0x7c00011c
 #define PPC_INST_MTTMR 0x7c0003dc
 #define PPC_INST_NOP   0x6000
@@ -310,6 +312,7 @@
 #define __PPC_XS(s)s) & 0x1f) << 21) | (((s) & 0x20) >> 5))
 #define __PPC_XT(s)__PPC_XS(s)
 #define __PPC_T_TLB(t) (((t) & 0x3) << 21)
+#define __PPC_L_DARN(l)(((l) & 0x3) << 16)
 #define __PPC_WC(w)(((w) & 0x3) << 21)
 #define __PPC_WS(w)(((w) & 0x1f) << 11)
 #define __PPC_SH(s)__PPC_WS(s)
@@ -333,6 +336,8 @@
 
 /* Deal with instructions that older assemblers aren't aware of */
 #definePPC_CP_ABORTstringify_in_c(.long PPC_INST_CP_AB

Re: [PATCH V4] powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes

2017-04-07 Thread Michael Ellerman
Anshuman Khandual  writes:

> This just adds user space exported ABI definitions for 2MB, 16MB, 1GB,
> 16GB non default huge page sizes to be used with mmap() system call.

I updated this for you to include all the sizes.

> diff --git a/arch/powerpc/include/uapi/asm/mman.h 
> b/arch/powerpc/include/uapi/asm/mman.h
> index 03c06ba..ebe99c7 100644
> --- a/arch/powerpc/include/uapi/asm/mman.h
> +++ b/arch/powerpc/include/uapi/asm/mman.h
> @@ -29,4 +29,18 @@
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
>  
> +/*
> + * These constant defines should be used for creating the
> + * 'flags' argument (26:31 bit positions) for mmap() system
> + * call should the caller decide to use non default HugeTLB
> + * page size.
> + */

And I reworded the comment the make it clearer (I think) that most users
shouldn't need to use these, and should just use the default size:

/*
 * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
 * A value of zero indicates that the default huge page size should be used.
 * To use a non-default huge page size, one of these defines can be used, or the
 * size can be encoded by hand. Note that on most systems only a subset, or
 * possibly none, of these sizes will be available.
 */


Also do you want to send a patch to the man page?

https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/mmap.2#n248

cheers


Re: [7/7] crypto: caam/qi - add ablkcipher and authenc algorithms

2017-04-07 Thread Michael Ellerman
Laurentiu Tudor  writes:

> On 04/05/2017 01:06 PM, Michael Ellerman wrote:
>> Laurentiu Tudor  writes:
>>
>>> Hi Michael,
>>>
>>> Just a couple of basic things to check:
>>>- was the dtb updated to the newest?
>>
>> Possibly not, it's an automated build/boot, I'll have to check what it
>> does with the dtb.
>>
>>>- is the qman node present? This should be easily visible in
>>> /proc/device-tree/soc@ffe00/qman@318000.
>>
>> No it's not there.
>>
>> That's running linux-next with:
>>
>> CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_QI=n
>>
>>
>> Does that mean I didn't update the device tree?
>
> I think so. Also, I just checked that the node is actually there by 
> compiling p5020ds.dts and then decompiling the dtb.

OK, I'll make sure I update the DTB.

It will still be good if the code was a bit more robust about the qman
being missing.

cheers


Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-07 Thread Michael Ellerman
Hari Bathini  writes:
> On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:
>> My preference would be that the fadump kernel "just works". If it's
>> using too much memory then the fadump kernel should do whatever it needs
>> to use less memory, eg. shrinking nr_cpu_ids etc.
>
>> Do we actually know *why* the fadump kernel is running out of memory?
>> Obviously large numbers of CPUs is one of the main drivers (lots of
>> stacks required). But other than that what is causing the memory
>> pressure? I would like some data on that before we proceed.
>
> Almost the same amount of memory in comparison with the memory
> required to boot the production kernel but that is unwarranted for fadump
> (dump capture) kernel.

That's not data! :)

The dump kernel is booted with *much* less memory than the production
kernel (that's the whole issue!) and so it doesn't need to create struct
pages for all that memory, which means it should need less memory.

The vfs caches are also sized based on the available memory, so they
should also shrink in the dump kernel.

I want some actual numbers on what's driving the memory usage.

I tried some of these parameters to see how much memory they would save:

> So, if parameters like
> cgroup_disable=memory,

0 bytes saved.

> transparent_hugepages=never,

0 bytes saved.

> numa=off,

64KB saved.

> nr_cpus=1,

3MB saved (vs 16 CPUs)


Now maybe on your system those do save memory for some reason, but
please prove it to me. Otherwise I'm inclined to merge:

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ff0dd4e77a7..03f1f253c372 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 * dump data waiting for us.
 */
fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
+   if (fdm_active) {
fw_dump.dump_active = 1;
+   nr_cpu_ids = 1;
+   }
 
/* Get the sizes required to store dump data for the firmware provided
 * dump sections.

cheers


RE: [7/7] crypto: caam/qi - add ablkcipher and authenc algorithms

2017-04-07 Thread Laurentiu Tudor


-Original Message-
From: Michael Ellerman [mailto:m...@ellerman.id.au] 
Sent: Friday, April 07, 2017 4:22 PM
To: Laurentiu Tudor ; Horia Geantă 
; Herbert Xu ; Scott Wood 
; Roy Pledge 
Cc: Claudiu Manoil ; Cristian Stoica 
; Dan Douglass ; 
linux-arm-ker...@lists.infradead.org; Vakul Garg ; 
linuxppc-dev@lists.ozlabs.org; David S. Miller ; Alexandru 
Porosanu ; linux-cry...@vger.kernel.org
Subject: Re: [7/7] crypto: caam/qi - add ablkcipher and authenc algorithms
Importance: High

Laurentiu Tudor  writes:

> On 04/05/2017 01:06 PM, Michael Ellerman wrote:
>> Laurentiu Tudor  writes:
>>
>>> Hi Michael,
>>>
>>> Just a couple of basic things to check:
>>>- was the dtb updated to the newest?
>>
>> Possibly not, it's an automated build/boot, I'll have to check what 
>> it does with the dtb.
>>
>>>- is the qman node present? This should be easily visible in 
>>> /proc/device-tree/soc@ffe00/qman@318000.
>>
>> No it's not there.
>>
>> That's running linux-next with:
>>
>> CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_QI=n
>>
>>
>> Does that mean I didn't update the device tree?
>
> I think so. Also, I just checked that the node is actually there by 
> compiling p5020ds.dts and then decompiling the dtb.

> OK, I'll make sure I update the DTB.
> 
> It will still be good if the code was a bit more robust about the qman being 
> missing.

Totally agree. We should handle this error condition.

---
Thanks & Best Regards, Laurentiu


[PATCH V4 0/7] cxl: Add support for Coherent Accelerator Interface Architecture 2.0

2017-04-07 Thread Christophe Lombard
This series adds support for a cxl card which supports the Coherent
Accelerator Interface Architecture 2.0.

It requires IBM Power9 system and the Power Service Layer, version 9.
The PSL provides the address translation and system memory cache for
CAIA compliant Accelerators.
the PSL attaches to the IBM Processor chip through the PCIe link using
the PSL-specific “CAPI Protocol” Transaction Layer Packets.
The PSL and CAPP communicate using PowerBus packets. 
When using a PCIe link the PCIe Host Bridge (PHB) decodes the CAPI
Protocol Packets from the PSL and forwards them as PowerBus data
packets. The PSL also has an optional DMA feature which allows the AFU
to send native PCIe reads and writes to the Processor.

CAIA 2 introduces new features: 
* There are several similarities among the two programming models:
Dedicated-Process and shared models.
* DMA support
* Nest MMU to handle translation addresses.
* ...

It builds on top of the existing cxl driver for the first version of
CAIA. Today only the bare-metal environment supports these new features.

Compatibility with the CAIA, version 1, allows applications and system
software to migrate from one implementation to another with minor
changes.
Most of the differences are:
* Power Service Layer registers: p1 and p2 registers. These new
registers require reworking The service layer API (in cxl.h).
* Support of Radix mode. Power9 consist of multiple memory management
model. So we need to select the right Translation mechanism mode.
* Dedicated-Shared Process Programming Model
* Process element entry. Structure cxl_process_element_common is
redefined.
* Translation Fault Handling. Only a page fault is now handle by the
driver cxl when a translation fault is occured. 

Roughly 3/4 of the code is common between the two CAIA version. When
the code needs to call a specific implementation, it does so
through an API. The PSL8 and PSL9 implementations each describe
their own definition. See struct cxl_service_layer_ops.

The first 3 patches are mostly cleanup and fixes, separating the
psl8-specific code from the code which will also be used for psl9.
Patches 4 restructure existing code, to easily add the psl
implementation.
Patch 5 and 6 rename and isolate implementation-specific code.
Patch 7 introduces the core of the PSL9-specific code.

Tested on Simulation environment.

Changelog[v4]
 - Rebase to latest upstream.
 - Integrate comments from Andrew Donnellan and Frederic Barrat.
 - patch2: - Update the structure cxl_irq_info.
 Update the commit message.
 - patch3: - Update the commit message.
 Remove the prototype cxl_context_mm_users_get() in cxl.h
 The function no longer exists.
 - patch4: - Some callbacks are missing the xsl_ops structure
   - Rework the function native_irq_multiplexed()
 - patch6: - Remove code lines that will be going away in the next
 patch.
 - patch7: - Rename the function process_element_entry() to
 process_element_entry_psl9()
   - Change the setting of the PSL_SERR_An register.
   - Update cxl documentation.

Changelog[v3]
 - Rebase to latest upstream.
 - Integrate comments from Andrew Donnellan and Frederic Barrat.
 - patch2: - Rename pid and tid to "reserved" in the struct cxl_irq_info.
 - patch3: - Update commit message.
   - Reset ctx->mm to NULL.
   - Simplify slightly the function _cxl_slbia() using the mm
 associated to a context.
   - Remove cxl_context_mm_users_get().
 - patch4: - Some prototypes are not supposed to depend on CONFIG_DEBUG_FS.
 - patch6: - Regroup the sste_lock and sst alloc under the same "if"
 statement. 
 - patch7: - New functions to cover page fault and segment miss.
   - Rework the code to avoid duplication.
   - Add a new parameter for the function cxl_alloc_spa().
   - Invalidation of all ERAT entries is no longer required by
 CAIA2.
   - Keep original version of cxl_native_register_serr_irq().
   - ASB_Notify messages and Non-Blocking queues not supported
 on DD1.
   - Change the allocation of the apc machines.

Changelog[v2]
 - Rebase to latest upstream.
 - Integrate comments from Andrew Donnellan and Frederic Barrat.

Christophe Lombard (7):
  cxl: Read vsec perst load image
  cxl: Remove unused values in bare-metal environment.
  cxl: Keep track of mm struct associated with a context
  cxl: Update implementation service layer
  cxl: Rename some psl8 specific functions
  cxl: Isolate few psl8 specific calls
  cxl: Add psl9 specific code

 Documentation/powerpc/cxl.txt |  11 +-
 drivers/misc/cxl/api.c|  17 +-
 drivers/misc/cxl/context.c|  65 ++--
 drivers/misc/cxl/cxl.h| 244 +---
 drivers/misc/cxl/debugfs.c|  41 +++--
 drivers/misc/cxl/fault.c  | 136 ++--
 drivers/misc/cxl/file.c   |  15 +-
 drivers/misc/cxl/guest.c  |  10 +-
 dri

[PATCH V4 3/7] cxl: Keep track of mm struct associated with a context

2017-04-07 Thread Christophe Lombard
The mm_struct corresponding to the current task is acquired each time
an interrupt is raised. So to simplify the code, we only get the
mm_struct when attaching an AFU context to the process.
The mm_count reference is increased to ensure that the mm_struct can't
be freed. The mm_struct will be released when the context is detached.
A reference on mm_users is not kept to avoid a circular dependency if
the process mmaps its cxl mmio and forget to unmap before exiting.
The field glpid (pid of the group leader associated with the pid), of
the structure cxl_context, is removed because it's no longer useful.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/api.c | 17 +--
 drivers/misc/cxl/context.c | 21 +++--
 drivers/misc/cxl/cxl.h | 10 --
 drivers/misc/cxl/fault.c   | 76 --
 drivers/misc/cxl/file.c| 15 +++--
 drivers/misc/cxl/main.c| 12 ++--
 6 files changed, 61 insertions(+), 90 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index bcc030e..1a138c8 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cxl.h"
 
@@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
 
if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
-   ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID);
kernel = false;
ctx->real_mode = false;
+
+   /* acquire a reference to the task's mm */
+   ctx->mm = get_task_mm(current);
+
+   /* ensure this mm_struct can't be freed */
+   cxl_context_mm_count_get(ctx);
+
+   /* decrement the use count */
+   if (ctx->mm)
+   mmput(ctx->mm);
}
 
cxl_ctx_get();
 
if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
-   put_pid(ctx->glpid);
put_pid(ctx->pid);
-   ctx->glpid = ctx->pid = NULL;
+   ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
+   if (task)
+   cxl_context_mm_count_put(ctx);
goto out;
}
 
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 062bf6c..2e935ea 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
-   ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */
+   ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;
 
@@ -242,12 +243,16 @@ int __detach_context(struct cxl_context *ctx)
 
/* release the reference to the group leader and mm handling pid */
put_pid(ctx->pid);
-   put_pid(ctx->glpid);
 
cxl_ctx_put();
 
/* Decrease the attached context count on the adapter */
cxl_adapter_context_put(ctx->afu->adapter);
+
+   /* Decrease the mm count on the context */
+   cxl_context_mm_count_put(ctx);
+   ctx->mm = NULL;
+
return 0;
 }
 
@@ -325,3 +330,15 @@ void cxl_context_free(struct cxl_context *ctx)
mutex_unlock(&ctx->afu->contexts_lock);
call_rcu(&ctx->rcu, reclaim_ctx);
 }
+
+void cxl_context_mm_count_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_count);
+}
+
+void cxl_context_mm_count_put(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   mmdrop(ctx->mm);
+}
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 36bc213..4bcbf7a 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -482,8 +482,6 @@ struct cxl_context {
unsigned int sst_size, sst_lru;
 
wait_queue_head_t wq;
-   /* pid of the group leader associated with the pid */
-   struct pid *glpid;
/* use mm context associated with this pid for ds faults */
struct pid *pid;
spinlock_t lock; /* Protects pending_irq_mask, pending_fault and 
fault_addr */
@@ -551,6 +549,8 @@ struct cxl_context {
 * CX4 only:
 */
struct list_head extra_irq_contexts;
+
+   struct mm_struct *mm;
 };
 
 struct cxl_service_layer_ops {
@@ -1012,4 +1012,10 @@ int cxl_adapter_context_lock(struct cxl *adapter);
 /* Unlock the contexts-lock if taken. Warn and force unlock otherwise */
 void cxl_adapter_context_unlock(struct cxl *adapter);
 
+/* Increases the reference count to "struct mm_struct" */
+void cxl_context_mm_count_get(struct cxl_context *ctx);
+
+/* Decrements the reference count to "struct mm_struct" */
+void cxl

[PATCH V4 2/7] cxl: Remove unused values in bare-metal environment.

2017-04-07 Thread Christophe Lombard
The two previously fields pid and tid, located in the structure
cxl_irq_info, are only used in the guest environment. To avoid confusion,
it's not necessary to fill the fields in the bare-metal environment.
Pid_tid is now renamed to 'reserved' to avoid undefined behavior on
bare-metal. The PSL Process and Thread Identification Register
(CXL_PSL_PID_TID_An) is only used when attaching a dedicated process
for PSL8 only. This register goes away in CAIA2.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h| 20 
 drivers/misc/cxl/hcalls.c |  6 +++---
 drivers/misc/cxl/native.c |  5 -
 3 files changed, 7 insertions(+), 24 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..36bc213 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -888,27 +888,15 @@ int __detach_context(struct cxl_context *ctx);
 /*
  * This must match the layout of the H_COLLECT_CA_INT_INFO retbuf defined
  * in PAPR.
- * A word about endianness: a pointer to this structure is passed when
- * calling the hcall. However, it is not a block of memory filled up by
- * the hypervisor. The return values are found in registers, and copied
- * one by one when returning from the hcall. See the end of the call to
- * plpar_hcall9() in hvCall.S
- * As a consequence:
- * - we don't need to do any endianness conversion
- * - the pid and tid are an exception. They are 32-bit values returned in
- *   the same 64-bit register. So we do need to worry about byte ordering.
+ * Field pid_tid is now 'reserved' because it's no more used on bare-metal.
+ * On a guest environment, PSL_PID_An is located on the upper 32 bits and
+ * PSL_TID_An register in the lower 32 bits.
  */
 struct cxl_irq_info {
u64 dsisr;
u64 dar;
u64 dsr;
-#ifndef CONFIG_CPU_LITTLE_ENDIAN
-   u32 pid;
-   u32 tid;
-#else
-   u32 tid;
-   u32 pid;
-#endif
+   u64 reserved;
u64 afu_err;
u64 errstat;
u64 proc_handle;
diff --git a/drivers/misc/cxl/hcalls.c b/drivers/misc/cxl/hcalls.c
index d6d11f4..9b8bb0f 100644
--- a/drivers/misc/cxl/hcalls.c
+++ b/drivers/misc/cxl/hcalls.c
@@ -413,9 +413,9 @@ long cxl_h_collect_int_info(u64 unit_address, u64 
process_token,
 
switch (rc) {
case H_SUCCESS: /* The interrupt info is returned in return 
registers. */
-   pr_devel("dsisr:%#llx, dar:%#llx, dsr:%#llx, pid:%u, tid:%u, 
afu_err:%#llx, errstat:%#llx\n",
-   info->dsisr, info->dar, info->dsr, info->pid,
-   info->tid, info->afu_err, info->errstat);
+   pr_devel("dsisr:%#llx, dar:%#llx, dsr:%#llx, pid_tid:%#llx, 
afu_err:%#llx, errstat:%#llx\n",
+   info->dsisr, info->dar, info->dsr, info->reserved,
+   info->afu_err, info->errstat);
return 0;
case H_PARAMETER:   /* An incorrect parameter was supplied. */
return -EINVAL;
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..7257e8b 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx)
 
 static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info)
 {
-   u64 pidtid;
-
/* If the adapter has gone away, we can't get any meaningful
 * information.
 */
@@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct 
cxl_irq_info *info)
info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An);
info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An);
-   pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An);
-   info->pid = pidtid >> 32;
-   info->tid = pidtid & 0x;
info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An);
info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An);
info->proc_handle = 0;
-- 
2.7.4



[PATCH V4 1/7] cxl: Read vsec perst load image

2017-04-07 Thread Christophe Lombard
This bit is used to cause a flash image load for programmable
CAIA-compliant implementation. If this bit is set to ‘0’, a power
cycle of the adapter is required to load a programmable CAIA-com-
pliant implementation from flash.
This field will be used by the following patches.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index b27ea98..1f4c351 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1332,6 +1332,7 @@ static int cxl_read_vsec(struct cxl *adapter, struct 
pci_dev *dev)
CXL_READ_VSEC_IMAGE_STATE(dev, vsec, &image_state);
adapter->user_image_loaded = !!(image_state & 
CXL_VSEC_USER_IMAGE_LOADED);
adapter->perst_select_user = !!(image_state & 
CXL_VSEC_USER_IMAGE_LOADED);
+   adapter->perst_loads_image = !!(image_state & 
CXL_VSEC_PERST_LOADS_IMAGE);
 
CXL_READ_VSEC_NAFUS(dev, vsec, &adapter->slices);
CXL_READ_VSEC_AFU_DESC_OFF(dev, vsec, &afu_desc_off);
-- 
2.7.4



[PATCH V4 5/7] cxl: Rename some psl8 specific functions

2017-04-07 Thread Christophe Lombard
Rename a few functions, changing the '_psl' suffix to '_psl8', to make
clear that the implementation is psl8 specific.
Those functions will have an equivalent implementation for the psl9 in
a later patch.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 26 ++--
 drivers/misc/cxl/debugfs.c |  6 ++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 12 +-
 drivers/misc/cxl/pci.c | 60 +++---
 6 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 626073d..a54c003 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -813,10 +813,10 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count);
 void afu_release_irqs(struct cxl_context *ctx, void *cookie);
 void afu_irq_name_free(struct cxl_context *ctx);
 
-int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
-int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
-int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
-void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx);
 
 #ifdef CONFIG_DEBUG_FS
 
@@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace_psl(struct cxl *cxl);
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_stop_trace_psl8(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry 
*dir);
 void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir);
 
 #else /* CONFIG_DEBUG_FS */
 
@@ -860,11 +860,11 @@ static inline void cxl_debugfs_afu_remove(struct cxl_afu 
*afu)
 {
 }
 
-static inline void cxl_stop_trace(struct cxl *cxl)
+static inline void cxl_stop_trace_psl8(struct cxl *cxl)
 {
 }
 
-static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter,
struct dentry *dir)
 {
 }
@@ -874,7 +874,7 @@ static inline void cxl_debugfs_add_adapter_regs_xsl(struct 
cxl *adapter,
 {
 }
 
-static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct 
dentry *dir)
+static inline void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct 
dentry *dir)
 {
 }
 
@@ -919,8 +919,8 @@ struct cxl_irq_info {
 };
 
 void cxl_assign_psn_space(struct cxl_context *ctx);
-int cxl_invalidate_all_psl(struct cxl *adapter);
-irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+int cxl_invalidate_all_psl8(struct cxl *adapter);
+irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
 irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
 int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler,
void *cookie, irq_hw_number_t *dest_hwirq,
@@ -932,7 +932,7 @@ int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);
 
-void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx);
+void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
 void cxl_native_err_irq_dump_regs(struct cxl *adapter);
 int cxl_pci_vphb_add(struct cxl_afu *afu);
 void cxl_pci_vphb_remove(struct cxl_afu *afu);
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 4848ebf..2ff10a9 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -15,7 +15,7 @@
 
 static struct dentry *cxl_debugfs;
 
-void cxl_stop_trace_psl(struct cxl *adapter)
+void cxl_stop_trace_psl8(struct cxl *adapter)
 {
int slice;
 
@@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const char *name, 
umode_t mode,
  (void __force *)value, &fops_io_x64);
 }
 
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir)
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir)
 {
debugfs_create_io_x64("fir1", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR1));
debugfs_create_io_x64("fir2", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR2));
@@ -92,7 +92,7 @@ void cxl_debugfs_adapter_remove(struct cxl *adapter)
debugfs_rem

[PATCH V4 7/7] cxl: Add psl9 specific code

2017-04-07 Thread Christophe Lombard
The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
---
 Documentation/powerpc/cxl.txt |  11 +-
 drivers/misc/cxl/context.c|  16 ++-
 drivers/misc/cxl/cxl.h| 137 +++
 drivers/misc/cxl/debugfs.c|  19 
 drivers/misc/cxl/fault.c  |  64 +++
 drivers/misc/cxl/guest.c  |   8 +-
 drivers/misc/cxl/irq.c|  53 +
 drivers/misc/cxl/native.c | 225 +++---
 drivers/misc/cxl/pci.c| 246 +++---
 drivers/misc/cxl/trace.h  |  43 
 10 files changed, 748 insertions(+), 74 deletions(-)

diff --git a/Documentation/powerpc/cxl.txt b/Documentation/powerpc/cxl.txt
index d5506ba0..4a77462 100644
--- a/Documentation/powerpc/cxl.txt
+++ b/Documentation/powerpc/cxl.txt
@@ -21,7 +21,7 @@ Introduction
 Hardware overview
 =
 
-  POWER8   FPGA
+ POWER8/9 FPGA
+--++-+
|  || |
|   CPU||   AFU   |
@@ -34,7 +34,7 @@ Hardware overview
|   | CAPP |<-->| |
+---+--+  PCIE  +-+
 
-The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
+The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
 unit which is part of the PCIe Host Bridge (PHB). This is managed
 by Linux by calls into OPAL. Linux doesn't directly program the
 CAPP.
@@ -59,6 +59,13 @@ Hardware overview
 the fault. The context to which this fault is serviced is based on
 who owns that acceleration function.
 
+POWER8 <-> PSL Version 8 is compliant to the CAIA Version 1.0.
+POWER9 <-> PSL Version 9 is compliant to the CAIA Version 2.0.
+This PSL Version 9 provides new features as:
+* Native DMA support.
+* Supports sending ASB_Notify messages for host thread wakeup.
+* Supports Atomic operations.
+* 
 
 AFU Modes
 =
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index ac2531e..45363be 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
if (ctx->afu->current_mode == CXL_MODE_DEDICATED) {
if (start + len > ctx->afu->adapter->ps_size)
return -EINVAL;
+
+   if (cxl_is_psl9(ctx->afu)) {
+   /* make sure there is a valid problem state
+* area space for this AFU
+*/
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
} else {
if (start + len > ctx->psn_size)
return -EINVAL;
-   }
 
-   if (ctx->afu->current_mode != CXL_MODE_DEDICATED) {
/* make sure there is a valid per process space for this AFU */
if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) {
pr_devel("AFU doesn't support mmio space\n");
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 82335c0..df40e6e 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */
 
 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
+/* PSL registers - CAIA 2 */
+static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
+static const cxl_p1_reg_t CXL_XSL9_DSNCTL   = {0x0168};
+static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300};
+static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308};
+static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310};
+static const cxl_p1_reg_t CXL_PSL9_

[PATCH V4 4/7] cxl: Update implementation service layer

2017-04-07 Thread Christophe Lombard
The service layer API (in cxl.h) lists some low-level functions whose
implementation is different on PSL8, PSL9 and XSL:
- Init implementation for the adapter and the afu.
- Invalidate TLB/SLB.
- Attach process for dedicated/directed models.
- Handle psl interrupts.
- Debug registers for the adapter and the afu.
- Traces.
Each environment implements its own functions, and the common code uses
them through function pointers, defined in cxl_service_layer_ops.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 40 +++--
 drivers/misc/cxl/debugfs.c | 16 +++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 54 ++---
 drivers/misc/cxl/pci.c | 55 +-
 6 files changed, 110 insertions(+), 59 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 4bcbf7a..626073d 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -553,13 +553,23 @@ struct cxl_context {
struct mm_struct *mm;
 };
 
+struct cxl_irq_info;
+
 struct cxl_service_layer_ops {
int (*adapter_regs_init)(struct cxl *adapter, struct pci_dev *dev);
+   int (*invalidate_all)(struct cxl *adapter);
int (*afu_regs_init)(struct cxl_afu *afu);
+   int (*sanitise_afu_regs)(struct cxl_afu *afu);
int (*register_serr_irq)(struct cxl_afu *afu);
void (*release_serr_irq)(struct cxl_afu *afu);
-   void (*debugfs_add_adapter_sl_regs)(struct cxl *adapter, struct dentry 
*dir);
-   void (*debugfs_add_afu_sl_regs)(struct cxl_afu *afu, struct dentry 
*dir);
+   irqreturn_t (*handle_interrupt)(int irq, struct cxl_context *ctx, 
struct cxl_irq_info *irq_info);
+   irqreturn_t (*fail_irq)(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
+   int (*activate_dedicated_process)(struct cxl_afu *afu);
+   int (*attach_afu_directed)(struct cxl_context *ctx, u64 wed, u64 amr);
+   int (*attach_dedicated_process)(struct cxl_context *ctx, u64 wed, u64 
amr);
+   void (*update_dedicated_ivtes)(struct cxl_context *ctx);
+   void (*debugfs_add_adapter_regs)(struct cxl *adapter, struct dentry 
*dir);
+   void (*debugfs_add_afu_regs)(struct cxl_afu *afu, struct dentry *dir);
void (*psl_irq_dump_registers)(struct cxl_context *ctx);
void (*err_irq_dump_registers)(struct cxl *adapter);
void (*debugfs_stop_trace)(struct cxl *adapter);
@@ -803,6 +813,11 @@ int afu_register_irqs(struct cxl_context *ctx, u32 count);
 void afu_release_irqs(struct cxl_context *ctx, void *cookie);
 void afu_irq_name_free(struct cxl_context *ctx);
 
+int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+
 #ifdef CONFIG_DEBUG_FS
 
 int cxl_debugfs_init(void);
@@ -811,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace(struct cxl *cxl);
-void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir);
+void cxl_stop_trace_psl(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);
 
 #else /* CONFIG_DEBUG_FS */
 
@@ -849,17 +864,17 @@ static inline void cxl_stop_trace(struct cxl *cxl)
 {
 }
 
-static inline void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter,
struct dentry *dir)
 {
 }
 
-static inline void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter,
+static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter,
struct dentry *dir)
 {
 }
 
-static inline void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct 
dentry *dir)
+static inline void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct 
dentry *dir)
 {
 }
 
@@ -904,19 +919,20 @@ struct cxl_irq_info {
 };
 
 void cxl_assign_psn_space(struct cxl_context *ctx);
-irqreturn_t cxl_irq(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+int cxl_invalidate_all_psl(struct cxl *adapter);
+irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, s

[PATCH V4 6/7] cxl: Isolate few psl8 specific calls

2017-04-07 Thread Christophe Lombard
Point out the specific Coherent Accelerator Interface Architecture,
level 1, registers.
Code and functions specific to PSL8 (CAIA1) must be framed.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c | 28 +++-
 drivers/misc/cxl/cxl.h | 35 +++--
 drivers/misc/cxl/debugfs.c |  6 +++--
 drivers/misc/cxl/native.c  | 43 +--
 drivers/misc/cxl/pci.c | 64 +++---
 5 files changed, 120 insertions(+), 56 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 2e935ea..ac2531e 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -39,23 +39,26 @@ int cxl_context_init(struct cxl_context *ctx, struct 
cxl_afu *afu, bool master)
 {
int i;
 
-   spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;
 
-   /*
-* Allocate the segment table before we put it in the IDR so that we
-* can always access it when dereferenced from IDR. For the same
-* reason, the segment table is only destroyed after the context is
-* removed from the IDR.  Access to this in the IOCTL is protected by
-* Linux filesytem symantics (can't IOCTL until open is complete).
-*/
-   i = cxl_alloc_sst(ctx);
-   if (i)
-   return i;
+   if (cxl_is_psl8(afu)) {
+   spin_lock_init(&ctx->sste_lock);
+
+   /*
+* Allocate the segment table before we put it in the IDR so 
that we
+* can always access it when dereferenced from IDR. For the same
+* reason, the segment table is only destroyed after the 
context is
+* removed from the IDR.  Access to this in the IOCTL is 
protected by
+* Linux filesytem symantics (can't IOCTL until open is 
complete).
+*/
+   i = cxl_alloc_sst(ctx);
+   if (i)
+   return i;
+   }
 
INIT_WORK(&ctx->fault_work, cxl_handle_fault);
 
@@ -308,7 +311,8 @@ static void reclaim_ctx(struct rcu_head *rcu)
 {
struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu);
 
-   free_page((u64)ctx->sstp);
+   if (cxl_is_psl8(ctx->afu))
+   free_page((u64)ctx->sstp);
if (ctx->ff_page)
__free_page(ctx->ff_page);
ctx->sstp = NULL;
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index a54c003..82335c0 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -73,7 +73,7 @@ static const cxl_p1_reg_t CXL_PSL_Control = {0x0020};
 static const cxl_p1_reg_t CXL_PSL_DLCNTL  = {0x0060};
 static const cxl_p1_reg_t CXL_PSL_DLADDR  = {0x0068};
 
-/* PSL Lookaside Buffer Management Area */
+/* PSL Lookaside Buffer Management Area - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_LBISEL  = {0x0080};
 static const cxl_p1_reg_t CXL_PSL_SLBIE   = {0x0088};
 static const cxl_p1_reg_t CXL_PSL_SLBIA   = {0x0090};
@@ -82,7 +82,7 @@ static const cxl_p1_reg_t CXL_PSL_TLBIA   = {0x00A8};
 static const cxl_p1_reg_t CXL_PSL_AFUSEL  = {0x00B0};
 
 /* 0x00C0:7EFF Implementation dependent area */
-/* PSL registers */
+/* PSL registers - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_FIR1  = {0x0100};
 static const cxl_p1_reg_t CXL_PSL_FIR2  = {0x0108};
 static const cxl_p1_reg_t CXL_PSL_Timebase  = {0x0110};
@@ -109,7 +109,7 @@ static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
 static const cxl_p1n_reg_t CXL_PSL_SPOffset_An= {0x18};
 static const cxl_p1n_reg_t CXL_PSL_ID_An  = {0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
-/* Memory Management and Lookaside Buffer Management */
+/* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
 static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38};
 /* Pointer Area */
@@ -124,6 +124,7 @@ static const cxl_p1n_reg_t CXL_PSL_IVTE_Limit_An  = {0xB8};
 /* 0xC0:FF Implementation Dependent Area */
 static const cxl_p1n_reg_t CXL_PSL_FIR_SLICE_An   = {0xC0};
 static const cxl_p1n_reg_t CXL_AFU_DEBUG_An   = {0xC8};
+/* 0xC0:FF Implementation Dependent Area - CAIA 1 */
 static const cxl_p1n_reg_t CXL_PSL_APCALLOC_A = {0xD0};
 static const cxl_p1n_reg_t CXL_PSL_COALLOC_A  = {0xD8};
 static const cxl_p1n_reg_t CXL_PSL_RXCTL_A= {0xE0};
@@ -133,12 +134,14 @@ static const cxl_p1n_reg_t CXL_PSL_SLICE_TRACE= 
{0xE8};
 /* Configuration and Control Area */
 static const cxl_p2n_reg_t CXL_PSL_PID_TID_An = {0x000};
 static const cxl_p2n_reg_t CXL_CSRP_An= {0x008};
+/* Configuration and Control Area - CAIA 1 */
 static const cxl_p2n_reg_t CXL_AURP0_An   = {0x010};
 static const cxl_p2n_reg_t CXL_AURP1_An   = {0x018};
 sta

Re: [PATCH V4] powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes

2017-04-07 Thread Anshuman Khandual
On 04/07/2017 06:31 PM, Michael Ellerman wrote:
> Anshuman Khandual  writes:
> 
>> This just adds user space exported ABI definitions for 2MB, 16MB, 1GB,
>> 16GB non default huge page sizes to be used with mmap() system call.
> 
> I updated this for you to include all the sizes.
> 
>> diff --git a/arch/powerpc/include/uapi/asm/mman.h 
>> b/arch/powerpc/include/uapi/asm/mman.h
>> index 03c06ba..ebe99c7 100644
>> --- a/arch/powerpc/include/uapi/asm/mman.h
>> +++ b/arch/powerpc/include/uapi/asm/mman.h
>> @@ -29,4 +29,18 @@
>>  #define MAP_STACK   0x2 /* give out an address that is best 
>> suited for process/thread stacks */
>>  #define MAP_HUGETLB 0x4 /* create a huge page mapping */
>>  
>> +/*
>> + * These constant defines should be used for creating the
>> + * 'flags' argument (26:31 bit positions) for mmap() system
>> + * call should the caller decide to use non default HugeTLB
>> + * page size.
>> + */
> 
> And I reworded the comment the make it clearer (I think) that most users
> shouldn't need to use these, and should just use the default size:
> 
> /*
>  * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
>  * A value of zero indicates that the default huge page size should be used.
>  * To use a non-default huge page size, one of these defines can be used, or 
> the
>  * size can be encoded by hand. Note that on most systems only a subset, or
>  * possibly none, of these sizes will be available.
>  */

Yeah its clearer, thanks.

> 
> 
> Also do you want to send a patch to the man page?
> 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/mmap.2#n248
> 

Sure, will do it.



Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Peter Zijlstra
On Fri, Apr 07, 2017 at 06:47:43PM +0800, Jin Yao wrote:
> Perf already has support for disassembling the branch instruction
> and using the branch type for filtering. The patch just records
> the branch type in perf_branch_entry.
> 
> Before recording, the patch converts the x86 branch classification
> to common branch classification and compute for checking if the
> branches cross 4K or 2MB areas. It's an approximate computing for
> crossing 4K page or 2MB page.

The changelog is completely empty of rationale. Why do we care?

Not having the binary is a very bad reason; you can't do much of
anything if that's missing.


> @@ -923,6 +933,84 @@ static int branch_type(unsigned long from, unsigned long 
> to, int abort)
>   return ret;
>  }
>  
> +static int
> +common_branch_type(int type, u64 from, u64 to)
> +{
> + int ret;
> +
> + type = type & (~(X86_BR_KERNEL | X86_BR_USER));
> +
> + switch (type) {
> + case X86_BR_CALL:
> + case X86_BR_ZERO_CALL:
> + ret = PERF_BR_CALL;
> + break;
> +
> + case X86_BR_RET:
> + ret = PERF_BR_RET;
> + break;
> +
> + case X86_BR_SYSCALL:
> + ret = PERF_BR_SYSCALL;
> + break;
> +
> + case X86_BR_SYSRET:
> + ret = PERF_BR_SYSRET;
> + break;
> +
> + case X86_BR_INT:
> + ret = PERF_BR_INT;
> + break;
> +
> + case X86_BR_IRET:
> + ret = PERF_BR_IRET;
> + break;
> +
> + case X86_BR_IRQ:
> + ret = PERF_BR_IRQ;
> + break;
> +
> + case X86_BR_ABORT:
> + ret = PERF_BR_FAR_BRANCH;
> + break;
> +
> + case X86_BR_JCC:
> + if (to > from)
> + ret = PERF_BR_JCC_FWD;
> + else
> + ret = PERF_BR_JCC_BWD;
> + break;

This seems like superfluous information; we already get to and from, so
this comparison is pointless.

The rest looks like something you can simpler implement using a lookup
table.

> +
> + case X86_BR_JMP:
> + ret = PERF_BR_JMP;
> + break;
> +
> + case X86_BR_IND_CALL:
> + ret = PERF_BR_IND_CALL;
> + break;
> +
> + case X86_BR_IND_JMP:
> + ret = PERF_BR_IND_JMP;
> + break;
> +
> + default:
> + ret = PERF_BR_NONE;
> + }
> +
> + return ret;
> +}
> +
> +static bool
> +cross_area(u64 addr1, u64 addr2, int size)
> +{
> + u64 align1, align2;
> +
> + align1 = addr1 & ~(size - 1);
> + align2 = addr2 & ~(size - 1);
> +
> + return (align1 != align2) ? true : false;
> +}
> +
>  /*
>   * implement actual branch filter based on user demand.
>   * Hardware may not exactly satisfy that request, thus
> @@ -939,7 +1027,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>   bool compress = false;
>  
>   /* if sampling all branches, then nothing to filter */
> - if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
> + if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
> + ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
>   return;
>  
>   for (i = 0; i < cpuc->lbr_stack.nr; i++) {
> @@ -960,6 +1049,21 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>   cpuc->lbr_entries[i].from = 0;
>   compress = true;
>   }
> +
> + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) {
> + cpuc->lbr_entries[i].type = common_branch_type(type,
> +from,
> +to);
> + if (cross_area(from, to, AREA_2M))
> + cpuc->lbr_entries[i].cross = PERF_BR_CROSS_2M;
> + else if (cross_area(from, to, AREA_4K))
> + cpuc->lbr_entries[i].cross = PERF_BR_CROSS_4K;
> + else
> + cpuc->lbr_entries[i].cross = PERF_BR_CROSS_NONE;

This again is superfluous information; it is already fully contained in
to and from, which we have.

> + } else {
> + cpuc->lbr_entries[i].type = PERF_BR_NONE;
> + cpuc->lbr_entries[i].cross = PERF_BR_CROSS_NONE;
> + }
>   }
>  
>   if (!compress)
> -- 
> 2.7.4
> 


Re: [PATCH V4] powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes

2017-04-07 Thread Paul Clarke

nits... take 'em or leave 'em...

On 04/07/2017 08:01 AM, Michael Ellerman wrote:

Anshuman Khandual  writes:
And I reworded the comment the make it clearer (I think) that most users
shouldn't need to use these, and should just use the default size:

/*
 * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.


need a comma after "set".

also, "bits [26:31]" of what?


 * A value of zero indicates that the default huge page size should be used.
 * To use a non-default huge page size, one of these defines can be used, or the
 * size can be encoded by hand. Note that on most systems only a subset, or
 * possibly none, of these sizes will be available.
 */


PC



Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-07 Thread Will Deacon
On Fri, Apr 07, 2017 at 01:30:11AM +1000, Nicholas Piggin wrote:
> On Thu, 6 Apr 2017 15:13:53 +0100
> Will Deacon  wrote:
> > On Thu, Apr 06, 2017 at 10:59:58AM +1000, Nicholas Piggin wrote:
> > > Thanks for taking a look. The default spin primitives should just
> > > continue to do the right thing for you in that case.
> > > 
> > > Arm has a yield instruction, ia64 has a pause... No unusual
> > > requirements that I can see.  
> > 
> > Yield tends to be implemented as a NOP in practice, since it's in the
> > architecture for SMT CPUs and most ARM CPUs are single-threaded. We do have
> > the WFE instruction (wait for event) which is used in our implementation of
> > smp_cond_load_acquire, but I don't think we'd be able to use it with the
> > proposals here.
> > 
> > WFE can stop the clock for the CPU until an "event" is signalled by
> > another CPU. This could be done by an explicit SEV (send event) instruction,
> > but that tends to require heavy barriers on the signalling side. Instead,
> > the preferred way to generate an event is to clear the exclusive monitor
> > reservation for the CPU executing the WFE. That means that the waiter
> > does something like:
> > 
> > LDXR x0, [some_address] // Load exclusive from some_address
> > CMP  x0, some value // If the value matches what I want
> > B.EQ out// then we're done
> > WFE // otherwise, wait
> > 
> > at this point, the waiter will stop on the WFE until its monitor is cleared,
> > which happens if another CPU writes to some_address.
> > 
> > We've wrapped this up in the arm64 code as __cmpwait, and we use that
> > to build smp_cond_load_acquire. It would be nice to use the same machinery
> > for the conditional spinning here, unless you anticipate that we're only
> > going to be spinning for a handful of iterations anyway?
> 
> So I do want to look at adding spin loop primitives as well as the
> begin/in/end primitives to help with powerpc's SMT priorities.
> 
> So we'd have:
> 
>   spin_begin();
>   spin_do {
> if (blah) {
> spin_end();
> return;
> }
>   } spin_until(!locked);
>   spin_end();
> 
> So you could implement your monitor with that. There's a handful of core
> places. mutex, bit spinlock, seqlock, polling idle, etc. So I think if it
> is beneficial for you in smp_cond_load_acquire, it should be useful in
> those too.

Yeah, I think we should be able to implement spin_until like we do for
smp_cond_load_acquir, although it means we need to pass in the pointer as
well.

Will


Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Andi Kleen
On Fri, Apr 07, 2017 at 05:20:31PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 06:47:43PM +0800, Jin Yao wrote:
> > Perf already has support for disassembling the branch instruction
> > and using the branch type for filtering. The patch just records
> > the branch type in perf_branch_entry.
> > 
> > Before recording, the patch converts the x86 branch classification
> > to common branch classification and compute for checking if the
> > branches cross 4K or 2MB areas. It's an approximate computing for
> > crossing 4K page or 2MB page.
> 
> The changelog is completely empty of rationale. Why do we care?
> 
> Not having the binary is a very bad reason; you can't do much of
> anything if that's missing.

It's a somewhat common situation with partially JITed code, if you
don't have an agent. You can still do a lot of useful things.

We found it useful to have this extra information during workload
analysis. Forward conditionals and page crossing jumps
are indications of frontend problems.

-Andi


Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Peter Zijlstra
On Fri, Apr 07, 2017 at 09:48:34AM -0700, Andi Kleen wrote:
> On Fri, Apr 07, 2017 at 05:20:31PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 07, 2017 at 06:47:43PM +0800, Jin Yao wrote:
> > > Perf already has support for disassembling the branch instruction
> > > and using the branch type for filtering. The patch just records
> > > the branch type in perf_branch_entry.
> > > 
> > > Before recording, the patch converts the x86 branch classification
> > > to common branch classification and compute for checking if the
> > > branches cross 4K or 2MB areas. It's an approximate computing for
> > > crossing 4K page or 2MB page.
> > 
> > The changelog is completely empty of rationale. Why do we care?
> > 
> > Not having the binary is a very bad reason; you can't do much of
> > anything if that's missing.
> 
> It's a somewhat common situation with partially JITed code, if you
> don't have an agent. You can still do a lot of useful things.

Like what? How can you say anything about code you don't have?

> We found it useful to have this extra information during workload
> analysis. Forward conditionals and page crossing jumps
> are indications of frontend problems.

But you already have the exact same information in {to,from}, why would
you need to repackage information already contained?


Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-07 Thread Hari Bathini

Hi Michael,


On Friday 07 April 2017 07:16 PM, Michael Ellerman wrote:

Hari Bathini  writes:

On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:

My preference would be that the fadump kernel "just works". If it's
using too much memory then the fadump kernel should do whatever it needs
to use less memory, eg. shrinking nr_cpu_ids etc.
Do we actually know *why* the fadump kernel is running out of memory?
Obviously large numbers of CPUs is one of the main drivers (lots of
stacks required). But other than that what is causing the memory
pressure? I would like some data on that before we proceed.

Almost the same amount of memory in comparison with the memory
required to boot the production kernel but that is unwarranted for fadump
(dump capture) kernel.

That's not data! :)


I am collating the data. Sorry! I should have mentioned it :)


The dump kernel is booted with *much* less memory than the production
kernel (that's the whole issue!) and so it doesn't need to create struct
pages for all that memory, which means it should need less memory.


What I meant was, if we were to boot production kernel with mem=X, where 
X is
the smallest possible value to boot the kernel without resulting in an 
OOM, fadump

needed nearly the same amount to be reserved for it to capture dump without
hitting an OOM. But this was an observation on system with not so much 
memory.

Will try on a system with large memory and report back with data..



The vfs caches are also sized based on the available memory, so they
should also shrink in the dump kernel.

I want some actual numbers on what's driving the memory usage.

I tried some of these parameters to see how much memory they would save:


So, if parameters like
cgroup_disable=memory,

0 bytes saved.


Interesting.. was CONFIG_MEMCG set on the kernel?




transparent_hugepages=never,

0 bytes saved.


Not surprising unless transparent hugepages were used


numa=off,

64KB saved.


In the memory starved dump capture environment, every byte counts, I 
guess :)

Also, depends on the numa config?


nr_cpus=1,

3MB saved (vs 16 CPUs)


Now maybe on your system those do save memory for some reason, but
please prove it to me. Otherwise I'm inclined to merge:

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ff0dd4e77a7..03f1f253c372 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 * dump data waiting for us.
 */
fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
+   if (fdm_active) {
fw_dump.dump_active = 1;
+   nr_cpu_ids = 1;
+   }

/* Get the sizes required to store dump data for the firmware provided
 * dump sections.


Necessary but not sufficient is the point I am trying to make. 
Apparently not

convincing enough. Will try and come back with relevant data :)

Thanks
Hari



Re: [PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-07 Thread Andi Kleen
> > It's a somewhat common situation with partially JITed code, if you
> > don't have an agent. You can still do a lot of useful things.
> 
> Like what? How can you say anything about code you don't have?

For example if you combine the PMU topdown measurement, and see if it's
frontend bound, and then you see it has lots of forward conditionals,
then dynamic basic block reordering will help. If you have lots
of cross page jumps then function reordering will help. etc.

> > We found it useful to have this extra information during workload
> > analysis. Forward conditionals and page crossing jumps
> > are indications of frontend problems.
> 
> But you already have the exact same information in {to,from}, why would
> you need to repackage information already contained?

Without this patch, we don't know if it's conditional or something else.
And the kernel already knows this for its filtering, so it can as well
report it.

Right the CROSS_* and forward backward information could be computed
later.

-Andi