date:20160916

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman

Thiago Jung Bauermann  writes:

> Hello Eric,
>
> Am Freitag, 16 September 2016, 14:47:13 schrieb Eric W. Biederman:
>> Mimi Zohar  writes:
>> > Hi Andrew,
>> > 
>> > On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>> >> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>> >> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar 
>  wrote:
>> >> > > The TPM PCRs are only reset on a hard reboot.  In order to validate
>> >> > > a
>> >> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement
>> >> > > list
>> >> > > of the running kernel must be saved and then restored on the
>> >> > > subsequent
>> >> > > boot, possibly of a different architecture.
>> >> > > 
>> >> > > The existing securityfs binary_runtime_measurements file
>> >> > > conveniently
>> >> > > provides a serialized format of the IMA measurement list. This
>> >> > > patch
>> >> > > set serializes the measurement list in this format and restores it.
>> >> > > 
>> >> > > Up to now, the binary_runtime_measurements was defined as
>> >> > > architecture
>> >> > > native format.  The assumption being that userspace could and would
>> >> > > handle any architecture conversions.  With the ability of carrying
>> >> > > the
>> >> > > measurement list across kexec, possibly from one architecture to a
>> >> > > different one, the per boot architecture information is lost and
>> >> > > with it
>> >> > > the ability of recalculating the template digest hash.  To resolve
>> >> > > this
>> >> > > problem, without breaking the existing ABI, this patch set
>> >> > > introduces
>> >> > > the boot command line option "ima_canonical_fmt", which is
>> >> > > arbitrarily
>> >> > > defined as little endian.
>> >> > > 
>> >> > > The need for this boot command line option will be limited to the
>> >> > > existing version 1 format of the binary_runtime_measurements.
>> >> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>> >> > > support for larger digests).
>> >> > > 
>> >> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>> >> > > hand-over for the next kernel" patch set.
>> >> > > 
>> >> > > These patches can also be found in the next-kexec-restore branch
>> >> > > of:
>> >> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
>> >> > > .git
>> >> > 
>> >> > I'll merge these into -mm to get some linux-next exposure.  I don't
>> >> > know what your upstream merge plans will be?
>> >> 
>> >> Sounds good.  I'm hoping to get some review/comments on this patch set
>> >> as well.  At the moment, I'm chasing down a kernel test robot report
>> >> from this afternoon.
>> > 
>> > My concern about changing the canonical format as originally defined in
>> > patch 9/9 from big endian to little endian never materialized.  Andreas
>> > Steffan, the patch author, is happy either way.
>> > 
>> > We proposed two methods of addressing Eric Biederman's concerns of not
>> > including the IMA measurement list segment in the kexec hash as
>> > described in  https://lkml.org/lkml/2016/9/9/355.
>> > 
>> > - defer calculating and verifying the serialized IMA measurement list
>> > buffer hash to IMA
>> > - calculate the kexec hash on load, verify it on the kexec execute,
>> > before re-calculating and updating it.
>> 
>> I need to ask: How this is anticipated to interact with kexec on panic?
>> Because honestly I can't see this ever working in that case.  The
>> assumption is that the original kernel has gone crazy.  So from a
>> practical standpoint any trusted path should have been invalided.
>
> We are not interested in carrying the measurement list in the case of kexec 
> on panic. I see that the code is adding a hand-over buffer to the crash 
> image as well, but that is a bug.
>
> The fix is to do nothing in ima_add_kexec_buffer if image->type != 
> KEXEC_TYPE_DEFAULT.
>  
>> This entire idea of updating the kexec image makes me extremely
>> extremely nervious.  It feels like sticking a screw driver through the
>> spokes of your bicicle tires while ridding down the road.
>> 
>> I can see tracking to see if the list has changed at some
>> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.
>
> Yes, that is an interesting feature that I can add using the checksum-
> verifying part of my code. I can submit a patch for that if there's 
> interest, adding a reboot notifier that verifies the checksum and causes a 
> regular reboot instead of a kexec reboot if the checksum fails.

I was thinking an early failure instead of getting all of the way down
into a kernel an discovering the tpm/ima subsystem would not
initialized.  But where that falls in the reboot pathway I don't expect
there is much value in it.

>> At least the common bootloader cases that I know of using kexec are very
>> minimal distributions that live in a ramdisk and as such it should be
>> very straight forward to measure what is needed at or before
>> sys_kexec_load.  But that was completely dismissed as unrealistic so I
>> don't have

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Thiago Jung Bauermann

Hello Eric,

Am Freitag, 16 September 2016, 14:47:13 schrieb Eric W. Biederman:
> Mimi Zohar  writes:
> > Hi Andrew,
> > 
> > On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
> >> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
> >> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar 
 wrote:
> >> > > The TPM PCRs are only reset on a hard reboot.  In order to validate
> >> > > a
> >> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement
> >> > > list
> >> > > of the running kernel must be saved and then restored on the
> >> > > subsequent
> >> > > boot, possibly of a different architecture.
> >> > > 
> >> > > The existing securityfs binary_runtime_measurements file
> >> > > conveniently
> >> > > provides a serialized format of the IMA measurement list. This
> >> > > patch
> >> > > set serializes the measurement list in this format and restores it.
> >> > > 
> >> > > Up to now, the binary_runtime_measurements was defined as
> >> > > architecture
> >> > > native format.  The assumption being that userspace could and would
> >> > > handle any architecture conversions.  With the ability of carrying
> >> > > the
> >> > > measurement list across kexec, possibly from one architecture to a
> >> > > different one, the per boot architecture information is lost and
> >> > > with it
> >> > > the ability of recalculating the template digest hash.  To resolve
> >> > > this
> >> > > problem, without breaking the existing ABI, this patch set
> >> > > introduces
> >> > > the boot command line option "ima_canonical_fmt", which is
> >> > > arbitrarily
> >> > > defined as little endian.
> >> > > 
> >> > > The need for this boot command line option will be limited to the
> >> > > existing version 1 format of the binary_runtime_measurements.
> >> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
> >> > > support for larger digests).
> >> > > 
> >> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
> >> > > hand-over for the next kernel" patch set.
> >> > > 
> >> > > These patches can also be found in the next-kexec-restore branch
> >> > > of:
> >> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
> >> > > .git
> >> > 
> >> > I'll merge these into -mm to get some linux-next exposure.  I don't
> >> > know what your upstream merge plans will be?
> >> 
> >> Sounds good.  I'm hoping to get some review/comments on this patch set
> >> as well.  At the moment, I'm chasing down a kernel test robot report
> >> from this afternoon.
> > 
> > My concern about changing the canonical format as originally defined in
> > patch 9/9 from big endian to little endian never materialized.  Andreas
> > Steffan, the patch author, is happy either way.
> > 
> > We proposed two methods of addressing Eric Biederman's concerns of not
> > including the IMA measurement list segment in the kexec hash as
> > described in  https://lkml.org/lkml/2016/9/9/355.
> > 
> > - defer calculating and verifying the serialized IMA measurement list
> > buffer hash to IMA
> > - calculate the kexec hash on load, verify it on the kexec execute,
> > before re-calculating and updating it.
> 
> I need to ask: How this is anticipated to interact with kexec on panic?
> Because honestly I can't see this ever working in that case.  The
> assumption is that the original kernel has gone crazy.  So from a
> practical standpoint any trusted path should have been invalided.

We are not interested in carrying the measurement list in the case of kexec 
on panic. I see that the code is adding a hand-over buffer to the crash 
image as well, but that is a bug.

The fix is to do nothing in ima_add_kexec_buffer if image->type != 
KEXEC_TYPE_DEFAULT.
 
> This entire idea of updating the kexec image makes me extremely
> extremely nervious.  It feels like sticking a screw driver through the
> spokes of your bicicle tires while ridding down the road.
> 
> I can see tracking to see if the list has changed at some
> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.

Yes, that is an interesting feature that I can add using the checksum-
verifying part of my code. I can submit a patch for that if there's 
interest, adding a reboot notifier that verifies the checksum and causes a 
regular reboot instead of a kexec reboot if the checksum fails.

> At least the common bootloader cases that I know of using kexec are very
> minimal distributions that live in a ramdisk and as such it should be
> very straight forward to measure what is needed at or before
> sys_kexec_load.  But that was completely dismissed as unrealistic so I
> don't have a clue what actual problem you are trying to solve.

We are interested in solving the problem in a general way because it will be 
useful to us in the future for the case of an arbitrary number of kexecs 
(and thus not only a bootloader but also multiple full-blown distros may be 
involved in the chain).

But you are right that for the use case for which we currently need thi

Re: [PATCH v8 11/13] powerpc: Add support for loading ELF kernels with kexec_file_load.

2016-09-16 Thread Thiago Jung Bauermann

Hello,

This patch causes a warning in GCC 4.6.3:

arch/powerpc/kernel/kexec_elf_64.c:211:6: error: 'initrd_load_addr' may be used 
uninitialized in this function [-Werror=uninitialized]
cc1: all warnings being treated as errors
make[2]: *** [arch/powerpc/kernel/kexec_elf_64.o] Error 1
make[1]: *** [arch/powerpc/kernel] Error 2
make: *** [sub-make] Error 2

It's true that setup_new_fdt may be called with an uninitialised value
for initrd_load_addr if initrd == NULL, but in that case initrd_len
will be 0 as well (because both are set at the same time by
kernel_read_file_from_fd in kimage_file_prepare_segments) and
setup_new_fdt won't try to use initrd_load_addr. Therefore the warning
is harmless, because the situation where initrd_load_addr may be used
unitialized can't happen.

The patch below has the following change:

@@ -153,7 +153,7 @@ void *elf64_load(struct kimage *image, char *kernel_buf,
int i, ret;
unsigned int fdt_size;
unsigned long kernel_load_addr, purgatory_load_addr;
-   unsigned long initrd_load_addr, fdt_load_addr, stack_top;
+   unsigned long initrd_load_addr = 0, fdt_load_addr, stack_top;
void *fdt;
const void *slave_code;
struct elfhdr ehdr;

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center


Subject: [PATCH v8 11/13] powerpc: Add support for loading ELF kernels with
 kexec_file_load.

This uses all the infrastructure built up by the previous patches
in the series to load an ELF vmlinux file and an initrd. It uses the
flattened device tree at initial_boot_params as a base and adjusts memory
reservations and its /chosen node for the next kernel.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/kexec_elf_64.h |  10 ++
 arch/powerpc/kernel/Makefile|   1 +
 arch/powerpc/kernel/kexec_elf_64.c  | 282 
 arch/powerpc/kernel/machine_kexec_64.c  |   5 +-
 4 files changed, 297 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kexec_elf_64.h 
b/arch/powerpc/include/asm/kexec_elf_64.h
new file mode 100644
index ..30da6bc0ccf8
--- /dev/null
+++ b/arch/powerpc/include/asm/kexec_elf_64.h
@@ -0,0 +1,10 @@
+#ifndef __POWERPC_KEXEC_ELF_64_H__
+#define __POWERPC_KEXEC_ELF_64_H__
+
+#ifdef CONFIG_KEXEC_FILE
+
+extern struct kexec_file_ops kexec_elf64_ops;
+
+#endif /* CONFIG_KEXEC_FILE */
+
+#endif /* __POWERPC_KEXEC_ELF_64_H__ */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index fef0d730acc4..d12a84003283 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_PCI)   += pci_$(CONFIG_WORD_SIZE).o 
$(pci64-y) \
 obj-$(CONFIG_PCI_MSI)  += msi.o
 obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
   machine_kexec_$(CONFIG_WORD_SIZE).o
+obj-$(CONFIG_KEXEC_FILE)   += kexec_elf_$(CONFIG_WORD_SIZE).o
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
 
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
new file mode 100644
index ..c61243668bc3
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -0,0 +1,282 @@
+/*
+ * Load ELF vmlinux file for the kexec_file_load syscall.
+ *
+ * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
+ * Copyright (C) 2004  IBM Corp.
+ * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
+ * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
+ * Copyright (C) 2016  IBM Corporation
+ *
+ * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c.
+ * Heavily modified for the kernel by
+ * Thiago Jung Bauermann .
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt)"kexec_elf: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+extern size_t kexec_purgatory_size;
+
+#define PURGATORY_STACK_SIZE   (16 * 1024)
+
+/**
+ * build_elf_exec_info - read ELF executable and check that we can use it
+ */
+static int build_elf_exec_info(const char *buf, size_t len, struct elfhdr 
*ehdr,
+  struct elf_info *elf_info)
+{
+   int i;
+   int ret;
+
+   ret = elf_read_from_buffer(buf, len, ehdr, elf_info);
+   if (ret)
+   return ret;
+
+   /* Big endian vmlinux has type ET_DYN. */
+   if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN) {
+   pr_err("Not an ELF executable.\n");
+   goto error;
+   } else if (!elf_info->proghdrs) {
+

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Mimi Zohar  writes:
>
>> Hi Andrew,
>>
>> On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>>> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>>> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar  
>>> > wrote:
>>> > 
>>> > > The TPM PCRs are only reset on a hard reboot.  In order to validate a
>>> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
>>> > > of the running kernel must be saved and then restored on the subsequent
>>> > > boot, possibly of a different architecture.
>>> > > 
>>> > > The existing securityfs binary_runtime_measurements file conveniently
>>> > > provides a serialized format of the IMA measurement list. This patch
>>> > > set serializes the measurement list in this format and restores it.
>>> > > 
>>> > > Up to now, the binary_runtime_measurements was defined as architecture
>>> > > native format.  The assumption being that userspace could and would
>>> > > handle any architecture conversions.  With the ability of carrying the
>>> > > measurement list across kexec, possibly from one architecture to a
>>> > > different one, the per boot architecture information is lost and with it
>>> > > the ability of recalculating the template digest hash.  To resolve this
>>> > > problem, without breaking the existing ABI, this patch set introduces
>>> > > the boot command line option "ima_canonical_fmt", which is arbitrarily
>>> > > defined as little endian.
>>> > > 
>>> > > The need for this boot command line option will be limited to the
>>> > > existing version 1 format of the binary_runtime_measurements.
>>> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>>> > > support for larger digests).
>>> > > 
>>> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>>> > > hand-over for the next kernel" patch set. 
>>> > > 
>>> > > These patches can also be found in the next-kexec-restore branch of:
>>> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
>>> > 
>>> > I'll merge these into -mm to get some linux-next exposure.  I don't
>>> > know what your upstream merge plans will be?
>>> 
>>> Sounds good.  I'm hoping to get some review/comments on this patch set
>>> as well.  At the moment, I'm chasing down a kernel test robot report
>>> from this afternoon.
>>
>> My concern about changing the canonical format as originally defined in
>> patch 9/9 from big endian to little endian never materialized.  Andreas
>> Steffan, the patch author, is happy either way.
>>
>> We proposed two methods of addressing Eric Biederman's concerns of not
>> including the IMA measurement list segment in the kexec hash as
>> described in  https://lkml.org/lkml/2016/9/9/355.
>>
>> - defer calculating and verifying the serialized IMA measurement list
>> buffer hash to IMA
>> - calculate the kexec hash on load, verify it on the kexec execute,
>> before re-calculating and updating it.
>
> I need to ask: How this is anticipated to interact with kexec on panic?
> Because honestly I can't see this ever working in that case.  The
> assumption is that the original kernel has gone crazy.  So from a
> practical standpoint any trusted path should have been invalided.
>
> This entire idea of updating the kexec image makes me extremely
> extremely nervious.  It feels like sticking a screw driver through the
> spokes of your bicicle tires while ridding down the road.
>
> I can see tracking to see if the list has changed at some
> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.
>
> At least the common bootloader cases that I know of using kexec are very
> minimal distributions that live in a ramdisk and as such it should be
> very straight forward to measure what is needed at or before
> sys_kexec_load.  But that was completely dismissed as unrealistic so I
> don't have a clue what actual problem you are trying to solve.
>
> If there is anyway we can start small and not with this big scary
> infrastructure change I would very much prefer it.

I have thought about this a little more and the entire reason for
updating things on the fly really really disturbs me.  To prove you are
trusted the new kernel is going to have to present that whole trusted
list of files to someone.  Which means in my naive understanding of the
situation that any change in that list of files is going to have to be
tracked and audited.

So the idea that we have to be super flexible in the kernel (when we are
inflexible in userspace) does not make a bit of sense to me.

So no.  I am not in favor of adding a mechanism to kexec that gives me
the screaming heebie jeebies and that appears to be complete at odds
with what that mechanism is trying to do.

AKA if you are going to trust any old thing, or any old change on the
reboot path than it doesn't make sense to track them.  If you are
tracking them it doesn't make sense to have a mechanism where anything
goes.

Eric

Re: [PATCH 2/3] corenet: Support gpio power/reset for corenet

2016-09-16 Thread Scott Wood

On 09/15/2016 03:03 AM, Andy Fleming wrote:
> I agree that halt and power off mean and have always meant different
> things to the kernel. The problem is that most desktop systems,
> having halted, pass control to the BIOS which--usually--shuts off the
> power. Am I wrong about this? I've been using shutdown -h now to turn
> off my Linux systems for nearly 2 decades now, but I admit that I
> don't do it often, and I tend to stick with whatever works.

>From the shutdown man page:

 -h
   Equivalent to --poweroff, unless --halt is specified.

Again, let's talk in terms of the kernel API rather than quirky
userspace commands.

FWIW, I've always used "halt -p" and recall the system not powering off
on PCs when I use plain "halt", though it's been many years since I've
tried.

>>>I don't see any other platforms doing this.  How do the nodes get probed
>>>for them?
>>>
>>>
>>>The answer is I don't know, but this is a common issue with adding
>>>new devices to the device tree in embedded powerpc. The only other
>>>platforms which have gpio-poweroff nodes in their trees are in
>>>arch/arm, and none of those platforms call the probing
>>>function of_platform_bus_probe. I suspect they either probe every
>>>root node, or they somehow construct the match_id. As noted in the
>>>above-referenced commit, putting the nodes under the gpio bus does
>>>not cause them to get probed. This seemed like the best way under
>>>the current corenet code.
>>
>> Well, let's figure out what it is that PPC should be doing to have
>> things work the way it does on ARM.
> 
> For all of the devices? Or just these two?

All of them.  If ARM isn't maintaining these annoying lists why should
we have to? :-P

-Scott

[PATCH v14 11/17] powerpc/PCI: Add IORESOURCE_MEM_64 for 64-bit resource in OF parsing

2016-09-16 Thread Yinghai Lu

For device resource PREF bit setting under bridge 64-bit pref resource,
we need to make sure only set PREF for 64bit resource.

This patch set IORESOUCE_MEM_64 for 64bit resource during OF device
resource flags parsing.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96261
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96241
Signed-off-by: Yinghai Lu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Gavin Shan 
Cc: Yijing Wang 
Cc: Anton Blanchard 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/pci_of_scan.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_of_scan.c 
b/arch/powerpc/kernel/pci_of_scan.c
index 719f225..476b8ac5 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -44,8 +44,10 @@ static unsigned int pci_parse_of_flags(u32 addr0, int bridge)
 
if (addr0 & 0x0200) {
flags = IORESOURCE_MEM | PCI_BASE_ADDRESS_SPACE_MEMORY;
-   flags |= (addr0 >> 22) & PCI_BASE_ADDRESS_MEM_TYPE_64;
flags |= (addr0 >> 28) & PCI_BASE_ADDRESS_MEM_TYPE_1M;
+   if (addr0 & 0x0100)
+   flags |= IORESOURCE_MEM_64
+| PCI_BASE_ADDRESS_MEM_TYPE_64;
if (addr0 & 0x4000)
flags |= IORESOURCE_PREFETCH
 | PCI_BASE_ADDRESS_MEM_PREFETCH;
-- 
2.8.3

[PATCH v14 02/17] PCI: Let pci_mmap_page_range() take resource address

2016-09-16 Thread Yinghai Lu

Original pci_mmap_page_range() is taking PCI BAR value aka usr_address.

Bjorn found out that it would be much simple to pass resource address
directly and avoid extra those __pci_mmap_make_offset.

In this patch:
1. in proc path: proc_bus_pci_mmap, try convert back to resource
   before calling pci_mmap_page_range
2. in sysfs path: pci_mmap_resource will just offset with resource start.
3. all pci_mmap_page_range will have vma->vm_pgoff with in resource
   range instead of BAR value.
4. skip calling __pci_mmap_make_offset, as the checking is done
   in pci_mmap_fits().

-v2: add pci_user_to_resource and remove __pci_mmap_make_offset
-v3: pass resource pointer with pci_mmap_page_range()
-v4: put __pci_mmap_make_offset() removing to following patch
 separate /sys io access alignment checking to another patch
 updated after Bjorn's pci_resource_to_user() changes.
-v5: update after fix for pci_mmap with proc path according to
 Bjorn.

Signed-off-by: Yinghai Lu 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
---
 arch/microblaze/pci/pci-common.c | 11 +---
 arch/powerpc/kernel/pci-common.c | 11 +---
 arch/sparc/kernel/pci.c  |  4 ---
 arch/xtensa/kernel/pci.c | 13 +++--
 drivers/pci/pci-sysfs.c  | 32 ++---
 drivers/pci/pci.h|  2 +-
 drivers/pci/proc.c   | 60 ++--
 7 files changed, 93 insertions(+), 40 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 81556b8..9e3bc05 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -282,12 +282,15 @@ int pci_mmap_page_range(struct pci_dev *dev, struct 
vm_area_struct *vma,
 {
resource_size_t offset =
((resource_size_t)vma->vm_pgoff) << PAGE_SHIFT;
-   struct resource *rp;
int ret;
 
-   rp = __pci_mmap_make_offset(dev, &offset, mmap_state);
-   if (rp == NULL)
-   return -EINVAL;
+   if (mmap_state == pci_mmap_io) {
+   struct pci_controller *hose = pci_bus_to_host(dev->bus);
+
+   /* hose should never be NULL */
+   offset += hose->io_base_phys -
+((unsigned long)hose->io_base_virt - _IO_BASE);
+   }
 
vma->vm_pgoff = offset >> PAGE_SHIFT;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index e589080..67b2e68 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -515,12 +515,15 @@ int pci_mmap_page_range(struct pci_dev *dev, struct 
vm_area_struct *vma,
 {
resource_size_t offset =
((resource_size_t)vma->vm_pgoff) << PAGE_SHIFT;
-   struct resource *rp;
int ret;
 
-   rp = __pci_mmap_make_offset(dev, &offset, mmap_state);
-   if (rp == NULL)
-   return -EINVAL;
+   if (mmap_state == pci_mmap_io) {
+   struct pci_controller *hose = pci_bus_to_host(dev->bus);
+
+   /* hose should never be NULL */
+   offset += hose->io_base_phys -
+ ((unsigned long)hose->io_base_virt - _IO_BASE);
+   }
 
vma->vm_pgoff = offset >> PAGE_SHIFT;
if (write_combine)
diff --git a/arch/sparc/kernel/pci.c b/arch/sparc/kernel/pci.c
index 9c1878f..5f2d78e 100644
--- a/arch/sparc/kernel/pci.c
+++ b/arch/sparc/kernel/pci.c
@@ -868,10 +868,6 @@ int pci_mmap_page_range(struct pci_dev *dev, struct 
vm_area_struct *vma,
 {
int ret;
 
-   ret = __pci_mmap_make_offset(dev, vma, mmap_state);
-   if (ret < 0)
-   return ret;
-
__pci_mmap_set_pgprot(dev, vma, mmap_state);
 
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
diff --git a/arch/xtensa/kernel/pci.c b/arch/xtensa/kernel/pci.c
index b848cc3..4c5f1fa 100644
--- a/arch/xtensa/kernel/pci.c
+++ b/arch/xtensa/kernel/pci.c
@@ -366,11 +366,18 @@ int pci_mmap_page_range(struct pci_dev *dev, struct 
vm_area_struct *vma,
enum pci_mmap_state mmap_state,
int write_combine)
 {
+   unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
int ret;
 
-   ret = __pci_mmap_make_offset(dev, vma, mmap_state);
-   if (ret < 0)
-   return ret;
+   if (mmap_state == pci_mmap_io) {
+   struct pci_controller *pci_ctrl =
+(struct pci_controller *)dev->sysdata;
+
+   /* pci_ctrl should never be NULL */
+   offset += pci_ctrl->io_space.start - pci_ctrl->io_space.base;
+   }
+
+   vma->vm_pgoff = offset >> PAGE_SHIFT;
 
__pci_mmap_set_pgprot(dev, vma, mmap_state, write_combine);
 
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index e907154..d55d93d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/

[PATCH v14 10/17] powerpc/PCI: Keep resource idx order with bridge register number

2016-09-16 Thread Yinghai Lu

Same as sparc version.

Make resource with consistent sequence
like other arch or directly from pci_read_bridge_bases(),
even when non-pref mmio is missing, or out of ordering in firmware reporting.

Just hold i = 1 for non pref mmio, and i = 2 for pref mmio.

Signed-off-by: Yinghai Lu 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/pci_of_scan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_of_scan.c 
b/arch/powerpc/kernel/pci_of_scan.c
index 526ac67..719f225 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -252,7 +252,7 @@ void of_scan_pci_bridge(struct pci_dev *dev)
bus->resource[i] = res;
++res;
}
-   i = 1;
+   i = 3;
for (; len >= 32; len -= 32, ranges += 8) {
flags = pci_parse_of_flags(of_read_number(ranges, 1), 1);
size = of_read_number(&ranges[6], 2);
@@ -265,6 +265,12 @@ void of_scan_pci_bridge(struct pci_dev *dev)
   " for bridge %s\n", node->full_name);
continue;
}
+   } else if ((flags & IORESOURCE_PREFETCH) &&
+  !bus->resource[2]->flags) {
+   res = bus->resource[2];
+   } else if (((flags & (IORESOURCE_MEM | IORESOURCE_PREFETCH)) ==
+   IORESOURCE_MEM) && !bus->resource[1]->flags) {
+   res = bus->resource[1];
} else {
if (i >= PCI_NUM_RESOURCES - PCI_BRIDGE_RESOURCES) {
printk(KERN_ERR "PCI: too many memory ranges"
-- 
2.8.3

[PATCH v14 03/17] PCI: Remove __pci_mmap_make_offset()

2016-09-16 Thread Yinghai Lu

After
  PCI: Let pci_mmap_page_range() take resource address
No user for __pci_mmap_make_offset in those arch.

Remove them.

Signed-off-by: Yinghai Lu 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
---
 arch/microblaze/pci/pci-common.c |  63 --
 arch/powerpc/kernel/pci-common.c |  63 --
 arch/sparc/kernel/pci.c  | 113 ---
 arch/xtensa/kernel/pci.c |  62 -
 4 files changed, 301 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 9e3bc05..e7cd0ab 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -156,69 +156,6 @@ void pcibios_set_master(struct pci_dev *dev)
  */
 
 /*
- * Adjust vm_pgoff of VMA such that it is the physical page offset
- * corresponding to the 32-bit pci bus offset for DEV requested by the user.
- *
- * Basically, the user finds the base address for his device which he wishes
- * to mmap.  They read the 32-bit value from the config space base register,
- * add whatever PAGE_SIZE multiple offset they wish, and feed this into the
- * offset parameter of mmap on /proc/bus/pci/XXX for that device.
- *
- * Returns negative error code on failure, zero on success.
- */
-static struct resource *__pci_mmap_make_offset(struct pci_dev *dev,
-  resource_size_t *offset,
-  enum pci_mmap_state mmap_state)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   unsigned long io_offset = 0;
-   int i, res_bit;
-
-   if (!hose)
-   return NULL;/* should never happen */
-
-   /* If memory, add on the PCI bridge address offset */
-   if (mmap_state == pci_mmap_mem) {
-#if 0 /* See comment in pci_resource_to_user() for why this is disabled */
-   *offset += hose->pci_mem_offset;
-#endif
-   res_bit = IORESOURCE_MEM;
-   } else {
-   io_offset = (unsigned long)hose->io_base_virt - _IO_BASE;
-   *offset += io_offset;
-   res_bit = IORESOURCE_IO;
-   }
-
-   /*
-* Check that the offset requested corresponds to one of the
-* resources of the device.
-*/
-   for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
-   struct resource *rp = &dev->resource[i];
-   int flags = rp->flags;
-
-   /* treat ROM as memory (should be already) */
-   if (i == PCI_ROM_RESOURCE)
-   flags |= IORESOURCE_MEM;
-
-   /* Active and same type? */
-   if ((flags & res_bit) == 0)
-   continue;
-
-   /* In the range of this resource? */
-   if (*offset < (rp->start & PAGE_MASK) || *offset > rp->end)
-   continue;
-
-   /* found it! construct the final physical address */
-   if (mmap_state == pci_mmap_io)
-   *offset += hose->io_base_phys - io_offset;
-   return rp;
-   }
-
-   return NULL;
-}
-
-/*
  * This one is used by /dev/mem and fbdev who have no clue about the
  * PCI device, it tries to find the PCI device first and calls the
  * above routine
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 67b2e68..07fa9d5 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -388,69 +388,6 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
  */
 
 /*
- * Adjust vm_pgoff of VMA such that it is the physical page offset
- * corresponding to the 32-bit pci bus offset for DEV requested by the user.
- *
- * Basically, the user finds the base address for his device which he wishes
- * to mmap.  They read the 32-bit value from the config space base register,
- * add whatever PAGE_SIZE multiple offset they wish, and feed this into the
- * offset parameter of mmap on /proc/bus/pci/XXX for that device.
- *
- * Returns negative error code on failure, zero on success.
- */
-static struct resource *__pci_mmap_make_offset(struct pci_dev *dev,
-  resource_size_t *offset,
-  enum pci_mmap_state mmap_state)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   unsigned long io_offset = 0;
-   int i, res_bit;
-
-   if (hose == NULL)
-   return NULL;/* should never happen */
-
-   /* If memory, add on the PCI bridge address offset */
-   if (mmap_state == pci_mmap_mem) {
-#if 0 /* See comment in pci_resource_to_user() for why this is disabled */
-   *offset += hose->pci_mem_offset;
-#endif
-   res_bit = IORESOURCE_MEM;
-   } else {
-   io_offset = (unsigned long)hose->io_base_virt - _IO_BASE;
-

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-16 Thread Eric W. Biederman

Mimi Zohar  writes:

> Hi Andrew,
>
> On Wed, 2016-08-31 at 18:38 -0400, Mimi Zohar wrote:
>> On Wed, 2016-08-31 at 13:50 -0700, Andrew Morton wrote:
>> > On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar  
>> > wrote:
>> > 
>> > > The TPM PCRs are only reset on a hard reboot.  In order to validate a
>> > > TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
>> > > of the running kernel must be saved and then restored on the subsequent
>> > > boot, possibly of a different architecture.
>> > > 
>> > > The existing securityfs binary_runtime_measurements file conveniently
>> > > provides a serialized format of the IMA measurement list. This patch
>> > > set serializes the measurement list in this format and restores it.
>> > > 
>> > > Up to now, the binary_runtime_measurements was defined as architecture
>> > > native format.  The assumption being that userspace could and would
>> > > handle any architecture conversions.  With the ability of carrying the
>> > > measurement list across kexec, possibly from one architecture to a
>> > > different one, the per boot architecture information is lost and with it
>> > > the ability of recalculating the template digest hash.  To resolve this
>> > > problem, without breaking the existing ABI, this patch set introduces
>> > > the boot command line option "ima_canonical_fmt", which is arbitrarily
>> > > defined as little endian.
>> > > 
>> > > The need for this boot command line option will be limited to the
>> > > existing version 1 format of the binary_runtime_measurements.
>> > > Subsequent formats will be defined as canonical format (eg. TPM 2.0
>> > > support for larger digests).
>> > > 
>> > > This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
>> > > hand-over for the next kernel" patch set. 
>> > > 
>> > > These patches can also be found in the next-kexec-restore branch of:
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
>> > 
>> > I'll merge these into -mm to get some linux-next exposure.  I don't
>> > know what your upstream merge plans will be?
>> 
>> Sounds good.  I'm hoping to get some review/comments on this patch set
>> as well.  At the moment, I'm chasing down a kernel test robot report
>> from this afternoon.
>
> My concern about changing the canonical format as originally defined in
> patch 9/9 from big endian to little endian never materialized.  Andreas
> Steffan, the patch author, is happy either way.
>
> We proposed two methods of addressing Eric Biederman's concerns of not
> including the IMA measurement list segment in the kexec hash as
> described in  https://lkml.org/lkml/2016/9/9/355.
>
> - defer calculating and verifying the serialized IMA measurement list
> buffer hash to IMA
> - calculate the kexec hash on load, verify it on the kexec execute,
> before re-calculating and updating it.

I need to ask: How this is anticipated to interact with kexec on panic?
Because honestly I can't see this ever working in that case.  The
assumption is that the original kernel has gone crazy.  So from a
practical standpoint any trusted path should have been invalided.

This entire idea of updating the kexec image makes me extremely
extremely nervious.  It feels like sticking a screw driver through the
spokes of your bicicle tires while ridding down the road.

I can see tracking to see if the list has changed at some
point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.

At least the common bootloader cases that I know of using kexec are very
minimal distributions that live in a ramdisk and as such it should be
very straight forward to measure what is needed at or before
sys_kexec_load.  But that was completely dismissed as unrealistic so I
don't have a clue what actual problem you are trying to solve.

If there is anyway we can start small and not with this big scary
infrastructure change I would very much prefer it.

Eric

Re: [PATCH 5/5] arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig

2016-09-16 Thread kbuild test robot

Hi Claudiu,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc6 next-20160916]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Claudiu-Manoil/Freescale-DPAA-1-x-QBMan-Drivers/20160916-212727
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   In file included from drivers/soc/fsl/qbman/bman_priv.h:33:0,
from drivers/soc/fsl/qbman/bman_ccsr.c:31:
>> drivers/soc/fsl/qbman/dpaa_sys.h:51:2: error: #error "Unsupported Cacheline 
>> Size"
#error "Unsupported Cacheline Size"
 ^
--
   In file included from drivers/soc/fsl/qbman/bman_priv.h:33:0,
from drivers/soc/fsl/qbman/bman_portal.c:31:
>> drivers/soc/fsl/qbman/dpaa_sys.h:51:2: error: #error "Unsupported Cacheline 
>> Size"
#error "Unsupported Cacheline Size"
 ^
   drivers/soc/fsl/qbman/bman_portal.c: In function 'bman_portal_probe':
>> drivers/soc/fsl/qbman/bman_portal.c:152:6: error: '_PAGE_GUARDED' undeclared 
>> (first use in this function)
 _PAGE_GUARDED | _PAGE_NO_CACHE);
 ^
   drivers/soc/fsl/qbman/bman_portal.c:152:6: note: each undeclared identifier 
is reported only once for each function it appears in
--
   In file included from drivers/soc/fsl/qbman/qman_priv.h:33:0,
from drivers/soc/fsl/qbman/qman_portal.c:31:
>> drivers/soc/fsl/qbman/dpaa_sys.h:51:2: error: #error "Unsupported Cacheline 
>> Size"
#error "Unsupported Cacheline Size"
 ^
   drivers/soc/fsl/qbman/qman_portal.c: In function 'qman_portal_probe':
>> drivers/soc/fsl/qbman/qman_portal.c:289:6: error: '_PAGE_GUARDED' undeclared 
>> (first use in this function)
 _PAGE_GUARDED | _PAGE_NO_CACHE);
 ^
   drivers/soc/fsl/qbman/qman_portal.c:289:6: note: each undeclared identifier 
is reported only once for each function it appears in

vim +51 drivers/soc/fsl/qbman/dpaa_sys.h

4c95420d Claudiu Manoil 2016-09-16  35  #include 
4c95420d Claudiu Manoil 2016-09-16  36  #include 
4c95420d Claudiu Manoil 2016-09-16  37  #include 
4c95420d Claudiu Manoil 2016-09-16  38  #include 
4c95420d Claudiu Manoil 2016-09-16  39  #include 
4c95420d Claudiu Manoil 2016-09-16  40  #include 
4c95420d Claudiu Manoil 2016-09-16  41  #include 
4c95420d Claudiu Manoil 2016-09-16  42  #include 
4c95420d Claudiu Manoil 2016-09-16  43  #include 
4c95420d Claudiu Manoil 2016-09-16  44  #include 
4c95420d Claudiu Manoil 2016-09-16  45  
4c95420d Claudiu Manoil 2016-09-16  46  /* For 2-element tables related to 
cache-inhibited and cache-enabled mappings */
4c95420d Claudiu Manoil 2016-09-16  47  #define DPAA_PORTAL_CE 0
4c95420d Claudiu Manoil 2016-09-16  48  #define DPAA_PORTAL_CI 1
4c95420d Claudiu Manoil 2016-09-16  49  
4c95420d Claudiu Manoil 2016-09-16  50  #if (L1_CACHE_BYTES != 32) && 
(L1_CACHE_BYTES != 64)
4c95420d Claudiu Manoil 2016-09-16 @51  #error "Unsupported Cacheline Size"
4c95420d Claudiu Manoil 2016-09-16  52  #endif
4c95420d Claudiu Manoil 2016-09-16  53  
4c95420d Claudiu Manoil 2016-09-16  54  static inline void dpaa_flush(void *p)
4c95420d Claudiu Manoil 2016-09-16  55  {
4c95420d Claudiu Manoil 2016-09-16  56  #ifdef CONFIG_PPC
4c95420d Claudiu Manoil 2016-09-16  57  flush_dcache_range((unsigned 
long)p, (unsigned long)p+64);
4c95420d Claudiu Manoil 2016-09-16  58  #elif defined(CONFIG_ARM32)
4c95420d Claudiu Manoil 2016-09-16  59  __cpuc_flush_dcache_area(p, 64);

:: The code at line 51 was first introduced by commit
:: 4c95420d89be521032d30fa549bacdd9cd98c7c9 soc/fsl: Introduce DPAA 1.x 
BMan device driver

:: TO: Claudiu Manoil 
:: CC: 0day robot 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [V2] powerpc/Kconfig: Update config option based on page size.

2016-09-16 Thread Aneesh Kumar K.V

Balbir Singh  writes:

> On 14/09/16 20:40, santhosh wrote:
>> 
>>> Michael Ellerman  writes:
>>>
 On Fri, 2016-19-02 at 05:38:47 UTC, Rashmica Gupta wrote:
> Currently on PPC64 changing kernel pagesize from 4K to 64K leaves
> FORCE_MAX_ZONEORDER set to 13 - which produces a compile error.
>
 ...
> So, update the range of FORCE_MAX_ZONEORDER from 9-64 to 8-9 for 64K pages
> and from 13-64 to 9-13 for 4K pages.
>
> Signed-off-by: Rashmica Gupta 
> Reviewed-by: Balbir Singh 
 Applied to powerpc next, thanks.

 https://git.kernel.org/powerpc/c/a7ee539584acf4a565b7439cea

>>> HPAGE_PMD_ORDER is not something we should check w.r.t 4k linux page
>>> size. We do have the below constraint w.r.t hugetlb pages
>>>
>>> static inline bool hstate_is_gigantic(struct hstate *h)
>>> {
>>> return huge_page_order(h) >= MAX_ORDER;
>>> }
>>>
>>> That require MAX_ORDER to be greater than 12.
>>>
>
> 9 to 13 was done based on calculations you can find the commit
>
>
>
>>> Did we test hugetlbfs 4k config with this patch ? Will it work if we
>>> start marking hugepage as gigantic page ?
>>>
>>> -aneesh
>>>
>> Hello Rashmica,
>> 
>> With upstream linux kernel 4.8.0-rc1-6-gbae9cc6 compiled with linux 4k 
>> page size we are not able set hugepages, Aneesh had a look at the problem 
>> and he mentioned this commit is causing the issue.
>> 
>> *Details:*
>> We are using pkvm ubuntu 16.04 guest with upstream kernel 
>> [4.8.0-rc1-6-gbae9cc6] compiled with  4k page size
>> 
>> o/p from guest:
>> HugePages_Total:   0
>> HugePages_Free:0
>> HugePages_Rsvd:0
>> HugePages_Surp:0
>> Hugepagesize:  16384 kB
>> 
>> Page sizes from device-tree: [dmesg]
>> [0.00] base_shift=12: shift=12, sllp=0x, avpnm=0x, 
>> tlbiel=1, penc=0
>> [0.00] base_shift=12: shift=24, sllp=0x, avpnm=0x, 
>> tlbiel=1, penc=56
>> [0.00] base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, 
>> tlbiel=0, penc=0
>> 
>> while trying to configure the hugepages inside the guest it throws the below 
>> error:
>> 
>> echo 100 > /proc/sys/vm/nr_hugepages
>> -bash: echo: write error: Invalid argument
>> 
>> *Note*: we do not see the problem when the linux page is 64k
>
>
> Just to reiterate you are seeing this problem using 4k page size and 16M as 
> the hugepage size.
> With FORCE_MAX_ZONEORDER set to 9 to 13 for 4k pages, you can do upto 32M if 
> FORCE_MAX_ZONEORDER
> is 13 and same for 64K with FORCE_MAX_ZONEORDER set to 9.
>
> Basically the constraint is
>
>
> FORCE_MAX_ZONEBITS <= 25 - PAGESHIFT
>
> What is your value of FORCE_MAX_ZONEORDER in the .config?

The problem is the reverse of why the orginal fix was done. ie, When you
change the pae size from 64K to 4K using nconfig, we don't update the
FORCE_MAX_ZONEORDER and hence we end up a value of 9. That results in
the above error with hugetlb. So from a failed build when switching from
4k to 64K we now have a broken hugetlb when switching from 64K to 4K.

As suggested in the review of the original patch, we should make the 
FORCE_MAX_ZONEORDER range such that it picks the right value that will
get 16MB hugetlb to work.

Something like the below. That range is strange, but without that it picks a 
value
of 11.

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 927d2ab2ce08..792cb1768c8f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -637,7 +637,7 @@ config FORCE_MAX_ZONEORDER
int "Maximum zone order"
range 8 9 if PPC64 && PPC_64K_PAGES
default "9" if PPC64 && PPC_64K_PAGES
-   range 9 13 if PPC64 && !PPC_64K_PAGES
+   range 13 13 if PPC64 && !PPC_64K_PAGES
default "13" if PPC64 && !PPC_64K_PAGES
range 9 64 if PPC32 && PPC_16K_PAGES
default "9" if PPC32 && PPC_16K_PAGES


-aneesh

RE: [PATCH 04/13] powerpc: Use soft_enabled_set api to update paca->soft_enabled

2016-09-16 Thread David Laight

From: Nicholas Piggin
> Sent: 16 September 2016 12:59
> On Fri, 16 Sep 2016 11:43:13 +
> David Laight  wrote:
> 
> > From: Nicholas Piggin
> > > Sent: 16 September 2016 10:53
> > > On Thu, 15 Sep 2016 18:31:54 +0530
> > > Madhavan Srinivasan  wrote:
> > >
> > > > Force use of soft_enabled_set() wrapper to update paca-soft_enabled
> > > > wherever possisble. Also add a new wrapper function, 
> > > > soft_enabled_set_return(),
> > > > added to force the paca->soft_enabled updates.
> > ...
> > > > diff --git a/arch/powerpc/include/asm/hw_irq.h 
> > > > b/arch/powerpc/include/asm/hw_irq.h
> > > > index 8fad8c24760b..f828b8f8df02 100644
> > > > --- a/arch/powerpc/include/asm/hw_irq.h
> > > > +++ b/arch/powerpc/include/asm/hw_irq.h
> > > > @@ -53,6 +53,20 @@ static inline notrace void soft_enabled_set(unsigned 
> > > > long enable)
> > > > : : "r" (enable), "i" (offsetof(struct paca_struct, 
> > > > soft_enabled)));
> > > >  }
> > > >
> > > > +static inline notrace unsigned long soft_enabled_set_return(unsigned 
> > > > long enable)
> > > > +{
> > > > +   unsigned long flags;
> > > > +
> > > > +   asm volatile(
> > > > +   "lbz %0,%1(13); stb %2,%1(13)"
> > > > +   : "=r" (flags)
> > > > +   : "i" (offsetof(struct paca_struct, soft_enabled)),\
> > > > + "r" (enable)
> > > > +   : "memory");
> > > > +
> > > > +   return flags;
> > > > +}
> > >
> > > Why do you have the "memory" clobber here while soft_enabled_set() does 
> > > not?
> >
> > I wondered about the missing memory clobber earlier.
> >
> > Any 'clobber' ought to be restricted to the referenced memory area.
> > If the structure is only referenced by r13 through 'asm volatile' it isn't 
> > needed.
> 
> Well a clobber (compiler barrier) at some point is needed in irq_disable and
> irq_enable paths, so we correctly open and close the critical section vs 
> interrupts.
> I just wonder about these helpers. It might be better to take the clobbers 
> out of
> there and add barrier(); in callers, which would make it more obvious.

If the memory clobber is needed to synchronise with the rest of the code
rather than just ensuring the compiler doesn't reorder accesses via r13
then I'd add an explicit barrier() somewhere - even if in these helpers.

Potentially the helper wants a memory clobber for the (r13) area
and a separate barrier() to ensure the interrupts are masked for the
right code.
Even if both are together in the same helper.

David

[PATCH 5/5] arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig

2016-09-16 Thread Claudiu Manoil

Enable the drivers on the powerpc arch.

Signed-off-by: Roy Pledge 
Signed-off-by: Claudiu Manoil 
---
 arch/powerpc/Makefile| 4 ++--
 arch/powerpc/configs/dpaa.config | 1 +
 drivers/soc/Kconfig  | 1 +
 drivers/soc/fsl/Makefile | 1 +
 4 files changed, 5 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/configs/dpaa.config

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 1934707..f36347c 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -325,12 +325,12 @@ mpc85xx_smp_defconfig:
 PHONY += corenet32_smp_defconfig
 corenet32_smp_defconfig:
$(call merge_into_defconfig,corenet_basic_defconfig,\
-   85xx-32bit 85xx-smp 85xx-hw fsl-emb-nonhw)
+   85xx-32bit 85xx-smp 85xx-hw fsl-emb-nonhw dpaa)
 
 PHONY += corenet64_smp_defconfig
 corenet64_smp_defconfig:
$(call merge_into_defconfig,corenet_basic_defconfig,\
-   85xx-64bit 85xx-smp altivec 85xx-hw fsl-emb-nonhw)
+   85xx-64bit 85xx-smp altivec 85xx-hw fsl-emb-nonhw dpaa)
 
 PHONY += mpc86xx_defconfig
 mpc86xx_defconfig:
diff --git a/arch/powerpc/configs/dpaa.config b/arch/powerpc/configs/dpaa.config
new file mode 100644
index 000..efa99c0
--- /dev/null
+++ b/arch/powerpc/configs/dpaa.config
@@ -0,0 +1 @@
+CONFIG_FSL_DPAA=y
diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig
index fe42a2f..e6e90e8 100644
--- a/drivers/soc/Kconfig
+++ b/drivers/soc/Kconfig
@@ -1,6 +1,7 @@
 menu "SOC (System On Chip) specific Drivers"
 
 source "drivers/soc/bcm/Kconfig"
+source "drivers/soc/fsl/qbman/Kconfig"
 source "drivers/soc/fsl/qe/Kconfig"
 source "drivers/soc/mediatek/Kconfig"
 source "drivers/soc/qcom/Kconfig"
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 203307f..75e1f53 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -2,5 +2,6 @@
 # Makefile for the Linux Kernel SOC fsl specific device drivers
 #
 
+obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
-- 
1.7.11.7

[PATCH 4/5] soc/qman: Add self-test for QMan driver

2016-09-16 Thread Claudiu Manoil

Add self tests for the DPAA 1.x Queue Manager driver. The tests
ensure that the driver can properly enqueue and dequeue to/from
frame queues using the QMan portal infrastructure.

Signed-off-by: Roy Pledge 
Signed-off-by: Claudiu Manoil 
---
 drivers/soc/fsl/qbman/Kconfig   |  23 ++
 drivers/soc/fsl/qbman/Makefile  |   5 +
 drivers/soc/fsl/qbman/qman_test.c   |  62 
 drivers/soc/fsl/qbman/qman_test.h   |  36 ++
 drivers/soc/fsl/qbman/qman_test_api.c   | 252 +
 drivers/soc/fsl/qbman/qman_test_stash.c | 617 
 6 files changed, 995 insertions(+)
 create mode 100644 drivers/soc/fsl/qbman/qman_test.c
 create mode 100644 drivers/soc/fsl/qbman/qman_test.h
 create mode 100644 drivers/soc/fsl/qbman/qman_test_api.c
 create mode 100644 drivers/soc/fsl/qbman/qman_test_stash.c

diff --git a/drivers/soc/fsl/qbman/Kconfig b/drivers/soc/fsl/qbman/Kconfig
index 5e4b288..3fdf154 100644
--- a/drivers/soc/fsl/qbman/Kconfig
+++ b/drivers/soc/fsl/qbman/Kconfig
@@ -41,4 +41,27 @@ config FSL_BMAN_TEST_API
  high-level API testing with them (whichever portal(s) are affine
  to the cpu(s) the test executes on).
 
+config FSL_QMAN_TEST
+   tristate "QMan self-tests"
+   help
+ Compile self-test code for QMan.
+
+config FSL_QMAN_TEST_API
+   bool "QMan high-level self-test"
+   depends on FSL_QMAN_TEST
+   default y
+   help
+ This requires the presence of cpu-affine portals, and performs
+ high-level API testing with them (whichever portal(s) are affine to
+ the cpu(s) the test executes on).
+
+config FSL_QMAN_TEST_STASH
+   bool "QMan 'hot potato' data-stashing self-test"
+   depends on FSL_QMAN_TEST
+   default y
+   help
+ This performs a "hot potato" style test enqueuing/dequeuing a frame
+ across a series of FQs scheduled to different portals (and cpus), with
+ DQRR, data and context stashing always on.
+
 endif # FSL_DPAA
diff --git a/drivers/soc/fsl/qbman/Makefile b/drivers/soc/fsl/qbman/Makefile
index 714dd97..7ae199f 100644
--- a/drivers/soc/fsl/qbman/Makefile
+++ b/drivers/soc/fsl/qbman/Makefile
@@ -5,3 +5,8 @@ obj-$(CONFIG_FSL_DPAA)  += bman_ccsr.o 
qman_ccsr.o \
 obj-$(CONFIG_FSL_BMAN_TEST) += bman-test.o
 bman-test-y  = bman_test.o
 bman-test-$(CONFIG_FSL_BMAN_TEST_API)   += bman_test_api.o
+
+obj-$(CONFIG_FSL_QMAN_TEST)+= qman-test.o
+qman-test-y = qman_test.o
+qman-test-$(CONFIG_FSL_QMAN_TEST_API)  += qman_test_api.o
+qman-test-$(CONFIG_FSL_QMAN_TEST_STASH)+= qman_test_stash.o
diff --git a/drivers/soc/fsl/qbman/qman_test.c 
b/drivers/soc/fsl/qbman/qman_test.c
new file mode 100644
index 000..18f7f02
--- /dev/null
+++ b/drivers/soc/fsl/qbman/qman_test.c
@@ -0,0 +1,62 @@
+/* Copyright 2008 - 2016 Freescale Semiconductor, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *  notice, this list of conditions and the following disclaimer in the
+ *  documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *  names of its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 
THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "qman_test.h"
+
+MODULE_AUTHOR("Geoff Thorpe");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("QMan testing");
+
+static int test_init(void)
+{
+   int loop = 1;
+   int err = 0;
+
+

[PATCH 3/5] soc/bman: Add self-test for BMan driver

2016-09-16 Thread Claudiu Manoil

Add a self test for the DPAA 1.x Buffer Manager driver. This
test ensures that the driver can properly acquire and release
buffers using the BMan portal infrastructure.

Signed-off-by: Roy Pledge 
Signed-off-by: Claudiu Manoil 
---
 drivers/soc/fsl/qbman/Kconfig |  16 
 drivers/soc/fsl/qbman/Makefile|   4 +
 drivers/soc/fsl/qbman/bman_test.c |  53 
 drivers/soc/fsl/qbman/bman_test.h |  35 
 drivers/soc/fsl/qbman/bman_test_api.c | 151 ++
 5 files changed, 259 insertions(+)
 create mode 100644 drivers/soc/fsl/qbman/bman_test.c
 create mode 100644 drivers/soc/fsl/qbman/bman_test.h
 create mode 100644 drivers/soc/fsl/qbman/bman_test_api.c

diff --git a/drivers/soc/fsl/qbman/Kconfig b/drivers/soc/fsl/qbman/Kconfig
index 0abb9c8..5e4b288 100644
--- a/drivers/soc/fsl/qbman/Kconfig
+++ b/drivers/soc/fsl/qbman/Kconfig
@@ -25,4 +25,20 @@ config FSL_DPAA_CHECKING
  Compiles in additional checks, to sanity-check the drivers and
  any use of the exported API. Not recommended for performance.
 
+config FSL_BMAN_TEST
+   tristate "BMan self-tests"
+   help
+ Compile the BMan self-test code. These tests will
+ exercise the BMan APIs to confirm functionality
+ of both the software drivers and hardware device.
+
+config FSL_BMAN_TEST_API
+   bool "High-level API self-test"
+   depends on FSL_BMAN_TEST
+   default y
+   help
+ This requires the presence of cpu-affine portals, and performs
+ high-level API testing with them (whichever portal(s) are affine
+ to the cpu(s) the test executes on).
+
 endif # FSL_DPAA
diff --git a/drivers/soc/fsl/qbman/Makefile b/drivers/soc/fsl/qbman/Makefile
index 6e0ee30..714dd97 100644
--- a/drivers/soc/fsl/qbman/Makefile
+++ b/drivers/soc/fsl/qbman/Makefile
@@ -1,3 +1,7 @@
 obj-$(CONFIG_FSL_DPAA)  += bman_ccsr.o qman_ccsr.o \
   bman_portal.o qman_portal.o \
   bman.o qman.o
+
+obj-$(CONFIG_FSL_BMAN_TEST) += bman-test.o
+bman-test-y  = bman_test.o
+bman-test-$(CONFIG_FSL_BMAN_TEST_API)   += bman_test_api.o
diff --git a/drivers/soc/fsl/qbman/bman_test.c 
b/drivers/soc/fsl/qbman/bman_test.c
new file mode 100644
index 000..09b1c96
--- /dev/null
+++ b/drivers/soc/fsl/qbman/bman_test.c
@@ -0,0 +1,53 @@
+/* Copyright 2008 - 2016 Freescale Semiconductor, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *  notice, this list of conditions and the following disclaimer in the
+ *  documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *  names of its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 
THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "bman_test.h"
+
+MODULE_AUTHOR("Geoff Thorpe");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("BMan testing");
+
+static int test_init(void)
+{
+#ifdef CONFIG_FSL_BMAN_TEST_API
+   int loop = 1;
+
+   while (loop--)
+   bman_test_api();
+#endif
+   return 0;
+}
+
+static void test_exit(void)
+{
+}
+
+module_init(test_init);
+module_exit(test_exit);
diff --git a/drivers/soc/fsl/qbman/bman_test.h 
b/drivers/soc/fsl/qbman/bman_test.h
new file mode 100644
index 000..037ed34
--- /dev/null
+++ b/drivers/soc/fsl/qbman/bman_test.h
@@ -0,0 +1,35 @@
+/* Copyright 2008 - 2016 Freescale Semiconductor, Inc.
+ *
+ * Redistri

[PATCH 1/5] soc/fsl: Introduce DPAA 1.x BMan device driver

2016-09-16 Thread Claudiu Manoil

This driver enables the Freescale DPAA 1.x Buffer Manager block.
BMan is a hardware accelerator that manages buffer pools.  It allows
CPUs and other accelerators connected to the SoC datapath to acquire
and release buffers during data processing.

Signed-off-by: Roy Pledge 
Signed-off-by: Claudiu Manoil 
---
 drivers/soc/fsl/qbman/Kconfig   |  24 ++
 drivers/soc/fsl/qbman/Makefile  |   2 +
 drivers/soc/fsl/qbman/bman.c| 797 
 drivers/soc/fsl/qbman/bman_ccsr.c   | 263 
 drivers/soc/fsl/qbman/bman_portal.c | 215 ++
 drivers/soc/fsl/qbman/bman_priv.h   |  80 
 drivers/soc/fsl/qbman/dpaa_sys.h| 103 +
 include/soc/fsl/bman.h  | 129 ++
 8 files changed, 1613 insertions(+)
 create mode 100644 drivers/soc/fsl/qbman/Kconfig
 create mode 100644 drivers/soc/fsl/qbman/Makefile
 create mode 100644 drivers/soc/fsl/qbman/bman.c
 create mode 100644 drivers/soc/fsl/qbman/bman_ccsr.c
 create mode 100644 drivers/soc/fsl/qbman/bman_portal.c
 create mode 100644 drivers/soc/fsl/qbman/bman_priv.h
 create mode 100644 drivers/soc/fsl/qbman/dpaa_sys.h
 create mode 100644 include/soc/fsl/bman.h

diff --git a/drivers/soc/fsl/qbman/Kconfig b/drivers/soc/fsl/qbman/Kconfig
new file mode 100644
index 000..88ce5c6
--- /dev/null
+++ b/drivers/soc/fsl/qbman/Kconfig
@@ -0,0 +1,24 @@
+menuconfig FSL_DPAA
+   bool "Freescale DPAA 1.x support"
+   depends on OF && PPC
+   select GENERIC_ALLOCATOR
+   help
+ The Freescale Data Path Acceleration Architecture (DPAA) is a set of
+ hardware components on specific QorIQ multicore processors.
+ This architecture provides the infrastructure to support simplified
+ sharing of networking interfaces and accelerators by multiple CPUs.
+ The major h/w blocks composing DPAA are BMan and QMan.
+
+ The Buffer Manager (BMan) is a hardware buffer pool management block
+ that allows software and accelerators on the datapath to acquire and
+ release buffers in order to build frames.
+
+if FSL_DPAA
+
+config FSL_DPAA_CHECKING
+   bool "Additional driver checking"
+   help
+ Compiles in additional checks, to sanity-check the drivers and
+ any use of the exported API. Not recommended for performance.
+
+endif # FSL_DPAA
diff --git a/drivers/soc/fsl/qbman/Makefile b/drivers/soc/fsl/qbman/Makefile
new file mode 100644
index 000..855c3ac
--- /dev/null
+++ b/drivers/soc/fsl/qbman/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_FSL_DPAA)  += bman_ccsr.o bman_portal.o \
+  bman.o
diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c
new file mode 100644
index 000..ffa48fd
--- /dev/null
+++ b/drivers/soc/fsl/qbman/bman.c
@@ -0,0 +1,797 @@
+/* Copyright 2008 - 2016 Freescale Semiconductor, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *  notice, this list of conditions and the following disclaimer in the
+ *  documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *  names of its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 
THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "bman_priv.h"
+
+#define IRQNAME"BMan portal %d"
+#define MAX_IRQNAME16  /* big enough for "BMan portal %d" */
+
+/* Portal register assists */
+
+/* Cache-inhibited register offsets */
+#define BM_REG_RCR_PI_CINH 0x
+#define BM_REG_RCR_CI_CINH 0x0004
+#def

[PATCH 0/5] Freescale DPAA 1.x QBMan Drivers

2016-09-16 Thread Claudiu Manoil

Add basic support for the Data Path Acceleration Architecture v1.x
(DPAA 1.x) hardware infrastructure and accelerators found on multicore
Freescale SoCs, commonly known as the QorIQ series.

CC: Roy Pledge 

Claudiu Manoil (5):
  soc/fsl: Introduce DPAA 1.x BMan device driver
  soc/fsl: Introduce DPAA 1.x QMan device driver
  soc/bman: Add self-test for BMan driver
  soc/qman: Add self-test for QMan driver
  arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig

 arch/powerpc/Makefile   |4 +-
 arch/powerpc/configs/dpaa.config|1 +
 drivers/soc/Kconfig |1 +
 drivers/soc/fsl/Makefile|1 +
 drivers/soc/fsl/qbman/Kconfig   |   67 +
 drivers/soc/fsl/qbman/Makefile  |   12 +
 drivers/soc/fsl/qbman/bman.c|  797 +
 drivers/soc/fsl/qbman/bman_ccsr.c   |  263 +++
 drivers/soc/fsl/qbman/bman_portal.c |  215 +++
 drivers/soc/fsl/qbman/bman_priv.h   |   80 +
 drivers/soc/fsl/qbman/bman_test.c   |   53 +
 drivers/soc/fsl/qbman/bman_test.h   |   35 +
 drivers/soc/fsl/qbman/bman_test_api.c   |  151 ++
 drivers/soc/fsl/qbman/dpaa_sys.h|  103 ++
 drivers/soc/fsl/qbman/qman.c| 2881 +++
 drivers/soc/fsl/qbman/qman_ccsr.c   |  817 +
 drivers/soc/fsl/qbman/qman_portal.c |  354 
 drivers/soc/fsl/qbman/qman_priv.h   |  370 
 drivers/soc/fsl/qbman/qman_test.c   |   62 +
 drivers/soc/fsl/qbman/qman_test.h   |   36 +
 drivers/soc/fsl/qbman/qman_test_api.c   |  252 +++
 drivers/soc/fsl/qbman/qman_test_stash.c |  617 +++
 include/soc/fsl/bman.h  |  129 ++
 include/soc/fsl/qman.h  | 1076 
 24 files changed, 8375 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/configs/dpaa.config
 create mode 100644 drivers/soc/fsl/qbman/Kconfig
 create mode 100644 drivers/soc/fsl/qbman/Makefile
 create mode 100644 drivers/soc/fsl/qbman/bman.c
 create mode 100644 drivers/soc/fsl/qbman/bman_ccsr.c
 create mode 100644 drivers/soc/fsl/qbman/bman_portal.c
 create mode 100644 drivers/soc/fsl/qbman/bman_priv.h
 create mode 100644 drivers/soc/fsl/qbman/bman_test.c
 create mode 100644 drivers/soc/fsl/qbman/bman_test.h
 create mode 100644 drivers/soc/fsl/qbman/bman_test_api.c
 create mode 100644 drivers/soc/fsl/qbman/dpaa_sys.h
 create mode 100644 drivers/soc/fsl/qbman/qman.c
 create mode 100644 drivers/soc/fsl/qbman/qman_ccsr.c
 create mode 100644 drivers/soc/fsl/qbman/qman_portal.c
 create mode 100644 drivers/soc/fsl/qbman/qman_priv.h
 create mode 100644 drivers/soc/fsl/qbman/qman_test.c
 create mode 100644 drivers/soc/fsl/qbman/qman_test.h
 create mode 100644 drivers/soc/fsl/qbman/qman_test_api.c
 create mode 100644 drivers/soc/fsl/qbman/qman_test_stash.c
 create mode 100644 include/soc/fsl/bman.h
 create mode 100644 include/soc/fsl/qman.h

-- 
1.7.11.7

Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread Nicholas Piggin

On Fri, 16 Sep 2016 22:06:35 +1000
Nicholas Piggin  wrote:

> On Fri, 16 Sep 2016 11:57:37 +
> David Laight  wrote:
> 
> > From: Nicholas Piggin  
> > > Sent: 16 September 2016 12:52
> > > On Fri, 16 Sep 2016 11:30:58 +
> > > David Laight  wrote:
> > > 
> > > > From: Nicholas Piggin
> > > > > Sent: 16 September 2016 09:58
> > > > > Implementing busy wait loops with cpu_relax() in callers poses
> > > > > some difficulties for powerpc.
> > > > >
> > > > > First, we want to put our SMT thread into a low priority mode for the
> > > > > duration of the loop, but then return to normal priority after exiting
> > > > > the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> > > > > cpu_relax() does may have HMT_medium take effect before HMT_low made
> > > > > any (or much) difference.
> > > > >
> > > > > Second, it can be beneficial for some implementations to spin on the
> > > > > exit condition with a statically predicted-not-taken branch (i.e.,
> > > > > always predict the loop will exit).
> > > > >
> > > > > This is a quick RFC with a couple of users converted to see what
> > > > > people think. I don't use a C branch with hints, because we don't want
> > > > > the compiler moving the loop body out of line, which makes it a bit
> > > > > messy unfortunately. If there's a better way to do it, I'm all ears.  
> > > > >   
> > > >
> > > > I think it will still all go wrong if the conditional isn't trivial.
> > > > In particular if the condition contains || or && it is likely to
> > > > have a branch - which could invert the loop.
> > > 
> > > I don't know that it will.
> > > 
> > > Yes, if we have exit condition that requires more branches in order to
> > > be computed then we lose our nice property of never taking a branch
> > > miss on loop exit. But we still avoid *this* branch miss, and still
> > > prevent multiple iterations of the wait loop being speculatively
> > > executed concurrently when there's no work to be done.
> > > 
> > > And C doesn't know about the loop, so it can't do any transformation
> > > except to compute the final condition.
> > > 
> > > Or have I missed something?
> > 
> > Try putting the code inside a conditional or at the bottom of a loop.
> > gcc can replicate code to remove a branch.
> > 
> > So:
> > for (;;) {
> > a;
> > if (b)
> > c;
> > d;
> > }  
> 
> That's not what this patch does though. The loop is purely asm. gcc has
> no idea about it. Only thing gcc knows is to evaluate the condition and
> put it in a register.


Oh you're right course -- can't branch to random location. Sorry, I didn't
know what you meant at first. It does need to use asm goto I guess.

Thanks,
Nick

Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread Nicholas Piggin

On Fri, 16 Sep 2016 11:57:37 +
David Laight  wrote:

> From: Nicholas Piggin
> > Sent: 16 September 2016 12:52
> > On Fri, 16 Sep 2016 11:30:58 +
> > David Laight  wrote:
> >   
> > > From: Nicholas Piggin  
> > > > Sent: 16 September 2016 09:58
> > > > Implementing busy wait loops with cpu_relax() in callers poses
> > > > some difficulties for powerpc.
> > > >
> > > > First, we want to put our SMT thread into a low priority mode for the
> > > > duration of the loop, but then return to normal priority after exiting
> > > > the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> > > > cpu_relax() does may have HMT_medium take effect before HMT_low made
> > > > any (or much) difference.
> > > >
> > > > Second, it can be beneficial for some implementations to spin on the
> > > > exit condition with a statically predicted-not-taken branch (i.e.,
> > > > always predict the loop will exit).
> > > >
> > > > This is a quick RFC with a couple of users converted to see what
> > > > people think. I don't use a C branch with hints, because we don't want
> > > > the compiler moving the loop body out of line, which makes it a bit
> > > > messy unfortunately. If there's a better way to do it, I'm all ears.  
> > >
> > > I think it will still all go wrong if the conditional isn't trivial.
> > > In particular if the condition contains || or && it is likely to
> > > have a branch - which could invert the loop.  
> > 
> > I don't know that it will.
> > 
> > Yes, if we have exit condition that requires more branches in order to
> > be computed then we lose our nice property of never taking a branch
> > miss on loop exit. But we still avoid *this* branch miss, and still
> > prevent multiple iterations of the wait loop being speculatively
> > executed concurrently when there's no work to be done.
> > 
> > And C doesn't know about the loop, so it can't do any transformation
> > except to compute the final condition.
> > 
> > Or have I missed something?  
> 
> Try putting the code inside a conditional or at the bottom of a loop.
> gcc can replicate code to remove a branch.
> 
> So:
>   for (;;) {
>   a;
>   if (b)
>   c;
>   d;
>   }

That's not what this patch does though. The loop is purely asm. gcc has
no idea about it. Only thing gcc knows is to evaluate the condition and
put it in a register.

Thanks,
Nick

RE: [PATCH 04/13] powerpc: Use soft_enabled_set api to update paca->soft_enabled

2016-09-16 Thread David Laight

From: Nicholas Piggin
> Sent: 16 September 2016 10:53
> On Thu, 15 Sep 2016 18:31:54 +0530
> Madhavan Srinivasan  wrote:
> 
> > Force use of soft_enabled_set() wrapper to update paca-soft_enabled
> > wherever possisble. Also add a new wrapper function, 
> > soft_enabled_set_return(),
> > added to force the paca->soft_enabled updates.
...
> > diff --git a/arch/powerpc/include/asm/hw_irq.h 
> > b/arch/powerpc/include/asm/hw_irq.h
> > index 8fad8c24760b..f828b8f8df02 100644
> > --- a/arch/powerpc/include/asm/hw_irq.h
> > +++ b/arch/powerpc/include/asm/hw_irq.h
> > @@ -53,6 +53,20 @@ static inline notrace void soft_enabled_set(unsigned 
> > long enable)
> > : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)));
> >  }
> >
> > +static inline notrace unsigned long soft_enabled_set_return(unsigned long 
> > enable)
> > +{
> > +   unsigned long flags;
> > +
> > +   asm volatile(
> > +   "lbz %0,%1(13); stb %2,%1(13)"
> > +   : "=r" (flags)
> > +   : "i" (offsetof(struct paca_struct, soft_enabled)),\
> > + "r" (enable)
> > +   : "memory");
> > +
> > +   return flags;
> > +}
> 
> Why do you have the "memory" clobber here while soft_enabled_set() does not?

I wondered about the missing memory clobber earlier.

Any 'clobber' ought to be restricted to the referenced memory area.
If the structure is only referenced by r13 through 'asm volatile' it isn't 
needed.
OTOH why not allocate a global register variable to r13 and access through that?

David

RE: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread David Laight

From: Nicholas Piggin
> Sent: 16 September 2016 12:52
> On Fri, 16 Sep 2016 11:30:58 +
> David Laight  wrote:
> 
> > From: Nicholas Piggin
> > > Sent: 16 September 2016 09:58
> > > Implementing busy wait loops with cpu_relax() in callers poses
> > > some difficulties for powerpc.
> > >
> > > First, we want to put our SMT thread into a low priority mode for the
> > > duration of the loop, but then return to normal priority after exiting
> > > the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> > > cpu_relax() does may have HMT_medium take effect before HMT_low made
> > > any (or much) difference.
> > >
> > > Second, it can be beneficial for some implementations to spin on the
> > > exit condition with a statically predicted-not-taken branch (i.e.,
> > > always predict the loop will exit).
> > >
> > > This is a quick RFC with a couple of users converted to see what
> > > people think. I don't use a C branch with hints, because we don't want
> > > the compiler moving the loop body out of line, which makes it a bit
> > > messy unfortunately. If there's a better way to do it, I'm all ears.
> >
> > I think it will still all go wrong if the conditional isn't trivial.
> > In particular if the condition contains || or && it is likely to
> > have a branch - which could invert the loop.
> 
> I don't know that it will.
> 
> Yes, if we have exit condition that requires more branches in order to
> be computed then we lose our nice property of never taking a branch
> miss on loop exit. But we still avoid *this* branch miss, and still
> prevent multiple iterations of the wait loop being speculatively
> executed concurrently when there's no work to be done.
> 
> And C doesn't know about the loop, so it can't do any transformation
> except to compute the final condition.
> 
> Or have I missed something?

Try putting the code inside a conditional or at the bottom of a loop.
gcc can replicate code to remove a branch.

So:
for (;;) {
a;
if (b)
c;
d;
}

can become:
   x1:
a;
if (b) to x2;
d;
goto x1;

x2:
c;
d;
goto x1;

Which won't work.

David

Re: [PATCH 04/13] powerpc: Use soft_enabled_set api to update paca->soft_enabled

2016-09-16 Thread Nicholas Piggin

On Fri, 16 Sep 2016 11:43:13 +
David Laight  wrote:

> From: Nicholas Piggin
> > Sent: 16 September 2016 10:53
> > On Thu, 15 Sep 2016 18:31:54 +0530
> > Madhavan Srinivasan  wrote:
> >   
> > > Force use of soft_enabled_set() wrapper to update paca-soft_enabled
> > > wherever possisble. Also add a new wrapper function, 
> > > soft_enabled_set_return(),
> > > added to force the paca->soft_enabled updates.  
> ...
> > > diff --git a/arch/powerpc/include/asm/hw_irq.h 
> > > b/arch/powerpc/include/asm/hw_irq.h
> > > index 8fad8c24760b..f828b8f8df02 100644
> > > --- a/arch/powerpc/include/asm/hw_irq.h
> > > +++ b/arch/powerpc/include/asm/hw_irq.h
> > > @@ -53,6 +53,20 @@ static inline notrace void soft_enabled_set(unsigned 
> > > long enable)
> > >   : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)));
> > >  }
> > >
> > > +static inline notrace unsigned long soft_enabled_set_return(unsigned 
> > > long enable)
> > > +{
> > > + unsigned long flags;
> > > +
> > > + asm volatile(
> > > + "lbz %0,%1(13); stb %2,%1(13)"
> > > + : "=r" (flags)
> > > + : "i" (offsetof(struct paca_struct, soft_enabled)),\
> > > +   "r" (enable)
> > > + : "memory");
> > > +
> > > + return flags;
> > > +}  
> > 
> > Why do you have the "memory" clobber here while soft_enabled_set() does 
> > not?  
> 
> I wondered about the missing memory clobber earlier.
> 
> Any 'clobber' ought to be restricted to the referenced memory area.
> If the structure is only referenced by r13 through 'asm volatile' it isn't 
> needed.

Well a clobber (compiler barrier) at some point is needed in irq_disable and
irq_enable paths, so we correctly open and close the critical section vs 
interrupts.
I just wonder about these helpers. It might be better to take the clobbers out 
of
there and add barrier(); in callers, which would make it more obvious.

Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread Nicholas Piggin

On Fri, 16 Sep 2016 11:30:58 +
David Laight  wrote:

> From: Nicholas Piggin
> > Sent: 16 September 2016 09:58
> > Implementing busy wait loops with cpu_relax() in callers poses
> > some difficulties for powerpc.
> > 
> > First, we want to put our SMT thread into a low priority mode for the
> > duration of the loop, but then return to normal priority after exiting
> > the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> > cpu_relax() does may have HMT_medium take effect before HMT_low made
> > any (or much) difference.
> > 
> > Second, it can be beneficial for some implementations to spin on the
> > exit condition with a statically predicted-not-taken branch (i.e.,
> > always predict the loop will exit).
> > 
> > This is a quick RFC with a couple of users converted to see what
> > people think. I don't use a C branch with hints, because we don't want
> > the compiler moving the loop body out of line, which makes it a bit
> > messy unfortunately. If there's a better way to do it, I'm all ears.  
> 
> I think it will still all go wrong if the conditional isn't trivial.
> In particular if the condition contains || or && it is likely to
> have a branch - which could invert the loop.

I don't know that it will.

Yes, if we have exit condition that requires more branches in order to
be computed then we lose our nice property of never taking a branch
miss on loop exit. But we still avoid *this* branch miss, and still
prevent multiple iterations of the wait loop being speculatively
executed concurrently when there's no work to be done.

And C doesn't know about the loop, so it can't do any transformation
except to compute the final condition.

Or have I missed something?

Thanks,
Nick

RE: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread David Laight

From: Nicholas Piggin
> Sent: 16 September 2016 09:58
> Implementing busy wait loops with cpu_relax() in callers poses
> some difficulties for powerpc.
> 
> First, we want to put our SMT thread into a low priority mode for the
> duration of the loop, but then return to normal priority after exiting
> the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> cpu_relax() does may have HMT_medium take effect before HMT_low made
> any (or much) difference.
> 
> Second, it can be beneficial for some implementations to spin on the
> exit condition with a statically predicted-not-taken branch (i.e.,
> always predict the loop will exit).
> 
> This is a quick RFC with a couple of users converted to see what
> people think. I don't use a C branch with hints, because we don't want
> the compiler moving the loop body out of line, which makes it a bit
> messy unfortunately. If there's a better way to do it, I'm all ears.

I think it will still all go wrong if the conditional isn't trivial.
In particular if the condition contains || or && it is likely to
have a branch - which could invert the loop.

David

Re: [PATCH 00/13] powerpc: "paca->soft_enabled" based local atomic operation implementation

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:50 +0530
Madhavan Srinivasan  wrote:

> Local atomic operations are fast and highly reentrant per CPU counters.
> Used for percpu variable updates. Local atomic operations only guarantee
> variable modification atomicity wrt the CPU which owns the data and
> these needs to be executed in a preemption safe way.

Great patches, I hope the review helped. Other than my minor comments,
I think the patches look good, so it's a good time for other reviewers
to take a look. Would be good to get an ack from Ben, Anton, or Paul too
at some point.

These will clash spectacularly with my exception vector rework
unfortunately, but that's how it goes. I can help with rebasing if
mine ends up going in first.

Thanks,
Nick

Re: [PATCH 10/13] powerpc: Add "bitmask" paramater to MASKABLE_* macros

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:32:00 +0530
Madhavan Srinivasan  wrote:

> Make it explicit the interrupt masking supported
> by a gievn interrupt handler. Patch correspondingly
> extends the MASKABLE_* macros with an addition's parameter.
> "bitmask" parameter is passed to SOFTEN_TEST macro to decide
> on masking the interrupt.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/exception-64s.h | 62 
> 
>  arch/powerpc/kernel/exceptions-64s.S | 36 ---
>  2 files changed, 54 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/exception-64s.h 
> b/arch/powerpc/include/asm/exception-64s.h
> index 1eea4ab75607..41be0c2d7658 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -179,9 +179,9 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
>   * checking of the interrupt maskable level in the SOFTEN_TEST.
>   * Intended to be used in MASKABLE_EXCPETION_* macros.
>   */
> -#define __EXCEPTION_PROLOG_1(area, extra, vec)   
> \
> +#define __EXCEPTION_PROLOG_1(area, extra, vec, bitmask)  
> \
>   __EXCEPTION_PROLOG_1_PRE(area); \
> - extra(vec); \
> + extra(vec, bitmask);\
>   __EXCEPTION_PROLOG_1_POST(area);
>  
>  /*

Is __EXCEPTION_PROLOG_1 now for maskable exceptions, and EXCEPTION_PROLOG_1
for unmaskable? Does it make sense to rename __EXCEPTION_PROLOG_1 to
MASKABLE_EXCEPTION_PROLOG_1? Reducing the mystery underscores in this file would
be nice!

This worked out nicely with mask bit being passed in by the exception handlers.
Very neat.

Reviewed-by: Nicholas Piggin

Re: [PATCH 13/13] powerpc: rewrite local_t using soft_irq

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:32:03 +0530
Madhavan Srinivasan  wrote:

> Local atomic operations are fast and highly reentrant per CPU counters.
> Used for percpu variable updates. Local atomic operations only guarantee
> variable modification atomicity wrt the CPU which owns the data and
> these needs to be executed in a preemption safe way.
> 
> Here is the design of this patch. Since local_* operations
> are only need to be atomic to interrupts (IIUC), we have two options.
> Either replay the "op" if interrupted or replay the interrupt after
> the "op". Initial patchset posted was based on implementing local_* operation
> based on CR5 which replay's the "op". Patchset had issues in case of
> rewinding the address pointor from an array. This make the slow patch
> really slow. Since CR5 based implementation proposed using __ex_table to find
> the rewind addressr, this rasied concerns about size of __ex_table and 
> vmlinux.
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-December/123115.html
> 
> But this patch uses, local_irq_pmu_save to soft_disable
> interrupts (including PMIs). After finishing the "op", local_irq_pmu_restore()
> called and correspondingly interrupts are replayed if any occured.
> 
> patch re-write the current local_* functions to use arch_local_irq_disbale.
> Base flow for each function is
> 
> {
>   local_irq_pmu_save(flags)
>   load
>   ..
>   store
>   local_irq_pmu_restore(flags)
> }
> 
> Reason for the approach is that, currently l[w/d]arx/st[w/d]cx.
> instruction pair is used for local_* operations, which are heavy
> on cycle count and they dont support a local variant. So to
> see whether the new implementation helps, used a modified
> version of Rusty's benchmark code on local_t.
> 
> https://lkml.org/lkml/2008/12/16/450
> 
> Modifications to Rusty's benchmark code:
> - Executed only local_t test
> 
> Here are the values with the patch.
> 
> Time in ns per iteration
> 
> Local_t Without Patch   With Patch
> 
> _inc28  8
> _add28  8
> _read   3   3
> _add_return 28  7
> 
> Currently only asm/local.h has been rewrite, and also
> the entire change is tested only in PPC64 (pseries guest)
> and PPC64 host (LE)
> 
> TODO:
>   - local_cmpxchg and local_xchg needs modification.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/local.h | 94 
> 
>  1 file changed, 66 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/local.h 
> b/arch/powerpc/include/asm/local.h
> index b8da91363864..fb5728abb4e9 100644
> --- a/arch/powerpc/include/asm/local.h
> +++ b/arch/powerpc/include/asm/local.h
> @@ -3,6 +3,9 @@
>  
>  #include 
>  #include 
> +#include 
> +
> +#include 
>  
>  typedef struct
>  {
> @@ -14,24 +17,50 @@ typedef struct
>  #define local_read(l)atomic_long_read(&(l)->a)
>  #define local_set(l,i)   atomic_long_set(&(l)->a, (i))
>  
> -#define local_add(i,l)   atomic_long_add((i),(&(l)->a))
> -#define local_sub(i,l)   atomic_long_sub((i),(&(l)->a))
> -#define local_inc(l) atomic_long_inc(&(l)->a)
> -#define local_dec(l) atomic_long_dec(&(l)->a)
> +static __inline__ void local_add(long i, local_t *l)
> +{
> + long t;
> + unsigned long flags;
> +
> + local_irq_pmu_save(flags);
> + __asm__ __volatile__(
> + PPC_LL" %0,0(%2)\n\
> + add %0,%1,%0\n"
> + PPC_STL" %0,0(%2)\n"
> + : "=&r" (t)
> + : "r" (i), "r" (&(l->a.counter)));
> + local_irq_pmu_restore(flags);
> +}
> +
> +static __inline__ void local_sub(long i, local_t *l)
> +{
> + long t;
> + unsigned long flags;
> +
> + local_irq_pmu_save(flags);
> + __asm__ __volatile__(
> + PPC_LL" %0,0(%2)\n\
> + subf%0,%1,%0\n"
> + PPC_STL" %0,0(%2)\n"
> + : "=&r" (t)
> + : "r" (i), "r" (&(l->a.counter)));
> + local_irq_pmu_restore(flags);
> +}
>  
>  static __inline__ long local_add_return(long a, local_t *l)
>  {
>   long t;
> + unsigned long flags;
>  
> + local_irq_pmu_save(flags);
>   __asm__ __volatile__(
> -"1:" PPC_LLARX(%0,0,%2,0) "  # local_add_return\n\
> + PPC_LL" %0,0(%2)\n\
>   add %0,%1,%0\n"
> - PPC405_ERR77(0,%2)
> - PPC_STLCX   "%0,0,%2 \n\
> - bne-1b"
> + PPC_STL "%0,0(%2)\n"
>   : "=&r" (t)
>   : "r" (a), "r" (&(l->a.counter))
>   : "cc", "memory");
> + local_irq_pmu_restore(flags);

Are all your clobbers correct? You might not be clobbering "cc" here
anymore, for example. Could you double check those? Otherwise, awesome
patch!

Reviewed-by: Nicholas Piggin

Re: [PATCH 12/13] powerpc: Add a Kconfig and a functions to set new soft_enabled mask

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:32:02 +0530
Madhavan Srinivasan  wrote:

> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index 9e5e9a6d4147..ae31b1e85fdb 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -209,6 +209,10 @@ notrace void arch_local_irq_restore(unsigned long en)
>   unsigned char irq_happened;
>   unsigned int replay;
>  
> +#ifdef CONFIG_IRQ_DEBUG_SUPPORT
> + WARN_ON(en & local_paca->soft_enabled & ~IRQ_DISABLE_MASK_LINUX);
> +#endif
> +
>   /* Write the new soft-enabled value */
>   soft_enabled_set(en);
>  

Oh one other quick thing I just noticed: you could put this debug
check into your soft_enabled accessors.

We did decide it's okay for your masking level to go both ways,
didn't we? I.e.,

local_irq_disable();
local_irq_pmu_save(flags);
local_irq_pmu_restore(flags);
local_irq_enable();

-> LINUX -> LINUX|PMU -> LINUX ->

This means PMU interrupts would not get replayed despite being
enabled here. In practice I think that doesn't matter/can't happen
because a PMU interrupt while masked would hard disable anyway. A
comment explaining it might be nice though.

Thanks,
Nick

Re: [PATCH 11/13] powerpc: Add support to mask perf interrupts and replay them

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:32:01 +0530
Madhavan Srinivasan  wrote:

> To support masking of the PMI interrupts, couple of new interrupt handler
> macros are added MASKABLE_EXCEPTION_PSERIES_OOL and
> MASKABLE_RELON_EXCEPTION_PSERIES_OOL.
> 
> Couple of new irq #defs "PACA_IRQ_PMI" and "SOFTEN_VALUE_0xf0*" added to
> use in the exception code to check for PMI interrupts.
> 
> In the masked_interrupt handler, for PMIs we reset the MSR[EE]
> and return. In the __check_irq_replay(), replay the PMI interrupt
> by calling performance_monitor_common handler.


Reviewed-by: Nicholas Piggin

[RFC 3/3] powerpc/powernv: reset any PHBs in CAPI mode during initialisation

2016-09-16 Thread Andrew Donnellan

If we kexec into a new kernel on a machine where a PHB has been switched
into CAPI mode, we need to disable CAPI mode in the new kernel before
traffic begins to flow on the PHB and causes a machine checkstop.

During PHB initialisation, ask OPAL whether each PHB is in CAPI mode, and
if so, do a complete reset in order to disable CAPI mode.

This requires a version of skiboot that implements the
OPAL_PCI_GET_PHB_CAPI_MODE call (introduced at the same time that the
capability to disable CAPI mode during complete resets was introduced).

Signed-off-by: Andrew Donnellan 

---

Corresponding skiboot code: http://patchwork.ozlabs.org/patch/670782/
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ca5e9b5..bd76651 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3526,6 +3526,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
const __be32 *prop32;
int len;
unsigned int segno;
+   bool capi_mode = false;
u64 phb_id;
void *aux;
long rc;
@@ -3741,8 +3742,17 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
 * shutdown PCI devices correctly. We already got IODA table
 * cleaned out. So we have to issue PHB reset to stop all PCI
 * transactions from previous kernel.
+*
+* Additionally, if the PHB is in CAPI mode, we also need to
+* reset it to get it out of CAPI mode.
 */
-   if (is_kdump_kernel()) {
+   if (opal_check_token(OPAL_PCI_GET_PHB_CAPI_MODE) &&
+   opal_pci_get_phb_capi_mode(phb_id) == OPAL_PHB_CAPI_MODE_CAPI) {
+   pr_info("  PHB in CAPI mode, reset required\n");
+   capi_mode = true;
+   }
+
+   if (is_kdump_kernel() || capi_mode) {
pr_info("  Issue PHB reset ...\n");
pnv_eeh_phb_reset(hose, EEH_RESET_FUNDAMENTAL);
pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
-- 
git-series 0.8.10

[RFC 2/3] powerpc/powernv: add opal_pci_get_phb_capi_mode() call

2016-09-16 Thread Andrew Donnellan

opal_pci_get_phb_capi_mode() returns OPAL_PHB_CAPI_MODE_CAPI if the PHB is
in CAPI mode, and OPAL_PHB_CAPI_MODE_PCIE if it isn't.

We're going to use this call to determine if a PHB requires a complete
reset during initialisation in order to disable CAPI mode (on sufficiently
new skiboots that support this).

Signed-off-by: Andrew Donnellan 

---

Corresponding skiboot RFC: http://patchwork.ozlabs.org/patch/670781/
---
 arch/powerpc/include/asm/opal-api.h| 3 ++-
 arch/powerpc/include/asm/opal.h| 1 +
 arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0e2e57b..078ce77 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -167,7 +167,8 @@
 #define OPAL_INT_EOI   124
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
-#define OPAL_LAST  126
+#define OPAL_PCI_GET_PHB_CAPI_MODE 128
+#define OPAL_LAST  128
 
 /* Device tree flags */
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ee05bd2..501d32a 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -228,6 +228,7 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
kill_type,
 int64_t opal_rm_pci_tce_kill(uint64_t phb_id, uint32_t kill_type,
 uint32_t pe_num, uint32_t tce_size,
 uint64_t dma_addr, uint32_t npages);
+int64_t opal_pci_get_phb_capi_mode(uint64_t phb_id);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 3d29d40..338d034 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -308,3 +308,4 @@ OPAL_CALL(opal_int_eoi, 
OPAL_INT_EOI);
 OPAL_CALL(opal_int_set_mfrr,   OPAL_INT_SET_MFRR);
 OPAL_CALL(opal_pci_tce_kill,   OPAL_PCI_TCE_KILL);
 OPAL_CALL_REAL(opal_rm_pci_tce_kill,   OPAL_PCI_TCE_KILL);
+OPAL_CALL(opal_pci_get_phb_capi_mode,  OPAL_PCI_GET_PHB_CAPI_MODE);
-- 
git-series 0.8.10

[RFC 1/3] powerpc/powernv: fix comment style and spelling

2016-09-16 Thread Andrew Donnellan

Signed-off-by: Andrew Donnellan 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c16d790..ca5e9b5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3736,10 +3736,11 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
if (rc)
pr_warning("  OPAL Error %ld performing IODA table reset !\n", 
rc);
 
-   /* If we're running in kdump kerenl, the previous kerenl never
+   /*
+* If we're running in kdump kernel, the previous kernel never
 * shutdown PCI devices correctly. We already got IODA table
 * cleaned out. So we have to issue PHB reset to stop all PCI
-* transactions from previous kerenl.
+* transactions from previous kernel.
 */
if (is_kdump_kernel()) {
pr_info("  Issue PHB reset ...\n");
-- 
git-series 0.8.10

[RFC 0/3] powerpc/powernv: support CAPI + kexec

2016-09-16 Thread Andrew Donnellan

Currently, if you attempt to kexec into a new kernel from a machine with a
CAPI card and the cxl driver loaded, you are going to have an exceedingly
bad time. It turns out that the hardware doesn't really cope very well with
going through the standard Linux PCI initialisation process while a PHB is
still in CAPI mode. Checkstops everywhere!

I've submitted RFC patches for skiboot[0][1] to disable CAPI mode when we
do a complete reset on a PHB. This series ensures that when we enter the
new kernel, we ask skiboot to do a complete reset on any PHBs that have
been left in CAPI mode before we initialise everything.

At this stage, I haven't thought too hard about getting to the point where
we can do this after Linux has booted for stuff like PCI hotplug...
triggering a creset after the Linux EEH handling is all set up can get
interesting.

This has only gotten the very lightest of testing - I've kexec-ed quite a
few times with no real problems, and I've run some basic CAPI tests that
don't seem to fail too much more than they normally fail. It does look like
we get spammed with a tonne of HMIs and frozen PE messages in the skiboot
log, not sure how bad this is.

Thanks to Vaibhav Jain (who made a previous attempt at this), Mikey
Neuling, Ben Herrenschmidt, Gavin Shan, Bill Daly and JT Kellington for
advice on various bits of this.

Andrew

[0] http://patchwork.ozlabs.org/patch/670781/
[1] http://patchwork.ozlabs.org/patch/670782/

Andrew Donnellan (3):
  powerpc/powernv: fix comment style and spelling
  powerpc/powernv: add opal_pci_get_phb_capi_mode() call
  powerpc/powernv: reset any PHBs in CAPI mode during initialisation

 arch/powerpc/include/asm/opal-api.h|  3 ++-
 arch/powerpc/include/asm/opal.h|  1 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 17 ++---
 4 files changed, 18 insertions(+), 4 deletions(-)

base-commit: 024c7e3756d8a42fc41fe8a9488488b9b09d1dcc
-- 
git-series 0.8.10

Re: [PATCH 12/13] powerpc: Add a Kconfig and a functions to set new soft_enabled mask

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:32:02 +0530
Madhavan Srinivasan  wrote:

> New Kconfig is added "CONFIG_IRQ_DEBUG_SUPPORT" to add warn_on
> to alert the invalid transitions. Have also moved the code under
> the CONFIG_TRACE_IRQFLAGS in arch_local_irq_restore() to new Kconfig
> as suggested.

I can't tempt you to put the Kconfig changes into their own patch? :)


> To support disabling and enabling of irq with PMI, set of
> new powerpc_local_irq_pmu_save() and powerpc_local_irq_restore()
> functions are added. And powerpc_local_irq_save() implemented,
> by adding a new soft_enabled manipulation function soft_enabled_or_return().
> Local_irq_pmu_* macros are provided to access these powerpc_local_irq_pmu*
> functions which includes trace_hardirqs_on|off() to match what we
> have in include/linux/irqflags.h.
> 
> Signed-off-by: Madhavan Srinivasan 

> @@ -81,6 +81,20 @@ static inline notrace unsigned long 
> soft_enabled_set_return(unsigned long enable
>   return flags;
>  }
>  
> +static inline notrace unsigned long soft_enabled_or_return(unsigned long 
> enable)
> +{
> + unsigned long flags, zero;
> +
> + asm volatile(
> + "mr %1,%3; lbz %0,%2(13); or %1,%0,%1; stb %1,%2(13)"
> + : "=r" (flags), "=&r"(zero)
> + : "i" (offsetof(struct paca_struct, soft_enabled)),\
> +   "r" (enable)
> + : "memory");
> +
> + return flags;
> +}

Another candidate for builtin_constification using immediates. And do you
actually need the initial mr instruction there?


> @@ -105,7 +119,7 @@ static inline unsigned long arch_local_irq_save(void)
>  
>  static inline bool arch_irqs_disabled_flags(unsigned long flags)
>  {
> - return flags == IRQ_DISABLE_MASK_LINUX;
> + return (flags);
>  }
>  
>  static inline bool arch_irqs_disabled(void)

This part logically belongs in patch 11, but it also needs to be changed,
I think? Keep in mind it's the generic kernel asking whether it has "Linux
interrupts" disabled.

return flags & IRQ_DISABLE_MASK_LINUX;

> @@ -113,6 +127,59 @@ static inline bool arch_irqs_disabled(void)
>   return arch_irqs_disabled_flags(arch_local_save_flags());
>  }
>  
> +static inline void powerpc_local_irq_pmu_restore(unsigned long flags)
> +{
> + arch_local_irq_restore(flags);
> +}
> +
> +static inline unsigned long powerpc_local_irq_pmu_disable(void)
> +{
> + return soft_enabled_or_return(IRQ_DISABLE_MASK_LINUX | 
> IRQ_DISABLE_MASK_PMU);
> +}
> +
> +static inline unsigned long powerpc_local_irq_pmu_save(void)
> +{
> + return powerpc_local_irq_pmu_disable();
> +}
> +
> +#define raw_local_irq_pmu_save(flags)
> \
> + do {\
> + typecheck(unsigned long, flags);\
> + flags = powerpc_local_irq_pmu_save();   \
> + } while(0)
> +
> +#define raw_local_irq_pmu_restore(flags) \
> + do {\
> + typecheck(unsigned long, flags);\
> + powerpc_local_irq_pmu_restore(flags);   \
> + } while(0)
> +
> +#ifdef CONFIG_TRACE_IRQFLAGS
> +#define local_irq_pmu_save(flags)\
> + do {\
> + raw_local_irq_pmu_save(flags);  \
> + trace_hardirqs_off();   \
> + } while(0)
> +#define local_irq_pmu_restore(flags) \
> + do {\
> + if (raw_irqs_disabled_flags(flags)) {   \
> + raw_local_irq_pmu_restore(flags);\
> + trace_hardirqs_off();   \
> + } else {\
> + trace_hardirqs_on();\
> + raw_local_irq_pmu_restore(flags);\
> + }   \
> + } while(0)
> +#else
> +#define local_irq_pmu_save(flags)\
> + do {\
> + raw_local_irq_pmu_save(flags);  \
> + } while(0)
> +#define local_irq_pmu_restore(flags) \
> + do { raw_local_irq_pmu_restore(flags); } while (0)
> +#endif /* CONFIG_TRACE_IRQFLAGS */
> +
> +

This looks pretty good. When I suggested powerpc_ prefix, I intended in these
functions here, so it wouldn't match with the local_irq_ namespace of generic
kernel. But that was just an idea. If you prefer to do it this way, could you
just drop the powerpc_ wrappers entirely?

A comment above that says it comes from the generic Linux local_irq code
might be an idea too.

Provided at least the arch_irqs_disabled_flags comment gets addressed:

Reviewed-by: Nicholas Piggin

Re: [PATCH 09/13] powerpc: Introduce new mask bit for soft_enabled

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:59 +0530
Madhavan Srinivasan  wrote:

> Currently soft_enabled is used as the flag to determine
> the interrupt state. Patch extends the soft_enabled
> to be used as a mask instead of a flag.

This should be the title of the patch, IMO. Introducing the new
mask bit is incidental (and could be moved to patch 11). The key
here of course is switching the tests from a flag to a mask.

Very cool that you got this all worked out without adding any
new instructions.

Reviewed-by: Nicholas Piggin 

> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/exception-64s.h | 4 ++--
>  arch/powerpc/include/asm/hw_irq.h| 1 +
>  arch/powerpc/include/asm/irqflags.h  | 4 ++--
>  arch/powerpc/kernel/entry_64.S   | 4 ++--
>  arch/powerpc/kernel/exceptions-64e.S | 6 +++---
>  5 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/exception-64s.h 
> b/arch/powerpc/include/asm/exception-64s.h
> index dd3253bd0d8e..1eea4ab75607 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -430,9 +430,9 @@ label##_relon_hv: 
> \
>  
>  #define __SOFTEN_TEST(h, vec)
> \
>   lbz r10,PACASOFTIRQEN(r13); \
> - cmpwi   r10,IRQ_DISABLE_MASK_LINUX; 
> \
> + andi.   r10,r10,IRQ_DISABLE_MASK_LINUX; \
>   li  r10,SOFTEN_VALUE_##vec; \
> - beq masked_##h##interrupt
> + bne masked_##h##interrupt
>  #define _SOFTEN_TEST(h, vec) __SOFTEN_TEST(h, vec)
>  
>  #define SOFTEN_TEST_PR(vec)  \
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index fd9b421f9020..245262c02bab 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -32,6 +32,7 @@
>   */
>  #define IRQ_DISABLE_MASK_NONE0
>  #define IRQ_DISABLE_MASK_LINUX   1
> +#define IRQ_DISABLE_MASK_PMU 2
>  
>  #endif /* CONFIG_PPC64 */
>  
> diff --git a/arch/powerpc/include/asm/irqflags.h 
> b/arch/powerpc/include/asm/irqflags.h
> index d0ed2a7d7d10..9ff09747a226 100644
> --- a/arch/powerpc/include/asm/irqflags.h
> +++ b/arch/powerpc/include/asm/irqflags.h
> @@ -48,11 +48,11 @@
>  #define RECONCILE_IRQ_STATE(__rA, __rB)  \
>   lbz __rA,PACASOFTIRQEN(r13);\
>   lbz __rB,PACAIRQHAPPENED(r13);  \
> - cmpwi   cr0,__rA,IRQ_DISABLE_MASK_LINUX;\
> + andi.   __rA,__rA,IRQ_DISABLE_MASK_LINUX;\
>   li  __rA,IRQ_DISABLE_MASK_LINUX;\
>   ori __rB,__rB,PACA_IRQ_HARD_DIS;\
>   stb __rB,PACAIRQHAPPENED(r13);  \
> - beq 44f;\
> + bne 44f;\
>   stb __rA,PACASOFTIRQEN(r13);\
>   TRACE_DISABLE_INTS; \
>  44:
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 879aeb11ad29..533e363914a9 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -764,8 +764,8 @@ restore:
>*/
>   ld  r5,SOFTE(r1)
>   lbz r6,PACASOFTIRQEN(r13)
> - cmpwi   cr0,r5,IRQ_DISABLE_MASK_LINUX
> - beq restore_irq_off
> + andi.   r5,r5,IRQ_DISABLE_MASK_LINUX
> + bne restore_irq_off
>  
>   /* We are enabling, were we already enabled ? Yes, just return */
>   cmpwi   cr0,r6,IRQ_DISABLE_MASK_NONE
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index 5c628b5696f6..8e40df2c2f30 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -212,8 +212,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
>   /* Interrupts had better not already be enabled... */
>   twnei   r6,IRQ_DISABLE_MASK_LINUX
>  
> - cmpwi   cr0,r5,IRQ_DISABLE_MASK_LINUX
> - beq 1f
> + andi.   r5,r5,IRQ_DISABLE_MASK_LINUX
> + bne 1f
>  
>   TRACE_ENABLE_INTS
>   stb r5,PACASOFTIRQEN(r13)
> @@ -352,7 +352,7 @@ ret_from_mc_except:
>  
>  #define PROLOG_ADDITION_MASKABLE_GEN(n)  
> \
>   lbz r10,PACASOFTIRQEN(r13); /* are irqs soft-disabled ? */  \
> - cmpwi   cr0,r10,IRQ_DISABLE_MASK_LINUX;/* yes -> go out of line */ \
> + andi.   r10,r10,IRQ_DISABLE_MASK_LINUX;/* yes -> go out of line */ \
>   beq masked_interrupt_book3e_##n
>  
>  #define PROLOG_ADDITION_2REGS_GEN(n) \

Re: [PATCH 08/13] powerpc: Add new _EXCEPTION_PROLOG_1 macro

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:58 +0530
Madhavan Srinivasan  wrote:

> To support addition of "bitmask" to MASKABLE_* macros,
> factor out the EXCPETION_PROLOG_1 macro.
> 
> Signed-off-by: Madhavan Srinivasan 

Really minor nit, but as a matter of readability of the series,
would you consider moving this next to patch 10 where it's used,
if you submit again?

Reviewed-by: Nicholas Piggin

Re: [PATCH 07/13] powerpc: Avoid using EXCEPTION_PROLOG_1 macro in MASKABLE_*

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:57 +0530
Madhavan Srinivasan  wrote:

> Currently we use both EXCEPTION_PROLOG_1 and __EXCEPTION_PROLOG_1
> in the MASKABLE_* macros. As a cleanup, this patch makes MASKABLE_*
> to use only __EXCEPTION_PROLOG_1. There is not logic change.
> 
> Signed-off-by: Madhavan Srinivasan 

I assume this was done like this once for some macro expansion issue?
If it doesn't cause any breakage, then fine.

Reviewed-by: Nicholas Piggin

Re: [PATCH 06/13] powerpc: reverse the soft_enable logic

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:56 +0530
Madhavan Srinivasan  wrote:

> "paca->soft_enabled" is used as a flag to mask some of interrupts.
> Currently supported flags values and their details:
> 
> soft_enabledMSR[EE]
> 
> 0   0   Disabled (PMI and HMI not masked)
> 1   1   Enabled
> 
> "paca->soft_enabled" is initialized to 1 to make the interripts as
> enabled. arch_local_irq_disable() will toggle the value when interrupts
> needs to disbled. At this point, the interrupts are not actually disabled,
> instead, interrupt vector has code to check for the flag and mask it when it 
> occurs.
> By "mask it", it update interrupt paca->irq_happened and return.
> arch_local_irq_restore() is called to re-enable interrupts, which checks and
> replays interrupts if any occured.
> 
> Now, as mentioned, current logic doesnot mask "performance monitoring 
> interrupts"
> and PMIs are implemented as NMI. But this patchset depends on local_irq_*
> for a successful local_* update. Meaning, mask all possible interrupts during
> local_* update and replay them after the update.
> 
> So the idea here is to reserve the "paca->soft_enabled" logic. New values and
> details:
> 
> soft_enabledMSR[EE]
> 
> 1   0   Disabled  (PMI and HMI not masked)
> 0   1   Enabled
> 
> Reason for the this change is to create foundation for a third flag value "2"
> for "soft_enabled" to add support to mask PMIs. When ->soft_enabled is
> set to a value "2", PMI interrupts are mask and when set to a value
> of "1", PMI are not mask.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/hw_irq.h | 4 ++--
>  arch/powerpc/kernel/entry_64.S| 5 ++---
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index dc3c248f9244..fd9b421f9020 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -30,8 +30,8 @@
>  /*
>   * flags for paca->soft_enabled
>   */
> -#define IRQ_DISABLE_MASK_NONE1
> -#define IRQ_DISABLE_MASK_LINUX   0
> +#define IRQ_DISABLE_MASK_NONE0
> +#define IRQ_DISABLE_MASK_LINUX   1
>  
>  #endif /* CONFIG_PPC64 */
>  
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index aef7b64cbbeb..879aeb11ad29 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -131,8 +131,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>*/
>  #if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG)
>   lbz r10,PACASOFTIRQEN(r13)
> - xorir10,r10,IRQ_DISABLE_MASK_NONE
> -1:   tdnei   r10,0
> +1:   tdnei   r10,IRQ_DISABLE_MASK_NONE
>   EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
>  #endif
>  
> @@ -1012,7 +1011,7 @@ _GLOBAL(enter_rtas)
>* check it with the asm equivalent of WARN_ON
>*/
>   lbz r0,PACASOFTIRQEN(r13)
> -1:   tdnei   r0,IRQ_DISABLE_MASK_LINUX
> +1:   tdeqi   r0,IRQ_DISABLE_MASK_NONE
>   EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
>  #endif
>   

We specifically want to ensure that _LINUX interrupts are disabled
here. Not that we allow masking of others without _LINUX now, but
current behavior is checking that LINUX ones are masked.

Otherwise it seems okay.

It might be nice after this series to do a pass and rename
soft_enabled to soft_masked.

Reviewed-by: Nicholas Piggin

Re: powerpc: Discard ffs() function and use builtin_ffs instead

2016-09-16 Thread Christophe Leroy




Le 13/05/2016 à 08:53, Christophe Leroy a écrit :



Le 13/05/2016 à 08:16, Michael Ellerman a écrit :

On Thu, 2016-12-05 at 15:32:22 UTC, Christophe Leroy wrote:

With the ffs() function as defined in arch/powerpc/include/asm/bitops.h
GCC will not optimise the code in case of constant parameter, as shown
by the small exemple below.

int ffs_test(void)
{
return 4 << ffs(31);
}

c0012334 :
c0012334:   39 20 00 01 li  r9,1
c0012338:   38 60 00 04 li  r3,4
c001233c:   7d 29 00 34 cntlzw  r9,r9
c0012340:   21 29 00 20 subfic  r9,r9,32
c0012344:   7c 63 48 30 slw r3,r3,r9
c0012348:   4e 80 00 20 blr

With this patch, the same function will compile as follows:

c0012334 :
c0012334:   38 60 00 08 li  r3,8
c0012338:   4e 80 00 20 blr


But what code does it generate when it's not a constant?


The generated code is the same with and without the patch when not a 
constant:


int ffs_test2(int x)
{
return ffs(x);
}

c001233c :
c001233c:   7d 23 00 d0 neg r9,r3
c0012340:   7d 23 18 38 and r3,r9,r3
c0012344:   7c 63 00 34 cntlzw  r3,r3
c0012348:   20 63 00 20 subfic  r3,r3,32
c001234c:   4e 80 00 20 blr



And which gcc version first added the builtin version?
Don't know, but __builtin_ffs() is already used in 
arch/powerpc/include/asm/page_32.h




Hi Michael,

Any change to get it into 4.9 ?

Christophe

Re: [PATCH 05/13] powerpc: Add soft_enabled manipulation functions

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:55 +0530
Madhavan Srinivasan  wrote:

> Add new soft_enabled_* manipulation function and implement
> arch_local_* using the soft_enabled_* wrappers.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/hw_irq.h | 32 ++--
>  1 file changed, 14 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index f828b8f8df02..dc3c248f9244 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -53,21 +53,7 @@ static inline notrace void soft_enabled_set(unsigned long 
> enable)
>   : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)));
>  }
>  
> -static inline notrace unsigned long soft_enabled_set_return(unsigned long 
> enable)
> -{
> - unsigned long flags;
> -
> - asm volatile(
> - "lbz %0,%1(13); stb %2,%1(13)"
> - : "=r" (flags)
> - : "i" (offsetof(struct paca_struct, soft_enabled)),\
> -   "r" (enable)
> - : "memory");
> -
> - return flags;
> -}
> -
> -static inline unsigned long arch_local_save_flags(void)
> +static inline notrace unsigned long soft_enabled_return(void)
>  {
>   unsigned long flags;
>  
> @@ -79,20 +65,30 @@ static inline unsigned long arch_local_save_flags(void)
>   return flags;
>  }
>  
> -static inline unsigned long arch_local_irq_disable(void)
> +static inline notrace unsigned long soft_enabled_set_return(unsigned long 
> enable)
>  {
>   unsigned long flags, zero;
>  
>   asm volatile(
> - "li %1,%3; lbz %0,%2(13); stb %1,%2(13)"
> + "mr %1,%3; lbz %0,%2(13); stb %1,%2(13)"
>   : "=r" (flags), "=&r" (zero)
>   : "i" (offsetof(struct paca_struct, soft_enabled)),\
> -   "i" (IRQ_DISABLE_MASK_LINUX)
> +   "r" (enable)
>   : "memory");
>  
>   return flags;
>  }

As we talked about earlier, it would be nice to add builtin_constant
variants to avoid the extra instruction. If you prefer to do that
after this series, that's fine.

Reviewed-by: Nicholas Piggin

Re: [PATCH 04/13] powerpc: Use soft_enabled_set api to update paca->soft_enabled

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:54 +0530
Madhavan Srinivasan  wrote:

> Force use of soft_enabled_set() wrapper to update paca-soft_enabled
> wherever possisble. Also add a new wrapper function, 
> soft_enabled_set_return(),
> added to force the paca->soft_enabled updates.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/hw_irq.h  | 14 ++
>  arch/powerpc/include/asm/kvm_ppc.h |  2 +-
>  arch/powerpc/kernel/irq.c  |  2 +-
>  arch/powerpc/kernel/setup_64.c |  4 ++--
>  arch/powerpc/kernel/time.c |  6 +++---
>  5 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 8fad8c24760b..f828b8f8df02 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -53,6 +53,20 @@ static inline notrace void soft_enabled_set(unsigned long 
> enable)
>   : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)));
>  }
>  
> +static inline notrace unsigned long soft_enabled_set_return(unsigned long 
> enable)
> +{
> + unsigned long flags;
> +
> + asm volatile(
> + "lbz %0,%1(13); stb %2,%1(13)"
> + : "=r" (flags)
> + : "i" (offsetof(struct paca_struct, soft_enabled)),\
> +   "r" (enable)
> + : "memory");
> +
> + return flags;
> +}

Why do you have the "memory" clobber here while soft_enabled_set() does not?

Thanks,
Nick

Re: [PATCH 03/13] powerpc: move set_soft_enabled() and rename

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:53 +0530
Madhavan Srinivasan  wrote:

> Move set_soft_enabled() from powerpc/kernel/irq.c to
> asm/hw_irq.c. and rename it soft_enabled_set().
> THis way paca->soft_enabled updates can be forced.

Could you just tidy up the changelog a little?

You are renaming it I assume because you are going to introduce more
soft_enabled_x() functions, and that the namespace works better as a
prefix than a postfix.

You are moving it so all paca->soft_enabled updates can be done via
these access functions rather than open coded.

Did I get that right?

Reviewed-by: Nicholas Piggin

[PATCH 2/2] powernv:idle:Implement lite variant of power_enter_stop

2016-09-16 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

This patch adds a function named power_enter_stop_lite() that can
execute a stop instruction when ESL and EC bits are set to zero in the
PSSCR.  The function handles the wake-up from idle at the instruction
immediately after the stop instruction.

If the flag OPAL_PM_WAKEUP_AT_NEXT_INST[1] is set in the device tree
for a stop state, then use the lite variant for that particular stop
state.

[1] : The corresponding patch in skiboot that defines
  OPAL_PM_WAKEUP_AT_NEXT_INST and enables it in the device tree
  can be found here:
  https://lists.ozlabs.org/pipermail/skiboot/2016-September/004805.html

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/opal-api.h   |  1 +
 arch/powerpc/include/asm/processor.h  |  3 ++-
 arch/powerpc/kernel/idle_book3s.S | 28 +---
 arch/powerpc/platforms/powernv/idle.c | 17 ++---
 arch/powerpc/platforms/powernv/smp.c  |  2 +-
 drivers/cpuidle/cpuidle-powernv.c | 24 ++--
 6 files changed, 65 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0e2e57b..6e5741e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -179,6 +179,7 @@
 #define OPAL_PM_TIMEBASE_STOP  0x0002
 #define OPAL_PM_LOSE_HYP_CONTEXT   0x2000
 #define OPAL_PM_LOSE_FULL_CONTEXT  0x4000
+#define OPAL_PM_WAKEUP_AT_NEXT_INST0x8000
 #define OPAL_PM_NAP_ENABLED0x0001
 #define OPAL_PM_SLEEP_ENABLED  0x0002
 #define OPAL_PM_WINKLE_ENABLED 0x0004
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 68e3bf5..e0549a0 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -460,7 +460,8 @@ extern int powersave_nap;   /* set if nap mode can be used 
in idle loop */
 extern unsigned long power7_nap(int check_irq);
 extern unsigned long power7_sleep(void);
 extern unsigned long power7_winkle(void);
-extern unsigned long power9_idle_stop(unsigned long stop_level);
+extern unsigned long power9_idle_stop(unsigned long stop_level,
+   unsigned long exec_lite);
 
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 32d666b..47ee106 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -43,6 +43,8 @@
 #define PSSCR_HV_TEMPLATE  PSSCR_ESL | PSSCR_EC | \
PSSCR_PSLL_MASK | PSSCR_TR_MASK | \
PSSCR_MTL_MASK
+#define PSSCR_HV_TEMPLATE_LITE PSSCR_PSLL_MASK | PSSCR_TR_MASK | \
+PSSCR_MTL_MASK
 
.text
 
@@ -246,6 +248,20 @@ enter_winkle:
 
IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
 
+
+/*
+ * power_enter_stop_lite : This will resume the wake up from
+ * idle at the subsequent instruction.
+ *
+ * Caller should set ESL=EC=0 in PSSCR before calling
+ * this function.
+ *
+ */
+power_enter_stop_lite:
+   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+7: li  r3,0  /* Since we didn't lose state, return 0 */
+   b   pnv_wakeup_noloss
+
 /*
  * r3 - requested stop state
  */
@@ -333,13 +349,19 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);  
\
 
 /*
  * r3 - requested stop state
+ * r4 - Indicates if the lite variant with ESL=EC=0 should be executed.
  */
 _GLOBAL(power9_idle_stop)
-   LOAD_REG_IMMEDIATE(r4, PSSCR_HV_TEMPLATE)
-   or  r4,r4,r3
+   cmpdi   r4, 1
+   bne 4f
+   LOAD_REG_IMMEDIATE(r4, PSSCR_HV_TEMPLATE_LITE)
+   LOAD_REG_ADDR(r5,power_enter_stop_lite)
+   b   5f
+4: LOAD_REG_IMMEDIATE(r4, PSSCR_HV_TEMPLATE)
+   LOAD_REG_ADDR(r5,power_enter_stop)
+5: or  r4,r4,r3
mtspr   SPRN_PSSCR, r4
li  r4, 1
-   LOAD_REG_ADDR(r5,power_enter_stop)
b   pnv_powersave_common
/* No return */
 /*
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 479c256..c3d3fed 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -244,8 +244,15 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
 static void power9_idle(void)
 {
/* Requesting stop state 0 */
-   power9_idle_stop(0);
+   power9_idle_stop(0, 0);
 }
+
+static void power9_idle_lite(void)
+{
+   /* Requesting stop state 0 with ESL=EC=0 */
+   power9_idle_stop(0, 1);
+}
+
 /*
  * First deep stop state. Used to figure out when to save/restore
  * hypervisor context.
@@ -414,8 +421,12 @@ static int __init pnv_init_idle_states(void)
 
if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
ppc_md.power_save = power7_idle;
-   else if (supported_cpuidle_states & OPAL_PM_STO

[PATCH 1/2] powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro

2016-09-16 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Currently all the low-power idle states are expected to wake up
at reset vector 0x100. Which is why the macro IDLE_STATE_ENTER_SEQ
that puts the CPU to an idle state and never returns.

On ISA_300, when the ESL and EC bits in the PSSCR are zero, the
CPU is expected to wake up at the next instruction of the idle
instruction.

This patch adds a new macro named IDLE_STATE_ENTER_SEQ_NORET for the
no-return variant and reuses the name IDLE_STATE_ENTER_SEQ
for a variant that allows resuming operation at the instruction next
to the idle-instruction.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/cpuidle.h   |  5 -
 arch/powerpc/kernel/exceptions-64s.S |  6 +++---
 arch/powerpc/kernel/idle_book3s.S| 10 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 01b8a13..9fd23f6 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -21,7 +21,7 @@ extern u64 pnv_first_deep_stop_state;
 
 /* Idle state entry routines */
 #ifdef CONFIG_PPC_P7_NAP
-#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \
+#define IDLE_STATE_ENTER_SEQ(IDLE_INST) \
/* Magic NAP/SLEEP/WINKLE mode enter sequence */\
std r0,0(r1);   \
ptesync;\
@@ -29,6 +29,9 @@ extern u64 pnv_first_deep_stop_state;
 1: cmp cr0,r0,r0;  \
bne 1b; \
IDLE_INST;  \
+
+#defineIDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)   \
+   IDLE_STATE_ENTER_SEQ(IDLE_INST) \
b   .
 #endif /* CONFIG_PPC_P7_NAP */
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index bffec73..238307d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1301,12 +1301,12 @@ machine_check_handle_early:
lbz r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi   r3,PNV_THREAD_NAP
bgt 10f
-   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
 10:
cmpwi   r3,PNV_THREAD_SLEEP
bgt 2f
-   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
/* No return */
 
 2:
@@ -1320,7 +1320,7 @@ machine_check_handle_early:
 */
ori r13,r13,1
SET_PACA(r13)
-   IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
/* No return */
 4:
 #endif
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index bd739fe..32d666b 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -188,7 +188,7 @@ pnv_enter_arch207_idle_mode:
stb r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi   cr3,r3,PNV_THREAD_SLEEP
bge cr3,2f
-   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
 2:
/* Sleep or winkle */
@@ -222,7 +222,7 @@ pnv_fastsleep_workaround_at_entry:
 
 common_enter: /* common code for all the threads entering sleep or winkle */
bgt cr3,enter_winkle
-   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
 
 fastsleep_workaround_at_entry:
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
@@ -244,7 +244,7 @@ fastsleep_workaround_at_entry:
 enter_winkle:
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
 
 /*
  * r3 - requested stop state
@@ -257,7 +257,7 @@ power_enter_stop:
ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
cmpdr3,r4
bge 2f
-   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
 2:
 /*
  * Entering deep idle state.
@@ -279,7 +279,7 @@ lwarx_loop_stop:
 
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
 
 _GLOBAL(power7_idle)
/* Now check if user or arch enabled NAP mode */
-- 
1.9.4

[PATCH 0/2] powernv: Implement lite variant of stop with ESL=EC=0

2016-09-16 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

The Power ISA v3.0 allows us to execute the "stop" instruction with
ESL and EC of the PSSCR set to 0. This will ensure no loss of state,
and the wakeup from the stop will happen at an instruction following
the executed stop instruction.

This patchset adds support to run stop with ESL=EC=0 based on
a flag set for the corresponding stop state in the device tree.

The first patch renames the IDLE_STATE_ENTER_SEQ macro to
IDLE_STATE_ENTER_SEQ_NORET since the current users of this
macro expect the wakeup from stop to happen at the
System Reset vector. It reuses the name IDLE_STATE_ENTER_SEQ to a
variant where the wakeup from stop happens at the next instruction.

The second patch creates adds a new function (i.e, a lite variant)
that will execute a stop instruction with ESL=EC=0 and handle wakeup
at the subsequent instruction. A particular stop state is wired to
this new function if the device tree entry for that stop state has
OPAL_PM_WAKEUP_AT_NEXT_INST [1] flag set.

[1] : The corresponding patch in skiboot that defines
  OPAL_PM_WAKEUP_AT_NEXT_INST and enables it in the device tree
  can be found here:
  https://lists.ozlabs.org/pipermail/skiboot/2016-September/004805.html

Gautham R. Shenoy (2):
  powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro
  powernv:idle:Implement lite variant of power_enter_stop

 arch/powerpc/include/asm/cpuidle.h|  5 -
 arch/powerpc/include/asm/opal-api.h   |  1 +
 arch/powerpc/include/asm/processor.h  |  3 ++-
 arch/powerpc/kernel/exceptions-64s.S  |  6 +++---
 arch/powerpc/kernel/idle_book3s.S | 38 +++
 arch/powerpc/platforms/powernv/idle.c | 17 +---
 arch/powerpc/platforms/powernv/smp.c  |  2 +-
 drivers/cpuidle/cpuidle-powernv.c | 24 --
 8 files changed, 77 insertions(+), 19 deletions(-)

-- 
1.9.4

Re: [PATCH 02/13] powerpc: Cleanup to use IRQ_DISABLE_MASK_* macros for paca->soft_enabled update

2016-09-16 Thread Nicholas Piggin

On Thu, 15 Sep 2016 18:31:52 +0530
Madhavan Srinivasan  wrote:

> Replace the hardcoded values used when updating
> paca->soft_enabled with IRQ_DISABLE_MASK_* #def.
> No logic change.

This could be folded with patch 1.

Reviewed-by: Nicholas Piggin

[PATCH][RFC] Implement arch primitives for busywait loops

2016-09-16 Thread Nicholas Piggin

Implementing busy wait loops with cpu_relax() in callers poses
some difficulties for powerpc.

First, we want to put our SMT thread into a low priority mode for the
duration of the loop, but then return to normal priority after exiting
the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
cpu_relax() does may have HMT_medium take effect before HMT_low made
any (or much) difference.

Second, it can be beneficial for some implementations to spin on the
exit condition with a statically predicted-not-taken branch (i.e.,
always predict the loop will exit).

This is a quick RFC with a couple of users converted to see what
people think. I don't use a C branch with hints, because we don't want
the compiler moving the loop body out of line, which makes it a bit
messy unfortunately. If there's a better way to do it, I'm all ears.

I would not propose to switch all callers immediately, just some
core synchronisation primitives.

---
 arch/powerpc/include/asm/processor.h | 22 ++
 include/asm-generic/barrier.h|  7 ++-
 include/linux/bit_spinlock.h |  5 ++---
 include/linux/cgroup.h   |  7 ++-
 include/linux/seqlock.h  | 10 --
 5 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 68e3bf5..e10aee2 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -402,6 +402,28 @@ static inline unsigned long __pack_fe01(unsigned int 
fpmode)
 
 #ifdef CONFIG_PPC64
 #define cpu_relax()do { HMT_low(); HMT_medium(); barrier(); } while (0)
+
+#define spin_do\
+do {   \
+   HMT_low();  \
+   __asm__ __volatile__ (  "1010:");
+
+#define spin_while(cond)   \
+   barrier();  \
+   __asm__ __volatile__ (  "cmpdi  %0,0\n\t"   \
+   "beq-   1010b   \n\t"   \
+   : : "r" (cond));\
+   HMT_medium();   \
+} while (0)
+
+#define spin_until(cond)   \
+   barrier();  \
+   __asm__ __volatile__ (  "cmpdi  %0,0\n\t"   \
+   "bne-   1010b   \n\t"   \
+   : : "r" (cond));\
+   HMT_medium();   \
+} while (0)
+
 #else
 #define cpu_relax()barrier()
 #endif
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b5..4c53b3a 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -235,12 +235,9 @@ do {   
\
 #define smp_cond_load_acquire(ptr, cond_expr) ({   \
typeof(ptr) __PTR = (ptr);  \
typeof(*ptr) VAL;   \
-   for (;;) {  \
+   spin_do {   \
VAL = READ_ONCE(*__PTR);\
-   if (cond_expr)  \
-   break;  \
-   cpu_relax();\
-   }   \
+   } spin_until (cond_expr);   \
smp_acquire__after_ctrl_dep();  \
VAL;\
 })
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index 3b5bafc..695743c 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -25,9 +25,8 @@ static inline void bit_spin_lock(int bitnum, unsigned long 
*addr)
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
preempt_enable();
-   do {
-   cpu_relax();
-   } while (test_bit(bitnum, addr));
+   spin_do {
+   } spin_while (test_bit(bitnum, addr));
preempt_disable();
}
 #endif
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 984f73b..e7d395f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -450,12 +450,9 @@ task_get_css(struct task_struct *task, int subsys_id)
struct cgroup_subsys_state *css;
 
rcu_read_lock();
-   while (true) {
+   spin_do {
css = task_css(task, subsys_id);
-   if (likely(css_tryget_online(css)))
-   break;
-   cpu_relax();
-   }
+   } spin_until (likely(cs

i2c i2c-3: i2c-powermac: modalias failure on /uni-n@f8000000/i2c@f8001000/cereal@1c0

2016-09-16 Thread Mathieu Malaterre

Hi there,

Would anyone know why I am getting the following error message in
`dmesg` on my PowerMac/Mac Mini G4:

[...]
[2.090226] PowerMac i2c bus pmu 2 registered
[2.095691] PowerMac i2c bus pmu 1 registered
[2.101016] PowerMac i2c bus mac-io 0 registered
[2.106135] PowerMac i2c bus uni-n 1 registered
[2.111094] i2c i2c-3: i2c-powermac: modalias failure on
/uni-n@f800/i2c@f8001000/cereal@1c0
[2.116261] PowerMac i2c bus uni-n 0 registered
[...]


Looking at the code I see:

static bool i2c_powermac_get_type(struct i2c_adapter *adap,
struct device_node *node,
u32 addr, char *type, int type_size)
{
[...]
dev_err(&adap->dev, "i2c-powermac: modalias failure"
" on %s\n", node->full_name);
return false;


However I also do see the 'cereal' should be part of the 'supported' one:

static u32 i2c_powermac_get_addr(struct i2c_adapter *adap,
   struct pmac_i2c_bus *bus,
   struct device_node *node)
{
[...]
/* Now handle some devices with missing "reg" properties */
if (!strcmp(node->name, "cereal"))
return 0x60;


I would appreciate if someone could confirm this case could be handle
by `i2c-powermac` before filling a bug report.

Regards.

[PATCH v2] powerpc/8xx: add dedicated machine check handler

2016-09-16 Thread Christophe Leroy

During a machine check, the 8xx provides indication of
whether the check is due to data or instruction access, so
let's display it.

Lets also move 8xx specific handling into the new handler.

Signed-off-by: Christophe Leroy 
---
v2: moved into the new handler the part conditionned by CONFIG_8xx

 arch/powerpc/include/asm/cputable.h |  1 +
 arch/powerpc/kernel/cputable.c  |  1 +
 arch/powerpc/kernel/traps.c | 36 +---
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index f752e6f..ab68d0e 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -43,6 +43,7 @@ extern int machine_check_e500mc(struct pt_regs *regs);
 extern int machine_check_e500(struct pt_regs *regs);
 extern int machine_check_e200(struct pt_regs *regs);
 extern int machine_check_47x(struct pt_regs *regs);
+int machine_check_8xx(struct pt_regs *regs);
 
 extern void cpu_down_flush_e500v2(void);
 extern void cpu_down_flush_e500mc(void);
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 6c4646a..6a82ef0 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -1248,6 +1248,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
.mmu_features   = MMU_FTR_TYPE_8xx,
.icache_bsize   = 16,
.dcache_bsize   = 16,
+   .machine_check  = machine_check_8xx,
.platform   = "ppc823",
},
 #endif /* CONFIG_8xx */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 8bd8e73..04254ad 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -665,6 +665,31 @@ int machine_check_e200(struct pt_regs *regs)
 
return 0;
 }
+#elif defined(CONFIG_PPC_8xx)
+int machine_check_8xx(struct pt_regs *regs)
+{
+   unsigned long reason = get_mc_reason(regs);
+
+   pr_err("Machine check in kernel mode.\n");
+   pr_err("Caused by (from SRR1=%lx): ", reason);
+   if (reason & 0x4000)
+   pr_err("Fetch error at address %lx\n", regs->nip);
+   else
+   pr_err("Data access error at address %lx\n", regs->dar);
+
+#ifdef CONFIG_PCI
+   /* the qspan pci read routines can cause machine checks -- Cort
+*
+* yuck !!! that totally needs to go away ! There are better ways
+* to deal with that than having a wart in the mcheck handler.
+* -- BenH
+*/
+   bad_page_fault(regs, regs->dar, SIGBUS);
+   return 1;
+#else
+   return 0;
+#endif
+}
 #else
 int machine_check_generic(struct pt_regs *regs)
 {
@@ -724,17 +749,6 @@ void machine_check_exception(struct pt_regs *regs)
if (recover > 0)
goto bail;
 
-#if defined(CONFIG_8xx) && defined(CONFIG_PCI)
-   /* the qspan pci read routines can cause machine checks -- Cort
-*
-* yuck !!! that totally needs to go away ! There are better ways
-* to deal with that than having a wart in the mcheck handler.
-* -- BenH
-*/
-   bad_page_fault(regs, regs->dar, SIGBUS);
-   goto bail;
-#endif
-
if (debugger_fault_handler(regs))
goto bail;
 
-- 
2.1.0

[PATCH v2 3/3] powerpc/8xx: Implement support of hugepages

2016-09-16 Thread Christophe Leroy

8xx uses a two level page table with two different linux page size
support (4k and 16k). 8xx also support two different hugepage sizes
512k and 8M. In order to support them on linux we define two different
page table layout.

The size of pages is in the PGD entry, using PS field (bits 28-29):
00 : Small pages (4k or 16k)
01 : 512k pages
10 : reserved
11 : 8M pages

For 512K hugepage size a pgd entry have the below format
[0101] . The hugepte table allocated will contain 8
entries pointing to 512K huge pte in 4k pages mode and 64 entries in
16k pages mode.

For 8M in 16k mode, a pgd entry have the below format
[1101] . The hugepte table allocated will contain 8
entries pointing to 8M huge pte.

For 8M in 4k mode, multiple pgd entries point to the same hugepte
address and pgd entry will have the below format
[1101]. The hugepte table allocated will only have one
entry.

For the time being, we do not support CPU15 ERRATA when HUGETLB is
selected

Signed-off-by: Christophe Leroy 
---
v2: This v1 was split in two parts. This part focuses on adding the
support on 8xx. It also fixes an error in TLBmiss handlers in the
case of 8M hugepages in 16k pages mode.

 arch/powerpc/include/asm/hugetlb.h   |  19 -
 arch/powerpc/include/asm/mmu-8xx.h   |  35 
 arch/powerpc/include/asm/mmu.h   |  23 +++---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   1 +
 arch/powerpc/include/asm/nohash/pgtable.h|   4 +
 arch/powerpc/include/asm/reg_8xx.h   |   2 +-
 arch/powerpc/kernel/head_8xx.S   | 119 +--
 arch/powerpc/mm/hugetlbpage.c|  25 --
 arch/powerpc/mm/tlb_nohash.c |  21 -
 arch/powerpc/platforms/8xx/Kconfig   |   1 +
 arch/powerpc/platforms/Kconfig.cputype   |   1 +
 11 files changed, 223 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index c5517f4..3facdd4 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -51,12 +51,20 @@ static inline void __local_flush_hugetlb_page(struct 
vm_area_struct *vma,
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
BUG_ON(!hugepd_ok(hpd));
+#ifdef CONFIG_PPC_8xx
+   return (pte_t *)__va(hpd.pd & ~(_PMD_PAGE_MASK | _PMD_PRESENT_MASK));
+#else
return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | PD_HUGE);
+#endif
 }
 
 static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
+#ifdef CONFIG_PPC_8xx
+   return ((hpd.pd & _PMD_PAGE_MASK) >> 1) + 17;
+#else
return hpd.pd & HUGEPD_SHIFT_MASK;
+#endif
 }
 
 #endif /* CONFIG_PPC_BOOK3S_64 */
@@ -99,7 +107,15 @@ static inline int is_hugepage_only_range(struct mm_struct 
*mm,
 
 void book3e_hugetlb_preload(struct vm_area_struct *vma, unsigned long ea,
pte_t pte);
+#ifdef CONFIG_PPC_8xx
+static inline void flush_hugetlb_page(struct vm_area_struct *vma,
+ unsigned long vmaddr)
+{
+   flush_tlb_page(vma, vmaddr);
+}
+#else
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
+#endif
 
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
unsigned long end, unsigned long floor,
@@ -205,7 +221,8 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
  * are reserved early in the boot process by memblock instead of via
  * the .dts as on IBM platforms.
  */
-#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_PPC_FSL_BOOK3E)
+#if defined(CONFIG_HUGETLB_PAGE) && (defined(CONFIG_PPC_FSL_BOOK3E) || \
+defined(CONFIG_PPC_8xx))
 extern void __init reserve_hugetlb_gpages(void);
 #else
 static inline void reserve_hugetlb_gpages(void)
diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 3e0e492..798b5bf 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -172,6 +172,41 @@ typedef struct {
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8)
 #define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
+
+/* Page size definitions, common between 32 and 64-bit
+ *
+ *shift : is the "PAGE_SHIFT" value for that page size
+ *penc  : is the pte encoding mask
+ *
+ */
+struct mmu_psize_def {
+   unsigned intshift;  /* number of bits */
+   unsigned intenc;/* PTE encoding */
+   unsigned intind;/* Corresponding indirect page size shift */
+   unsigned intflags;
+#define MMU_PAGE_SIZE_DIRECT   0x1 /* Supported as a direct size */
+#define MMU_PAGE_SIZE_INDIRECT 0x2 /* Supported as an indirect size */
+};
+
+extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
+
+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+   int psize;
+
+   for (psize = 0; psize < MMU_PAGE_COUNT; ++psize)
+   if (mmu_psize_defs[psize].shift == shift)
+   return psize;
+   return

[PATCH v2 2/3] powerpc: get hugetlbpage handling more generic

2016-09-16 Thread Christophe Leroy

Today there are two implementations of hugetlbpages which are managed
by exclusive #ifdefs:
* FSL_BOOKE: several directory entries points to the same single hugepage
* BOOK3S: one upper level directory entry points to a table of hugepages

In preparation of implementation of hugepage support on the 8xx, we
need a mix of the two above solutions, because the 8xx needs both cases
depending on the size of pages:
* In 4k page size mode, each PGD entry covers a 4M bytes area. It means
that 2 PGD entries will be necessary to cover an 8M hugepage while a
single PGD entry will cover 8x 512k hugepages.
* In 16 page size mode, each PGD entry covers a 64M bytes area. It means
that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
hugepages will be covers by one PGD entry.

This patch:
* removes #ifdefs in favor of if/else based on the range sizes
* merges the two huge_pte_alloc() functions as they are pretty similar
* merges the two hugetlbpage_init() functions as they are pretty similar

Signed-off-by: Christophe Leroy 
---
v2: This part is new and results from a split of last patch of v1 serie in
two parts

 arch/powerpc/mm/hugetlbpage.c | 189 +-
 1 file changed, 77 insertions(+), 112 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8a512b1..2119f00 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -64,14 +64,16 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
 {
struct kmem_cache *cachep;
pte_t *new;
-
-#ifdef CONFIG_PPC_FSL_BOOK3E
int i;
-   int num_hugepd = 1 << (pshift - pdshift);
-   cachep = hugepte_cache;
-#else
-   cachep = PGT_CACHE(pdshift - pshift);
-#endif
+   int num_hugepd;
+
+   if (pshift >= pdshift) {
+   cachep = hugepte_cache;
+   num_hugepd = 1 << (pshift - pdshift);
+   } else {
+   cachep = PGT_CACHE(pdshift - pshift);
+   num_hugepd = 1;
+   }
 
new = kmem_cache_zalloc(cachep, GFP_KERNEL);
 
@@ -89,7 +91,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
smp_wmb();
 
spin_lock(&mm->page_table_lock);
-#ifdef CONFIG_PPC_FSL_BOOK3E
+
/*
 * We have multiple higher-level entries that point to the same
 * actual pte location.  Fill in each as we go and backtrack on error.
@@ -100,8 +102,13 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
if (unlikely(!hugepd_none(*hpdp)))
break;
else
+#ifdef CONFIG_PPC_BOOK3S_64
+   hpdp->pd = __pa(new) |
+  (shift_to_mmu_psize(pshift) << 2);
+#else
/* We use the old format for PPC_FSL_BOOK3E */
hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
+#endif
}
/* If we bailed from the for loop early, an error occurred, clean up */
if (i < num_hugepd) {
@@ -109,17 +116,6 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
hpdp->pd = 0;
kmem_cache_free(cachep, new);
}
-#else
-   if (!hugepd_none(*hpdp))
-   kmem_cache_free(cachep, new);
-   else {
-#ifdef CONFIG_PPC_BOOK3S_64
-   hpdp->pd = __pa(new) | (shift_to_mmu_psize(pshift) << 2);
-#else
-   hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
-#endif
-   }
-#endif
spin_unlock(&mm->page_table_lock);
return 0;
 }
@@ -136,7 +132,6 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
 /*
  * At this point we do the placement change only for BOOK3S 64. This would
  * possibly work on other subarchs.
@@ -153,6 +148,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long 
addr, unsigned long sz
addr &= ~(sz-1);
pg = pgd_offset(mm, addr);
 
+#ifdef CONFIG_PPC_BOOK3S_64
if (pshift == PGDIR_SHIFT)
/* 16GB huge page */
return (pte_t *) pg;
@@ -178,32 +174,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long 
addr, unsigned long sz
hpdp = (hugepd_t *)pm;
}
}
-   if (!hpdp)
-   return NULL;
-
-   BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
-
-   if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, 
pshift))
-   return NULL;
-
-   return hugepte_offset(*hpdp, addr, pdshift);
-}
-
 #else
-
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long 
sz)
-{
-   pgd_t *pg;
-   pud_t *pu;
-   pmd_t *pm;
-   hugepd_t *hpdp = NULL;
-   unsigned pshift = __ffs(sz);
-   unsigned pdshift = PGDIR_SHIFT;
-
-   addr &= ~(sz-1);
-
-   pg = pgd_offset(mm, addr);
-
if (pshift

[PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits

2016-09-16 Thread Christophe Leroy

Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
standard pages when using 4k pages and a single pgtable_cache
if using other size pages.

In preparation of implementing huge pages on the 8xx, this patch
replaces the specific powerpc32 handling by the 64 bits approach.

This is done by:
* moving 64 bits pgtable_cache_add() and pgtable_cache_init()
in a new file called init-common.c
* modifying pgtable_cache_init() to also handle the case
without PMD
* removing the 32 bits version of pgtable_cache_add() and
pgtable_cache_init()
* copying related header contents from 64 bits into both the
book3s/32 and nohash/32 header files

On the 8xx, the following cache sizes will be used:
* 4k pages mode:
- PGT_CACHE(10) for PGD
- PGT_CACHE(3) for 512k hugepage tables
* 16k pages mode:
- PGT_CACHE(6) for PGD
- PGT_CACHE(7) for 512k hugepage tables
- PGT_CACHE(3) for 8M hugepage tables

Signed-off-by: Christophe Leroy 
---
v2: in v1, hugepte_cache was wrongly replaced by PGT_CACHE(1).
This modification has been removed from v2.

 arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 ++--
 arch/powerpc/include/asm/book3s/32/pgtable.h |  43 
 arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 ++--
 arch/powerpc/include/asm/nohash/32/pgtable.h |  45 
 arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
 arch/powerpc/include/asm/pgtable.h   |   2 +
 arch/powerpc/mm/Makefile |   3 +-
 arch/powerpc/mm/init-common.c| 147 +++
 arch/powerpc/mm/init_64.c|  77 --
 arch/powerpc/mm/pgtable_32.c |  37 ---
 11 files changed, 273 insertions(+), 174 deletions(-)
 create mode 100644 arch/powerpc/mm/init-common.c

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 8e21bb4..d310546 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -2,14 +2,42 @@
 #define _ASM_POWERPC_BOOK3S_32_PGALLOC_H
 
 #include 
+#include 
 
-/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
-#define MAX_PGTABLE_INDEX_SIZE 0
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE 0xf
 
 extern void __bad_pte(pmd_t *pmd);
 
-extern pgd_t *pgd_alloc(struct mm_struct *mm);
-extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
+extern struct kmem_cache *pgtable_cache[];
+#define PGT_CACHE(shift) ({\
+   BUG_ON(!(shift));   \
+   pgtable_cache[(shift) - 1]; \
+   })
+
+static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+   return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
+}
+
+static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
+{
+   kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
+}
 
 /*
  * We don't have any real pmd's, and this code never triggers because
@@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
 
 static inline void pgtable_free(void *table, unsigned index_size)
 {
-   BUG_ON(index_size); /* 32-bit doesn't use this */
-   free_page((unsigned long)table);
+   if (!index_size) {
+   free_page((unsigned long)table);
+   } else {
+   BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+   kmem_cache_free(PGT_CACHE(index_size), table);
+   }
 }
 
 #define check_pgt_cache()  do { } while (0)
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 6b8b2d5..f887499 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -8,6 +8,26 @@
 /* And here we include common definitions */
 #include 
 
+#define PTE_INDEX_SIZE PTE_SHIFT
+#define PMD_INDEX_SIZE 0
+#define PUD_INDEX_SIZE 0
+#define PGD_INDEX_SIZE (32 - PGDIR_SHIFT)
+
+#define PMD_CACHE_INDEXPMD_INDEX_SIZE
+
+#ifndef __ASSEMBLY__
+#define PTE_TABLE_SIZE (sizeof(pte_t) << PTE_INDEX_SIZE)
+#define PMD_TABLE_SIZE (sizeof(pmd_t) << PTE_INDEX_SIZE)
+#def

[PATCH v2 0/3] powerpc: implementation of huge pages for 8xx

2016-09-16 Thread Christophe Leroy

This is v2 of patch serie is the implementation of support of
hugepages for the 8xx.
v1 of the serie was including some other fixes and
optimisations/reorganisations for the 8xx. Now the patch has been
split and this part only focuses on the implementation of
hugepages.

Compared the v1, the last patch has been split in two parts.

This patch serie applies on top of the patch serie named
"Optimisation on 8xx prior to hugepage implementation"

Christophe Leroy (3):
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc: get hugetlbpage handling more generic
  powerpc/8xx: Implement support of hugepages

 arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h |  43 +++---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
 arch/powerpc/include/asm/hugetlb.h   |  19 ++-
 arch/powerpc/include/asm/mmu-8xx.h   |  35 +
 arch/powerpc/include/asm/mmu.h   |  23 +--
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h |  45 +++---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   1 +
 arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
 arch/powerpc/include/asm/nohash/pgtable.h|   4 +
 arch/powerpc/include/asm/pgtable.h   |   2 +
 arch/powerpc/include/asm/reg_8xx.h   |   2 +-
 arch/powerpc/kernel/head_8xx.S   | 119 ++-
 arch/powerpc/mm/Makefile |   3 +-
 arch/powerpc/mm/hugetlbpage.c| 212 ---
 arch/powerpc/mm/init-common.c| 147 +++
 arch/powerpc/mm/init_64.c|  77 --
 arch/powerpc/mm/pgtable_32.c |  37 -
 arch/powerpc/mm/tlb_nohash.c |  21 ++-
 arch/powerpc/platforms/8xx/Kconfig   |   1 +
 arch/powerpc/platforms/Kconfig.cputype   |   1 +
 22 files changed, 572 insertions(+), 313 deletions(-)
 create mode 100644 arch/powerpc/mm/init-common.c

-- 
2.1.0

[PATCH v2] Remove duplicate setting of the B field in tlbie

2016-09-16 Thread Balbir Singh


Remove duplicate setting of the the "B" field when doing a tlbie(l).
In compute_tlbie_rb(), the "B" field is set again just before
returning the rb value to be used for tlbie(l).

Signed-off-by: Balbir Singh 
---
Changelog - Leave the more readable version around

 arch/powerpc/include/asm/kvm_book3s_64.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 88d17b4..aa7d4fd 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -159,7 +159,6 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
/* This covers 14..54 bits of va*/
rb = (v & ~0x7fUL) << 16;   /* AVA field */
 
-   rb |= (v >> HPTE_V_SSIZE_SHIFT) << 8;   /*  B field */
/*
 * AVA in v had cleared lower 23 bits. We need to derive
 * that from pteg index
@@ -211,7 +210,7 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
break;
}
}
-   rb |= (v >> 54) & 0x300;/* B field */
+   rb |= (v >> HPTE_V_SSIZE_SHIFT) << 8;   /* B field */
return rb;
 }
 
-- 
2.5.5

Re: [PATCH] MAINTAINERS: Update cxl maintainers

2016-09-16 Thread Andrew Donnellan


On 16/09/16 14:28, Michael Neuling wrote:
> Fred has taken over the cxl maintenance I was doing.  This updates the
> MAINTAINERS file to reflect this.
>
> It also removes a duplicate entry in the files covered.
>
> Signed-off-by: Michael Neuling 

Reviewed-by: Andrew Donnellan 

>  CXL (IBM Coherent Accelerator Processor Interface CAPI) DRIVER
>  M:Ian Munsie 
> -M:Michael Neuling 
> +M:Frederic Barrat 
>  L:linuxppc-dev@lists.ozlabs.org
>  S:Supported
>  F:drivers/misc/cxl/
>  F:include/misc/cxl*
>  F:include/uapi/misc/cxl.h
>  F:Documentation/powerpc/cxl.txt
> -F:Documentation/powerpc/cxl.txt
>  F:Documentation/ABI/testing/sysfs-class-cxl

We should probably add:

F:arch/powerpc/platforms/powernv/pci-cxl.c

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH] MAINTAINERS: Update cxl maintainers

2016-09-16 Thread Andrew Donnellan


On 16/09/16 14:28, Michael Neuling wrote:

Fred has taken over the cxl maintenance I was doing.  This updates the
MAINTAINERS file to reflect this.

It also removes a duplicate entry in the files covered.

Signed-off-by: Michael Neuling 


Reviewed-by: Andrew Donnellan 


 CXL (IBM Coherent Accelerator Processor Interface CAPI) DRIVER
 M: Ian Munsie 
-M: Michael Neuling 
+M: Frederic Barrat 
 L: linuxppc-dev@lists.ozlabs.org
 S: Supported
 F: drivers/misc/cxl/
 F: include/misc/cxl*
 F: include/uapi/misc/cxl.h
 F: Documentation/powerpc/cxl.txt
-F: Documentation/powerpc/cxl.txt
 F: Documentation/ABI/testing/sysfs-class-cxl


We should probably add:

F:  arch/powerpc/platforms/powernv/pci-cxl.c

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

[PATCH v2 3/3] powerpc/8xx: make user addr DTLB miss the short path

2016-09-16 Thread Christophe Leroy

User space DTLB miss represent approximatly 90% of TLB misses
so make it the shortest path.

Also remove an unneccessary double jump in FixupDAR

Before this patch, we spend 3.3 TB ticks in the handler for each
user address miss and 3.4 TB ticks for each kernel address miss
After this patch, we send 3.0 TB ticks in the handler for each
user address miss and 3.9 TB ticks for each kernel address miss
Taking into account that user misses represent 90% of the total,
this patch provides an improvement of approx. 9%

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 53 ++
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 9cc240d..bfe4907 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -384,30 +384,31 @@ InstructionTLBMiss:
 
. = 0x1200
 DataStoreTLBMiss:
+   mtspr   SPRN_SPRG_SCRATCH2, r3
EXCEPTION_PROLOG_0
-   mfcrr10
+   mfcrr3
 
/* If we are faulting a kernel address, we have to use the
 * kernel page tables.
 */
-   mfspr   r11, SPRN_MD_EPN
-   rlwinm  r11, r11, 16, 0xfff8
+   mfspr   r10, SPRN_MD_EPN
+   rlwinm  r10, r10, 16, 0xfff8
+   cmpli   cr0, r10, PAGE_OFFSET@h
+   mfspr   r11, SPRN_M_TW  /* Get level 1 table */
+   blt+3f
 #ifndef CONFIG_PIN_TLB_IMMR
-   cmpli   cr0, r11, VIRT_IMMR_BASE@h
+   cmpli   cr0, r10, VIRT_IMMR_BASE@h
 #endif
-   cmpli   cr7, r11, PAGE_OFFSET@h
+_ENTRY(DTLBMiss_cmp)
+   cmpli   cr7, r10, (PAGE_OFFSET + 0x180)@h
+   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 #ifndef CONFIG_PIN_TLB_IMMR
 _ENTRY(DTLBMiss_jmp)
beq-DTLBMissIMMR
 #endif
-   bge-cr7, DTLBMissLinear
-
-   mfspr   r11, SPRN_M_TW  /* Get level 1 table */
+   blt cr7, DTLBMissLinear
 3:
-   mtcrr10
-#ifdef CONFIG_8xx_CPU6
-   mtspr   SPRN_SPRG_SCRATCH2, r3
-#endif
+   mtcrr3
mfspr   r10, SPRN_MD_EPN
 
/* Insert level 1 index */
@@ -460,9 +461,7 @@ _ENTRY(DTLBMiss_jmp)
MTSPR_CPU6(SPRN_MD_RPN, r10, r3)/* Update TLB entry */
 
/* Restore registers */
-#ifdef CONFIG_8xx_CPU6
mfspr   r3, SPRN_SPRG_SCRATCH2
-#endif
mtspr   SPRN_DAR, r11   /* Tag DAR */
EXCEPTION_EPILOG_0
rfi
@@ -533,7 +532,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
  * not enough space in the DataStoreTLBMiss area.
  */
 DTLBMissIMMR:
-   mtcrr10
+   mtcrr3
/* Set 512k byte guarded page and mark it valid */
li  r10, MD_PS512K | MD_GUARDED | MD_SVALID
MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
@@ -545,27 +544,23 @@ DTLBMissIMMR:
 
li  r11, RPN_PATTERN
mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r3, SPRN_SPRG_SCRATCH2
EXCEPTION_EPILOG_0
rfi
 
 DTLBMissLinear:
-_ENTRY(DTLBMiss_cmp)
-   cmpli   cr0, r11, (PAGE_OFFSET + 0x180)@h
-   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
-   bge-3b
-
-   mtcrr10
+   mtcrr3
/* Set 8M byte page and mark it valid */
-   li  r10, MD_PS8MEG | MD_SVALID
-   MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
-   mfspr   r10, SPRN_MD_EPN
-   rlwinm  r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */
+   li  r11, MD_PS8MEG | MD_SVALID
+   MTSPR_CPU6(SPRN_MD_TWC, r11, r3)
+   rlwinm  r10, r10, 16, 0x0f80/* 8xx supports max 256Mb RAM */
ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
  _PAGE_PRESENT
MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
 
li  r11, RPN_PATTERN
mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r3, SPRN_SPRG_SCRATCH2
EXCEPTION_EPILOG_0
rfi
 
@@ -585,7 +580,9 @@ FixupDAR:/* Entry point for dcbx workaround. */
rlwinm  r11, r10, 16, 0xfff8
 _ENTRY(FixupDAR_cmp)
cmpli   cr7, r11, (PAGE_OFFSET + 0x180)@h
-   blt-cr7, 200f
+   /* create physical page address from effective address */
+   tophys(r11, r10)
+   blt-cr7, 201f
lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
/* Insert level 1 index */
 3: rlwimi  r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 
29
@@ -615,10 +612,6 @@ _ENTRY(FixupDAR_cmp)
 141:   mfspr   r10,SPRN_SPRG_SCRATCH2
b   DARFixed/* Nope, go back to normal TLB processing */
 
-   /* create physical page address from effective address */
-200:   tophys(r11, r10)
-   b   201b
-
 144:   mfspr   r10, SPRN_DSISR
rlwinm  r10, r10,0,7,5  /* Clear store bit for buggy dcbst insn */
mtspr   SPRN_DSISR, r10
-- 
2.1.0

[PATCH v2 2/3] powerpc/8xx: Move additional DTLBMiss handlers out of exception area

2016-09-16 Thread Christophe Leroy

When all options are activated, there is not enough space for the
DTLBMiss handlers that handles IMMR area and linear RAM pages in
the exception area once we have added hugepage handling.
So lets move them after .0x2000

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 84 +-
 1 file changed, 42 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index fd5b53d..9cc240d 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -382,26 +382,6 @@ InstructionTLBMiss:
EXCEPTION_EPILOG_0
rfi
 
-/*
- * Bottom part of DataStoreTLBMiss handler for IMMR area
- * not enough space in the DataStoreTLBMiss area
- */
-DTLBMissIMMR:
-   mtcrr10
-   /* Set 512k byte guarded page and mark it valid */
-   li  r10, MD_PS512K | MD_GUARDED | MD_SVALID
-   MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
-   mfspr   r10, SPRN_IMMR  /* Get current IMMR */
-   rlwinm  r10, r10, 0, 0xfff8 /* Get 512 kbytes boundary */
-   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
- _PAGE_PRESENT | _PAGE_NO_CACHE
-   MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
-
-   li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
-   EXCEPTION_EPILOG_0
-   rfi
-
. = 0x1200
 DataStoreTLBMiss:
EXCEPTION_PROLOG_0
@@ -420,7 +400,7 @@ DataStoreTLBMiss:
 _ENTRY(DTLBMiss_jmp)
beq-DTLBMissIMMR
 #endif
-   bge-cr7, 4f
+   bge-cr7, DTLBMissLinear
 
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
 3:
@@ -487,27 +467,6 @@ _ENTRY(DTLBMiss_jmp)
EXCEPTION_EPILOG_0
rfi
 
-4:
-_ENTRY(DTLBMiss_cmp)
-   cmpli   cr0, r11, (PAGE_OFFSET + 0x180)@h
-   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
-   bge-3b
-
-   mtcrr10
-   /* Set 8M byte page and mark it valid */
-   li  r10, MD_PS8MEG | MD_SVALID
-   MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
-   mfspr   r10, SPRN_MD_EPN
-   rlwinm  r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */
-   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
- _PAGE_PRESENT
-   MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
-
-   li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
-   EXCEPTION_EPILOG_0
-   rfi
-
 
 /* This is an instruction TLB error on the MPC8xx.  This could be due
  * to many reasons, such as executing guarded memory or illegal instruction
@@ -569,6 +528,47 @@ DARFixed:/* Return from dcbx instruction bug workaround */
 
. = 0x2000
 
+/*
+ * Bottom part of DataStoreTLBMiss handlers for IMMR area and linear RAM.
+ * not enough space in the DataStoreTLBMiss area.
+ */
+DTLBMissIMMR:
+   mtcrr10
+   /* Set 512k byte guarded page and mark it valid */
+   li  r10, MD_PS512K | MD_GUARDED | MD_SVALID
+   MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
+   mfspr   r10, SPRN_IMMR  /* Get current IMMR */
+   rlwinm  r10, r10, 0, 0xfff8 /* Get 512 kbytes boundary */
+   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
+ _PAGE_PRESENT | _PAGE_NO_CACHE
+   MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
+
+   li  r11, RPN_PATTERN
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   EXCEPTION_EPILOG_0
+   rfi
+
+DTLBMissLinear:
+_ENTRY(DTLBMiss_cmp)
+   cmpli   cr0, r11, (PAGE_OFFSET + 0x180)@h
+   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
+   bge-3b
+
+   mtcrr10
+   /* Set 8M byte page and mark it valid */
+   li  r10, MD_PS8MEG | MD_SVALID
+   MTSPR_CPU6(SPRN_MD_TWC, r10, r11)
+   mfspr   r10, SPRN_MD_EPN
+   rlwinm  r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */
+   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
+ _PAGE_PRESENT
+   MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
+
+   li  r11, RPN_PATTERN
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   EXCEPTION_EPILOG_0
+   rfi
+
 /* This is the procedure to calculate the data EA for buggy dcbx,dcbi 
instructions
  * by decoding the registers used by the dcbx instruction and adding them.
  * DAR is set to the calculated address.
-- 
2.1.0

[PATCH v2 1/3] powerpc/8xx: use r3 to scratch CR in ITLBmiss

2016-09-16 Thread Christophe Leroy

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 8632515..fd5b53d 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -323,7 +323,7 @@ SystemCall:
 #endif
 
 InstructionTLBMiss:
-#ifdef CONFIG_8xx_CPU6
+#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined 
(CONFIG_DEBUG_PAGEALLOC)
mtspr   SPRN_SPRG_SCRATCH2, r3
 #endif
EXCEPTION_PROLOG_0
@@ -331,23 +331,20 @@ InstructionTLBMiss:
/* If we are faulting a kernel address, we have to use the
 * kernel page tables.
 */
+   mfspr   r10, SPRN_SRR0  /* Get effective address of fault */
+   INVALIDATE_ADJACENT_PAGES_CPU15(r11, r10)
 #if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
/* Only modules will cause ITLB Misses as we always
 * pin the first 8MB of kernel memory */
-   mfspr   r11, SPRN_SRR0  /* Get effective address of fault */
-   INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11)
-   mfcrr10
-   IS_KERNEL(r11, r11)
+   mfcrr3
+   IS_KERNEL(r11, r10)
+#endif
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
+#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
BRANCH_UNLESS_KERNEL(3f)
lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 3:
-   mtcrr10
-   mfspr   r10, SPRN_SRR0  /* Get effective address of fault */
-#else
-   mfspr   r10, SPRN_SRR0  /* Get effective address of fault */
-   INVALIDATE_ADJACENT_PAGES_CPU15(r11, r10)
-   mfspr   r11, SPRN_M_TW  /* Get level 1 table base address */
+   mtcrr3
 #endif
/* Insert level 1 index */
rlwimi  r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 
29
@@ -379,7 +376,7 @@ InstructionTLBMiss:
MTSPR_CPU6(SPRN_MI_RPN, r10, r3)/* Update TLB entry */
 
/* Restore registers */
-#ifdef CONFIG_8xx_CPU6
+#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined 
(CONFIG_DEBUG_PAGEALLOC)
mfspr   r3, SPRN_SPRG_SCRATCH2
 #endif
EXCEPTION_EPILOG_0
-- 
2.1.0

[PATCH RESEND] powerpc: fix usage of _PAGE_RO in hugepage

2016-09-16 Thread Christophe Leroy

On some CPUs like the 8xx, _PAGE_RW hence _PAGE_WRITE is defined
as 0 and _PAGE_RO has to be set when a page is not writable

_PAGE_RO is defined by default in pte-common.h, however BOOK3S/64
doesn't include that file so _PAGE_RO has to be defined explicitly
in book3s/64/pgtable.h

fixes: a7b9f671f2d14 ("powerpc32: adds handling of _PAGE_RO")
Signed-off-by: Christophe Leroy 
---
This patch was initially part of the v1 serie of patchs for providing
hugepage support to the 8xx. As suggested by Aneesh, that serie has
been splited to focus only on hugepage implementation for 8xx.
This patch is a fix and is independant of 8xx hugepage implementation,
allthough it is required to have hugepage support working properly on
the 8xx.

 arch/powerpc/include/asm/book3s/64/pgtable.h | 2 ++
 arch/powerpc/mm/hugetlbpage.c| 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8ec8be9..9fd77f8 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -6,6 +6,8 @@
  */
 #define _PAGE_BIT_SWAP_TYPE0
 
+#define _PAGE_RO   0
+
 #define _PAGE_EXEC 0x1 /* execute permission */
 #define _PAGE_WRITE0x2 /* write access allowed */
 #define _PAGE_READ 0x4 /* read access allowed */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7372ee1..8a512b1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1021,6 +1021,8 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned 
long addr,
mask = _PAGE_PRESENT | _PAGE_READ;
if (write)
mask |= _PAGE_WRITE;
+   else
+   mask |= _PAGE_RO;
 
if ((pte_val(pte) & mask) != mask)
return 0;
-- 
2.1.0

[PATCH v2 0/3] Optimisation on 8xx prior to hugepage implementation

2016-09-16 Thread Christophe Leroy

This serie is a prologue of hugepage implementation on the 8xx.
It some how optimises the DTLBMiss handler while allowing at the
same time to hook the hugepage handling that will be introduced in
a subsequent patch serie.

v1 of those patches was part of a serie identified
"powerpc/8xx: implementation of huge pages"

Christophe Leroy (3):
  powerpc/8xx: use r3 to scratch CR in ITLBmiss
  powerpc/8xx: Move additional DTLBMiss handlers out of exception area
  powerpc/8xx: make user addr DTLB miss the short path

 arch/powerpc/kernel/head_8xx.S | 134 +++--
 1 file changed, 62 insertions(+), 72 deletions(-)

-- 
2.1.0

Re: [V2] powerpc/Kconfig: Update config option based on page size.

2016-09-16 Thread Balbir Singh



On 14/09/16 20:40, santhosh wrote:
> 
>> Michael Ellerman  writes:
>>
>>> On Fri, 2016-19-02 at 05:38:47 UTC, Rashmica Gupta wrote:
 Currently on PPC64 changing kernel pagesize from 4K to 64K leaves
 FORCE_MAX_ZONEORDER set to 13 - which produces a compile error.

>>> ...
 So, update the range of FORCE_MAX_ZONEORDER from 9-64 to 8-9 for 64K pages
 and from 13-64 to 9-13 for 4K pages.

 Signed-off-by: Rashmica Gupta 
 Reviewed-by: Balbir Singh 
>>> Applied to powerpc next, thanks.
>>>
>>> https://git.kernel.org/powerpc/c/a7ee539584acf4a565b7439cea
>>>
>> HPAGE_PMD_ORDER is not something we should check w.r.t 4k linux page
>> size. We do have the below constraint w.r.t hugetlb pages
>>
>> static inline bool hstate_is_gigantic(struct hstate *h)
>> {
>> return huge_page_order(h) >= MAX_ORDER;
>> }
>>
>> That require MAX_ORDER to be greater than 12.
>>

9 to 13 was done based on calculations you can find the commit



>> Did we test hugetlbfs 4k config with this patch ? Will it work if we
>> start marking hugepage as gigantic page ?
>>
>> -aneesh
>>
> Hello Rashmica,
> 
> With upstream linux kernel 4.8.0-rc1-6-gbae9cc6 compiled with linux 4k 
> page size we are not able set hugepages, Aneesh had a look at the problem and 
> he mentioned this commit is causing the issue.
> 
> *Details:*
> We are using pkvm ubuntu 16.04 guest with upstream kernel 
> [4.8.0-rc1-6-gbae9cc6] compiled with  4k page size
> 
> o/p from guest:
> HugePages_Total:   0
> HugePages_Free:0
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize:  16384 kB
> 
> Page sizes from device-tree: [dmesg]
> [0.00] base_shift=12: shift=12, sllp=0x, avpnm=0x, 
> tlbiel=1, penc=0
> [0.00] base_shift=12: shift=24, sllp=0x, avpnm=0x, 
> tlbiel=1, penc=56
> [0.00] base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, 
> tlbiel=0, penc=0
> 
> while trying to configure the hugepages inside the guest it throws the below 
> error:
> 
> echo 100 > /proc/sys/vm/nr_hugepages
> -bash: echo: write error: Invalid argument
> 
> *Note*: we do not see the problem when the linux page is 64k


Just to reiterate you are seeing this problem using 4k page size and 16M as the 
hugepage size.
With FORCE_MAX_ZONEORDER set to 9 to 13 for 4k pages, you can do upto 32M if 
FORCE_MAX_ZONEORDER
is 13 and same for 64K with FORCE_MAX_ZONEORDER set to 9.

Basically the constraint is


FORCE_MAX_ZONEBITS <= 25 - PAGESHIFT

What is your value of FORCE_MAX_ZONEORDER in the .config?

Balbir Singh.

[PATCH] MAINTAINERS: Update cxl maintainers

2016-09-16 Thread Michael Neuling

Fred has taken over the cxl maintenance I was doing.  This updates the
MAINTAINERS file to reflect this.

It also removes a duplicate entry in the files covered.

Signed-off-by: Michael Neuling 
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index a5e1270dfb..8d7a3d534b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3492,14 +3492,13 @@ F:  drivers/net/ethernet/chelsio/cxgb4vf/
 
 CXL (IBM Coherent Accelerator Processor Interface CAPI) DRIVER
 M: Ian Munsie 
-M: Michael Neuling 
+M: Frederic Barrat 
 L: linuxppc-dev@lists.ozlabs.org
 S: Supported
 F: drivers/misc/cxl/
 F: include/misc/cxl*
 F: include/uapi/misc/cxl.h
 F: Documentation/powerpc/cxl.txt
-F: Documentation/powerpc/cxl.txt
 F: Documentation/ABI/testing/sysfs-class-cxl
 
 CXLFLASH (IBM Coherent Accelerator Processor Interface CAPI Flash) SCSI DRIVER
-- 
2.7.4

Re: [PATCH] powerpc: do not use kprobe section to exempt exception handlers

2016-09-16 Thread Michael Ellerman

Nicholas Piggin  writes:

> Use the blacklist macros instead. This allows the linker to move
> exception handler functions close to callers and avoids trampolines in
> larger kernels.

Nice, that's been on my todo list for eva.

Can you do the asm ones too? See _KPROBE() in misc_32/64.S.

cheers

65 matches

Mail list logo