Re: [RFC] fs: add userspace critical mounts event support

2016-09-02 Thread Dmitry Torokhov
On Fri, Sep 2, 2016 at 9:11 PM, Linus Torvalds
 wrote:
> On Fri, Sep 2, 2016 at 5:20 PM, Luis R. Rodriguez  wrote:
>>
>> Thoughts ?
>
> I really think this is a horrible hack.
>
> It's basically the kernel giving up, and relying on user space to give
> a single flag, and it's broken nasty crap.  Worse, it's broken nasty
> crap with a user interface, so we'll be stuck with it forever. Please
> no.

I agree that interface is bad, but I do believe we need something like this...

>
> What are the drivers that need this, and why can't those drivers just
> be fixed to not do braindead things?

Like what? Some devices do need to have firmware loaded so we know
their capabilities, so we really can't push the firmware loading into
"open". If your touch controller for some reason decided to crap over
it's nvram and comes in bootloader mode it is nice for it to slurp in
config data or firmware so use does not have to trigger update
manually. And while the controller is in bootloader mode we can't
create input device because we do not know what capabilities to
declare.

These devices we want to probe asynchronously and simply tell firmware
loader to wait for firmware to become available. The problem we do not
know when to give up, since we do not know where the firmware might
be. But userspace knows and can tel us.

Thanks.

-- 
Dmitry


Re: [RFC] fs: add userspace critical mounts event support

2016-09-02 Thread Linus Torvalds
On Fri, Sep 2, 2016 at 5:20 PM, Luis R. Rodriguez  wrote:
>
> Thoughts ?

I really think this is a horrible hack.

It's basically the kernel giving up, and relying on user space to give
a single flag, and it's broken nasty crap.  Worse, it's broken nasty
crap with a user interface, so we'll be stuck with it forever. Please
no.

What are the drivers that need this, and why can't those drivers just
be fixed to not do braindead things?

Linus


[RFC] fs: add userspace critical mounts event support

2016-09-02 Thread Luis R. Rodriguez
kernel_read_file_from_path() can try to read a file from
the system's filesystem. This is typically done for firmware
for instance, which lives in /lib/firmware. One issue with
this is that the kernel cannot know for sure when the real
final /lib/firmare/ is ready, and even if you use initramfs
drivers are currently initialized *first* prior to the initramfs
kicking off. During init we run through all init calls first
(do_initcalls()) and finally the initramfs is processed via
prepare_namespace():

do_basic_setup() {
   ...
   driver_init();
   ...
   do_initcalls();
   ...
}

kernel_init_freeable() {
   ...
   do_basic_setup();
   ...
   if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
  ramdisk_execute_command = NULL;
  prepare_namespace();
   }
}

This leaves a possible race between loading drivers and any uses
of kernel_read_file_from_path(). Because pivot_root() can be used,
this allows userspace further possibilities in terms of defining
when a kernel critical filesystem should be ready by.

We define kernel critical filesystems as filesystems which the
kernel needs for kernel_read_file_from_path(). Since only userspace
can know when kernel critical filesystems are mounted and ready,
let userspace notify the kernel of this, and enable a new kernel
configuration which lets the kernel wait for this event before
enabling reads from kernel_read_file_from_path().

A default timeout of 10s is used for now. You can override this
through the kernel-parameters using critical_mounts_timeout_ms=T
where T is in ms. cat /sys/kernel/critical_mounts_timeout_ms the
current system value.

When userspace is ready it can simply:

  echo 1 > /sys/kernel/critical_mounts_ready

Signed-off-by: Luis R. Rodriguez 
---

Note, this still leaves the puzzle of the fact that initramfs may carry
some firmware, and some drivers may be OK in using firmware from there,
the wait stuff would just get in the way. To address this I think can
perhaps instead check *one* for the file, and if its present immediately
give it back, we'd only resort to the wait in cases of failure.

Another thing -- other than firmware we have:

security/integrity/ima/ima_fs.c:rc = kernel_read_file_from_path(path, 
&data, &size, 0, READING_POLICY);
sound/oss/sound_firmware.h: err = kernel_read_file_from_path((char *)fn, 
(void **)fp, &size,

What paths are these? So we can document the current uses in the Kconfig
at least.

Thoughts ?

 Documentation/kernel-parameters.txt |  6 +++
 drivers/base/Kconfig| 48 +++
 fs/exec.c   |  3 ++
 include/linux/fs.h  |  8 
 kernel/ksysfs.c | 77 +
 5 files changed, 142 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 8ccacc44622a..1af89faa9fc9 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -849,6 +849,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   critical_mounts_timeout_ms=T[KNL] timeout in ms
+   Format: 
+   Use this to override the kernel's default timeout for
+   waiting for critical system mount points to become
+   available.
+
cryptomgr.notests
 [KNL] Disable crypto self-tests
 
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 12b4f5551501..21576c0a4898 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -25,6 +25,54 @@ config UEVENT_HELPER_PATH
  via /proc/sys/kernel/hotplug or via /sys/kernel/uevent_helper
  later at runtime.
 
+config CRITICAL_MOUNTS_WAIT
+   bool "Enable waiting for critical-filesystems-ready notification"
+   default n
+   help
+ Kernel subsystems and device drivers often need to read files
+ from the filesystem, however in doing this races are possible at
+ bootup -- the subsystem requesting the file might look for it in /
+ early in boot, but if we haven't yet mounted the real root
+ filesystem we'll just tell the subsystem the file is not present and
+ it will fail. Furthermore what path to the filesystem is used varies
+ depending on the subsystem. To help the kernel we provide the option
+ to let the kernel wait for all critical filesystems to mounted and
+ ready before letting the kernel start trying to read files from the
+ systems' filesystem. Since pivot_root() can be used and therefore a
+ system might be configured to change its / filesystem at bootup as
+ many times as it wishes, only userspace can realy know exactly when
+ all critical filesystems are ready. Ena

Re: [PATCH v2 2/5] firmware: annotate thou shalt not request fw on init or probe

2016-09-02 Thread Luis R. Rodriguez
On Thu, Aug 25, 2016 at 09:41:33PM +0200, Luis R. Rodriguez wrote:
> On Thu, Aug 25, 2016 at 01:05:44PM +0200, Daniel Vetter wrote:
> > On Wed, Aug 24, 2016 at 10:39 PM, Luis R. Rodriguez  
> > wrote:
> > Some users want a fully running gfx stack 2s after power-on. There's
> > not even enough time to load an uefi or vga driver first. i915
> > directly initializes the gpu from power-on state on those.
> 
> I see.. thanks.
> 
> > >> I think what would work is loading the different subsystems of the
> > >> driver in parallel (we already do that largely)
> > >
> > > Init level stuff is actually pretty synchronous, and in fact both
> > > init and probe are called serially. There are a few subsystems which
> > > have been doing things a bit differently, but these are exceptions.
> > >
> > > When you say we already do this largely, can you describe a bit more
> > > precisely what you mean ?
> > 
> > Oh, this isn't subsystems as in linux device/driver model, but
> > different parts within the driver. We fire up a bunch of struct work
> > to get various bits done asynchronously.
> 
> Thanks for the clarification.

<-- snip -->

> > One of the proposals which would have worked for us died a while back:
> > 
> > https://lkml.org/lkml/2014/6/15/47
> 
> Interesting, the problem stated here, which comes from the fact (as clarified
> above) the rootfs comes up only *after* we run do_initcalls() as such there 
> are
> possible theoretical races even with request_firmware_nowait() -- as such that
> justifies even more the SmPL grammar rule to include even 
> request_firmware_nowait()
> on probe / init as this patch has.
> 
> Anyway yeah that patch has good intentions but its wrong for a slew of 
> reasons.
> The problem is the same as discussed with Bjorn recently. The solution
> discussed is that since only userspace knows when the *real* rootfs is ready
> (because of slew of reasons) the best we can do is have an event issued by
> userspace to inform us of that.

OK I have something I think we can build upon for this. Will send RFC.

  Luis


Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models

2016-09-02 Thread Benjamin Herrenschmidt
On Fri, 2016-09-02 at 13:26 -0700, Linus Torvalds wrote:
> On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann  wrote:
> > 
> > 
> > When I once looked, I thought all drivers using NO_IRQ were
> > specific
> > to powerpc or one of the less common architectures.

We deprecated NO_IRQ ages ago, it's 0 like it should be, but yes we may
have forgotten to "cleanup" the old users.
 
> powerpc definitely does seem to be the biggest case, with about half
> the instances of NO_IRQ being under arch/powerpc/ (and a few more in
> ppc-specific drivers).
> 
> Adding the powerpc maintainers to the list - because it would really
> be nice to get rid of it, or at least make it *so* rare that we don't
> have people re-introducing it again because they thought it was the
> right thing to do.

Right. Originally it was -1 for us which causes the whole problem. I
changed it to be 0 after doing the whole irq domain remapping thing.
That was a lng time ago.

> A fair amount of of it could even be done by some trivial scripting.
> Something like
> 
>   git grep -wl NO_IRQ arch/powerpc/ | while read a
>   do
>   sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new
>   sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a
>   done
> 
> does fix at least a few of the cases. It still leaves several
> assignments and "return NO_IRQ;" statements, but a few more
> sed-scripts would take care of most of it. Then remove the #define,
> and do a full build to find any straggling cases.
> 
> Michael? Ben?

It can just be replaced with "0" in all powerpc related cases yes.

Cheers,
Ben.



Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models

2016-09-02 Thread Linus Torvalds
On Fri, Sep 2, 2016 at 1:43 PM, Julia Lawall  wrote:
>
> Like this?

Sure.

> Is it always correct to replace return NO_IRQ by return 0?

On platforms that define it to 0, like powerpc, yes.

On arm, c6x, mn10300, parisc and sparc it would need more work.

> Completely untested patch below.

Looks fine, but the ppc people should test it (although it really
should generate the exact same code, unless there was some odd
conversion case).

 Linus


Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models

2016-09-02 Thread Julia Lawall


On Fri, 2 Sep 2016, Linus Torvalds wrote:

> On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann  wrote:
> >
> > When I once looked, I thought all drivers using NO_IRQ were specific
> > to powerpc or one of the less common architectures.
>
> powerpc definitely does seem to be the biggest case, with about half
> the instances of NO_IRQ being under arch/powerpc/ (and a few more in
> ppc-specific drivers).
>
> Adding the powerpc maintainers to the list - because it would really
> be nice to get rid of it, or at least make it *so* rare that we don't
> have people re-introducing it again because they thought it was the
> right thing to do.
>
> A fair amount of of it could even be done by some trivial scripting.
> Something like
>
>   git grep -wl NO_IRQ arch/powerpc/ | while read a
>   do
>   sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new
>   sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a
>   done
>
> does fix at least a few of the cases. It still leaves several
> assignments and "return NO_IRQ;" statements, but a few more
> sed-scripts would take care of most of it. Then remove the #define,
> and do a full build to find any straggling cases.

Like this?

@@
expression e;
@@

e
- != NO_IRQ

@@
expression e;
@@

+!
e
- == NO_IRQ

@@
@@

- NO_IRQ
+ 0

---

Is it always correct to replace return NO_IRQ by return 0?

Completely untested patch below.

julia

diff -u -p a/arch/powerpc/platforms/52xx/mpc52xx_pic.c 
b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
--- a/arch/powerpc/platforms/52xx/mpc52xx_pic.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
@@ -511,7 +511,7 @@ unsigned int mpc52xx_get_irq(void)
irq |= (MPC52xx_IRQ_L1_PERP << MPC52xx_IRQ_L1_OFFSET);
}
} else {
-   return NO_IRQ;
+   return 0;
}

return irq_linear_revmap(mpc52xx_irqhost, irq);
diff -u -p a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -368,7 +368,7 @@ static void chrp_8259_cascade(struct irq
struct irq_chip *chip = irq_desc_get_chip(desc);
unsigned int cascade_irq = i8259_irq();

-   if (cascade_irq != NO_IRQ)
+   if (cascade_irq)
generic_handle_irq(cascade_irq);

chip->irq_eoi(&desc->irq_data);
@@ -514,7 +514,7 @@ static void __init chrp_find_8259(void)
}
if (chrp_mpic != NULL) {
cascade_irq = irq_of_parse_and_map(pic, 0);
-   if (cascade_irq == NO_IRQ)
+   if (!cascade_irq)
printk(KERN_ERR "i8259: failed to map cascade irq\n");
else
irq_set_chained_handler(cascade_irq,
diff -u -p a/arch/powerpc/platforms/maple/pci.c 
b/arch/powerpc/platforms/maple/pci.c
--- a/arch/powerpc/platforms/maple/pci.c
+++ b/arch/powerpc/platforms/maple/pci.c
@@ -552,7 +552,7 @@ void maple_pci_irq_fixup(struct pci_dev
pci_bus_to_host(dev->bus) == u4_pcie) {
printk(KERN_DEBUG "Fixup U4 PCIe IRQ\n");
dev->irq = irq_create_mapping(NULL, 1);
-   if (dev->irq != NO_IRQ)
+   if (dev->irq)
irq_set_irq_type(dev->irq, IRQ_TYPE_LEVEL_LOW);
}

@@ -562,7 +562,7 @@ void maple_pci_irq_fixup(struct pci_dev
if (dev->vendor == PCI_VENDOR_ID_AMD &&
dev->device == PCI_DEVICE_ID_AMD_8111_IDE &&
(dev->class & 5) != 5) {
-   dev->irq = NO_IRQ;
+   dev->irq = 0;
}

DBG(" <- maple_pci_irq_fixup\n");
@@ -648,7 +648,7 @@ int maple_pci_get_legacy_ide_irq(struct
return defirq;
}
irq = irq_of_parse_and_map(np, channel & 0x1);
-   if (irq == NO_IRQ) {
+   if (!irq) {
printk("Failed to map onboard IDE interrupt for channel %d\n",
   channel);
return defirq;
diff -u -p a/arch/powerpc/platforms/embedded6xx/flipper-pic.c 
b/arch/powerpc/platforms/embedded6xx/flipper-pic.c
--- a/arch/powerpc/platforms/embedded6xx/flipper-pic.c
+++ b/arch/powerpc/platforms/embedded6xx/flipper-pic.c
@@ -181,7 +181,7 @@ unsigned int flipper_pic_get_irq(void)
irq_status = in_be32(io_base + FLIPPER_ICR) &
 in_be32(io_base + FLIPPER_IMR);
if (irq_status == 0)
-   return NO_IRQ;  /* no more IRQs pending */
+   return 0;   /* no more IRQs pending */

irq = __ffs(irq_status);
return irq_linear_revmap(flipper_irq_host, irq);
diff -u -p a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c 
b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
--- a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
+++ b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c
@@ -114,7 +114,7 @@ static unsigned int __hlwd_pic_get_irq(s
irq_status = in_be32(io_base + HW_BROADWAY_ICR) &
 in_be32(io_base + HW_BROADWAY_IMR

[PATCH 6/6] cxlflash: Fix context reference tracking on detach

2016-09-02 Thread Uma Krishnan
From: "Matthew R. Ochs" 

Commit 888baf069f49 ("scsi: cxlflash: Add kref to context") introduced a
kref to the context. In particular, the detach routine was updated to use
the kref services for managing the removal and destruction of a context.

As part of this change, the tracking mechanism internal to the detach
handler was refactored. This introduced a bug that can cause the tracking
state to be lost. This can lead to a situation where exclusive access to a
context is prematurely [and unknowingly] relinquished for the executing
thread.

To remedy, only update the tracking state when the kref operation indicates
the context was removed.

Fixes: 888baf069f49 ("scsi: cxlflash: Add kref to context")
Signed-off-by: Matthew R. Ochs 
---
 drivers/scsi/cxlflash/superpipe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index c91fe6f..9636970 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -912,7 +912,8 @@ static int _cxlflash_disk_detach(struct scsi_device *sdev,
 * Release the context reference and the sdev reference that
 * bound this LUN to the context.
 */
-   put_ctx = !kref_put(&ctxi->kref, remove_context);
+   if (kref_put(&ctxi->kref, remove_context))
+   put_ctx = false;
scsi_device_put(sdev);
 out:
if (put_ctx)
-- 
2.1.0



[PATCH 5/6] cxlflash: Refactor WWPN setup

2016-09-02 Thread Uma Krishnan
From: "Matthew R. Ochs" 

Commit 964497b3bf3f ("cxlflash: Remove dual port online dependency")
logically removed the ability for the WWPN setup routine afu_set_wwpn()
to return a non-success value. This routine can safely be made a void
to simplify the code as there is no longer a need to report a failure.

Signed-off-by: Matthew R. Ochs 
---
 drivers/scsi/cxlflash/main.c | 40 +---
 1 file changed, 9 insertions(+), 31 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 42970a4..b301655 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1093,42 +1093,25 @@ static int wait_port_offline(__be64 __iomem *fc_regs, 
u32 delay_us, u32 nretry)
  * online. This toggling action can cause this routine to delay up to a few
  * seconds. When configured to use the internal LUN feature of the AFU, a
  * failure to come online is overridden.
- *
- * Return:
- * 0 when the WWPN is successfully written and the port comes back online
- * -1 when the port fails to go offline or come back up online
  */
-static int afu_set_wwpn(struct afu *afu, int port, __be64 __iomem *fc_regs,
-   u64 wwpn)
+static void afu_set_wwpn(struct afu *afu, int port, __be64 __iomem *fc_regs,
+u64 wwpn)
 {
-   int rc = 0;
-
set_port_offline(fc_regs);
-
if (!wait_port_offline(fc_regs, FC_PORT_STATUS_RETRY_INTERVAL_US,
   FC_PORT_STATUS_RETRY_CNT)) {
pr_debug("%s: wait on port %d to go offline timed out\n",
 __func__, port);
-   rc = -1; /* but continue on to leave the port back online */
}
 
-   if (rc == 0)
-   writeq_be(wwpn, &fc_regs[FC_PNAME / 8]);
-
-   /* Always return success after programming WWPN */
-   rc = 0;
+   writeq_be(wwpn, &fc_regs[FC_PNAME / 8]);
 
set_port_online(fc_regs);
-
if (!wait_port_online(fc_regs, FC_PORT_STATUS_RETRY_INTERVAL_US,
  FC_PORT_STATUS_RETRY_CNT)) {
-   pr_err("%s: wait on port %d to go online timed out\n",
-  __func__, port);
+   pr_debug("%s: wait on port %d to go online timed out\n",
+__func__, port);
}
-
-   pr_debug("%s: returning rc=%d\n", __func__, rc);
-
-   return rc;
 }
 
 /**
@@ -1629,15 +1612,10 @@ static int init_global(struct cxlflash_cfg *cfg)
  [FC_CRC_THRESH / 8]);
 
/* Set WWPNs. If already programmed, wwpn[i] is 0 */
-   if (wwpn[i] != 0 &&
-   afu_set_wwpn(afu, i,
-&afu->afu_map->global.fc_regs[i][0],
-wwpn[i])) {
-   dev_err(dev, "%s: failed to set WWPN on port %d\n",
-  __func__, i);
-   rc = -EIO;
-   goto out;
-   }
+   if (wwpn[i] != 0)
+   afu_set_wwpn(afu, i,
+&afu->afu_map->global.fc_regs[i][0],
+wwpn[i]);
/* Programming WWPN back to back causes additional
 * offline/online transitions and a PLOGI
 */
-- 
2.1.0



[PATCH 4/6] cxlflash: Improve EEH recovery time

2016-09-02 Thread Uma Krishnan
From: "Matthew R. Ochs" 

When an EEH occurs during device initialization, the port timeout
logic can cause excessive delays as MMIO reads will fail. Depending
on where they are experienced, these delays can lead to a prolonged
reset, causing an unnecessary triggering of other timeout logic in
the SCSI stack or user applications.

To expedite recovery, the port timeout logic is updated to decay the
timeout at a much faster rate when in the presence of a likely EEH
frozen event.

Signed-off-by: Matthew R. Ochs 
---
 drivers/scsi/cxlflash/main.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 4ef5235..42970a4 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1040,6 +1040,8 @@ static int wait_port_online(__be64 __iomem *fc_regs, u32 
delay_us, u32 nretry)
do {
msleep(delay_us / 1000);
status = readq_be(&fc_regs[FC_MTIP_STATUS / 8]);
+   if (status == U64_MAX)
+   nretry /= 2;
} while ((status & FC_MTIP_STATUS_MASK) != FC_MTIP_STATUS_ONLINE &&
 nretry--);
 
@@ -1071,6 +1073,8 @@ static int wait_port_offline(__be64 __iomem *fc_regs, u32 
delay_us, u32 nretry)
do {
msleep(delay_us / 1000);
status = readq_be(&fc_regs[FC_MTIP_STATUS / 8]);
+   if (status == U64_MAX)
+   nretry /= 2;
} while ((status & FC_MTIP_STATUS_MASK) != FC_MTIP_STATUS_OFFLINE &&
 nretry--);
 
-- 
2.1.0



[PATCH 3/6] cxlflash: Fix to avoid EEH and host reset collisions

2016-09-02 Thread Uma Krishnan
From: "Matthew R. Ochs" 

The EEH reset handler is ignorant to the current state of the
driver when processing a frozen event and initiating a device
reset. This can be an issue if an EEH event occurs while a user
or stack initiated reset is executing. More specifically, if an
EEH occurs while the SCSI host reset handler is active, the reset
initiated by the EEH thread will likely collide with the host reset
thread. This can leave the device in an inconsistent state, or worse,
cause a system crash.

As a remedy, the EEH handler is updated to evaluate the device state
and take appropriate action (proceed, wait, or disconnect host). The
host reset handler is also updated to handle situations where an EEH
occurred during a host reset. In such situations, the host reset handler
will delay reporting back a success to give the EEH reset an opportunity
to complete.

Signed-off-by: Matthew R. Ochs 
---
 drivers/scsi/cxlflash/main.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 4c2559a..4ef5235 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2042,6 +2042,11 @@ retry:
  * cxlflash_eh_host_reset_handler() - reset the host adapter
  * @scp:   SCSI command from stack identifying host.
  *
+ * Following a reset, the state is evaluated again in case an EEH occurred
+ * during the reset. In such a scenario, the host reset will either yield
+ * until the EEH recovery is complete or return success or failure based
+ * upon the current device state.
+ *
  * Return:
  * SUCCESS as defined in scsi/scsi.h
  * FAILED as defined in scsi/scsi.h
@@ -2074,7 +2079,8 @@ static int cxlflash_eh_host_reset_handler(struct 
scsi_cmnd *scp)
} else
cfg->state = STATE_NORMAL;
wake_up_all(&cfg->reset_waitq);
-   break;
+   ssleep(1);
+   /* fall through */
case STATE_RESET:
wait_event(cfg->reset_waitq, cfg->state != STATE_RESET);
if (cfg->state == STATE_NORMAL)
@@ -2590,6 +2596,9 @@ out_remove:
  * @pdev:  PCI device struct.
  * @state: PCI channel state.
  *
+ * When an EEH occurs during an active reset, wait until the reset is
+ * complete and then take action based upon the device state.
+ *
  * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT
  */
 static pci_ers_result_t cxlflash_pci_error_detected(struct pci_dev *pdev,
@@ -2603,6 +2612,10 @@ static pci_ers_result_t 
cxlflash_pci_error_detected(struct pci_dev *pdev,
 
switch (state) {
case pci_channel_io_frozen:
+   wait_event(cfg->reset_waitq, cfg->state != STATE_RESET);
+   if (cfg->state == STATE_FAILTERM)
+   return PCI_ERS_RESULT_DISCONNECT;
+
cfg->state = STATE_RESET;
scsi_block_requests(cfg->host);
drain_ioctls(cfg);
-- 
2.1.0



[PATCH 2/6] cxlflash: Remove the device cleanly in the system shutdown path

2016-09-02 Thread Uma Krishnan
Commit 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash
cards") was recently introduced to notify the AFU when a system is going
down. Due to the position of the cxlflash driver in the device stack,
cxlflash devices are _always_ removed during a reboot/shutdown. This can
lead to a crash if the cxlflash shutdown hook is invoked _after_ the
shutdown hook for the owning virtual PHB. Furthermore, the current
implementation of shutdown/remove hooks for cxlflash are not tolerant to
being invoked when the device is not enabled. This can also lead to a
crash in situations where the remove hook is invoked after the device has
been removed via the vPHBs shutdown hook. An example of this scenario
would be an EEH reset failure while a reboot/shutdown is in progress.

To solve both problems, the shutdown hook for cxlflash is updated to
simply remove the device. This path already includes the AFU notification
and thus this solution will continue to perform the original intent. At
the same time, the remove hook is updated to protect against being
called when the device is not enabled.

Fixes: 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash
cards")
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index b063c41..4c2559a 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -823,17 +823,6 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
 }
 
 /**
- * cxlflash_shutdown() - shutdown handler
- * @pdev:  PCI device associated with the host.
- */
-static void cxlflash_shutdown(struct pci_dev *pdev)
-{
-   struct cxlflash_cfg *cfg = pci_get_drvdata(pdev);
-
-   notify_shutdown(cfg, false);
-}
-
-/**
  * cxlflash_remove() - PCI entry point to tear down host
  * @pdev:  PCI device associated with the host.
  *
@@ -844,6 +833,11 @@ static void cxlflash_remove(struct pci_dev *pdev)
struct cxlflash_cfg *cfg = pci_get_drvdata(pdev);
ulong lock_flags;
 
+   if (!pci_is_enabled(pdev)) {
+   pr_debug("%s: Device is disabled\n", __func__);
+   return;
+   }
+
/* If a Task Management Function is active, wait for it to complete
 * before continuing with remove.
 */
@@ -2685,7 +2679,7 @@ static struct pci_driver cxlflash_driver = {
.id_table = cxlflash_pci_table,
.probe = cxlflash_probe,
.remove = cxlflash_remove,
-   .shutdown = cxlflash_shutdown,
+   .shutdown = cxlflash_remove,
.err_handler = &cxlflash_err_handler,
 };
 
-- 
2.1.0



[PATCH 1/6] cxlflash: Scan host only after the port is ready for I/O

2016-09-02 Thread Uma Krishnan
When a port link is established, the AFU sends a 'link up' interrupt.
After the link is up, corresponding initialization steps are performed
on the card. Following that, when the card is ready for I/O, the AFU
sends 'login succeeded' interrupt. Today, cxlflash invokes
scsi_scan_host() upon receipt of both interrupts.

SCSI commands sent to the port prior to the 'login succeeded' interrupt
will fail with 'port not available' error. This is not desirable.
Moreover, when async_scan is active for the host, subsequent scan calls
are terminated with error. Due to this, the scsi_scan_host() call
performed after 'login succeeded' interrupt could portentially return
error and the devices may not be scanned properly.

To avoid this problem, scsi_scan_host() should be called only after the
'login succeeded' interrupt.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 661bb94..b063c41 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1187,7 +1187,7 @@ static const struct asyc_intr_info ainfo[] = {
{SISL_ASTATUS_FC0_LOGI_F, "login failed", 0, CLR_FC_ERROR},
{SISL_ASTATUS_FC0_LOGI_S, "login succeeded", 0, SCAN_HOST},
{SISL_ASTATUS_FC0_LINK_DN, "link down", 0, 0},
-   {SISL_ASTATUS_FC0_LINK_UP, "link up", 0, SCAN_HOST},
+   {SISL_ASTATUS_FC0_LINK_UP, "link up", 0, 0},
{SISL_ASTATUS_FC1_OTHER, "other error", 1, CLR_FC_ERROR | LINK_RESET},
{SISL_ASTATUS_FC1_LOGO, "target initiated LOGO", 1, 0},
{SISL_ASTATUS_FC1_CRC_T, "CRC threshold exceeded", 1, LINK_RESET},
@@ -1195,7 +1195,7 @@ static const struct asyc_intr_info ainfo[] = {
{SISL_ASTATUS_FC1_LOGI_F, "login failed", 1, CLR_FC_ERROR},
{SISL_ASTATUS_FC1_LOGI_S, "login succeeded", 1, SCAN_HOST},
{SISL_ASTATUS_FC1_LINK_DN, "link down", 1, 0},
-   {SISL_ASTATUS_FC1_LINK_UP, "link up", 1, SCAN_HOST},
+   {SISL_ASTATUS_FC1_LINK_UP, "link up", 1, 0},
{0x0, "", 0, 0} /* terminator */
 };
 
-- 
2.1.0



[PATCH 0/6] cxlflash: Miscellaneous fixes

2016-09-02 Thread Uma Krishnan
This patch set contains miscellaneous fixes in the device cleanup and
EEH recovery paths. One of the patches simplifies the WWPN assignment
routine.

There are a couple of fixes that resolve regressions introduced by
Commit 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash
cards") and Commit 888baf069f49 ("scsi: cxlflash: Add kref to context")

This series is based upon mkp's 4.9/scsi-queue and is bisectable.

Matthew R. Ochs (4):
  cxlflash: Fix to avoid EEH and host reset collisions
  cxlflash: Improve EEH recovery time
  cxlflash: Refactor WWPN setup
  cxlflash: Fix context reference tracking on detach

Uma Krishnan (2):
  cxlflash: Scan host only after the port is ready for I/O
  cxlflash: Remove the device cleanly in the system shutdown path

 drivers/scsi/cxlflash/main.c  | 81 +--
 drivers/scsi/cxlflash/superpipe.c |  3 +-
 2 files changed, 37 insertions(+), 47 deletions(-)

-- 
2.1.0



Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models

2016-09-02 Thread Linus Torvalds
On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann  wrote:
>
> When I once looked, I thought all drivers using NO_IRQ were specific
> to powerpc or one of the less common architectures.

powerpc definitely does seem to be the biggest case, with about half
the instances of NO_IRQ being under arch/powerpc/ (and a few more in
ppc-specific drivers).

Adding the powerpc maintainers to the list - because it would really
be nice to get rid of it, or at least make it *so* rare that we don't
have people re-introducing it again because they thought it was the
right thing to do.

A fair amount of of it could even be done by some trivial scripting.
Something like

  git grep -wl NO_IRQ arch/powerpc/ | while read a
  do
  sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new
  sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a
  done

does fix at least a few of the cases. It still leaves several
assignments and "return NO_IRQ;" statements, but a few more
sed-scripts would take care of most of it. Then remove the #define,
and do a full build to find any straggling cases.

Michael? Ben?

 Linus


Re: [PATCH v3 2/5] firmware: annotate thou shalt not request fw on init or probe

2016-09-02 Thread Luis R. Rodriguez
On Wed, Aug 24, 2016 at 10:17:38AM +0200, Gabriel Paubert wrote:
> On Tue, Aug 23, 2016 at 05:45:04PM -0700, mcg...@kernel.org wrote:
> 
> [snip]
> > ---
> >  Documentation/firmware_class/README|  20 
> >  drivers/base/Kconfig   |   2 +-
> >  .../request_firmware-avoid-init-probe-init.cocci   | 130 
> > +
> >  3 files changed, 151 insertions(+), 1 deletion(-)
> >  create mode 100644 
> > scripts/coccinelle/api/request_firmware-avoid-init-probe-init.cocci
> > 
> > diff --git a/Documentation/firmware_class/README 
> > b/Documentation/firmware_class/README
> > index cafdca8b3b15..056d1cb9d365 100644
> > --- a/Documentation/firmware_class/README
> > +++ b/Documentation/firmware_class/README
> > @@ -93,6 +93,26 @@
> > user contexts to request firmware asynchronously, but can't be called
> > in atomic contexts.
> >  
> > +Requirements:
> > +=
> > +
> > +You should avoid at all costs requesting firmware on both init and probe 
> > paths
> > +of your device driver. Reason for this is the complexity needed to ensure a
> > +firmware will be available for a driver early in boot through different
> > +build configurations. Consider built-in drivers needing firmware early, or
> > +consider a driver assuming it will only get firmware after pivot_root().
> > +
> > +Drivers that really need firmware early should use stuff the firmware in
> 
> Minor grammatical nit: s/use//

Fixed, thanks.

  Luis


Re: [PATCH] ASoC: fsl_esai: fix spelling mistake "Transmition" -> "Transmission"

2016-09-02 Thread Nicolin Chen
On Fri, Sep 02, 2016 at 03:07:23PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistakes in dev_dbg messages
> 
> Signed-off-by: Colin Ian King 

Acked-by: Nicolin Chen 

Thanks

> ---
>  sound/soc/fsl/fsl_esai.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
> index 26a90e1..e927955 100644
> --- a/sound/soc/fsl/fsl_esai.c
> +++ b/sound/soc/fsl/fsl_esai.c
> @@ -77,19 +77,19 @@ static irqreturn_t esai_isr(int irq, void *devid)
>   regmap_read(esai_priv->regmap, REG_ESAI_ESR, &esr);
>  
>   if (esr & ESAI_ESR_TINIT_MASK)
> - dev_dbg(&pdev->dev, "isr: Transmition Initialized\n");
> + dev_dbg(&pdev->dev, "isr: Transmission Initialized\n");
>  
>   if (esr & ESAI_ESR_RFF_MASK)
>   dev_warn(&pdev->dev, "isr: Receiving overrun\n");
>  
>   if (esr & ESAI_ESR_TFE_MASK)
> - dev_warn(&pdev->dev, "isr: Transmition underrun\n");
> + dev_warn(&pdev->dev, "isr: Transmission underrun\n");
>  
>   if (esr & ESAI_ESR_TLS_MASK)
>   dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n");
>  
>   if (esr & ESAI_ESR_TDE_MASK)
> - dev_dbg(&pdev->dev, "isr: Transmition data exception\n");
> + dev_dbg(&pdev->dev, "isr: Transmission data exception\n");
>  
>   if (esr & ESAI_ESR_TED_MASK)
>   dev_dbg(&pdev->dev, "isr: Transmitting even slots\n");
> -- 
> 2.9.3
> 


[PATCH] ASoC: fsl_esai: fix spelling mistake "Transmition" -> "Transmission"

2016-09-02 Thread Colin King
From: Colin Ian King 

Trivial fix to spelling mistakes in dev_dbg messages

Signed-off-by: Colin Ian King 
---
 sound/soc/fsl/fsl_esai.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
index 26a90e1..e927955 100644
--- a/sound/soc/fsl/fsl_esai.c
+++ b/sound/soc/fsl/fsl_esai.c
@@ -77,19 +77,19 @@ static irqreturn_t esai_isr(int irq, void *devid)
regmap_read(esai_priv->regmap, REG_ESAI_ESR, &esr);
 
if (esr & ESAI_ESR_TINIT_MASK)
-   dev_dbg(&pdev->dev, "isr: Transmition Initialized\n");
+   dev_dbg(&pdev->dev, "isr: Transmission Initialized\n");
 
if (esr & ESAI_ESR_RFF_MASK)
dev_warn(&pdev->dev, "isr: Receiving overrun\n");
 
if (esr & ESAI_ESR_TFE_MASK)
-   dev_warn(&pdev->dev, "isr: Transmition underrun\n");
+   dev_warn(&pdev->dev, "isr: Transmission underrun\n");
 
if (esr & ESAI_ESR_TLS_MASK)
dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n");
 
if (esr & ESAI_ESR_TDE_MASK)
-   dev_dbg(&pdev->dev, "isr: Transmition data exception\n");
+   dev_dbg(&pdev->dev, "isr: Transmission data exception\n");
 
if (esr & ESAI_ESR_TED_MASK)
dev_dbg(&pdev->dev, "isr: Transmitting even slots\n");
-- 
2.9.3



Re: [PATHC v2 5/9] ima: on soft reboot, save the measurement list

2016-09-02 Thread Mimi Zohar
Hi Dave,

On Thu, 2016-09-01 at 09:57 +0800, Dave Young wrote:
> On 08/30/16 at 06:40pm, Mimi Zohar wrote:

> > + * Called during kexec_file_load so that IMA can add a segment to the kexec
> > + * image for the measurement list for the next kernel.
> > + */
> > +void ima_add_kexec_buffer(struct kimage *image)
> > +{
> > +   static int registered = 0;
> > +   struct kexec_buf kbuf = { .image = image, .buf_align = PAGE_SIZE,
> > + .buf_min = 0, .buf_max = ULONG_MAX,
> > + .top_down = true, .skip_checksum = true };
> > +   int ret;
> > +
> > +   if (!kexec_can_hand_over_buffer())
> > +   return;
> > +
> > +   kexec_segment_size = ALIGN(ima_get_binary_runtime_size() + PAGE_SIZE,
> > +  PAGE_SIZE);
> > +
> > +   if (kexec_segment_size >= (ULONG_MAX - sizeof(long))) {
> > +   pr_err("Binary measurement list too large.\n");
> > +   return;
> > +   }
> 
> Now we added a limitation that total segment size can not be larger than
> half of totalram. see kernel/kexec_core.c sanity_check_segment_list()
> 
> So can it fail early here if kexec_segment_size is over half of total
> ram?

Sure, I'll include this change in the next post.

Mimi



Re: [PATCH 1/3] powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET

2016-09-02 Thread Aneesh Kumar K.V

Hi Paul,

Really nice catch. Was this found by code analysis or do we have any
reported issue around this ?

Paul Mackerras  writes:

> In commit c60ac5693c47 ("powerpc: Update kernel VSID range", 2013-03-13)
> we lost a check on the region number (the top four bits of the effective
> address) for addresses below PAGE_OFFSET.  That commit replaced a check
> that the top 18 bits were all zero with a check that bits 46 - 59 were
> zero (performed for all addresses, not just user addresses).

To make review easy for others, here is the relevant diff from that commit.

 _GLOBAL(slb_allocate_realmode)
-   /* r3 = faulting address */
+   /*
+* check for bad kernel/user address
+* (ea & ~REGION_MASK) >= PGTABLE_RANGE
+*/
+   rldicr. r9,r3,4,(63 - 46 - 4)
+   bne-8f
 
srdir9,r3,60/* get region */

..
And because we were doing the above check, I removed
.

 BEGIN_FTR_SECTION
b   slb_finish_load
 END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
b   slb_finish_load_1T
 
-0: /* user address: proto-VSID = context << 15 | ESID. First check
-* if the address is within the boundaries of the user region
-*/
-   srdi.   r9,r10,USER_ESID_BITS
-   bne-8f  /* invalid ea bits set */
-
-
+0:


>
> This means that userspace can access an address like 0x1000_0xxx__
> and we will insert a valid SLB entry for it.  The VSID used will be the
> same as if the top 4 bits were 0, but the page size will be some random
> value obtained by indexing beyond the end of the mm_ctx_high_slices_psize
> array in the paca.  If that page size is the same as would be used for
> region 0, then userspace just has an alias of the region 0 space.  If the
> page size is different, then no HPTE will be found for the access, and
> the process will get a SIGSEGV (since hash_page_mm() will refuse to create
> a HPTE for the bogus address).
>
> The access beyond the end of the mm_ctx_high_slices_psize can be at most
> 5.5MB past the array, and so will be in RAM somewhere.  Since the access
> is a load performed in real mode, it won't fault or crash the kernel.
> At most this bug could perhaps leak a little bit of information about
> blocks of 32 bytes of memory located at offsets of i * 512kB past the
> paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11.


Reviewed-by: Aneesh Kumar K.V 

>
> Cc: sta...@vger.kernel.org # v3.10+
> Cc: Aneesh Kumar K.V 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/mm/slb_low.S | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
> index dfdb90c..9f19834 100644
> --- a/arch/powerpc/mm/slb_low.S
> +++ b/arch/powerpc/mm/slb_low.S
> @@ -113,7 +113,12 @@ BEGIN_FTR_SECTION
>  END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
>   b   slb_finish_load_1T
>
> -0:
> +0:   /*
> +  * For userspace addresses, make sure this is region 0.
> +  */
> + cmpdi   r9, 0
> + bne 8f
> +
>   /* when using slices, we extract the psize off the slice bitmaps
>* and then we need to get the sllp encoding off the mmu_psize_defs
>* array.
> -- 
> 2.7.4



[PATCH 2/3] powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address

2016-09-02 Thread Paul Mackerras
Currently, if userspace or the kernel accesses a completely bogus address,
for example with any of bits 46-59 set, we first take an SLB miss interrupt,
install a corresponding SLB entry with VSID 0, retry the instruction, then
take a DSI/ISI interrupt because there is no HPT entry mapping the address.
However, by the time of the second interrupt, the Come-From Address Register
(CFAR) has been overwritten by the rfid instruction at the end of the SLB
miss interrupt handler.  Since bogus accesses can often be caused by a
function return after the stack has been overwritten, the CFAR value would
be very useful as it could indicate which function it was whose return had
led to the bogus address.

This patch adds code to create a full exception frame in the SLB miss handler
in the case of a bogus address, rather than inserting an SLB entry with a
zero VSID field.  Then we call a new slb_miss_bad_addr() function in C code,
which delivers a signal for a user access or creates an oops for a kernel
access.  In the latter case the oops message will show the CFAR value at the
time of the access.

In the case of the radix MMU, a segment miss interrupt indicates an access
outside the ranges mapped by the page tables.  Previously this was handled
by the code for an unrecoverable SLB miss (one with MSR[RI] = 0), which is
not really correct.  With this patch, we now handle these interrupts with
slb_miss_bad_addr(), which is much more consistent.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kernel/exceptions-64s.S | 40 ++--
 arch/powerpc/kernel/traps.c  | 11 ++
 arch/powerpc/mm/slb_low.S|  8 +++-
 3 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index df6d45e..a2526b0 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -175,6 +175,7 @@ data_access_slb_pSeries:
std r3,PACA_EXSLB+EX_R3(r13)
mfspr   r3,SPRN_DAR
mfspr   r12,SPRN_SRR1
+   crset   4*cr6+eq
 #ifndef CONFIG_RELOCATABLE
b   slb_miss_realmode
 #else
@@ -201,6 +202,7 @@ instruction_access_slb_pSeries:
std r3,PACA_EXSLB+EX_R3(r13)
mfspr   r3,SPRN_SRR0/* SRR0 is faulting address */
mfspr   r12,SPRN_SRR1
+   crclr   4*cr6+eq
 #ifndef CONFIG_RELOCATABLE
b   slb_miss_realmode
 #else
@@ -767,6 +769,7 @@ data_access_slb_relon_pSeries:
std r3,PACA_EXSLB+EX_R3(r13)
mfspr   r3,SPRN_DAR
mfspr   r12,SPRN_SRR1
+   crset   4*cr6+eq
 #ifndef CONFIG_RELOCATABLE
b   slb_miss_realmode
 #else
@@ -792,6 +795,7 @@ instruction_access_slb_relon_pSeries:
std r3,PACA_EXSLB+EX_R3(r13)
mfspr   r3,SPRN_SRR0/* SRR0 is faulting address */
mfspr   r12,SPRN_SRR1
+   crclr   4*cr6+eq
 #ifndef CONFIG_RELOCATABLE
b   slb_miss_realmode
 #else
@@ -1389,6 +1393,7 @@ unrecover_mce:
  * r3 has the faulting address
  * r9 - r13 are saved in paca->exslb.
  * r3 is saved in paca->slb_r3
+ * cr6.eq is set for a D-SLB miss, clear for a I-SLB miss
  * We assume we aren't going to take any exceptions during this procedure.
  */
 slb_miss_realmode:
@@ -1399,29 +1404,31 @@ slb_miss_realmode:
 
stw r9,PACA_EXSLB+EX_CCR(r13)   /* save CR in exc. frame */
std r10,PACA_EXSLB+EX_LR(r13)   /* save LR */
+   std r3,PACA_EXSLB+EX_DAR(r13)
 
+   crset   4*cr0+eq
 #ifdef CONFIG_PPC_STD_MMU_64
 BEGIN_MMU_FTR_SECTION
bl  slb_allocate_realmode
 END_MMU_FTR_SECTION_IFCLR(MMU_FTR_TYPE_RADIX)
 #endif
-   /* All done -- return from exception. */
 
ld  r10,PACA_EXSLB+EX_LR(r13)
ld  r3,PACA_EXSLB+EX_R3(r13)
lwz r9,PACA_EXSLB+EX_CCR(r13)   /* get saved CR */
-
mtlrr10
+
+   beq 8f  /* if bad address, make full stack frame */
+
andi.   r10,r12,MSR_RI  /* check for unrecoverable exception */
-BEGIN_MMU_FTR_SECTION
beq-2f
-FTR_SECTION_ELSE
-   b   2f
-ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
+
+   /* All done -- return from exception. */
 
 .machine   push
 .machine   "power4"
mtcrf   0x80,r9
+   mtcrf   0x02,r9 /* I/D indication is in cr6 */
mtcrf   0x01,r9 /* slb_allocate uses cr0 and cr7 */
 .machine   pop
 
@@ -1451,6 +1458,27 @@ unrecov_slb:
bl  unrecoverable_exception
b   1b
 
+8: mfspr   r11,SPRN_SRR0
+   ld  r10,PACAKBASE(r13)
+   LOAD_HANDLER(r10,bad_addr_slb)
+   mtspr   SPRN_SRR0,r10
+   ld  r10,PACAKMSR(r13)
+   mtspr   SPRN_SRR1,r10
+   rfid
+   b   .
+
+bad_addr_slb:
+   EXCEPTION_PROLOG_COMMON(0x380, PACA_EXSLB)
+   RECONCILE_IRQ_STATE(r10, r11)
+   ld  r3, PACA_EXSLB+EX_DAR(r13)
+   std r3, _DAR(r1)
+   beq c

[PATCH 3/3] powerpc/mm: Speed up computation of base and actual page size for a HPTE

2016-09-02 Thread Paul Mackerras
This replaces a 2-D search through an array with a simple 8-bit table
lookup for determining the actual and/or base page size for a HPT entry.

The encoding in the second doubleword of the HPTE is designed to encode
the actual and base page sizes without using any more bits than would be
needed for a 4k page number, by using between 1 and 8 low-order bits of
the RPN (real page number) field to encode the page sizes.  A single
"large page" bit in the first doubleword indicates that these low-order
bits are to be interpreted like this.

We can determine the page sizes by using the low-order 8 bits of the RPN
to look up a 256-entry table.  For actual page sizes less than 1MB, some
of the upper bits of these 8 bits are going to be real address bits, but
we can cope with that by replicating the entries for those smaller page
sizes.

While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
functions from a KVM-specific header to a header for 64-bit HPT systems,
since this computation doesn't have anything specifically to do with KVM.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 37 
 arch/powerpc/include/asm/kvm_book3s_64.h  | 87 +++
 arch/powerpc/include/asm/mmu.h|  1 +
 arch/powerpc/mm/hash_native_64.c  | 42 +
 arch/powerpc/mm/hash_utils_64.c   | 37 
 5 files changed, 84 insertions(+), 120 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 287a656..e407af2 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -245,6 +245,43 @@ static inline int segment_shift(int ssize)
 }
 
 /*
+ * This array is indexed by the LP field of the HPTE second dword.
+ * Since this field may contain some RPN bits, some entries are
+ * replicated so that we get the same value irrespective of RPN.
+ * The top 4 bits are the page size index (MMU_PAGE_*) for the
+ * actual page size, the bottom 4 bits are the base page size.
+ */
+extern u8 hpte_page_sizes[1 << LP_BITS];
+
+static inline unsigned long __hpte_page_size(unsigned long h, unsigned long l,
+bool is_base_size)
+{
+   unsigned int i, lp;
+
+   if (!(h & HPTE_V_LARGE))
+   return 1ul << 12;
+
+   /* Look at the 8 bit LP value */
+   lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
+   i = hpte_page_sizes[lp];
+   if (!i)
+   return 0;
+   if (!is_base_size)
+   i >>= 4;
+   return 1ul << mmu_psize_defs[i & 0xf].shift;
+}
+
+static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+{
+   return __hpte_page_size(h, l, 0);
+}
+
+static inline unsigned long hpte_base_page_size(unsigned long h, unsigned long 
l)
+{
+   return __hpte_page_size(h, l, 1);
+}
+
+/*
  * The current system page and segment sizes
  */
 extern int mmu_kernel_ssize;
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 88d17b4..4ffd5a1 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -20,6 +20,8 @@
 #ifndef __ASM_KVM_BOOK3S_64_H__
 #define __ASM_KVM_BOOK3S_64_H__
 
+#include 
+
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
 static inline struct kvmppc_book3s_shadow_vcpu *svcpu_get(struct kvm_vcpu 
*vcpu)
 {
@@ -97,56 +99,20 @@ static inline void __unlock_hpte(__be64 *hpte, unsigned 
long hpte_v)
hpte[0] = cpu_to_be64(hpte_v);
 }
 
-static inline int __hpte_actual_psize(unsigned int lp, int psize)
-{
-   int i, shift;
-   unsigned int mask;
-
-   /* start from 1 ignoring MMU_PAGE_4K */
-   for (i = 1; i < MMU_PAGE_COUNT; i++) {
-
-   /* invalid penc */
-   if (mmu_psize_defs[psize].penc[i] == -1)
-   continue;
-   /*
-* encoding bits per actual page size
-*PTE LP actual page size
-* rrrz >=8KB
-* rrzz >=16KB
-* rzzz >=32KB
-*  >=64KB
-* ...
-*/
-   shift = mmu_psize_defs[i].shift - LP_SHIFT;
-   if (shift > LP_BITS)
-   shift = LP_BITS;
-   mask = (1 << shift) - 1;
-   if ((lp & mask) == mmu_psize_defs[psize].penc[i])
-   return i;
-   }
-   return -1;
-}
-
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
 {
-   int b_psize = MMU_PAGE_4K, a_psize = MMU_PAGE_4K;
+   int i, b_psize = MMU_PAGE_4K, a_psize = MMU_PAGE_4K;
unsigned int penc;
unsigned long rb = 0, va_low, sllp;
unsi

[PATCH 1/3] powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET

2016-09-02 Thread Paul Mackerras
In commit c60ac5693c47 ("powerpc: Update kernel VSID range", 2013-03-13)
we lost a check on the region number (the top four bits of the effective
address) for addresses below PAGE_OFFSET.  That commit replaced a check
that the top 18 bits were all zero with a check that bits 46 - 59 were
zero (performed for all addresses, not just user addresses).

This means that userspace can access an address like 0x1000_0xxx__
and we will insert a valid SLB entry for it.  The VSID used will be the
same as if the top 4 bits were 0, but the page size will be some random
value obtained by indexing beyond the end of the mm_ctx_high_slices_psize
array in the paca.  If that page size is the same as would be used for
region 0, then userspace just has an alias of the region 0 space.  If the
page size is different, then no HPTE will be found for the access, and
the process will get a SIGSEGV (since hash_page_mm() will refuse to create
a HPTE for the bogus address).

The access beyond the end of the mm_ctx_high_slices_psize can be at most
5.5MB past the array, and so will be in RAM somewhere.  Since the access
is a load performed in real mode, it won't fault or crash the kernel.
At most this bug could perhaps leak a little bit of information about
blocks of 32 bytes of memory located at offsets of i * 512kB past the
paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11.

Cc: sta...@vger.kernel.org # v3.10+
Cc: Aneesh Kumar K.V 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/mm/slb_low.S | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index dfdb90c..9f19834 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -113,7 +113,12 @@ BEGIN_FTR_SECTION
 END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
b   slb_finish_load_1T
 
-0:
+0: /*
+* For userspace addresses, make sure this is region 0.
+*/
+   cmpdi   r9, 0
+   bne 8f
+
/* when using slices, we extract the psize off the slice bitmaps
 * and then we need to get the sllp encoding off the mmu_psize_defs
 * array.
-- 
2.7.4