Re: [RFC] fs: add userspace critical mounts event support
On Fri, Sep 2, 2016 at 9:11 PM, Linus Torvalds wrote: > On Fri, Sep 2, 2016 at 5:20 PM, Luis R. Rodriguez wrote: >> >> Thoughts ? > > I really think this is a horrible hack. > > It's basically the kernel giving up, and relying on user space to give > a single flag, and it's broken nasty crap. Worse, it's broken nasty > crap with a user interface, so we'll be stuck with it forever. Please > no. I agree that interface is bad, but I do believe we need something like this... > > What are the drivers that need this, and why can't those drivers just > be fixed to not do braindead things? Like what? Some devices do need to have firmware loaded so we know their capabilities, so we really can't push the firmware loading into "open". If your touch controller for some reason decided to crap over it's nvram and comes in bootloader mode it is nice for it to slurp in config data or firmware so use does not have to trigger update manually. And while the controller is in bootloader mode we can't create input device because we do not know what capabilities to declare. These devices we want to probe asynchronously and simply tell firmware loader to wait for firmware to become available. The problem we do not know when to give up, since we do not know where the firmware might be. But userspace knows and can tel us. Thanks. -- Dmitry
Re: [RFC] fs: add userspace critical mounts event support
On Fri, Sep 2, 2016 at 5:20 PM, Luis R. Rodriguez wrote: > > Thoughts ? I really think this is a horrible hack. It's basically the kernel giving up, and relying on user space to give a single flag, and it's broken nasty crap. Worse, it's broken nasty crap with a user interface, so we'll be stuck with it forever. Please no. What are the drivers that need this, and why can't those drivers just be fixed to not do braindead things? Linus
[RFC] fs: add userspace critical mounts event support
kernel_read_file_from_path() can try to read a file from the system's filesystem. This is typically done for firmware for instance, which lives in /lib/firmware. One issue with this is that the kernel cannot know for sure when the real final /lib/firmare/ is ready, and even if you use initramfs drivers are currently initialized *first* prior to the initramfs kicking off. During init we run through all init calls first (do_initcalls()) and finally the initramfs is processed via prepare_namespace(): do_basic_setup() { ... driver_init(); ... do_initcalls(); ... } kernel_init_freeable() { ... do_basic_setup(); ... if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); } } This leaves a possible race between loading drivers and any uses of kernel_read_file_from_path(). Because pivot_root() can be used, this allows userspace further possibilities in terms of defining when a kernel critical filesystem should be ready by. We define kernel critical filesystems as filesystems which the kernel needs for kernel_read_file_from_path(). Since only userspace can know when kernel critical filesystems are mounted and ready, let userspace notify the kernel of this, and enable a new kernel configuration which lets the kernel wait for this event before enabling reads from kernel_read_file_from_path(). A default timeout of 10s is used for now. You can override this through the kernel-parameters using critical_mounts_timeout_ms=T where T is in ms. cat /sys/kernel/critical_mounts_timeout_ms the current system value. When userspace is ready it can simply: echo 1 > /sys/kernel/critical_mounts_ready Signed-off-by: Luis R. Rodriguez --- Note, this still leaves the puzzle of the fact that initramfs may carry some firmware, and some drivers may be OK in using firmware from there, the wait stuff would just get in the way. To address this I think can perhaps instead check *one* for the file, and if its present immediately give it back, we'd only resort to the wait in cases of failure. Another thing -- other than firmware we have: security/integrity/ima/ima_fs.c:rc = kernel_read_file_from_path(path, &data, &size, 0, READING_POLICY); sound/oss/sound_firmware.h: err = kernel_read_file_from_path((char *)fn, (void **)fp, &size, What paths are these? So we can document the current uses in the Kconfig at least. Thoughts ? Documentation/kernel-parameters.txt | 6 +++ drivers/base/Kconfig| 48 +++ fs/exec.c | 3 ++ include/linux/fs.h | 8 kernel/ksysfs.c | 77 + 5 files changed, 142 insertions(+) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 8ccacc44622a..1af89faa9fc9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -849,6 +849,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + critical_mounts_timeout_ms=T[KNL] timeout in ms + Format: + Use this to override the kernel's default timeout for + waiting for critical system mount points to become + available. + cryptomgr.notests [KNL] Disable crypto self-tests diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 12b4f5551501..21576c0a4898 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -25,6 +25,54 @@ config UEVENT_HELPER_PATH via /proc/sys/kernel/hotplug or via /sys/kernel/uevent_helper later at runtime. +config CRITICAL_MOUNTS_WAIT + bool "Enable waiting for critical-filesystems-ready notification" + default n + help + Kernel subsystems and device drivers often need to read files + from the filesystem, however in doing this races are possible at + bootup -- the subsystem requesting the file might look for it in / + early in boot, but if we haven't yet mounted the real root + filesystem we'll just tell the subsystem the file is not present and + it will fail. Furthermore what path to the filesystem is used varies + depending on the subsystem. To help the kernel we provide the option + to let the kernel wait for all critical filesystems to mounted and + ready before letting the kernel start trying to read files from the + systems' filesystem. Since pivot_root() can be used and therefore a + system might be configured to change its / filesystem at bootup as + many times as it wishes, only userspace can realy know exactly when + all critical filesystems are ready. Ena
Re: [PATCH v2 2/5] firmware: annotate thou shalt not request fw on init or probe
On Thu, Aug 25, 2016 at 09:41:33PM +0200, Luis R. Rodriguez wrote: > On Thu, Aug 25, 2016 at 01:05:44PM +0200, Daniel Vetter wrote: > > On Wed, Aug 24, 2016 at 10:39 PM, Luis R. Rodriguez > > wrote: > > Some users want a fully running gfx stack 2s after power-on. There's > > not even enough time to load an uefi or vga driver first. i915 > > directly initializes the gpu from power-on state on those. > > I see.. thanks. > > > >> I think what would work is loading the different subsystems of the > > >> driver in parallel (we already do that largely) > > > > > > Init level stuff is actually pretty synchronous, and in fact both > > > init and probe are called serially. There are a few subsystems which > > > have been doing things a bit differently, but these are exceptions. > > > > > > When you say we already do this largely, can you describe a bit more > > > precisely what you mean ? > > > > Oh, this isn't subsystems as in linux device/driver model, but > > different parts within the driver. We fire up a bunch of struct work > > to get various bits done asynchronously. > > Thanks for the clarification. <-- snip --> > > One of the proposals which would have worked for us died a while back: > > > > https://lkml.org/lkml/2014/6/15/47 > > Interesting, the problem stated here, which comes from the fact (as clarified > above) the rootfs comes up only *after* we run do_initcalls() as such there > are > possible theoretical races even with request_firmware_nowait() -- as such that > justifies even more the SmPL grammar rule to include even > request_firmware_nowait() > on probe / init as this patch has. > > Anyway yeah that patch has good intentions but its wrong for a slew of > reasons. > The problem is the same as discussed with Bjorn recently. The solution > discussed is that since only userspace knows when the *real* rootfs is ready > (because of slew of reasons) the best we can do is have an event issued by > userspace to inform us of that. OK I have something I think we can build upon for this. Will send RFC. Luis
Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models
On Fri, 2016-09-02 at 13:26 -0700, Linus Torvalds wrote: > On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann wrote: > > > > > > When I once looked, I thought all drivers using NO_IRQ were > > specific > > to powerpc or one of the less common architectures. We deprecated NO_IRQ ages ago, it's 0 like it should be, but yes we may have forgotten to "cleanup" the old users. > powerpc definitely does seem to be the biggest case, with about half > the instances of NO_IRQ being under arch/powerpc/ (and a few more in > ppc-specific drivers). > > Adding the powerpc maintainers to the list - because it would really > be nice to get rid of it, or at least make it *so* rare that we don't > have people re-introducing it again because they thought it was the > right thing to do. Right. Originally it was -1 for us which causes the whole problem. I changed it to be 0 after doing the whole irq domain remapping thing. That was a lng time ago. > A fair amount of of it could even be done by some trivial scripting. > Something like > > git grep -wl NO_IRQ arch/powerpc/ | while read a > do > sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new > sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a > done > > does fix at least a few of the cases. It still leaves several > assignments and "return NO_IRQ;" statements, but a few more > sed-scripts would take care of most of it. Then remove the #define, > and do a full build to find any straggling cases. > > Michael? Ben? It can just be replaced with "0" in all powerpc related cases yes. Cheers, Ben.
Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models
On Fri, Sep 2, 2016 at 1:43 PM, Julia Lawall wrote: > > Like this? Sure. > Is it always correct to replace return NO_IRQ by return 0? On platforms that define it to 0, like powerpc, yes. On arm, c6x, mn10300, parisc and sparc it would need more work. > Completely untested patch below. Looks fine, but the ppc people should test it (although it really should generate the exact same code, unless there was some odd conversion case). Linus
Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models
On Fri, 2 Sep 2016, Linus Torvalds wrote: > On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann wrote: > > > > When I once looked, I thought all drivers using NO_IRQ were specific > > to powerpc or one of the less common architectures. > > powerpc definitely does seem to be the biggest case, with about half > the instances of NO_IRQ being under arch/powerpc/ (and a few more in > ppc-specific drivers). > > Adding the powerpc maintainers to the list - because it would really > be nice to get rid of it, or at least make it *so* rare that we don't > have people re-introducing it again because they thought it was the > right thing to do. > > A fair amount of of it could even be done by some trivial scripting. > Something like > > git grep -wl NO_IRQ arch/powerpc/ | while read a > do > sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new > sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a > done > > does fix at least a few of the cases. It still leaves several > assignments and "return NO_IRQ;" statements, but a few more > sed-scripts would take care of most of it. Then remove the #define, > and do a full build to find any straggling cases. Like this? @@ expression e; @@ e - != NO_IRQ @@ expression e; @@ +! e - == NO_IRQ @@ @@ - NO_IRQ + 0 --- Is it always correct to replace return NO_IRQ by return 0? Completely untested patch below. julia diff -u -p a/arch/powerpc/platforms/52xx/mpc52xx_pic.c b/arch/powerpc/platforms/52xx/mpc52xx_pic.c --- a/arch/powerpc/platforms/52xx/mpc52xx_pic.c +++ b/arch/powerpc/platforms/52xx/mpc52xx_pic.c @@ -511,7 +511,7 @@ unsigned int mpc52xx_get_irq(void) irq |= (MPC52xx_IRQ_L1_PERP << MPC52xx_IRQ_L1_OFFSET); } } else { - return NO_IRQ; + return 0; } return irq_linear_revmap(mpc52xx_irqhost, irq); diff -u -p a/arch/powerpc/platforms/chrp/setup.c b/arch/powerpc/platforms/chrp/setup.c --- a/arch/powerpc/platforms/chrp/setup.c +++ b/arch/powerpc/platforms/chrp/setup.c @@ -368,7 +368,7 @@ static void chrp_8259_cascade(struct irq struct irq_chip *chip = irq_desc_get_chip(desc); unsigned int cascade_irq = i8259_irq(); - if (cascade_irq != NO_IRQ) + if (cascade_irq) generic_handle_irq(cascade_irq); chip->irq_eoi(&desc->irq_data); @@ -514,7 +514,7 @@ static void __init chrp_find_8259(void) } if (chrp_mpic != NULL) { cascade_irq = irq_of_parse_and_map(pic, 0); - if (cascade_irq == NO_IRQ) + if (!cascade_irq) printk(KERN_ERR "i8259: failed to map cascade irq\n"); else irq_set_chained_handler(cascade_irq, diff -u -p a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c --- a/arch/powerpc/platforms/maple/pci.c +++ b/arch/powerpc/platforms/maple/pci.c @@ -552,7 +552,7 @@ void maple_pci_irq_fixup(struct pci_dev pci_bus_to_host(dev->bus) == u4_pcie) { printk(KERN_DEBUG "Fixup U4 PCIe IRQ\n"); dev->irq = irq_create_mapping(NULL, 1); - if (dev->irq != NO_IRQ) + if (dev->irq) irq_set_irq_type(dev->irq, IRQ_TYPE_LEVEL_LOW); } @@ -562,7 +562,7 @@ void maple_pci_irq_fixup(struct pci_dev if (dev->vendor == PCI_VENDOR_ID_AMD && dev->device == PCI_DEVICE_ID_AMD_8111_IDE && (dev->class & 5) != 5) { - dev->irq = NO_IRQ; + dev->irq = 0; } DBG(" <- maple_pci_irq_fixup\n"); @@ -648,7 +648,7 @@ int maple_pci_get_legacy_ide_irq(struct return defirq; } irq = irq_of_parse_and_map(np, channel & 0x1); - if (irq == NO_IRQ) { + if (!irq) { printk("Failed to map onboard IDE interrupt for channel %d\n", channel); return defirq; diff -u -p a/arch/powerpc/platforms/embedded6xx/flipper-pic.c b/arch/powerpc/platforms/embedded6xx/flipper-pic.c --- a/arch/powerpc/platforms/embedded6xx/flipper-pic.c +++ b/arch/powerpc/platforms/embedded6xx/flipper-pic.c @@ -181,7 +181,7 @@ unsigned int flipper_pic_get_irq(void) irq_status = in_be32(io_base + FLIPPER_ICR) & in_be32(io_base + FLIPPER_IMR); if (irq_status == 0) - return NO_IRQ; /* no more IRQs pending */ + return 0; /* no more IRQs pending */ irq = __ffs(irq_status); return irq_linear_revmap(flipper_irq_host, irq); diff -u -p a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c --- a/arch/powerpc/platforms/embedded6xx/hlwd-pic.c +++ b/arch/powerpc/platforms/embedded6xx/hlwd-pic.c @@ -114,7 +114,7 @@ static unsigned int __hlwd_pic_get_irq(s irq_status = in_be32(io_base + HW_BROADWAY_ICR) & in_be32(io_base + HW_BROADWAY_IMR
[PATCH 6/6] cxlflash: Fix context reference tracking on detach
From: "Matthew R. Ochs" Commit 888baf069f49 ("scsi: cxlflash: Add kref to context") introduced a kref to the context. In particular, the detach routine was updated to use the kref services for managing the removal and destruction of a context. As part of this change, the tracking mechanism internal to the detach handler was refactored. This introduced a bug that can cause the tracking state to be lost. This can lead to a situation where exclusive access to a context is prematurely [and unknowingly] relinquished for the executing thread. To remedy, only update the tracking state when the kref operation indicates the context was removed. Fixes: 888baf069f49 ("scsi: cxlflash: Add kref to context") Signed-off-by: Matthew R. Ochs --- drivers/scsi/cxlflash/superpipe.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c index c91fe6f..9636970 100644 --- a/drivers/scsi/cxlflash/superpipe.c +++ b/drivers/scsi/cxlflash/superpipe.c @@ -912,7 +912,8 @@ static int _cxlflash_disk_detach(struct scsi_device *sdev, * Release the context reference and the sdev reference that * bound this LUN to the context. */ - put_ctx = !kref_put(&ctxi->kref, remove_context); + if (kref_put(&ctxi->kref, remove_context)) + put_ctx = false; scsi_device_put(sdev); out: if (put_ctx) -- 2.1.0
[PATCH 5/6] cxlflash: Refactor WWPN setup
From: "Matthew R. Ochs" Commit 964497b3bf3f ("cxlflash: Remove dual port online dependency") logically removed the ability for the WWPN setup routine afu_set_wwpn() to return a non-success value. This routine can safely be made a void to simplify the code as there is no longer a need to report a failure. Signed-off-by: Matthew R. Ochs --- drivers/scsi/cxlflash/main.c | 40 +--- 1 file changed, 9 insertions(+), 31 deletions(-) diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index 42970a4..b301655 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -1093,42 +1093,25 @@ static int wait_port_offline(__be64 __iomem *fc_regs, u32 delay_us, u32 nretry) * online. This toggling action can cause this routine to delay up to a few * seconds. When configured to use the internal LUN feature of the AFU, a * failure to come online is overridden. - * - * Return: - * 0 when the WWPN is successfully written and the port comes back online - * -1 when the port fails to go offline or come back up online */ -static int afu_set_wwpn(struct afu *afu, int port, __be64 __iomem *fc_regs, - u64 wwpn) +static void afu_set_wwpn(struct afu *afu, int port, __be64 __iomem *fc_regs, +u64 wwpn) { - int rc = 0; - set_port_offline(fc_regs); - if (!wait_port_offline(fc_regs, FC_PORT_STATUS_RETRY_INTERVAL_US, FC_PORT_STATUS_RETRY_CNT)) { pr_debug("%s: wait on port %d to go offline timed out\n", __func__, port); - rc = -1; /* but continue on to leave the port back online */ } - if (rc == 0) - writeq_be(wwpn, &fc_regs[FC_PNAME / 8]); - - /* Always return success after programming WWPN */ - rc = 0; + writeq_be(wwpn, &fc_regs[FC_PNAME / 8]); set_port_online(fc_regs); - if (!wait_port_online(fc_regs, FC_PORT_STATUS_RETRY_INTERVAL_US, FC_PORT_STATUS_RETRY_CNT)) { - pr_err("%s: wait on port %d to go online timed out\n", - __func__, port); + pr_debug("%s: wait on port %d to go online timed out\n", +__func__, port); } - - pr_debug("%s: returning rc=%d\n", __func__, rc); - - return rc; } /** @@ -1629,15 +1612,10 @@ static int init_global(struct cxlflash_cfg *cfg) [FC_CRC_THRESH / 8]); /* Set WWPNs. If already programmed, wwpn[i] is 0 */ - if (wwpn[i] != 0 && - afu_set_wwpn(afu, i, -&afu->afu_map->global.fc_regs[i][0], -wwpn[i])) { - dev_err(dev, "%s: failed to set WWPN on port %d\n", - __func__, i); - rc = -EIO; - goto out; - } + if (wwpn[i] != 0) + afu_set_wwpn(afu, i, +&afu->afu_map->global.fc_regs[i][0], +wwpn[i]); /* Programming WWPN back to back causes additional * offline/online transitions and a PLOGI */ -- 2.1.0
[PATCH 4/6] cxlflash: Improve EEH recovery time
From: "Matthew R. Ochs" When an EEH occurs during device initialization, the port timeout logic can cause excessive delays as MMIO reads will fail. Depending on where they are experienced, these delays can lead to a prolonged reset, causing an unnecessary triggering of other timeout logic in the SCSI stack or user applications. To expedite recovery, the port timeout logic is updated to decay the timeout at a much faster rate when in the presence of a likely EEH frozen event. Signed-off-by: Matthew R. Ochs --- drivers/scsi/cxlflash/main.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index 4ef5235..42970a4 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -1040,6 +1040,8 @@ static int wait_port_online(__be64 __iomem *fc_regs, u32 delay_us, u32 nretry) do { msleep(delay_us / 1000); status = readq_be(&fc_regs[FC_MTIP_STATUS / 8]); + if (status == U64_MAX) + nretry /= 2; } while ((status & FC_MTIP_STATUS_MASK) != FC_MTIP_STATUS_ONLINE && nretry--); @@ -1071,6 +1073,8 @@ static int wait_port_offline(__be64 __iomem *fc_regs, u32 delay_us, u32 nretry) do { msleep(delay_us / 1000); status = readq_be(&fc_regs[FC_MTIP_STATUS / 8]); + if (status == U64_MAX) + nretry /= 2; } while ((status & FC_MTIP_STATUS_MASK) != FC_MTIP_STATUS_OFFLINE && nretry--); -- 2.1.0
[PATCH 3/6] cxlflash: Fix to avoid EEH and host reset collisions
From: "Matthew R. Ochs" The EEH reset handler is ignorant to the current state of the driver when processing a frozen event and initiating a device reset. This can be an issue if an EEH event occurs while a user or stack initiated reset is executing. More specifically, if an EEH occurs while the SCSI host reset handler is active, the reset initiated by the EEH thread will likely collide with the host reset thread. This can leave the device in an inconsistent state, or worse, cause a system crash. As a remedy, the EEH handler is updated to evaluate the device state and take appropriate action (proceed, wait, or disconnect host). The host reset handler is also updated to handle situations where an EEH occurred during a host reset. In such situations, the host reset handler will delay reporting back a success to give the EEH reset an opportunity to complete. Signed-off-by: Matthew R. Ochs --- drivers/scsi/cxlflash/main.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index 4c2559a..4ef5235 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -2042,6 +2042,11 @@ retry: * cxlflash_eh_host_reset_handler() - reset the host adapter * @scp: SCSI command from stack identifying host. * + * Following a reset, the state is evaluated again in case an EEH occurred + * during the reset. In such a scenario, the host reset will either yield + * until the EEH recovery is complete or return success or failure based + * upon the current device state. + * * Return: * SUCCESS as defined in scsi/scsi.h * FAILED as defined in scsi/scsi.h @@ -2074,7 +2079,8 @@ static int cxlflash_eh_host_reset_handler(struct scsi_cmnd *scp) } else cfg->state = STATE_NORMAL; wake_up_all(&cfg->reset_waitq); - break; + ssleep(1); + /* fall through */ case STATE_RESET: wait_event(cfg->reset_waitq, cfg->state != STATE_RESET); if (cfg->state == STATE_NORMAL) @@ -2590,6 +2596,9 @@ out_remove: * @pdev: PCI device struct. * @state: PCI channel state. * + * When an EEH occurs during an active reset, wait until the reset is + * complete and then take action based upon the device state. + * * Return: PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_DISCONNECT */ static pci_ers_result_t cxlflash_pci_error_detected(struct pci_dev *pdev, @@ -2603,6 +2612,10 @@ static pci_ers_result_t cxlflash_pci_error_detected(struct pci_dev *pdev, switch (state) { case pci_channel_io_frozen: + wait_event(cfg->reset_waitq, cfg->state != STATE_RESET); + if (cfg->state == STATE_FAILTERM) + return PCI_ERS_RESULT_DISCONNECT; + cfg->state = STATE_RESET; scsi_block_requests(cfg->host); drain_ioctls(cfg); -- 2.1.0
[PATCH 2/6] cxlflash: Remove the device cleanly in the system shutdown path
Commit 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash cards") was recently introduced to notify the AFU when a system is going down. Due to the position of the cxlflash driver in the device stack, cxlflash devices are _always_ removed during a reboot/shutdown. This can lead to a crash if the cxlflash shutdown hook is invoked _after_ the shutdown hook for the owning virtual PHB. Furthermore, the current implementation of shutdown/remove hooks for cxlflash are not tolerant to being invoked when the device is not enabled. This can also lead to a crash in situations where the remove hook is invoked after the device has been removed via the vPHBs shutdown hook. An example of this scenario would be an EEH reset failure while a reboot/shutdown is in progress. To solve both problems, the shutdown hook for cxlflash is updated to simply remove the device. This path already includes the AFU notification and thus this solution will continue to perform the original intent. At the same time, the remove hook is updated to protect against being called when the device is not enabled. Fixes: 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash cards") Signed-off-by: Uma Krishnan --- drivers/scsi/cxlflash/main.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index b063c41..4c2559a 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -823,17 +823,6 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool wait) } /** - * cxlflash_shutdown() - shutdown handler - * @pdev: PCI device associated with the host. - */ -static void cxlflash_shutdown(struct pci_dev *pdev) -{ - struct cxlflash_cfg *cfg = pci_get_drvdata(pdev); - - notify_shutdown(cfg, false); -} - -/** * cxlflash_remove() - PCI entry point to tear down host * @pdev: PCI device associated with the host. * @@ -844,6 +833,11 @@ static void cxlflash_remove(struct pci_dev *pdev) struct cxlflash_cfg *cfg = pci_get_drvdata(pdev); ulong lock_flags; + if (!pci_is_enabled(pdev)) { + pr_debug("%s: Device is disabled\n", __func__); + return; + } + /* If a Task Management Function is active, wait for it to complete * before continuing with remove. */ @@ -2685,7 +2679,7 @@ static struct pci_driver cxlflash_driver = { .id_table = cxlflash_pci_table, .probe = cxlflash_probe, .remove = cxlflash_remove, - .shutdown = cxlflash_shutdown, + .shutdown = cxlflash_remove, .err_handler = &cxlflash_err_handler, }; -- 2.1.0
[PATCH 1/6] cxlflash: Scan host only after the port is ready for I/O
When a port link is established, the AFU sends a 'link up' interrupt. After the link is up, corresponding initialization steps are performed on the card. Following that, when the card is ready for I/O, the AFU sends 'login succeeded' interrupt. Today, cxlflash invokes scsi_scan_host() upon receipt of both interrupts. SCSI commands sent to the port prior to the 'login succeeded' interrupt will fail with 'port not available' error. This is not desirable. Moreover, when async_scan is active for the host, subsequent scan calls are terminated with error. Due to this, the scsi_scan_host() call performed after 'login succeeded' interrupt could portentially return error and the devices may not be scanned properly. To avoid this problem, scsi_scan_host() should be called only after the 'login succeeded' interrupt. Signed-off-by: Uma Krishnan --- drivers/scsi/cxlflash/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c index 661bb94..b063c41 100644 --- a/drivers/scsi/cxlflash/main.c +++ b/drivers/scsi/cxlflash/main.c @@ -1187,7 +1187,7 @@ static const struct asyc_intr_info ainfo[] = { {SISL_ASTATUS_FC0_LOGI_F, "login failed", 0, CLR_FC_ERROR}, {SISL_ASTATUS_FC0_LOGI_S, "login succeeded", 0, SCAN_HOST}, {SISL_ASTATUS_FC0_LINK_DN, "link down", 0, 0}, - {SISL_ASTATUS_FC0_LINK_UP, "link up", 0, SCAN_HOST}, + {SISL_ASTATUS_FC0_LINK_UP, "link up", 0, 0}, {SISL_ASTATUS_FC1_OTHER, "other error", 1, CLR_FC_ERROR | LINK_RESET}, {SISL_ASTATUS_FC1_LOGO, "target initiated LOGO", 1, 0}, {SISL_ASTATUS_FC1_CRC_T, "CRC threshold exceeded", 1, LINK_RESET}, @@ -1195,7 +1195,7 @@ static const struct asyc_intr_info ainfo[] = { {SISL_ASTATUS_FC1_LOGI_F, "login failed", 1, CLR_FC_ERROR}, {SISL_ASTATUS_FC1_LOGI_S, "login succeeded", 1, SCAN_HOST}, {SISL_ASTATUS_FC1_LINK_DN, "link down", 1, 0}, - {SISL_ASTATUS_FC1_LINK_UP, "link up", 1, SCAN_HOST}, + {SISL_ASTATUS_FC1_LINK_UP, "link up", 1, 0}, {0x0, "", 0, 0} /* terminator */ }; -- 2.1.0
[PATCH 0/6] cxlflash: Miscellaneous fixes
This patch set contains miscellaneous fixes in the device cleanup and EEH recovery paths. One of the patches simplifies the WWPN assignment routine. There are a couple of fixes that resolve regressions introduced by Commit 704c4b0ddc03 ("cxlflash: Shutdown notify support for CXL Flash cards") and Commit 888baf069f49 ("scsi: cxlflash: Add kref to context") This series is based upon mkp's 4.9/scsi-queue and is bisectable. Matthew R. Ochs (4): cxlflash: Fix to avoid EEH and host reset collisions cxlflash: Improve EEH recovery time cxlflash: Refactor WWPN setup cxlflash: Fix context reference tracking on detach Uma Krishnan (2): cxlflash: Scan host only after the port is ready for I/O cxlflash: Remove the device cleanly in the system shutdown path drivers/scsi/cxlflash/main.c | 81 +-- drivers/scsi/cxlflash/superpipe.c | 3 +- 2 files changed, 37 insertions(+), 47 deletions(-) -- 2.1.0
Re: [Ksummit-discuss] [CORE TOPIC] (group) maintainership models
On Fri, Sep 2, 2016 at 1:06 PM, Arnd Bergmann wrote: > > When I once looked, I thought all drivers using NO_IRQ were specific > to powerpc or one of the less common architectures. powerpc definitely does seem to be the biggest case, with about half the instances of NO_IRQ being under arch/powerpc/ (and a few more in ppc-specific drivers). Adding the powerpc maintainers to the list - because it would really be nice to get rid of it, or at least make it *so* rare that we don't have people re-introducing it again because they thought it was the right thing to do. A fair amount of of it could even be done by some trivial scripting. Something like git grep -wl NO_IRQ arch/powerpc/ | while read a do sed 's/(\([a-z_]*irq\) != NO_IRQ)/(\1)/' < $a > $a.new sed 's/(\([a-z_]*irq\) == NO_IRQ)/(!\1)/' < $a.new > $a done does fix at least a few of the cases. It still leaves several assignments and "return NO_IRQ;" statements, but a few more sed-scripts would take care of most of it. Then remove the #define, and do a full build to find any straggling cases. Michael? Ben? Linus
Re: [PATCH v3 2/5] firmware: annotate thou shalt not request fw on init or probe
On Wed, Aug 24, 2016 at 10:17:38AM +0200, Gabriel Paubert wrote: > On Tue, Aug 23, 2016 at 05:45:04PM -0700, mcg...@kernel.org wrote: > > [snip] > > --- > > Documentation/firmware_class/README| 20 > > drivers/base/Kconfig | 2 +- > > .../request_firmware-avoid-init-probe-init.cocci | 130 > > + > > 3 files changed, 151 insertions(+), 1 deletion(-) > > create mode 100644 > > scripts/coccinelle/api/request_firmware-avoid-init-probe-init.cocci > > > > diff --git a/Documentation/firmware_class/README > > b/Documentation/firmware_class/README > > index cafdca8b3b15..056d1cb9d365 100644 > > --- a/Documentation/firmware_class/README > > +++ b/Documentation/firmware_class/README > > @@ -93,6 +93,26 @@ > > user contexts to request firmware asynchronously, but can't be called > > in atomic contexts. > > > > +Requirements: > > += > > + > > +You should avoid at all costs requesting firmware on both init and probe > > paths > > +of your device driver. Reason for this is the complexity needed to ensure a > > +firmware will be available for a driver early in boot through different > > +build configurations. Consider built-in drivers needing firmware early, or > > +consider a driver assuming it will only get firmware after pivot_root(). > > + > > +Drivers that really need firmware early should use stuff the firmware in > > Minor grammatical nit: s/use// Fixed, thanks. Luis
Re: [PATCH] ASoC: fsl_esai: fix spelling mistake "Transmition" -> "Transmission"
On Fri, Sep 02, 2016 at 03:07:23PM +0100, Colin King wrote: > From: Colin Ian King > > Trivial fix to spelling mistakes in dev_dbg messages > > Signed-off-by: Colin Ian King Acked-by: Nicolin Chen Thanks > --- > sound/soc/fsl/fsl_esai.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c > index 26a90e1..e927955 100644 > --- a/sound/soc/fsl/fsl_esai.c > +++ b/sound/soc/fsl/fsl_esai.c > @@ -77,19 +77,19 @@ static irqreturn_t esai_isr(int irq, void *devid) > regmap_read(esai_priv->regmap, REG_ESAI_ESR, &esr); > > if (esr & ESAI_ESR_TINIT_MASK) > - dev_dbg(&pdev->dev, "isr: Transmition Initialized\n"); > + dev_dbg(&pdev->dev, "isr: Transmission Initialized\n"); > > if (esr & ESAI_ESR_RFF_MASK) > dev_warn(&pdev->dev, "isr: Receiving overrun\n"); > > if (esr & ESAI_ESR_TFE_MASK) > - dev_warn(&pdev->dev, "isr: Transmition underrun\n"); > + dev_warn(&pdev->dev, "isr: Transmission underrun\n"); > > if (esr & ESAI_ESR_TLS_MASK) > dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n"); > > if (esr & ESAI_ESR_TDE_MASK) > - dev_dbg(&pdev->dev, "isr: Transmition data exception\n"); > + dev_dbg(&pdev->dev, "isr: Transmission data exception\n"); > > if (esr & ESAI_ESR_TED_MASK) > dev_dbg(&pdev->dev, "isr: Transmitting even slots\n"); > -- > 2.9.3 >
[PATCH] ASoC: fsl_esai: fix spelling mistake "Transmition" -> "Transmission"
From: Colin Ian King Trivial fix to spelling mistakes in dev_dbg messages Signed-off-by: Colin Ian King --- sound/soc/fsl/fsl_esai.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c index 26a90e1..e927955 100644 --- a/sound/soc/fsl/fsl_esai.c +++ b/sound/soc/fsl/fsl_esai.c @@ -77,19 +77,19 @@ static irqreturn_t esai_isr(int irq, void *devid) regmap_read(esai_priv->regmap, REG_ESAI_ESR, &esr); if (esr & ESAI_ESR_TINIT_MASK) - dev_dbg(&pdev->dev, "isr: Transmition Initialized\n"); + dev_dbg(&pdev->dev, "isr: Transmission Initialized\n"); if (esr & ESAI_ESR_RFF_MASK) dev_warn(&pdev->dev, "isr: Receiving overrun\n"); if (esr & ESAI_ESR_TFE_MASK) - dev_warn(&pdev->dev, "isr: Transmition underrun\n"); + dev_warn(&pdev->dev, "isr: Transmission underrun\n"); if (esr & ESAI_ESR_TLS_MASK) dev_dbg(&pdev->dev, "isr: Just transmitted the last slot\n"); if (esr & ESAI_ESR_TDE_MASK) - dev_dbg(&pdev->dev, "isr: Transmition data exception\n"); + dev_dbg(&pdev->dev, "isr: Transmission data exception\n"); if (esr & ESAI_ESR_TED_MASK) dev_dbg(&pdev->dev, "isr: Transmitting even slots\n"); -- 2.9.3
Re: [PATHC v2 5/9] ima: on soft reboot, save the measurement list
Hi Dave, On Thu, 2016-09-01 at 09:57 +0800, Dave Young wrote: > On 08/30/16 at 06:40pm, Mimi Zohar wrote: > > + * Called during kexec_file_load so that IMA can add a segment to the kexec > > + * image for the measurement list for the next kernel. > > + */ > > +void ima_add_kexec_buffer(struct kimage *image) > > +{ > > + static int registered = 0; > > + struct kexec_buf kbuf = { .image = image, .buf_align = PAGE_SIZE, > > + .buf_min = 0, .buf_max = ULONG_MAX, > > + .top_down = true, .skip_checksum = true }; > > + int ret; > > + > > + if (!kexec_can_hand_over_buffer()) > > + return; > > + > > + kexec_segment_size = ALIGN(ima_get_binary_runtime_size() + PAGE_SIZE, > > + PAGE_SIZE); > > + > > + if (kexec_segment_size >= (ULONG_MAX - sizeof(long))) { > > + pr_err("Binary measurement list too large.\n"); > > + return; > > + } > > Now we added a limitation that total segment size can not be larger than > half of totalram. see kernel/kexec_core.c sanity_check_segment_list() > > So can it fail early here if kexec_segment_size is over half of total > ram? Sure, I'll include this change in the next post. Mimi
Re: [PATCH 1/3] powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
Hi Paul, Really nice catch. Was this found by code analysis or do we have any reported issue around this ? Paul Mackerras writes: > In commit c60ac5693c47 ("powerpc: Update kernel VSID range", 2013-03-13) > we lost a check on the region number (the top four bits of the effective > address) for addresses below PAGE_OFFSET. That commit replaced a check > that the top 18 bits were all zero with a check that bits 46 - 59 were > zero (performed for all addresses, not just user addresses). To make review easy for others, here is the relevant diff from that commit. _GLOBAL(slb_allocate_realmode) - /* r3 = faulting address */ + /* +* check for bad kernel/user address +* (ea & ~REGION_MASK) >= PGTABLE_RANGE +*/ + rldicr. r9,r3,4,(63 - 46 - 4) + bne-8f srdir9,r3,60/* get region */ .. And because we were doing the above check, I removed . BEGIN_FTR_SECTION b slb_finish_load END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT) b slb_finish_load_1T -0: /* user address: proto-VSID = context << 15 | ESID. First check -* if the address is within the boundaries of the user region -*/ - srdi. r9,r10,USER_ESID_BITS - bne-8f /* invalid ea bits set */ - - +0: > > This means that userspace can access an address like 0x1000_0xxx__ > and we will insert a valid SLB entry for it. The VSID used will be the > same as if the top 4 bits were 0, but the page size will be some random > value obtained by indexing beyond the end of the mm_ctx_high_slices_psize > array in the paca. If that page size is the same as would be used for > region 0, then userspace just has an alias of the region 0 space. If the > page size is different, then no HPTE will be found for the access, and > the process will get a SIGSEGV (since hash_page_mm() will refuse to create > a HPTE for the bogus address). > > The access beyond the end of the mm_ctx_high_slices_psize can be at most > 5.5MB past the array, and so will be in RAM somewhere. Since the access > is a load performed in real mode, it won't fault or crash the kernel. > At most this bug could perhaps leak a little bit of information about > blocks of 32 bytes of memory located at offsets of i * 512kB past the > paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11. Reviewed-by: Aneesh Kumar K.V > > Cc: sta...@vger.kernel.org # v3.10+ > Cc: Aneesh Kumar K.V > Signed-off-by: Paul Mackerras > --- > arch/powerpc/mm/slb_low.S | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S > index dfdb90c..9f19834 100644 > --- a/arch/powerpc/mm/slb_low.S > +++ b/arch/powerpc/mm/slb_low.S > @@ -113,7 +113,12 @@ BEGIN_FTR_SECTION > END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT) > b slb_finish_load_1T > > -0: > +0: /* > + * For userspace addresses, make sure this is region 0. > + */ > + cmpdi r9, 0 > + bne 8f > + > /* when using slices, we extract the psize off the slice bitmaps >* and then we need to get the sllp encoding off the mmu_psize_defs >* array. > -- > 2.7.4
[PATCH 2/3] powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address
Currently, if userspace or the kernel accesses a completely bogus address, for example with any of bits 46-59 set, we first take an SLB miss interrupt, install a corresponding SLB entry with VSID 0, retry the instruction, then take a DSI/ISI interrupt because there is no HPT entry mapping the address. However, by the time of the second interrupt, the Come-From Address Register (CFAR) has been overwritten by the rfid instruction at the end of the SLB miss interrupt handler. Since bogus accesses can often be caused by a function return after the stack has been overwritten, the CFAR value would be very useful as it could indicate which function it was whose return had led to the bogus address. This patch adds code to create a full exception frame in the SLB miss handler in the case of a bogus address, rather than inserting an SLB entry with a zero VSID field. Then we call a new slb_miss_bad_addr() function in C code, which delivers a signal for a user access or creates an oops for a kernel access. In the latter case the oops message will show the CFAR value at the time of the access. In the case of the radix MMU, a segment miss interrupt indicates an access outside the ranges mapped by the page tables. Previously this was handled by the code for an unrecoverable SLB miss (one with MSR[RI] = 0), which is not really correct. With this patch, we now handle these interrupts with slb_miss_bad_addr(), which is much more consistent. Signed-off-by: Paul Mackerras --- arch/powerpc/kernel/exceptions-64s.S | 40 ++-- arch/powerpc/kernel/traps.c | 11 ++ arch/powerpc/mm/slb_low.S| 8 +++- 3 files changed, 48 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index df6d45e..a2526b0 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -175,6 +175,7 @@ data_access_slb_pSeries: std r3,PACA_EXSLB+EX_R3(r13) mfspr r3,SPRN_DAR mfspr r12,SPRN_SRR1 + crset 4*cr6+eq #ifndef CONFIG_RELOCATABLE b slb_miss_realmode #else @@ -201,6 +202,7 @@ instruction_access_slb_pSeries: std r3,PACA_EXSLB+EX_R3(r13) mfspr r3,SPRN_SRR0/* SRR0 is faulting address */ mfspr r12,SPRN_SRR1 + crclr 4*cr6+eq #ifndef CONFIG_RELOCATABLE b slb_miss_realmode #else @@ -767,6 +769,7 @@ data_access_slb_relon_pSeries: std r3,PACA_EXSLB+EX_R3(r13) mfspr r3,SPRN_DAR mfspr r12,SPRN_SRR1 + crset 4*cr6+eq #ifndef CONFIG_RELOCATABLE b slb_miss_realmode #else @@ -792,6 +795,7 @@ instruction_access_slb_relon_pSeries: std r3,PACA_EXSLB+EX_R3(r13) mfspr r3,SPRN_SRR0/* SRR0 is faulting address */ mfspr r12,SPRN_SRR1 + crclr 4*cr6+eq #ifndef CONFIG_RELOCATABLE b slb_miss_realmode #else @@ -1389,6 +1393,7 @@ unrecover_mce: * r3 has the faulting address * r9 - r13 are saved in paca->exslb. * r3 is saved in paca->slb_r3 + * cr6.eq is set for a D-SLB miss, clear for a I-SLB miss * We assume we aren't going to take any exceptions during this procedure. */ slb_miss_realmode: @@ -1399,29 +1404,31 @@ slb_miss_realmode: stw r9,PACA_EXSLB+EX_CCR(r13) /* save CR in exc. frame */ std r10,PACA_EXSLB+EX_LR(r13) /* save LR */ + std r3,PACA_EXSLB+EX_DAR(r13) + crset 4*cr0+eq #ifdef CONFIG_PPC_STD_MMU_64 BEGIN_MMU_FTR_SECTION bl slb_allocate_realmode END_MMU_FTR_SECTION_IFCLR(MMU_FTR_TYPE_RADIX) #endif - /* All done -- return from exception. */ ld r10,PACA_EXSLB+EX_LR(r13) ld r3,PACA_EXSLB+EX_R3(r13) lwz r9,PACA_EXSLB+EX_CCR(r13) /* get saved CR */ - mtlrr10 + + beq 8f /* if bad address, make full stack frame */ + andi. r10,r12,MSR_RI /* check for unrecoverable exception */ -BEGIN_MMU_FTR_SECTION beq-2f -FTR_SECTION_ELSE - b 2f -ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX) + + /* All done -- return from exception. */ .machine push .machine "power4" mtcrf 0x80,r9 + mtcrf 0x02,r9 /* I/D indication is in cr6 */ mtcrf 0x01,r9 /* slb_allocate uses cr0 and cr7 */ .machine pop @@ -1451,6 +1458,27 @@ unrecov_slb: bl unrecoverable_exception b 1b +8: mfspr r11,SPRN_SRR0 + ld r10,PACAKBASE(r13) + LOAD_HANDLER(r10,bad_addr_slb) + mtspr SPRN_SRR0,r10 + ld r10,PACAKMSR(r13) + mtspr SPRN_SRR1,r10 + rfid + b . + +bad_addr_slb: + EXCEPTION_PROLOG_COMMON(0x380, PACA_EXSLB) + RECONCILE_IRQ_STATE(r10, r11) + ld r3, PACA_EXSLB+EX_DAR(r13) + std r3, _DAR(r1) + beq c
[PATCH 3/3] powerpc/mm: Speed up computation of base and actual page size for a HPTE
This replaces a 2-D search through an array with a simple 8-bit table lookup for determining the actual and/or base page size for a HPT entry. The encoding in the second doubleword of the HPTE is designed to encode the actual and base page sizes without using any more bits than would be needed for a 4k page number, by using between 1 and 8 low-order bits of the RPN (real page number) field to encode the page sizes. A single "large page" bit in the first doubleword indicates that these low-order bits are to be interpreted like this. We can determine the page sizes by using the low-order 8 bits of the RPN to look up a 256-entry table. For actual page sizes less than 1MB, some of the upper bits of these 8 bits are going to be real address bits, but we can cope with that by replicating the entries for those smaller page sizes. While we're at it, let's move the hpte_page_size() and hpte_base_page_size() functions from a KVM-specific header to a header for 64-bit HPT systems, since this computation doesn't have anything specifically to do with KVM. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 37 arch/powerpc/include/asm/kvm_book3s_64.h | 87 +++ arch/powerpc/include/asm/mmu.h| 1 + arch/powerpc/mm/hash_native_64.c | 42 + arch/powerpc/mm/hash_utils_64.c | 37 5 files changed, 84 insertions(+), 120 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index 287a656..e407af2 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -245,6 +245,43 @@ static inline int segment_shift(int ssize) } /* + * This array is indexed by the LP field of the HPTE second dword. + * Since this field may contain some RPN bits, some entries are + * replicated so that we get the same value irrespective of RPN. + * The top 4 bits are the page size index (MMU_PAGE_*) for the + * actual page size, the bottom 4 bits are the base page size. + */ +extern u8 hpte_page_sizes[1 << LP_BITS]; + +static inline unsigned long __hpte_page_size(unsigned long h, unsigned long l, +bool is_base_size) +{ + unsigned int i, lp; + + if (!(h & HPTE_V_LARGE)) + return 1ul << 12; + + /* Look at the 8 bit LP value */ + lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1); + i = hpte_page_sizes[lp]; + if (!i) + return 0; + if (!is_base_size) + i >>= 4; + return 1ul << mmu_psize_defs[i & 0xf].shift; +} + +static inline unsigned long hpte_page_size(unsigned long h, unsigned long l) +{ + return __hpte_page_size(h, l, 0); +} + +static inline unsigned long hpte_base_page_size(unsigned long h, unsigned long l) +{ + return __hpte_page_size(h, l, 1); +} + +/* * The current system page and segment sizes */ extern int mmu_kernel_ssize; diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 88d17b4..4ffd5a1 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -20,6 +20,8 @@ #ifndef __ASM_KVM_BOOK3S_64_H__ #define __ASM_KVM_BOOK3S_64_H__ +#include + #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE static inline struct kvmppc_book3s_shadow_vcpu *svcpu_get(struct kvm_vcpu *vcpu) { @@ -97,56 +99,20 @@ static inline void __unlock_hpte(__be64 *hpte, unsigned long hpte_v) hpte[0] = cpu_to_be64(hpte_v); } -static inline int __hpte_actual_psize(unsigned int lp, int psize) -{ - int i, shift; - unsigned int mask; - - /* start from 1 ignoring MMU_PAGE_4K */ - for (i = 1; i < MMU_PAGE_COUNT; i++) { - - /* invalid penc */ - if (mmu_psize_defs[psize].penc[i] == -1) - continue; - /* -* encoding bits per actual page size -*PTE LP actual page size -* rrrz >=8KB -* rrzz >=16KB -* rzzz >=32KB -* >=64KB -* ... -*/ - shift = mmu_psize_defs[i].shift - LP_SHIFT; - if (shift > LP_BITS) - shift = LP_BITS; - mask = (1 << shift) - 1; - if ((lp & mask) == mmu_psize_defs[psize].penc[i]) - return i; - } - return -1; -} - static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, unsigned long pte_index) { - int b_psize = MMU_PAGE_4K, a_psize = MMU_PAGE_4K; + int i, b_psize = MMU_PAGE_4K, a_psize = MMU_PAGE_4K; unsigned int penc; unsigned long rb = 0, va_low, sllp; unsi
[PATCH 1/3] powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
In commit c60ac5693c47 ("powerpc: Update kernel VSID range", 2013-03-13) we lost a check on the region number (the top four bits of the effective address) for addresses below PAGE_OFFSET. That commit replaced a check that the top 18 bits were all zero with a check that bits 46 - 59 were zero (performed for all addresses, not just user addresses). This means that userspace can access an address like 0x1000_0xxx__ and we will insert a valid SLB entry for it. The VSID used will be the same as if the top 4 bits were 0, but the page size will be some random value obtained by indexing beyond the end of the mm_ctx_high_slices_psize array in the paca. If that page size is the same as would be used for region 0, then userspace just has an alias of the region 0 space. If the page size is different, then no HPTE will be found for the access, and the process will get a SIGSEGV (since hash_page_mm() will refuse to create a HPTE for the bogus address). The access beyond the end of the mm_ctx_high_slices_psize can be at most 5.5MB past the array, and so will be in RAM somewhere. Since the access is a load performed in real mode, it won't fault or crash the kernel. At most this bug could perhaps leak a little bit of information about blocks of 32 bytes of memory located at offsets of i * 512kB past the paca->mm_ctx_high_slices_psize array, for 1 <= i <= 11. Cc: sta...@vger.kernel.org # v3.10+ Cc: Aneesh Kumar K.V Signed-off-by: Paul Mackerras --- arch/powerpc/mm/slb_low.S | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S index dfdb90c..9f19834 100644 --- a/arch/powerpc/mm/slb_low.S +++ b/arch/powerpc/mm/slb_low.S @@ -113,7 +113,12 @@ BEGIN_FTR_SECTION END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT) b slb_finish_load_1T -0: +0: /* +* For userspace addresses, make sure this is region 0. +*/ + cmpdi r9, 0 + bne 8f + /* when using slices, we extract the psize off the slice bitmaps * and then we need to get the sllp encoding off the mmu_psize_defs * array. -- 2.7.4