[PATCH] SB600 for the Nemo board has non-zero devices on non-root bus
On 3. Dec 2017, at 00:02, Olof Johansson wrote: > >> On Sat, Dec 2, 2017 at 3:00 PM, Olof Johansson wrote: >> >> The below patch, together with Bjorn's, should do it. Christian, can you test >> and report back? >> >> I'm guessing it won't do any harm to set this on non-X1000 platforms. My >> test system is currently powered down so I can't check. >> >> >> From a3b390277627b0342c8ccfc16e58679e0d8abdde Mon Sep 17 00:00:00 2001 >> From: Olof Johansson >> Date: Sat, 2 Dec 2017 14:56:36 -0800 >> Subject: [PATCH] powerpc/pasemi: set PCI_SCAN_ALL_PCI_DEVS >> >> Needed on Amiga X1000 with SB600. >> >> Reported-by: Christian Zigotzky >> Cc: Bjorn Helgaas >> Signed-off-by: Olof Johansson >> --- >> arch/powerpc/platforms/pasemi/pci.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/arch/powerpc/platforms/pasemi/pci.c >> b/arch/powerpc/platforms/pasemi/pci.c >> index 5ff6108..ea54ed2 100644 >> --- a/arch/powerpc/platforms/pasemi/pci.c >> +++ b/arch/powerpc/platforms/pasemi/pci.c >> @@ -224,6 +224,8 @@ void __init pas_pci_init(void) >>return; >>} >> >> + pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS): > > Typo, should be ';', not ':'. I obviously didn't even try compiling this. :) > > > -Olof Hi Olof, Thanks a lot for your patch! I will test it on Wednesday. Cheers, Christian
Re: Memory corruption in powerpc guests with virtio_balloon (was Re: [PATCH v3] virtio_balloon: fix deadlock on OOM)
"Michael S. Tsirkin" writes: > On Fri, Dec 01, 2017 at 11:31:08PM +1100, Michael Ellerman wrote: >> "Michael S. Tsirkin" writes: >> >> > fill_balloon doing memory allocations under balloon_lock >> > can cause a deadlock when leak_balloon is called from >> > virtballoon_oom_notify and tries to take same lock. ... >> >> >> Somehow this commit seems to be killing powerpc guests. ... > > Thanks for the report! > A fix was just posted: > virtio_balloon: fix increment of vb->num_pfns in fill_balloon() > > Would appreciate testing. Yep that fixes it. Thanks. Tested-by: Michael Ellerman cheers
Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's
On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin wrote: > Seems like a reasonable approach. Why do we only do this for > powernv? It seems like a good idea in general to pull all > offlined CPUs out and into the same state for all platforms > and for all shutdown/restart/crash paths. > The reason is largely wake-up related, do we expect offline CPUs to wake up in the kdump kernel. Largely the infrastructure allows us to selectively decide what platforms need this support. I did not want to break the world by enabling it across platforms (pseries for example) without good reason. > Also I wonder if there is anything we should do on the other > side of the equation for the kdump kernel to pull CPUs into a > known state rather than rely on the crash kernel to do it for > us. We might have a better ability to do that with system > reset IPIs now. > Yes, but do we need to do that or quickly dump the vmcore to a file and exit? Ideally we want the original kernel to wake offline cpus as appropriate (as we do with kexec) and send them to opal_reinit_cpus and make them spin on kexec_spin_wait(). The kdump kernel boots with maxcpus=1 and leaves them spinning > We still need to support platforms without NMI IPIs, so we > still need this patch as well. > True Thanks for the review! Balbir Singh
Re: [PATCH] KVM: PPC: Book3S HV: check for XIVE device before executing the XICS hcalls
On Mon, Nov 27, 2017 at 08:30:17AM +0100, Cédric Le Goater wrote: > When QEMU is started with the option kernel_irqchip=òff, the kvm XICS > hcalls are being used even though a kvm XICS device has not been > created on the host, resulting quickly in a failure and a broken > guest. > > The test checking if there is a XIVE device in the VM before executing > the XICS hcalls is missing from the recent XICS-over-XIVE glue. > > Signed-off-by: Cédric Le Goater I think this is fixing the same bug that commit 00bb6ae50062 ("KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabled", 2017-10-26) addresses. Do you think this patch is needed in addition to 00bb6ae50062? Paul.
[PATCH 1/3] arch/powerpc/hugetlb: Use pte_access_permitted for hugetlb access check
No functional change in this patch. This update gup_hugepte to use the helper. This will help later when we add memory keys. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/hugetlbpage.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index a9b9083c5e49..c7e5afe5e118 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -855,9 +855,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, pte = READ_ONCE(*ptep); - if (!pte_present(pte) || !pte_read(pte)) - return 0; - if (write && !pte_write(pte)) + if (!pte_access_permitted(pte, write)) return 0; /* hugepages are never "special" */ -- 2.14.3
[PATCH 2/3] powerpc/mm/book3s/64: Add proper pte access check helper
pte_access_premitted get called in get_user_pages_fast path. If we have marked the pte PROT_NONE, we should not allow a read access on the address. With the current implementation we are not checking the READ and only check for WRITE. This is needed on archs like ppc64 that implement PROT_NONE using RWX access instead of _PAGE_PRESENT. Also add pte_user check just to make sure we are not accessing kernel mapping. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 41 1 file changed, 41 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 2006fec40a00..efa5bd6bc1d3 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -520,6 +520,30 @@ static inline int pte_present(pte_t pte) { return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT)); } + +#define pte_access_permitted pte_access_permitted +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + unsigned long pteval = pte_val(pte); + /* Also check for pte_user */ + unsigned long clear_pte_bits = _PAGE_PRIVILEGED; + /* +* _PAGE_READ is needed for any access and will be +* cleared for PROT_NONE +*/ + unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_READ; + + if (write) + need_pte_bits |= _PAGE_WRITE; + + if ((pteval & need_pte_bits) != need_pte_bits) + return false; + + if ((pteval & clear_pte_bits) == clear_pte_bits) + return false; + return true; +} + /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. @@ -824,6 +848,11 @@ static inline int pud_bad(pud_t pud) return hash__pud_bad(pud); } +#define pud_access_permitted pud_access_permitted +static inline bool pud_access_permitted(pud_t pud, bool write) +{ + return pte_access_permitted(pud_pte(pud), write); +} #define pgd_write(pgd) pte_write(pgd_pte(pgd)) static inline void pgd_set(pgd_t *pgdp, unsigned long val) @@ -863,6 +892,12 @@ static inline int pgd_bad(pgd_t pgd) return hash__pgd_bad(pgd); } +#define pgd_access_permitted pgd_access_permitted +static inline bool pgd_access_permitted(pgd_t pgd, bool write) +{ + return pte_access_permitted(pgd_pte(pgd), write); +} + extern struct page *pgd_page(pgd_t pgd); /* Pointers in the page table tree are physical addresses */ @@ -984,6 +1019,12 @@ static inline int pmd_protnone(pmd_t pmd) #define __pmd_write(pmd) __pte_write(pmd_pte(pmd)) #define pmd_savedwrite(pmd)pte_savedwrite(pmd_pte(pmd)) +#define pmd_access_permitted pmd_access_permitted +static inline bool pmd_access_permitted(pmd_t pmd, bool write) +{ + return pte_access_permitted(pmd_pte(pmd), write); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); -- 2.14.3
[PATCH 3/3] powerpc/mm/ Add proper pte access check helper
pte_access_premitted get called in get_user_pages_fast path. If we have marked the pte PROT_NONE, we should not allow a read access on the address. With the current implementation we are not checking the READ and only check for WRITE. This is needed on archs like ppc64 that implement PROT_NONE using _PAGE_USER access instead of _PAGE_PRESENT. Also add pte_user check just to make sure we are not accessing kernel mapping. Even though there is code duplication, keeping the low level pte accessors different for different platforms helps in code readability. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/32/pgtable.h | 23 +++ arch/powerpc/include/asm/nohash/pgtable.h| 23 +++ 2 files changed, 46 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 016579ef16d3..30a155c0a6b0 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -311,6 +311,29 @@ static inline int pte_present(pte_t pte) return pte_val(pte) & _PAGE_PRESENT; } +/* + * We only find page table entry in the last level + * Hence no need for other accessors + */ +#define pte_access_permitted pte_access_permitted +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + unsigned long pteval = pte_val(pte); + /* +* A read-only access is controlled by _PAGE_USER bit. +* We have _PAGE_READ set for WRITE and EXECUTE +*/ + unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_USER; + + if (write) + need_pte_bits |= _PAGE_WRITE; + + if ((pteval & need_pte_bits) != need_pte_bits) + return false; + + return true; +} + /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. * diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index 5c68f4a59f75..fc4376c8d444 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -45,6 +45,29 @@ static inline int pte_present(pte_t pte) return pte_val(pte) & _PAGE_PRESENT; } +/* + * We only find page table entry in the last level + * Hence no need for other accessors + */ +#define pte_access_permitted pte_access_permitted +static inline bool pte_access_permitted(pte_t pte, bool write) +{ + unsigned long pteval = pte_val(pte); + /* +* A read-only access is controlled by _PAGE_USER bit. +* We have _PAGE_READ set for WRITE and EXECUTE +*/ + unsigned long need_pte_bits = _PAGE_PRESENT | _PAGE_USER; + + if (write) + need_pte_bits |= _PAGE_WRITE; + + if ((pteval & need_pte_bits) != need_pte_bits) + return false; + + return true; +} + /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. * -- 2.14.3
[PATCH v2] powerpc/64s: ISAv3 initialize MMU registers before setting partition table
kexec can leave MMU registers set when booting into a new kernel, PIDR in particular. The boot sequence does not zero PIDR, so it only gets set when CPUs first switch to a userspace processes (until then it's running a kernel thread with effective PID = 0). This leaves a window where a process table entry and page tables are set up due to user processes running on other CPUs, that happen to match with a stale PID. The CPU with that PID may cause speculative accesses that address quadrant 0, which will result in cached translations and PWC for that process, on a CPU which is not in the mm_cpumask and so they will not get invalidated properly. The most common result is the kernel hanging in infinite page fault loops soon after kexec (usually in schedule_tail, which is usually the first non-speculative quardant 0 access to a new PID) due to a stale PWC. However being a stale translation erorr, it could result in anything up to security and data corruption errors. Fix this by zeroing out PIDR before setting PTCR. LPIDR is also not initialized, and may cause a similar issue with speculative access to quadrant 1/2. This has not been observed, but LPIDR is cleared to prevent that possibility. Signed-off-by: Nicholas Piggin --- Since v1: - Aneesh noticed hash was not clearing LPID - Improved changelog - Improved comments Please review. Thanks, Nick arch/powerpc/mm/hash_utils_64.c | 19 +-- arch/powerpc/mm/pgtable-radix.c | 10 ++ 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 655a5a9a183d..d9f3a72f3981 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1029,13 +1029,24 @@ void __init hash__early_init_mmu(void) pci_io_base = ISA_IO_BASE; #endif + /* +* kexec kernels can boot into the new kernel with PID and LPID +* registers non-zero. Zero them to prevent speculative accesses +* setting up cached translations when we turn the MMU on and +* process/partition table entries start being added. +*/ + if (cpu_has_feature(CPU_FTR_ARCH_300)) + mtspr(SPRN_PID, 0); + /* Select appropriate backend */ if (firmware_has_feature(FW_FEATURE_PS3_LV1)) ps3_early_mm_init(); else if (firmware_has_feature(FW_FEATURE_LPAR)) hpte_init_pseries(); - else if (IS_ENABLED(CONFIG_PPC_NATIVE)) + else if (IS_ENABLED(CONFIG_PPC_NATIVE)) { + mtspr(SPRN_LPID, 0); hpte_init_native(); + } if (!mmu_hash_ops.hpte_insert) panic("hash__early_init_mmu: No MMU hash ops defined!\n"); @@ -1055,11 +1066,15 @@ void __init hash__early_init_mmu(void) void hash__early_init_mmu_secondary(void) { /* Initialize hash table for that CPU */ - if (!firmware_has_feature(FW_FEATURE_LPAR)) { + if (cpu_has_feature(CPU_FTR_ARCH_300)) + mtspr(SPRN_PID, 0); + if (!firmware_has_feature(FW_FEATURE_LPAR)) { if (cpu_has_feature(CPU_FTR_POWER9_DD1)) update_hid_for_hash(); + mtspr(SPRN_LPID, 0); + if (!cpu_has_feature(CPU_FTR_ARCH_300)) mtspr(SPRN_SDR1, _SDR1); else diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c index cfbbee941a76..0c13355ddc2a 100644 --- a/arch/powerpc/mm/pgtable-radix.c +++ b/arch/powerpc/mm/pgtable-radix.c @@ -563,10 +563,18 @@ void __init radix__early_init_mmu(void) __pte_frag_nr = H_PTE_FRAG_NR; __pte_frag_size_shift = H_PTE_FRAG_SIZE_SHIFT; + /* +* kexec kernels can boot into the new kernel with PID and LPID +* registers non-zero. Zero them to prevent speculative accesses +* setting up cached translations when we turn the MMU on and +* process/partition table entries start being added. +*/ + mtspr(SPRN_PID, 0); if (!firmware_has_feature(FW_FEATURE_LPAR)) { radix_init_native(); if (cpu_has_feature(CPU_FTR_POWER9_DD1)) update_hid_for_radix(); + mtspr(SPRN_LPID, 0); lpcr = mfspr(SPRN_LPCR); mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR); radix_init_partition_table(); @@ -587,10 +595,12 @@ void radix__early_init_mmu_secondary(void) /* * update partition table control register and UPRT */ + mtspr(SPRN_PID, 0); if (!firmware_has_feature(FW_FEATURE_LPAR)) { if (cpu_has_feature(CPU_FTR_POWER9_DD1)) update_hid_for_radix(); + mtspr(SPRN_LPID, 0); lpcr = mfspr(SPRN_LPCR); mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR); -- 2.15.0
Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's
On Mon, 4 Dec 2017 11:37:01 +1100 Balbir Singh wrote: > On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin wrote: > > Seems like a reasonable approach. Why do we only do this for > > powernv? It seems like a good idea in general to pull all > > offlined CPUs out and into the same state for all platforms > > and for all shutdown/restart/crash paths. > > > > The reason is largely wake-up related, do we expect offline CPUs to wake > up in the kdump kernel. Largely the infrastructure allows us to selectively > decide what platforms need this support. I did not want to break the world > by enabling it across platforms (pseries for example) without good reason. What happens if a pseries offlined CPU gets an exception for some reason though? It seems like it would return into pseries_mach_cpu_die of the old kernel which will go wrong. Maybe the platform has stronger guarantees that it won't wake up there, like requiring a specific hcall or something? I was just thinking trying to move all platforms in general to the same scheme would be preferable, unless there is a good reason not to. Just for sharing code and behaviour. > > > Also I wonder if there is anything we should do on the other > > side of the equation for the kdump kernel to pull CPUs into a > > known state rather than rely on the crash kernel to do it for > > us. We might have a better ability to do that with system > > reset IPIs now. > > > > Yes, but do we need to do that or quickly dump the vmcore to a file > and exit? Well if the previous kernel did not shut them down properly, we need to do that. Don't we? My point is the previous kernel crashed somehow, we should be trying to fix everything up rather than hoping it crashed "nicely" for us. Yes we shouldn't disturb things as much as possible, but we've booted an entire new kernel in its own reserved memory, so I'm not sure if it's such a concern to try fixing up wayward CPUs. Thanks, Nick
Re: [PATCH] powerpc/powernv : Add support to enable sensor groups
Shilpasri G Bhat writes: > On 11/28/2017 05:07 PM, Michael Ellerman wrote: >> Shilpasri G Bhat writes: >> >>> Adds support to enable/disable a sensor group. This can be used to >>> select the sensor groups that needs to be copied to main memory by >>> OCC. Sensor groups like power, temperature, current, voltage, >>> frequency, utilization can be enabled/disabled at runtime. >>> >>> Signed-off-by: Shilpasri G Bhat >>> --- >>> The skiboot patch for the opal call is posted below: >>> https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html >> >> Can you remind me why we're doing this with a completely bespoke sysfs >> API, rather than using some generic sensors API? >> > > Disabling/Enabling sensor groups is not supported in the current generic > sensors > API. And also we dont export all type of sensors in HWMON as not all of them > are > environment sensors (like performance). Are there barriers to adding such concepts to the generic sensors API? -- Stewart Smith OPAL Architect, IBM.
Re: [PATCH v1] PCI: Make PCI_SCAN_ALL_PCIE_DEVS work for Root as well as Downstream Ports
Bjorn Helgaas writes: > On Fri, Dec 01, 2017 at 06:27:10PM -0600, Bjorn Helgaas wrote: >> From: Bjorn Helgaas >> >> PCIe Downstream Ports normally have only a Device 0 below them. To >> optimize enumeration, we don't scan for other devices *unless* the >> PCI_SCAN_ALL_PCIE_DEVS flag is set by set by quirks or the >> "pci=pcie_scan_all" kernel parameter. >> >> Previously PCI_SCAN_ALL_PCIE_DEVS only affected scanning below Switch >> Downstream Ports, not Root Ports. >> >> But the "Nemo" system, also known as the AmigaOne X1000, has a PA Semi Root >> Port whose link leads to an AMD/ATI SB600 South Bridge. The Root Port is a >> PCIe device, of course, but the SB600 contains only conventional PCI >> devices with no visible PCIe port. >> >> Simplify and restructure only_one_child() so that we scan for all possible >> devices below Root Ports as well as Switch Downstream Ports when >> PCI_SCAN_ALL_PCIE_DEVS is set. >> >> This is enough to make Nemo work with "pci=pcie_scan_all". We would also >> like to add a quirk to set PCI_SCAN_ALL_PCIE_DEVS automatically on Nemo so >> users wouldn't have to use the "pci=pcie_scan_all" parameter, but we don't >> have that yet. >> >> Link: >> https://lkml.kernel.org/r/CAErSpo55Q8Q=5p6_+uu7ahnw+53ibvdnrxxrzrv9qnur_9e...@mail.gmail.com >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198057 >> Reported-and-Tested-by: Christian Zigotzky >> Signed-off-by: Bjorn Helgaas > > Applied to pci/enumeration for v4.16. Thanks Bjorn. cheers
Re: [PATCH] powerpc/40x: acadia: Fix the 'interrupt-parent' property
Fabio Estevam writes: > Hi Michael, > > On Tue, Oct 17, 2017 at 11:01 AM, Fabio Estevam wrote: >> 'interrupts-parent' property does not exist. Fix the typo. >> >> Fixes: 00f3ca740a9c26 ("powerpc/40x: AMCC PowerPC 405EZ Acadia DTS") >> Signed-off-by: Fabio Estevam >> --- >> arch/powerpc/boot/dts/acadia.dts | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/boot/dts/acadia.dts >> b/arch/powerpc/boot/dts/acadia.dts >> index 57291f6..8626615 100644 >> --- a/arch/powerpc/boot/dts/acadia.dts >> +++ b/arch/powerpc/boot/dts/acadia.dts >> @@ -183,7 +183,7 @@ >> usb@ef603000 { >> compatible = "ohci-be"; >> reg = <0xef603000 0x80>; >> - interrupts-parent = <&UIC0>; >> + interrupt-parent = <&UIC0>; > > Any comments? Nope :) Will pick it up for 4.16. cheers
Re: [PATCH v5 0/3] Prepartion for SR-IOV PowerVM Enablement
On Thu, 2017-11-09 at 08:00 -0600, Bryant G. Ly wrote: > v1 - Initial patch > v2 - Addressed Bjorn's comment on creating a highly platform > dependent global exported symbol. > v3 - Based patch off linux-ppc/master > v4 - Using the sriov-drivers_autoprobe mechanism per Bjorn's request > v5 - Fixed comments and commit message Looks good to me, for the whole series: Acked-by: Russell Currey > > Bryant G. Ly (3): > powerpc/kernel: Separate SR-IOV Calls > pseries: Add PSeries SR-IOV Machine dependent calls > PCI/IOV: Add pci_vf_drivers_autoprobe() interface > > arch/powerpc/include/asm/machdep.h | 7 ++ > arch/powerpc/include/asm/pci-bridge.h| 4 +--- > arch/powerpc/kernel/eeh_driver.c | 4 ++-- > arch/powerpc/kernel/pci-common.c | 23 > +++ > arch/powerpc/kernel/pci_dn.c | 6 - > arch/powerpc/platforms/powernv/eeh-powernv.c | 33 ++-- > > arch/powerpc/platforms/powernv/pci-ioda.c| 6 +++-- > arch/powerpc/platforms/pseries/eeh_pseries.c | 24 > > arch/powerpc/platforms/pseries/pci.c | 31 > ++ > drivers/pci/iov.c| 11 ++ > include/linux/pci.h | 2 ++ > 11 files changed, 118 insertions(+), 33 deletions(-) >
Re: [PATCH 2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context
Nicholas Piggin writes: > When an interrupt is returning to a soft-disabled context (which can > happen for non-maskable interrupts or synchronous interrupts), it goes > through the motions of soft-disabling again, including calling > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()). > > This is not necessary, because we must already be soft-disabled in the > interrupt context, it also may be causing crashes in the irq tracing > code to re-enter as an nmi. Replace it with a warning to ensure that > soft-interrupts are still disabled. > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/kernel/entry_64.S | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) So this patch is the core of the bug fix I gather. Git blames says: Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync") Cc: sta...@vger.kernel.org # v3.4+ But I'm wondering how this has been broken that long without us noticing? You hit it doing some sort of perf stress test I think - so is it just that we've never pushed hard enough? Or did something change to expose this? Or we're just not sure? cheers > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > index 3320bcac7192..36878b6ee8b8 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -911,9 +911,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) > beq 1f > rlwinm r7,r7,0,~PACA_IRQ_HARD_DIS > stb r7,PACAIRQHAPPENED(r13) > -1: li r0,0 > - stb r0,PACASOFTIRQEN(r13); > - TRACE_DISABLE_INTS > +1: > +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG) > + /* The interrupt should not have soft enabled. */ > + lbz r7,PACASOFTIRQEN(r13) > +1: tdnei r7,0 > + EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING > +#endif > b .Ldo_restore > > /* > -- > 2.15.0
Re: [PATCH V3 0/9] powerpc: Support for ibm,dynamic-memory-v2
Nathan Fontenot writes: > This patch set provides a series of updates to de-couple the LMB > information provided in the device tree property from the device > tree property format. This eases the ability to support a new > format for the dynamic memory property, ibm,dynamic-memory-v2. Something in here is still blowing up for me in a KVM guest: OF stdout device is: /vdevice/vty@7100 Preparing to boot Linux version 4.14.0-rc2-gcc6x-g9e1fc7e (kerkins@alpine1-p1) (gcc version 6.4.1 20171202 (Custom 6328ca9eaa476138)) #1 SMP Sun Dec 3 21:45:32 AEDT 2017 Detected machine type: 0101 command line: Max number of cores passed to firmware: 256 (NR_CPUS = 2048) Calling ibm,client-architecture-support... done memory layout at init: memory_limit : (16 MB aligned) alloc_bottom : 015c alloc_top: 3000 alloc_top_hi : 0001 rmo_top : 3000 ram_top : 0001 instantiating rtas at 0x2fff... done prom_hold_cpus: skipped copying OF device tree... Building dt strings... Building dt structure... Device tree strings 0x017d -> 0x017d09d8 Device tree struct 0x017e -> 0x017f Quiescing Open Firmware ... Booting Linux via __start() @ 0x0040 ... [0.00] bootconsole [udbg0] enabled [0.00] Allocated 2883584 bytes for 2048 pacas at cfd4 [0.00] hash-mmu: Page sizes from device-tree: [0.00] hash-mmu: base_shift=12: shift=12, sllp=0x, avpnm=0x, tlbiel=1, penc=0 [0.00] hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x, tlbiel=1, penc=1 [0.00] -> fw_vec5_feature_init() [0.00] <- fw_vec5_feature_init() [0.00] -> fw_hypertas_feature_init() [0.00] <- fw_hypertas_feature_init() [0.00] Page orders: linear mapping = 16, virtual = 16, io = 16, vmemmap = 16 [0.00] Using 1TB segments [0.00] hash-mmu: Initializing hash mmu with SLB [0.00] Linux version 4.14.0-rc2-gcc6x-g9e1fc7e (kerkins@alpine1-p1) (gcc version 6.4.1 20171202 (Custom 6328ca9eaa476138)) #1 SMP Sun Dec 3 21:45:32 AEDT 2017 [0.00] Found initrd at 0xc15c:0xc178d70b [0.00] Machine is LPAR ! [0.00] -> pseries_init() [0.00] -> fw_cmo_feature_init() [0.00] CMO not available [0.00] <- fw_cmo_feature_init() [0.00] <- pseries_init() [0.00] Using pSeries machine description [0.00] Partition configured for 16 cpus. [0.00] CPU maps initialized for 8 threads per core [0.00] (thread shift is 3) [0.00] Freed 2818048 bytes for unused pacas [0.00] - [0.00] ppc64_pft_size= 0x19 [0.00] phys_mem_size = 0x1 [0.00] dcache_bsize = 0x80 [0.00] icache_bsize = 0x80 [0.00] cpu_features = 0x17dc7aec18500249 [0.00] possible= 0xdfdf18500649 [0.00] always = 0x18100040 [0.00] cpu_user_features = 0xdc0065c2 0xef00 [0.00] mmu_features = 0x78006001 [0.00] firmware_features = 0x0001405a440b [0.00] htab_hash_mask= 0x3 [0.00] - [0.00] numa: NODE_DATA [mem 0xfff6a300-0xfff73fff] [0.00] -> smp_init_pSeries() [0.00] <- smp_init_pSeries() [0.00] PCI host bridge /pci@8002000 ranges: [0.00] IO 0x01008000..0x01008000 -> 0x [0.00] MEM 0x0100a000..0x01101fff -> 0x8000 [0.00] PPC64 nvram contains 65536 bytes [0.00] Top of RAM: 0x1, Total RAM: 0x1 [0.00] Memory hole size: 0MB [0.00] Zone ranges: [0.00] DMA [mem 0x-0x] [0.00] DMA32empty [0.00] Normal empty [0.00] Movable zone start for each node [0.00] Early memory node ranges [0.00] node 0: [mem 0x-0x] [0.00] Initmem setup node 0 [mem 0x-0x] [0.00] On node 0 totalpages: 65536 [0.00] DMA zone: 64 pages used for memmap [0.00] DMA zone: 0 pages reserved [0.00] DMA zone: 65536 pages, LIFO batch:1 [0.00] percpu: Embedded 4 pages/cpu @c000ffb0 s167064 r0 d95080 u262144 [0.00] pcpu-alloc: s167064 r0 d95080 u262144 alloc=1*1
[PATCH] Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"
This reverts commit a3b2cb30f252b21a6f962e0dd107c8b897ca65e4. The earlier patch tried to fix problems with panic on powerpc in certain circumstances, where some output from the generic panic code was being dropped. Unfortunately, it breaks things worse in other circumstances. In particular when running a PAPR guest, it will now attempt to reboot instead of informing the hypervisor (KVM or PowerVM) that the guest has crashed. The crash notification is important to some virtualization management layers. Since the circumstances in which the original patch helped are somewhat obscure, revert it for now until we figure out how to do it properly. Signed-off-by: David Gibson --- arch/powerpc/include/asm/machdep.h | 1 + arch/powerpc/include/asm/setup.h | 1 + arch/powerpc/kernel/fadump.c | 22 -- arch/powerpc/kernel/setup-common.c | 27 +++ arch/powerpc/platforms/ps3/setup.c | 15 +++ arch/powerpc/platforms/pseries/setup.c | 1 + 6 files changed, 45 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 73b92017b6d7..cd2fc1cc1cc7 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -76,6 +76,7 @@ struct machdep_calls { void __noreturn (*restart)(char *cmd); void __noreturn (*halt)(void); + void(*panic)(char *str); void(*cpu_die)(void); long(*time_init)(void); /* Optional, may be NULL */ diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h index 257d23dbf55d..cf00ec26303a 100644 --- a/arch/powerpc/include/asm/setup.h +++ b/arch/powerpc/include/asm/setup.h @@ -24,6 +24,7 @@ extern void reloc_got2(unsigned long); void check_for_initrd(void); void initmem_init(void); +void setup_panic(void); #define ARCH_PANIC_TIMEOUT 180 #ifdef CONFIG_PPC_PSERIES diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 04ea5c04fd24..3c2c2688918f 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1462,25 +1462,6 @@ static void fadump_init_files(void) return; } -static int fadump_panic_event(struct notifier_block *this, - unsigned long event, void *ptr) -{ - /* -* If firmware-assisted dump has been registered then trigger -* firmware-assisted dump and let firmware handle everything -* else. If this returns, then fadump was not registered, so -* go through the rest of the panic path. -*/ - crash_fadump(NULL, ptr); - - return NOTIFY_DONE; -} - -static struct notifier_block fadump_panic_block = { - .notifier_call = fadump_panic_event, - .priority = INT_MIN /* may not return; must be done last */ -}; - /* * Prepare for firmware-assisted dump. */ @@ -1513,9 +1494,6 @@ int __init setup_fadump(void) init_fadump_mem_struct(&fdm, fw_dump.reserve_dump_area_start); fadump_init_files(); - atomic_notifier_chain_register(&panic_notifier_list, - &fadump_panic_block); - return 1; } subsys_initcall(setup_fadump); diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 2075322cd225..9d213542a48b 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -704,6 +704,30 @@ int check_legacy_ioport(unsigned long base_port) } EXPORT_SYMBOL(check_legacy_ioport); +static int ppc_panic_event(struct notifier_block *this, + unsigned long event, void *ptr) +{ + /* +* If firmware-assisted dump has been registered then trigger +* firmware-assisted dump and let firmware handle everything else. +*/ + crash_fadump(NULL, ptr); + ppc_md.panic(ptr); /* May not return */ + return NOTIFY_DONE; +} + +static struct notifier_block ppc_panic_block = { + .notifier_call = ppc_panic_event, + .priority = INT_MIN /* may not return; must be done last */ +}; + +void __init setup_panic(void) +{ + if (!ppc_md.panic) + return; + atomic_notifier_chain_register(&panic_notifier_list, &ppc_panic_block); +} + #ifdef CONFIG_CHECK_CACHE_COHERENCY /* * For platforms that have configurable cache-coherency. This function @@ -848,6 +872,9 @@ void __init setup_arch(char **cmdline_p) /* Probe the machine type, establish ppc_md. */ probe_machine(); + /* Setup panic notifier if requested by the platform. */ + setup_panic(); + /* * Configure ppc_md.power_save (ppc32 only, 64-bit machines do * it from their respective probe() function. diff --git a/arch/powerpc/platforms/ps3/setup.c b/arch/powerpc/platforms/ps3/setup.c index 9dabea6e1443..6244bc849469 100644 --- a/arch/powerpc/platforms/ps3/setup.c +++ b/a
Re: a3b2cb30 broken panic reporting for qemu guests
On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote: > On Wed, 29 Nov 2017 15:06:52 +1100 > David Gibson wrote: > > > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier" > > purports to fix a problem when the kernel panics with fadump not > > registered, but it breaks something else instead. I _think_ it was > > working on the incorrect assumption that ppc_md.panic was (or should > > be) only used with fadump, but I'm not really sure. > > > > Panic works with kdump enabled, and (I think) with fadump enabled). > > However, with neither of these enabled, we always go to the generic > > panic logic. > > Yeah thanks, I can't remember what assumption I was working on tbh. > > > That's incorrect for PAPR guests - they should call ibm,os-term via > > RTAS. Under qemu this leads to a "GUEST_PANICKED" event notification > > which higher-level management pays attention to. Since a3b2cb30 we > > now reboot instead of reporting that. > > > > I believe it will also break panic for PS3 machines, but since that > > platform basically no longer exists, we probably don't care. > > I (hope) it should just go down to the normal panic path and not do > much worse than it already does -- although it won't print out that > message. Sounds plausible. > > I'm not entirely sure how to fix this. I _think_ what we want is to > > call ppc_md.panic from a late panic notifier, the way this patch does > > for fadump_panic_event() if fadump is registered. > > The problem I had there is that some of the printk and console stuff > wasn't getting flushed out, so I was getting a blank screen. This was > probably in conjunction with panicing from NMI context that we're now > starting to introduce. Ok. What was the exact bit of panic() that wasn't getting invoked that needed to be? AFAICT ppc_md.panic was already being called at the end of the panic notifiers, by using INT_MIN priority. Note that this is the same way that the pvpanic device (used on x86 for similar panic notification functionality) does it. Well, pvpanic uses priority 1, which seems less thorough than INT_MIN. > So it's a bit annoying. There's other ugliness we have for being unable > to control panic code well enough from arch code > (arch/powerpc/platforms/powernv/opal.c) > > I guess a really minimal fix is to put an #ifdef powerpc down the bottom > there (/me *cries*). That would work for the PAPR os-term thing, but wouldn't if we ever had a specific device that worked like pvpanic. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: a3b2cb30 broken panic reporting for qemu guests
On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote: > On Fri, 01 Dec 2017 22:11:50 +1100 > Michael Ellerman wrote: > > > David Gibson writes: > > > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote: > > >> On Wed, 29 Nov 2017 15:06:52 +1100 > > >> David Gibson wrote: > > >> > > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier" > > >> > purports to fix a problem when the kernel panics with fadump not > > >> > registered, but it breaks something else instead. I _think_ it was > > >> > working on the incorrect assumption that ppc_md.panic was (or should > > >> > be) only used with fadump, but I'm not really sure. > > >> > > > >> > Panic works with kdump enabled, and (I think) with fadump enabled). > > >> > However, with neither of these enabled, we always go to the generic > > >> > panic logic. > > >> > > >> Yeah thanks, I can't remember what assumption I was working on tbh. > > >> > > >> > That's incorrect for PAPR guests - they should call ibm,os-term via > > >> > RTAS. Under qemu this leads to a "GUEST_PANICKED" event notification > > >> > which higher-level management pays attention to. Since a3b2cb30 we > > >> > now reboot instead of reporting that. > > >> > > > >> > I believe it will also break panic for PS3 machines, but since that > > >> > platform basically no longer exists, we probably don't care. > > >> > > >> I (hope) it should just go down to the normal panic path and not do > > >> much worse than it already does -- although it won't print out that > > >> message. > > >> > > >> > I'm not entirely sure how to fix this. I _think_ what we want is to > > >> > call ppc_md.panic from a late panic notifier, the way this patch does > > >> > for fadump_panic_event() if fadump is registered. > > >> > > >> The problem I had there is that some of the printk and console stuff > > >> wasn't getting flushed out, so I was getting a blank screen. This was > > >> probably in conjunction with panicing from NMI context that we're now > > >> starting to introduce. > > >> > > >> So it's a bit annoying. There's other ugliness we have for being unable > > >> to control panic code well enough from arch code > > >> (arch/powerpc/platforms/powernv/opal.c) > > >> > > >> I guess a really minimal fix is to put an #ifdef powerpc down the bottom > > >> there (/me *cries*). > > > > > > Um.. right. I'm not really sure from that how to go forward from > > > here. We want to fix this for RHEL7.5, which doesn't give us a lot of > > > time. > > > > > > Adding the #ifdef at the bottom of the generic panic code is gross, > > > but there's already a bunch of that, so maybe adequate until a better > > > solution can be found? > > > > I think you mean put an #ifdef at the bottom of panic(). If so that > > won't work. Our default panic_timeout is 180 so we never get to the > > bottom of panic(), we call emergency_restart(). > > > > You *could* put an #ifdef powerpc before that, but that's even more > > gross because it's in a different place to the sparc/s390 #ifdefs. > > > > I notice we don't implement machine_emergency_restart(), it just > > becomes machine_restart(NULL). > > > > So it seems like that's the place we should be hooking, > > machine_emergency_restart(). That's what x86 does. > > > > But panic() is not the only caller of emergency_restart(), so it's not > > an entirely straight forward conversion. > > > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a > > bigger rework for 4.16. > > > > Nick that shouldn't break your original aim too badly I think? ie. worst > > case is we panic() but don't see the output if we came from NMI? > > If the fix is not pretty obvious, then I guess revert. We actually > do have a bit of a regression though, since we've started marking > system reset interrupts as NMI. Previously a system reset would have > a better chance of printing something there. > > So I wonder is an ifdef really all that much worse just because it's not > in the same exact place as the others? We do get bug reports that were > triggered by a system reset from hypervisor console. Hopefully they would > be running with crash dumps usually. I don't think an ifdef there is really correct though. Sounds like we might instead need to move some console flushes before calling of all the notifiers, so that things like platforms/pseries or pvpanic can insert non-returning notifiers. Or else we need a different mechanism from the existing panic notifiers for hooking in a "how to die" function from platform or device. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: a3b2cb30 broken panic reporting for qemu guests
On Fri, Dec 01, 2017 at 10:11:50PM +1100, Michael Ellerman wrote: > David Gibson writes: > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote: > >> On Wed, 29 Nov 2017 15:06:52 +1100 > >> David Gibson wrote: > >> > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier" > >> > purports to fix a problem when the kernel panics with fadump not > >> > registered, but it breaks something else instead. I _think_ it was > >> > working on the incorrect assumption that ppc_md.panic was (or should > >> > be) only used with fadump, but I'm not really sure. > >> > > >> > Panic works with kdump enabled, and (I think) with fadump enabled). > >> > However, with neither of these enabled, we always go to the generic > >> > panic logic. > >> > >> Yeah thanks, I can't remember what assumption I was working on tbh. > >> > >> > That's incorrect for PAPR guests - they should call ibm,os-term via > >> > RTAS. Under qemu this leads to a "GUEST_PANICKED" event notification > >> > which higher-level management pays attention to. Since a3b2cb30 we > >> > now reboot instead of reporting that. > >> > > >> > I believe it will also break panic for PS3 machines, but since that > >> > platform basically no longer exists, we probably don't care. > >> > >> I (hope) it should just go down to the normal panic path and not do > >> much worse than it already does -- although it won't print out that > >> message. > >> > >> > I'm not entirely sure how to fix this. I _think_ what we want is to > >> > call ppc_md.panic from a late panic notifier, the way this patch does > >> > for fadump_panic_event() if fadump is registered. > >> > >> The problem I had there is that some of the printk and console stuff > >> wasn't getting flushed out, so I was getting a blank screen. This was > >> probably in conjunction with panicing from NMI context that we're now > >> starting to introduce. > >> > >> So it's a bit annoying. There's other ugliness we have for being unable > >> to control panic code well enough from arch code > >> (arch/powerpc/platforms/powernv/opal.c) > >> > >> I guess a really minimal fix is to put an #ifdef powerpc down the bottom > >> there (/me *cries*). > > > > Um.. right. I'm not really sure from that how to go forward from > > here. We want to fix this for RHEL7.5, which doesn't give us a lot of > > time. > > > > Adding the #ifdef at the bottom of the generic panic code is gross, > > but there's already a bunch of that, so maybe adequate until a better > > solution can be found? > > I think you mean put an #ifdef at the bottom of panic(). If so that > won't work. Our default panic_timeout is 180 so we never get to the > bottom of panic(), we call emergency_restart(). > > You *could* put an #ifdef powerpc before that, but that's even more > gross because it's in a different place to the sparc/s390 #ifdefs. > > I notice we don't implement machine_emergency_restart(), it just > becomes machine_restart(NULL). > > So it seems like that's the place we should be hooking, > machine_emergency_restart(). That's what x86 does. Only sort of. machine_emergency_restart() is in basically the right place, but AFAICT it is supposed to restart the machine, which os-term (by design) does not. x86 uses this for a restart, when using pvpanic which provides similar functionality to os-term, that gets called before reaching emergency_reset(). > But panic() is not the only caller of emergency_restart(), so it's not > an entirely straight forward conversion. > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a > bigger rework for 4.16. I've sent a revert for now. > Nick that shouldn't break your original aim too badly I think? ie. worst > case is we panic() but don't see the output if we came from NMI? > > cheers > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH V2] powerpc/mm: Invalidate subpage_prot() system call on radix platforms
Radix enabled platforms don't support subpage_prot() system calls. But at present the system call goes through without an error and fails later on while validating expected subpage accesses. Lets not allow the system call on powerpc radix platforms to begin with to prevent this confusion in user space. Signed-off-by: Anshuman Khandual --- Changes in V2: Now it returns -ENOENT error instead as per Michael Ellerman. arch/powerpc/mm/subpage-prot.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c index 781532d..f14a07c 100644 --- a/arch/powerpc/mm/subpage-prot.c +++ b/arch/powerpc/mm/subpage-prot.c @@ -195,6 +195,9 @@ long sys_subpage_prot(unsigned long addr, unsigned long len, u32 __user *map) unsigned long next, limit; int err; + if (radix_enabled()) + return -ENOENT; + /* Check parameters */ if ((addr & ~PAGE_MASK) || (len & ~PAGE_MASK) || addr >= mm->task_size || len >= mm->task_size || -- 1.8.3.1
Re: [PATCH 2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context
On Mon, 04 Dec 2017 16:09:57 +1100 Michael Ellerman wrote: > Nicholas Piggin writes: > > > When an interrupt is returning to a soft-disabled context (which can > > happen for non-maskable interrupts or synchronous interrupts), it goes > > through the motions of soft-disabling again, including calling > > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()). > > > > This is not necessary, because we must already be soft-disabled in the > > interrupt context, it also may be causing crashes in the irq tracing > > code to re-enter as an nmi. Replace it with a warning to ensure that > > soft-interrupts are still disabled. > > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/kernel/entry_64.S | 10 +++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > So this patch is the core of the bug fix I gather. > > Git blames says: > > Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state > getting out of sync") > Cc: sta...@vger.kernel.org # v3.4+ > > But I'm wondering how this has been broken that long without us > noticing? You hit it doing some sort of perf stress test I think - so is > it just that we've never pushed hard enough? Or did something change to > expose this? Or we're just not sure? I'm not really sure. A customer hit it, during either a stress test or long running workload with lockdep irq tracing and perf running at the same time. I don't have a lot more details but we might be able to get some offline if necessary. Thanks, Nick
Re: a3b2cb30 broken panic reporting for qemu guests
On Mon, 4 Dec 2017 16:49:14 +1100 David Gibson wrote: > On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote: > > On Fri, 01 Dec 2017 22:11:50 +1100 > > Michael Ellerman wrote: > > > > > David Gibson writes: > > > > > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote: > > > >> On Wed, 29 Nov 2017 15:06:52 +1100 > > > >> David Gibson wrote: > > > >> > > > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic notifier" > > > >> > purports to fix a problem when the kernel panics with fadump not > > > >> > registered, but it breaks something else instead. I _think_ it was > > > >> > working on the incorrect assumption that ppc_md.panic was (or should > > > >> > be) only used with fadump, but I'm not really sure. > > > >> > > > > >> > Panic works with kdump enabled, and (I think) with fadump enabled). > > > >> > However, with neither of these enabled, we always go to the generic > > > >> > panic logic. > > > >> > > > >> Yeah thanks, I can't remember what assumption I was working on tbh. > > > >> > > > >> > That's incorrect for PAPR guests - they should call ibm,os-term via > > > >> > RTAS. Under qemu this leads to a "GUEST_PANICKED" event notification > > > >> > which higher-level management pays attention to. Since a3b2cb30 we > > > >> > now reboot instead of reporting that. > > > >> > > > > >> > I believe it will also break panic for PS3 machines, but since that > > > >> > platform basically no longer exists, we probably don't care. > > > >> > > > >> I (hope) it should just go down to the normal panic path and not do > > > >> much worse than it already does -- although it won't print out that > > > >> message. > > > >> > > > >> > I'm not entirely sure how to fix this. I _think_ what we want is to > > > >> > call ppc_md.panic from a late panic notifier, the way this patch does > > > >> > for fadump_panic_event() if fadump is registered. > > > >> > > > >> The problem I had there is that some of the printk and console stuff > > > >> wasn't getting flushed out, so I was getting a blank screen. This was > > > >> probably in conjunction with panicing from NMI context that we're now > > > >> starting to introduce. > > > >> > > > >> So it's a bit annoying. There's other ugliness we have for being unable > > > >> to control panic code well enough from arch code > > > >> (arch/powerpc/platforms/powernv/opal.c) > > > >> > > > >> I guess a really minimal fix is to put an #ifdef powerpc down the > > > >> bottom > > > >> there (/me *cries*). > > > > > > > > Um.. right. I'm not really sure from that how to go forward from > > > > here. We want to fix this for RHEL7.5, which doesn't give us a lot of > > > > time. > > > > > > > > Adding the #ifdef at the bottom of the generic panic code is gross, > > > > but there's already a bunch of that, so maybe adequate until a better > > > > solution can be found? > > > > > > I think you mean put an #ifdef at the bottom of panic(). If so that > > > won't work. Our default panic_timeout is 180 so we never get to the > > > bottom of panic(), we call emergency_restart(). > > > > > > You *could* put an #ifdef powerpc before that, but that's even more > > > gross because it's in a different place to the sparc/s390 #ifdefs. > > > > > > I notice we don't implement machine_emergency_restart(), it just > > > becomes machine_restart(NULL). > > > > > > So it seems like that's the place we should be hooking, > > > machine_emergency_restart(). That's what x86 does. > > > > > > But panic() is not the only caller of emergency_restart(), so it's not > > > an entirely straight forward conversion. > > > > > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a > > > bigger rework for 4.16. > > > > > > Nick that shouldn't break your original aim too badly I think? ie. worst > > > case is we panic() but don't see the output if we came from NMI? > > > > If the fix is not pretty obvious, then I guess revert. We actually > > do have a bit of a regression though, since we've started marking > > system reset interrupts as NMI. Previously a system reset would have > > a better chance of printing something there. > > > > So I wonder is an ifdef really all that much worse just because it's not > > in the same exact place as the others? We do get bug reports that were > > triggered by a system reset from hypervisor console. Hopefully they would > > be running with crash dumps usually. > > I don't think an ifdef there is really correct though. Sounds like we > might instead need to move some console flushes before calling of all > the notifiers, so that things like platforms/pseries or pvpanic can > insert non-returning notifiers. Well, I don't necessarily think that's the right thing to do either. If we have a crash dump handler, we should get the console memory and rather do that immediately than try to flush it, no? I think a helper function that would do all the necessary ge
Re: [PATCH 0/4] Fix use after free in HPT resizing code and related minor improvements
On Wed, Nov 29, 2017 at 11:38:22AM -0500, Serhii Popovych wrote: > It is possible to trigger use after free during HPT resize > causing host kernel to crash. More details and analysis of > the problem can be found in change with corresponding subject > (KVM: PPC: Book3S HV: Fix use after free in case of multiple > resize requests). > > We need some changes to prepare for the fix, especially > make ->error in HPT resize instance single point for > tracking allocation state, improve kvmppc_allocate_hpt() > and kvmppc_free_hpt() so they can be used more safely. > > See individual commit description message to get more > information on changes presented. I spoke with Paul Mackerras about these patches on IRC today. We want this as a fix, ASAP, in 4.15. However, he's uncomfortable with pushing some of extra cleanups which aren't necessary for the bug fix this late for 4.15, and was having trouble following what was the core of the fix. He was also nervous about the addition of more BUG_ON()s. To avoid the round trip to Ukraine time and back, I've made revised versions of patches 1 & 3 which should apply standalone, replaced the BUG_ON()s with WARN_ON()s and revised the commit messages to better explain the crucial part of the fix. However, I've run out of time to test them. Serhii, I'll send you my revised patches shortly. Can you please test them and repost. Then you can rebase patches 2 & 4 from this series on top of the revised patches and post those separately (as a cleanup with less urgency than the actual fix). A couple of people have also suggested CCing k...@vger.kernel.org on the next round in addition to the lists already included. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: a3b2cb30 broken panic reporting for qemu guests
On Mon, Dec 04, 2017 at 04:12:06PM +1000, Nicholas Piggin wrote: > On Mon, 4 Dec 2017 16:49:14 +1100 > David Gibson wrote: > > > On Fri, Dec 01, 2017 at 09:40:38PM +1000, Nicholas Piggin wrote: > > > On Fri, 01 Dec 2017 22:11:50 +1100 > > > Michael Ellerman wrote: > > > > > > > David Gibson writes: > > > > > > > > > On Wed, Nov 29, 2017 at 02:23:43PM +1000, Nicholas Piggin wrote: > > > > >> On Wed, 29 Nov 2017 15:06:52 +1100 > > > > >> David Gibson wrote: > > > > >> > > > > >> > a3b2cb30 "powerpc: Do not call ppc_md.panic in fadump panic > > > > >> > notifier" > > > > >> > purports to fix a problem when the kernel panics with fadump not > > > > >> > registered, but it breaks something else instead. I _think_ it was > > > > >> > working on the incorrect assumption that ppc_md.panic was (or > > > > >> > should > > > > >> > be) only used with fadump, but I'm not really sure. > > > > >> > > > > > >> > Panic works with kdump enabled, and (I think) with fadump enabled). > > > > >> > However, with neither of these enabled, we always go to the generic > > > > >> > panic logic. > > > > >> > > > > >> Yeah thanks, I can't remember what assumption I was working on tbh. > > > > >> > > > > >> > That's incorrect for PAPR guests - they should call ibm,os-term via > > > > >> > RTAS. Under qemu this leads to a "GUEST_PANICKED" event > > > > >> > notification > > > > >> > which higher-level management pays attention to. Since a3b2cb30 we > > > > >> > now reboot instead of reporting that. > > > > >> > > > > > >> > I believe it will also break panic for PS3 machines, but since that > > > > >> > platform basically no longer exists, we probably don't care. > > > > >> > > > > >> I (hope) it should just go down to the normal panic path and not do > > > > >> much worse than it already does -- although it won't print out that > > > > >> message. > > > > >> > > > > >> > I'm not entirely sure how to fix this. I _think_ what we want is > > > > >> > to > > > > >> > call ppc_md.panic from a late panic notifier, the way this patch > > > > >> > does > > > > >> > for fadump_panic_event() if fadump is registered. > > > > >> > > > > >> The problem I had there is that some of the printk and console stuff > > > > >> wasn't getting flushed out, so I was getting a blank screen. This was > > > > >> probably in conjunction with panicing from NMI context that we're now > > > > >> starting to introduce. > > > > >> > > > > >> So it's a bit annoying. There's other ugliness we have for being > > > > >> unable > > > > >> to control panic code well enough from arch code > > > > >> (arch/powerpc/platforms/powernv/opal.c) > > > > >> > > > > >> I guess a really minimal fix is to put an #ifdef powerpc down the > > > > >> bottom > > > > >> there (/me *cries*). > > > > > > > > > > Um.. right. I'm not really sure from that how to go forward from > > > > > here. We want to fix this for RHEL7.5, which doesn't give us a lot of > > > > > time. > > > > > > > > > > Adding the #ifdef at the bottom of the generic panic code is gross, > > > > > but there's already a bunch of that, so maybe adequate until a better > > > > > solution can be found? > > > > > > > > I think you mean put an #ifdef at the bottom of panic(). If so that > > > > won't work. Our default panic_timeout is 180 so we never get to the > > > > bottom of panic(), we call emergency_restart(). > > > > > > > > You *could* put an #ifdef powerpc before that, but that's even more > > > > gross because it's in a different place to the sparc/s390 #ifdefs. > > > > > > > > I notice we don't implement machine_emergency_restart(), it just > > > > becomes machine_restart(NULL). > > > > > > > > So it seems like that's the place we should be hooking, > > > > machine_emergency_restart(). That's what x86 does. > > > > > > > > But panic() is not the only caller of emergency_restart(), so it's not > > > > an entirely straight forward conversion. > > > > > > > > So I think for 4.15 and 4.14 I'm inclined to revert. Then we can do a > > > > bigger rework for 4.16. > > > > > > > > Nick that shouldn't break your original aim too badly I think? ie. worst > > > > case is we panic() but don't see the output if we came from NMI? > > > > > > If the fix is not pretty obvious, then I guess revert. We actually > > > do have a bit of a regression though, since we've started marking > > > system reset interrupts as NMI. Previously a system reset would have > > > a better chance of printing something there. > > > > > > So I wonder is an ifdef really all that much worse just because it's not > > > in the same exact place as the others? We do get bug reports that were > > > triggered by a system reset from hypervisor console. Hopefully they would > > > be running with crash dumps usually. > > > > I don't think an ifdef there is really correct though. Sounds like we > > might instead need to move some console flushes before calling of all > > the notifiers,