Re: [v3,2/2] arm64: Add software workaround for Falkor erratum 1041
22 +#define ARM64_NCAPS23 #endif /* __ASM_CPUCAPS_H */ diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S index 65f42d2..2a752cb 100644 --- a/arch/arm64/kernel/cpu-reset.S +++ b/arch/arm64/kernel/cpu-reset.S @@ -37,6 +37,7 @@ ENTRY(__cpu_soft_restart) mrs x12, sctlr_el1 ldr x13, =SCTLR_ELx_FLAGS bic x12, x12, x13 + pre_disable_mmu_workaround msr sctlr_el1, x12 isb diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 0e27f86..2fd1938 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -179,6 +179,22 @@ static int cpu_enable_trap_ctr_access(void *__unused) MIDR_CPU_VAR_REV(0, 0)), }, #endif +#ifdef CONFIG_QCOM_FALKOR_ERRATUM_E1041 + { + .desc = "Qualcomm Technologies Falkor erratum 1041", + .capability = ARM64_WORKAROUND_QCOM_FALKOR_E1041, + MIDR_RANGE(MIDR_QCOM_FALKOR_V1, + MIDR_CPU_VAR_REV(0, 0), + MIDR_CPU_VAR_REV(0, 0)), + }, + { + .desc = "Qualcomm Technologies Falkor erratum 1041", + .capability = ARM64_WORKAROUND_QCOM_FALKOR_E1041, + MIDR_RANGE(MIDR_QCOM_FALKOR, + MIDR_CPU_VAR_REV(0, 1), + MIDR_CPU_VAR_REV(0, 2)), + }, +#endif #ifdef CONFIG_ARM64_ERRATUM_858921 { /* Cortex-A73 all versions */ diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S index 4e6ad35..dc675ba 100644 --- a/arch/arm64/kernel/efi-entry.S +++ b/arch/arm64/kernel/efi-entry.S @@ -96,6 +96,7 @@ ENTRY(entry) mrs x0, sctlr_el2 bic x0, x0, #1 << 0 // clear SCTLR.M bic x0, x0, #1 << 2 // clear SCTLR.C + pre_disable_mmu_early_workaround msr sctlr_el2, x0 isb b 2f @@ -103,6 +104,7 @@ ENTRY(entry) mrs x0, sctlr_el1 bic x0, x0, #1 << 0 // clear SCTLR.M bic x0, x0, #1 << 2 // clear SCTLR.C + pre_disable_mmu_early_workaround msr sctlr_el1, x0 isb 2: diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 0b243ec..a807fca 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -732,6 +732,7 @@ __primary_switch: * to take into account by discarding the current kernel mapping and * creating a new one. */ + pre_disable_mmu_early_workaround msr sctlr_el1, x20 // disable the MMU isb bl __create_page_tables// recreate kernel mapping diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S index ce704a4..f407e42 100644 --- a/arch/arm64/kernel/relocate_kernel.S +++ b/arch/arm64/kernel/relocate_kernel.S @@ -45,6 +45,7 @@ ENTRY(arm64_relocate_new_kernel) mrs x0, sctlr_el2 ldr x1, =SCTLR_ELx_FLAGS bic x0, x0, x1 + pre_disable_mmu_workaround msr sctlr_el2, x0 isb 1: diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S index 3f96155..870828c 100644 --- a/arch/arm64/kvm/hyp-init.S +++ b/arch/arm64/kvm/hyp-init.S @@ -151,6 +151,7 @@ reset: mrs x5, sctlr_el2 ldr x6, =SCTLR_ELx_FLAGS bic x5, x5, x6 // Clear SCTL_M and etc + pre_disable_mmu_workaround msr sctlr_el2, x5 isb I applied the V3 of errtum 1041 patches to Ubuntu Artful 4.13 kernel and ran the stress-ng and VM create/stop/restart testing like I did on the previous version of this patch series. Tests successfully ran on qdf2400 platform, I did not observe any regressions on the Artful 4.13 kernel. Tested-by: Manoj Iyer -- Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
On Fri, 10 Nov 2017, Manoj Iyer wrote: On Thu, 9 Nov 2017, Manoj Iyer wrote: James, Looks like my VM test raised a false alarm. I retested stock Artful 4.13 kernel (No erratum 1041 patches applied). James, an update on the crash (false alarm). We suspect this is a firmware crash due to a possible fw bug. Once this is addressed I will be able to send you the test results you requested on VM start/stop with the erratum 1041 patches applied. James/Shanker, I can report that VM start/stop/restart tests worked with the patches applied to Ubuntu 4.13 (Artful) kernel on the qdf2400 hardware. Host: Ubuntu 4.13 with Erratum 1041 patches applied Guest: Stock Ubuntu 4.13 kernel - create 20 vms one at a time 10 iteration of: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Tested-by: Manoj Iyer Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. Guest: Ubuntu Zesty (4.10) kernel. - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. And, I am able to reproduce the system reset issue I previously reported. I think the problem I reported with VMs might have nothing to do with the erratum 1041 patches, and probably needs to be root caused seperately. With stock 4.13 kernel (no erratum 1041 patches applied): awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! On Thu, 9 Nov 2017, Manoj Iyer wrote: On Thu, 9 Nov 2017, Manoj Iyer wrote: James, (sorry for top-posting) Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Start 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Fixing some confusion I might have introduced in my prev email. - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. The system reset's itself after starting the last VM on the 1st loop displaying the following: awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! Followed by the following messages on system reboot: [ 6.616891] BERT: Error records from previous boot: [ 6.621655] [Hardware Error]: event severity: fatal [ 6.626516] [Hardware Error]: imprecise tstamp: -00-00 00:00:00 [ 6.632851] [Hardware Error]: Error 0, type: fatal [ 6.637713] [Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 6.646045] [Hardware Error]: section length: 0x238 [ 6.651082] [Hardware Error]: : 72724502 5220726f 6f736165 6e55206e .Error Reason Un [ 6.659761] [Hardware Error]: 0010: 776f6e6b 006e known... [ 6.668442] [Hardware Error]: 0020: [ 6.677122] [Hardware Error]: 0030: On Thu, 9 Nov 2017, James Morse wrote: Hi Manoj, On 08/11/17 19:05, Manoj Iyer wrote: On Thu, 2 Nov 2017, Shanker Donthineni wrote: The ARM architecture defines the memory locations that are permitted to be accessed as the result of a speculative instruction fetch from an exception level for which all stages of translation are disabled. Specifically, the core is permitted to speculatively fetch from the 4KB region containing the current program counter and next 4KB. When translation is changed from enabled to disabled for the running exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the Falkor core may errantly speculatively access memory locations outside of the 4KB region permitted by the architecture. The errant memory access may lead to one of the following unexpected behaviors. I applied the 3 patches to Ubuntu 4.13.0-16-ge
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
On Thu, 9 Nov 2017, Manoj Iyer wrote: James, Looks like my VM test raised a false alarm. I retested stock Artful 4.13 kernel (No erratum 1041 patches applied). James, an update on the crash (false alarm). We suspect this is a firmware crash due to a possible fw bug. Once this is addressed I will be able to send you the test results you requested on VM start/stop with the erratum 1041 patches applied. Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. Guest: Ubuntu Zesty (4.10) kernel. - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. And, I am able to reproduce the system reset issue I previously reported. I think the problem I reported with VMs might have nothing to do with the erratum 1041 patches, and probably needs to be root caused seperately. With stock 4.13 kernel (no erratum 1041 patches applied): awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! On Thu, 9 Nov 2017, Manoj Iyer wrote: On Thu, 9 Nov 2017, Manoj Iyer wrote: James, (sorry for top-posting) Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Start 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Fixing some confusion I might have introduced in my prev email. - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. The system reset's itself after starting the last VM on the 1st loop displaying the following: awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! Followed by the following messages on system reboot: [ 6.616891] BERT: Error records from previous boot: [ 6.621655] [Hardware Error]: event severity: fatal [ 6.626516] [Hardware Error]: imprecise tstamp: -00-00 00:00:00 [ 6.632851] [Hardware Error]: Error 0, type: fatal [ 6.637713] [Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 6.646045] [Hardware Error]: section length: 0x238 [ 6.651082] [Hardware Error]: : 72724502 5220726f 6f736165 6e55206e .Error Reason Un [ 6.659761] [Hardware Error]: 0010: 776f6e6b 006e known... [ 6.668442] [Hardware Error]: 0020: [ 6.677122] [Hardware Error]: 0030: On Thu, 9 Nov 2017, James Morse wrote: Hi Manoj, On 08/11/17 19:05, Manoj Iyer wrote: On Thu, 2 Nov 2017, Shanker Donthineni wrote: The ARM architecture defines the memory locations that are permitted to be accessed as the result of a speculative instruction fetch from an exception level for which all stages of translation are disabled. Specifically, the core is permitted to speculatively fetch from the 4KB region containing the current program counter and next 4KB. When translation is changed from enabled to disabled for the running exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the Falkor core may errantly speculatively access memory locations outside of the 4KB region permitted by the architecture. The errant memory access may lead to one of the following unexpected behaviors. I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and ran stress-ng cpu tests on QDF2400 server [...] Where stress-ng would spawn N workers and test cpu offline/online, perform matrix operations, do rapid context switchs, and anonymous mmaps. Although I was not able to reproduce the erratum on the stock 4.13 kernel using the same test case, the patched kernel did not seem to introduce any regressions either. I ran the stress-ng tests for over 8hrs found the syst
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
James, Looks like my VM test raised a false alarm. I retested stock Artful 4.13 kernel (No erratum 1041 patches applied). Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. Guest: Ubuntu Zesty (4.10) kernel. - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. And, I am able to reproduce the system reset issue I previously reported. I think the problem I reported with VMs might have nothing to do with the erratum 1041 patches, and probably needs to be root caused seperately. With stock 4.13 kernel (no erratum 1041 patches applied): awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! On Thu, 9 Nov 2017, Manoj Iyer wrote: On Thu, 9 Nov 2017, Manoj Iyer wrote: James, (sorry for top-posting) Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Start 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Fixing some confusion I might have introduced in my prev email. - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. The system reset's itself after starting the last VM on the 1st loop displaying the following: awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! Followed by the following messages on system reboot: [ 6.616891] BERT: Error records from previous boot: [ 6.621655] [Hardware Error]: event severity: fatal [ 6.626516] [Hardware Error]: imprecise tstamp: -00-00 00:00:00 [ 6.632851] [Hardware Error]: Error 0, type: fatal [ 6.637713] [Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 6.646045] [Hardware Error]: section length: 0x238 [ 6.651082] [Hardware Error]: : 72724502 5220726f 6f736165 6e55206e .Error Reason Un [ 6.659761] [Hardware Error]: 0010: 776f6e6b 006e known... [ 6.668442] [Hardware Error]: 0020: [ 6.677122] [Hardware Error]: 0030: On Thu, 9 Nov 2017, James Morse wrote: Hi Manoj, On 08/11/17 19:05, Manoj Iyer wrote: On Thu, 2 Nov 2017, Shanker Donthineni wrote: The ARM architecture defines the memory locations that are permitted to be accessed as the result of a speculative instruction fetch from an exception level for which all stages of translation are disabled. Specifically, the core is permitted to speculatively fetch from the 4KB region containing the current program counter and next 4KB. When translation is changed from enabled to disabled for the running exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the Falkor core may errantly speculatively access memory locations outside of the 4KB region permitted by the architecture. The errant memory access may lead to one of the following unexpected behaviors. I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and ran stress-ng cpu tests on QDF2400 server [...] Where stress-ng would spawn N workers and test cpu offline/online, perform matrix operations, do rapid context switchs, and anonymous mmaps. Although I was not able to reproduce the erratum on the stock 4.13 kernel using the same test case, the patched kernel did not seem to introduce any regressions either. I ran the stress-ng tests for over 8hrs found the system to be stable. Could you throw kexec and KVM into the mix? This issue only shows up when we disable the MMU, which we almost never do. For CPU offline/online we make the PSCI 'offline' call with the MMU enabled. When the CPU comes back firmware has reset the EL2/EL1 SCTL
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
On Thu, 9 Nov 2017, Manoj Iyer wrote: James, (sorry for top-posting) Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Start 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Fixing some confusion I might have introduced in my prev email. - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. The system reset's itself after starting the last VM on the 1st loop displaying the following: awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! Followed by the following messages on system reboot: [ 6.616891] BERT: Error records from previous boot: [ 6.621655] [Hardware Error]: event severity: fatal [ 6.626516] [Hardware Error]: imprecise tstamp: -00-00 00:00:00 [ 6.632851] [Hardware Error]: Error 0, type: fatal [ 6.637713] [Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 6.646045] [Hardware Error]: section length: 0x238 [ 6.651082] [Hardware Error]: : 72724502 5220726f 6f736165 6e55206e .Error Reason Un [ 6.659761] [Hardware Error]: 0010: 776f6e6b 006e known... [ 6.668442] [Hardware Error]: 0020: [ 6.677122] [Hardware Error]: 0030: On Thu, 9 Nov 2017, James Morse wrote: Hi Manoj, On 08/11/17 19:05, Manoj Iyer wrote: On Thu, 2 Nov 2017, Shanker Donthineni wrote: The ARM architecture defines the memory locations that are permitted to be accessed as the result of a speculative instruction fetch from an exception level for which all stages of translation are disabled. Specifically, the core is permitted to speculatively fetch from the 4KB region containing the current program counter and next 4KB. When translation is changed from enabled to disabled for the running exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the Falkor core may errantly speculatively access memory locations outside of the 4KB region permitted by the architecture. The errant memory access may lead to one of the following unexpected behaviors. I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and ran stress-ng cpu tests on QDF2400 server [...] Where stress-ng would spawn N workers and test cpu offline/online, perform matrix operations, do rapid context switchs, and anonymous mmaps. Although I was not able to reproduce the erratum on the stock 4.13 kernel using the same test case, the patched kernel did not seem to introduce any regressions either. I ran the stress-ng tests for over 8hrs found the system to be stable. Could you throw kexec and KVM into the mix? This issue only shows up when we disable the MMU, which we almost never do. For CPU offline/online we make the PSCI 'offline' call with the MMU enabled. When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a higher exception level, so it won't hit this issue. One place we do this is kexec, where we drop into purgatory with the MMU disabled. The other is KVM unloading itself to return to the hyp stub. You can stress this by starting and stopping a VM. When the number of VMs reaches 0 KVM should unload via 'kvm_arch_hardware_disable()'. Thanks, James -- ==== Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud -- ==== Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
James, (sorry for top-posting) Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) - Start 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. The system reset's itself after starting the last VM on the 1st loop displaying the following: awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! Followed by the following messages on system reboot: [ 6.616891] BERT: Error records from previous boot: [ 6.621655] [Hardware Error]: event severity: fatal [ 6.626516] [Hardware Error]: imprecise tstamp: -00-00 00:00:00 [ 6.632851] [Hardware Error]: Error 0, type: fatal [ 6.637713] [Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 6.646045] [Hardware Error]: section length: 0x238 [ 6.651082] [Hardware Error]: : 72724502 5220726f 6f736165 6e55206e .Error Reason Un [ 6.659761] [Hardware Error]: 0010: 776f6e6b 006e known... [ 6.668442] [Hardware Error]: 0020: [ 6.677122] [Hardware Error]: 0030: On Thu, 9 Nov 2017, James Morse wrote: Hi Manoj, On 08/11/17 19:05, Manoj Iyer wrote: On Thu, 2 Nov 2017, Shanker Donthineni wrote: The ARM architecture defines the memory locations that are permitted to be accessed as the result of a speculative instruction fetch from an exception level for which all stages of translation are disabled. Specifically, the core is permitted to speculatively fetch from the 4KB region containing the current program counter and next 4KB. When translation is changed from enabled to disabled for the running exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the Falkor core may errantly speculatively access memory locations outside of the 4KB region permitted by the architecture. The errant memory access may lead to one of the following unexpected behaviors. I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and ran stress-ng cpu tests on QDF2400 server [...] Where stress-ng would spawn N workers and test cpu offline/online, perform matrix operations, do rapid context switchs, and anonymous mmaps. Although I was not able to reproduce the erratum on the stock 4.13 kernel using the same test case, the patched kernel did not seem to introduce any regressions either. I ran the stress-ng tests for over 8hrs found the system to be stable. Could you throw kexec and KVM into the mix? This issue only shows up when we disable the MMU, which we almost never do. For CPU offline/online we make the PSCI 'offline' call with the MMU enabled. When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a higher exception level, so it won't hit this issue. One place we do this is kexec, where we drop into purgatory with the MMU disabled. The other is KVM unloading itself to return to the hyp stub. You can stress this by starting and stopping a VM. When the number of VMs reaches 0 KVM should unload via 'kvm_arch_hardware_disable()'. Thanks, James -- ==== Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
(MIDR_QCOM_FALKOR_V1, + MIDR_CPU_VAR_REV(0, 0), + MIDR_CPU_VAR_REV(0, 0)), + }, + { + .desc = "Qualcomm Technologies Falkor erratum 1041", + .capability = ARM64_WORKAROUND_QCOM_FALKOR_E1041, + MIDR_RANGE(MIDR_QCOM_FALKOR, + MIDR_CPU_VAR_REV(0, 1), + MIDR_CPU_VAR_REV(0, 2)), + }, +#endif #ifdef CONFIG_ARM64_ERRATUM_858921 { /* Cortex-A73 all versions */ diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S index acae627..c31be1b 100644 --- a/arch/arm64/kernel/efi-entry.S +++ b/arch/arm64/kernel/efi-entry.S @@ -96,14 +96,14 @@ ENTRY(entry) read_sctlr el2, x0 bic x0, x0, #1 << 0 // clear SCTLR.M bic x0, x0, #1 << 2 // clear SCTLR.C - write_sctlr el2, x0 + early_write_sctlr el2, x0 isb b 2f 1: read_sctlr el1, x0 bic x0, x0, #1 << 0 // clear SCTLR.M bic x0, x0, #1 << 2 // clear SCTLR.C - write_sctlr el1, x0 + early_write_sctlr el1, x0 isb 2: /* Jump to kernel entry point */ diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index b8d5b73..9512ce7 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -511,7 +511,7 @@ install_el2_stub: mov x0, #0x0800 // Set/clear RES{1,0} bits CPU_BE( movkx0, #0x33d0, lsl #16) // Set EE and E0E on BE systems CPU_LE( movkx0, #0x30d0, lsl #16) // Clear EE and E0E on LE systems - write_sctlr el1, x0 + early_write_sctlr el1, x0 /* Coprocessor traps. */ mov x0, #0x33ff @@ -732,7 +732,7 @@ __primary_switch: * to take into account by discarding the current kernel mapping and * creating a new one. */ - write_sctlr el1, x20// disable the MMU + early_write_sctlr el1, x20 // disable the MMU isb bl __create_page_tables// recreate kernel mapping I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and ran stress-ng cpu tests on QDF2400 server as follows: sudo ./stress-ng --pathological -v --cpu 100 --cpu-load 80 --cpu-method all --cpu-online 500 --matrix 100 --matrix-method all --matrix-size 8192 --vm 10 --vm-hang 10 --vm-method all --switch 100 --numa 100 Where stress-ng would spawn N workers and test cpu offline/online, perform matrix operations, do rapid context switchs, and anonymous mmaps. Although I was not able to reproduce the erratum on the stock 4.13 kernel using the same test case, the patched kernel did not seem to introduce any regressions either. I ran the stress-ng tests for over 8hrs found the system to be stable. Tested-by: Manoj Iyer Regards -- Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html