Re: next/master bisection: baseline.login on rk3288-rock2-square
On Sat, 6 Feb 2021 at 14:10, Guillaume Tucker wrote: > > On 05/02/2021 12:05, Ard Biesheuvel wrote: > > On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel wrote: > >> > >> On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker > >> wrote: > >>> > >>> On 04/02/2021 18:23, Nick Desaulniers wrote: > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor > wrote: > > > > On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang > > Built Linux wrote: > >> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > >>> > >>> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > >>> wrote: > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > wrote: > >> > >> Essentially: > >> > >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > >> CC="ccache clang" zImage > >> > >> This command should link with BFD (and assemble with GAS; it's only > >> using clang as the compiler. > > > > I think you missed the 'LLVM=1' before CC="ccache clang". That should > > use all of the LLVM utilities minus the integrated assembler while > > wrapping clang with ccache. > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should > permit fallback to BFD. > >>> > >>> That was close, except we're cross-compiling with GCC for arm. > >>> So I've now built a plain next-20210203 (without Ard's fix) using > >>> this command line: > >>> > >>> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm > >>> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage > >>> > >>> I'm using a modified Docker image gtucker/kernelci-build-clang-11 > >>> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf > >>> packages added to be able to use the GNU linker. BTW I guess we > >>> should enable this kind of hybrid build setup on kernelci.org as > >>> well. > >>> > >>> Full build log + kernel binaries can be found here: > >>> > >>> > >>> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/ > >>> > >>> And this booted fine, which confirms it's really down to how > >>> ld.lld puts together the kernel image. Does it actually solve > >>> the debate whether this is an issue to fix in the assembly code > >>> or at link time? > >>> > >>> Full test job details for the record: > >>> > >>> https://lava.collabora.co.uk/scheduler/job/3176004 > >>> > >> > >> > >> So the issue appears to be in the way the linker generates the > >> _kernel_bss_size symbol, which obviously has an impact, given that the > >> queued fix takes it into account in the cache_clean operation. > >> > >> On GNU ld, I see > >> > >>479: 00065e14 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size > >> > >> whereas n LLVM ld.lld, I see > >> > >>433: c1c86e98 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size > >> > >> and adding this value may cause the cache clean to operate on unmapped > >> addresses, or cause the addition to wrap and not perform a cache clean > >> at all. > >> > >> AFAICT, this also breaks the appended DTB case in LLVM, so this needs > >> a separate fix in any case. > > > > I pushed a combined branch of torvalds/master, rmk/fixes (still > > containing my 9052/1 fix) and this patch to my for-kernelci branch > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/ > > > > Guillaume, > > > > It seems there is no Clang-11 coverage there, right? Mind giving this > > branch a spin? If this fixes the regressions, we can get these queued > > up. > > That's right, Clang builds are only enabled on linux-next and > mainline at the moment. We could enable it on any other branch > where it makes sense. How about just the main defconfig for arm, > arm64 and x86_64 on your ardb/for-kernelci branch? > Yes, please. > For now I've run another set of builds with clang-11 and got the > following test results with your branch on staging: > > > https://staging.kernelci.org/test/job/ardb/branch/for-kernelci/kernel/v5.11-rc6-146-g923ca344043a/plan/baseline/ > > which are all passing. > > I'll reply to the thread with your patch to confirm. > Excellent, thanks a lot.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 05/02/2021 12:05, Ard Biesheuvel wrote: > On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel wrote: >> >> On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker >> wrote: >>> >>> On 04/02/2021 18:23, Nick Desaulniers wrote: On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor wrote: > > On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang > Built Linux wrote: >> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: >>> >>> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker >>> wrote: On 04/02/2021 15:42, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > wrote: >> >> Essentially: >> >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 >> CC="ccache clang" zImage >> >> This command should link with BFD (and assemble with GAS; it's only >> using clang as the compiler. > > I think you missed the 'LLVM=1' before CC="ccache clang". That should > use all of the LLVM utilities minus the integrated assembler while > wrapping clang with ccache. You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should permit fallback to BFD. >>> >>> That was close, except we're cross-compiling with GCC for arm. >>> So I've now built a plain next-20210203 (without Ard's fix) using >>> this command line: >>> >>> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm >>> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage >>> >>> I'm using a modified Docker image gtucker/kernelci-build-clang-11 >>> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf >>> packages added to be able to use the GNU linker. BTW I guess we >>> should enable this kind of hybrid build setup on kernelci.org as >>> well. >>> >>> Full build log + kernel binaries can be found here: >>> >>> >>> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/ >>> >>> And this booted fine, which confirms it's really down to how >>> ld.lld puts together the kernel image. Does it actually solve >>> the debate whether this is an issue to fix in the assembly code >>> or at link time? >>> >>> Full test job details for the record: >>> >>> https://lava.collabora.co.uk/scheduler/job/3176004 >>> >> >> >> So the issue appears to be in the way the linker generates the >> _kernel_bss_size symbol, which obviously has an impact, given that the >> queued fix takes it into account in the cache_clean operation. >> >> On GNU ld, I see >> >>479: 00065e14 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size >> >> whereas n LLVM ld.lld, I see >> >>433: c1c86e98 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size >> >> and adding this value may cause the cache clean to operate on unmapped >> addresses, or cause the addition to wrap and not perform a cache clean >> at all. >> >> AFAICT, this also breaks the appended DTB case in LLVM, so this needs >> a separate fix in any case. > > I pushed a combined branch of torvalds/master, rmk/fixes (still > containing my 9052/1 fix) and this patch to my for-kernelci branch > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/ > > Guillaume, > > It seems there is no Clang-11 coverage there, right? Mind giving this > branch a spin? If this fixes the regressions, we can get these queued > up. That's right, Clang builds are only enabled on linux-next and mainline at the moment. We could enable it on any other branch where it makes sense. How about just the main defconfig for arm, arm64 and x86_64 on your ardb/for-kernelci branch? For now I've run another set of builds with clang-11 and got the following test results with your branch on staging: https://staging.kernelci.org/test/job/ardb/branch/for-kernelci/kernel/v5.11-rc6-146-g923ca344043a/plan/baseline/ which are all passing. I'll reply to the thread with your patch to confirm. Guillaume
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker > wrote: > > > > On 04/02/2021 18:23, Nick Desaulniers wrote: > > > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor > > > wrote: > > >> > > >> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang > > >> Built Linux wrote: > > >>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > > > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > > wrote: > > > > > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > >> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > >> wrote: > > >>> > > >>> Essentially: > > >>> > > >>> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > > >>> CC="ccache clang" zImage > > >>> > > >>> This command should link with BFD (and assemble with GAS; it's only > > >>> using clang as the compiler. > > >> > > >> I think you missed the 'LLVM=1' before CC="ccache clang". That should > > >> use all of the LLVM utilities minus the integrated assembler while > > >> wrapping clang with ccache. > > > > > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should > > > permit fallback to BFD. > > > > That was close, except we're cross-compiling with GCC for arm. > > So I've now built a plain next-20210203 (without Ard's fix) using > > this command line: > > > > make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm > > CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage > > > > I'm using a modified Docker image gtucker/kernelci-build-clang-11 > > with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf > > packages added to be able to use the GNU linker. BTW I guess we > > should enable this kind of hybrid build setup on kernelci.org as > > well. > > > > Full build log + kernel binaries can be found here: > > > > > > https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/ > > > > And this booted fine, which confirms it's really down to how > > ld.lld puts together the kernel image. Does it actually solve > > the debate whether this is an issue to fix in the assembly code > > or at link time? > > > > Full test job details for the record: > > > > https://lava.collabora.co.uk/scheduler/job/3176004 > > > > > So the issue appears to be in the way the linker generates the > _kernel_bss_size symbol, which obviously has an impact, given that the > queued fix takes it into account in the cache_clean operation. > > On GNU ld, I see > >479: 00065e14 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size > > whereas n LLVM ld.lld, I see > >433: c1c86e98 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size > > and adding this value may cause the cache clean to operate on unmapped > addresses, or cause the addition to wrap and not perform a cache clean > at all. > > AFAICT, this also breaks the appended DTB case in LLVM, so this needs > a separate fix in any case. I pushed a combined branch of torvalds/master, rmk/fixes (still containing my 9052/1 fix) and this patch to my for-kernelci branch https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/ Guillaume, It seems there is no Clang-11 coverage there, right? Mind giving this branch a spin? If this fixes the regressions, we can get these queued up. Thanks, Ard.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker wrote: > > On 04/02/2021 18:23, Nick Desaulniers wrote: > > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor wrote: > >> > >> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang > >> Built Linux wrote: > >>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > wrote: > > > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > >> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > >> wrote: > >>> > >>> Essentially: > >>> > >>> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > >>> CC="ccache clang" zImage > >>> > >>> This command should link with BFD (and assemble with GAS; it's only > >>> using clang as the compiler. > >> > >> I think you missed the 'LLVM=1' before CC="ccache clang". That should > >> use all of the LLVM utilities minus the integrated assembler while > >> wrapping clang with ccache. > > > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should > > permit fallback to BFD. > > That was close, except we're cross-compiling with GCC for arm. > So I've now built a plain next-20210203 (without Ard's fix) using > this command line: > > make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm > CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage > > I'm using a modified Docker image gtucker/kernelci-build-clang-11 > with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf > packages added to be able to use the GNU linker. BTW I guess we > should enable this kind of hybrid build setup on kernelci.org as > well. > > Full build log + kernel binaries can be found here: > > > https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/ > > And this booted fine, which confirms it's really down to how > ld.lld puts together the kernel image. Does it actually solve > the debate whether this is an issue to fix in the assembly code > or at link time? > > Full test job details for the record: > > https://lava.collabora.co.uk/scheduler/job/3176004 > So the issue appears to be in the way the linker generates the _kernel_bss_size symbol, which obviously has an impact, given that the queued fix takes it into account in the cache_clean operation. On GNU ld, I see 479: 00065e14 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size whereas n LLVM ld.lld, I see 433: c1c86e98 0 NOTYPE GLOBAL DEFAULT ABS _kernel_bss_size and adding this value may cause the cache clean to operate on unmapped addresses, or cause the addition to wrap and not perform a cache clean at all. AFAICT, this also breaks the appended DTB case in LLVM, so this needs a separate fix in any case.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 09:31:06PM +, Guillaume Tucker wrote: > On 04/02/2021 18:23, Nick Desaulniers wrote: > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should > > permit fallback to BFD. > > That was close, except we're cross-compiling with GCC for arm. > So I've now built a plain next-20210203 (without Ard's fix) using > this command line: > > make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm > CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage > > I'm using a modified Docker image gtucker/kernelci-build-clang-11 > with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf > packages added to be able to use the GNU linker. BTW I guess we > should enable this kind of hybrid build setup on kernelci.org as > well. ... > And this booted fine, which confirms it's really down to how > ld.lld puts together the kernel image. Does it actually solve > the debate whether this is an issue to fix in the assembly code > or at link time? Well... as I mentioned previously, we don't really understand what is going on between the decompressor running with the caches on, turning the caches off, jumping into the decompressed kernel, and then getting to the v7 setup code. The results from various attempts at solving the problem which lead to Ard's original patch that caused your breakage were not making a whole lot of sense (I think I wrote that all up in a previous email thread, so won't repeat it here.) So, I was slightly nervous about merging Ard's fix - and your report suggested that there is indeed more going on here that we don't understand. When I was tracking down what was going on, I had this patch applied (I've had to recreate it, so may not be exactly what I had), with the DEBUG_LL stuff appropriately enabled. It may be worth applying this patch, enabling the DEBUG_LL stuff appropriately for one of your failing boards, and try booting it. You should get two strings of identical hex numbers that look something like: 480009000401400030004820071d40008090 If they're looking like instructions, for example: ee060f37e3a00080ee020f10ee020f30ee030f10e3a00903ee050f30 Then it's likely that you are seeing a very similar problem as I was without Ard's patch. If you do get instruction-like content, then you will likely find the sequence of instructions in the decompressor code. diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index 28c9d32fa99a..19fa93ae282c 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -475,7 +475,39 @@ ENDPROC(cpu_pj4b_do_resume) ldr r12, [r0] add r12, r12, r0@ the local stack stmia r12, {r1-r6, lr}@ v7_invalidate_l1 touches r0-r6 + ldr r0, [r12, #0] + bl printhex8 + ldr r0, [r12, #4] + bl printhex8 + ldr r0, [r12, #8] + bl printhex8 + ldr r0, [r12, #12] + bl printhex8 + ldr r0, [r12, #16] + bl printhex8 + ldr r0, [r12, #20] + bl printhex8 + ldr r0, [r12, #24] + bl printhex8 + mov r0, #'\n' + bl printch bl v7_invalidate_l1 + ldr r0, [r12, #0] + bl printhex8 + ldr r0, [r12, #4] + bl printhex8 + ldr r0, [r12, #8] + bl printhex8 + ldr r0, [r12, #12] + bl printhex8 + ldr r0, [r12, #16] + bl printhex8 + ldr r0, [r12, #20] + bl printhex8 + ldr r0, [r12, #24] + bl printhex8 + mov r0, #'\n' + bl printch ldmia r12, {r1-r6, lr} __v7_setup_cont: -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 04/02/2021 18:23, Nick Desaulniers wrote: > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor wrote: >> >> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built >> Linux wrote: >>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker wrote: > > On 04/02/2021 15:42, Ard Biesheuvel wrote: >> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker >> wrote: >>> >>> Essentially: >>> >>> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 >>> CC="ccache clang" zImage >>> >>> This command should link with BFD (and assemble with GAS; it's only >>> using clang as the compiler. >> >> I think you missed the 'LLVM=1' before CC="ccache clang". That should >> use all of the LLVM utilities minus the integrated assembler while >> wrapping clang with ccache. > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should > permit fallback to BFD. That was close, except we're cross-compiling with GCC for arm. So I've now built a plain next-20210203 (without Ard's fix) using this command line: make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage I'm using a modified Docker image gtucker/kernelci-build-clang-11 with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf packages added to be able to use the GNU linker. BTW I guess we should enable this kind of hybrid build setup on kernelci.org as well. Full build log + kernel binaries can be found here: https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/ And this booted fine, which confirms it's really down to how ld.lld puts together the kernel image. Does it actually solve the debate whether this is an issue to fix in the assembly code or at link time? Full test job details for the record: https://lava.collabora.co.uk/scheduler/job/3176004 Hope that helps, Guillaume
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 04/02/2021 16:01, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > wrote: >> >> On 04/02/2021 15:42, Ard Biesheuvel wrote: >>> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker >>> wrote: On 04/02/2021 10:33, Guillaume Tucker wrote: > On 04/02/2021 10:27, Ard Biesheuvel wrote: >> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin >> wrote: >>> >>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker wrote: > > Hi Ard, > > Please see the bisection report below about a boot failure on > rk3288 with next-20210203. It was also bisected on > imx6q-var-dt6customboard with next-20210202. > > Reports aren't automatically sent to the public while we're > trialing new bisection features on kernelci.org but this one > looks valid. > > The kernel is most likely crashing very early on, so there's > nothing in the logs. Please let us know if you need some help > with debugging or trying a fix on these platforms. > Thanks for the report. >>> >>> Ard, >>> >>> I want to send my fixes branch today which includes your regression >>> fix that caused this regression. >>> >>> As this is proving difficult to fix, I can only drop your fix from >>> my fixes branch - and given that this seems to be problematical, I'm >>> tempted to revert the original change at this point which should fix >>> both of these regressions - and then we have another go at getting rid >>> of the set/way instructions during the next cycle. >>> >>> Thoughts? >>> >> >> Hi Russell, >> >> If Guillaume is willing to do the experiment, and it fixes the issue, > > Yes, I'm running some tests with that fix now and should have > some results shortly. Yes it does fix the issue: https://lava.collabora.co.uk/scheduler/job/3173819 with Ard's fix applied to this test branch: https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ +clang +Nick It's worth mentioning that the issue only happens with kernels built with Clang. As you can see there are several other arm platforms failing with clang-11 builds but booting fine with gcc-8: https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/ Here's a sample build log: https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log Essentially: make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage I believe it should be using the GNU assembler as LLVM_IAS=1 is not defined, but there may be something more subtle about it. >>> >>> >>> Do you have a link for a failing zImage built from multi_v7_defconfig? >> >> Sure, this one was built from a plain next-20210203: >> >> >> http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage >> >> You can also find the dtbs, modules and other things in that same >> directory. >> >> For the record, here's the test job that used it: >> >> https://lava.collabora.co.uk/scheduler/job/3173792 >> > > Thanks. > > That zImage boots fine locally. Unfortunately, I don't have rk3288 > hardware to reproduce. > > Could you please point me to the list of all the other platforms that > failed to boot this image? This is the list of platforms from kernelci.org I've gathered which appeared to be impacted: imx6q-sabrelite imx6q-var-dt6customboard imx6dl-riotboard imx6qp-wandboard-revd1 imx7ulp-evk odroid-xu3 rk3288-rock2-square rk3288-veyron-jaq stm32mp157c-dk2 sun4i-a10-olinuxino-lime sun5i-a13-olinuxino-micro sun7i-a20-cubieboard2 sun7i-a20-olinuxino-lime2 sun8i-a33-olinuxino sun8i-a83t-bananapi-m3 sun8i-h2-plus-libretech-all-h3-cc sun8i-h2-plus-orangepi-r1 sun8i-h2-plus-orangepi-zero sun8i-h3-libretech-all-h3-cc sun8i-h3-bananapi-m2-plus sun8i-h3-orangepi-pc sun8i-r40-bananapi-m2-ultra They were all booting next-20210203 with gcc-8 but not with clang-11. I've run checks on a good share of them with your patch applied and they're now booting with clang-11, just like the rk3288 and imx6q platforms that were used for the bisections. > To be honest, I am slightly annoyed that a change that works fine with > GCC but does not work with Clang version > > 11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158 > > (where exp means experimental, I suppose) is the reason for this Well it's the standard one from the LLVM Debian package repo: deb http://apt.llvm.org/buster/
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor wrote: > > On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built > Linux wrote: > > On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > > > > > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > > > wrote: > > > > > > > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > > > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > > > > wrote: > > > > >> > > > > >> Essentially: > > > > >> > > > > >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > > > > >> CC="ccache clang" zImage > > > > This command should link with BFD (and assemble with GAS; it's only > > using clang as the compiler. > > I think you missed the 'LLVM=1' before CC="ccache clang". That should > use all of the LLVM utilities minus the integrated assembler while > wrapping clang with ccache. You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should permit fallback to BFD. -- Thanks, ~Nick Desaulniers
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built Linux wrote: > On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > > > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > > wrote: > > > > > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > > > wrote: > > > >> > > > >> Essentially: > > > >> > > > >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > > > >> CC="ccache clang" zImage > > This command should link with BFD (and assemble with GAS; it's only > using clang as the compiler. I think you missed the 'LLVM=1' before CC="ccache clang". That should use all of the LLVM utilities minus the integrated assembler while wrapping clang with ccache. Cheers, Nathan
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker > wrote: > > > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > > wrote: > > >> > > >> Essentially: > > >> > > >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 > > >> CC="ccache clang" zImage This command should link with BFD (and assemble with GAS; it's only using clang as the compiler. > > To be honest, I am slightly annoyed that a change that works fine with > GCC but does not work with Clang version > > 11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158 > > (where exp means experimental, I suppose) is the reason for this > discussion, especially because the change is in asm code. Is it > possible to build with Clang but use the GNU linker? rk3288 might be the last 32b ARM platform ChromeOS uses. "veyron" -- Thanks, ~Nick Desaulniers
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker wrote: > > On 04/02/2021 15:42, Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > > wrote: > >> > >> On 04/02/2021 10:33, Guillaume Tucker wrote: > >>> On 04/02/2021 10:27, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin > wrote: > > > > On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: > >> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker > >> wrote: > >>> > >>> Hi Ard, > >>> > >>> Please see the bisection report below about a boot failure on > >>> rk3288 with next-20210203. It was also bisected on > >>> imx6q-var-dt6customboard with next-20210202. > >>> > >>> Reports aren't automatically sent to the public while we're > >>> trialing new bisection features on kernelci.org but this one > >>> looks valid. > >>> > >>> The kernel is most likely crashing very early on, so there's > >>> nothing in the logs. Please let us know if you need some help > >>> with debugging or trying a fix on these platforms. > >>> > >> > >> Thanks for the report. > > > > Ard, > > > > I want to send my fixes branch today which includes your regression > > fix that caused this regression. > > > > As this is proving difficult to fix, I can only drop your fix from > > my fixes branch - and given that this seems to be problematical, I'm > > tempted to revert the original change at this point which should fix > > both of these regressions - and then we have another go at getting rid > > of the set/way instructions during the next cycle. > > > > Thoughts? > > > > Hi Russell, > > If Guillaume is willing to do the experiment, and it fixes the issue, > >>> > >>> Yes, I'm running some tests with that fix now and should have > >>> some results shortly. > >> > >> Yes it does fix the issue: > >> > >> https://lava.collabora.co.uk/scheduler/job/3173819 > >> > >> with Ard's fix applied to this test branch: > >> > >> > >> https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ > >> > >> > >> +clang +Nick > >> > >> It's worth mentioning that the issue only happens with kernels > >> built with Clang. As you can see there are several other arm > >> platforms failing with clang-11 builds but booting fine with > >> gcc-8: > >> > >> > >> https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/ > >> > >> Here's a sample build log: > >> > >> > >> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log > >> > >> Essentially: > >> > >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache > >> clang" zImage > >> > >> I believe it should be using the GNU assembler as LLVM_IAS=1 is > >> not defined, but there may be something more subtle about it. > >> > > > > > > Do you have a link for a failing zImage built from multi_v7_defconfig? > > Sure, this one was built from a plain next-20210203: > > > http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage > > You can also find the dtbs, modules and other things in that same > directory. > > For the record, here's the test job that used it: > > https://lava.collabora.co.uk/scheduler/job/3173792 > Thanks. That zImage boots fine locally. Unfortunately, I don't have rk3288 hardware to reproduce. Could you please point me to the list of all the other platforms that failed to boot this image? To be honest, I am slightly annoyed that a change that works fine with GCC but does not work with Clang version 11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158 (where exp means experimental, I suppose) is the reason for this discussion, especially because the change is in asm code. Is it possible to build with Clang but use the GNU linker?
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 15:36, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 03:25:20PM +0100, Ard Biesheuvel wrote: > > Pushing contents of the cache hierarchy to main memory is *not* a > > valid use of set/way ops, and so there is no point in pretending that > > set/way ops will produce the same results as by-VA ops. Only the by-VA > > ops give the architectural guarantees that we rely on for correctness. > > ... yet we /were/ doing that, and it worked fine for 13 years - from > 1st June 2007 until the by-VA merge into mainline on the 3rd April > 2020. > > You may be right that it wasn't the most correct way, but it worked > for those 13 years without issue, and it's only recently that it's > become a problem, and trying to "fix" that introduced a regression, > and fixing that regression has caused another regression... and I > what I'm wondering is how many more regression fixing cycles it's > going to take - how many regression fixes on top of other regression > fixes are we going to end up seeing here. > > The fact is, we never properly understood why your patch caused the > regression I was seeing. If we don't understand it, then we can never > say that we've fixed the problem properly. That is highlighted by the > fact that fixing the regression I was seeing has caused another > regression. > I agree with all these points. But as I pointed out, reverting the original by-VA change, which has been there for almost a year now, has some risks of its own. If we don't understand the details of how this is broken, how can we be sure we don't break something else if we revert it again? So I'm not arguing that reverting the original patch is unreasonable, just that doing so at this point in the cycle is risky, and that it would be better to queue the revert for v5.12, and only backport it after some soak time in -next. And in a sense, reinstating the cache_clean() before cache_off() already amounts to a partial revert of the queued fix.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 04/02/2021 15:42, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker > wrote: >> >> On 04/02/2021 10:33, Guillaume Tucker wrote: >>> On 04/02/2021 10:27, Ard Biesheuvel wrote: On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: >> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker >> wrote: >>> >>> Hi Ard, >>> >>> Please see the bisection report below about a boot failure on >>> rk3288 with next-20210203. It was also bisected on >>> imx6q-var-dt6customboard with next-20210202. >>> >>> Reports aren't automatically sent to the public while we're >>> trialing new bisection features on kernelci.org but this one >>> looks valid. >>> >>> The kernel is most likely crashing very early on, so there's >>> nothing in the logs. Please let us know if you need some help >>> with debugging or trying a fix on these platforms. >>> >> >> Thanks for the report. > > Ard, > > I want to send my fixes branch today which includes your regression > fix that caused this regression. > > As this is proving difficult to fix, I can only drop your fix from > my fixes branch - and given that this seems to be problematical, I'm > tempted to revert the original change at this point which should fix > both of these regressions - and then we have another go at getting rid > of the set/way instructions during the next cycle. > > Thoughts? > Hi Russell, If Guillaume is willing to do the experiment, and it fixes the issue, >>> >>> Yes, I'm running some tests with that fix now and should have >>> some results shortly. >> >> Yes it does fix the issue: >> >> https://lava.collabora.co.uk/scheduler/job/3173819 >> >> with Ard's fix applied to this test branch: >> >> https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ >> >> >> +clang +Nick >> >> It's worth mentioning that the issue only happens with kernels >> built with Clang. As you can see there are several other arm >> platforms failing with clang-11 builds but booting fine with >> gcc-8: >> >> >> https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/ >> >> Here's a sample build log: >> >> >> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log >> >> Essentially: >> >> make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache >> clang" zImage >> >> I believe it should be using the GNU assembler as LLVM_IAS=1 is >> not defined, but there may be something more subtle about it. >> > > > Do you have a link for a failing zImage built from multi_v7_defconfig? Sure, this one was built from a plain next-20210203: http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage You can also find the dtbs, modules and other things in that same directory. For the record, here's the test job that used it: https://lava.collabora.co.uk/scheduler/job/3173792 Guillaume
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker wrote: > > On 04/02/2021 10:33, Guillaume Tucker wrote: > > On 04/02/2021 10:27, Ard Biesheuvel wrote: > >> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin > >> wrote: > >>> > >>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker > wrote: > > > > Hi Ard, > > > > Please see the bisection report below about a boot failure on > > rk3288 with next-20210203. It was also bisected on > > imx6q-var-dt6customboard with next-20210202. > > > > Reports aren't automatically sent to the public while we're > > trialing new bisection features on kernelci.org but this one > > looks valid. > > > > The kernel is most likely crashing very early on, so there's > > nothing in the logs. Please let us know if you need some help > > with debugging or trying a fix on these platforms. > > > > Thanks for the report. > >>> > >>> Ard, > >>> > >>> I want to send my fixes branch today which includes your regression > >>> fix that caused this regression. > >>> > >>> As this is proving difficult to fix, I can only drop your fix from > >>> my fixes branch - and given that this seems to be problematical, I'm > >>> tempted to revert the original change at this point which should fix > >>> both of these regressions - and then we have another go at getting rid > >>> of the set/way instructions during the next cycle. > >>> > >>> Thoughts? > >>> > >> > >> Hi Russell, > >> > >> If Guillaume is willing to do the experiment, and it fixes the issue, > > > > Yes, I'm running some tests with that fix now and should have > > some results shortly. > > Yes it does fix the issue: > > https://lava.collabora.co.uk/scheduler/job/3173819 > > with Ard's fix applied to this test branch: > > https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ > > > +clang +Nick > > It's worth mentioning that the issue only happens with kernels > built with Clang. As you can see there are several other arm > platforms failing with clang-11 builds but booting fine with > gcc-8: > > > https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/ > > Here's a sample build log: > > > https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log > > Essentially: > > make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache > clang" zImage > > I believe it should be using the GNU assembler as LLVM_IAS=1 is > not defined, but there may be something more subtle about it. > Do you have a link for a failing zImage built from multi_v7_defconfig?
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 03:25:20PM +0100, Ard Biesheuvel wrote: > Pushing contents of the cache hierarchy to main memory is *not* a > valid use of set/way ops, and so there is no point in pretending that > set/way ops will produce the same results as by-VA ops. Only the by-VA > ops give the architectural guarantees that we rely on for correctness. ... yet we /were/ doing that, and it worked fine for 13 years - from 1st June 2007 until the by-VA merge into mainline on the 3rd April 2020. You may be right that it wasn't the most correct way, but it worked for those 13 years without issue, and it's only recently that it's become a problem, and trying to "fix" that introduced a regression, and fixing that regression has caused another regression... and I what I'm wondering is how many more regression fixing cycles it's going to take - how many regression fixes on top of other regression fixes are we going to end up seeing here. The fact is, we never properly understood why your patch caused the regression I was seeing. If we don't understand it, then we can never say that we've fixed the problem properly. That is highlighted by the fact that fixing the regression I was seeing has caused another regression. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 15:09, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 12:26:44PM +, Marc Zyngier wrote: > > I agree. With set/way CMOs, there is no way to reach the PoC if > > it beyond the system cache, leading to an unbootable kernel. > > This is actually pretty well documented in the architecture, > > and it did bite us for the first time on XGene-1, 7 years ago. > > That may be, however we still do set/way maintenance to invalidate > the L1 cache as that is required for ARMv7 to place the cache into > a known state, as stated by the architecture reference manual. > Getting a certain cache at a certain level into a known state is a valid use of set/way ops, and is simply unnecessary when running under virtualization, but doesn't do any harm. Pushing contents of the cache hierarchy to main memory is *not* a valid use of set/way ops, and so there is no point in pretending that set/way ops will produce the same results as by-VA ops. Only the by-VA ops give the architectural guarantees that we rely on for correctness. > Arguably, that should be done by firmware, but when starting > secondary CPUs, there are platforms out there which do not bring > the L1 cache to a defined state. So we are pretty much stuck with > doing set/way operations during CPU initialisation in the main > kernel. > Indeed. And this is unfortunate, but not the end of the world. > If ARMv8 decides that this is not supportable, then that's a matter > for ARMv8 to address without impacting the requirements of ARMv7. > I'm not sure what you mean here. The v7 architecture is crystal clear about the difference between set/way ops (managing a single cache), and by-VA ops (managing the 'cachedness' state of a memory region). The semantics are radically different, regardless of v7 vs v8 or AArch32 vs AArch64.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 12:26:44PM +, Marc Zyngier wrote: > I agree. With set/way CMOs, there is no way to reach the PoC if > it beyond the system cache, leading to an unbootable kernel. > This is actually pretty well documented in the architecture, > and it did bite us for the first time on XGene-1, 7 years ago. That may be, however we still do set/way maintenance to invalidate the L1 cache as that is required for ARMv7 to place the cache into a known state, as stated by the architecture reference manual. Arguably, that should be done by firmware, but when starting secondary CPUs, there are platforms out there which do not bring the L1 cache to a defined state. So we are pretty much stuck with doing set/way operations during CPU initialisation in the main kernel. If ARMv8 decides that this is not supportable, then that's a matter for ARMv8 to address without impacting the requirements of ARMv7. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 2021-02-04 10:55, Ard Biesheuvel wrote: (cc Marc) On Thu, 4 Feb 2021 at 11:48, Russell King - ARM Linux admin wrote: On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote: > Hi Russell, > > If Guillaume is willing to do the experiment, and it fixes the issue, > it proves that rk3288 is relying on the flush before the MMU is > disabled, and so in that case, the fix is trivial, and we can just > apply it. > > If the experiment fails (which would mean rk3288 does not tolerate the > cache maintenance being performed after cache off), it is going to be > hairy, and so it will definitely take more time. > > So in the latter case (or if Guillaume does not get back to us), I > think reverting my queued fix is the only sane option. But in that > case, may I suggest that we queue the revert of the original by-VA > change for v5.12 so it gets lots of coverage in -next, and allows us > an opportunity to come up with a proper fix in the same timeframe, and > backport the revert and the subsequent fix as a pair? Otherwise, we'll > end up in the situation where v5.10.x until today has by-va, v5.10.x-y > has set/way, and v5.10y+ has by-va again. (I don't think we care about > anything before that, given that v5.4 predates any of this) I'm suggesting dropping your fix (9052/1) and reverting "ARM: decompressor: switch to by-VA cache maintenance for v7 cores" which gets us to a point where _both_ regressions are fixed. I understand, but we don't know whether doing so might regress other platforms that were added in the mean time. I'm of the opinion that the by-VA patch was incorrect when it was merged (it caused a regression), and it's only a performance improvement. It is a correctness improvement, not a performance improvement. Without that change, the 32-bit ARM kernel cannot boot bare metal on platforms with a system cache such as 8040 or SynQuacer, and can only boot under KVM on such systems because of the special handling of set/way instructions by the host. I agree. With set/way CMOs, there is no way to reach the PoC if it beyond the system cache, leading to an unbootable kernel. This is actually pretty well documented in the architecture, and it did bite us for the first time on XGene-1, 7 years ago. In retrospect, having KVM to handle set/way CMOs in was a mistake, as it just papered over the problem for the sake of running older 32bit guests. It violated the principle of KVM/arm being strictly architectural and provided unrealistic expectations. I'll take the blame for this. Thanks, M. -- Jazz is not dead. It just smells funny...
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 12:45, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 11:32:05AM +, Guillaume Tucker wrote: > > Yes it does fix the issue: > > > > https://lava.collabora.co.uk/scheduler/job/3173819 > > > > with Ard's fix applied to this test branch: > > > > > > https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ > > > > > > +clang +Nick > > > > It's worth mentioning that the issue only happens with kernels > > built with Clang. As you can see there are several other arm > > platforms failing with clang-11 builds but booting fine with > > gcc-8: > > My gut feeling is that it isn't Clang specific - it's likely down to > the exact code/data placement, how things end up during decompression, > and exactly what state the cache ends up in. > > That certainly was the case with the original regression. > Agreed. So given that my queued fix turns this cache_clean cache_off into this cache_off cache_clean for v7 only, and considering that turning this into cache_clean cache_off cache_clean (as the diff tested by Guillaume does) fixes the reported issue, it seems like the safest option to me at this point. Reverting both patches, one of which has been in mainline since v5.7, seems unwise to me at this point in the cycle.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 11:32:05AM +, Guillaume Tucker wrote: > Yes it does fix the issue: > > https://lava.collabora.co.uk/scheduler/job/3173819 > > with Ard's fix applied to this test branch: > > https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ > > > +clang +Nick > > It's worth mentioning that the issue only happens with kernels > built with Clang. As you can see there are several other arm > platforms failing with clang-11 builds but booting fine with > gcc-8: My gut feeling is that it isn't Clang specific - it's likely down to the exact code/data placement, how things end up during decompression, and exactly what state the cache ends up in. That certainly was the case with the original regression. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 04/02/2021 10:33, Guillaume Tucker wrote: > On 04/02/2021 10:27, Ard Biesheuvel wrote: >> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin >> wrote: >>> >>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker wrote: > > Hi Ard, > > Please see the bisection report below about a boot failure on > rk3288 with next-20210203. It was also bisected on > imx6q-var-dt6customboard with next-20210202. > > Reports aren't automatically sent to the public while we're > trialing new bisection features on kernelci.org but this one > looks valid. > > The kernel is most likely crashing very early on, so there's > nothing in the logs. Please let us know if you need some help > with debugging or trying a fix on these platforms. > Thanks for the report. >>> >>> Ard, >>> >>> I want to send my fixes branch today which includes your regression >>> fix that caused this regression. >>> >>> As this is proving difficult to fix, I can only drop your fix from >>> my fixes branch - and given that this seems to be problematical, I'm >>> tempted to revert the original change at this point which should fix >>> both of these regressions - and then we have another go at getting rid >>> of the set/way instructions during the next cycle. >>> >>> Thoughts? >>> >> >> Hi Russell, >> >> If Guillaume is willing to do the experiment, and it fixes the issue, > > Yes, I'm running some tests with that fix now and should have > some results shortly. Yes it does fix the issue: https://lava.collabora.co.uk/scheduler/job/3173819 with Ard's fix applied to this test branch: https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/ +clang +Nick It's worth mentioning that the issue only happens with kernels built with Clang. As you can see there are several other arm platforms failing with clang-11 builds but booting fine with gcc-8: https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/ Here's a sample build log: https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log Essentially: make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage I believe it should be using the GNU assembler as LLVM_IAS=1 is not defined, but there may be something more subtle about it. Thanks, Guillaume >> it proves that rk3288 is relying on the flush before the MMU is >> disabled, and so in that case, the fix is trivial, and we can just >> apply it. >> >> If the experiment fails (which would mean rk3288 does not tolerate the >> cache maintenance being performed after cache off), it is going to be >> hairy, and so it will definitely take more time. >> >> So in the latter case (or if Guillaume does not get back to us), I >> think reverting my queued fix is the only sane option. But in that >> case, may I suggest that we queue the revert of the original by-VA >> change for v5.12 so it gets lots of coverage in -next, and allows us >> an opportunity to come up with a proper fix in the same timeframe, and >> backport the revert and the subsequent fix as a pair? Otherwise, we'll >> end up in the situation where v5.10.x until today has by-va, v5.10.x-y >> has set/way, and v5.10y+ has by-va again. (I don't think we care about >> anything before that, given that v5.4 predates any of this) >> >> But in the end, I'm happy to go along with whatever works best for you. > > Thanks, > Guillaume >
Re: next/master bisection: baseline.login on rk3288-rock2-square
(cc Marc) On Thu, 4 Feb 2021 at 11:48, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote: > > Hi Russell, > > > > If Guillaume is willing to do the experiment, and it fixes the issue, > > it proves that rk3288 is relying on the flush before the MMU is > > disabled, and so in that case, the fix is trivial, and we can just > > apply it. > > > > If the experiment fails (which would mean rk3288 does not tolerate the > > cache maintenance being performed after cache off), it is going to be > > hairy, and so it will definitely take more time. > > > > So in the latter case (or if Guillaume does not get back to us), I > > think reverting my queued fix is the only sane option. But in that > > case, may I suggest that we queue the revert of the original by-VA > > change for v5.12 so it gets lots of coverage in -next, and allows us > > an opportunity to come up with a proper fix in the same timeframe, and > > backport the revert and the subsequent fix as a pair? Otherwise, we'll > > end up in the situation where v5.10.x until today has by-va, v5.10.x-y > > has set/way, and v5.10y+ has by-va again. (I don't think we care about > > anything before that, given that v5.4 predates any of this) > > I'm suggesting dropping your fix (9052/1) and reverting > "ARM: decompressor: switch to by-VA cache maintenance for v7 cores" > which gets us to a point where _both_ regressions are fixed. > I understand, but we don't know whether doing so might regress other platforms that were added in the mean time. > I'm of the opinion that the by-VA patch was incorrect when it was > merged (it caused a regression), and it's only a performance > improvement. It is a correctness improvement, not a performance improvement. Without that change, the 32-bit ARM kernel cannot boot bare metal on platforms with a system cache such as 8040 or SynQuacer, and can only boot under KVM on such systems because of the special handling of set/way instructions by the host. The performance issue related to set/way ops under KVM was already fixed by describing data and unified caches as 1 set and 1 way when running in 32-bit mode. > Our attempts so far to fix it are just causing other > regressions. So, I think it is reasonable to revert both back to a > known good point which has worked over a decade. If doing so causes > regressions (which I think is unlikely), then that would be unfortunate > but alas is a price that's worth paying to get back to a known good > point - since then we're not stacking regression fixes on top of other > regression fixes. > This is exactly why I am proposing to queue the revert of the original patch for v5.12, and only backport it to v5.10 and v5.11 once we are sure it does not break anything else.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote: > Hi Russell, > > If Guillaume is willing to do the experiment, and it fixes the issue, > it proves that rk3288 is relying on the flush before the MMU is > disabled, and so in that case, the fix is trivial, and we can just > apply it. > > If the experiment fails (which would mean rk3288 does not tolerate the > cache maintenance being performed after cache off), it is going to be > hairy, and so it will definitely take more time. > > So in the latter case (or if Guillaume does not get back to us), I > think reverting my queued fix is the only sane option. But in that > case, may I suggest that we queue the revert of the original by-VA > change for v5.12 so it gets lots of coverage in -next, and allows us > an opportunity to come up with a proper fix in the same timeframe, and > backport the revert and the subsequent fix as a pair? Otherwise, we'll > end up in the situation where v5.10.x until today has by-va, v5.10.x-y > has set/way, and v5.10y+ has by-va again. (I don't think we care about > anything before that, given that v5.4 predates any of this) I'm suggesting dropping your fix (9052/1) and reverting "ARM: decompressor: switch to by-VA cache maintenance for v7 cores" which gets us to a point where _both_ regressions are fixed. I'm of the opinion that the by-VA patch was incorrect when it was merged (it caused a regression), and it's only a performance improvement. Our attempts so far to fix it are just causing other regressions. So, I think it is reasonable to revert both back to a known good point which has worked over a decade. If doing so causes regressions (which I think is unlikely), then that would be unfortunate but alas is a price that's worth paying to get back to a known good point - since then we're not stacking regression fixes on top of other regression fixes. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On 04/02/2021 10:27, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin > wrote: >> >> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: >>> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker >>> wrote: Hi Ard, Please see the bisection report below about a boot failure on rk3288 with next-20210203. It was also bisected on imx6q-var-dt6customboard with next-20210202. Reports aren't automatically sent to the public while we're trialing new bisection features on kernelci.org but this one looks valid. The kernel is most likely crashing very early on, so there's nothing in the logs. Please let us know if you need some help with debugging or trying a fix on these platforms. >>> >>> Thanks for the report. >> >> Ard, >> >> I want to send my fixes branch today which includes your regression >> fix that caused this regression. >> >> As this is proving difficult to fix, I can only drop your fix from >> my fixes branch - and given that this seems to be problematical, I'm >> tempted to revert the original change at this point which should fix >> both of these regressions - and then we have another go at getting rid >> of the set/way instructions during the next cycle. >> >> Thoughts? >> > > Hi Russell, > > If Guillaume is willing to do the experiment, and it fixes the issue, Yes, I'm running some tests with that fix now and should have some results shortly. > it proves that rk3288 is relying on the flush before the MMU is > disabled, and so in that case, the fix is trivial, and we can just > apply it. > > If the experiment fails (which would mean rk3288 does not tolerate the > cache maintenance being performed after cache off), it is going to be > hairy, and so it will definitely take more time. > > So in the latter case (or if Guillaume does not get back to us), I > think reverting my queued fix is the only sane option. But in that > case, may I suggest that we queue the revert of the original by-VA > change for v5.12 so it gets lots of coverage in -next, and allows us > an opportunity to come up with a proper fix in the same timeframe, and > backport the revert and the subsequent fix as a pair? Otherwise, we'll > end up in the situation where v5.10.x until today has by-va, v5.10.x-y > has set/way, and v5.10y+ has by-va again. (I don't think we care about > anything before that, given that v5.4 predates any of this) > > But in the end, I'm happy to go along with whatever works best for you. Thanks, Guillaume
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin wrote: > > On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: > > On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker > > wrote: > > > > > > Hi Ard, > > > > > > Please see the bisection report below about a boot failure on > > > rk3288 with next-20210203. It was also bisected on > > > imx6q-var-dt6customboard with next-20210202. > > > > > > Reports aren't automatically sent to the public while we're > > > trialing new bisection features on kernelci.org but this one > > > looks valid. > > > > > > The kernel is most likely crashing very early on, so there's > > > nothing in the logs. Please let us know if you need some help > > > with debugging or trying a fix on these platforms. > > > > > > > Thanks for the report. > > Ard, > > I want to send my fixes branch today which includes your regression > fix that caused this regression. > > As this is proving difficult to fix, I can only drop your fix from > my fixes branch - and given that this seems to be problematical, I'm > tempted to revert the original change at this point which should fix > both of these regressions - and then we have another go at getting rid > of the set/way instructions during the next cycle. > > Thoughts? > Hi Russell, If Guillaume is willing to do the experiment, and it fixes the issue, it proves that rk3288 is relying on the flush before the MMU is disabled, and so in that case, the fix is trivial, and we can just apply it. If the experiment fails (which would mean rk3288 does not tolerate the cache maintenance being performed after cache off), it is going to be hairy, and so it will definitely take more time. So in the latter case (or if Guillaume does not get back to us), I think reverting my queued fix is the only sane option. But in that case, may I suggest that we queue the revert of the original by-VA change for v5.12 so it gets lots of coverage in -next, and allows us an opportunity to come up with a proper fix in the same timeframe, and backport the revert and the subsequent fix as a pair? Otherwise, we'll end up in the situation where v5.10.x until today has by-va, v5.10.x-y has set/way, and v5.10y+ has by-va again. (I don't think we care about anything before that, given that v5.4 predates any of this) But in the end, I'm happy to go along with whatever works best for you.
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote: > On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker > wrote: > > > > Hi Ard, > > > > Please see the bisection report below about a boot failure on > > rk3288 with next-20210203. It was also bisected on > > imx6q-var-dt6customboard with next-20210202. > > > > Reports aren't automatically sent to the public while we're > > trialing new bisection features on kernelci.org but this one > > looks valid. > > > > The kernel is most likely crashing very early on, so there's > > nothing in the logs. Please let us know if you need some help > > with debugging or trying a fix on these platforms. > > > > Thanks for the report. Ard, I want to send my fixes branch today which includes your regression fix that caused this regression. As this is proving difficult to fix, I can only drop your fix from my fixes branch - and given that this seems to be problematical, I'm tempted to revert the original change at this point which should fix both of these regressions - and then we have another go at getting rid of the set/way instructions during the next cycle. Thoughts? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: next/master bisection: baseline.login on rk3288-rock2-square
On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker wrote: > > Hi Ard, > > Please see the bisection report below about a boot failure on > rk3288 with next-20210203. It was also bisected on > imx6q-var-dt6customboard with next-20210202. > > Reports aren't automatically sent to the public while we're > trialing new bisection features on kernelci.org but this one > looks valid. > > The kernel is most likely crashing very early on, so there's > nothing in the logs. Please let us know if you need some help > with debugging or trying a fix on these platforms. > Thanks for the report. Mind trying the following fix? --- a/arch/arm/boot/compressed/head.S +++ b/arch/arm/boot/compressed/head.S @@ -617,8 +617,10 @@ not_relocated: mov r0, #0 @ cache_clean_flush() redundant. In other cases, the clean is @ performed by set/way and R0/R1 are ignored. @ - mov r0, #0 - mov r1, #0 + get_inflated_image_size r1, r2, r3 + + mov r0, r4 @ start of decompressed kernel + add r1, r1, r0 @ end of kernel BSS bl cache_clean_flush get_inflated_image_size r1, r2, r3 > On 04/02/2021 04:25, KernelCI bot wrote: > > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > > * This automated bisection report was sent to you on the basis * > > * that you may be involved with the breaking commit it has * > > * found. No manual investigation has been done to verify it, * > > * and the root cause of the problem may be somewhere else. * > > * * > > * If you do send a fix, please include this trailer:* > > * Reported-by: "kernelci.org bot" * > > * * > > * Hope this helps! * > > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > > > > next/master bisection: baseline.login on rk3288-rock2-square > > > > Summary: > > Start: 58b6c0e507b7 Add linux-next specific files for 20210203 > > Plain log: > > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.txt > > HTML log: > > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.html > > Result: 5a29552af92d ARM: 9052/1: decompressor: cover BSS in cache > > clean and reorder with MMU disable on v7 > > > > Checks: > > revert: PASS > > verify: PASS > > > > Parameters: > > Tree: next > > URL: > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > Branch: master > > Target: rk3288-rock2-square > > CPU arch: arm > > Lab:lab-collabora > > Compiler: clang-11 > > Config: multi_v7_defconfig > > Test case: baseline.login > > > > Breaking commit found: > > > > --- > > commit 5a29552af92dbd62c2b6fd1cddf7dad1ef7555b2 > > Author: Ard Biesheuvel > > Date: Sun Jan 24 18:03:45 2021 +0100 > > > > ARM: 9052/1: decompressor: cover BSS in cache clean and reorder with > > MMU disable on v7 > > > > Commit 401b368caaec ("ARM: decompressor: switch to by-VA cache > > maintenance > > for v7 cores") replaced the by-set/way cache maintenance in the > > decompressor > > with by-VA cache maintenance, which is more appropriate for the task at > > hand, especially under virtualization on hosts with non-architected > > system > > caches that are not affected by by-set/way maintenance at all. > > > > On such systems, that commit inadvertently removed the cache clean and > > invalidate of all of the guest's memory that is performed by KVM on > > behalf > > of the guest after its MMU is disabled (but only if any by-set/way cache > > maintenance instructions were issued first). This resulted in various > > erroneous behaviors observed by Russell, all involving the mini-stack > > used by the core kernel's v7 boot code, and which resides in BSS. It > > seems intractable to figure out exactly what goes wrong in each of these > > cases, but some small experiments did suggest that the lack of a cache > > clean and invalidate *after* disabling the MMU and caches is what > > triggers the errors, presumably because cachelines are being allocated > > or reallocated while the first cache clean and invalidate is in > > progress. > > > > To ensure that no cache lines cover any of the data that is accessed by > > the booting kernel with the MMU off, include the uncompressed kernel's > > BSS region in the cache clean operation. > > > > Also, to ensure that no cachelines are allocated
Re: next/master bisection: baseline.login on rk3288-rock2-square
Hi Ard, Please see the bisection report below about a boot failure on rk3288 with next-20210203. It was also bisected on imx6q-var-dt6customboard with next-20210202. Reports aren't automatically sent to the public while we're trialing new bisection features on kernelci.org but this one looks valid. The kernel is most likely crashing very early on, so there's nothing in the logs. Please let us know if you need some help with debugging or trying a fix on these platforms. Best wishes, Guillaume On 04/02/2021 04:25, KernelCI bot wrote: > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > * This automated bisection report was sent to you on the basis * > * that you may be involved with the breaking commit it has * > * found. No manual investigation has been done to verify it, * > * and the root cause of the problem may be somewhere else. * > * * > * If you do send a fix, please include this trailer:* > * Reported-by: "kernelci.org bot" * > * * > * Hope this helps! * > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > > next/master bisection: baseline.login on rk3288-rock2-square > > Summary: > Start: 58b6c0e507b7 Add linux-next specific files for 20210203 > Plain log: > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.txt > HTML log: > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.html > Result: 5a29552af92d ARM: 9052/1: decompressor: cover BSS in cache > clean and reorder with MMU disable on v7 > > Checks: > revert: PASS > verify: PASS > > Parameters: > Tree: next > URL: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > Branch: master > Target: rk3288-rock2-square > CPU arch: arm > Lab:lab-collabora > Compiler: clang-11 > Config: multi_v7_defconfig > Test case: baseline.login > > Breaking commit found: > > --- > commit 5a29552af92dbd62c2b6fd1cddf7dad1ef7555b2 > Author: Ard Biesheuvel > Date: Sun Jan 24 18:03:45 2021 +0100 > > ARM: 9052/1: decompressor: cover BSS in cache clean and reorder with MMU > disable on v7 > > Commit 401b368caaec ("ARM: decompressor: switch to by-VA cache maintenance > for v7 cores") replaced the by-set/way cache maintenance in the > decompressor > with by-VA cache maintenance, which is more appropriate for the task at > hand, especially under virtualization on hosts with non-architected system > caches that are not affected by by-set/way maintenance at all. > > On such systems, that commit inadvertently removed the cache clean and > invalidate of all of the guest's memory that is performed by KVM on behalf > of the guest after its MMU is disabled (but only if any by-set/way cache > maintenance instructions were issued first). This resulted in various > erroneous behaviors observed by Russell, all involving the mini-stack > used by the core kernel's v7 boot code, and which resides in BSS. It > seems intractable to figure out exactly what goes wrong in each of these > cases, but some small experiments did suggest that the lack of a cache > clean and invalidate *after* disabling the MMU and caches is what > triggers the errors, presumably because cachelines are being allocated > or reallocated while the first cache clean and invalidate is in progress. > > To ensure that no cache lines cover any of the data that is accessed by > the booting kernel with the MMU off, include the uncompressed kernel's > BSS region in the cache clean operation. > > Also, to ensure that no cachelines are allocated while the cache is being > cleaned, perform the cache clean operation *after* disabling the MMU and > caches when running on v7 or later, by making a tail call to the clean > routine from the cache_off routine. This requires passing the VA range > to cache_off(), which means some care needs to be taken to preserve > R0 and R1 across the call to cache_off(). > > Since this makes the first cache clean redundant, call it with the > range reduced to zero. This only affects v7, as all other versions > ignore R0/R1 entirely. > > Link: > https://lore.kernel.org/linux-arm-kernel/20210122152012.30075-1-a...@kernel.org > > Fixes: 401b368caaec ("ARM: decompressor: switch to by-VA cache > maintenance for v7 cores") > Reported-by: Russell King > Signed-off-by: Ard Biesheuvel > Signed-off-by: Russell King > > diff --git