Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-06 Thread Ard Biesheuvel
On Sat, 6 Feb 2021 at 14:10, Guillaume Tucker
 wrote:
>
> On 05/02/2021 12:05, Ard Biesheuvel wrote:
> > On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel  wrote:
> >>
> >> On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker
> >>  wrote:
> >>>
> >>> On 04/02/2021 18:23, Nick Desaulniers wrote:
>  On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  
>  wrote:
> >
> > On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang 
> > Built Linux wrote:
> >> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
> >>>
> >>> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
> >>>  wrote:
> 
>  On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> >  wrote:
> >>
> >> Essentially:
> >>
> >>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> >> CC="ccache clang" zImage
> >>
> >> This command should link with BFD (and assemble with GAS; it's only
> >> using clang as the compiler.
> >
> > I think you missed the 'LLVM=1' before CC="ccache clang". That should
> > use all of the LLVM utilities minus the integrated assembler while
> > wrapping clang with ccache.
> 
>  You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
>  permit fallback to BFD.
> >>>
> >>> That was close, except we're cross-compiling with GCC for arm.
> >>> So I've now built a plain next-20210203 (without Ard's fix) using
> >>> this command line:
> >>>
> >>> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
> >>> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage
> >>>
> >>> I'm using a modified Docker image gtucker/kernelci-build-clang-11
> >>> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
> >>> packages added to be able to use the GNU linker.  BTW I guess we
> >>> should enable this kind of hybrid build setup on kernelci.org as
> >>> well.
> >>>
> >>> Full build log + kernel binaries can be found here:
> >>>
> >>> 
> >>> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/
> >>>
> >>> And this booted fine, which confirms it's really down to how
> >>> ld.lld puts together the kernel image.  Does it actually solve
> >>> the debate whether this is an issue to fix in the assembly code
> >>> or at link time?
> >>>
> >>> Full test job details for the record:
> >>>
> >>> https://lava.collabora.co.uk/scheduler/job/3176004
> >>>
> >>
> >>
> >> So the issue appears to be in the way the linker generates the
> >> _kernel_bss_size symbol, which obviously has an impact, given that the
> >> queued fix takes it into account in the cache_clean operation.
> >>
> >> On GNU ld, I see
> >>
> >>479: 00065e14 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
> >>
> >> whereas n LLVM ld.lld, I see
> >>
> >>433: c1c86e98 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
> >>
> >> and adding this value may cause the cache clean to operate on unmapped
> >> addresses, or cause the addition to wrap and not perform a cache clean
> >> at all.
> >>
> >> AFAICT, this also breaks the appended DTB case in LLVM, so this needs
> >> a separate fix in any case.
> >
> > I pushed a combined branch of torvalds/master, rmk/fixes (still
> > containing my 9052/1 fix) and this patch to my for-kernelci branch
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/
> >
> > Guillaume,
> >
> > It seems there is no Clang-11 coverage there, right? Mind giving this
> > branch a spin? If this fixes the regressions, we can get these queued
> > up.
>
> That's right, Clang builds are only enabled on linux-next and
> mainline at the moment.  We could enable it on any other branch
> where it makes sense.  How about just the main defconfig for arm,
> arm64 and x86_64 on your ardb/for-kernelci branch?
>

Yes, please.

> For now I've run another set of builds with clang-11 and got the
> following test results with your branch on staging:
>
>   
> https://staging.kernelci.org/test/job/ardb/branch/for-kernelci/kernel/v5.11-rc6-146-g923ca344043a/plan/baseline/
>
> which are all passing.
>
> I'll reply to the thread with your patch to confirm.
>

Excellent, thanks a lot.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-06 Thread Guillaume Tucker
On 05/02/2021 12:05, Ard Biesheuvel wrote:
> On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel  wrote:
>>
>> On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker
>>  wrote:
>>>
>>> On 04/02/2021 18:23, Nick Desaulniers wrote:
 On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  
 wrote:
>
> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang 
> Built Linux wrote:
>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
>>>
>>> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
>>>  wrote:

 On 04/02/2021 15:42, Ard Biesheuvel wrote:
> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
>  wrote:
>>
>> Essentially:
>>
>>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
>> CC="ccache clang" zImage
>>
>> This command should link with BFD (and assemble with GAS; it's only
>> using clang as the compiler.
>
> I think you missed the 'LLVM=1' before CC="ccache clang". That should
> use all of the LLVM utilities minus the integrated assembler while
> wrapping clang with ccache.

 You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
 permit fallback to BFD.
>>>
>>> That was close, except we're cross-compiling with GCC for arm.
>>> So I've now built a plain next-20210203 (without Ard's fix) using
>>> this command line:
>>>
>>> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
>>> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage
>>>
>>> I'm using a modified Docker image gtucker/kernelci-build-clang-11
>>> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
>>> packages added to be able to use the GNU linker.  BTW I guess we
>>> should enable this kind of hybrid build setup on kernelci.org as
>>> well.
>>>
>>> Full build log + kernel binaries can be found here:
>>>
>>> 
>>> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/
>>>
>>> And this booted fine, which confirms it's really down to how
>>> ld.lld puts together the kernel image.  Does it actually solve
>>> the debate whether this is an issue to fix in the assembly code
>>> or at link time?
>>>
>>> Full test job details for the record:
>>>
>>> https://lava.collabora.co.uk/scheduler/job/3176004
>>>
>>
>>
>> So the issue appears to be in the way the linker generates the
>> _kernel_bss_size symbol, which obviously has an impact, given that the
>> queued fix takes it into account in the cache_clean operation.
>>
>> On GNU ld, I see
>>
>>479: 00065e14 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
>>
>> whereas n LLVM ld.lld, I see
>>
>>433: c1c86e98 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
>>
>> and adding this value may cause the cache clean to operate on unmapped
>> addresses, or cause the addition to wrap and not perform a cache clean
>> at all.
>>
>> AFAICT, this also breaks the appended DTB case in LLVM, so this needs
>> a separate fix in any case.
> 
> I pushed a combined branch of torvalds/master, rmk/fixes (still
> containing my 9052/1 fix) and this patch to my for-kernelci branch
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/
> 
> Guillaume,
> 
> It seems there is no Clang-11 coverage there, right? Mind giving this
> branch a spin? If this fixes the regressions, we can get these queued
> up.

That's right, Clang builds are only enabled on linux-next and
mainline at the moment.  We could enable it on any other branch
where it makes sense.  How about just the main defconfig for arm,
arm64 and x86_64 on your ardb/for-kernelci branch?

For now I've run another set of builds with clang-11 and got the
following test results with your branch on staging:

  
https://staging.kernelci.org/test/job/ardb/branch/for-kernelci/kernel/v5.11-rc6-146-g923ca344043a/plan/baseline/

which are all passing.

I'll reply to the thread with your patch to confirm.

Guillaume


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-05 Thread Ard Biesheuvel
On Fri, 5 Feb 2021 at 09:21, Ard Biesheuvel  wrote:
>
> On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker
>  wrote:
> >
> > On 04/02/2021 18:23, Nick Desaulniers wrote:
> > > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  
> > > wrote:
> > >>
> > >> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang 
> > >> Built Linux wrote:
> > >>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
> > 
> >  On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
> >   wrote:
> > >
> > > On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > >> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> > >>  wrote:
> > >>>
> > >>> Essentially:
> > >>>
> > >>>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> > >>> CC="ccache clang" zImage
> > >>>
> > >>> This command should link with BFD (and assemble with GAS; it's only
> > >>> using clang as the compiler.
> > >>
> > >> I think you missed the 'LLVM=1' before CC="ccache clang". That should
> > >> use all of the LLVM utilities minus the integrated assembler while
> > >> wrapping clang with ccache.
> > >
> > > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
> > > permit fallback to BFD.
> >
> > That was close, except we're cross-compiling with GCC for arm.
> > So I've now built a plain next-20210203 (without Ard's fix) using
> > this command line:
> >
> > make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
> > CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage
> >
> > I'm using a modified Docker image gtucker/kernelci-build-clang-11
> > with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
> > packages added to be able to use the GNU linker.  BTW I guess we
> > should enable this kind of hybrid build setup on kernelci.org as
> > well.
> >
> > Full build log + kernel binaries can be found here:
> >
> > 
> > https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/
> >
> > And this booted fine, which confirms it's really down to how
> > ld.lld puts together the kernel image.  Does it actually solve
> > the debate whether this is an issue to fix in the assembly code
> > or at link time?
> >
> > Full test job details for the record:
> >
> > https://lava.collabora.co.uk/scheduler/job/3176004
> >
>
>
> So the issue appears to be in the way the linker generates the
> _kernel_bss_size symbol, which obviously has an impact, given that the
> queued fix takes it into account in the cache_clean operation.
>
> On GNU ld, I see
>
>479: 00065e14 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
>
> whereas n LLVM ld.lld, I see
>
>433: c1c86e98 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size
>
> and adding this value may cause the cache clean to operate on unmapped
> addresses, or cause the addition to wrap and not perform a cache clean
> at all.
>
> AFAICT, this also breaks the appended DTB case in LLVM, so this needs
> a separate fix in any case.

I pushed a combined branch of torvalds/master, rmk/fixes (still
containing my 9052/1 fix) and this patch to my for-kernelci branch

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/

Guillaume,

It seems there is no Clang-11 coverage there, right? Mind giving this
branch a spin? If this fixes the regressions, we can get these queued
up.

Thanks,
Ard.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-05 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 22:31, Guillaume Tucker
 wrote:
>
> On 04/02/2021 18:23, Nick Desaulniers wrote:
> > On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  wrote:
> >>
> >> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang 
> >> Built Linux wrote:
> >>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
> 
>  On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
>   wrote:
> >
> > On 04/02/2021 15:42, Ard Biesheuvel wrote:
> >> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> >>  wrote:
> >>>
> >>> Essentially:
> >>>
> >>>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> >>> CC="ccache clang" zImage
> >>>
> >>> This command should link with BFD (and assemble with GAS; it's only
> >>> using clang as the compiler.
> >>
> >> I think you missed the 'LLVM=1' before CC="ccache clang". That should
> >> use all of the LLVM utilities minus the integrated assembler while
> >> wrapping clang with ccache.
> >
> > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
> > permit fallback to BFD.
>
> That was close, except we're cross-compiling with GCC for arm.
> So I've now built a plain next-20210203 (without Ard's fix) using
> this command line:
>
> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage
>
> I'm using a modified Docker image gtucker/kernelci-build-clang-11
> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
> packages added to be able to use the GNU linker.  BTW I guess we
> should enable this kind of hybrid build setup on kernelci.org as
> well.
>
> Full build log + kernel binaries can be found here:
>
> 
> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/
>
> And this booted fine, which confirms it's really down to how
> ld.lld puts together the kernel image.  Does it actually solve
> the debate whether this is an issue to fix in the assembly code
> or at link time?
>
> Full test job details for the record:
>
> https://lava.collabora.co.uk/scheduler/job/3176004
>


So the issue appears to be in the way the linker generates the
_kernel_bss_size symbol, which obviously has an impact, given that the
queued fix takes it into account in the cache_clean operation.

On GNU ld, I see

   479: 00065e14 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size

whereas n LLVM ld.lld, I see

   433: c1c86e98 0 NOTYPE  GLOBAL DEFAULT  ABS _kernel_bss_size

and adding this value may cause the cache clean to operate on unmapped
addresses, or cause the addition to wrap and not perform a cache clean
at all.

AFAICT, this also breaks the appended DTB case in LLVM, so this needs
a separate fix in any case.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 09:31:06PM +, Guillaume Tucker wrote:
> On 04/02/2021 18:23, Nick Desaulniers wrote:
> > You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
> > permit fallback to BFD.
> 
> That was close, except we're cross-compiling with GCC for arm.
> So I've now built a plain next-20210203 (without Ard's fix) using
> this command line:
> 
> make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
> CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage
> 
> I'm using a modified Docker image gtucker/kernelci-build-clang-11
> with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
> packages added to be able to use the GNU linker.  BTW I guess we
> should enable this kind of hybrid build setup on kernelci.org as
> well.

...

> And this booted fine, which confirms it's really down to how
> ld.lld puts together the kernel image.  Does it actually solve
> the debate whether this is an issue to fix in the assembly code
> or at link time?

Well... as I mentioned previously, we don't really understand what
is going on between the decompressor running with the caches on,
turning the caches off, jumping into the decompressed kernel, and
then getting to the v7 setup code.

The results from various attempts at solving the problem which lead
to Ard's original patch that caused your breakage were not making a
whole lot of sense (I think I wrote that all up in a previous email
thread, so won't repeat it here.)

So, I was slightly nervous about merging Ard's fix - and your report
suggested that there is indeed more going on here that we don't
understand.

When I was tracking down what was going on, I had this patch applied
(I've had to recreate it, so may not be exactly what I had), with the
DEBUG_LL stuff appropriately enabled. It may be worth applying this
patch, enabling the DEBUG_LL stuff appropriately for one of your
failing boards, and try booting it.

You should get two strings of identical hex numbers that look
something like:

480009000401400030004820071d40008090

If they're looking like instructions, for example:

ee060f37e3a00080ee020f10ee020f30ee030f10e3a00903ee050f30

Then it's likely that you are seeing a very similar problem as I was
without Ard's patch. If you do get instruction-like content, then
you will likely find the sequence of instructions in the decompressor
code.

diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index 28c9d32fa99a..19fa93ae282c 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -475,7 +475,39 @@ ENDPROC(cpu_pj4b_do_resume)
ldr r12, [r0]
add r12, r12, r0@ the local stack
stmia   r12, {r1-r6, lr}@ v7_invalidate_l1 touches r0-r6
+   ldr r0, [r12, #0]
+   bl  printhex8
+   ldr r0, [r12, #4]
+   bl  printhex8
+   ldr r0, [r12, #8]
+   bl  printhex8
+   ldr r0, [r12, #12]
+   bl  printhex8
+   ldr r0, [r12, #16]
+   bl  printhex8
+   ldr r0, [r12, #20]
+   bl  printhex8
+   ldr r0, [r12, #24]
+   bl  printhex8
+   mov r0, #'\n'
+   bl  printch
bl  v7_invalidate_l1
+   ldr r0, [r12, #0]
+   bl  printhex8
+   ldr r0, [r12, #4]
+   bl  printhex8
+   ldr r0, [r12, #8]
+   bl  printhex8
+   ldr r0, [r12, #12]
+   bl  printhex8
+   ldr r0, [r12, #16]
+   bl  printhex8
+   ldr r0, [r12, #20]
+   bl  printhex8
+   ldr r0, [r12, #24]
+   bl  printhex8
+   mov r0, #'\n'
+   bl  printch
ldmia   r12, {r1-r6, lr}
 
 __v7_setup_cont:

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
On 04/02/2021 18:23, Nick Desaulniers wrote:
> On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  wrote:
>>
>> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built 
>> Linux wrote:
>>> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:

 On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
  wrote:
>
> On 04/02/2021 15:42, Ard Biesheuvel wrote:
>> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
>>  wrote:
>>>
>>> Essentially:
>>>
>>>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
>>> CC="ccache clang" zImage
>>>
>>> This command should link with BFD (and assemble with GAS; it's only
>>> using clang as the compiler.
>>
>> I think you missed the 'LLVM=1' before CC="ccache clang". That should
>> use all of the LLVM utilities minus the integrated assembler while
>> wrapping clang with ccache.
> 
> You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
> permit fallback to BFD.

That was close, except we're cross-compiling with GCC for arm.
So I've now built a plain next-20210203 (without Ard's fix) using
this command line:

make LD=arm-linux-gnueabihf-ld.bfd -j18 ARCH=arm 
CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache clang" zImage

I'm using a modified Docker image gtucker/kernelci-build-clang-11
with the very latest LLVM 11 and gcc-8-arm-linux-gnueabihf
packages added to be able to use the GNU linker.  BTW I guess we
should enable this kind of hybrid build setup on kernelci.org as
well.

Full build log + kernel binaries can be found here:


https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-41/arm/multi_v7_defconfig/clang-11/

And this booted fine, which confirms it's really down to how
ld.lld puts together the kernel image.  Does it actually solve
the debate whether this is an issue to fix in the assembly code
or at link time?

Full test job details for the record:

https://lava.collabora.co.uk/scheduler/job/3176004

Hope that helps,
Guillaume


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
On 04/02/2021 16:01, Ard Biesheuvel wrote:
> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
>  wrote:
>>
>> On 04/02/2021 15:42, Ard Biesheuvel wrote:
>>> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
>>>  wrote:

 On 04/02/2021 10:33, Guillaume Tucker wrote:
> On 04/02/2021 10:27, Ard Biesheuvel wrote:
>> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
>>  wrote:
>>>
>>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
 On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
  wrote:
>
> Hi Ard,
>
> Please see the bisection report below about a boot failure on
> rk3288 with next-20210203.  It was also bisected on
> imx6q-var-dt6customboard with next-20210202.
>
> Reports aren't automatically sent to the public while we're
> trialing new bisection features on kernelci.org but this one
> looks valid.
>
> The kernel is most likely crashing very early on, so there's
> nothing in the logs.  Please let us know if you need some help
> with debugging or trying a fix on these platforms.
>

 Thanks for the report.
>>>
>>> Ard,
>>>
>>> I want to send my fixes branch today which includes your regression
>>> fix that caused this regression.
>>>
>>> As this is proving difficult to fix, I can only drop your fix from
>>> my fixes branch - and given that this seems to be problematical, I'm
>>> tempted to revert the original change at this point which should fix
>>> both of these regressions - and then we have another go at getting rid
>>> of the set/way instructions during the next cycle.
>>>
>>> Thoughts?
>>>
>>
>> Hi Russell,
>>
>> If Guillaume is willing to do the experiment, and it fixes the issue,
>
> Yes, I'm running some tests with that fix now and should have
> some results shortly.

 Yes it does fix the issue:

   https://lava.collabora.co.uk/scheduler/job/3173819

 with Ard's fix applied to this test branch:

   
 https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/


 +clang +Nick

 It's worth mentioning that the issue only happens with kernels
 built with Clang.  As you can see there are several other arm
 platforms failing with clang-11 builds but booting fine with
 gcc-8:

   
 https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/

 Here's a sample build log:

   
 https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log

 Essentially:

   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache 
 clang" zImage

 I believe it should be using the GNU assembler as LLVM_IAS=1 is
 not defined, but there may be something more subtle about it.

>>>
>>>
>>> Do you have a link for a failing zImage built from multi_v7_defconfig?
>>
>> Sure, this one was built from a plain next-20210203:
>>
>>   
>> http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage
>>
>> You can also find the dtbs, modules and other things in that same
>> directory.
>>
>> For the record, here's the test job that used it:
>>
>>   https://lava.collabora.co.uk/scheduler/job/3173792
>>
> 
> Thanks.
> 
> That zImage boots fine locally. Unfortunately, I don't have rk3288
> hardware to reproduce.
> 
> Could you please point me to the list of all the other platforms that
> failed to boot this image?

This is the list of platforms from kernelci.org I've gathered
which appeared to be impacted:

imx6q-sabrelite
imx6q-var-dt6customboard
imx6dl-riotboard
imx6qp-wandboard-revd1
imx7ulp-evk
odroid-xu3
rk3288-rock2-square
rk3288-veyron-jaq
stm32mp157c-dk2
sun4i-a10-olinuxino-lime
sun5i-a13-olinuxino-micro
sun7i-a20-cubieboard2
sun7i-a20-olinuxino-lime2
sun8i-a33-olinuxino
sun8i-a83t-bananapi-m3
sun8i-h2-plus-libretech-all-h3-cc
sun8i-h2-plus-orangepi-r1
sun8i-h2-plus-orangepi-zero
sun8i-h3-libretech-all-h3-cc
sun8i-h3-bananapi-m2-plus
sun8i-h3-orangepi-pc
sun8i-r40-bananapi-m2-ultra

They were all booting next-20210203 with gcc-8 but not with
clang-11.  I've run checks on a good share of them with your
patch applied and they're now booting with clang-11, just like
the rk3288 and imx6q platforms that were used for the bisections.


> To be honest, I am slightly annoyed that a change that works fine with
> GCC but does not work with Clang version
> 
> 11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158
> 
> (where exp means experimental, I suppose) is the reason for this

Well it's the standard one from the LLVM Debian package repo:

  deb http://apt.llvm.org/buster/ 

Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Nick Desaulniers
On Thu, Feb 4, 2021 at 10:12 AM Nathan Chancellor  wrote:
>
> On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built 
> Linux wrote:
> > On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
> > >
> > > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
> > >  wrote:
> > > >
> > > > On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > > > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> > > > >  wrote:
> > > > >>
> > > > >> Essentially:
> > > > >>
> > > > >>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> > > > >> CC="ccache clang" zImage
> >
> > This command should link with BFD (and assemble with GAS; it's only
> > using clang as the compiler.
>
> I think you missed the 'LLVM=1' before CC="ccache clang". That should
> use all of the LLVM utilities minus the integrated assembler while
> wrapping clang with ccache.

You're right, I missed `LLVM=1`. Adding `LD=ld.bfd` I think should
permit fallback to BFD.
-- 
Thanks,
~Nick Desaulniers


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Nathan Chancellor
On Thu, Feb 04, 2021 at 10:06:08AM -0800, 'Nick Desaulniers' via Clang Built 
Linux wrote:
> On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
> >
> > On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
> >  wrote:
> > >
> > > On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> > > >  wrote:
> > > >>
> > > >> Essentially:
> > > >>
> > > >>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> > > >> CC="ccache clang" zImage
> 
> This command should link with BFD (and assemble with GAS; it's only
> using clang as the compiler.

I think you missed the 'LLVM=1' before CC="ccache clang". That should
use all of the LLVM utilities minus the integrated assembler while
wrapping clang with ccache.

Cheers,
Nathan


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Nick Desaulniers
On Thu, Feb 4, 2021 at 8:02 AM Ard Biesheuvel  wrote:
>
> On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
>  wrote:
> >
> > On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> > >  wrote:
> > >>
> > >> Essentially:
> > >>
> > >>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 
> > >> CC="ccache clang" zImage

This command should link with BFD (and assemble with GAS; it's only
using clang as the compiler.

>
> To be honest, I am slightly annoyed that a change that works fine with
> GCC but does not work with Clang version
>
> 11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158
>
> (where exp means experimental, I suppose) is the reason for this
> discussion, especially because the change is in asm code. Is it
> possible to build with Clang but use the GNU linker?

rk3288 might be the last 32b ARM platform ChromeOS uses. "veyron"
-- 
Thanks,
~Nick Desaulniers


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 16:53, Guillaume Tucker
 wrote:
>
> On 04/02/2021 15:42, Ard Biesheuvel wrote:
> > On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
> >  wrote:
> >>
> >> On 04/02/2021 10:33, Guillaume Tucker wrote:
> >>> On 04/02/2021 10:27, Ard Biesheuvel wrote:
>  On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
>   wrote:
> >
> > On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
> >> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
> >>  wrote:
> >>>
> >>> Hi Ard,
> >>>
> >>> Please see the bisection report below about a boot failure on
> >>> rk3288 with next-20210203.  It was also bisected on
> >>> imx6q-var-dt6customboard with next-20210202.
> >>>
> >>> Reports aren't automatically sent to the public while we're
> >>> trialing new bisection features on kernelci.org but this one
> >>> looks valid.
> >>>
> >>> The kernel is most likely crashing very early on, so there's
> >>> nothing in the logs.  Please let us know if you need some help
> >>> with debugging or trying a fix on these platforms.
> >>>
> >>
> >> Thanks for the report.
> >
> > Ard,
> >
> > I want to send my fixes branch today which includes your regression
> > fix that caused this regression.
> >
> > As this is proving difficult to fix, I can only drop your fix from
> > my fixes branch - and given that this seems to be problematical, I'm
> > tempted to revert the original change at this point which should fix
> > both of these regressions - and then we have another go at getting rid
> > of the set/way instructions during the next cycle.
> >
> > Thoughts?
> >
> 
>  Hi Russell,
> 
>  If Guillaume is willing to do the experiment, and it fixes the issue,
> >>>
> >>> Yes, I'm running some tests with that fix now and should have
> >>> some results shortly.
> >>
> >> Yes it does fix the issue:
> >>
> >>   https://lava.collabora.co.uk/scheduler/job/3173819
> >>
> >> with Ard's fix applied to this test branch:
> >>
> >>   
> >> https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/
> >>
> >>
> >> +clang +Nick
> >>
> >> It's worth mentioning that the issue only happens with kernels
> >> built with Clang.  As you can see there are several other arm
> >> platforms failing with clang-11 builds but booting fine with
> >> gcc-8:
> >>
> >>   
> >> https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/
> >>
> >> Here's a sample build log:
> >>
> >>   
> >> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log
> >>
> >> Essentially:
> >>
> >>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache 
> >> clang" zImage
> >>
> >> I believe it should be using the GNU assembler as LLVM_IAS=1 is
> >> not defined, but there may be something more subtle about it.
> >>
> >
> >
> > Do you have a link for a failing zImage built from multi_v7_defconfig?
>
> Sure, this one was built from a plain next-20210203:
>
>   
> http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage
>
> You can also find the dtbs, modules and other things in that same
> directory.
>
> For the record, here's the test job that used it:
>
>   https://lava.collabora.co.uk/scheduler/job/3173792
>

Thanks.

That zImage boots fine locally. Unfortunately, I don't have rk3288
hardware to reproduce.

Could you please point me to the list of all the other platforms that
failed to boot this image?

To be honest, I am slightly annoyed that a change that works fine with
GCC but does not work with Clang version

11.1.0-++20210130110826+3a8282376b6c-1~exp1~20210130221445.158

(where exp means experimental, I suppose) is the reason for this
discussion, especially because the change is in asm code. Is it
possible to build with Clang but use the GNU linker?


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 15:36, Russell King - ARM Linux admin
 wrote:
>
> On Thu, Feb 04, 2021 at 03:25:20PM +0100, Ard Biesheuvel wrote:
> > Pushing contents of the cache hierarchy to main memory is *not* a
> > valid use of set/way ops, and so there is no point in pretending that
> > set/way ops will produce the same results as by-VA ops. Only the by-VA
> > ops give the architectural guarantees that we rely on for correctness.
>
> ... yet we /were/ doing that, and it worked fine for 13 years - from
> 1st June 2007 until the by-VA merge into mainline on the 3rd April
> 2020.
>
> You may be right that it wasn't the most correct way, but it worked
> for those 13 years without issue, and it's only recently that it's
> become a problem, and trying to "fix" that introduced a regression,
> and fixing that regression has caused another regression... and I
> what I'm wondering is how many more regression fixing cycles it's
> going to take - how many regression fixes on top of other regression
> fixes are we going to end up seeing here.
>
> The fact is, we never properly understood why your patch caused the
> regression I was seeing. If we don't understand it, then we can never
> say that we've fixed the problem properly. That is highlighted by the
> fact that fixing the regression I was seeing has caused another
> regression.
>

I agree with all these points.

But as I pointed out, reverting the original by-VA change, which has
been there for almost a year now, has some risks of its own. If we
don't understand the details of how this is broken, how can we be sure
we don't break something else if we revert it again?

So I'm not arguing that reverting the original patch is unreasonable,
just that doing so at this point in the cycle is risky, and that it
would be better to queue the revert for v5.12, and only backport it
after some soak time in -next. And in a sense, reinstating the
cache_clean() before cache_off() already amounts to a partial revert
of the queued fix.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
On 04/02/2021 15:42, Ard Biesheuvel wrote:
> On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
>  wrote:
>>
>> On 04/02/2021 10:33, Guillaume Tucker wrote:
>>> On 04/02/2021 10:27, Ard Biesheuvel wrote:
 On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
  wrote:
>
> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
>> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
>>  wrote:
>>>
>>> Hi Ard,
>>>
>>> Please see the bisection report below about a boot failure on
>>> rk3288 with next-20210203.  It was also bisected on
>>> imx6q-var-dt6customboard with next-20210202.
>>>
>>> Reports aren't automatically sent to the public while we're
>>> trialing new bisection features on kernelci.org but this one
>>> looks valid.
>>>
>>> The kernel is most likely crashing very early on, so there's
>>> nothing in the logs.  Please let us know if you need some help
>>> with debugging or trying a fix on these platforms.
>>>
>>
>> Thanks for the report.
>
> Ard,
>
> I want to send my fixes branch today which includes your regression
> fix that caused this regression.
>
> As this is proving difficult to fix, I can only drop your fix from
> my fixes branch - and given that this seems to be problematical, I'm
> tempted to revert the original change at this point which should fix
> both of these regressions - and then we have another go at getting rid
> of the set/way instructions during the next cycle.
>
> Thoughts?
>

 Hi Russell,

 If Guillaume is willing to do the experiment, and it fixes the issue,
>>>
>>> Yes, I'm running some tests with that fix now and should have
>>> some results shortly.
>>
>> Yes it does fix the issue:
>>
>>   https://lava.collabora.co.uk/scheduler/job/3173819
>>
>> with Ard's fix applied to this test branch:
>>
>>   https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/
>>
>>
>> +clang +Nick
>>
>> It's worth mentioning that the issue only happens with kernels
>> built with Clang.  As you can see there are several other arm
>> platforms failing with clang-11 builds but booting fine with
>> gcc-8:
>>
>>   
>> https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/
>>
>> Here's a sample build log:
>>
>>   
>> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log
>>
>> Essentially:
>>
>>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache 
>> clang" zImage
>>
>> I believe it should be using the GNU assembler as LLVM_IAS=1 is
>> not defined, but there may be something more subtle about it.
>>
> 
> 
> Do you have a link for a failing zImage built from multi_v7_defconfig?

Sure, this one was built from a plain next-20210203:

  
http://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/zImage

You can also find the dtbs, modules and other things in that same
directory.

For the record, here's the test job that used it:

  https://lava.collabora.co.uk/scheduler/job/3173792

Guillaume


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 12:32, Guillaume Tucker
 wrote:
>
> On 04/02/2021 10:33, Guillaume Tucker wrote:
> > On 04/02/2021 10:27, Ard Biesheuvel wrote:
> >> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
> >>  wrote:
> >>>
> >>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
>  On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
>   wrote:
> >
> > Hi Ard,
> >
> > Please see the bisection report below about a boot failure on
> > rk3288 with next-20210203.  It was also bisected on
> > imx6q-var-dt6customboard with next-20210202.
> >
> > Reports aren't automatically sent to the public while we're
> > trialing new bisection features on kernelci.org but this one
> > looks valid.
> >
> > The kernel is most likely crashing very early on, so there's
> > nothing in the logs.  Please let us know if you need some help
> > with debugging or trying a fix on these platforms.
> >
> 
>  Thanks for the report.
> >>>
> >>> Ard,
> >>>
> >>> I want to send my fixes branch today which includes your regression
> >>> fix that caused this regression.
> >>>
> >>> As this is proving difficult to fix, I can only drop your fix from
> >>> my fixes branch - and given that this seems to be problematical, I'm
> >>> tempted to revert the original change at this point which should fix
> >>> both of these regressions - and then we have another go at getting rid
> >>> of the set/way instructions during the next cycle.
> >>>
> >>> Thoughts?
> >>>
> >>
> >> Hi Russell,
> >>
> >> If Guillaume is willing to do the experiment, and it fixes the issue,
> >
> > Yes, I'm running some tests with that fix now and should have
> > some results shortly.
>
> Yes it does fix the issue:
>
>   https://lava.collabora.co.uk/scheduler/job/3173819
>
> with Ard's fix applied to this test branch:
>
>   https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/
>
>
> +clang +Nick
>
> It's worth mentioning that the issue only happens with kernels
> built with Clang.  As you can see there are several other arm
> platforms failing with clang-11 builds but booting fine with
> gcc-8:
>
>   
> https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/
>
> Here's a sample build log:
>
>   
> https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log
>
> Essentially:
>
>   make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache 
> clang" zImage
>
> I believe it should be using the GNU assembler as LLVM_IAS=1 is
> not defined, but there may be something more subtle about it.
>


Do you have a link for a failing zImage built from multi_v7_defconfig?


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 03:25:20PM +0100, Ard Biesheuvel wrote:
> Pushing contents of the cache hierarchy to main memory is *not* a
> valid use of set/way ops, and so there is no point in pretending that
> set/way ops will produce the same results as by-VA ops. Only the by-VA
> ops give the architectural guarantees that we rely on for correctness.

... yet we /were/ doing that, and it worked fine for 13 years - from
1st June 2007 until the by-VA merge into mainline on the 3rd April
2020.

You may be right that it wasn't the most correct way, but it worked
for those 13 years without issue, and it's only recently that it's
become a problem, and trying to "fix" that introduced a regression,
and fixing that regression has caused another regression... and I
what I'm wondering is how many more regression fixing cycles it's
going to take - how many regression fixes on top of other regression
fixes are we going to end up seeing here.

The fact is, we never properly understood why your patch caused the
regression I was seeing. If we don't understand it, then we can never
say that we've fixed the problem properly. That is highlighted by the
fact that fixing the regression I was seeing has caused another
regression.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 15:09, Russell King - ARM Linux admin
 wrote:
>
> On Thu, Feb 04, 2021 at 12:26:44PM +, Marc Zyngier wrote:
> > I agree. With set/way CMOs, there is no way to reach the PoC if
> > it beyond the system cache, leading to an unbootable kernel.
> > This is actually pretty well documented in the architecture,
> > and it did bite us for the first time on XGene-1, 7 years ago.
>
> That may be, however we still do set/way maintenance to invalidate
> the L1 cache as that is required for ARMv7 to place the cache into
> a known state, as stated by the architecture reference manual.
>

Getting a certain cache at a certain level into a known state is a
valid use of set/way ops, and is simply unnecessary when running under
virtualization, but doesn't do any harm.

Pushing contents of the cache hierarchy to main memory is *not* a
valid use of set/way ops, and so there is no point in pretending that
set/way ops will produce the same results as by-VA ops. Only the by-VA
ops give the architectural guarantees that we rely on for correctness.

> Arguably, that should be done by firmware, but when starting
> secondary CPUs, there are platforms out there which do not bring
> the L1 cache to a defined state. So we are pretty much stuck with
> doing set/way operations during CPU initialisation in the main
> kernel.
>

Indeed. And this is unfortunate, but not the end of the world.

> If ARMv8 decides that this is not supportable, then that's a matter
> for ARMv8 to address without impacting the requirements of ARMv7.
>

I'm not sure what you mean here. The v7 architecture is crystal clear
about the difference between set/way ops (managing a single cache),
and by-VA ops (managing the 'cachedness' state of a memory region).
The semantics are radically different, regardless of v7 vs v8 or
AArch32 vs AArch64.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 12:26:44PM +, Marc Zyngier wrote:
> I agree. With set/way CMOs, there is no way to reach the PoC if
> it beyond the system cache, leading to an unbootable kernel.
> This is actually pretty well documented in the architecture,
> and it did bite us for the first time on XGene-1, 7 years ago.

That may be, however we still do set/way maintenance to invalidate
the L1 cache as that is required for ARMv7 to place the cache into
a known state, as stated by the architecture reference manual.

Arguably, that should be done by firmware, but when starting
secondary CPUs, there are platforms out there which do not bring
the L1 cache to a defined state. So we are pretty much stuck with
doing set/way operations during CPU initialisation in the main
kernel.

If ARMv8 decides that this is not supportable, then that's a matter
for ARMv8 to address without impacting the requirements of ARMv7.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Marc Zyngier

On 2021-02-04 10:55, Ard Biesheuvel wrote:

(cc Marc)

On Thu, 4 Feb 2021 at 11:48, Russell King - ARM Linux admin
 wrote:


On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote:
> Hi Russell,
>
> If Guillaume is willing to do the experiment, and it fixes the issue,
> it proves that rk3288 is relying on the flush before the MMU is
> disabled, and so in that case, the fix is trivial, and we can just
> apply it.
>
> If the experiment fails (which would mean rk3288 does not tolerate the
> cache maintenance being performed after cache off), it is going to be
> hairy, and so it will definitely take more time.
>
> So in the latter case (or if Guillaume does not get back to us), I
> think reverting my queued fix is the only sane option. But in that
> case, may I suggest that we queue the revert of the original by-VA
> change for v5.12 so it gets lots of coverage in -next, and allows us
> an opportunity to come up with a proper fix in the same timeframe, and
> backport the revert and the subsequent fix as a pair? Otherwise, we'll
> end up in the situation where v5.10.x until today has by-va, v5.10.x-y
> has set/way, and v5.10y+ has by-va again. (I don't think we care about
> anything before that, given that v5.4 predates any of this)

I'm suggesting dropping your fix (9052/1) and reverting
"ARM: decompressor: switch to by-VA cache maintenance for v7 cores"
which gets us to a point where _both_ regressions are fixed.



I understand, but we don't know whether doing so might regress other
platforms that were added in the mean time.


I'm of the opinion that the by-VA patch was incorrect when it was
merged (it caused a regression), and it's only a performance
improvement.


It is a correctness improvement, not a performance improvement.

Without that change, the 32-bit ARM kernel cannot boot bare metal on
platforms with a system cache such as 8040 or SynQuacer, and can only
boot under KVM on such systems because of the special handling of
set/way instructions by the host.


I agree. With set/way CMOs, there is no way to reach the PoC if
it beyond the system cache, leading to an unbootable kernel.
This is actually pretty well documented in the architecture,
and it did bite us for the first time on XGene-1, 7 years ago.

In retrospect, having KVM to handle set/way CMOs in was a mistake,
as it just papered over the problem for the sake of running older
32bit guests. It violated the principle of KVM/arm being strictly
architectural and provided unrealistic expectations. I'll take the
blame for this.

Thanks,

M.
--
Jazz is not dead. It just smells funny...


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 12:45, Russell King - ARM Linux admin
 wrote:
>
> On Thu, Feb 04, 2021 at 11:32:05AM +, Guillaume Tucker wrote:
> > Yes it does fix the issue:
> >
> >   https://lava.collabora.co.uk/scheduler/job/3173819
> >
> > with Ard's fix applied to this test branch:
> >
> >   
> > https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/
> >
> >
> > +clang +Nick
> >
> > It's worth mentioning that the issue only happens with kernels
> > built with Clang.  As you can see there are several other arm
> > platforms failing with clang-11 builds but booting fine with
> > gcc-8:
>
> My gut feeling is that it isn't Clang specific - it's likely down to
> the exact code/data placement, how things end up during decompression,
> and exactly what state the cache ends up in.
>
> That certainly was the case with the original regression.
>

Agreed.

So given that my queued fix turns this

cache_clean
cache_off

into this

cache_off
cache_clean

for v7 only, and considering that turning this into

cache_clean
cache_off
cache_clean

(as the diff tested by Guillaume does) fixes the reported issue, it
seems like the safest option to me at this point.

Reverting both patches, one of which has been in mainline since v5.7,
seems unwise to me at this point in the cycle.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 11:32:05AM +, Guillaume Tucker wrote:
> Yes it does fix the issue:
> 
>   https://lava.collabora.co.uk/scheduler/job/3173819
> 
> with Ard's fix applied to this test branch:
> 
>   https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/
> 
> 
> +clang +Nick
> 
> It's worth mentioning that the issue only happens with kernels
> built with Clang.  As you can see there are several other arm
> platforms failing with clang-11 builds but booting fine with
> gcc-8:

My gut feeling is that it isn't Clang specific - it's likely down to
the exact code/data placement, how things end up during decompression,
and exactly what state the cache ends up in.

That certainly was the case with the original regression.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
On 04/02/2021 10:33, Guillaume Tucker wrote:
> On 04/02/2021 10:27, Ard Biesheuvel wrote:
>> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
>>  wrote:
>>>
>>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
 On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
  wrote:
>
> Hi Ard,
>
> Please see the bisection report below about a boot failure on
> rk3288 with next-20210203.  It was also bisected on
> imx6q-var-dt6customboard with next-20210202.
>
> Reports aren't automatically sent to the public while we're
> trialing new bisection features on kernelci.org but this one
> looks valid.
>
> The kernel is most likely crashing very early on, so there's
> nothing in the logs.  Please let us know if you need some help
> with debugging or trying a fix on these platforms.
>

 Thanks for the report.
>>>
>>> Ard,
>>>
>>> I want to send my fixes branch today which includes your regression
>>> fix that caused this regression.
>>>
>>> As this is proving difficult to fix, I can only drop your fix from
>>> my fixes branch - and given that this seems to be problematical, I'm
>>> tempted to revert the original change at this point which should fix
>>> both of these regressions - and then we have another go at getting rid
>>> of the set/way instructions during the next cycle.
>>>
>>> Thoughts?
>>>
>>
>> Hi Russell,
>>
>> If Guillaume is willing to do the experiment, and it fixes the issue,
> 
> Yes, I'm running some tests with that fix now and should have
> some results shortly.

Yes it does fix the issue:

  https://lava.collabora.co.uk/scheduler/job/3173819

with Ard's fix applied to this test branch:

  https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210203-ard-fix/


+clang +Nick

It's worth mentioning that the issue only happens with kernels
built with Clang.  As you can see there are several other arm
platforms failing with clang-11 builds but booting fine with
gcc-8:

  
https://kernelci.org/test/job/next/branch/master/kernel/next-20210203/plan/baseline/

Here's a sample build log:

  
https://storage.staging.kernelci.org/gtucker/next-20210203-ard-fix/v5.10-rc4-24722-g58b6c0e507b7-gtucker_single-staging-33/arm/multi_v7_defconfig/clang-11/build.log

Essentially:

  make -j18 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- LLVM=1 CC="ccache 
clang" zImage

I believe it should be using the GNU assembler as LLVM_IAS=1 is
not defined, but there may be something more subtle about it.

Thanks,
Guillaume


>> it proves that rk3288 is relying on the flush before the MMU is
>> disabled, and so in that case, the fix is trivial, and we can just
>> apply it.
>>
>> If the experiment fails (which would mean rk3288 does not tolerate the
>> cache maintenance being performed after cache off), it is going to be
>> hairy, and so it will definitely take more time.
>>
>> So in the latter case (or if Guillaume does not get back to us), I
>> think reverting my queued fix is the only sane option. But in that
>> case, may I suggest that we queue the revert of the original by-VA
>> change for v5.12 so it gets lots of coverage in -next, and allows us
>> an opportunity to come up with a proper fix in the same timeframe, and
>> backport the revert and the subsequent fix as a pair? Otherwise, we'll
>> end up in the situation where v5.10.x until today has by-va, v5.10.x-y
>> has set/way, and v5.10y+ has by-va again. (I don't think we care about
>> anything before that, given that v5.4 predates any of this)
>>
>> But in the end, I'm happy to go along with whatever works best for you.
> 
> Thanks,
> Guillaume
> 



Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
(cc Marc)

On Thu, 4 Feb 2021 at 11:48, Russell King - ARM Linux admin
 wrote:
>
> On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote:
> > Hi Russell,
> >
> > If Guillaume is willing to do the experiment, and it fixes the issue,
> > it proves that rk3288 is relying on the flush before the MMU is
> > disabled, and so in that case, the fix is trivial, and we can just
> > apply it.
> >
> > If the experiment fails (which would mean rk3288 does not tolerate the
> > cache maintenance being performed after cache off), it is going to be
> > hairy, and so it will definitely take more time.
> >
> > So in the latter case (or if Guillaume does not get back to us), I
> > think reverting my queued fix is the only sane option. But in that
> > case, may I suggest that we queue the revert of the original by-VA
> > change for v5.12 so it gets lots of coverage in -next, and allows us
> > an opportunity to come up with a proper fix in the same timeframe, and
> > backport the revert and the subsequent fix as a pair? Otherwise, we'll
> > end up in the situation where v5.10.x until today has by-va, v5.10.x-y
> > has set/way, and v5.10y+ has by-va again. (I don't think we care about
> > anything before that, given that v5.4 predates any of this)
>
> I'm suggesting dropping your fix (9052/1) and reverting
> "ARM: decompressor: switch to by-VA cache maintenance for v7 cores"
> which gets us to a point where _both_ regressions are fixed.
>

I understand, but we don't know whether doing so might regress other
platforms that were added in the mean time.

> I'm of the opinion that the by-VA patch was incorrect when it was
> merged (it caused a regression), and it's only a performance
> improvement.

It is a correctness improvement, not a performance improvement.

Without that change, the 32-bit ARM kernel cannot boot bare metal on
platforms with a system cache such as 8040 or SynQuacer, and can only
boot under KVM on such systems because of the special handling of
set/way instructions by the host.

The performance issue related to set/way ops under KVM was already
fixed by describing data and unified caches as 1 set and 1 way when
running in 32-bit mode.


> Our attempts so far to fix it are just causing other
> regressions. So, I think it is reasonable to revert both back to a
> known good point which has worked over a decade. If doing so causes
> regressions (which I think is unlikely), then that would be unfortunate
> but alas is a price that's worth paying to get back to a known good
> point - since then we're not stacking regression fixes on top of other
> regression fixes.
>

This is exactly why I am proposing to queue the revert of the original
patch for v5.12, and only backport it to v5.10 and v5.11 once we are
sure it does not break anything else.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 11:27:16AM +0100, Ard Biesheuvel wrote:
> Hi Russell,
> 
> If Guillaume is willing to do the experiment, and it fixes the issue,
> it proves that rk3288 is relying on the flush before the MMU is
> disabled, and so in that case, the fix is trivial, and we can just
> apply it.
> 
> If the experiment fails (which would mean rk3288 does not tolerate the
> cache maintenance being performed after cache off), it is going to be
> hairy, and so it will definitely take more time.
> 
> So in the latter case (or if Guillaume does not get back to us), I
> think reverting my queued fix is the only sane option. But in that
> case, may I suggest that we queue the revert of the original by-VA
> change for v5.12 so it gets lots of coverage in -next, and allows us
> an opportunity to come up with a proper fix in the same timeframe, and
> backport the revert and the subsequent fix as a pair? Otherwise, we'll
> end up in the situation where v5.10.x until today has by-va, v5.10.x-y
> has set/way, and v5.10y+ has by-va again. (I don't think we care about
> anything before that, given that v5.4 predates any of this)

I'm suggesting dropping your fix (9052/1) and reverting
"ARM: decompressor: switch to by-VA cache maintenance for v7 cores"
which gets us to a point where _both_ regressions are fixed.

I'm of the opinion that the by-VA patch was incorrect when it was
merged (it caused a regression), and it's only a performance
improvement. Our attempts so far to fix it are just causing other
regressions. So, I think it is reasonable to revert both back to a
known good point which has worked over a decade. If doing so causes
regressions (which I think is unlikely), then that would be unfortunate
but alas is a price that's worth paying to get back to a known good
point - since then we're not stacking regression fixes on top of other
regression fixes.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
On 04/02/2021 10:27, Ard Biesheuvel wrote:
> On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
>  wrote:
>>
>> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
>>> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
>>>  wrote:

 Hi Ard,

 Please see the bisection report below about a boot failure on
 rk3288 with next-20210203.  It was also bisected on
 imx6q-var-dt6customboard with next-20210202.

 Reports aren't automatically sent to the public while we're
 trialing new bisection features on kernelci.org but this one
 looks valid.

 The kernel is most likely crashing very early on, so there's
 nothing in the logs.  Please let us know if you need some help
 with debugging or trying a fix on these platforms.

>>>
>>> Thanks for the report.
>>
>> Ard,
>>
>> I want to send my fixes branch today which includes your regression
>> fix that caused this regression.
>>
>> As this is proving difficult to fix, I can only drop your fix from
>> my fixes branch - and given that this seems to be problematical, I'm
>> tempted to revert the original change at this point which should fix
>> both of these regressions - and then we have another go at getting rid
>> of the set/way instructions during the next cycle.
>>
>> Thoughts?
>>
> 
> Hi Russell,
> 
> If Guillaume is willing to do the experiment, and it fixes the issue,

Yes, I'm running some tests with that fix now and should have
some results shortly.

> it proves that rk3288 is relying on the flush before the MMU is
> disabled, and so in that case, the fix is trivial, and we can just
> apply it.
> 
> If the experiment fails (which would mean rk3288 does not tolerate the
> cache maintenance being performed after cache off), it is going to be
> hairy, and so it will definitely take more time.
> 
> So in the latter case (or if Guillaume does not get back to us), I
> think reverting my queued fix is the only sane option. But in that
> case, may I suggest that we queue the revert of the original by-VA
> change for v5.12 so it gets lots of coverage in -next, and allows us
> an opportunity to come up with a proper fix in the same timeframe, and
> backport the revert and the subsequent fix as a pair? Otherwise, we'll
> end up in the situation where v5.10.x until today has by-va, v5.10.x-y
> has set/way, and v5.10y+ has by-va again. (I don't think we care about
> anything before that, given that v5.4 predates any of this)
> 
> But in the end, I'm happy to go along with whatever works best for you.

Thanks,
Guillaume


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 11:06, Russell King - ARM Linux admin
 wrote:
>
> On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
> > On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
> >  wrote:
> > >
> > > Hi Ard,
> > >
> > > Please see the bisection report below about a boot failure on
> > > rk3288 with next-20210203.  It was also bisected on
> > > imx6q-var-dt6customboard with next-20210202.
> > >
> > > Reports aren't automatically sent to the public while we're
> > > trialing new bisection features on kernelci.org but this one
> > > looks valid.
> > >
> > > The kernel is most likely crashing very early on, so there's
> > > nothing in the logs.  Please let us know if you need some help
> > > with debugging or trying a fix on these platforms.
> > >
> >
> > Thanks for the report.
>
> Ard,
>
> I want to send my fixes branch today which includes your regression
> fix that caused this regression.
>
> As this is proving difficult to fix, I can only drop your fix from
> my fixes branch - and given that this seems to be problematical, I'm
> tempted to revert the original change at this point which should fix
> both of these regressions - and then we have another go at getting rid
> of the set/way instructions during the next cycle.
>
> Thoughts?
>

Hi Russell,

If Guillaume is willing to do the experiment, and it fixes the issue,
it proves that rk3288 is relying on the flush before the MMU is
disabled, and so in that case, the fix is trivial, and we can just
apply it.

If the experiment fails (which would mean rk3288 does not tolerate the
cache maintenance being performed after cache off), it is going to be
hairy, and so it will definitely take more time.

So in the latter case (or if Guillaume does not get back to us), I
think reverting my queued fix is the only sane option. But in that
case, may I suggest that we queue the revert of the original by-VA
change for v5.12 so it gets lots of coverage in -next, and allows us
an opportunity to come up with a proper fix in the same timeframe, and
backport the revert and the subsequent fix as a pair? Otherwise, we'll
end up in the situation where v5.10.x until today has by-va, v5.10.x-y
has set/way, and v5.10y+ has by-va again. (I don't think we care about
anything before that, given that v5.4 predates any of this)

But in the end, I'm happy to go along with whatever works best for you.


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Russell King - ARM Linux admin
On Thu, Feb 04, 2021 at 10:07:58AM +0100, Ard Biesheuvel wrote:
> On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
>  wrote:
> >
> > Hi Ard,
> >
> > Please see the bisection report below about a boot failure on
> > rk3288 with next-20210203.  It was also bisected on
> > imx6q-var-dt6customboard with next-20210202.
> >
> > Reports aren't automatically sent to the public while we're
> > trialing new bisection features on kernelci.org but this one
> > looks valid.
> >
> > The kernel is most likely crashing very early on, so there's
> > nothing in the logs.  Please let us know if you need some help
> > with debugging or trying a fix on these platforms.
> >
> 
> Thanks for the report.

Ard,

I want to send my fixes branch today which includes your regression
fix that caused this regression.

As this is proving difficult to fix, I can only drop your fix from
my fixes branch - and given that this seems to be problematical, I'm
tempted to revert the original change at this point which should fix
both of these regressions - and then we have another go at getting rid
of the set/way instructions during the next cycle.

Thoughts?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Ard Biesheuvel
On Thu, 4 Feb 2021 at 09:43, Guillaume Tucker
 wrote:
>
> Hi Ard,
>
> Please see the bisection report below about a boot failure on
> rk3288 with next-20210203.  It was also bisected on
> imx6q-var-dt6customboard with next-20210202.
>
> Reports aren't automatically sent to the public while we're
> trialing new bisection features on kernelci.org but this one
> looks valid.
>
> The kernel is most likely crashing very early on, so there's
> nothing in the logs.  Please let us know if you need some help
> with debugging or trying a fix on these platforms.
>

Thanks for the report.

Mind trying the following fix?

--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -617,8 +617,10 @@ not_relocated: mov r0, #0
@ cache_clean_flush() redundant. In other cases, the clean is
@ performed by set/way and R0/R1 are ignored.
@
-   mov r0, #0
-   mov r1, #0
+   get_inflated_image_size r1, r2, r3
+
+   mov r0, r4  @ start of decompressed kernel
+   add r1, r1, r0  @ end of kernel BSS
bl  cache_clean_flush

get_inflated_image_size r1, r2, r3




> On 04/02/2021 04:25, KernelCI bot wrote:
> > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> > * This automated bisection report was sent to you on the basis  *
> > * that you may be involved with the breaking commit it has  *
> > * found.  No manual investigation has been done to verify it,   *
> > * and the root cause of the problem may be somewhere else.  *
> > *   *
> > * If you do send a fix, please include this trailer:*
> > *   Reported-by: "kernelci.org bot"   *
> > *   *
> > * Hope this helps!  *
> > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> >
> > next/master bisection: baseline.login on rk3288-rock2-square
> >
> > Summary:
> >   Start:  58b6c0e507b7 Add linux-next specific files for 20210203
> >   Plain log:  
> > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.txt
> >   HTML log:   
> > https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.html
> >   Result: 5a29552af92d ARM: 9052/1: decompressor: cover BSS in cache 
> > clean and reorder with MMU disable on v7
> >
> > Checks:
> >   revert: PASS
> >   verify: PASS
> >
> > Parameters:
> >   Tree:   next
> >   URL:
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> >   Branch: master
> >   Target: rk3288-rock2-square
> >   CPU arch:   arm
> >   Lab:lab-collabora
> >   Compiler:   clang-11
> >   Config: multi_v7_defconfig
> >   Test case:  baseline.login
> >
> > Breaking commit found:
> >
> > ---
> > commit 5a29552af92dbd62c2b6fd1cddf7dad1ef7555b2
> > Author: Ard Biesheuvel 
> > Date:   Sun Jan 24 18:03:45 2021 +0100
> >
> > ARM: 9052/1: decompressor: cover BSS in cache clean and reorder with 
> > MMU disable on v7
> >
> > Commit 401b368caaec ("ARM: decompressor: switch to by-VA cache 
> > maintenance
> > for v7 cores") replaced the by-set/way cache maintenance in the 
> > decompressor
> > with by-VA cache maintenance, which is more appropriate for the task at
> > hand, especially under virtualization on hosts with non-architected 
> > system
> > caches that are not affected by by-set/way maintenance at all.
> >
> > On such systems, that commit inadvertently removed the cache clean and
> > invalidate of all of the guest's memory that is performed by KVM on 
> > behalf
> > of the guest after its MMU is disabled (but only if any by-set/way cache
> > maintenance instructions were issued first). This resulted in various
> > erroneous behaviors observed by Russell, all involving the mini-stack
> > used by the core kernel's v7 boot code, and which resides in BSS. It
> > seems intractable to figure out exactly what goes wrong in each of these
> > cases, but some small experiments did suggest that the lack of a cache
> > clean and invalidate *after* disabling the MMU and caches is what
> > triggers the errors, presumably because cachelines are being allocated
> > or reallocated while the first cache clean and invalidate is in 
> > progress.
> >
> > To ensure that no cache lines cover any of the data that is accessed by
> > the booting kernel with the MMU off, include the uncompressed kernel's
> > BSS region in the cache clean operation.
> >
> > Also, to ensure that no cachelines are allocated 

Re: next/master bisection: baseline.login on rk3288-rock2-square

2021-02-04 Thread Guillaume Tucker
Hi Ard,

Please see the bisection report below about a boot failure on
rk3288 with next-20210203.  It was also bisected on
imx6q-var-dt6customboard with next-20210202.

Reports aren't automatically sent to the public while we're
trialing new bisection features on kernelci.org but this one
looks valid.

The kernel is most likely crashing very early on, so there's
nothing in the logs.  Please let us know if you need some help
with debugging or trying a fix on these platforms.

Best wishes,
Guillaume


On 04/02/2021 04:25, KernelCI bot wrote:
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> * This automated bisection report was sent to you on the basis  *
> * that you may be involved with the breaking commit it has  *
> * found.  No manual investigation has been done to verify it,   *
> * and the root cause of the problem may be somewhere else.  *
> *   *
> * If you do send a fix, please include this trailer:*
> *   Reported-by: "kernelci.org bot"   *
> *   *
> * Hope this helps!  *
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> 
> next/master bisection: baseline.login on rk3288-rock2-square
> 
> Summary:
>   Start:  58b6c0e507b7 Add linux-next specific files for 20210203
>   Plain log:  
> https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.txt
>   HTML log:   
> https://storage.kernelci.org/next/master/next-20210203/arm/multi_v7_defconfig/clang-11/lab-collabora/baseline-rk3288-rock2-square.html
>   Result: 5a29552af92d ARM: 9052/1: decompressor: cover BSS in cache 
> clean and reorder with MMU disable on v7
> 
> Checks:
>   revert: PASS
>   verify: PASS
> 
> Parameters:
>   Tree:   next
>   URL:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   Branch: master
>   Target: rk3288-rock2-square
>   CPU arch:   arm
>   Lab:lab-collabora
>   Compiler:   clang-11
>   Config: multi_v7_defconfig
>   Test case:  baseline.login
> 
> Breaking commit found:
> 
> ---
> commit 5a29552af92dbd62c2b6fd1cddf7dad1ef7555b2
> Author: Ard Biesheuvel 
> Date:   Sun Jan 24 18:03:45 2021 +0100
> 
> ARM: 9052/1: decompressor: cover BSS in cache clean and reorder with MMU 
> disable on v7
> 
> Commit 401b368caaec ("ARM: decompressor: switch to by-VA cache maintenance
> for v7 cores") replaced the by-set/way cache maintenance in the 
> decompressor
> with by-VA cache maintenance, which is more appropriate for the task at
> hand, especially under virtualization on hosts with non-architected system
> caches that are not affected by by-set/way maintenance at all.
> 
> On such systems, that commit inadvertently removed the cache clean and
> invalidate of all of the guest's memory that is performed by KVM on behalf
> of the guest after its MMU is disabled (but only if any by-set/way cache
> maintenance instructions were issued first). This resulted in various
> erroneous behaviors observed by Russell, all involving the mini-stack
> used by the core kernel's v7 boot code, and which resides in BSS. It
> seems intractable to figure out exactly what goes wrong in each of these
> cases, but some small experiments did suggest that the lack of a cache
> clean and invalidate *after* disabling the MMU and caches is what
> triggers the errors, presumably because cachelines are being allocated
> or reallocated while the first cache clean and invalidate is in progress.
> 
> To ensure that no cache lines cover any of the data that is accessed by
> the booting kernel with the MMU off, include the uncompressed kernel's
> BSS region in the cache clean operation.
> 
> Also, to ensure that no cachelines are allocated while the cache is being
> cleaned, perform the cache clean operation *after* disabling the MMU and
> caches when running on v7 or later, by making a tail call to the clean
> routine from the cache_off routine. This requires passing the VA range
> to cache_off(), which means some care needs to be taken to preserve
> R0 and R1 across the call to cache_off().
> 
> Since this makes the first cache clean redundant, call it with the
> range reduced to zero. This only affects v7, as all other versions
> ignore R0/R1 entirely.
> 
> Link: 
> https://lore.kernel.org/linux-arm-kernel/20210122152012.30075-1-a...@kernel.org
> 
> Fixes: 401b368caaec ("ARM: decompressor: switch to by-VA cache 
> maintenance for v7 cores")
> Reported-by: Russell King 
> Signed-off-by: Ard Biesheuvel 
> Signed-off-by: Russell King 
> 
> diff --git