Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-12 Thread Mike Rapoport
On Tue, Jan 12, 2021 at 10:53:45AM +, Guillaume Tucker wrote:
> On 05/01/2021 09:13, Mike Rapoport wrote:
> > On Sun, Jan 03, 2021 at 03:09:14PM -0500, Andrea Arcangeli wrote:
> >> Hello Mike,
> >>
> >> On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
> >>> Thanks for the logs, it seems that implicitly adding reserved regions to
> >>> memblock.memory wasn't that bright idea :)
> >>
> >> Would it be possible to somehow clean up the hack then?
> >>
> >> The only difference between the clean solution and the hack is that
> >> the hack intended to achieved the exact same, but without adding the
> >> reserved regions to memblock.memory.
> > 
> > I didn't consider adding reserved regions to memblock.memory as a clean
> > solution, this was still a hack, but I didn't think that things are that
> > fragile.
> > 
> > I still think we cannot rely on memblock.reserved to detect
> > memory/zone/node sizes and the boot failure reported here confirms this.
> >  
> >> The comment on that problematic area says the reserved area cannot be
> >> used for DMA because of some unexplained hw issue, and that doing so
> >> prevents booting, but since the area got reserved, even with the clean
> >> solution, it shouldn't have never been used for DMA?
> >>
> >> So I can only imagine that the physical memory region is way more
> >> problematic than just for DMA. It sounds like that anything that
> >> touches it, including the CPU, will hang the system, not just DMA. It
> >> sounds somewhat similar to the other e820 direct mapping issue on x86?
> > 
> > My understanding is that the boot failed because when I implicitly added
> > the reserved region to memblock.memory the memory size seen by
> > free_area_init() jumped from 2G to 4G because the reserved area was close
> > to 4G. The very first allocation would get a chunk from slightly below of
> > 4G and as there is no real memory there, the kernel would crash.
> >  
> >> If you want to test the hack on the arm board to check if it boots you
> >> can use the below commit:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc
> > 
> > My take is your solution would boot with this memory configuration, but I
> > still don't think that using memblock.reserved for zone/node sizing is
> > correct.
> 
> The rk3288 platform has now been failing to boot for nearly a
> month on linux-next:
> 
>   https://kernelci.org/test/case/id/5ffbed0a31ad81239bc94cdb/
> 
> Until a fix or a new version of this patch is made, would it be
> possible to drop it or revert it so the platform become usable
> again?

There is a new version of these patches:

https://lore.kernel.org/lkml/2021094017.22696-1-r...@kernel.org

It's going to be in linux-next as soon as Andrew pushes mmotm.
 
> Or if you want, I can make a cleaned-up version of my hack to
> ignore the problematic region if you still need your patch to be
> on linux-next, but that would probably be less than ideal.
> 
> Thanks,
> Guillaume

-- 
Sincerely yours,
Mike.


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-12 Thread Guillaume Tucker
On 12/01/2021 10:53, Guillaume Tucker wrote:
> On 05/01/2021 09:13, Mike Rapoport wrote:
>> On Sun, Jan 03, 2021 at 03:09:14PM -0500, Andrea Arcangeli wrote:
>>> Hello Mike,
>>>
>>> On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
 Thanks for the logs, it seems that implicitly adding reserved regions to
 memblock.memory wasn't that bright idea :)
>>>
>>> Would it be possible to somehow clean up the hack then?
>>>
>>> The only difference between the clean solution and the hack is that
>>> the hack intended to achieved the exact same, but without adding the
>>> reserved regions to memblock.memory.
>>
>> I didn't consider adding reserved regions to memblock.memory as a clean
>> solution, this was still a hack, but I didn't think that things are that
>> fragile.
>>
>> I still think we cannot rely on memblock.reserved to detect
>> memory/zone/node sizes and the boot failure reported here confirms this.
>>  
>>> The comment on that problematic area says the reserved area cannot be
>>> used for DMA because of some unexplained hw issue, and that doing so
>>> prevents booting, but since the area got reserved, even with the clean
>>> solution, it shouldn't have never been used for DMA?
>>>
>>> So I can only imagine that the physical memory region is way more
>>> problematic than just for DMA. It sounds like that anything that
>>> touches it, including the CPU, will hang the system, not just DMA. It
>>> sounds somewhat similar to the other e820 direct mapping issue on x86?
>>
>> My understanding is that the boot failed because when I implicitly added
>> the reserved region to memblock.memory the memory size seen by
>> free_area_init() jumped from 2G to 4G because the reserved area was close
>> to 4G. The very first allocation would get a chunk from slightly below of
>> 4G and as there is no real memory there, the kernel would crash.
>>  
>>> If you want to test the hack on the arm board to check if it boots you
>>> can use the below commit:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc
>>
>> My take is your solution would boot with this memory configuration, but I
>> still don't think that using memblock.reserved for zone/node sizing is
>> correct.
> 
> The rk3288 platform has now been failing to boot for nearly a
> month on linux-next:
> 
>   https://kernelci.org/test/case/id/5ffbed0a31ad81239bc94cdb/
> 
> Until a fix or a new version of this patch is made, would it be
> possible to drop it or revert it so the platform become usable
> again?
> 
> Or if you want, I can make a cleaned-up version of my hack to
> ignore the problematic region if you still need your patch to be
> on linux-next, but that would probably be less than ideal.

By the way, another bisection found that this commit is also
breaking tegra124-nyan-big but only with both CONFIG_EFI=y
CONFIG_ARM_LPAE=y enabled:

  https://kernelci.org/test/case/id/5ff6b1e26cf19f3b10c94cc5/

The plain multi_v7_defconfig is booting fine:

  https://kernelci.org/test/plan/id/5ff6b0a1db91b8a2b9c94cba/

I haven't looked into this one or tried to make it boot like
rk3288, but please let me know if there's anything there that can
be done to help.

Thanks,
Guillaume


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-12 Thread Guillaume Tucker
On 05/01/2021 09:13, Mike Rapoport wrote:
> On Sun, Jan 03, 2021 at 03:09:14PM -0500, Andrea Arcangeli wrote:
>> Hello Mike,
>>
>> On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
>>> Thanks for the logs, it seems that implicitly adding reserved regions to
>>> memblock.memory wasn't that bright idea :)
>>
>> Would it be possible to somehow clean up the hack then?
>>
>> The only difference between the clean solution and the hack is that
>> the hack intended to achieved the exact same, but without adding the
>> reserved regions to memblock.memory.
> 
> I didn't consider adding reserved regions to memblock.memory as a clean
> solution, this was still a hack, but I didn't think that things are that
> fragile.
> 
> I still think we cannot rely on memblock.reserved to detect
> memory/zone/node sizes and the boot failure reported here confirms this.
>  
>> The comment on that problematic area says the reserved area cannot be
>> used for DMA because of some unexplained hw issue, and that doing so
>> prevents booting, but since the area got reserved, even with the clean
>> solution, it shouldn't have never been used for DMA?
>>
>> So I can only imagine that the physical memory region is way more
>> problematic than just for DMA. It sounds like that anything that
>> touches it, including the CPU, will hang the system, not just DMA. It
>> sounds somewhat similar to the other e820 direct mapping issue on x86?
> 
> My understanding is that the boot failed because when I implicitly added
> the reserved region to memblock.memory the memory size seen by
> free_area_init() jumped from 2G to 4G because the reserved area was close
> to 4G. The very first allocation would get a chunk from slightly below of
> 4G and as there is no real memory there, the kernel would crash.
>  
>> If you want to test the hack on the arm board to check if it boots you
>> can use the below commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc
> 
> My take is your solution would boot with this memory configuration, but I
> still don't think that using memblock.reserved for zone/node sizing is
> correct.

The rk3288 platform has now been failing to boot for nearly a
month on linux-next:

  https://kernelci.org/test/case/id/5ffbed0a31ad81239bc94cdb/

Until a fix or a new version of this patch is made, would it be
possible to drop it or revert it so the platform become usable
again?

Or if you want, I can make a cleaned-up version of my hack to
ignore the problematic region if you still need your patch to be
on linux-next, but that would probably be less than ideal.

Thanks,
Guillaume


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-05 Thread Mike Rapoport
On Sun, Jan 03, 2021 at 03:09:14PM -0500, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
> > Thanks for the logs, it seems that implicitly adding reserved regions to
> > memblock.memory wasn't that bright idea :)
> 
> Would it be possible to somehow clean up the hack then?
> 
> The only difference between the clean solution and the hack is that
> the hack intended to achieved the exact same, but without adding the
> reserved regions to memblock.memory.

I didn't consider adding reserved regions to memblock.memory as a clean
solution, this was still a hack, but I didn't think that things are that
fragile.

I still think we cannot rely on memblock.reserved to detect
memory/zone/node sizes and the boot failure reported here confirms this.
 
> The comment on that problematic area says the reserved area cannot be
> used for DMA because of some unexplained hw issue, and that doing so
> prevents booting, but since the area got reserved, even with the clean
> solution, it shouldn't have never been used for DMA?
>
> So I can only imagine that the physical memory region is way more
> problematic than just for DMA. It sounds like that anything that
> touches it, including the CPU, will hang the system, not just DMA. It
> sounds somewhat similar to the other e820 direct mapping issue on x86?

My understanding is that the boot failed because when I implicitly added
the reserved region to memblock.memory the memory size seen by
free_area_init() jumped from 2G to 4G because the reserved area was close
to 4G. The very first allocation would get a chunk from slightly below of
4G and as there is no real memory there, the kernel would crash.
 
> If you want to test the hack on the arm board to check if it boots you
> can use the below commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc

My take is your solution would boot with this memory configuration, but I
still don't think that using memblock.reserved for zone/node sizing is
correct.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-03 Thread Andrea Arcangeli
Hello Mike,

On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
> Thanks for the logs, it seems that implicitly adding reserved regions to
> memblock.memory wasn't that bright idea :)

Would it be possible to somehow clean up the hack then?

The only difference between the clean solution and the hack is that
the hack intended to achieved the exact same, but without adding the
reserved regions to memblock.memory.

The comment on that problematic area says the reserved area cannot be
used for DMA because of some unexplained hw issue, and that doing so
prevents booting, but since the area got reserved, even with the clean
solution, it shouldn't have never been used for DMA?

So I can only imagine that the physical memory region is way more
problematic than just for DMA. It sounds like that anything that
touches it, including the CPU, will hang the system, not just DMA. It
sounds somewhat similar to the other e820 direct mapping issue on x86?

If you want to test the hack on the arm board to check if it boots you
can use the below commit:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc

Thanks,
Andrea



Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-03 Thread Mike Rapoport
On Fri, Dec 18, 2020 at 09:59:26PM +, Guillaume Tucker wrote:
> On 13/12/2020 08:23, Mike Rapoport wrote:
> > Hi Guillaume,
> > 
> > On Fri, Dec 11, 2020 at 09:53:46PM +, Guillaume Tucker wrote:
> >> Hi Mike,
> >>
> 
> OK, sorry for the delay.  I've built a kernel and booted it as
> you requested, and also found that the issue was due to this
> memory area defined in arch/arm/boot/dts/rk3288.dtsi:
> 
> reserved-memory {
> #address-cells = <2>;
> #size-cells = <2>;
> ranges;
> 
> /*
>  * The rk3288 cannot use the memory area above 0xfe00
>  * for dma operations for some reason. While there is
>  * probably a better solution available somewhere, we
>  * haven't found it yet and while devices with 2GB of ram
>  * are not affected, this issue prevents 4GB from booting.
>  * So to make these devices at least bootable, block
>  * this area for the time being until the real solution
>  * is found.
>  */
> dma-unusable@fe00 {
> reg = <0x0 0xfe00 0x0 0x100>;
> };
> };
> 
> So I've put a hack[1] on top of 950c37691925 to skip adding a
> node in memblock_enforce_memory_reserved_overlap() if the base
> address is 0xfe00, which got the kernel booting.  Here's the
> console log:
> 
>   https://people.collabora.com/~gtucker/tmp/2966825.txt
> 
> and the full test job details, if this helps:
> 
>   https://lava.collabora.co.uk/scheduler/job/2966825
> 
> 
> I haven't really looked much further than that, but I'll be
> available on Monday to help run other tests if needed.

Sorry for the delay, I was mostly offline for the last three weeks.

Thanks for the logs, it seems that implicitly adding reserved regions to
memblock.memory wasn't that bright idea :)
 
> Thanks,
> Guillaume
> 
> [1] https://people.collabora.com/~gtucker/tmp/2966825.patch

-- 
Sincerely yours,
Mike.


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2020-12-18 Thread Guillaume Tucker
On 13/12/2020 08:23, Mike Rapoport wrote:
> Hi Guillaume,
> 
> On Fri, Dec 11, 2020 at 09:53:46PM +, Guillaume Tucker wrote:
>> Hi Mike,
>>
>> Please see the bisection report below about a boot failure on
>> rk3288 with next-20201210.
>>
>> Reports aren't automatically sent to the public while we're
>> trialing new bisection features on kernelci.org but this one
>> looks valid.
>>
>> There's nothing in the serial console log, probably because it's
>> crashing too early during boot.  This was confirmed on two rk3288
>> platforms on kernelci.org: rk3288-veyron-jaq and
>> rk3288-rock2-square.  There's no clear sign about other platforms
>> being impacted.
>>
>> If this looks like something you want to investigate but you
>> don't have a platform at hand to reproduce it, please let us know
>> if you would like the test to be re-run on kernelci.org with some
>> debug config turned on, or if you have a fix to try.
> 
> I'd apprciate if you can build a working kernel with
> CONFIG_DEBUG_MEMORY_INIT=y and run it with 
> 
>   memblock=debug mminit_loglevel=4
> 
> in the command line.
> 
> If I understand correctly, DEBUG_LL is not an option for these platforms
> so if earlyprintk didn't display the log there is not much to do about
> it.

OK, sorry for the delay.  I've built a kernel and booted it as
you requested, and also found that the issue was due to this
memory area defined in arch/arm/boot/dts/rk3288.dtsi:

reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;

/*
 * The rk3288 cannot use the memory area above 0xfe00
 * for dma operations for some reason. While there is
 * probably a better solution available somewhere, we
 * haven't found it yet and while devices with 2GB of ram
 * are not affected, this issue prevents 4GB from booting.
 * So to make these devices at least bootable, block
 * this area for the time being until the real solution
 * is found.
 */
dma-unusable@fe00 {
reg = <0x0 0xfe00 0x0 0x100>;
};
};

So I've put a hack[1] on top of 950c37691925 to skip adding a
node in memblock_enforce_memory_reserved_overlap() if the base
address is 0xfe00, which got the kernel booting.  Here's the
console log:

  https://people.collabora.com/~gtucker/tmp/2966825.txt

and the full test job details, if this helps:

  https://lava.collabora.co.uk/scheduler/job/2966825


I haven't really looked much further than that, but I'll be
available on Monday to help run other tests if needed.

Thanks,
Guillaume

[1] https://people.collabora.com/~gtucker/tmp/2966825.patch


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2020-12-13 Thread Mike Rapoport
Hi Guillaume,

On Fri, Dec 11, 2020 at 09:53:46PM +, Guillaume Tucker wrote:
> Hi Mike,
> 
> Please see the bisection report below about a boot failure on
> rk3288 with next-20201210.
> 
> Reports aren't automatically sent to the public while we're
> trialing new bisection features on kernelci.org but this one
> looks valid.
> 
> There's nothing in the serial console log, probably because it's
> crashing too early during boot.  This was confirmed on two rk3288
> platforms on kernelci.org: rk3288-veyron-jaq and
> rk3288-rock2-square.  There's no clear sign about other platforms
> being impacted.
> 
> If this looks like something you want to investigate but you
> don't have a platform at hand to reproduce it, please let us know
> if you would like the test to be re-run on kernelci.org with some
> debug config turned on, or if you have a fix to try.

I'd apprciate if you can build a working kernel with
CONFIG_DEBUG_MEMORY_INIT=y and run it with 

memblock=debug mminit_loglevel=4

in the command line.

If I understand correctly, DEBUG_LL is not an option for these platforms
so if earlyprintk didn't display the log there is not much to do about
it.

> Thanks,
> Guillaume

-- 
Sincerely yours,
Mike.


Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2020-12-11 Thread Guillaume Tucker
Hi Mike,

Please see the bisection report below about a boot failure on
rk3288 with next-20201210.

Reports aren't automatically sent to the public while we're
trialing new bisection features on kernelci.org but this one
looks valid.

There's nothing in the serial console log, probably because it's
crashing too early during boot.  This was confirmed on two rk3288
platforms on kernelci.org: rk3288-veyron-jaq and
rk3288-rock2-square.  There's no clear sign about other platforms
being impacted.

If this looks like something you want to investigate but you
don't have a platform at hand to reproduce it, please let us know
if you would like the test to be re-run on kernelci.org with some
debug config turned on, or if you have a fix to try.

Thanks,
Guillaume

On 11/12/2020 21:34, staging.kernelci.org bot wrote:
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> * This automated bisection report was sent to you on the basis  *
> * that you may be involved with the breaking commit it has  *
> * found.  No manual investigation has been done to verify it,   *
> * and the root cause of the problem may be somewhere else.  *
> *   *
> * If you do send a fix, please include this trailer:*
> *   Reported-by: "kernelci.org bot"   *
> *   *
> * Hope this helps!  *
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> 
> kernelci/staging-next bisection: sleep.login on rk3288-rock2-square 
> #2286-staging
> 
> Summary:
>   Start:  7f507faf2d85 staging-next-20201211.0

This is really next-20201210...  The revision shown here is just
an artifact of staging.kernelci.org which creates its own tags.

>   Plain log:  
> https://storage.staging.kernelci.org/kernelci/staging-next/staging-next-20201211.0/arm/multi_v7_defconfig/gcc-8/lab-collabora/sleep-rk3288-rock2-square.txt
>   HTML log:   
> https://storage.staging.kernelci.org/kernelci/staging-next/staging-next-20201211.0/arm/multi_v7_defconfig/gcc-8/lab-collabora/sleep-rk3288-rock2-square.html
>   Result: 950c37691925 mm: memblock: enforce overlap of memory.memblock 
> and memory.reserved
> 
> Checks:
>   revert: PASS
>   verify: PASS
> 
> Parameters:
>   Tree:   kernelci
>   URL:https://github.com/kernelci/linux.git
>   Branch: staging-next
>   Target: rk3288-rock2-square
>   CPU arch:   arm
>   Lab:lab-collabora
>   Compiler:   gcc-8
>   Config: multi_v7_defconfig
>   Test case:  sleep.login
> 
> Breaking commit found:
> 
> ---
> commit 950c3769192512118a87432dd42e71c5241dbd10
> Author: Mike Rapoport 
> Date:   Thu Dec 10 15:40:51 2020 +1100
> 
> mm: memblock: enforce overlap of memory.memblock and memory.reserved
> 
> Patch series "mm: fix initialization of struct page for holes in  memory 
> layout", v2.
> 
> Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
> rather that check each PFN") exposed several issues with the memory map
> initialization and these patches fix those issues.
> 
> Initially there were crashes during compaction that Qian Cai reported back
> in April [1].  It seemed back then that the probelm was fixed, but a few
> weeks ago Andrea Arcangeli hit the same bug [2] and after a long
> discussion between us [3] I think these patches are the proper fix.
> 
> [1] 
> https://lore.kernel.org/lkml/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw
> [2] 
> https://lore.kernel.org/lkml/20201121194506.13464-1-aarca...@redhat.com
> [3] 
> https://lore.kernel.org/mm-commits/20201206005401.qkuavgoxr%a...@linux-foundation.org
> 
> This patch (of 2):
> 
> memblock does not require that the reserved memory ranges will be a subset
> of memblock.memory.
> 
> As a result there may be reserved pages that are not in the range of any
> zone or node because zone and node boundaries are detected based on
> memblock.memory and pages that only present in memblock.reserved are not
> taken into account during zone/node size detection.
> 
> Make sure that all ranges in memblock.reserved are added to
> memblock.memory before calculating node and zone boundaries.
> 
> Link: https://lkml.kernel.org/r/20201209214304.6812-1-r...@kernel.org
> Link: https://lkml.kernel.org/r/20201209214304.6812-2-r...@kernel.org
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions 
> rather that check each PFN")