Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-06-08 Thread Martin Blumenstingl
Hi Liang,

On Thu, Apr 11, 2019 at 5:00 AM Liang Yang  wrote:
>
> Hi Martin,
> On 2019/4/11 1:54, Martin Blumenstingl wrote:
> > Hi Liang,
> >
> > On Wed, Apr 10, 2019 at 1:08 PM Liang Yang  wrote:
> >>
> >> Hi Martin,
> >>
> >> On 2019/4/5 12:30, Martin Blumenstingl wrote:
> >>> Hi Liang,
> >>>
> >>> On Fri, Mar 29, 2019 at 8:44 AM Liang Yang  wrote:
> 
>  Hi Martin,
> 
>  On 2019/3/29 2:03, Martin Blumenstingl wrote:
> > Hi Liang,
>  [..]
> >> I don't think it is caused by a different NAND type, but i have 
> >> followed
> >> the some test on my GXL platform. we can see the result from the
> >> attachment. By the way, i don't find any information about this on 
> >> meson
> >> NFC datasheet, so i will ask our VLSI.
> >> Martin, May you reproduce it with the new patch on meson8b platform ? I
> >> need a more clear and easier compared log like gxl.txt. Thanks.
> > your gxl.txt is great, finally I can also compare my own results with
> > something that works for you!
> > in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
> > instructions result in a different info buffer output.
> > does this make any sense to you?
> >
>  I have asked our VLSI designer for explanation or simulation result by
>  an e-mail. Thanks.
> >>> do you have any update on this?
> >> Sorry. I haven't got reply from VLSI designer yet. We tried to improve
> >> priority yesterday, but i still can't estimate the time. There is no
> >> document or change list showing the difference between m8/b and gxl/axg
> >> serial chips. Now it seems that we can't use command NFC_CMD_N2M on nand
> >> initialization for m8/b chips and use *read byte from NFC fifo register*
> >> instead.
> > thank you for the status update!
> >
> > I am trying to understand your suggestion not to use NFC_CMD_N2M:
> > the documentation (public S922X datasheet from Hardkernel: [0]) states
> > that P_NAND_BUF (NFC_REG_BUF in the meson_nand driver) can hold up to
> > four bytes of data. is this the "read byte from NFC FIFO register" you
> > mentioned?
> >
> You are right.take the early meson NFC driver V2 on previous mail as a
> reference.
>
> > Before I spend time changing the code to use the FIFO register I would
> > like to wait for an answer from your VLSI designer.
> > Setting the "correct" info buffer length for NFC_CMD_N2M on the 32-bit
> > SoCs seems like an easier solution compared to switching to the FIFO
> > register. Keeping NFC_CMD_N2M on the 32-bit SoCs also allows us to
> > have only one code-path for 32 and 64 bit SoCs, meaning we don't have
> > to maintain two separate code-paths for basically the same
> > functionality (assuming that NFC_CMD_N2M is not completely broken on
> > the 32-bit SoCs, we just don't know how to use it yet).
> >
> All right. I am also waiting for the answer.
do you have any update on this?


Martin


Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-04-10 Thread Liang Yang

Hi Martin,
On 2019/4/11 1:54, Martin Blumenstingl wrote:

Hi Liang,

On Wed, Apr 10, 2019 at 1:08 PM Liang Yang  wrote:


Hi Martin,

On 2019/4/5 12:30, Martin Blumenstingl wrote:

Hi Liang,

On Fri, Mar 29, 2019 at 8:44 AM Liang Yang  wrote:


Hi Martin,

On 2019/3/29 2:03, Martin Blumenstingl wrote:

Hi Liang,

[..]

I don't think it is caused by a different NAND type, but i have followed
the some test on my GXL platform. we can see the result from the
attachment. By the way, i don't find any information about this on meson
NFC datasheet, so i will ask our VLSI.
Martin, May you reproduce it with the new patch on meson8b platform ? I
need a more clear and easier compared log like gxl.txt. Thanks.

your gxl.txt is great, finally I can also compare my own results with
something that works for you!
in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
instructions result in a different info buffer output.
does this make any sense to you?


I have asked our VLSI designer for explanation or simulation result by
an e-mail. Thanks.

do you have any update on this?

Sorry. I haven't got reply from VLSI designer yet. We tried to improve
priority yesterday, but i still can't estimate the time. There is no
document or change list showing the difference between m8/b and gxl/axg
serial chips. Now it seems that we can't use command NFC_CMD_N2M on nand
initialization for m8/b chips and use *read byte from NFC fifo register*
instead.

thank you for the status update!

I am trying to understand your suggestion not to use NFC_CMD_N2M:
the documentation (public S922X datasheet from Hardkernel: [0]) states
that P_NAND_BUF (NFC_REG_BUF in the meson_nand driver) can hold up to
four bytes of data. is this the "read byte from NFC FIFO register" you
mentioned?

You are right.take the early meson NFC driver V2 on previous mail as a 
reference.



Before I spend time changing the code to use the FIFO register I would
like to wait for an answer from your VLSI designer.
Setting the "correct" info buffer length for NFC_CMD_N2M on the 32-bit
SoCs seems like an easier solution compared to switching to the FIFO
register. Keeping NFC_CMD_N2M on the 32-bit SoCs also allows us to
have only one code-path for 32 and 64 bit SoCs, meaning we don't have
to maintain two separate code-paths for basically the same
functionality (assuming that NFC_CMD_N2M is not completely broken on
the 32-bit SoCs, we just don't know how to use it yet).


All right. I am also waiting for the answer.


Regards
Martin


[0] 
https://dn.odroid.com/S922X/ODROID-N2/Datasheet/S922X_Public_Datasheet_V0.2.pdf

.



Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-04-10 Thread Martin Blumenstingl
Hi Liang,

On Wed, Apr 10, 2019 at 1:08 PM Liang Yang  wrote:
>
> Hi Martin,
>
> On 2019/4/5 12:30, Martin Blumenstingl wrote:
> > Hi Liang,
> >
> > On Fri, Mar 29, 2019 at 8:44 AM Liang Yang  wrote:
> >>
> >> Hi Martin,
> >>
> >> On 2019/3/29 2:03, Martin Blumenstingl wrote:
> >>> Hi Liang,
> >> [..]
>  I don't think it is caused by a different NAND type, but i have followed
>  the some test on my GXL platform. we can see the result from the
>  attachment. By the way, i don't find any information about this on meson
>  NFC datasheet, so i will ask our VLSI.
>  Martin, May you reproduce it with the new patch on meson8b platform ? I
>  need a more clear and easier compared log like gxl.txt. Thanks.
> >>> your gxl.txt is great, finally I can also compare my own results with
> >>> something that works for you!
> >>> in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
> >>> instructions result in a different info buffer output.
> >>> does this make any sense to you?
> >>>
> >> I have asked our VLSI designer for explanation or simulation result by
> >> an e-mail. Thanks.
> > do you have any update on this?
> Sorry. I haven't got reply from VLSI designer yet. We tried to improve
> priority yesterday, but i still can't estimate the time. There is no
> document or change list showing the difference between m8/b and gxl/axg
> serial chips. Now it seems that we can't use command NFC_CMD_N2M on nand
> initialization for m8/b chips and use *read byte from NFC fifo register*
> instead.
thank you for the status update!

I am trying to understand your suggestion not to use NFC_CMD_N2M:
the documentation (public S922X datasheet from Hardkernel: [0]) states
that P_NAND_BUF (NFC_REG_BUF in the meson_nand driver) can hold up to
four bytes of data. is this the "read byte from NFC FIFO register" you
mentioned?

Before I spend time changing the code to use the FIFO register I would
like to wait for an answer from your VLSI designer.
Setting the "correct" info buffer length for NFC_CMD_N2M on the 32-bit
SoCs seems like an easier solution compared to switching to the FIFO
register. Keeping NFC_CMD_N2M on the 32-bit SoCs also allows us to
have only one code-path for 32 and 64 bit SoCs, meaning we don't have
to maintain two separate code-paths for basically the same
functionality (assuming that NFC_CMD_N2M is not completely broken on
the 32-bit SoCs, we just don't know how to use it yet).


Regards
Martin


[0] 
https://dn.odroid.com/S922X/ODROID-N2/Datasheet/S922X_Public_Datasheet_V0.2.pdf


Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-04-10 Thread Liang Yang

Hi Martin,

On 2019/4/5 12:30, Martin Blumenstingl wrote:

Hi Liang,

On Fri, Mar 29, 2019 at 8:44 AM Liang Yang  wrote:


Hi Martin,

On 2019/3/29 2:03, Martin Blumenstingl wrote:

Hi Liang,

[..]

I don't think it is caused by a different NAND type, but i have followed
the some test on my GXL platform. we can see the result from the
attachment. By the way, i don't find any information about this on meson
NFC datasheet, so i will ask our VLSI.
Martin, May you reproduce it with the new patch on meson8b platform ? I
need a more clear and easier compared log like gxl.txt. Thanks.

your gxl.txt is great, finally I can also compare my own results with
something that works for you!
in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
instructions result in a different info buffer output.
does this make any sense to you?


I have asked our VLSI designer for explanation or simulation result by
an e-mail. Thanks.

do you have any update on this?
Sorry. I haven't got reply from VLSI designer yet. We tried to improve 
priority yesterday, but i still can't estimate the time. There is no 
document or change list showing the difference between m8/b and gxl/axg 
serial chips. Now it seems that we can't use command NFC_CMD_N2M on nand 
initialization for m8/b chips and use *read byte from NFC fifo register* 
instead.


Martin

.



Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-04-04 Thread Martin Blumenstingl
Hi Liang,

On Fri, Mar 29, 2019 at 8:44 AM Liang Yang  wrote:
>
> Hi Martin,
>
> On 2019/3/29 2:03, Martin Blumenstingl wrote:
> > Hi Liang,
> [..]
> >> I don't think it is caused by a different NAND type, but i have followed
> >> the some test on my GXL platform. we can see the result from the
> >> attachment. By the way, i don't find any information about this on meson
> >> NFC datasheet, so i will ask our VLSI.
> >> Martin, May you reproduce it with the new patch on meson8b platform ? I
> >> need a more clear and easier compared log like gxl.txt. Thanks.
> > your gxl.txt is great, finally I can also compare my own results with
> > something that works for you!
> > in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
> > instructions result in a different info buffer output.
> > does this make any sense to you?
> >
> I have asked our VLSI designer for explanation or simulation result by
> an e-mail. Thanks.
do you have any update on this?


Martin


Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-29 Thread Liang Yang

Hi Martin,

On 2019/3/29 2:03, Martin Blumenstingl wrote:
Hi Liang, 

[..]

I don't think it is caused by a different NAND type, but i have followed
the some test on my GXL platform. we can see the result from the
attachment. By the way, i don't find any information about this on meson
NFC datasheet, so i will ask our VLSI.
Martin, May you reproduce it with the new patch on meson8b platform ? I
need a more clear and easier compared log like gxl.txt. Thanks.

your gxl.txt is great, finally I can also compare my own results with
something that works for you!
in my results (see attachment) the "DATA_IN  [256 B, force 8-bit]"
instructions result in a different info buffer output.
does this make any sense to you?

I have asked our VLSI designer for explanation or simulation result by 
an e-mail. Thanks.


Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-28 Thread Martin Blumenstingl
Hi Liang,

On Wed, Mar 27, 2019 at 9:52 AM Liang Yang  wrote:
>
> Hi Martin,
>
> Thanks a lot.
> On 2019/3/26 2:31, Martin Blumenstingl wrote:
> > Hi Liang,
> >
> > On Mon, Mar 25, 2019 at 11:03 AM Liang Yang  wrote:
> >>
> >> Hi Martin,
> >>
> >> On 2019/3/23 5:07, Martin Blumenstingl wrote:
> >>> Hi Matthew,
> >>>
> >>> On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox  
> >>> wrote:
> 
>  On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:
> > Hello,
> >
> > I am experiencing the following crash:
> > [ cut here ]
> > kernel BUG at mm/slub.c:3950!
> 
>    if (unlikely(!PageSlab(page))) {
>    BUG_ON(!PageCompound(page));
> 
>  You called kfree() on the address of a page which wasn't allocated by 
>  slab.
> 
> > I have traced this crash to the kfree() in meson_nfc_read_buf().
> > my observation is as follows:
> > - meson_nfc_read_buf() is called 7 times without any crash, the
> > kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
> > (physical address)
> > - the eight time meson_nfc_read_buf() is called kzalloc() call returns
> > 0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
> > final kfree() crashes
> > - changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
> > PAGE_SIZE works around that crash
> 
>  I suspect you're doing something which corrupts memory.  Overrunning
>  the end of your allocation or something similar.  Have you tried KASAN
>  or even the various slab debugging (eg redzones)?
> >>> KASAN is not available on 32-bit ARM. there was some progress last
> >>> year [0] but it didn't make it into mainline. I tried to make the
> >>> patches apply again and got it to compile (and my kernel is still
> >>> booting) but I have no idea if it's still working. for anyone
> >>> interested, my patches are here: [1] (I consider this a HACK because I
> >>> don't know anything about the code which is being touched in the
> >>> patches, I only made it compile)
> >>>
> >>> SLAB debugging (redzones) were a great hint, thank you very much for
> >>> that Matthew! I enabled:
> >>> CONFIG_SLUB_DEBUG=y
> >>> CONFIG_SLUB_DEBUG_ON=y
> >>> and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
> >>> overwritten" (a larger kernel log extract is attached).
> >>>
> >>> I'm starting to wonder if the NAND controller (hardware) writes more
> >>> than 8 bytes.
> >>> some context: the "info" buffer allocated in meson_nfc_read_buf is
> >>> then passed to the NAND controller IP (after using dma_map_single).
> >>>
> >>> Liang, how does the NAND controller know that it only has to send
> >>> PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
> >>> other callers of meson_nfc_dma_buffer_setup (which passes the info
> >>> buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
> >>> bytes?
> >>>
> >> NFC_CMD_N2M and CMDRWGEN are different commands. CMDRWGEN needs to set
> >> the ecc page size (1KB or 512B) and Pages(2, 4, 8, ...), so
> >> PER_INFO_BYTE(= 8) bytes for each ecc page.
> >> I have never used NFC_CMD_N2M to transfer data before, because it is
> >> very low efficient. And I do a experiment with the attachment and find
> >> on overwritten on my meson axg platform.
> >>
> >> Martin, I would appreciate it very much if you would try the attachment
> >> on your meson m8b platform.
> > thank you for your debug patch! on my board 2 * PER_INFO_BYTE is not enough.
> > I took the idea from your patch and adapted it so I could print a
> > buffer with 256 bytes (which seems to be "big enough" for my board).
> it only needs PER_INFO_BYTE (= 8) bytes, because NFC_CMD_N2M don't set
> *Pages*, that is not like CMDRWGEN which needs Pages*PER_INFO_BYTE (= 8)
>   bytes when setting *Pages* parameter. I have been thinking that
> NFC_CMD_N2M  only occupis PER_INFO_BYTE (= 8) bytes. And i have tried to
> not set the info address, the machine would crash.
thank you for the explanation. the command is built using:
  cmd = NFC_CMD_N2M | (len & GENMASK(5, 0));

> > see the attached, modified patch
> >
> > in the output I see that sometimes the first 32 bytes are not touched
> > by the controller, but everything beyond 32 bytes is modified in the
> > info buffer.
> >
> it really makes sense that the controller sometimes fills the space
> beyond the first 8 bytes. However i expect the controller should only
> take the first 8 bytes when using NFC_CMD_N2M.
in my tests (see the attached log output) it seems that the info
buffer size has the following constraints:
- use the "len" which is passed to meson_nfc_read_buf
- if "len" is smaller than PER_INFO_BYTE then use PER_INFO_BYTE (= 8)

> > I also tried to increase the buffer size to 512, but that didn't make
> > a difference (I never saw any info buffer modification beyond 256
> > bytes).
> >
> > also I just noticed that I 

Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-27 Thread Liang Yang

Hi Martin,

Thanks a lot.
On 2019/3/26 2:31, Martin Blumenstingl wrote:

Hi Liang,

On Mon, Mar 25, 2019 at 11:03 AM Liang Yang  wrote:


Hi Martin,

On 2019/3/23 5:07, Martin Blumenstingl wrote:

Hi Matthew,

On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox  wrote:


On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:

Hello,

I am experiencing the following crash:
[ cut here ]
kernel BUG at mm/slub.c:3950!


  if (unlikely(!PageSlab(page))) {
  BUG_ON(!PageCompound(page));

You called kfree() on the address of a page which wasn't allocated by slab.


I have traced this crash to the kfree() in meson_nfc_read_buf().
my observation is as follows:
- meson_nfc_read_buf() is called 7 times without any crash, the
kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
(physical address)
- the eight time meson_nfc_read_buf() is called kzalloc() call returns
0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
final kfree() crashes
- changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
PAGE_SIZE works around that crash


I suspect you're doing something which corrupts memory.  Overrunning
the end of your allocation or something similar.  Have you tried KASAN
or even the various slab debugging (eg redzones)?

KASAN is not available on 32-bit ARM. there was some progress last
year [0] but it didn't make it into mainline. I tried to make the
patches apply again and got it to compile (and my kernel is still
booting) but I have no idea if it's still working. for anyone
interested, my patches are here: [1] (I consider this a HACK because I
don't know anything about the code which is being touched in the
patches, I only made it compile)

SLAB debugging (redzones) were a great hint, thank you very much for
that Matthew! I enabled:
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_DEBUG_ON=y
and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
overwritten" (a larger kernel log extract is attached).

I'm starting to wonder if the NAND controller (hardware) writes more
than 8 bytes.
some context: the "info" buffer allocated in meson_nfc_read_buf is
then passed to the NAND controller IP (after using dma_map_single).

Liang, how does the NAND controller know that it only has to send
PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
other callers of meson_nfc_dma_buffer_setup (which passes the info
buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
bytes?


NFC_CMD_N2M and CMDRWGEN are different commands. CMDRWGEN needs to set
the ecc page size (1KB or 512B) and Pages(2, 4, 8, ...), so
PER_INFO_BYTE(= 8) bytes for each ecc page.
I have never used NFC_CMD_N2M to transfer data before, because it is
very low efficient. And I do a experiment with the attachment and find
on overwritten on my meson axg platform.

Martin, I would appreciate it very much if you would try the attachment
on your meson m8b platform.

thank you for your debug patch! on my board 2 * PER_INFO_BYTE is not enough.
I took the idea from your patch and adapted it so I could print a
buffer with 256 bytes (which seems to be "big enough" for my board).
it only needs PER_INFO_BYTE (= 8) bytes, because NFC_CMD_N2M don't set 
*Pages*, that is not like CMDRWGEN which needs Pages*PER_INFO_BYTE (= 8) 
 bytes when setting *Pages* parameter. I have been thinking that 
NFC_CMD_N2M  only occupis PER_INFO_BYTE (= 8) bytes. And i have tried to 
not set the info address, the machine would crash.

see the attached, modified patch

in the output I see that sometimes the first 32 bytes are not touched
by the controller, but everything beyond 32 bytes is modified in the
info buffer.

it really makes sense that the controller sometimes fills the space 
beyond the first 8 bytes. However i expect the controller should only 
take the first 8 bytes when using NFC_CMD_N2M.

I also tried to increase the buffer size to 512, but that didn't make
a difference (I never saw any info buffer modification beyond 256
bytes).

also I just noticed that I didn't give you much details on my NAND chip yet.
from Amlogic vendor u-boot on Meson8m2 (all my Meson8b boards have
eMMC flash, but I believe the NAND controller on Meson8 to GXBB is
identical):
   m8m2_n200_v1#amlnf chipinfo
   flash  info
   name:B revision 20nm NAND 8GiB H27UCG8T2B, id:ad de 94 eb 74 44  0  0
   pagesize:0x4000, blocksize:0x40, oobsize:0x500, chipsize:0x2000,
 option:0x8, T_REA:16, T_RHOH:15
   hw controller info
   chip_num:1, onfi_mode:0, page_shift:14, block_shift:22, option:0xc2
   ecc_unit:1024, ecc_bytes:70, ecc_steps:16, ecc_max:40
   bch_mode:5, user_mode:2, oobavail:32, oobtail:64384

I don't think it is caused by a different NAND type, but i have followed 
the some test on my GXL platform. we can see the result from the 
attachment. By the way, i don't find any information about this on meson 
NFC datasheet, so i will ask our VLSI.
Martin, May you reproduce it with 

Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-25 Thread Martin Blumenstingl
Hi Liang,

On Mon, Mar 25, 2019 at 11:03 AM Liang Yang  wrote:
>
> Hi Martin,
>
> On 2019/3/23 5:07, Martin Blumenstingl wrote:
> > Hi Matthew,
> >
> > On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox  wrote:
> >>
> >> On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:
> >>> Hello,
> >>>
> >>> I am experiencing the following crash:
> >>>[ cut here ]
> >>>kernel BUG at mm/slub.c:3950!
> >>
> >>  if (unlikely(!PageSlab(page))) {
> >>  BUG_ON(!PageCompound(page));
> >>
> >> You called kfree() on the address of a page which wasn't allocated by slab.
> >>
> >>> I have traced this crash to the kfree() in meson_nfc_read_buf().
> >>> my observation is as follows:
> >>> - meson_nfc_read_buf() is called 7 times without any crash, the
> >>> kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
> >>> (physical address)
> >>> - the eight time meson_nfc_read_buf() is called kzalloc() call returns
> >>> 0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
> >>> final kfree() crashes
> >>> - changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
> >>> PAGE_SIZE works around that crash
> >>
> >> I suspect you're doing something which corrupts memory.  Overrunning
> >> the end of your allocation or something similar.  Have you tried KASAN
> >> or even the various slab debugging (eg redzones)?
> > KASAN is not available on 32-bit ARM. there was some progress last
> > year [0] but it didn't make it into mainline. I tried to make the
> > patches apply again and got it to compile (and my kernel is still
> > booting) but I have no idea if it's still working. for anyone
> > interested, my patches are here: [1] (I consider this a HACK because I
> > don't know anything about the code which is being touched in the
> > patches, I only made it compile)
> >
> > SLAB debugging (redzones) were a great hint, thank you very much for
> > that Matthew! I enabled:
> >CONFIG_SLUB_DEBUG=y
> >CONFIG_SLUB_DEBUG_ON=y
> > and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
> > overwritten" (a larger kernel log extract is attached).
> >
> > I'm starting to wonder if the NAND controller (hardware) writes more
> > than 8 bytes.
> > some context: the "info" buffer allocated in meson_nfc_read_buf is
> > then passed to the NAND controller IP (after using dma_map_single).
> >
> > Liang, how does the NAND controller know that it only has to send
> > PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
> > other callers of meson_nfc_dma_buffer_setup (which passes the info
> > buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
> > bytes?
> >
> NFC_CMD_N2M and CMDRWGEN are different commands. CMDRWGEN needs to set
> the ecc page size (1KB or 512B) and Pages(2, 4, 8, ...), so
> PER_INFO_BYTE(= 8) bytes for each ecc page.
> I have never used NFC_CMD_N2M to transfer data before, because it is
> very low efficient. And I do a experiment with the attachment and find
> on overwritten on my meson axg platform.
>
> Martin, I would appreciate it very much if you would try the attachment
> on your meson m8b platform.
thank you for your debug patch! on my board 2 * PER_INFO_BYTE is not enough.
I took the idea from your patch and adapted it so I could print a
buffer with 256 bytes (which seems to be "big enough" for my board).
see the attached, modified patch

in the output I see that sometimes the first 32 bytes are not touched
by the controller, but everything beyond 32 bytes is modified in the
info buffer.

I also tried to increase the buffer size to 512, but that didn't make
a difference (I never saw any info buffer modification beyond 256
bytes).

also I just noticed that I didn't give you much details on my NAND chip yet.
from Amlogic vendor u-boot on Meson8m2 (all my Meson8b boards have
eMMC flash, but I believe the NAND controller on Meson8 to GXBB is
identical):
  m8m2_n200_v1#amlnf chipinfo
  flash  info
  name:B revision 20nm NAND 8GiB H27UCG8T2B, id:ad de 94 eb 74 44  0  0
  pagesize:0x4000, blocksize:0x40, oobsize:0x500, chipsize:0x2000,
option:0x8, T_REA:16, T_RHOH:15
  hw controller info
  chip_num:1, onfi_mode:0, page_shift:14, block_shift:22, option:0xc2
  ecc_unit:1024, ecc_bytes:70, ecc_steps:16, ecc_max:40
  bch_mode:5, user_mode:2, oobavail:32, oobtail:64384


Regards

Martin
...
[2.716885] :  8005 2800 2945 fdfd fdfd fdfd fdfd fdfd fdfd fdfd 
fdfd fdfd fdfd fdfd fdfd
[2.720464] 0020: fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd 
fdfd fdfd fdfd fdfd fdfd
[2.729689] 0040: fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd 
fdfd fdfd fdfd fdfd fdfd
[2.738847] 0060: fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd 
fdfd fdfd fdfd fdfd fdfd
[2.748065] 0080: fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd 
fdfd fdfd fdfd fdfd fdfd
[2.757228] 00a0: fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd fdfd 

Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-25 Thread Liang Yang

Hi Martin,

On 2019/3/23 5:07, Martin Blumenstingl wrote:

Hi Matthew,

On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox  wrote:


On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:

Hello,

I am experiencing the following crash:
   [ cut here ]
   kernel BUG at mm/slub.c:3950!


 if (unlikely(!PageSlab(page))) {
 BUG_ON(!PageCompound(page));

You called kfree() on the address of a page which wasn't allocated by slab.


I have traced this crash to the kfree() in meson_nfc_read_buf().
my observation is as follows:
- meson_nfc_read_buf() is called 7 times without any crash, the
kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
(physical address)
- the eight time meson_nfc_read_buf() is called kzalloc() call returns
0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
final kfree() crashes
- changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
PAGE_SIZE works around that crash


I suspect you're doing something which corrupts memory.  Overrunning
the end of your allocation or something similar.  Have you tried KASAN
or even the various slab debugging (eg redzones)?

KASAN is not available on 32-bit ARM. there was some progress last
year [0] but it didn't make it into mainline. I tried to make the
patches apply again and got it to compile (and my kernel is still
booting) but I have no idea if it's still working. for anyone
interested, my patches are here: [1] (I consider this a HACK because I
don't know anything about the code which is being touched in the
patches, I only made it compile)

SLAB debugging (redzones) were a great hint, thank you very much for
that Matthew! I enabled:
   CONFIG_SLUB_DEBUG=y
   CONFIG_SLUB_DEBUG_ON=y
and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
overwritten" (a larger kernel log extract is attached).

I'm starting to wonder if the NAND controller (hardware) writes more
than 8 bytes.
some context: the "info" buffer allocated in meson_nfc_read_buf is
then passed to the NAND controller IP (after using dma_map_single).

Liang, how does the NAND controller know that it only has to send
PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
other callers of meson_nfc_dma_buffer_setup (which passes the info
buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
bytes?

NFC_CMD_N2M and CMDRWGEN are different commands. CMDRWGEN needs to set 
the ecc page size (1KB or 512B) and Pages(2, 4, 8, ...), so 
PER_INFO_BYTE(= 8) bytes for each ecc page.
I have never used NFC_CMD_N2M to transfer data before, because it is 
very low efficient. And I do a experiment with the attachment and find 
on overwritten on my meson axg platform.


Martin, I would appreciate it very much if you would try the attachment 
on your meson m8b platform.




Regards
Martin


[0] https://lore.kernel.org/patchwork/cover/913212/
[1] https://github.com/xdarklight/linux/tree/arm-kasan-hack-v5.1-rc1

diff --git a/drivers/mtd/nand/raw/meson_nand.c 
b/drivers/mtd/nand/raw/meson_nand.c
old mode 100644
new mode 100755
index e858d58..905ef39
--- a/drivers/mtd/nand/raw/meson_nand.c
+++ b/drivers/mtd/nand/raw/meson_nand.c
@@ -527,11 +527,12 @@ static void meson_nfc_dma_buffer_release(struct nand_chip 
*nand,
 static int meson_nfc_read_buf(struct nand_chip *nand, u8 *buf, int len)
 {
struct meson_nfc *nfc = nand_get_controller_data(nand);
-   int ret = 0;
+   int ret = 0, i;
u32 cmd;
u8 *info;
 
-   info = kzalloc(PER_INFO_BYTE, GFP_KERNEL);
+   info = kzalloc(2 * PER_INFO_BYTE, GFP_KERNEL);
+   memset(info, 0xFD, 2 * PER_INFO_BYTE);
ret = meson_nfc_dma_buffer_setup(nand, buf, len, info,
 PER_INFO_BYTE, DMA_FROM_DEVICE);
if (ret)
@@ -543,6 +544,12 @@ static int meson_nfc_read_buf(struct nand_chip *nand, u8 
*buf, int len)
meson_nfc_drain_cmd(nfc);
meson_nfc_wait_cmd_finish(nfc, 1000);
meson_nfc_dma_buffer_release(nand, len, PER_INFO_BYTE, DMA_FROM_DEVICE);
+
+   for (i = 0; i < 2 * PER_INFO_BYTE; i++){
+   printk("0x%x ", info[i]);
+   }
+   printk("\n");
+
kfree(info);
 
return ret;


Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-22 Thread Martin Blumenstingl
Hi Matthew,

On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox  wrote:
>
> On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:
> > Hello,
> >
> > I am experiencing the following crash:
> >   [ cut here ]
> >   kernel BUG at mm/slub.c:3950!
>
> if (unlikely(!PageSlab(page))) {
> BUG_ON(!PageCompound(page));
>
> You called kfree() on the address of a page which wasn't allocated by slab.
>
> > I have traced this crash to the kfree() in meson_nfc_read_buf().
> > my observation is as follows:
> > - meson_nfc_read_buf() is called 7 times without any crash, the
> > kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
> > (physical address)
> > - the eight time meson_nfc_read_buf() is called kzalloc() call returns
> > 0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
> > final kfree() crashes
> > - changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
> > PAGE_SIZE works around that crash
>
> I suspect you're doing something which corrupts memory.  Overrunning
> the end of your allocation or something similar.  Have you tried KASAN
> or even the various slab debugging (eg redzones)?
KASAN is not available on 32-bit ARM. there was some progress last
year [0] but it didn't make it into mainline. I tried to make the
patches apply again and got it to compile (and my kernel is still
booting) but I have no idea if it's still working. for anyone
interested, my patches are here: [1] (I consider this a HACK because I
don't know anything about the code which is being touched in the
patches, I only made it compile)

SLAB debugging (redzones) were a great hint, thank you very much for
that Matthew! I enabled:
  CONFIG_SLUB_DEBUG=y
  CONFIG_SLUB_DEBUG_ON=y
and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
overwritten" (a larger kernel log extract is attached).

I'm starting to wonder if the NAND controller (hardware) writes more
than 8 bytes.
some context: the "info" buffer allocated in meson_nfc_read_buf is
then passed to the NAND controller IP (after using dma_map_single).

Liang, how does the NAND controller know that it only has to send
PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
other callers of meson_nfc_dma_buffer_setup (which passes the info
buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
bytes?


Regards
Martin


[0] https://lore.kernel.org/patchwork/cover/913212/
[1] https://github.com/xdarklight/linux/tree/arm-kasan-hack-v5.1-rc1
[2.742070] meson_nfc_read_buf e95e7d00 0x295e7d00
[2.742155] meson_nfc_read_buf e95e7d00 0x295e7d00
[2.746056] meson_nfc_read_buf e95e62c0 0x295e62c0
[2.750947] meson_nfc_read_buf e95e7d00 0x295e7d00
[2.755530] 
=
[2.763673] BUG kmalloc-64 (Not tainted): Redzone overwritten
[2.769392] 
-
[2.769392] 
[2.779013] Disabling lock debugging due to kernel taint
[2.784303] INFO: 0x(ptrval)-0x(ptrval). First byte 0xff instead of 0xcc
[2.790982] INFO: Allocated in 0x age=4294937574 cpu=4294967295 
pid=-1
[2.798171]  0x
[2.800598]  0x
[2.803024]  0x
[2.805451]  0x
[2.807879]  0x
[2.810306]  0x
[2.812733]  0x
[2.815160]  0x
[2.817587]  0x
[2.820014]  0x
[2.822441]  0x
[2.824869]  0x
[2.827296]  0x
[2.829722]  0x
[2.832150]  0x
[2.834577]  0x
[2.837006] INFO: Freed in 0x age=4294937574 cpu=4294967295 pid=-1
[2.843852]  0x
[2.846279]  0x
[2.848706]  0x
[2.851133]  0x
[2.853560]  0x
[2.855987]  0x
[2.858414]  0x
[2.860842]  0x
[2.863269]  0x
[2.865696]  0x
[2.868123]  0x
[2.870550]  0x
[2.872977]  0x
[2.875404]  0x
[2.877831]  0x
[2.880258]  0x
[2.882687] INFO: Slab 0x(ptrval) objects=25 used=4 fp=0x(ptrval) 
flags=0x10201
[2.889968] INFO: Object 0x(ptrval) @offset=7424 fp=0x(ptrval)
[2.889968] 
[2.897251] Redzone (ptrval): cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc  
[2.905917] Redzone (ptrval): cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc  
[2.914585] Redzone (ptrval): cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc  
[2.923253] Redzone (ptrval): cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc  
[2.931922] Object (ptrval): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 
[2.940503] Object (ptrval): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 
[2.949085] Object (ptrval): ff ff ff ff ff ff ff ff ff ff 

Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-21 Thread Matthew Wilcox
On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:
> Hello,
> 
> I am experiencing the following crash:
>   [ cut here ]
>   kernel BUG at mm/slub.c:3950!

if (unlikely(!PageSlab(page))) {
BUG_ON(!PageCompound(page));

You called kfree() on the address of a page which wasn't allocated by slab.

> I have traced this crash to the kfree() in meson_nfc_read_buf().
> my observation is as follows:
> - meson_nfc_read_buf() is called 7 times without any crash, the
> kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
> (physical address)
> - the eight time meson_nfc_read_buf() is called kzalloc() call returns
> 0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
> final kfree() crashes
> - changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
> PAGE_SIZE works around that crash

I suspect you're doing something which corrupts memory.  Overrunning
the end of your allocation or something similar.  Have you tried KASAN
or even the various slab debugging (eg redzones)?



32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

2019-03-21 Thread Martin Blumenstingl
Hello,

I am experiencing the following crash:
  [ cut here ]
  kernel BUG at mm/slub.c:3950!
  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
  Modules linked in:
  CPU: 1 PID: 1 Comm: swapper/0 Not tainted
5.1.0-rc1-00080-g37b8cb064293-dirty #4252
  Hardware name: Amlogic Meson platform
  PC is at kfree+0x250/0x274
  LR is at meson_nfc_exec_op+0x3b0/0x408
  ...
my goal is to add support for the 32-bit Amlogic Meson SoCs (ARM
Cortex-A5 / Cortex-A9 cores) in the meson-nand driver.

I have traced this crash to the kfree() in meson_nfc_read_buf().
my observation is as follows:
- meson_nfc_read_buf() is called 7 times without any crash, the
kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
(physical address)
- the eight time meson_nfc_read_buf() is called kzalloc() call returns
0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
final kfree() crashes
- changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
PAGE_SIZE works around that crash
- disabling the meson-nand driver makes my board boot just fine
- Liang has tested the unmodified code on a 64-bit Amlogic SoC (ARM
Cortex-A53 cores) and he doesn't see the crash there

in case the selected SLAB allocator is relevant:
  CONFIG_SLUB=y

the following printk statement is used to print the addresses returned
by the kzalloc() call in meson_nfc_read_buf():
  printk("%s 0x%px 0x%08x\n", __func__, info, virt_to_phys(info));

my questions are:
- why does kzalloc() return an unaligned address 0xee39a38b (virtual
address) / 0x2e39a38b (physical address)?
- how can further analyze this issue?
- (I don't know where to start analyzing: in mm/, arch/arm/mm, the
meson-nand driver seems to work fine on the 64-bit SoCs but that
doesn't fully rule it out, ...)


Regards
Martin