Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Thu 21-02-19 01:23:50, Meelis Roos wrote: > > > First, I found out that both the problematic alphas had memory compaction > > > and > > > page migration and bounce buffers turned on, and working alphas had them > > > off. > > > > > > Next, turing off these options makes the problematic alphas work. > > > > OK, thanks for testing! Can you narrow down whether the problem is due to > > CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two > > completely different things so knowing where to look will help. Thanks! > > Tested both. > > Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha. > Just CONFIG_BOUNCE has no effect in 5 tries. OK, so page migration is problematic. Thanks for confirmation! Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
First, I found out that both the problematic alphas had memory compaction and page migration and bounce buffers turned on, and working alphas had them off. Next, turing off these options makes the problematic alphas work. OK, thanks for testing! Can you narrow down whether the problem is due to CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two completely different things so knowing where to look will help. Thanks! Tested both. Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha. Just CONFIG_BOUNCE has no effect in 5 tries. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Wed 20-02-19 08:31:05, Meelis Roos wrote: > > Could > > https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u > > be relevant? > > Tried it, still broken. OK, I didn't put too much hope into this patch as you see filesystem metadata corruption so icache/dcache coherency issues seemed unlikely. Still good that you've tried so that we are sure. > I wrote: > > > But my kernel config had memory compaction (that turned on page migration) > > and > > bounce buffers. I do not remember why I found them necessary but I will try > > without them. > > First, I found out that both the problematic alphas had memory compaction and > page migration and bounce buffers turned on, and working alphas had them off. > > Next, turing off these options makes the problematic alphas work. OK, thanks for testing! Can you narrow down whether the problem is due to CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two completely different things so knowing where to look will help. Thanks! Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Could https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u be relevant? Tried it, still broken. I wrote: But my kernel config had memory compaction (that turned on page migration) and bounce buffers. I do not remember why I found them necessary but I will try without them. First, I found out that both the problematic alphas had memory compaction and page migration and bounce buffers turned on, and working alphas had them off. Next, turing off these options makes the problematic alphas work. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Tue, Feb 19, 2019 at 02:20:26PM +0100, Jan Kara wrote: > Thanks for information. Yeah, that makes somewhat more sense. Can you ever > see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your > findings still seem to indicate that there' some problem with page > migration and Alpha (added MM list to CC). Could https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u be relevant?
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Thanks for information. Yeah, that makes somewhat more sense. Can you ever see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? HAVE_ARCH_TRANSPARENT_HUGEPAGE [=n] Seems there is no THP on alpha. Because your findings still seem to indicate that there' some problem with page migration and Alpha (added MM list to CC). But my kernel config had memory compaction (that turned on page migration) and bounce buffers. I do not remember why I found them necessary but I will try without them. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Tue 19-02-19 14:17:09, Meelis Roos wrote: > > > > > The result of the bisection is > > > > > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration > > > > > stalls for blkdev pages > > > > > > > > > > Is that result relevant for the problem or should I continue > > > > > bisecting between 4.20.0 and the so far first bad commit? > > > > > > > > Can you try reverting the commit and see if it makes the problem go > > > > away? > > > > > > Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems > > > to make the kernel work - emerge --sync succeeded. > There is more to it. > > After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of > that patch) successfully for Gentoo update, I upgraded the kernel to > 5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch) > and it broke on rsync again: > > RepoStorageException: command exited with status -6: rsync -a --link-dest > /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found > --exclude=/packages --exclude /.tmp-unverified-download-quarantine > /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/ > > Nothing in dmesg. > > This means the real root reason is somewhere deeper and reverting this > commit just made it less likely to happen. Thanks for information. Yeah, that makes somewhat more sense. Can you ever see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your findings still seem to indicate that there' some problem with page migration and Alpha (added MM list to CC). Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
The result of the bisection is [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit? Can you try reverting the commit and see if it makes the problem go away? Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems to make the kernel work - emerge --sync succeeded. There is more to it. After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of that patch) successfully for Gentoo update, I upgraded the kernel to 5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch) and it broke on rsync again: RepoStorageException: command exited with status -6: rsync -a --link-dest /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found --exclude=/packages --exclude /.tmp-unverified-download-quarantine /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/ Nothing in dmesg. This means the real root reason is somewhere deeper and reverting this commit just made it less likely to happen. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Hum, weird. I have hard time understanding how that change could be causing fs corruption on Aplha but OTOH it is not completely unthinkable. With this commit we may migrate some block device pages we were not able to migrate previously and that could be causing some unexpected issue. I'll look into this. To make things more interesting, it does not happen on any alpha but only one subarch so far: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1889207.html is my original bug report. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Sun 17-02-19 00:29:40, Meelis Roos wrote: > > > The result of the bisection is > > > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls > > > for blkdev pages > > > > > > Is that result relevant for the problem or should I continue bisecting > > > between 4.20.0 and the so far first bad commit? > > > > Can you try reverting the commit and see if it makes the problem go away? > > Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems > to make the kernel work - emerge --sync succeeded. > > Unfinished further bisection has also not yielded any other bad revisions > so far. Hum, weird. I have hard time understanding how that change could be causing fs corruption on Aplha but OTOH it is not completely unthinkable. With this commit we may migrate some block device pages we were not able to migrate previously and that could be causing some unexpected issue. I'll look into this. Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
The result of the bisection is [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit? Can you try reverting the commit and see if it makes the problem go away? Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems to make the kernel work - emerge --sync succeeded. Unfinished further bisection has also not yielded any other bad revisions so far. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Fri, Feb 15, 2019 at 06:59:48PM +0200, Meelis Roos wrote: > The result of the bisection is > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for > blkdev pages > > Is that result relevant for the problem or should I continue bisecting > between 4.20.0 and the so far first bad commit? Can you try reverting the commit and see if it makes the problem go away? - Ted
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
I have noticed ext4 filesystem corruption on two of my test alphas with 4.20.0-09062-gd8372ba8ce28. Retried it, still happens with 5.0.0-rc5-00358-gdf3865f8f568 - rsync of emerge --sync just fail with nothing in dmesg. Finished second round of bisecting, first round did not get me far enough so I may still have false "goods" in my bisection history. The command I used for bisecting was Gentoos emerge --sync. that sometimes failed from error -6 or -11 from rsync. Usually the file system corruption did not happen and nothing was in dmesg, just file IO error from rsync. The result of the bisection is [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit? On AlphaServer DS10: [10749.664418] EXT4-fs error (device sda2): __ext4_iget:5052: inode #1853093: block 1: comm rsync: invalid block On AlphaServer DS10L: [ 5325.064656] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.069539] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.077351] EXT4-fs error (device sda2): ext4_empty_dir:2718: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 Two other alphas, PC-164 and Eiger, worked fine with the same kernel version (different kernel configs according to hardware). The details: 4.20 worked fine, with gentoo emerge package update after bootup. Next, 4.20.0-06428-g00c569b567c7 worked fine, with gentoo emerge after bootup. Next, 4.20.0-09062-gd8372ba8ce28 booted up fine but rsync and rm during start of gentoo emerge errored out like above. So the corruption _might_ have happened during bootup of previous kernel but it looks more likely that only the latest kernel with blk-mq introduced the problems. mq-deadline is in use on all the alphas. DS10 has Symbios 53C896 SCSI (sym2 driver), DS10L has QLogic ISP1040, so they are different. Working Eiger and PC164 have sym2 based scsi controllers too. -- Meelis Roos
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
02.01.19 17:52 I wrote: I have noticed ext4 filesystem corruption on two of my test alphas with 4.20.0-09062-gd8372ba8ce28. Retried it, still happens with 5.0.0-rc5-00358-gdf3865f8f568 - rsync of emerge --sync just fail with nothing in dmesg. On AlphaServer DS10: [10749.664418] EXT4-fs error (device sda2): __ext4_iget:5052: inode #1853093: block 1: comm rsync: invalid block On AlphaServer DS10L: [ 5325.064656] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.069539] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.077351] EXT4-fs error (device sda2): ext4_empty_dir:2718: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 Two other alphas, PC-164 and Eiger, worked fine with the same kernel version (different kernel configs according to hardware). The details: 4.20 worked fine, with gentoo emerge package update after bootup. Next, 4.20.0-06428-g00c569b567c7 worked fine, with gentoo emerge after bootup. Next, 4.20.0-09062-gd8372ba8ce28 booted up fine but rsync and rm during start of gentoo emerge errored out like above. So the corruption _might_ have happened during bootup of previous kernel but it looks more likely that only the latest kernel with blk-mq introduced the problems. mq-deadline is in use on all the alphas. DS10 has Symbios 53C896 SCSI (sym2 driver), DS10L has QLogic ISP1040, so they are different. Working Eiger and PC164 have sym2 based scsi controllers too. -- Meelis Roos
ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
I have noticed ext4 filesystem corruption on two of my test alphas with 4.20.0-09062-gd8372ba8ce28. On AlphaServer DS10: [10749.664418] EXT4-fs error (device sda2): __ext4_iget:5052: inode #1853093: block 1: comm rsync: invalid block On AlphaServer DS10L: [ 5325.064656] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.069539] EXT4-fs error (device sda2): htree_dirblock_to_tree:1007: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 [ 5325.077351] EXT4-fs error (device sda2): ext4_empty_dir:2718: inode #1191951: block 4731728: comm rm: bad entry in directory: directory entry overrun - offset=76, inode=417080, rec_len=61816, name_len=35, size=4096 Two other alphas, PC-164 and Eiger, worked fine with the same kernel version (different kernel configs according to hardware). The details: 4.20 worked fine, with gentoo emerge package update after bootup. Next, 4.20.0-06428-g00c569b567c7 worked fine, with gentoo emerge after bootup. Next, 4.20.0-09062-gd8372ba8ce28 booted up fine but rsync and rm during start of gentoo emerge errored out like above. So the corruption _might_ have happened during bootup of previous kernel but it looks more likely that only the latest kernel with blk-mq introduced the problems. mq-deadline is in use on all the alphas. DS10 has Symbios 53C896 SCSI (sym2 driver), DS10L has QLogic ISP1040, so they are different. Working Eiger and PC164 have sym2 based scsi controllers too. Full dmesg of DS10: [0.00] Linux version 4.20.0-09062-gd8372ba8ce28 (mroos@ds10) (gcc version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #92 Sun Dec 30 01:29:49 EET 2018 [0.00] Booting GENERIC on Tsunami variation Webbrick using machine vector Webbrick from SRM [0.00] Major Options: LEGACY_START VERBOSE_MCHECK MAGIC_SYSRQ [0.00] Command line: root=/dev/sda2 console=ttyS0 [0.00] memcluster 0, usage 1, start0, end 256 [0.00] memcluster 1, usage 0, start 256, end65443 [0.00] memcluster 2, usage 1, start65443, end65536 [0.00] 2048K Bcache detected; load hit latency 20 cycles, load miss latency 95 cycles [0.00] On node 0 totalpages: 65443 [0.00] DMA zone: 448 pages used for memmap [0.00] DMA zone: 0 pages reserved [0.00] DMA zone: 65443 pages, LIFO batch:15 [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 [0.00] pcpu-alloc: [0] 0 [0.00] Built 1 zonelists, mobility grouping on. Total pages: 64995 [0.00] Kernel command line: root=/dev/sda2 console=ttyS0 [0.00] Dentry cache hash table entries: 65536 (order: 6, 524288 bytes) [0.00] Inode-cache hash table entries: 32768 (order: 5, 262144 bytes) [0.00] Sorting __ex_table... [0.00] Memory: 508584K/523544K available (5571K kernel code, 413K rwdata, 1456K rodata, 256K init, 206K bss, 14960K reserved, 0K cma-reserved) [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 [0.00] NR_IRQS: 128 [0.00] HWRPB cycle frequency bogus. Estimated 462413354 Hz [0.00] clocksource: rpcc: mask: 0x max_cycles: 0x, max_idle_ns: 4133229351 ns [0.002929] Console: colour VGA+ 80x25 [0.021484] printk: console [ttyS0] enabled [0.022460] Calibrating delay loop... 916.72 BogoMIPS (lpj=447488) [0.032226] pid_max: default: 32768 minimum: 301 [0.033203] Mount-cache hash table entries: 1024 (order: 0, 8192 bytes) [0.034179] Mountpoint-cache hash table entries: 1024 (order: 0, 8192 bytes) [0.038085] devtmpfs: initialized [0.040039] random: get_random_u32 called from bucket_table_alloc.isra.17+0xc4/0x290 with crng_init=0 [0.041015] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1866466235866741 ns [0.041992] futex hash table entries: 256 (order: -1, 6144 bytes) [0.043945] NET: Registered protocol family 16 [0.045898] EISA bus registered [0.047851] random: get_random_bytes called from kcmp_cookies_init+0x2c/0x74 with crng_init=0 [0.048828] PCI host bridge to bus :00 [0.050781] pci_bus :00: root bus resource [io 0x-0x1ff] [0.052734] pci_bus :00: root bus resource [mem 0x-0x3fff] [0.053710] pci_bus :00: No busn resource found for root bus, will use [bus 00-ff] [0.054687] pci :00:01.0: [10b9:5237] type 00 class 0x0c0310 [0.054687] pci :00:01.0: reg 0x10: [mem 0x020b4000-0x020b4fff] [0.054687] pci :00:07.0: [10b9:1533] type 00 class 0x060100 [0.055664] pci :00:09.0: [1011:0019] type 00 class 0x02 [0.055664] pci :00:09.0: reg 0x10: [io 0x1200-0x127f] [0.055664] pci :00:09.0: reg 0x14: [mem 0x