On 19/10/2019 21:29, Peter Hjalmarsson wrote:

Hi Peter,

Thanks for the report.

> While trying to reproduce another problem I have seen with BTRFS while
> running balance and raid6 I hit an issue resulting in:
> BUG: kernel NULL pointer dereference, address: 00000000000002ce
> 
> I created a script trying to pinpoint the problem utilizing zram, and
> the run goes like this:
> 
> 1. run the scripts which sets up a "SINGEL" filesystem on two larger
> devices, fills it with an adequate amount of data, and then tries to
> convert it to raid6
> 2. the filesystem will fail to convert due to not enough space to
> convert all data to raid6 (not enough space on at least 3 devices at
> the same time), hitting an enospc-error
> * Up until this point stuff still seems to work without crashing, and
> the system seems stable, just with two different profiles fot the data
> 3. issue the umount command which will be "Killed", and a backtrace
> will be written to dmesg
> 
> This results in a filesystem that cannot be synced, unmounted, and in
> all just seems crashed.
> I have ran the script 6 times. 1 time it passed, 5 times it has
> crashed, and I have rebooted the computer, since that is the only way
> it seems to get rid of the filesysem, so the issue seems pretty
> reproducable.
> 
> The system is a laptop running Fedora 31 Beta with the latest updates,
> and the latest kernel (5.3.6-300.fc31.x86_64)
> 
> Do you have any input, or any other questions or stuff you want me to test?
> 
> The script looks as follow:
> -----
> $ cat run-btrfs-test
> modprobe -iv zram num_devices=8
> udevadm trigger
> sync
> zramctl /dev/zram0 -s 8G && \
> zramctl /dev/zram1 -s 8G && \
> zramctl /dev/zram2 -s 4G && \
> zramctl /dev/zram3 -s 4G && \
> zramctl /dev/zram4 -s 4G && \
> zramctl /dev/zram5 -s 2G && \
> zramctl /dev/zram6 -s 2G && \
> zramctl /dev/zram7 -s 4G && \
> mkfs.btrfs /dev/zram0 && \
> mkdir -p /mnt/btrfs-test && \
> mount /dev/zram0 /mnt/btrfs-test && \
> echo "FS Mounted" && \
> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
> echo "Devices added" && \
> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
> bs=32M count=1 && sync ; done
> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test && \
> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test
> =====
> 
> The output from the script, I will trim the output down to after the for-loop:
> ------
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.64GiB
> devid    1 size 8.00GiB used 8.00GiB path /dev/zram0
> devid    2 size 8.00GiB used 8.00GiB path /dev/zram1
> devid    3 size 4.00GiB used 0.00B path /dev/zram2
> devid    4 size 4.00GiB used 0.00B path /dev/zram3
> devid    5 size 4.00GiB used 0.00B path /dev/zram4
> devid    6 size 2.00GiB used 0.00B path /dev/zram5
> devid    7 size 2.00GiB used 0.00B path /dev/zram6
> devid    8 size 4.00GiB used 0.00B path /dev/zram7
> 
> Data, single: total=15.74GiB, used=15.63GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=264.00MiB, used=17.06MiB
> GlobalReserve, single: total=16.70MiB, used=0.00B
> Done, had to relocate 3 out of 20 chunks
> ERROR: error during balancing '/mnt/btrfs-test': No space left on device
> There may be more info in syslog - try dmesg | tail
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.65GiB
> devid    1 size 8.00GiB used 2.99GiB path /dev/zram0
> devid    2 size 8.00GiB used 2.00GiB path /dev/zram1
> devid    3 size 4.00GiB used 4.00GiB path /dev/zram2
> devid    4 size 4.00GiB used 4.00GiB path /dev/zram3
> devid    5 size 4.00GiB used 4.00GiB path /dev/zram4
> devid    6 size 2.00GiB used 2.00GiB path /dev/zram5
> devid    7 size 2.00GiB used 2.00GiB path /dev/zram6
> devid    8 size 4.00GiB used 4.00GiB path /dev/zram7
> 
> Data, single: total=2.00GiB, used=1.97GiB
> Data, RAID6: total=14.41GiB, used=13.66GiB
> System, RAID6: total=80.00MiB, used=16.00KiB
> Metadata, RAID6: total=512.00MiB, used=17.52MiB
> GlobalReserve, single: total=17.16MiB, used=0.00B
> ===
> 
> Issueing the umount:
> ----
> # umount /mnt/btrfs-test && modprobe -rv zram

OK, I couldn't reproduce it in my environment (5.4-rc3+ based
btrfs-devel/misc-next form David) with this script. I'll dig deeper.


> Killed
> ===
> 
> And last but not least: the output in dmesg:
> ---
> [  205.960233] BTRFS info (device zram0): 2 enospc errors during balance
> [  205.960235] BTRFS info (device zram0): balance: ended with status: -28

Here balance ended with -ENOSPC (28).

> [  235.774821] BUG: kernel NULL pointer dereference, address: 00000000000002ce

That's a NULL pointer deference with an offset of 0x2ce (718).

> [  235.774826] #PF: supervisor read access in kernel mode
> [  235.774828] #PF: error_code(0x0000) - not-present page
> [  235.774830] PGD 0 P4D 0
> [  235.774834] Oops: 0000 [#1] SMP PTI
> [  235.774838] CPU: 3 PID: 5421 Comm: umount Not tainted
> 5.3.6-300.fc31.x86_64 #1
> [  235.774840] Hardware name: LENOVO 80JV/Lenovo U41-70, BIOS BDCN71WW
> 08/03/2016
> [  235.774847] RIP: 0010:__free_pages+0x5/0x30
> [  235.774850] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
> e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
> 44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
> e9 82
> [  235.774852] RSP: 0018:ffffc3cf0ffb7db0 EFLAGS: 00010046
> [  235.774854] RAX: ffffa0ad0ffa0118 RBX: 0000000000000045 RCX: 
> 0000000000000000
> [  235.774856] RDX: ffffa0aeceaee2f0 RSI: 0000000000000000 RDI: 
> 000000000000029a
> [  235.774858] RBP: ffffa0ad0ffa0000 R08: fffffb06c23fd108 R09: 
> fffffb06c23fd108
> [  235.774860] R10: 0000000000068879 R11: fffffb06c02cf820 R12: 
> 0000000000000045
> [  235.774861] R13: ffffa0ade5c00010 R14: ffffa0ae0d19e5ac R15: 
> ffffa0ae0d19e578
> [  235.774864] FS:  00007f3b40f6cc80(0000) GS:ffffa0aeceac0000(0000)
> knlGS:0000000000000000
> [  235.774866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  235.774867] CR2: 00000000000002ce CR3: 000000004c1f0004 CR4: 
> 00000000003606e0

CR2 has it as well, so we're page faulting on an access to 0x2ce.

Let's look at __free_pages():

(gdb) l *(__free_pages+0x5)
0xffffffff81179e45 is in __free_pages (mm/page_alloc.c:4818).
4813                    __free_pages_ok(page, order);
4814    }
4815    
4816    void __free_pages(struct page *page, unsigned int order)
4817    {
4818            if (put_page_testzero(page))
4819                    free_the_page(page, order);
4820    }
4821    EXPORT_SYMBOL(__free_pages);
4822    
(gdb)

We're getting called by __free_page() so we know order is 0. So
something must have passed in a NULL page (somehow).

Looking at __free_raid_bio() one step further up the call-chain I see this:

for (i = 0; i < rbio->nr_pages; i++) {
         if (rbio->stripe_pages[i]) {
                 __free_page(rbio->stripe_pages[i]);
                 rbio->stripe_pages[i] = NULL;
         }
}

> [  235.774869] Call Trace:
> [  235.774920]  __free_raid_bio+0x72/0xb0 [btrfs]
> [  235.774961]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
> [  235.774992]  close_ctree+0x1ea/0x2f0 [btrfs]
> [  235.774998]  generic_shutdown_super+0x6c/0x100
> [  235.775001]  kill_anon_super+0x14/0x30
> [  235.775024]  btrfs_kill_super+0x12/0xa0 [btrfs]
> [  235.775029]  deactivate_locked_super+0x36/0x70
> [  235.775033]  cleanup_mnt+0x104/0x150
> [  235.775038]  task_work_run+0x87/0xa0
> [  235.775043]  exit_to_usermode_loop+0xda/0x100
> [  235.775047]  do_syscall_64+0x183/0x1a0
> [  235.775053]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  235.775056] RIP: 0033:0x7f3b411b767b
> [  235.775060] Code: 08 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
> 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 07 0c 00 f7 d8 64 89
> 01 48

This decodes to:

All code
========
   0:   08 0c 00                or     %cl,(%rax,%rax,1)
   3:   f7 d8                   neg    %eax
   5:   64 89 01                mov    %eax,%fs:(%rcx)
   8:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
   c:   c3                      retq
   d:   66 90                   xchg   %ax,%ax
   f:   f3 0f 1e fa             endbr64
  13:   31 f6                   xor    %esi,%esi
  15:   e9 05 00 00 00          jmpq   0x1f
  1a:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  1f:   f3 0f 1e fa             endbr64
  23:   b8 a6 00 00 00          mov    $0xa6,%eax
  28:   0f 05                   syscall
  2a:*  48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax         <--
trapping instruction
  30:   73 01                   jae    0x33
  32:   c3                      retq
  33:   48 8b 0d dd 07 0c 00    mov    0xc07dd(%rip),%rcx        # 0xc0817
  3a:   f7 d8                   neg    %eax
  3c:   64 89 01                mov    %eax,%fs:(%rcx)
  3f:   48                      rex.W

Code starting with the faulting instruction
===========================================
   0:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
   6:   73 01                   jae    0x9
   8:   c3                      retq
   9:   48 8b 0d dd 07 0c 00    mov    0xc07dd(%rip),%rcx        # 0xc07ed
  10:   f7 d8                   neg    %eax
  12:   64 89 01                mov    %eax,%fs:(%rcx)
  15:   48                      rex.W

This doesn't look like __free_pages

(gdb) disassemble __free_pages
Dump of assembler code for function __free_pages:
   0xffffffff81179e40 <+0>:     lock decl 0x34(%rdi)
   0xffffffff81179e44 <+4>:     jne    0xffffffff81179e51 <__free_pages+17>
   0xffffffff81179e46 <+6>:     test   %esi,%esi
   0xffffffff81179e48 <+8>:     je     0xffffffff81179e4f <__free_pages+15>
   0xffffffff81179e4a <+10>:    jmpq   0xffffffff81178500 <__free_pages_ok>
   0xffffffff81179e4f <+15>:    jmp    0xffffffff81179de0 <free_unref_page>
   0xffffffff81179e51 <+17>:    repz retq
End of assembler dump.
(gdb)

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumsh...@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Reply via email to