On 14.01.19 г. 15:13 ч., Scott E. Blomquist wrote:
>
> Nikolay Borisov writes:
> >
> > On 14.01.19 г. 13:42 ч., Scott E. Blomquist wrote:
> > >
> <snip>
> > >
> > > The file system hung again below is the sysrq output
> > >
> > > Linux kanlabfs 4.19.13-custom #1 SMP Wed Jan 9 08:36:50 EST 2019 x86_64
> x86_64 x86_64 GNU/Linux
> > >
> > > btrfs-progs v4.19.1
> > >
> > > # btrfs fi df /export/
> > > Data, single: total=79.61TiB, used=79.61TiB
> > > System, single: total=36.00MiB, used=8.31MiB
> > > Metadata, single: total=192.01GiB, used=190.19GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > So this btrfs is hosted on your local machine but it is exported via
> > NFS, correct?
>
> Correct and via samba also
>
> > >
> > > # btrfs fi show
> > > Label: '/export' uuid: 8f92c2e4-86fe-48cb-b2d3-bc36da765f02
> > > Total devices 3 FS bytes used 79.79TiB
> > > devid 1 size 47.30TiB used 43.58TiB path /dev/sda1
> > > devid 2 size 21.83TiB used 18.11TiB path /dev/sdb1
> > > devid 3 size 21.83TiB used 18.11TiB path /dev/sdc1
> >
> > What kind of disks are those, presumably spinning rust due to their size
> > but what model/make?
> >
>
> 3 x raid 6 on a LSI MegaRAID SAS 9271-8i
Has your controller been updated to the latest firmware? In my
experience LSI Megaraid are rubbish controllers and in the past, in a
datacenter environment, we've had a batch of bad controllers which
resulted in controllers resets, causing all IO to die on 10s of machines.
There was a way to query the controller's built-in log for firmware
errors. I can't remember the exact command but googling suggests using:
MegaCli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log
Can you run that and also attach it when a hang occurs?
>
> > > [Mon Jan 14 06:24:26 2019] sysrq: SysRq : Show Blocked State
> >
> > <snip>
> >
> > > [Mon Jan 14 06:24:26 2019] btrfs-transacti D 0 6808 2 0x80000000
> > > [Mon Jan 14 06:24:26 2019] Call Trace:
> > > [Mon Jan 14 06:24:26 2019] ? __schedule+0x2ea/0x870
> > > [Mon Jan 14 06:24:26 2019] schedule+0x32/0x80
> > > [Mon Jan 14 06:24:26 2019] btrfs_start_ordered_extent+0xca/0x100 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] ? wait_woken+0x80/0x80
> > > [Mon Jan 14 06:24:26 2019] btrfs_wait_ordered_range+0xbd/0x110 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] __btrfs_wait_cache_io+0x49/0x1a0 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] btrfs_write_dirty_block_groups+0xed/0x360
> [btrfs]
> > > [Mon Jan 14 06:24:26 2019] ? btrfs_run_delayed_refs+0x8b/0x1d0 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] commit_cowonly_roots+0x1ed/0x280 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] btrfs_commit_transaction+0x36e/0x8d0 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] ? start_transaction+0x9b/0x3f0 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] transaction_kthread+0x14d/0x180 [btrfs]
> > > [Mon Jan 14 06:24:26 2019] kthread+0xf8/0x130
> > > [Mon Jan 14 06:24:26 2019] ? btrfs_cleanup_transaction+0x530/0x530
> [btrfs]
> > > [Mon Jan 14 06:24:26 2019] ? kthread_bind+0x10/0x10
> > > [Mon Jan 14 06:24:26 2019] ret_from_fork+0x35/0x40
> >
> > So the transaction is being committed as a result of that
> > btrfs_start_ordered_extent, which flushes data to disk. Since you've
> > compiled your kernel can you run the following command from the kernel's
> > source:
> >
> > ./scripts/faddr2line vmlinux btrfs_start_ordered_extent+0xca/0x100
> >
> > 'vmlinux' should be the kernel executable with debug info that results
> > from compiling the kernel. I want to figure out which line exactly
> > btrfs_start_ordered_extent+0xca/0x100 resolves to.
>
> <snip>
>
> I'll have to rebuild the kernel with debug symbols. Do I have to be
> booted into the kernel for that command to be useful?
Well the running kernel needs to correspond to the vmlinux since
otherwise the offsets might not match. In any case try rebuilding the
kernel and running it to see if it's going to result in a sane output.
>
> Cheers and Thanks,
>
> sb. Scott Blomquist
>
>
>