Runaway SLAB usage by 'bio' during 'device replace'

2016-05-30 Thread Chris Johnson
I have a RAID6 array that had a failed HDD. The drive failed
completely and has been removed from the system. I'm running a 'device
replace' operation with a new disk. The array is ~20TB so this will
take a few days.

Yesterday the system crashed hard with OOM errors about 24 hours into
the replace. Rebooting after the crash and remounting the array
automatically resumed the replace where it left off.

Today I kept a close eye on it and have watched the memory usage creep
up slowly.

htop says this is user process memory (green bar) but shows no user
processes using this much memory

free says this is almost entirely cached/buffered memory that is
taking up the space.

slabtop reveals that there is a highly unusual amount of SLAB going to
'bio' which has to do with block allocation apparently. slabtop output
is attached.

'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage
(~4GB) from dentry but 'bio' does not release any (11GB) memory and
continues to grow slowly.

This is running the Rockstor distro based on CentOS. The system has 16GB of RAM.

Kernel: 4.4.5-1.el7.elrepo.x86_64
btrfs-progs: 4.4.1

Kernel messages aren't showing anything of note during the replace
until it starts throwing out OOM errors.

I would like to collect enough information for a useful bug report
here, but I also can't babysit this rebuild during the work week and
reboot it once a day for OOM crashes. Should I cancel the replace
operation and use 'dev delete missing' instead? Will using 'delete
missing' cause any problem if it's done after a partially completed
and canceled replace?
# slabtop -o -s=a
 Active / Total Objects (% used): 33431432 / 33664160 (99.3%)
 Active / Total Slabs (% used)  : 1346736 / 1346736 (100.0%)
 Active / Total Caches (% used) : 78 / 114 (68.4%)
 Active / Total Size (% used)   : 10512136.19K / 10737701.80K (97.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.32K / 15.62K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME   
32493650 32492775  99%0.31K 1299746   25  10397968K bio-1   
   
323505 323447  99%0.19K  15405   21 61620K dentry 
176680 176680 100%0.07K   3155   56 12620K btrfs_free_space   
118208  41288  34%0.12K   3694   32 14776K kmalloc-128
 94528  43378  45%0.25K   2954   32 23632K kmalloc-256
 91872  41682  45%0.50K   2871   32 45936K kmalloc-512
 83048  39031  46%4.00K  103818332192K kmalloc-4096   
 69049  69049 100%0.27K   2381   29 19048K btrfs_extent_buffer
 46872  46385  98%0.57K   1674   28 26784K radix_tree_node
 23460  23460 100%0.12K690   34  2760K kernfs_node_cache  
 17536  17536 100%0.98K548   32 17536K btrfs_inode
 16380  16007  97%0.14K585   28  2340K btrfs_path 
 12444  11635  93%0.08K244   51   976K Acpi-State 
 12404  12404 100%0.55K443   28  7088K inode_cache
 11648  10851  93%0.06K182   64   728K kmalloc-64 
 10404   5716  54%0.08K204   51   816K btrfs_extent_state 
  8954   8703  97%0.18K407   22  1628K vm_area_struct 
  5888   4946  84%0.03K 46  128   184K kmalloc-32 
  5632   5632 100%0.01K 11  51244K kmalloc-8  
  5049   4905  97%0.08K 99   51   396K anon_vma   
  4352   4352 100%0.02K 17  25668K kmalloc-16 
  3723   3723 100%0.05K 51   73   204K Acpi-Parse 
  3230   3230 100%0.05K 38   85   152K ftrace_event_field 
  3213   2949  91%0.19K153   21   612K kmalloc-192
  3120   3090  99%0.61K120   26  1920K proc_inode_cache   
  2814   2814 100%0.09K 67   42   268K kmalloc-96 
  1984   1510  76%1.00K 62   32  1984K kmalloc-1024   
  1904   1904 100%0.07K 34   56   136K Acpi-Operand   
  1472   1472 100%0.09K 32   46   128K trace_event_file   
  1224   1224 100%0.04K 12  10248K Acpi-Namespace 
  1152   1152 100%0.64K 48   24   768K shmem_inode_cache  
   592581  98%2.00K 37   16  1184K kmalloc-2048   
   528457  86%0.36K 24   22   192K blkdev_requests
   462355  76%0.38K 22   21   176K mnt_cache  
   450433  96%1.06K 15   30   480K signal_cache   
   429429 100%0.20K 11   3988K btrfs_delayed_ref_head 
   420420 100%2.05K 28   15   896K idr_layer_cache
   408408 100%0.04K  4  102

Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array

2016-05-29 Thread Chris Johnson
Situation: A six disk RAID5/6 array with a completely failed disk. The
failed disk is removed and an identical replacement drive is plugged
in.

Here I have two options for replacing the disk, assuming the old drive
is device 6 in the superblock and the replacement disk is /dev/sda.

'btrfs replace start 6 /dev/sda /mnt'
This will start a rebuild of the array using the new drive, copying
data that would have been on device 6 to the new drive from the parity
data.

btrfs add /dev/sda /mnt && btrfs device delete missing /mnt
This adds a new device (the replacement disk) to the array and dev
delete missing appears to trigger a rebalance before deleting the
missing disk from the array. The end result appears to be identical to
option 1.

A few weeks back I recovered an array with a failed drive using
'delete missing' because 'replace' caused a kernel panic. I later
discovered that this was not (just) a failed drive but some other
failed hardware that I've yet to start diagnosing. Either motherboard
or HBA. The drives are in a new server now and I am currently
rebuilding the array with 'replace', which is believe is the "more
correct" way to replace a bad drive in an array.

Both work, but 'replace' seems to be slower so I'm curious what the
functional differences are between the two. I thought the replace
would be faster as I assumed it would need to read fewer blocks since
instead of a complete rebalance it's just rebuilding a drive from
parity data.

What are the differences between the two under the hood? The only
obvious difference I could see is that when I ran `replace` the space
on the replacement drive was instantly allocated under 'filesystem
show' while when I used 'device delete' the drive usage slowly crept
up through the course of the rebalance.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html