Re: Runaway SLAB usage by 'bio' during 'device replace'
Hi Chris, since you are using a recent LTS kernel on your centos/rockstor, i guess the kernel errors might help to find some bugs here. can you give the devs the errors from your logs? additionally basic info on your raid settings would be nice to, but which specific details the devs should ask on demand. But generally speaking raid5/6 works quiet ok in every day use for less important data, but there a major bugs when it comes to failing disks or in general when you try to replace harddrives. I have a similar problem right now. I added a new drive to an array and while deleting an older drive the new drive failed :-( So i ended up rescuing all data (8TB) to an new array with "btrfs restore". This took over a week, cause there is currently no switch to automaticly cancel looping while recovering. So you will have to manually apply the cancel command on every file it starts to loop, which might be a lot. In general adding a new drive and afterwards removing the old one is more save than the replace method, at least right now (as of kernel 4.5/4.6). But major bug fixes are in the works and there is hope that raid5/6 becomes more reliable next year. so good luck! Am 30.05.2016 um 22:55 schrieb Duncan: > Chris Johnson posted on Mon, 30 May 2016 11:48:02 -0700 as excerpted: > >> I have a RAID6 array that had a failed HDD. The drive failed completely >> and has been removed from the system. I'm running a 'device replace' >> operation with a new disk. The array is ~20TB so this will take a few >> days. > This isn't a direct answer to your issue as I'm a user and list regular, > not a dev, and that's beyond me, but it's something you need to know, if > you don't already... > > Btrfs raid56 mode remains for the time being in general negatively- > recommended, except specifically for testing with throw-away data, due to > two critical but not immediately data destroying bugs, one related to > serial device replacement, the other to balance restriping. They may or > may not be related to each other, as neither one has been fully traced. > > The serial replace bug has to do with replacing multiple devices, one at > a time. The first replace appears to work fine by all visible measures, > but apparently doesn't return the array to full working condition after > all, because an attempt to replace a second device fails, and can bring > down the filesystem. Unfortunately it doesn't always happen, and due to > the size of devices these days, working arrays tend to be multi-TB > monsters that take time to get to this point, so all we have at this > point is multiple reports of the same issue, but no real way to reproduce > it. I believe but am not sure that the problem can occur regardless of > whether btrfs replace or device add/delete was used. > > The restriping bug has to do with restriping to a different width, either > manually doing a filtered balance after adding devices, or automatically, > as triggered by btrfs device delete. Again, multiple reports but not > nailed down to anything specifically reproducible yet. The problem here > is that the restripes, while apparently producing correct results, can > inexplicably take an order of magnitude (or worse) longer than they > should. What one might expect to take hours takes over a week, and on > the big arrays that might be expected to take 2-3 days, months. > > The problem, again, isn't correctness, but the fact that over such long > periods, the risk of device loss is increased, and if the array was > already being reshaped/rebalanced to repair loss of one device, loss of > another device may kill it. > > Neither of these bugs affects normal runtime operation, but both are > critical enough with regard to what people normally use parity-raid for, > so they /can/ take a device (or two with raid6) loss and repair the array > to get back to normal operation, that raid56 remains negatively > recommended for anything but testing with throw-away data, until after > these bugs can be fully traced and fixed. > > > Your particular issue doesn't appear to be directly related to either of > the above. In fact, I know I've seen patches recently having to do with > memory leaks that may well fix your problem (tho you'd have to be running > 4.6 at least to have them at this point, and perhaps even 4.7-rc1. > > But given the situation, either be sure you have backups and are prepared > to use them if the array goes south on you due to failed or impractical > device replacement, or switch to something other than btrfs raid56 mode. > Btrfs redundancy-raid (raid1 and raid10) are more mature and tested, and > thus may be options if they fit your filesystem space and device layout > needs. Alternatively, btrfs (or other filesystems) on top of dm/md-raid > may be an option, tho you obviously lose some features of btrfs that > way. And of course zfs is the closest btrfs-comparable that's reasonably > mature and may be an
Re: Runaway SLAB usage by 'bio' during 'device replace'
On Tue, 31 May 2016, Filipe Manana wrote: On Mon, May 30, 2016 at 7:48 PM, Chris Johnsonwrote: I have a RAID6 array that had a failed HDD. The drive failed completely and has been removed from the system. I'm running a 'device replace' operation with a new disk. The array is ~20TB so this will take a few days. Yesterday the system crashed hard with OOM errors about 24 hours into the replace. Rebooting after the crash and remounting the array automatically resumed the replace where it left off. Today I kept a close eye on it and have watched the memory usage creep up slowly. htop says this is user process memory (green bar) but shows no user processes using this much memory free says this is almost entirely cached/buffered memory that is taking up the space. slabtop reveals that there is a highly unusual amount of SLAB going to 'bio' which has to do with block allocation apparently. slabtop output is attached. 'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage (~4GB) from dentry but 'bio' does not release any (11GB) memory and continues to grow slowly. Probably you are experiencing a leak that was recently fixed and, at the moment, available only in the 4.7-rc1 kernel: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4673272f43ae790ab9ec04e38a7542f82bb8f020 Yes, you would almost certainly be hitting that memory leak. This is running the Rockstor distro based on CentOS. The system has 16GB of RAM. Kernel: 4.4.5-1.el7.elrepo.x86_64 btrfs-progs: 4.4.1 Kernel messages aren't showing anything of note during the replace until it starts throwing out OOM errors. I would like to collect enough information for a useful bug report here, but I also can't babysit this rebuild during the work week and reboot it once a day for OOM crashes. Should I cancel the replace operation and use 'dev delete missing' instead? Will using 'delete missing' cause any problem if it's done after a partially completed and canceled replace? If you can't get a kernel with the memory leak patched, 'dev delete missing' doesn't suffer from the memory leak, so it's possible you could use that. Also, in our testing we've seen 'dev delete missing' to be more reliable than replace. As to whether it will be problematic to cancel the replace and do a delete missing - that I'm not sure. Scott -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runaway SLAB usage by 'bio' during 'device replace'
On Mon, May 30, 2016 at 7:48 PM, Chris Johnsonwrote: > I have a RAID6 array that had a failed HDD. The drive failed > completely and has been removed from the system. I'm running a 'device > replace' operation with a new disk. The array is ~20TB so this will > take a few days. > > Yesterday the system crashed hard with OOM errors about 24 hours into > the replace. Rebooting after the crash and remounting the array > automatically resumed the replace where it left off. > > Today I kept a close eye on it and have watched the memory usage creep > up slowly. > > htop says this is user process memory (green bar) but shows no user > processes using this much memory > > free says this is almost entirely cached/buffered memory that is > taking up the space. > > slabtop reveals that there is a highly unusual amount of SLAB going to > 'bio' which has to do with block allocation apparently. slabtop output > is attached. > > 'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage > (~4GB) from dentry but 'bio' does not release any (11GB) memory and > continues to grow slowly. Probably you are experiencing a leak that was recently fixed and, at the moment, available only in the 4.7-rc1 kernel: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4673272f43ae790ab9ec04e38a7542f82bb8f020 > > This is running the Rockstor distro based on CentOS. The system has 16GB of > RAM. > > Kernel: 4.4.5-1.el7.elrepo.x86_64 > btrfs-progs: 4.4.1 > > Kernel messages aren't showing anything of note during the replace > until it starts throwing out OOM errors. > > I would like to collect enough information for a useful bug report > here, but I also can't babysit this rebuild during the work week and > reboot it once a day for OOM crashes. Should I cancel the replace > operation and use 'dev delete missing' instead? Will using 'delete > missing' cause any problem if it's done after a partially completed > and canceled replace? -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runaway SLAB usage by 'bio' during 'device replace'
Chris Johnson posted on Mon, 30 May 2016 11:48:02 -0700 as excerpted: > I have a RAID6 array that had a failed HDD. The drive failed completely > and has been removed from the system. I'm running a 'device replace' > operation with a new disk. The array is ~20TB so this will take a few > days. This isn't a direct answer to your issue as I'm a user and list regular, not a dev, and that's beyond me, but it's something you need to know, if you don't already... Btrfs raid56 mode remains for the time being in general negatively- recommended, except specifically for testing with throw-away data, due to two critical but not immediately data destroying bugs, one related to serial device replacement, the other to balance restriping. They may or may not be related to each other, as neither one has been fully traced. The serial replace bug has to do with replacing multiple devices, one at a time. The first replace appears to work fine by all visible measures, but apparently doesn't return the array to full working condition after all, because an attempt to replace a second device fails, and can bring down the filesystem. Unfortunately it doesn't always happen, and due to the size of devices these days, working arrays tend to be multi-TB monsters that take time to get to this point, so all we have at this point is multiple reports of the same issue, but no real way to reproduce it. I believe but am not sure that the problem can occur regardless of whether btrfs replace or device add/delete was used. The restriping bug has to do with restriping to a different width, either manually doing a filtered balance after adding devices, or automatically, as triggered by btrfs device delete. Again, multiple reports but not nailed down to anything specifically reproducible yet. The problem here is that the restripes, while apparently producing correct results, can inexplicably take an order of magnitude (or worse) longer than they should. What one might expect to take hours takes over a week, and on the big arrays that might be expected to take 2-3 days, months. The problem, again, isn't correctness, but the fact that over such long periods, the risk of device loss is increased, and if the array was already being reshaped/rebalanced to repair loss of one device, loss of another device may kill it. Neither of these bugs affects normal runtime operation, but both are critical enough with regard to what people normally use parity-raid for, so they /can/ take a device (or two with raid6) loss and repair the array to get back to normal operation, that raid56 remains negatively recommended for anything but testing with throw-away data, until after these bugs can be fully traced and fixed. Your particular issue doesn't appear to be directly related to either of the above. In fact, I know I've seen patches recently having to do with memory leaks that may well fix your problem (tho you'd have to be running 4.6 at least to have them at this point, and perhaps even 4.7-rc1. But given the situation, either be sure you have backups and are prepared to use them if the array goes south on you due to failed or impractical device replacement, or switch to something other than btrfs raid56 mode. Btrfs redundancy-raid (raid1 and raid10) are more mature and tested, and thus may be options if they fit your filesystem space and device layout needs. Alternatively, btrfs (or other filesystems) on top of dm/md-raid may be an option, tho you obviously lose some features of btrfs that way. And of course zfs is the closest btrfs-comparable that's reasonably mature and may be an option, tho there are licensing and hardware issues (it likes lots of memory on linux due to double-caching of some elements as its caching scheme doesn't work well with that of linux, and ecc memory is very strongly recommended) if using it on linux. I'd suggest giving btrfs raid56 another few kernel releases, six months to a year, and then check back. I'd hope the bugs can be properly traced and fixed within a couple kernel cycles, so four months or so, but I prefer a few cycles to stabilize with no known critical bugs, before I recommend it (I was getting close to recommending it after the last known critical bug was fixed in 4.1, when these came up), which puts the projected timeframe at 8-12 months, before I could really consider raid56 mode as reasonably stable as btrfs in general, which is to say, stabilizing, but not yet fully stable, so even then, the standard admin backup rule that if you don't have backups you consider the data to be worth less than the time/resources/hassle to do those backups, still applies more strongly than it would to a fully mature filesystem. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe
Runaway SLAB usage by 'bio' during 'device replace'
I have a RAID6 array that had a failed HDD. The drive failed completely and has been removed from the system. I'm running a 'device replace' operation with a new disk. The array is ~20TB so this will take a few days. Yesterday the system crashed hard with OOM errors about 24 hours into the replace. Rebooting after the crash and remounting the array automatically resumed the replace where it left off. Today I kept a close eye on it and have watched the memory usage creep up slowly. htop says this is user process memory (green bar) but shows no user processes using this much memory free says this is almost entirely cached/buffered memory that is taking up the space. slabtop reveals that there is a highly unusual amount of SLAB going to 'bio' which has to do with block allocation apparently. slabtop output is attached. 'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage (~4GB) from dentry but 'bio' does not release any (11GB) memory and continues to grow slowly. This is running the Rockstor distro based on CentOS. The system has 16GB of RAM. Kernel: 4.4.5-1.el7.elrepo.x86_64 btrfs-progs: 4.4.1 Kernel messages aren't showing anything of note during the replace until it starts throwing out OOM errors. I would like to collect enough information for a useful bug report here, but I also can't babysit this rebuild during the work week and reboot it once a day for OOM crashes. Should I cancel the replace operation and use 'dev delete missing' instead? Will using 'delete missing' cause any problem if it's done after a partially completed and canceled replace? # slabtop -o -s=a Active / Total Objects (% used): 33431432 / 33664160 (99.3%) Active / Total Slabs (% used) : 1346736 / 1346736 (100.0%) Active / Total Caches (% used) : 78 / 114 (68.4%) Active / Total Size (% used) : 10512136.19K / 10737701.80K (97.9%) Minimum / Average / Maximum Object : 0.01K / 0.32K / 15.62K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 32493650 32492775 99%0.31K 1299746 25 10397968K bio-1 323505 323447 99%0.19K 15405 21 61620K dentry 176680 176680 100%0.07K 3155 56 12620K btrfs_free_space 118208 41288 34%0.12K 3694 32 14776K kmalloc-128 94528 43378 45%0.25K 2954 32 23632K kmalloc-256 91872 41682 45%0.50K 2871 32 45936K kmalloc-512 83048 39031 46%4.00K 103818332192K kmalloc-4096 69049 69049 100%0.27K 2381 29 19048K btrfs_extent_buffer 46872 46385 98%0.57K 1674 28 26784K radix_tree_node 23460 23460 100%0.12K690 34 2760K kernfs_node_cache 17536 17536 100%0.98K548 32 17536K btrfs_inode 16380 16007 97%0.14K585 28 2340K btrfs_path 12444 11635 93%0.08K244 51 976K Acpi-State 12404 12404 100%0.55K443 28 7088K inode_cache 11648 10851 93%0.06K182 64 728K kmalloc-64 10404 5716 54%0.08K204 51 816K btrfs_extent_state 8954 8703 97%0.18K407 22 1628K vm_area_struct 5888 4946 84%0.03K 46 128 184K kmalloc-32 5632 5632 100%0.01K 11 51244K kmalloc-8 5049 4905 97%0.08K 99 51 396K anon_vma 4352 4352 100%0.02K 17 25668K kmalloc-16 3723 3723 100%0.05K 51 73 204K Acpi-Parse 3230 3230 100%0.05K 38 85 152K ftrace_event_field 3213 2949 91%0.19K153 21 612K kmalloc-192 3120 3090 99%0.61K120 26 1920K proc_inode_cache 2814 2814 100%0.09K 67 42 268K kmalloc-96 1984 1510 76%1.00K 62 32 1984K kmalloc-1024 1904 1904 100%0.07K 34 56 136K Acpi-Operand 1472 1472 100%0.09K 32 46 128K trace_event_file 1224 1224 100%0.04K 12 10248K Acpi-Namespace 1152 1152 100%0.64K 48 24 768K shmem_inode_cache 592581 98%2.00K 37 16 1184K kmalloc-2048 528457 86%0.36K 24 22 192K blkdev_requests 462355 76%0.38K 22 21 176K mnt_cache 450433 96%1.06K 15 30 480K signal_cache 429429 100%0.20K 11 3988K btrfs_delayed_ref_head 420420 100%2.05K 28 15 896K idr_layer_cache 408408 100%0.04K 4 102