Re: BTRFS RAID filesystem unmountable
On 2018/12/7 上午7:15, Michael Wade wrote: > Hi Qu, > > Me again! Having formatted the drives and rebuilt the RAID array I > seem to have be having the same problem as before (no power cut this > time [I bought a UPS]). But strangely, your super block shows it has log tree, which means either your hit a kernel panic/transaction abort, or a unexpected power loss. > The brtfs volume is broken on my ReadyNAS. > > I have attached the results of some of the commands you asked me to > run last time, and I am hoping you might be able to help me out. This time, the problem is more serious, some chunk tree blocks are not even inside system chunk range, no wonder it fails to mount. To confirm it, you could run "btrfs ins dump-tree -b 17725903077376 " and paste the output. But I don't have any clue. My guess is some kernel problem related to new chunk allocation, or the chunk root node itself is already seriously corrupted. Considering how old your kernel is (4.4), it's not recommended to use btrfs on such old kernel, unless it's well backported with tons of btrfs fixes. Thanks, Qu > > Kind regards > Michael > On Sat, 19 May 2018 at 12:43, Michael Wade wrote: >> >> I have let the find root command run for 14+ days, its produced a >> pretty huge log file 1.6 GB but still hasn't completed. I think I will >> start the process of reformatting my drives and starting over. >> >> Thanks for your help anyway. >> >> Kind regards >> Michael >> >> On 5 May 2018 at 01:43, Qu Wenruo wrote: >>> >>> >>> On 2018年05月05日 00:18, Michael Wade wrote: Hi Qu, The tool is still running and the log file is now ~300mb. I guess it shouldn't normally take this long.. Is there anything else worth trying? >>> >>> I'm afraid not much. >>> >>> Although there is a possibility to modify btrfs-find-root to do much >>> faster but limited search. >>> >>> But from the result, it looks like underlying device corruption, and not >>> much we can do right now. >>> >>> Thanks, >>> Qu >>> Kind regards Michael On 2 May 2018 at 06:29, Michael Wade wrote: > Thanks Qu, > > I actually aborted the run with the old btrfs tools once I saw its > output. The new btrfs tools is still running and has produced a log > file of ~85mb filled with that content so far. > > Kind regards > Michael > > On 2 May 2018 at 02:31, Qu Wenruo wrote: >> >> >> On 2018年05月01日 23:50, Michael Wade wrote: >>> Hi Qu, >>> >>> Oh dear that is not good news! >>> >>> I have been running the find root command since yesterday but it only >>> seems to be only be outputting the following message: >>> >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >> >> It's mostly fine, as find-root will go through all tree blocks and try >> to read them as tree blocks. >> Although btrfs-find-root will suppress csum error output, but such basic >> tree validation check is not suppressed, thus you get such message. >> >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> >>> I tried with the latest btrfs tools compiled from source and the ones >>> I have installed with the same result. Is there a CLI utility I could >>> use to determine if the log contains any other content? >> >> Did it report any useful info at the end? >> >> Thanks, >> Qu >> >>> >>> Kind regards >>> Michael >>> >>> >>> On 30 April 2018 at 04:02, Qu Wenruo wrote: On 2018年04月29日 22:08, Michael Wade wrote: > Hi Qu, > > Got this error message: > > ./btrfs inspect dump-tree -b 20800943685632 /dev/md127 > btrfs-progs v4.16.1 > bytenr mismatch, want=20800943685632, have=3118598835113619663 > ERROR: cannot read chunk root > ERROR: unable to open /dev/md127 > > I have attached the dumps for: > > dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K > skip=266325721088 > dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K > skip=266359275520 Unfortunately, both dumps are corrupted and contain mostly garbage. I think it's the underlying stack (mdraid) has something wrong or failed to recover its data. This means your last
Re: btrfs progs always assume devid 1?
On 2018-12-05 14:50, Roman Mamedov wrote: Hello, To migrate my FS to a different physical disk, I have added a new empty device to the FS, then ran the remove operation on the original one. Now my FS has only devid 2: Label: 'p1' uuid: d886c190-b383-45ba-9272-9f00c6a10c50 Total devices 1 FS bytes used 36.63GiB devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1 And all the operations of btrfs-progs now fail to work in their default invocation, such as: # btrfs fi resize max . Resize '.' of 'max' ERROR: unable to resize '.': No such device [768813.414821] BTRFS info (device dm-5): resizer unable to find device 1 Of course this works: # btrfs fi resize 2:max . Resize '.' of '2:max' But this is inconvenient and seems to be a rather simple oversight. If what I got is normal (the device staying as ID 2 after such operation), then count that as a suggestion that btrfs-progs should use the first existing devid, rather than always looking for hard-coded devid 1. I've been meaning to try and write up a patch to special-case this for a while now, but have not gotten around to it yet. FWIW, this is one of multiple reasons that it's highly recommended to use `btrfs replace` instead of adding a new device and deleting the old one when replacing a device. Other benefits include: * It doesn't have to run in the foreground (and doesn't by default). * It usually takes less time. * Replace operations can be queried while running to get a nice indication of the completion percentage. The only disadvantage is that the new device has to be at least as large as the old one (though you can get around this to a limited degree by shrinking the old device), and it needs the old and new device to be plugged in at the same time (add/delete doesn't, if you flip the order of the add and delete commands).
Re: BTRFS Mount Delay Time Graph
On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote: > On 12/4/18 8:07 AM, Nikolay Borisov wrote: >> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote: >>> With 14TB drives available today, it doesn't take more than a handful of >>> drives to result in a filesystem that takes around a minute to mount. >>> As a result of this, I suspect this will become an increasingly problem >>> for serious users of BTRFS as time goes on. I'm not complaining as I'm >>> not a contributor so I have no room to do so -- just shedding some light >>> on a problem that may deserve attention as filesystem sizes continue to >>> grow. >> Would it be possible to provide perf traces of the longer-running mount >> time? Everyone seems to be fixated on reading block groups (which is >> likely to be the culprit) but before pointing finger I'd like concrete >> evidence pointed at the offender. > > I am glad to collect such traces -- please advise with commands that > would achieve that. If you just mean block traces, I can do that, but I > suspect you mean something more BTRFS-specific. A command that would be good is : perf record --all-kernel -g mount /dev/vdc /media/scratch/ of course replace device/mount path appropriately. This will result in a perf.data file which contains stacktraces of the hottest paths executed during invocation of mount. If you could send this file to the mailing list or upload it somwhere for interested people (me and perhaps) Qu to inspect would be appreciated. If the file turned out way too big you can use perf report --stdio to create a text output and you could send that as well. > > Best, > > ellis >
Re: BTRFS Mount Delay Time Graph
On 12/4/18 8:07 AM, Nikolay Borisov wrote: > On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote: >> With 14TB drives available today, it doesn't take more than a handful of >> drives to result in a filesystem that takes around a minute to mount. >> As a result of this, I suspect this will become an increasingly problem >> for serious users of BTRFS as time goes on. I'm not complaining as I'm >> not a contributor so I have no room to do so -- just shedding some light >> on a problem that may deserve attention as filesystem sizes continue to >> grow. > Would it be possible to provide perf traces of the longer-running mount > time? Everyone seems to be fixated on reading block groups (which is > likely to be the culprit) but before pointing finger I'd like concrete > evidence pointed at the offender. I am glad to collect such traces -- please advise with commands that would achieve that. If you just mean block traces, I can do that, but I suspect you mean something more BTRFS-specific. Best, ellis
Re: BTRFS Mount Delay Time Graph
Le 04/12/2018 à 03:52, Chris Murphy a écrit : > On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton > wrote: >> Le 03/12/2018 à 20:56, Lionel Bouton a écrit : >>> [...] >>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various >>> tuning of the io queue (switching between classic io-schedulers and >>> blk-mq ones in the virtual machines) and BTRFS mount options >>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement >>> in mount time (I managed to reduce the mount of IO requests >> Sent to quickly : I meant to write "managed to reduce by half the number >> of IO write requests for the same amount of data writen" >> >>> by half on >>> one server in production though although more tests are needed to >>> isolate the cause). > Interesting. I wonder if it's ssd_spread or space_cache=v2 that > reduces the writes by half, or by how much for each? That's a major > reduction in writes, and suggests it might be possible for further > optimization, to help mitigate the wandering trees impact. Note, the other major changes were : - 4.9 upgrade to 1.14, - using multi-queue aware bfq instead of noop. If BTRFS IO patterns in our case allow bfq to merge io-requests, this could be another explanation. Lionel
Re: BTRFS Mount Delay Time Graph
On 2018/12/4 下午9:07, Nikolay Borisov wrote: > > > On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote: >> Hi all, >> >> Many months ago I promised to graph how long it took to mount a BTRFS >> filesystem as it grows. I finally had (made) time for this, and the >> attached is the result of my testing. The image is a fairly >> self-explanatory graph, and the raw data is also attached in >> comma-delimited format for the more curious. The columns are: >> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). >> >> Experimental setup: >> - System: >> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 >> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux >> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. >> - 3 unmount/mount cycles performed in between adding another 250GB of data >> - 250GB of data added each time in the form of 25x10GB files in their >> own directory. Files generated in parallel each epoch (25 at the same >> time, with a 1MB record size). >> - 240 repetitions of this performed (to collect timings in increments of >> 250GB between a 0GB and 60TB filesystem) >> - Normal "time" command used to measure time to mount. "Real" time used >> of the timings reported from time. >> - Mount: >> /dev/md0 on /btrfs type btrfs >> (rw,relatime,space_cache=v2,subvolid=5,subvol=/) >> >> At 60TB, we take 30s to mount the filesystem, which is actually not as >> bad as I originally thought it would be (perhaps as a result of using >> RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open >> to comment if folks more intimately familiar with BTRFS think this is >> due to the very large files I've used. I can redo the test with much >> more realistic data if people have legitimate reason to think it will >> drastically change the result. >> >> With 14TB drives available today, it doesn't take more than a handful of >> drives to result in a filesystem that takes around a minute to mount. >> As a result of this, I suspect this will become an increasingly problem >> for serious users of BTRFS as time goes on. I'm not complaining as I'm >> not a contributor so I have no room to do so -- just shedding some light >> on a problem that may deserve attention as filesystem sizes continue to >> grow. > > Would it be possible to provide perf traces of the longer-running mount > time? Everyone seems to be fixated on reading block groups (which is > likely to be the culprit) but before pointing finger I'd like concrete > evidence pointed at the offender. IIRC I submitted such analyse years ago. Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking. So yes, it would be a good idea to show such percentage. Thanks, Qu > >> >> Best, >> >> ellis >>
Re: btrfs development - question about crypto api integration
On Fri, Nov 30, 2018 at 06:27:58PM +0200, Nikolay Borisov wrote: > > > On 30.11.18 г. 17:22 ч., Chris Mason wrote: > > On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > > > >> On 29.11.18 г. 18:43 ч., Jean Fobe wrote: > >>> Hi all, > >>> I've been studying LZ4 and other compression algorithms on the > >>> kernel, and seen other projects such as zram and ubifs using the > >>> crypto api. Is there a technical reason for not using the crypto api > >>> for compression (and possibly for encryption) in btrfs? > >>> I did not find any design/technical implementation choices in > >>> btrfs development in the developer's FAQ on the wiki. If I completely > >>> missed it, could someone point me in the right direction? > >>> Lastly, if there is no technical reason for this, would it be > >>> something interesting to have implemented? > >> > >> I personally think it would be better if btrfs' exploited the generic > >> framework. And in fact when you look at zstd, btrfs does use the > >> generic, low-level ZSTD routines but not the crypto library wrappers. > >> If > >> I were I'd try and convert zstd (since it's the most recently added > >> algorithm) to using the crypto layer to see if there are any lurking > >> problems. > > > > Back when I first added the zlib support, the zlib API was both easier > > to use and a better fit for our async worker threads. That doesn't mean > > we shouldn't switch, it's just how we got to step one ;) > > And what about zstd? WHy is it also using the low level api and not the > crypto wrappers? I think beacuse it just copied the way things already were. Here's fs/ubifs/compress.c as an example use of the API in a filesystem: ubifs_compress crypto_comp_compress crypto_comp_crt -- address of, dereference ->cot_commpress -- takes another address of something, indirect function call with value 'crypto_compress' -- this does 2 pointer dereferences and indirect call to coa_compress coa_compress = lzo_compress lzo_compress -- pointer dereferences for crypto context lzo1x_1_compress -- through static __lzo_compress (no overhead) while in btrfs: btrfs_compress_pages ->compress_pages lzo1x_1_compress The crypto API adds several pointer and function call indirectinos, I'm not sure I want to get rid of the direct calls just yet.
Re: BTRFS Mount Delay Time Graph
On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote: > Hi all, > > Many months ago I promised to graph how long it took to mount a BTRFS > filesystem as it grows. I finally had (made) time for this, and the > attached is the result of my testing. The image is a fairly > self-explanatory graph, and the raw data is also attached in > comma-delimited format for the more curious. The columns are: > Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). > > Experimental setup: > - System: > Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 > 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux > - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. > - 3 unmount/mount cycles performed in between adding another 250GB of data > - 250GB of data added each time in the form of 25x10GB files in their > own directory. Files generated in parallel each epoch (25 at the same > time, with a 1MB record size). > - 240 repetitions of this performed (to collect timings in increments of > 250GB between a 0GB and 60TB filesystem) > - Normal "time" command used to measure time to mount. "Real" time used > of the timings reported from time. > - Mount: > /dev/md0 on /btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > At 60TB, we take 30s to mount the filesystem, which is actually not as > bad as I originally thought it would be (perhaps as a result of using > RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open > to comment if folks more intimately familiar with BTRFS think this is > due to the very large files I've used. I can redo the test with much > more realistic data if people have legitimate reason to think it will > drastically change the result. > > With 14TB drives available today, it doesn't take more than a handful of > drives to result in a filesystem that takes around a minute to mount. > As a result of this, I suspect this will become an increasingly problem > for serious users of BTRFS as time goes on. I'm not complaining as I'm > not a contributor so I have no room to do so -- just shedding some light > on a problem that may deserve attention as filesystem sizes continue to > grow. Would it be possible to provide perf traces of the longer-running mount time? Everyone seems to be fixated on reading block groups (which is likely to be the culprit) but before pointing finger I'd like concrete evidence pointed at the offender. > > Best, > > ellis >
Re: BTRFS Mount Delay Time Graph
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton wrote: > > Le 03/12/2018 à 20:56, Lionel Bouton a écrit : > > [...] > > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > > tuning of the io queue (switching between classic io-schedulers and > > blk-mq ones in the virtual machines) and BTRFS mount options > > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > > in mount time (I managed to reduce the mount of IO requests > > Sent to quickly : I meant to write "managed to reduce by half the number > of IO write requests for the same amount of data writen" > > > by half on > > one server in production though although more tests are needed to > > isolate the cause). Interesting. I wonder if it's ssd_spread or space_cache=v2 that reduces the writes by half, or by how much for each? That's a major reduction in writes, and suggests it might be possible for further optimization, to help mitigate the wandering trees impact. -- Chris Murphy
Re: BTRFS Mount Delay Time Graph
On 2018/12/4 上午2:20, Wilson, Ellis wrote: > Hi all, > > Many months ago I promised to graph how long it took to mount a BTRFS > filesystem as it grows. I finally had (made) time for this, and the > attached is the result of my testing. The image is a fairly > self-explanatory graph, and the raw data is also attached in > comma-delimited format for the more curious. The columns are: > Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). > > Experimental setup: > - System: > Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 > 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux > - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. > - 3 unmount/mount cycles performed in between adding another 250GB of data > - 250GB of data added each time in the form of 25x10GB files in their > own directory. Files generated in parallel each epoch (25 at the same > time, with a 1MB record size). > - 240 repetitions of this performed (to collect timings in increments of > 250GB between a 0GB and 60TB filesystem) > - Normal "time" command used to measure time to mount. "Real" time used > of the timings reported from time. > - Mount: > /dev/md0 on /btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > At 60TB, we take 30s to mount the filesystem, which is actually not as > bad as I originally thought it would be (perhaps as a result of using > RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open > to comment if folks more intimately familiar with BTRFS think this is > due to the very large files I've used. I can redo the test with much > more realistic data if people have legitimate reason to think it will > drastically change the result. > > With 14TB drives available today, it doesn't take more than a handful of > drives to result in a filesystem that takes around a minute to mount. > As a result of this, I suspect this will become an increasingly problem > for serious users of BTRFS as time goes on. I'm not complaining as I'm > not a contributor so I have no room to do so -- just shedding some light > on a problem that may deserve attention as filesystem sizes continue to > grow. This problem is somewhat known. If you dig further, it's btrfs_read_block_groups() which will try to read *ALL* block group items. And to no one's surprise, when the fs goes larger, the more block group items need to be read from disk. We need to do some delay for such read to improve such case. Thanks, Qu > > Best, > > ellis > signature.asc Description: OpenPGP digital signature
Re: BTRFS Mount Delay Time Graph
Hi, On 12/3/18 8:56 PM, Lionel Bouton wrote: > > Le 03/12/2018 à 19:20, Wilson, Ellis a écrit : >> >> Many months ago I promised to graph how long it took to mount a BTRFS >> filesystem as it grows. I finally had (made) time for this, and the >> attached is the result of my testing. The image is a fairly >> self-explanatory graph, and the raw data is also attached in >> comma-delimited format for the more curious. The columns are: >> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). >> >> Experimental setup: >> - System: >> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 >> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux >> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. >> - 3 unmount/mount cycles performed in between adding another 250GB of data >> - 250GB of data added each time in the form of 25x10GB files in their >> own directory. Files generated in parallel each epoch (25 at the same >> time, with a 1MB record size). >> - 240 repetitions of this performed (to collect timings in increments of >> 250GB between a 0GB and 60TB filesystem) >> - Normal "time" command used to measure time to mount. "Real" time used >> of the timings reported from time. >> - Mount: >> /dev/md0 on /btrfs type btrfs >> (rw,relatime,space_cache=v2,subvolid=5,subvol=/) >> >> At 60TB, we take 30s to mount the filesystem, which is actually not as >> bad as I originally thought it would be (perhaps as a result of using >> RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open >> to comment if folks more intimately familiar with BTRFS think this is >> due to the very large files I've used. Probably yes. The thing that is happening is that all block group items are read from the extent tree. And, instead of being nicely grouped together, they are scattered all over the place, at their virtual address, in between all normal extent items. So, mount time depends on cold random read iops your storage can do, and the size of the extent tree and amount of block groups. And, your extent tree has more items in it if you have more extents. So, yes, writing a lot of 4kiB files should have a similar effect I think as a lot of 128MiB files that are still stored in 1 extent per file. > I can redo the test with much >> more realistic data if people have legitimate reason to think it will >> drastically change the result. > > We are hosting some large BTRFS filesystems on Ceph (RBD used by > QEMU/KVM). I believe the delay is heavily linked to the number of files > (I didn't check if snapshots matter and I suspect it does but not as > much as the number of "original" files at least if you don't heavily > modify existing files but mostly create new ones as we do). > As an example, we have a filesystem with 20TB used space with 4 > subvolumes hosting multi millions files/directories (probably 10-20 > millions total I didn't check the exact number recently as simply > counting files is a very long process) and 40 snapshots for each volume. > Mount takes about 15 minutes. > We have virtual machines that we don't reboot as often as we would like > because of these slow mount times. > > If you want to study this, you could : > - graph the delay for various individual file sizes (instead of 25x10GB, > create 2 500 x 100MB and 250 000 x 1MB files between each run and > compare to the original result) > - graph the delay vs the number of snapshots (probably starting with a > large number of files in the initial subvolume to start with a non > trivial mount delay) > You may want to study the impact of the differences between snapshots by > comparing snapshoting without modifications and snapshots made at > various stages of your suvolume growth. > > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > tuning of the io queue (switching between classic io-schedulers and > blk-mq ones in the virtual machines) and BTRFS mount options > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > in mount time (I managed to reduce the mount of IO requests by half on > one server in production though although more tests are needed to > isolate the cause). > I didn't expect much for the mount times, it seems to me that mount is > mostly constrained by the BTRFS on disk structures needed at mount time > and how the filesystem reads them (for example it doesn't benefit at all > from large IO queue depths which probably means that each read depends > on previous ones which prevents io-schedulers from optimizing anything). Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982 What the code is doing here is starting at the beginning of the extent tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which is not that far away), and then based on the information in it, computes
Re: BTRFS Mount Delay Time Graph
Le 03/12/2018 à 20:56, Lionel Bouton a écrit : > [...] > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > tuning of the io queue (switching between classic io-schedulers and > blk-mq ones in the virtual machines) and BTRFS mount options > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > in mount time (I managed to reduce the mount of IO requests Sent to quickly : I meant to write "managed to reduce by half the number of IO write requests for the same amount of data writen" > by half on > one server in production though although more tests are needed to > isolate the cause).
Re: BTRFS Mount Delay Time Graph
Hi, Le 03/12/2018 à 19:20, Wilson, Ellis a écrit : > Hi all, > > Many months ago I promised to graph how long it took to mount a BTRFS > filesystem as it grows. I finally had (made) time for this, and the > attached is the result of my testing. The image is a fairly > self-explanatory graph, and the raw data is also attached in > comma-delimited format for the more curious. The columns are: > Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). > > Experimental setup: > - System: > Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 > 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux > - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. > - 3 unmount/mount cycles performed in between adding another 250GB of data > - 250GB of data added each time in the form of 25x10GB files in their > own directory. Files generated in parallel each epoch (25 at the same > time, with a 1MB record size). > - 240 repetitions of this performed (to collect timings in increments of > 250GB between a 0GB and 60TB filesystem) > - Normal "time" command used to measure time to mount. "Real" time used > of the timings reported from time. > - Mount: > /dev/md0 on /btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > At 60TB, we take 30s to mount the filesystem, which is actually not as > bad as I originally thought it would be (perhaps as a result of using > RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open > to comment if folks more intimately familiar with BTRFS think this is > due to the very large files I've used. I can redo the test with much > more realistic data if people have legitimate reason to think it will > drastically change the result. We are hosting some large BTRFS filesystems on Ceph (RBD used by QEMU/KVM). I believe the delay is heavily linked to the number of files (I didn't check if snapshots matter and I suspect it does but not as much as the number of "original" files at least if you don't heavily modify existing files but mostly create new ones as we do). As an example, we have a filesystem with 20TB used space with 4 subvolumes hosting multi millions files/directories (probably 10-20 millions total I didn't check the exact number recently as simply counting files is a very long process) and 40 snapshots for each volume. Mount takes about 15 minutes. We have virtual machines that we don't reboot as often as we would like because of these slow mount times. If you want to study this, you could : - graph the delay for various individual file sizes (instead of 25x10GB, create 2 500 x 100MB and 250 000 x 1MB files between each run and compare to the original result) - graph the delay vs the number of snapshots (probably starting with a large number of files in the initial subvolume to start with a non trivial mount delay) You may want to study the impact of the differences between snapshots by comparing snapshoting without modifications and snapshots made at various stages of your suvolume growth. Note : recently I tried upgrading from 4.9 to 4.14 kernels, various tuning of the io queue (switching between classic io-schedulers and blk-mq ones in the virtual machines) and BTRFS mount options (space_cache=v2,ssd_spread) but there wasn't any measurable improvement in mount time (I managed to reduce the mount of IO requests by half on one server in production though although more tests are needed to isolate the cause). I didn't expect much for the mount times, it seems to me that mount is mostly constrained by the BTRFS on disk structures needed at mount time and how the filesystem reads them (for example it doesn't benefit at all from large IO queue depths which probably means that each read depends on previous ones which prevents io-schedulers from optimizing anything). Best regards, Lionel
Re: btrfs development - question about crypto api integration
On 30.11.18 г. 17:22 ч., Chris Mason wrote: > On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > >> On 29.11.18 г. 18:43 ч., Jean Fobe wrote: >>> Hi all, >>> I've been studying LZ4 and other compression algorithms on the >>> kernel, and seen other projects such as zram and ubifs using the >>> crypto api. Is there a technical reason for not using the crypto api >>> for compression (and possibly for encryption) in btrfs? >>> I did not find any design/technical implementation choices in >>> btrfs development in the developer's FAQ on the wiki. If I completely >>> missed it, could someone point me in the right direction? >>> Lastly, if there is no technical reason for this, would it be >>> something interesting to have implemented? >> >> I personally think it would be better if btrfs' exploited the generic >> framework. And in fact when you look at zstd, btrfs does use the >> generic, low-level ZSTD routines but not the crypto library wrappers. >> If >> I were I'd try and convert zstd (since it's the most recently added >> algorithm) to using the crypto layer to see if there are any lurking >> problems. > > Back when I first added the zlib support, the zlib API was both easier > to use and a better fit for our async worker threads. That doesn't mean > we shouldn't switch, it's just how we got to step one ;) And what about zstd? WHy is it also using the low level api and not the crypto wrappers? > > -chris >
Re: btrfs development - question about crypto api integration
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > On 29.11.18 г. 18:43 ч., Jean Fobe wrote: >> Hi all, >> I've been studying LZ4 and other compression algorithms on the >> kernel, and seen other projects such as zram and ubifs using the >> crypto api. Is there a technical reason for not using the crypto api >> for compression (and possibly for encryption) in btrfs? >> I did not find any design/technical implementation choices in >> btrfs development in the developer's FAQ on the wiki. If I completely >> missed it, could someone point me in the right direction? >> Lastly, if there is no technical reason for this, would it be >> something interesting to have implemented? > > I personally think it would be better if btrfs' exploited the generic > framework. And in fact when you look at zstd, btrfs does use the > generic, low-level ZSTD routines but not the crypto library wrappers. > If > I were I'd try and convert zstd (since it's the most recently added > algorithm) to using the crypto layer to see if there are any lurking > problems. Back when I first added the zlib support, the zlib API was both easier to use and a better fit for our async worker threads. That doesn't mean we shouldn't switch, it's just how we got to step one ;) -chris
Re: btrfs development - question about crypto api integration
On 29.11.18 г. 18:43 ч., Jean Fobe wrote: > Hi all, > I've been studying LZ4 and other compression algorithms on the > kernel, and seen other projects such as zram and ubifs using the > crypto api. Is there a technical reason for not using the crypto api > for compression (and possibly for encryption) in btrfs? > I did not find any design/technical implementation choices in > btrfs development in the developer's FAQ on the wiki. If I completely > missed it, could someone point me in the right direction? > Lastly, if there is no technical reason for this, would it be > something interesting to have implemented? I personally think it would be better if btrfs' exploited the generic framework. And in fact when you look at zstd, btrfs does use the generic, low-level ZSTD routines but not the crypto library wrappers. If I were I'd try and convert zstd (since it's the most recently added algorithm) to using the crypto layer to see if there are any lurking problems. > > Best regards >
Re: Btrfs and fanotify filesystem watches
On Fri 23-11-18 19:53:11, Amir Goldstein wrote: > On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein wrote: > > > So open_by_handle() should work fine even if we get mount_fd of /subvol1 > > > and handle for inode on /subvol2. mount_fd in open_by_handle() is really > > > only used to get the superblock and that is the same. > > > > I don't think it will work fine. > > do_handle_to_path() will compose a path from /subvol1 mnt and dentry from > > /subvol2. This may resolve to a full path that does not really exist, > > so application > > cannot match it to anything that is can use name_to_handle_at() to identify. > > > > The whole thing just sounds too messy. At the very least we need more time > > to communicate with btrfs developers and figure this out, so I am going to > > go with -EXDEV for any attempt to set *any* mark on a group with > > FAN_REPORT_FID if fsid of fanotify_mark() path argument is different > > from fsid of path->dentry->d_sb->s_root. > > > > We can relax that later if we figure out a better way. > > > > BTW, I am also going to go with -ENODEV for zero fsid (e.g. tmpfs). > > tmpfs can be easily fixed to have non zero fsid if filesystem watch on > > tmpfs is > > important. > > > > Well, this is interesting... I have implemented the -EXDEV logic and it > works as expected. I can set a filesystem global watch on the main > btrfs mount and not allowed to set a global watch on a subvolume. > > The interesting part is that the global watch on the main btrfs volume > more useful than I though it would be. The file handles reported by the > main volume global watch are resolved to correct paths in the main volume. > I guess this is because a btrfs subvolume looks like a directory tree > in the global > namespace to vfs. See below. > > So I will continue based on this working POC: > https://github.com/amir73il/linux/commits/fanotify_fid > > Note that in the POC, fsid is cached in mark connector as you suggested. > It is still buggy, but my prototype always decodes file handles from the first > path argument given to the program, so it just goes to show that by getting > fsid of the main btrfs volume, the application will be able to properly decode > file handles and resolve correct paths. > > The bottom line is that btrfs will have the full functionality of super block > monitoring with no ability to watch (or ignore) a single subvolume > (unless by using a mount mark). Sounds good. I'll check the new version of your series. Honza -- Jan Kara SUSE Labs, CR
Re: Btrfs and fanotify filesystem watches
On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein wrote: > > On Fri, Nov 23, 2018 at 2:52 PM Jan Kara wrote: > > > > Changed subject to better match what we discuss and added btrfs list to CC. > > > > On Thu 22-11-18 17:18:25, Amir Goldstein wrote: > > > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara wrote: > > > > > > > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote: > > > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is > > > > > > > associated with > > > > > > > the single super block struct, so all dentries in all subvolumes > > > > > > > will > > > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid. > > > > > > > > > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see: > > > > > > > > > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ > > > > > > be32_to_cpu(fsid[2]); > > > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ > > > > > > be32_to_cpu(fsid[3]); > > > > > > /* Mask in the root object ID too, to disambiguate subvols > > > > > > */ > > > > > > buf->f_fsid.val[0] ^= > > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid > > > > > > >> 32; > > > > > > buf->f_fsid.val[1] ^= > > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid; > > > > > > > > > > > > So subvolume root is xored into the FSID. Thus dentries from > > > > > > different > > > > > > subvolumes are going to return different fsids... > > > > > > > > > > > > > > > > Right... how could I have missed that :-/ > > > > > > > > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that > > > > > and I saw how many flaws you pointed to in $SUBJECT patch. > > > > > Instead, I will use: > > > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,... > > > > > > > > So what about my proposal to store fsid in the notification mark when > > > > it is > > > > created and then use it when that mark results in event being generated? > > > > When mark is created, we have full path available, so getting proper > > > > fsid > > > > is trivial. Furthermore if the behavior is documented like that, it is > > > > fairly easy for userspace to track fsids it should care about and > > > > translate > > > > them to proper file descriptors for open_by_handle(). > > > > > > > > This would get rid of statfs() on every event creation (which I don't > > > > like > > > > much) and also avoids these problems "how to get fsid for inode". What > > > > do > > > > you think? > > > > > > > > > > That's interesting. I like the simplicity. > > > What happens when there are 2 btrfs subvols /subvol1 /subvol2 > > > and the application obviously doesn't know about this and does: > > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1); > > > statfs("/subvol1",...); > > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2); > > > statfs("/subvol2",...); > > > > > > Now the second fanotify_mark() call just updates the existing super block > > > mark mask, but does not change the fsid on the mark, so all events will > > > have > > > fsid of subvol1 that was stored when first creating the mark. > > > > Yes. > > > > > fanotify-watch application (for example) hashes the watches (paths) under > > > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to > > > "watch" (i.e. path). > > > > I agree this can be confusing... but with btrfs fanotify-watch will be > > confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM > > on /subvol1 (with fsid A) is going to return also events on inodes from > > /subvol2 (with fsid B). So your current code will return event with fsid B > > which fanotify-watch has no way to match back and can get confused. > > > > So currently application can get events with fsid it has never seen, with > > the code as I suggest it can get "wrong" fsid. That is confusing but still > > somewhat better? > > It's not better IMO because the purpose of the fid in event is a unique key > that application can use to match a path it already indexed. > wrong fsid => no match. > Using open_by_handle_at() to resolve unknown fid is a limited solution > and I don't think it is going to work in this case (see below). > > > > > The core of the problem is that btrfs does not have "the superblock > > identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events > > that we could use. > > > > > And when trying to open_by_handle fid with fhandle from /subvol2 > > > using mount_fd of /subvol1, I suppose we can either get ESTALE > > > or a disconnected dentry, because object from /subvol2 cannot > > > have a path inside /subvol1 > > > > So open_by_handle() should work fine even if we get mount_fd of /subvol1 > > and handle for inode on /subvol2. mount_fd in open_by_handle() is really > > only used to get the superblock and that is the same. > > I don't think it will work fine. > do_handle_to_path() will compose a path from /subvol1 mnt and dentry
Re: Btrfs and fanotify filesystem watches
On Fri, Nov 23, 2018 at 2:52 PM Jan Kara wrote: > > Changed subject to better match what we discuss and added btrfs list to CC. > > On Thu 22-11-18 17:18:25, Amir Goldstein wrote: > > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara wrote: > > > > > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote: > > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is > > > > > > associated with > > > > > > the single super block struct, so all dentries in all subvolumes > > > > > > will > > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid. > > > > > > > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see: > > > > > > > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ > > > > > be32_to_cpu(fsid[2]); > > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ > > > > > be32_to_cpu(fsid[3]); > > > > > /* Mask in the root object ID too, to disambiguate subvols */ > > > > > buf->f_fsid.val[0] ^= > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> > > > > > 32; > > > > > buf->f_fsid.val[1] ^= > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid; > > > > > > > > > > So subvolume root is xored into the FSID. Thus dentries from different > > > > > subvolumes are going to return different fsids... > > > > > > > > > > > > > Right... how could I have missed that :-/ > > > > > > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that > > > > and I saw how many flaws you pointed to in $SUBJECT patch. > > > > Instead, I will use: > > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,... > > > > > > So what about my proposal to store fsid in the notification mark when it > > > is > > > created and then use it when that mark results in event being generated? > > > When mark is created, we have full path available, so getting proper fsid > > > is trivial. Furthermore if the behavior is documented like that, it is > > > fairly easy for userspace to track fsids it should care about and > > > translate > > > them to proper file descriptors for open_by_handle(). > > > > > > This would get rid of statfs() on every event creation (which I don't like > > > much) and also avoids these problems "how to get fsid for inode". What do > > > you think? > > > > > > > That's interesting. I like the simplicity. > > What happens when there are 2 btrfs subvols /subvol1 /subvol2 > > and the application obviously doesn't know about this and does: > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1); > > statfs("/subvol1",...); > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2); > > statfs("/subvol2",...); > > > > Now the second fanotify_mark() call just updates the existing super block > > mark mask, but does not change the fsid on the mark, so all events will have > > fsid of subvol1 that was stored when first creating the mark. > > Yes. > > > fanotify-watch application (for example) hashes the watches (paths) under > > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to > > "watch" (i.e. path). > > I agree this can be confusing... but with btrfs fanotify-watch will be > confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM > on /subvol1 (with fsid A) is going to return also events on inodes from > /subvol2 (with fsid B). So your current code will return event with fsid B > which fanotify-watch has no way to match back and can get confused. > > So currently application can get events with fsid it has never seen, with > the code as I suggest it can get "wrong" fsid. That is confusing but still > somewhat better? It's not better IMO because the purpose of the fid in event is a unique key that application can use to match a path it already indexed. wrong fsid => no match. Using open_by_handle_at() to resolve unknown fid is a limited solution and I don't think it is going to work in this case (see below). > > The core of the problem is that btrfs does not have "the superblock > identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events > that we could use. > > > And when trying to open_by_handle fid with fhandle from /subvol2 > > using mount_fd of /subvol1, I suppose we can either get ESTALE > > or a disconnected dentry, because object from /subvol2 cannot > > have a path inside /subvol1 > > So open_by_handle() should work fine even if we get mount_fd of /subvol1 > and handle for inode on /subvol2. mount_fd in open_by_handle() is really > only used to get the superblock and that is the same. I don't think it will work fine. do_handle_to_path() will compose a path from /subvol1 mnt and dentry from /subvol2. This may resolve to a full path that does not really exist, so application cannot match it to anything that is can use name_to_handle_at() to identify. The whole thing just sounds too messy. At the very least we need more time to communicate with btrfs developers and figure this out, so I am going to go
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski wrote: > > On 2018-11-22 21:46, Nikolay Borisov wrote: > > >> # echo w > /proc/sysrq-trigger > >> > >> # dmesg -c > >> [ 931.585611] sysrq: SysRq : Show Blocked State > >> [ 931.585715] taskPC stack pid father > >> [ 931.590168] btrfs-cleaner D0 1340 2 0x8000 > >> [ 931.590175] Call Trace: > >> [ 931.590190] __schedule+0x29e/0x840 > >> [ 931.590195] schedule+0x2c/0x80 > >> [ 931.590199] schedule_timeout+0x258/0x360 > >> [ 931.590204] io_schedule_timeout+0x1e/0x50 > >> [ 931.590208] wait_for_completion_io+0xb7/0x140 > >> [ 931.590214] ? wake_up_q+0x80/0x80 > >> [ 931.590219] submit_bio_wait+0x61/0x90 > >> [ 931.590225] blkdev_issue_discard+0x7a/0xd0 > >> [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] > >> [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] > >> [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] > >> [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] > >> [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] > >> [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] > >> [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] > >> [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] > >> [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] > >> [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] > >> [ 931.590612] kthread+0x120/0x140 > >> [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] > >> [ 931.590658] ? kthread_bind+0x40/0x40 > >> [ 931.590661] ret_from_fork+0x22/0x40 > >> > > > > It seems your filesystem is mounted with the DSICARD option meaning > > every delete will result in discard this is highly suboptimal for > > ssd's. > > Try remounting the fs without the discard option see if it helps. > > Generally for discard you want to submit it in big batches (what fstrim > > does) so that the ftl on the ssd could apply any optimisations it might > > have up its sleeve. > > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. > > Do you know if the issue you described ("discard this is highly > suboptimal for ssd") affects other filesystems as well to a similar > extent? I.e. if using ext4 on ssd? Quite a lot of activity on ext4 and XFS are overwrites, so discard isn't needed. And it might be discard is subject to delays. On Btrfs, it's almost immediate, to the degree that on a couple SSDs I've tested, stale trees referenced exclusively by the most recent backup tree entires in the superblock are already zeros. That functionally means no automatic recoveries at mount time if there's a problem with any of the current trees. I was using it for about a year to no ill effect, BUT not a lot of file deletions either. I wouldn't recommend it, and instead suggest enabling the fstrim.timer which by default runs fstrim.service once a week (which in turn issues fstrim, I think on all mounted volumes.) I am a bit more concerned about the read errors you had that were being corrected automatically? The corruption suggests a firmware bug related to trim. I'd check the affected SSD firmware revision and consider updating it (only after a backup, it's plausible the firmware update is not guaranteed to be data safe). Does the volume use DUP or raid1 metadata? I'm not sure how it's correcting for these problems otherwise. -- Chris Murphy
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 2018/11/22 下午10:03, Roman Mamedov wrote: > On Thu, 22 Nov 2018 22:07:25 +0900 > Tomasz Chmielewski wrote: > >> Spot on! >> >> Removed "discard" from fstab and added "ssd", rebooted - no more >> btrfs-cleaner running. > > Recently there has been a bugfix for TRIM in Btrfs: > > btrfs: Ensure btrfs_trim_fs can trim the whole fs > https://patchwork.kernel.org/patch/10579539/ > > Perhaps your upgraded kernel is the first one to contain it, and for the first > time you're seeing TRIM to actually *work*, with the actual performance impact > of it on a large fragmented FS, instead of a few contiguous unallocated areas. > That only affects btrfs_trim_fs(), and you can see it's only called from ioctl interface, so it's definitely not the case. Thanks, Qu
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, 22 Nov 2018 22:07:25 +0900 Tomasz Chmielewski wrote: > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. Recently there has been a bugfix for TRIM in Btrfs: btrfs: Ensure btrfs_trim_fs can trim the whole fs https://patchwork.kernel.org/patch/10579539/ Perhaps your upgraded kernel is the first one to contain it, and for the first time you're seeing TRIM to actually *work*, with the actual performance impact of it on a large fragmented FS, instead of a few contiguous unallocated areas. -- With respect, Roman
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 2018-11-22 21:46, Nikolay Borisov wrote: # echo w > /proc/sysrq-trigger # dmesg -c [ 931.585611] sysrq: SysRq : Show Blocked State [ 931.585715] task PC stack pid father [ 931.590168] btrfs-cleaner D 0 1340 2 0x8000 [ 931.590175] Call Trace: [ 931.590190] __schedule+0x29e/0x840 [ 931.590195] schedule+0x2c/0x80 [ 931.590199] schedule_timeout+0x258/0x360 [ 931.590204] io_schedule_timeout+0x1e/0x50 [ 931.590208] wait_for_completion_io+0xb7/0x140 [ 931.590214] ? wake_up_q+0x80/0x80 [ 931.590219] submit_bio_wait+0x61/0x90 [ 931.590225] blkdev_issue_discard+0x7a/0xd0 [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] [ 931.590612] kthread+0x120/0x140 [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] [ 931.590658] ? kthread_bind+0x40/0x40 [ 931.590661] ret_from_fork+0x22/0x40 It seems your filesystem is mounted with the DSICARD option meaning every delete will result in discard this is highly suboptimal for ssd's. Try remounting the fs without the discard option see if it helps. Generally for discard you want to submit it in big batches (what fstrim does) so that the ftl on the ssd could apply any optimisations it might have up its sleeve. Spot on! Removed "discard" from fstab and added "ssd", rebooted - no more btrfs-cleaner running. Do you know if the issue you described ("discard this is highly suboptimal for ssd") affects other filesystems as well to a similar extent? I.e. if using ext4 on ssd? Would you finally care to share the smart data + the model and make of the ssd? 2x these: Model Family: Samsung based SSDs Device Model: SAMSUNG MZ7LM1T9HCJM-5 Firmware Version: GXT1103Q User Capacity:1,920,383,410,176 bytes [1.92 TB] Sector Size: 512 bytes logical/physical Rotation Rate:Solid State Device 1x this: Device Model: Micron_5200_MTFDDAK1T9TDC Firmware Version: D1MU004 User Capacity:1,920,383,410,176 bytes [1.92 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate:Solid State Device Form Factor: 2.5 inches But - seems the issue was unneeded discard option, so not pasting unnecessary SMART data, thanks for finding this out. Tomasz Chmielewski
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On 22.11.18 г. 14:31 ч., Tomasz Chmielewski wrote: > Yet another system upgraded to 4.19 and showing strange issues. > > btrfs-cleaner is showing as ~90-100% busy in iotop: > > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > 1340 be/4 root 0.00 B/s 0.00 B/s 0.00 % 92.88 % > [btrfs-cleaner] > > > Note disk read, disk write are 0.00 B/s. > > > iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0 > writes to them: > > Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.91 0.00 0.00 0.00 90.80 > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 100.00 > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 100.00 > > > > btrfs-cleaner persists 100% busy after reboots. The system is generally > not very responsive. > > > > # echo w > /proc/sysrq-trigger > > # dmesg -c > [ 931.585611] sysrq: SysRq : Show Blocked State > [ 931.585715] task PC stack pid father > [ 931.590168] btrfs-cleaner D 0 1340 2 0x8000 > [ 931.590175] Call Trace: > [ 931.590190] __schedule+0x29e/0x840 > [ 931.590195] schedule+0x2c/0x80 > [ 931.590199] schedule_timeout+0x258/0x360 > [ 931.590204] io_schedule_timeout+0x1e/0x50 > [ 931.590208] wait_for_completion_io+0xb7/0x140 > [ 931.590214] ? wake_up_q+0x80/0x80 > [ 931.590219] submit_bio_wait+0x61/0x90 > [ 931.590225] blkdev_issue_discard+0x7a/0xd0 > [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] > [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] > [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] > [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] > [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] > [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] > [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] > [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] > [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] > [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] > [ 931.590612] kthread+0x120/0x140 > [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] > [ 931.590658] ? kthread_bind+0x40/0x40 > [ 931.590661] ret_from_fork+0x22/0x40 > It seems your filesystem is mounted with the DSICARD option meaning every delete will result in discard this is highly suboptimal for ssd's. Try remounting the fs without the discard option see if it helps. Generally for discard you want to submit it in big batches (what fstrim does) so that the ftl on the ssd could apply any optimisations it might have up its sleeve. Would you finally care to share the smart data + the model and make of the ssd? > > > > After rebooting to 4.19.3, I've started seeing read errors. There were > no errors with a previous kernel (4.17.14); disks are healthy according > to SMART; no errors reported when we read the whole surface with dd. > > # grep -i btrfs.*corrected /var/log/syslog|wc -l > 156 > > Things like: > > Nov 22 12:17:43 lxd05 kernel: [ 711.538836] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector > 101088384) > Nov 22 12:17:43 lxd05 kernel: [ 711.538905] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector > 101088392) > Nov 22 12:17:43 lxd05 kernel: [ 711.538958] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector > 101088400) > Nov 22 12:17:43 lxd05 kernel: [ 711.539006] BTRFS info (device sda2): > read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector > 101088408) > > > Yet - 0 errors, according to stats, not sure if it's expected or not: Since btrfs was able to fix the read failure on the fly it doesn't record the error. IMO you have problems with your storage below the filesystem level. > > # btrfs device stats /data/lxd > [/dev/sda2].write_io_errs 0 > [/dev/sda2].read_io_errs 0 > [/dev/sda2].flush_io_errs 0 > [/dev/sda2].corruption_errs 0 > [/dev/sda2].generation_errs 0 > [/dev/sdb2].write_io_errs 0 > [/dev/sdb2].read_io_errs 0 > [/dev/sdb2].flush_io_errs 0 > [/dev/sdb2].corruption_errs 0 > [/dev/sdb2].generation_errs 0 > [/dev/sdc2].write_io_errs 0 > [/dev/sdc2].read_io_errs 0 > [/dev/sdc2].flush_io_errs 0 > [/dev/sdc2].corruption_errs 0 > [/dev/sdc2].generation_errs 0 > >
Re: btrfs convert problem
On 2018/11/17 下午5:49, Serhat Sevki Dincer wrote: > Hi, > > On my second attempt to convert my 698 GiB usb HDD from ext4 to btrfs > with btrfs-progs 4.19 from Manjaro (kernel 4.14.80): > > I identified bad files with > $ find . -type f -exec cat {} > /dev/null \; > This revealed 6 corrupted files, I deleted them. Tried it again with > no error message. > > Then I checked HDD with > $ sudo fsck.ext4 -f /dev/sdb1 > e2fsck 1.44.4 (18-Aug-2018) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > HDD: 763513/45744128 files (0.3% non-contiguous), 131674747/182970301 blocks > It is mountable & usable. > > Now my attempt to convert > $ LC_ALL=en_US.utf8 sudo strace -f -s 10 -a 4 -o convert-strace.log > btrfs-convert /dev/sdb1 > create btrfs filesystem: > blocksize: 4096 > nodesize: 16384 > features: extref, skinny-metadata (default) > creating ext2 image file > Unable to find block group for 0 > Unable to find block group for 0 > Unable to find block group for 0 It's ENOSPC. Since it won't damage your original ext* fs, you could try to check if there are enough *continuous* space for btrfs to use. Please keep in mind that, due to the nature ext* data layout, it may contain a lot of small free space fragments, and in that case btrfs-convert may not be able to take use of them. Or you could try to disable csum to make btrfs take less space so it may have a chance to convert. Thanks, Qu > ctree.c:2244: split_leaf: BUG_ON `1` triggered, value 1 > btrfs-convert(+0x162d6)[0x561fbd26b2d6] > btrfs-convert(btrfs_search_slot+0xf21)[0x561fbd26c881] > btrfs-convert(btrfs_csum_file_block+0x499)[0x561fbd27e8e9] > btrfs-convert(+0xe6f5)[0x561fbd2636f5] > btrfs-convert(main+0x194f)[0x561fbd26296f] > /usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f658a93a223] > btrfs-convert(_start+0x2e)[0x561fbd26328e] > Aborted > > It crashed :) The log is 2.4G, I compressed it and put it at > https://drive.google.com/drive/folders/0B5oVWFBM47D9aVpJY2s4UTdMdUU > > What else can I do to debug further? > Thanks.. > signature.asc Description: OpenPGP digital signature
Re: btrfs convert problem
Hi, On my second attempt to convert my 698 GiB usb HDD from ext4 to btrfs with btrfs-progs 4.19 from Manjaro (kernel 4.14.80): I identified bad files with $ find . -type f -exec cat {} > /dev/null \; This revealed 6 corrupted files, I deleted them. Tried it again with no error message. Then I checked HDD with $ sudo fsck.ext4 -f /dev/sdb1 e2fsck 1.44.4 (18-Aug-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information HDD: 763513/45744128 files (0.3% non-contiguous), 131674747/182970301 blocks It is mountable & usable. Now my attempt to convert $ LC_ALL=en_US.utf8 sudo strace -f -s 10 -a 4 -o convert-strace.log btrfs-convert /dev/sdb1 create btrfs filesystem: blocksize: 4096 nodesize: 16384 features: extref, skinny-metadata (default) creating ext2 image file Unable to find block group for 0 Unable to find block group for 0 Unable to find block group for 0 ctree.c:2244: split_leaf: BUG_ON `1` triggered, value 1 btrfs-convert(+0x162d6)[0x561fbd26b2d6] btrfs-convert(btrfs_search_slot+0xf21)[0x561fbd26c881] btrfs-convert(btrfs_csum_file_block+0x499)[0x561fbd27e8e9] btrfs-convert(+0xe6f5)[0x561fbd2636f5] btrfs-convert(main+0x194f)[0x561fbd26296f] /usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f658a93a223] btrfs-convert(_start+0x2e)[0x561fbd26328e] Aborted It crashed :) The log is 2.4G, I compressed it and put it at https://drive.google.com/drive/folders/0B5oVWFBM47D9aVpJY2s4UTdMdUU What else can I do to debug further? Thanks..
Re: BTRFS on production: NVR 16+ IP Cameras
On Thu, Nov 15, 2018 at 10:39 AM Juan Alberto Cirez wrote: > > Is BTRFS mature enough to be deployed on a production system to underpin > the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? > > Based on our limited experience with BTRFS (1+ year) under the above > scenario the answer seems to be no; but I wanted you ask the community > at large for their experience before making a final decision to hold off > on deploying BTRFS on production systems. > > Let us be clear: We think BTRFS has great potential, and as it matures > we will continue to watch its progress, so that at some future point we > can return to using it. > > The issue has been the myriad of problems we have encountered when > deploying BTRFS as the storage fs for the NVR/VMS in cases were the > camera count exceeds 10: Corrupted file systems, sudden read-only file > system, re-balance kernel panics, broken partitions, etc. Performance problems are separate from reliability problems. No matter what, there shouldn't be corruptions or failures when your process is writing through the Btrfs kernel driver. Period. So you've either got significant hardware/firmware problems as the root cause, or your use case is exposing Btrfs bugs. But you're burdened with providing sufficient details about the hardware and storage stack configuration including kernel, btrfs-progs versions and mkfs options and mount options being used. Without a way for a developer to reproduce your problem it's unlikely the source of the problem can be discovered and fixed. > So, again, the question is: is BTRFS mature enough to be used in such > use case and if so, what approach can be used to mitigate such issues. What format are the cameras writing out in? It matters if this is a continuous appending format, or if it's writing them out as individual JPEG files, one per frame, or whatever. What rate, what size, and any other concurrent operations, etc. -- Chris Murphy
Re: BTRFS on production: NVR 16+ IP Cameras
On 11/16/2018 03:51 AM, Nikolay Borisov wrote: On 15.11.18 г. 20:39 ч., Juan Alberto Cirez wrote: Is BTRFS mature enough to be deployed on a production system to underpin the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? Based on our limited experience with BTRFS (1+ year) under the above scenario the answer seems to be no; but I wanted you ask the community at large for their experience before making a final decision to hold off on deploying BTRFS on production systems. Let us be clear: We think BTRFS has great potential, and as it matures we will continue to watch its progress, so that at some future point we can return to using it. The issue has been the myriad of problems we have encountered when deploying BTRFS as the storage fs for the NVR/VMS in cases were the camera count exceeds 10: Corrupted file systems, sudden read-only file system, re-balance kernel panics, broken partitions, etc. What version of the file system, how many of those issues (if any) have you reported to the upstream mailing list? I don't remember seeing any reports from you. One specific case saw the btrfs drive pool mounted under the /var partition so that upon installation the btrfs pool contained all files under /var; /lib/mysql as well as the video storage. Needless to say this was a catastrophe... BTRFS is not suitable for random workloads, such as databases for example. In those cases it's recommended, at the very least, to disable COW on the database files. Note, disabling CoW doesn't preclude you from creating snapshots, it just means that not every 4k write (for example) will result in a CoW operation. At the other end of the spectrum is a use case where ONLY the video storage was on the btrfs pool; but in that case, the btrfs pool became read-only suddenly and would not re-mount as rw despite all the recovery trick btrfs-tools could throw at it. This, of course prevented the NVR/VMS from recording any footage. Again, you give absolutely no information about how you have configured the filesystem or versions of the software used. So, again, the question is: is BTRFS mature enough to be used in such use case and if so, what approach can be used to mitigate such issues. And.. What is the blocksize used by the application to write into btrfs? And what type of write IO? async, directio or fsync? Thanks, Anand Thank you all for your assistance -Ryan-
Re: BTRFS on production: NVR 16+ IP Cameras
On 2018-11-15 13:39, Juan Alberto Cirez wrote: Is BTRFS mature enough to be deployed on a production system to underpin the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? For NVR, I'd say no. BTRFS does pretty horribly with append-only workloads, even if they are WORM style. It also does a really bad job with most relational database systems that you would likely use for indexing. If you can suggest your reasoning for wanting to use BTRFS though, I can probably point you at alternatives that would work more reliably for your use case.
Re: BTRFS on production: NVR 16+ IP Cameras
On Thu, 15 Nov 2018 11:39:58 -0700 Juan Alberto Cirez wrote: > Is BTRFS mature enough to be deployed on a production system to underpin > the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? What are you looking to gain from using Btrfs on an NVR system? It doesn't sound like any of its prime-time features -- such as snapshots, checksumming, compression or reflink copying -- are a must for a bulk video recording scenario. And even if you don't need or use those features, you still pay the price for having them, Btrfs can be 2x to 10x slower than its simpler competitors: https://www.phoronix.com/scan.php?page=article=linux418-nvme-raid=2 If you just meant to use its multi-device support instead of a separate RAID layer, IMO that can't be considered prime-time or depended upon in production (yes, not even RAID 1 or 10). -- With respect, Roman
Re: BTRFS on production: NVR 16+ IP Cameras
On 15.11.18 г. 20:39 ч., Juan Alberto Cirez wrote: > Is BTRFS mature enough to be deployed on a production system to underpin > the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? > > Based on our limited experience with BTRFS (1+ year) under the above > scenario the answer seems to be no; but I wanted you ask the community > at large for their experience before making a final decision to hold off > on deploying BTRFS on production systems. > > Let us be clear: We think BTRFS has great potential, and as it matures > we will continue to watch its progress, so that at some future point we > can return to using it. > > The issue has been the myriad of problems we have encountered when > deploying BTRFS as the storage fs for the NVR/VMS in cases were the > camera count exceeds 10: Corrupted file systems, sudden read-only file > system, re-balance kernel panics, broken partitions, etc. What version of the file system, how many of those issues (if any) have you reported to the upstream mailing list? I don't remember seeing any reports from you. > > One specific case saw the btrfs drive pool mounted under the /var > partition so that upon installation the btrfs pool contained all files > under /var; /lib/mysql as well as the video storage. Needless to say > this was a catastrophe... BTRFS is not suitable for random workloads, such as databases for example. In those cases it's recommended, at the very least, to disable COW on the database files. Note, disabling CoW doesn't preclude you from creating snapshots, it just means that not every 4k write (for example) will result in a CoW operation. > > At the other end of the spectrum is a use case where ONLY the video > storage was on the btrfs pool; but in that case, the btrfs pool became > read-only suddenly and would not re-mount as rw despite all the recovery > trick btrfs-tools could throw at it. This, of course prevented the > NVR/VMS from recording any footage. Again, you give absolutely no information about how you have configured the filesystem or versions of the software used. > > So, again, the question is: is BTRFS mature enough to be used in such > use case and if so, what approach can be used to mitigate such issues. > > > Thank you all for your assistance > > -Ryan-
Re: btrfs partition is broken, cannot restore anything
Hi Qu, I created an issue for this feature request, so that it is not lost: https://github.com/kdave/btrfs-progs/issues/153 Thanks for the help again! Now unsubscribing from this list. Regards, Attila On Tue, Nov 6, 2018 at 9:23 AM Qu Wenruo wrote: > On 2018/11/6 下午4:17, Attila Vangel wrote: [snip] > > So I propose that "btrs restore" should always print a summary line at > > the end to make it more user friendly (e.g. for the first time users). > > Something like this: > > > > x files restored in y directories[, total size of files is z > readable unit e.g. GB>]. Use -v command line option for more > > information. > > > > The part in square bracket is optional / nice to have, might be useful > > e.g. to estimate the required free space in the target filesystem. > > Nice idea. Maybe we could enhance btrfs-restore to that direction. > > Thanks, > Qu [snip]
Re: btrfs partition is broken, cannot restore anything
On 2018/11/6 下午4:17, Attila Vangel wrote: > Hi Qu, > > Thanks again for the help. > In my case: > > $ sudo btrfs check /dev/nvme0n1p2 > Opening filesystem to check... > checksum verify failed on 18811453440 found E4E3BDB6 wanted > checksum verify failed on 18811453440 found E4E3BDB6 wanted > bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0 > ERROR: cannot open file system As expected, since extent tree is corrupted, btrfs check can't do much help. This is definitely something we could enhance, although it will take a long time for btrfs-progs to handle such case, then some super hard kernel mount option for kernel to do pure RO mount. > > (same with --mode=lowmem). > > However "btrfs restore" seems to work! (did not have time for the > actual run yet) > > What confused me about "btrds restore" is when I ran a dry run, it > just output about a dozen errors about some files, and I thought > that's it, no files can be restored. > Today I tried it with other options, especially -v shown me that the > majority of the files can be restored. I consider myself an > intermediate GNU plus Linux user, yet given the stressful situation > and being a first time user (btrfs n00b) I missed experimenting with > the -v option. > > So I propose that "btrs restore" should always print a summary line at > the end to make it more user friendly (e.g. for the first time users). > Something like this: > > x files restored in y directories[, total size of files is z readable unit e.g. GB>]. Use -v command line option for more > information. > > The part in square bracket is optional / nice to have, might be useful > e.g. to estimate the required free space in the target filesystem. Nice idea. Maybe we could enhance btrfs-restore to that direction. Thanks, Qu > > Cheers, > Attila > > On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo wrote: >> >> >> >> On 2018/11/6 上午2:01, Attila Vangel wrote: >>> Hi, >>> >>> TL;DR: I want to save data from my unmountable btrfs partition. >>> I saw some commands in another thread "Salvage files from broken btrfs". >>> I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO, >>> btrfs-progs 4.17.1-1) to execute these commands. >>> >>> $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt >>> mount: /mnt: wrong fs type, bad option, bad superblock on >>> /dev/nvme0n1p2, missing codepage or helper program, or other error. >>> >>> Corresponding lines from dmesg: >>> >>> [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount >>> time >>> [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled >>> [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents >>> [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start, >>> want 18811453440 have 0 >>> [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: >>> -5 >>> [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed >> >> Extent tree corrupted. >> >> If it's the only problem, btrfs-restore should be able to salvage data. >> >>> >>> $ sudo btrfs-find-root /dev/nvme0n1p2 >> >> No, that's not what you need. >> >>> Superblock thinks the generation is 220524 >>> Superblock thinks the level is 1 >>> Found tree root at 25018368 gen 220524 level 1 >>> Well block 4243456(gen: 220520 level: 1) seems good, but >>> generation/level doesn't match, want gen: 220524 level: 1 >>> Well block 5259264(gen: 220519 level: 1) seems good, but >>> generation/level doesn't match, want gen: 220524 level: 1 >>> Well block 4866048(gen: 220518 level: 0) seems good, but >>> generation/level doesn't match, want gen: 220524 level: 1 >>> >>> $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2 >>> superblock: bytenr=65536, device=/dev/nvme0n1p2 >> [snip] >>> >>> If I understood correctly, somehow it is possible to use this data to >>> parametrize btrfs restore to save the files from the partition. >> >> None of the output is really helpful. >> >> In your case, your extent tree is corrupted, thus kernel will refuse to >> mount (even RO). >> >> You should run "btrfs check" on the fs to see if btrfs can check fs tree. >> If not, then go directly to "btrfs restore". >> >> Thanks, >> Qu >> >>> Could you please help how to do it in this case? I am not familiar >>> with these technical terms in the outputs. >>> Thanks in advance! >>> >>> Cheers, >>> Attila >>> >>> On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel >>> wrote: Hi, Somehow my btrfs partition got broken. I use Arch, so my kernel is quite new (4.18.x). I don't remember exactly the sequence of events. At some point it was accessible in read-only, but unfortunately I did not take backup immediately. dmesg log from that time: [ 62.602388] BTRFS warning (device nvme0n1p2): block group 103923318784 has wrong amount of free space [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free space cache for block group 103923318784,
Re: btrfs partition is broken, cannot restore anything
Hi Qu, Thanks again for the help. In my case: $ sudo btrfs check /dev/nvme0n1p2 Opening filesystem to check... checksum verify failed on 18811453440 found E4E3BDB6 wanted checksum verify failed on 18811453440 found E4E3BDB6 wanted bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0 ERROR: cannot open file system (same with --mode=lowmem). However "btrfs restore" seems to work! (did not have time for the actual run yet) What confused me about "btrds restore" is when I ran a dry run, it just output about a dozen errors about some files, and I thought that's it, no files can be restored. Today I tried it with other options, especially -v shown me that the majority of the files can be restored. I consider myself an intermediate GNU plus Linux user, yet given the stressful situation and being a first time user (btrfs n00b) I missed experimenting with the -v option. So I propose that "btrs restore" should always print a summary line at the end to make it more user friendly (e.g. for the first time users). Something like this: x files restored in y directories[, total size of files is z ]. Use -v command line option for more information. The part in square bracket is optional / nice to have, might be useful e.g. to estimate the required free space in the target filesystem. Cheers, Attila On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo wrote: > > > > On 2018/11/6 上午2:01, Attila Vangel wrote: > > Hi, > > > > TL;DR: I want to save data from my unmountable btrfs partition. > > I saw some commands in another thread "Salvage files from broken btrfs". > > I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO, > > btrfs-progs 4.17.1-1) to execute these commands. > > > > $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt > > mount: /mnt: wrong fs type, bad option, bad superblock on > > /dev/nvme0n1p2, missing codepage or helper program, or other error. > > > > Corresponding lines from dmesg: > > > > [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount > > time > > [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled > > [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents > > [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start, > > want 18811453440 have 0 > > [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: > > -5 > > [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed > > Extent tree corrupted. > > If it's the only problem, btrfs-restore should be able to salvage data. > > > > > $ sudo btrfs-find-root /dev/nvme0n1p2 > > No, that's not what you need. > > > Superblock thinks the generation is 220524 > > Superblock thinks the level is 1 > > Found tree root at 25018368 gen 220524 level 1 > > Well block 4243456(gen: 220520 level: 1) seems good, but > > generation/level doesn't match, want gen: 220524 level: 1 > > Well block 5259264(gen: 220519 level: 1) seems good, but > > generation/level doesn't match, want gen: 220524 level: 1 > > Well block 4866048(gen: 220518 level: 0) seems good, but > > generation/level doesn't match, want gen: 220524 level: 1 > > > > $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2 > > superblock: bytenr=65536, device=/dev/nvme0n1p2 > [snip] > > > > If I understood correctly, somehow it is possible to use this data to > > parametrize btrfs restore to save the files from the partition. > > None of the output is really helpful. > > In your case, your extent tree is corrupted, thus kernel will refuse to > mount (even RO). > > You should run "btrfs check" on the fs to see if btrfs can check fs tree. > If not, then go directly to "btrfs restore". > > Thanks, > Qu > > > Could you please help how to do it in this case? I am not familiar > > with these technical terms in the outputs. > > Thanks in advance! > > > > Cheers, > > Attila > > > > On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel > > wrote: > >> > >> Hi, > >> > >> Somehow my btrfs partition got broken. I use Arch, so my kernel is > >> quite new (4.18.x). > >> I don't remember exactly the sequence of events. At some point it was > >> accessible in read-only, but unfortunately I did not take backup > >> immediately. dmesg log from that time: > >> > >> [ 62.602388] BTRFS warning (device nvme0n1p2): block group > >> 103923318784 has wrong amount of free space > >> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free > >> space cache for block group 103923318784, rebuilding it now > >> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 > >> 18812026880 > >> [ 108.039227] BTRFS: error (device nvme0n1p2) in > >> __btrfs_free_extent:7010: errno=-5 IO failure > >> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly > >> [ 108.039250] BTRFS: error (device nvme0n1p2) in > >> btrfs_run_delayed_refs:3076: errno=-5 IO failure > >> > >> At the next reboot it failed to mount. Problem may have been that at > >> some point I booted to another distro with older kernel
Re: btrfs partition is broken, cannot restore anything
On 2018/11/6 上午2:01, Attila Vangel wrote: > Hi, > > TL;DR: I want to save data from my unmountable btrfs partition. > I saw some commands in another thread "Salvage files from broken btrfs". > I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO, > btrfs-progs 4.17.1-1) to execute these commands. > > $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt > mount: /mnt: wrong fs type, bad option, bad superblock on > /dev/nvme0n1p2, missing codepage or helper program, or other error. > > Corresponding lines from dmesg: > > [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount > time > [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled > [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents > [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start, > want 18811453440 have 0 > [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5 > [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed Extent tree corrupted. If it's the only problem, btrfs-restore should be able to salvage data. > > $ sudo btrfs-find-root /dev/nvme0n1p2 No, that's not what you need. > Superblock thinks the generation is 220524 > Superblock thinks the level is 1 > Found tree root at 25018368 gen 220524 level 1 > Well block 4243456(gen: 220520 level: 1) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > Well block 5259264(gen: 220519 level: 1) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > Well block 4866048(gen: 220518 level: 0) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > > $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2 > superblock: bytenr=65536, device=/dev/nvme0n1p2 [snip] > > If I understood correctly, somehow it is possible to use this data to > parametrize btrfs restore to save the files from the partition. None of the output is really helpful. In your case, your extent tree is corrupted, thus kernel will refuse to mount (even RO). You should run "btrfs check" on the fs to see if btrfs can check fs tree. If not, then go directly to "btrfs restore". Thanks, Qu > Could you please help how to do it in this case? I am not familiar > with these technical terms in the outputs. > Thanks in advance! > > Cheers, > Attila > > On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel wrote: >> >> Hi, >> >> Somehow my btrfs partition got broken. I use Arch, so my kernel is >> quite new (4.18.x). >> I don't remember exactly the sequence of events. At some point it was >> accessible in read-only, but unfortunately I did not take backup >> immediately. dmesg log from that time: >> >> [ 62.602388] BTRFS warning (device nvme0n1p2): block group >> 103923318784 has wrong amount of free space >> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free >> space cache for block group 103923318784, rebuilding it now >> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 >> 18812026880 >> [ 108.039227] BTRFS: error (device nvme0n1p2) in >> __btrfs_free_extent:7010: errno=-5 IO failure >> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly >> [ 108.039250] BTRFS: error (device nvme0n1p2) in >> btrfs_run_delayed_refs:3076: errno=-5 IO failure >> >> At the next reboot it failed to mount. Problem may have been that at >> some point I booted to another distro with older kernel (4.15.x, >> 4.14.52) and unfortunately attempted some checks/repairs (?) e.g. from >> gparted, and at that time I did not know it could be destructive. >> >> Anyway, currently it fails to mount (even with ro and/or recovery), >> btrfs check results in "checksum verify failed" and "bad tree block" >> errors, btrfs restore resulted in "We have looped trying to restore >> files in" errors for a dozen of paths then exit. >> >> Is there some hope to save data from the filesystem, and if so, how? >> >> BTW I checked some diagnostics commands regarding my SSD with the nvme >> client and from that it seems there are no hardware problems. >> >> Your help is highly appreciated. >> >> Cheers, >> Attila signature.asc Description: OpenPGP digital signature
Re: BTRFS did it's job nicely (thanks!)
On Mon, Nov 5, 2018 at 6:27 AM, Austin S. Hemmelgarn wrote: > On 11/4/2018 11:44 AM, waxhead wrote: >> >> Sterling Windmill wrote: >>> >>> Out of curiosity, what led to you choosing RAID1 for data but RAID10 >>> for metadata? >>> >>> I've flip flipped between these two modes myself after finding out >>> that BTRFS RAID10 doesn't work how I would've expected. >>> >>> Wondering what made you choose your configuration. >>> >>> Thanks! >>> Sure, >> >> >> The "RAID"1 profile for data was chosen to maximize disk space utilization >> since I got a lot of mixed size devices. >> >> The "RAID"10 profile for metadata was chosen simply because it *feels* a >> bit faster for some of my (previous) workload which was reading a lot of >> small files (which I guess was embedded in the metadata). While I never >> remembered that I got any measurable performance increase the system simply >> felt smoother (which is strange since "RAID"10 should hog more disks at >> once). >> >> I would love to try "RAID"10 for both data and metadata, but I have to >> delete some files first (or add yet another drive). >> >> Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 >> does not work as you expected? >> >> As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 >> replica) is striped over as many disks it can (as long as there is free >> space). >> >> So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe >> over (20/2) x 2 and if you run out of space on 10 of the devices it will >> continue to stripe over (5/2) x 2. So your stripe width vary with the >> available space essentially... I may be terribly wrong about this (until >> someones corrects me that is...) > > He's probably referring to the fact that instead of there being a roughly > 50% chance of it surviving the failure of at least 2 devices like classical > RAID10 is technically able to do, it's currently functionally 100% certain > it won't survive more than one device failing. Right. Classic RAID10 is *two block device* copies, where you have mirror1 drives and mirror2 drives, and each mirror pair becomes a single virtual block device that are then striped across. If you lose a single mirror1 drive, its mirror2 data is available and statistically unlikely to also go away. Whereas with Btrfs raid10, it's *two block group* copies. And it is the block group that's striped. That means block group copy 1 is striped across 1/2 the available drives (at the time the bg is allocated), and block group copy 2 is striped across the other drives. When a drive dies, there is no single remaining drive that contains all the missing copies, they're distributed. Which means you've got a very good chance in a 2 drive failure of losing two copies of either metadata or data or both. While I'm not certain it's 100% not survivable, the real gotcha is it's possible maybe even likely that it'll mount and seem to work fine but as soon as it runs into two missing bg's, it'll face plant. -- Chris Murphy
Re: btrfs partition is broken, cannot restore anything
Hi, Stupid gmail has put my email (or Qu's reply? ) to spam, so I just saw the reply after I sent my reply (gmail asked me whether to remove it from spam). Anyway here is the requested output. Thanks for the help! $ sudo btrfs check /dev/nvme0n1p2 Opening filesystem to check... checksum verify failed on 18811453440 found E4E3BDB6 wanted checksum verify failed on 18811453440 found E4E3BDB6 wanted bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0 ERROR: cannot open file system $ sudo btrfs check --mode=lowmem /dev/nvme0n1p2 Opening filesystem to check... checksum verify failed on 18811453440 found E4E3BDB6 wanted checksum verify failed on 18811453440 found E4E3BDB6 wanted bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0 ERROR: cannot open file system Regards, Attila On Mon, Nov 5, 2018 at 6:01 PM Attila Vangel wrote: > > Hi, > > TL;DR: I want to save data from my unmountable btrfs partition. > I saw some commands in another thread "Salvage files from broken btrfs". > I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO, > btrfs-progs 4.17.1-1) to execute these commands. > > $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt > mount: /mnt: wrong fs type, bad option, bad superblock on > /dev/nvme0n1p2, missing codepage or helper program, or other error. > > Corresponding lines from dmesg: > > [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount > time > [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled > [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents > [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start, > want 18811453440 have 0 > [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5 > [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed > > $ sudo btrfs-find-root /dev/nvme0n1p2 > Superblock thinks the generation is 220524 > Superblock thinks the level is 1 > Found tree root at 25018368 gen 220524 level 1 > Well block 4243456(gen: 220520 level: 1) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > Well block 5259264(gen: 220519 level: 1) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > Well block 4866048(gen: 220518 level: 0) seems good, but > generation/level doesn't match, want gen: 220524 level: 1 > > $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2 > superblock: bytenr=65536, device=/dev/nvme0n1p2 > - > csum_type0 (crc32c) > csum_size4 > csum0x7956a931 [match] > bytenr65536 > flags0x1 > ( WRITTEN ) > magic_BHRfS_M [match] > fsid014c9d24-339c-482e-8f06-9284e4a7bc40 > labelnewhome > generation220524 > root25018368 > sys_array_size97 > chunk_root_generation219209 > root_level1 > chunk_root131072 > chunk_root_level1 > log_root86818816 > log_root_transid0 > log_root_level0 > total_bytes355938074624 > bytes_used344504737792 > sectorsize4096 > nodesize16384 > leafsize (deprecated)16384 > stripesize4096 > root_dir6 > num_devices1 > compat_flags0x0 > compat_ro_flags0x0 > incompat_flags0x161 > ( MIXED_BACKREF | > BIG_METADATA | > EXTENDED_IREF | > SKINNY_METADATA ) > cache_generation220524 > uuid_tree_generation220524 > dev_item.uuid05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3 > dev_item.fsid014c9d24-339c-482e-8f06-9284e4a7bc40 [match] > dev_item.type0 > dev_item.total_bytes355938074624 > dev_item.bytes_used355792322560 > dev_item.io_align4096 > dev_item.io_width4096 > dev_item.sector_size4096 > dev_item.devid1 > dev_item.dev_group0 > dev_item.seek_speed0 > dev_item.bandwidth0 > dev_item.generation0 > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) > length 4194304 owner 2 stripe_len 65536 type SYSTEM > io_align 4096 io_width 4096 sector_size 4096 > num_stripes 1 sub_stripes 0 > stripe 0 devid 1 offset 0 > dev_uuid 05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3 > backup_roots[4]: > backup 0: > backup_tree_root:42598400gen: 220522level: 1 > backup_chunk_root:131072gen: 219209level: 1 > backup_extent_root:26460160gen: 220522level: 2 > backup_fs_root:51347456gen: 220523level: 2 > backup_dev_root:4472832gen: 220520level: 1 > backup_csum_root:26558464gen: 220522level: 2 > backup_total_bytes:355938074624 > backup_bytes_used:344504741888 > backup_num_devices:1 > > backup 1: > backup_tree_root:
Re: btrfs partition is broken, cannot restore anything
Hi, TL;DR: I want to save data from my unmountable btrfs partition. I saw some commands in another thread "Salvage files from broken btrfs". I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO, btrfs-progs 4.17.1-1) to execute these commands. $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/nvme0n1p2, missing codepage or helper program, or other error. Corresponding lines from dmesg: [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount time [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start, want 18811453440 have 0 [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: -5 [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed $ sudo btrfs-find-root /dev/nvme0n1p2 Superblock thinks the generation is 220524 Superblock thinks the level is 1 Found tree root at 25018368 gen 220524 level 1 Well block 4243456(gen: 220520 level: 1) seems good, but generation/level doesn't match, want gen: 220524 level: 1 Well block 5259264(gen: 220519 level: 1) seems good, but generation/level doesn't match, want gen: 220524 level: 1 Well block 4866048(gen: 220518 level: 0) seems good, but generation/level doesn't match, want gen: 220524 level: 1 $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2 superblock: bytenr=65536, device=/dev/nvme0n1p2 - csum_type0 (crc32c) csum_size4 csum0x7956a931 [match] bytenr65536 flags0x1 ( WRITTEN ) magic_BHRfS_M [match] fsid014c9d24-339c-482e-8f06-9284e4a7bc40 labelnewhome generation220524 root25018368 sys_array_size97 chunk_root_generation219209 root_level1 chunk_root131072 chunk_root_level1 log_root86818816 log_root_transid0 log_root_level0 total_bytes355938074624 bytes_used344504737792 sectorsize4096 nodesize16384 leafsize (deprecated)16384 stripesize4096 root_dir6 num_devices1 compat_flags0x0 compat_ro_flags0x0 incompat_flags0x161 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA ) cache_generation220524 uuid_tree_generation220524 dev_item.uuid05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3 dev_item.fsid014c9d24-339c-482e-8f06-9284e4a7bc40 [match] dev_item.type0 dev_item.total_bytes355938074624 dev_item.bytes_used355792322560 dev_item.io_align4096 dev_item.io_width4096 dev_item.sector_size4096 dev_item.devid1 dev_item.dev_group0 dev_item.seek_speed0 dev_item.bandwidth0 dev_item.generation0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) length 4194304 owner 2 stripe_len 65536 type SYSTEM io_align 4096 io_width 4096 sector_size 4096 num_stripes 1 sub_stripes 0 stripe 0 devid 1 offset 0 dev_uuid 05fe6ce8-1f2d-41ba-a367-cbdb8f06ffd3 backup_roots[4]: backup 0: backup_tree_root:42598400gen: 220522level: 1 backup_chunk_root:131072gen: 219209level: 1 backup_extent_root:26460160gen: 220522level: 2 backup_fs_root:51347456gen: 220523level: 2 backup_dev_root:4472832gen: 220520level: 1 backup_csum_root:26558464gen: 220522level: 2 backup_total_bytes:355938074624 backup_bytes_used:344504741888 backup_num_devices:1 backup 1: backup_tree_root:52363264gen: 220523level: 1 backup_chunk_root:131072gen: 219209level: 1 backup_extent_root:51806208gen: 220523level: 2 backup_fs_root:51347456gen: 220523level: 2 backup_dev_root:4472832gen: 220520level: 1 backup_csum_root:52461568gen: 220523level: 2 backup_total_bytes:355938074624 backup_bytes_used:344504729600 backup_num_devices:1 backup 2: backup_tree_root:25018368gen: 220524level: 1 backup_chunk_root:131072gen: 219209level: 1 backup_extent_root:21479424gen: 220524level: 2 backup_fs_root:53084160gen: 220524level: 2 backup_dev_root:4472832gen: 220520level: 1 backup_csum_root:53379072gen: 220524level: 2 backup_total_bytes:355938074624 backup_bytes_used:344504737792 backup_num_devices:1 backup 3: backup_tree_root:21921792gen: 220521level: 1 backup_chunk_root:
Re: BTRFS did it's job nicely (thanks!)
On 11/4/2018 11:44 AM, waxhead wrote: Sterling Windmill wrote: Out of curiosity, what led to you choosing RAID1 for data but RAID10 for metadata? I've flip flipped between these two modes myself after finding out that BTRFS RAID10 doesn't work how I would've expected. Wondering what made you choose your configuration. Thanks! Sure, The "RAID"1 profile for data was chosen to maximize disk space utilization since I got a lot of mixed size devices. The "RAID"10 profile for metadata was chosen simply because it *feels* a bit faster for some of my (previous) workload which was reading a lot of small files (which I guess was embedded in the metadata). While I never remembered that I got any measurable performance increase the system simply felt smoother (which is strange since "RAID"10 should hog more disks at once). I would love to try "RAID"10 for both data and metadata, but I have to delete some files first (or add yet another drive). Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 does not work as you expected? As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 replica) is striped over as many disks it can (as long as there is free space). So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe over (20/2) x 2 and if you run out of space on 10 of the devices it will continue to stripe over (5/2) x 2. So your stripe width vary with the available space essentially... I may be terribly wrong about this (until someones corrects me that is...) He's probably referring to the fact that instead of there being a roughly 50% chance of it surviving the failure of at least 2 devices like classical RAID10 is technically able to do, it's currently functionally 100% certain it won't survive more than one device failing.
Re: BTRFS did it's job nicely (thanks!)
Sterling Windmill wrote: Out of curiosity, what led to you choosing RAID1 for data but RAID10 for metadata? I've flip flipped between these two modes myself after finding out that BTRFS RAID10 doesn't work how I would've expected. Wondering what made you choose your configuration. Thanks! Sure, The "RAID"1 profile for data was chosen to maximize disk space utilization since I got a lot of mixed size devices. The "RAID"10 profile for metadata was chosen simply because it *feels* a bit faster for some of my (previous) workload which was reading a lot of small files (which I guess was embedded in the metadata). While I never remembered that I got any measurable performance increase the system simply felt smoother (which is strange since "RAID"10 should hog more disks at once). I would love to try "RAID"10 for both data and metadata, but I have to delete some files first (or add yet another drive). Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 does not work as you expected? As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 replica) is striped over as many disks it can (as long as there is free space). So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe over (20/2) x 2 and if you run out of space on 10 of the devices it will continue to stripe over (5/2) x 2. So your stripe width vary with the available space essentially... I may be terribly wrong about this (until someones corrects me that is...)
Re: BTRFS did it's job nicely (thanks!)
Out of curiosity, what led to you choosing RAID1 for data but RAID10 for metadata? I've flip flipped between these two modes myself after finding out that BTRFS RAID10 doesn't work how I would've expected. Wondering what made you choose your configuration. Thanks! On Fri, Nov 2, 2018 at 3:55 PM waxhead wrote: > > Hi, > > my main computer runs on a 7x SSD BTRFS as rootfs with > data:RAID1 and metadata:RAID10. > > One SSD is probably about to fail, and it seems that BTRFS fixed it > nicely (thanks everyone!) > > I decided to just post the ugly details in case someone just wants to > have a look. Note that I tend to interpret the btrfs de st / output as > if the error was NOT fixed even if (seems clearly that) it was, so I > think the output is a bit misleading... just saying... > > > > -- below are the details for those curious (just for fun) --- > > scrub status for [YOINK!] > scrub started at Fri Nov 2 17:49:45 2018 and finished after > 00:29:26 > total bytes scrubbed: 1.15TiB with 1 errors > error details: csum=1 > corrected errors: 1, uncorrectable errors: 0, unverified errors: 0 > > btrfs fi us -T / > Overall: > Device size: 1.18TiB > Device allocated: 1.17TiB > Device unallocated:9.69GiB > Device missing: 0.00B > Used: 1.17TiB > Free (estimated): 6.30GiB (min: 6.30GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data Metadata System > Id Path RAID1 RAID10RAID10Unallocated > -- - - - - --- > 6 /dev/sda1 236.28GiB 704.00MiB 32.00MiB 485.00MiB > 7 /dev/sdb1 233.72GiB 1.03GiB 32.00MiB 2.69GiB > 2 /dev/sdc1 110.56GiB 352.00MiB - 904.00MiB > 8 /dev/sdd1 234.96GiB 1.03GiB 32.00MiB 1.45GiB > 1 /dev/sde1 164.90GiB 1.03GiB 32.00MiB 1.72GiB > 9 /dev/sdf1 109.00GiB 1.03GiB 32.00MiB 744.00MiB > 10 /dev/sdg1 107.98GiB 1.03GiB 32.00MiB 1.74GiB > -- - - - - --- > Total 598.70GiB 3.09GiB 96.00MiB 9.69GiB > Used 597.25GiB 1.57GiB 128.00KiB > > > > uname -a > Linux main 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64 > GNU/Linux > > btrfs --version > btrfs-progs v4.17 > > > dmesg | grep -i btrfs > [7.801817] Btrfs loaded, crc32c=crc32c-generic > [8.163288] BTRFS: device label btrfsroot devid 10 transid 669961 > /dev/sdg1 > [8.163433] BTRFS: device label btrfsroot devid 9 transid 669961 > /dev/sdf1 > [8.163591] BTRFS: device label btrfsroot devid 1 transid 669961 > /dev/sde1 > [8.163734] BTRFS: device label btrfsroot devid 8 transid 669961 > /dev/sdd1 > [8.163974] BTRFS: device label btrfsroot devid 2 transid 669961 > /dev/sdc1 > [8.164117] BTRFS: device label btrfsroot devid 7 transid 669961 > /dev/sdb1 > [8.164262] BTRFS: device label btrfsroot devid 6 transid 669961 > /dev/sda1 > [8.206174] BTRFS info (device sde1): disk space caching is enabled > [8.206236] BTRFS info (device sde1): has skinny extents > [8.348610] BTRFS info (device sde1): enabling ssd optimizations > [8.854412] BTRFS info (device sde1): enabling free space tree > [8.854471] BTRFS info (device sde1): using free space tree > [ 68.170580] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.185973] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.185991] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186003] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186015] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186028] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186041] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186052] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186063] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.186075] BTRFS warning (device sde1): csum failed root 3760 ino > 3247424 off 125434560512 csum 0x2e395164 expected csum 0x6514b2c2 mirror 2 > [ 68.199237]
Re: BTRFS did it's job nicely (thanks!)
Duncan wrote: waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted: Note that I tend to interpret the btrfs de st / output as if the error was NOT fixed even if (seems clearly that) it was, so I think the output is a bit misleading... just saying... See the btrfs-device manpage, stats subcommand, -z|--reset option, and device stats section: -z|--reset Print the stats and reset the values to zero afterwards. DEVICE STATS The device stats keep persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub run. So stats keeps a count of historic errors and is only reset when you specifically reset it, *NOT* when the error is fixed. Yes, I am perfectly aware of all that. The issue I have is that the manpage describes corruption errors as "A block checksum mismatched or corrupted metadata header was found". This does not tell me if this was a permanent corruption or if it was fixed. That is why I think the output is a bit misleadning (and I should have said that more clearly). My point being that btrfs device stats /mnt would have been a lot easier to read and understand if it distinguished between permanent corruption e.g. unfixable errors vs fixed errors. (There's actually a recent patch, I believe in the current dev kernel 4.20/5.0, that will reset a device's stats automatically for the btrfs replace case when it's actually a different device afterward anyway. Apparently, it doesn't even do /that/ automatically yet. Keep that in mind if you replace that device.) Oh thanks for the heads up, I was under the impression that the device stats was tracked by btrfs devid, but apparently it is (was) not. Good to know!
Re: BTRFS did it's job nicely (thanks!)
waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted: > Note that I tend to interpret the btrfs de st / output as if the error > was NOT fixed even if (seems clearly that) it was, so I think the output > is a bit misleading... just saying... See the btrfs-device manpage, stats subcommand, -z|--reset option, and device stats section: -z|--reset Print the stats and reset the values to zero afterwards. DEVICE STATS The device stats keep persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub run. So stats keeps a count of historic errors and is only reset when you specifically reset it, *NOT* when the error is fixed. (There's actually a recent patch, I believe in the current dev kernel 4.20/5.0, that will reset a device's stats automatically for the btrfs replace case when it's actually a different device afterward anyway. Apparently, it doesn't even do /that/ automatically yet. Keep that in mind if you replace that device.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs partition is broken, cannot restore anything
On 2018/11/2 上午4:40, Attila Vangel wrote: > Hi, > > Somehow my btrfs partition got broken. I use Arch, so my kernel is > quite new (4.18.x). > I don't remember exactly the sequence of events. At some point it was > accessible in read-only, but unfortunately I did not take backup > immediately. dmesg log from that time: > > [ 62.602388] BTRFS warning (device nvme0n1p2): block group > 103923318784 has wrong amount of free space > [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free > space cache for block group 103923318784, rebuilding it now > [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 > 18812026880 > [ 108.039227] BTRFS: error (device nvme0n1p2) in > __btrfs_free_extent:7010: errno=-5 IO failure Normally we shouldn't call __btrfs_free_extent() during mount. It looks like it's caused by log tree replay. You could try to mount the fs with -o ro,nologreplay to see if you could still mount the fs RO without log replay. If it goes well, at least you could access the files. > [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly > [ 108.039250] BTRFS: error (device nvme0n1p2) in > btrfs_run_delayed_refs:3076: errno=-5 IO failure > > At the next reboot it failed to mount. Problem may have been that at > some point I booted to another distro with older kernel (4.15.x, > 4.14.52) and unfortunately attempted some checks/repairs (?) e.g. from > gparted, and at that time I did not know it could be destructive.> > Anyway, currently it fails to mount (even with ro and/or recovery), > btrfs check results in "checksum verify failed" and "bad tree block" > errors, Full btrfs check result please. > btrfs restore resulted in "We have looped trying to restore > files in" errors for a dozen of paths then exit. So at least btrfs-restore is working. This normally means the fs is not completely corrupted, and essential trees are at least OK. Still need that "btrfs check" output (along with "btrfs check --mode=lowmem") to determine what to do. Thanks, Qu > > Is there some hope to save data from the filesystem, and if so, how? > > BTW I checked some diagnostics commands regarding my SSD with the nvme > client and from that it seems there are no hardware problems. > > Your help is highly appreciated. > > Cheers, > Attila > signature.asc Description: OpenPGP digital signature
Re: btrfs-qgroup-rescan using 100% CPU
On 2018/10/28 上午6:58, Dave wrote: > I'm using btrfs and snapper on a system with an SSD. On this system > when I run `snapper -c root ls` (where `root` is the snapper config > for /), the process takes a very long time and top shows the following > process using 100% of the CPU: > > kworker/u8:6+btrfs-qgroup-rescan Not sure about what snapper is doing, but it looks like snapper needs to use btrfs qgroup. And then enable btrfs qgroup will do a initial qgroup scan. If you have a lot of snapshots or a lot of files, it will take a long time to do the initial rescan. That's the designed behavior. > > I have multiple computers (also with SSD's) set up the same way with > snapper and btrfs. On the other computers, `snapper -c root ls` > completes almost instantly, even on systems with many more snapshots. The size of each subvolume also counts. The time consumed by qgroup rescan depends on the number of references. Snapshots, reflinks all contribute to that number. Also, large files contribute less than small files, as one large file (128M) could only contain one reference, while 128 small files (1M) contains 128 references. Snapshot is one of the heaviest workload in such case. If one extent is shared 4 times between 4 snapshots, then it will cost 4 times CPU resource to do the rescan. > This system has 20 total snapshots on `/`. That already sounds a lot, especially considering the size of the fs. > > System info: > > 4.18.16-arch1-1-ARCH (Arch Linux) > btrfs-progs v4.17.1 > scrub started at Sat Oct 27 18:37:21 2018 and finished after 00:04:02 > total bytes scrubbed: 75.97GiB with 0 errors > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/cryptdv 116G 77G 38G 67% / > > Data, single: total=72.01GiB, used=71.38GiB > System, DUP: total=32.00MiB, used=16.00KiB > Metadata, DUP: total=3.50GiB, used=2.22GiB From a quick glance, it indeed needs some time do to the rescan. Also, to make sure it's not some deadlock, you could check the rescan progress by the following command: # btrfs ins dump-tree -t quota /dev/mapper/cryptdv And look for the following item: item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32 version 1 generation 7 flags ON|RESCAN scan 1024 That scan number should change with rescan progress. (That number is only updated after each transaction, so you need to wait some time to see that change). Thanks, Qu > > What other info would be helpful? What troubleshooting steps should I try? > signature.asc Description: OpenPGP digital signature
Re: Btrfs resize seems to deadlock
On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana wrote: > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > wrote: > > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > resize max $MOUNT", the command runs for a few minutes and then hangs > > > forcing the system to be reset. I am not sure what the state of the > > > filesystem really is at this point. Btrfs usage does report the > > > correct size for after resizing. Details below: > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > deeper debugging, may I ask you to post "btrfs inspect-internal > > dump-tree -t 1 /dev/your_btrfs_disk"? > > I believe it's actually easy to understand from the trace alone and > it's kind of a bad luck scenario. > I made this fix a few hours ago: > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > But haven't done full testing yet and might have missed something. > Bo, can you take a look and let me know what you think? > The patch is OK to me, should fix the problem. Since load_free_space_cache() is touching root tree with path->commit_root and skip_locking, I was wondering if we could just provide a same way for free space cache inode's iget(). thanks, liubo > Thanks. > > > > > So I'd like to know what's the height of your tree "1" which refers to > > root tree in btrfs. > > > > thanks, > > liubo > > > > > $ sudo btrfs filesystem usage $MOUNT > > > Overall: > > > Device size: 90.96TiB > > > Device allocated: 72.62TiB > > > Device unallocated: 18.33TiB > > > Device missing: 0.00B > > > Used: 72.62TiB > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > Data ratio: 1.00 > > > Metadata ratio: 2.00 > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > $MOUNT72.46TiB > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > $MOUNT 172.00GiB > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > >$MOUNT80.00MiB > > > > > > Unallocated: > > > $MOUNT18.33TiB > > > > > > $ uname -a > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > Call Trace: > > > ? __schedule+0x253/0x860 > > > schedule+0x28/0x80 > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > transaction_kthread+0x155/0x170 [btrfs] > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > kthread+0x112/0x130 > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > ret_from_fork+0x35/0x40 > > > btrfs D0 2504 2502 0x0002 > > > Call Trace: > > > ? __schedule+0x253/0x860 > > > schedule+0x28/0x80 > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > ? finish_wait+0x80/0x80 > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > ? inode_insert5+0x119/0x190 > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_iget+0x113/0x690 [btrfs] > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > find_free_extent+0x799/0x1010 [btrfs] > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > ? btrfs_release_path+0x13/0x80 [btrfs] > > > ? btrfs_update_device+0x8d/0x1c0 [btrfs] > > > btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs] > > > btrfs_ioctl+0xa25/0x2cf0 [btrfs] > > > ? tty_write+0x1fc/0x330 > > > ? do_vfs_ioctl+0xa4/0x620 > > > do_vfs_ioctl+0xa4/0x620 > > > ksys_ioctl+0x60/0x90 > > > ? ksys_write+0x4f/0xb0 > > > __x64_sys_ioctl+0x16/0x20 > > > do_syscall_64+0x5b/0x160 > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > RIP: 0033:0x7fcdc0d78c57 > > > Code: Bad RIP value. > > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX:
Re: Btrfs resize seems to deadlock
On Mon, Oct 22, 2018 at 10:06 AM Andrew Nelson wrote: > > OK, an update: After unmouting and running btrfs check, the drive > reverted to reporting the old size. Not sure if this was due to > unmounting / mounting or doing btrfs check. Btrfs check should have > been running in readonly mode. It reverted to the old size because the transaction used for the resize operation never got committed due to the deadlock, not because of 'btrfs check'. > Since it looked like something was > wrong with the resize process, I patched my kernel with the posted > patch. This time the resize operation finished successfully. Great, thanks for testing! > On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana wrote: > > > > On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson > > wrote: > > > > > > Also, is the drive in a safe state to use? Is there anything I should > > > run on the drive to check consistency? > > > > It should be in a safe state. You can verify it running "btrfs check > > /dev/" (it's a readonly operation). > > > > If you are able to patch and build a kernel, you can also try the > > patch. I left it running tests overnight and haven't got any > > regressions. > > > > Thanks. > > > > > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson > > > wrote: > > > > > > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but > > > > the output is ~55mb. Is there something in particular you are looking > > > > for in this? > > > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana > > > > wrote: > > > > > > > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > > > > > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > > > > > wrote: > > > > > > > > > > > > > > I am having an issue with btrfs resize in Fedora 28. I am > > > > > > > attempting > > > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > > > > > resize max $MOUNT", the command runs for a few minutes and then > > > > > > > hangs > > > > > > > forcing the system to be reset. I am not sure what the state of > > > > > > > the > > > > > > > filesystem really is at this point. Btrfs usage does report the > > > > > > > correct size for after resizing. Details below: > > > > > > > > > > > > > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > > > > > deeper debugging, may I ask you to post "btrfs inspect-internal > > > > > > dump-tree -t 1 /dev/your_btrfs_disk"? > > > > > > > > > > I believe it's actually easy to understand from the trace alone and > > > > > it's kind of a bad luck scenario. > > > > > I made this fix a few hours ago: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > > > > > > > > > But haven't done full testing yet and might have missed something. > > > > > Bo, can you take a look and let me know what you think? > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > So I'd like to know what's the height of your tree "1" which refers > > > > > > to > > > > > > root tree in btrfs. > > > > > > > > > > > > thanks, > > > > > > liubo > > > > > > > > > > > > > $ sudo btrfs filesystem usage $MOUNT > > > > > > > Overall: > > > > > > > Device size: 90.96TiB > > > > > > > Device allocated: 72.62TiB > > > > > > > Device unallocated: 18.33TiB > > > > > > > Device missing: 0.00B > > > > > > > Used: 72.62TiB > > > > > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > > > > > Data ratio: 1.00 > > > > > > > Metadata ratio: 2.00 > > > > > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > > > > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > > > > > $MOUNT72.46TiB > > > > > > > > > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > > > > > $MOUNT 172.00GiB > > > > > > > > > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > > > > > >$MOUNT80.00MiB > > > > > > > > > > > > > > Unallocated: > > > > > > > $MOUNT18.33TiB > > > > > > > > > > > > > > $ uname -a > > > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon > > > > > > > Oct 15 > > > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > > > > > Call Trace: > > > > > > > ? __schedule+0x253/0x860 > > > > > > > schedule+0x28/0x80 > > > > > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > > > > > ? finish_wait+0x80/0x80 > > > > > > > transaction_kthread+0x155/0x170 [btrfs] > > > > > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > > > > > kthread+0x112/0x130 > > > > > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > > > > > ret_from_fork+0x35/0x40 > > > > > >
Re: Btrfs resize seems to deadlock
OK, an update: After unmouting and running btrfs check, the drive reverted to reporting the old size. Not sure if this was due to unmounting / mounting or doing btrfs check. Btrfs check should have been running in readonly mode. Since it looked like something was wrong with the resize process, I patched my kernel with the posted patch. This time the resize operation finished successfully. On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana wrote: > > On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson > wrote: > > > > Also, is the drive in a safe state to use? Is there anything I should > > run on the drive to check consistency? > > It should be in a safe state. You can verify it running "btrfs check > /dev/" (it's a readonly operation). > > If you are able to patch and build a kernel, you can also try the > patch. I left it running tests overnight and haven't got any > regressions. > > Thanks. > > > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson > > wrote: > > > > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but > > > the output is ~55mb. Is there something in particular you are looking > > > for in this? > > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana wrote: > > > > > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > > > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > > > > wrote: > > > > > > > > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > > > > resize max $MOUNT", the command runs for a few minutes and then > > > > > > hangs > > > > > > forcing the system to be reset. I am not sure what the state of the > > > > > > filesystem really is at this point. Btrfs usage does report the > > > > > > correct size for after resizing. Details below: > > > > > > > > > > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > > > > deeper debugging, may I ask you to post "btrfs inspect-internal > > > > > dump-tree -t 1 /dev/your_btrfs_disk"? > > > > > > > > I believe it's actually easy to understand from the trace alone and > > > > it's kind of a bad luck scenario. > > > > I made this fix a few hours ago: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > > > > > > > But haven't done full testing yet and might have missed something. > > > > Bo, can you take a look and let me know what you think? > > > > > > > > Thanks. > > > > > > > > > > > > > > So I'd like to know what's the height of your tree "1" which refers to > > > > > root tree in btrfs. > > > > > > > > > > thanks, > > > > > liubo > > > > > > > > > > > $ sudo btrfs filesystem usage $MOUNT > > > > > > Overall: > > > > > > Device size: 90.96TiB > > > > > > Device allocated: 72.62TiB > > > > > > Device unallocated: 18.33TiB > > > > > > Device missing: 0.00B > > > > > > Used: 72.62TiB > > > > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > > > > Data ratio: 1.00 > > > > > > Metadata ratio: 2.00 > > > > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > > > > $MOUNT72.46TiB > > > > > > > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > > > > $MOUNT 172.00GiB > > > > > > > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > > > > >$MOUNT80.00MiB > > > > > > > > > > > > Unallocated: > > > > > > $MOUNT18.33TiB > > > > > > > > > > > > $ uname -a > > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct > > > > > > 15 > > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > > > > Call Trace: > > > > > > ? __schedule+0x253/0x860 > > > > > > schedule+0x28/0x80 > > > > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > > > > ? finish_wait+0x80/0x80 > > > > > > transaction_kthread+0x155/0x170 [btrfs] > > > > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > > > > kthread+0x112/0x130 > > > > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > > > > ret_from_fork+0x35/0x40 > > > > > > btrfs D0 2504 2502 0x0002 > > > > > > Call Trace: > > > > > > ? __schedule+0x253/0x860 > > > > > > schedule+0x28/0x80 > > > > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > > > > ? finish_wait+0x80/0x80 > > > > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > > > > ? inode_insert5+0x119/0x190 > > > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > > > ?
Re: Btrfs resize seems to deadlock
On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson wrote: > > Also, is the drive in a safe state to use? Is there anything I should > run on the drive to check consistency? It should be in a safe state. You can verify it running "btrfs check /dev/" (it's a readonly operation). If you are able to patch and build a kernel, you can also try the patch. I left it running tests overnight and haven't got any regressions. Thanks. > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson > wrote: > > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but > > the output is ~55mb. Is there something in particular you are looking > > for in this? > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana wrote: > > > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > > > wrote: > > > > > > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > > > resize max $MOUNT", the command runs for a few minutes and then hangs > > > > > forcing the system to be reset. I am not sure what the state of the > > > > > filesystem really is at this point. Btrfs usage does report the > > > > > correct size for after resizing. Details below: > > > > > > > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > > > deeper debugging, may I ask you to post "btrfs inspect-internal > > > > dump-tree -t 1 /dev/your_btrfs_disk"? > > > > > > I believe it's actually easy to understand from the trace alone and > > > it's kind of a bad luck scenario. > > > I made this fix a few hours ago: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > > > > > But haven't done full testing yet and might have missed something. > > > Bo, can you take a look and let me know what you think? > > > > > > Thanks. > > > > > > > > > > > So I'd like to know what's the height of your tree "1" which refers to > > > > root tree in btrfs. > > > > > > > > thanks, > > > > liubo > > > > > > > > > $ sudo btrfs filesystem usage $MOUNT > > > > > Overall: > > > > > Device size: 90.96TiB > > > > > Device allocated: 72.62TiB > > > > > Device unallocated: 18.33TiB > > > > > Device missing: 0.00B > > > > > Used: 72.62TiB > > > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > > > Data ratio: 1.00 > > > > > Metadata ratio: 2.00 > > > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > > > $MOUNT72.46TiB > > > > > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > > > $MOUNT 172.00GiB > > > > > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > > > >$MOUNT80.00MiB > > > > > > > > > > Unallocated: > > > > > $MOUNT18.33TiB > > > > > > > > > > $ uname -a > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > > > Call Trace: > > > > > ? __schedule+0x253/0x860 > > > > > schedule+0x28/0x80 > > > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > > > ? finish_wait+0x80/0x80 > > > > > transaction_kthread+0x155/0x170 [btrfs] > > > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > > > kthread+0x112/0x130 > > > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > > > ret_from_fork+0x35/0x40 > > > > > btrfs D0 2504 2502 0x0002 > > > > > Call Trace: > > > > > ? __schedule+0x253/0x860 > > > > > schedule+0x28/0x80 > > > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > > > ? finish_wait+0x80/0x80 > > > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > > > ? inode_insert5+0x119/0x190 > > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > btrfs_iget+0x113/0x690 [btrfs] > > > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > > > ? finish_wait+0x80/0x80 > > > > > find_free_extent+0x799/0x1010 [btrfs] > > > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > > > btrfs_lookup_inode+0x3a/0xc0
Re: Btrfs resize seems to deadlock
Also, is the drive in a safe state to use? Is there anything I should run on the drive to check consistency? On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson wrote: > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but > the output is ~55mb. Is there something in particular you are looking > for in this? > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana wrote: > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > > wrote: > > > > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > > resize max $MOUNT", the command runs for a few minutes and then hangs > > > > forcing the system to be reset. I am not sure what the state of the > > > > filesystem really is at this point. Btrfs usage does report the > > > > correct size for after resizing. Details below: > > > > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > > deeper debugging, may I ask you to post "btrfs inspect-internal > > > dump-tree -t 1 /dev/your_btrfs_disk"? > > > > I believe it's actually easy to understand from the trace alone and > > it's kind of a bad luck scenario. > > I made this fix a few hours ago: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > > > But haven't done full testing yet and might have missed something. > > Bo, can you take a look and let me know what you think? > > > > Thanks. > > > > > > > > So I'd like to know what's the height of your tree "1" which refers to > > > root tree in btrfs. > > > > > > thanks, > > > liubo > > > > > > > $ sudo btrfs filesystem usage $MOUNT > > > > Overall: > > > > Device size: 90.96TiB > > > > Device allocated: 72.62TiB > > > > Device unallocated: 18.33TiB > > > > Device missing: 0.00B > > > > Used: 72.62TiB > > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > > Data ratio: 1.00 > > > > Metadata ratio: 2.00 > > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > > $MOUNT72.46TiB > > > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > > $MOUNT 172.00GiB > > > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > > >$MOUNT80.00MiB > > > > > > > > Unallocated: > > > > $MOUNT18.33TiB > > > > > > > > $ uname -a > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > > Call Trace: > > > > ? __schedule+0x253/0x860 > > > > schedule+0x28/0x80 > > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > > ? finish_wait+0x80/0x80 > > > > transaction_kthread+0x155/0x170 [btrfs] > > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > > kthread+0x112/0x130 > > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > > ret_from_fork+0x35/0x40 > > > > btrfs D0 2504 2502 0x0002 > > > > Call Trace: > > > > ? __schedule+0x253/0x860 > > > > schedule+0x28/0x80 > > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > > ? finish_wait+0x80/0x80 > > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > > ? inode_insert5+0x119/0x190 > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > btrfs_iget+0x113/0x690 [btrfs] > > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > > ? finish_wait+0x80/0x80 > > > > find_free_extent+0x799/0x1010 [btrfs] > > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > ? btrfs_release_path+0x13/0x80 [btrfs] > > > > ? btrfs_update_device+0x8d/0x1c0 [btrfs] > > > > btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs] > > > > btrfs_ioctl+0xa25/0x2cf0 [btrfs] > > > > ? tty_write+0x1fc/0x330 > > > > ? do_vfs_ioctl+0xa4/0x620 > > > >
Re: Btrfs resize seems to deadlock
I have ran the "btrfs inspect-internal dump-tree -t 1" command, but the output is ~55mb. Is there something in particular you are looking for in this? On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana wrote: > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > > wrote: > > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > > resize max $MOUNT", the command runs for a few minutes and then hangs > > > forcing the system to be reset. I am not sure what the state of the > > > filesystem really is at this point. Btrfs usage does report the > > > correct size for after resizing. Details below: > > > > > > > Thanks for the report, the stack is helpful, but this needs a few > > deeper debugging, may I ask you to post "btrfs inspect-internal > > dump-tree -t 1 /dev/your_btrfs_disk"? > > I believe it's actually easy to understand from the trace alone and > it's kind of a bad luck scenario. > I made this fix a few hours ago: > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock > > But haven't done full testing yet and might have missed something. > Bo, can you take a look and let me know what you think? > > Thanks. > > > > > So I'd like to know what's the height of your tree "1" which refers to > > root tree in btrfs. > > > > thanks, > > liubo > > > > > $ sudo btrfs filesystem usage $MOUNT > > > Overall: > > > Device size: 90.96TiB > > > Device allocated: 72.62TiB > > > Device unallocated: 18.33TiB > > > Device missing: 0.00B > > > Used: 72.62TiB > > > Free (estimated): 18.34TiB (min: 9.17TiB) > > > Data ratio: 1.00 > > > Metadata ratio: 2.00 > > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > > $MOUNT72.46TiB > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > > $MOUNT 172.00GiB > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > > >$MOUNT80.00MiB > > > > > > Unallocated: > > > $MOUNT18.33TiB > > > > > > $ uname -a > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > > > btrfs-transacti D0 2501 2 0x8000 > > > Call Trace: > > > ? __schedule+0x253/0x860 > > > schedule+0x28/0x80 > > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > ? join_transaction+0x22/0x3e0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > transaction_kthread+0x155/0x170 [btrfs] > > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > > kthread+0x112/0x130 > > > ? kthread_create_worker_on_cpu+0x70/0x70 > > > ret_from_fork+0x35/0x40 > > > btrfs D0 2504 2502 0x0002 > > > Call Trace: > > > ? __schedule+0x253/0x860 > > > schedule+0x28/0x80 > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > ? finish_wait+0x80/0x80 > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > ? inode_insert5+0x119/0x190 > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_iget+0x113/0x690 [btrfs] > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > find_free_extent+0x799/0x1010 [btrfs] > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > ? btrfs_release_path+0x13/0x80 [btrfs] > > > ? btrfs_update_device+0x8d/0x1c0 [btrfs] > > > btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs] > > > btrfs_ioctl+0xa25/0x2cf0 [btrfs] > > > ? tty_write+0x1fc/0x330 > > > ? do_vfs_ioctl+0xa4/0x620 > > > do_vfs_ioctl+0xa4/0x620 > > > ksys_ioctl+0x60/0x90 > > > ? ksys_write+0x4f/0xb0 > > > __x64_sys_ioctl+0x16/0x20 > > > do_syscall_64+0x5b/0x160 > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > RIP: 0033:0x7fcdc0d78c57 > > > Code: Bad RIP value. > > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010 > > > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57 > > > RDX:
Re: Btrfs resize seems to deadlock
On Sat, Oct 20, 2018 at 9:27 PM Liu Bo wrote: > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson > wrote: > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > > resize max $MOUNT", the command runs for a few minutes and then hangs > > forcing the system to be reset. I am not sure what the state of the > > filesystem really is at this point. Btrfs usage does report the > > correct size for after resizing. Details below: > > > > Thanks for the report, the stack is helpful, but this needs a few > deeper debugging, may I ask you to post "btrfs inspect-internal > dump-tree -t 1 /dev/your_btrfs_disk"? I believe it's actually easy to understand from the trace alone and it's kind of a bad luck scenario. I made this fix a few hours ago: https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock But haven't done full testing yet and might have missed something. Bo, can you take a look and let me know what you think? Thanks. > > So I'd like to know what's the height of your tree "1" which refers to > root tree in btrfs. > > thanks, > liubo > > > $ sudo btrfs filesystem usage $MOUNT > > Overall: > > Device size: 90.96TiB > > Device allocated: 72.62TiB > > Device unallocated: 18.33TiB > > Device missing: 0.00B > > Used: 72.62TiB > > Free (estimated): 18.34TiB (min: 9.17TiB) > > Data ratio: 1.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 24.11MiB) > > > > Data,single: Size:72.46TiB, Used:72.45TiB > > $MOUNT72.46TiB > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > > $MOUNT 172.00GiB > > > > System,DUP: Size:40.00MiB, Used:7.53MiB > >$MOUNT80.00MiB > > > > Unallocated: > > $MOUNT18.33TiB > > > > $ uname -a > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > btrfs-transacti D0 2501 2 0x8000 > > Call Trace: > > ? __schedule+0x253/0x860 > > schedule+0x28/0x80 > > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > > ? kmem_cache_alloc+0x166/0x1d0 > > ? join_transaction+0x22/0x3e0 [btrfs] > > ? finish_wait+0x80/0x80 > > transaction_kthread+0x155/0x170 [btrfs] > > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > > kthread+0x112/0x130 > > ? kthread_create_worker_on_cpu+0x70/0x70 > > ret_from_fork+0x35/0x40 > > btrfs D0 2504 2502 0x0002 > > Call Trace: > > ? __schedule+0x253/0x860 > > schedule+0x28/0x80 > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > ? finish_wait+0x80/0x80 > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > ? inode_insert5+0x119/0x190 > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > ? kmem_cache_alloc+0x166/0x1d0 > > btrfs_iget+0x113/0x690 [btrfs] > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > load_free_space_cache+0x7c/0x170 [btrfs] > > ? cache_block_group+0x72/0x3b0 [btrfs] > > cache_block_group+0x1b3/0x3b0 [btrfs] > > ? finish_wait+0x80/0x80 > > find_free_extent+0x799/0x1010 [btrfs] > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > btrfs_cow_block+0xdc/0x180 [btrfs] > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > ? kmem_cache_alloc+0x166/0x1d0 > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > cache_save_setup+0xe4/0x3a0 [btrfs] > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > ? btrfs_release_path+0x13/0x80 [btrfs] > > ? btrfs_update_device+0x8d/0x1c0 [btrfs] > > btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs] > > btrfs_ioctl+0xa25/0x2cf0 [btrfs] > > ? tty_write+0x1fc/0x330 > > ? do_vfs_ioctl+0xa4/0x620 > > do_vfs_ioctl+0xa4/0x620 > > ksys_ioctl+0x60/0x90 > > ? ksys_write+0x4f/0xb0 > > __x64_sys_ioctl+0x16/0x20 > > do_syscall_64+0x5b/0x160 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > RIP: 0033:0x7fcdc0d78c57 > > Code: Bad RIP value. > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57 > > RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003 > > RBP: 7ffdd1ee6da0 R08: R09: 7ffdd1ee67e0 > > R10: R11: 0246 R12: 7ffdd1ee888e > > R13: 0003 R14: R15: > > kworker/u48:1 D0 2505 2 0x8000 > > Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs] > > Call Trace: > > ? __schedule+0x253/0x860 > >
Re: Btrfs resize seems to deadlock
On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson wrote: > > I am having an issue with btrfs resize in Fedora 28. I am attempting > to enlarge my Btrfs partition. Every time I run "btrfs filesystem > resize max $MOUNT", the command runs for a few minutes and then hangs > forcing the system to be reset. I am not sure what the state of the > filesystem really is at this point. Btrfs usage does report the > correct size for after resizing. Details below: > Thanks for the report, the stack is helpful, but this needs a few deeper debugging, may I ask you to post "btrfs inspect-internal dump-tree -t 1 /dev/your_btrfs_disk"? So I'd like to know what's the height of your tree "1" which refers to root tree in btrfs. thanks, liubo > $ sudo btrfs filesystem usage $MOUNT > Overall: > Device size: 90.96TiB > Device allocated: 72.62TiB > Device unallocated: 18.33TiB > Device missing: 0.00B > Used: 72.62TiB > Free (estimated): 18.34TiB (min: 9.17TiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 24.11MiB) > > Data,single: Size:72.46TiB, Used:72.45TiB > $MOUNT72.46TiB > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB > $MOUNT 172.00GiB > > System,DUP: Size:40.00MiB, Used:7.53MiB >$MOUNT80.00MiB > > Unallocated: > $MOUNT18.33TiB > > $ uname -a > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15 > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > btrfs-transacti D0 2501 2 0x8000 > Call Trace: > ? __schedule+0x253/0x860 > schedule+0x28/0x80 > btrfs_commit_transaction+0x7aa/0x8b0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > ? join_transaction+0x22/0x3e0 [btrfs] > ? finish_wait+0x80/0x80 > transaction_kthread+0x155/0x170 [btrfs] > ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > kthread+0x112/0x130 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x35/0x40 > btrfs D0 2504 2502 0x0002 > Call Trace: > ? __schedule+0x253/0x860 > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > ? btrfs_release_path+0x13/0x80 [btrfs] > ? btrfs_update_device+0x8d/0x1c0 [btrfs] > btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs] > btrfs_ioctl+0xa25/0x2cf0 [btrfs] > ? tty_write+0x1fc/0x330 > ? do_vfs_ioctl+0xa4/0x620 > do_vfs_ioctl+0xa4/0x620 > ksys_ioctl+0x60/0x90 > ? ksys_write+0x4f/0xb0 > __x64_sys_ioctl+0x16/0x20 > do_syscall_64+0x5b/0x160 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7fcdc0d78c57 > Code: Bad RIP value. > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57 > RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003 > RBP: 7ffdd1ee6da0 R08: R09: 7ffdd1ee67e0 > R10: R11: 0246 R12: 7ffdd1ee888e > R13: 0003 R14: R15: > kworker/u48:1 D0 2505 2 0x8000 > Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs] > Call Trace: > ? __schedule+0x253/0x860 > schedule+0x28/0x80 > btrfs_tree_lock+0x130/0x210 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_search_slot+0x775/0x9f0 [btrfs] > btrfs_mark_extent_written+0xb0/0xb20 [btrfs] > ? join_transaction+0x22/0x3e0 [btrfs] > btrfs_finish_ordered_io+0x2e0/0x7e0 [btrfs] > ? syscall_return_via_sysret+0x13/0x83 > ? __switch_to_asm+0x34/0x70 > normal_work_helper+0xd3/0x2f0 [btrfs] > process_one_work+0x1a1/0x350 > worker_thread+0x30/0x380 > ? pwq_unbound_release_workfn+0xd0/0xd0 > kthread+0x112/0x130 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x35/0x40
Re: btrfs send receive: No space left on device
On Wed, Oct 17, 2018 at 10:29 AM Libor Klepáč wrote: > > Hello, > i have new 32GB SSD in my intel nuc, installed debian9 on it, using btrfs as > a rootfs. > Then i created subvolumes /system and /home and moved system there. > > System was installed using kernel 4.9.x and filesystem created using > btrfs-progs 4.7.x > Details follow: > main filesystem > > # btrfs filesystem usage /mnt/btrfs/ssd/ > Overall: > Device size: 29.08GiB > Device allocated: 4.28GiB > Device unallocated: 24.80GiB > Device missing: 0.00B > Used: 2.54GiB > Free (estimated): 26.32GiB (min: 26.32GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 16.00MiB (used: 0.00B) > > Data,single: Size:4.00GiB, Used:2.48GiB >/dev/sda3 4.00GiB > > Metadata,single: Size:256.00MiB, Used:61.05MiB >/dev/sda3 256.00MiB > > System,single: Size:32.00MiB, Used:16.00KiB >/dev/sda3 32.00MiB > > Unallocated: >/dev/sda3 24.80GiB > > #/etc/fstab > UUID=d801da52-813d-49da-bdda-87fc6363e0ac /mnt/btrfs/ssd btrfs > noatime,space_cache=v2,compress=lzo,commit=300,subvolid=5 0 0 > UUID=d801da52-813d-49da-bdda-87fc6363e0ac / btrfs > noatime,space_cache=v2,compress=lzo,commit=300,subvol=/system 0 > 0 > UUID=d801da52-813d-49da-bdda-87fc6363e0ac /home btrfs > noatime,space_cache=v2,compress=lzo,commit=300,subvol=/home 0 > 0 > > - > Then i installed kernel from backports: > 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 > and btrfs-progs 4.17 > > For backups , i have created 16GB iscsi device on my qnap and mounted it, > created filesystem, mounted like this: > LABEL=backup/mnt/btrfs/backup btrfs > noatime,space_cache=v2,compress=lzo,subvolid=5,nofail,noauto 0 > 0 > > After send-receive operation on /home subvolume, usage looks like this: > > # btrfs filesystem usage /mnt/btrfs/backup/ > Overall: > Device size: 16.00GiB > Device allocated: 1.27GiB > Device unallocated: 14.73GiB > Device missing: 0.00B > Used:844.18MiB > Free (estimated): 14.92GiB (min: 14.92GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 16.00MiB (used: 0.00B) > > Data,single: Size:1.01GiB, Used:833.36MiB >/dev/sdb1.01GiB > > Metadata,single: Size:264.00MiB, Used:10.80MiB >/dev/sdb 264.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB >/dev/sdb4.00MiB > > Unallocated: >/dev/sdb 14.73GiB > > > Problem is, during send-receive of system subvolume, it runs out of space: > > # btrbk run /mnt/btrfs/ssd/system/ -v > btrbk command line client, version 0.26.1 (Wed Oct 17 09:51:20 2018) > Using configuration: /etc/btrbk/btrbk.conf > Using transaction log: /var/log/btrbk.log > Creating subvolume snapshot for: /mnt/btrfs/ssd/system > [snapshot] source: /mnt/btrfs/ssd/system > [snapshot] target: /mnt/btrfs/ssd/_snapshots/system.20181017T0951 > Checking for missing backups of subvolume "/mnt/btrfs/ssd/system" in > "/mnt/btrfs/backup/" > Creating subvolume backup (send-receive) for: > /mnt/btrfs/ssd/_snapshots/system.20181016T2034 > No common parent subvolume present, creating full backup... > [send/receive] source: /mnt/btrfs/ssd/_snapshots/system.20181016T2034 > [send/receive] target: /mnt/btrfs/backup/system.20181016T2034 > mbuffer: error: outputThread: error writing to at offset 0x4b5bd000: > Broken pipe > mbuffer: warning: error during output to : Broken pipe > WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, > receive=/mnt/btrfs/backup) At subvol > /mnt/btrfs/ssd/_snapshots/system.20181016T2034 > WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, > receive=/mnt/btrfs/backup) At subvol system.20181016T2034 > ERROR: rename o77417-5519-0 -> > lib/modules/4.18.0-0.bpo.1-amd64/kernel/drivers/watchdog/pcwd_pci.ko failed: > No space left on device > ERROR: Failed to send/receive btrfs subvolume: > /mnt/btrfs/ssd/_snapshots/system.20181016T2034 -> /mnt/btrfs/backup > [delete] options: commit-after > [delete] target: /mnt/btrfs/backup/system.20181016T2034 > WARNING: Deleted partially received (garbled) subvolume: > /mnt/btrfs/backup/system.20181016T2034 > ERROR: Error while resuming backups, aborting > Created 0/2 missing backups > WARNING: Skipping cleanup of snapshots for subvolume "/mnt/btrfs/ssd/system", > as at least one target aborted earlier > Completed within: 116s (Wed Oct 17 09:53:16 2018) >
Re: btrfs check: Superblock bytenr is larger than device size
On 2018/10/16 下午11:25, Anton Shepelev wrote: > Qu Wenruo to Anton Shepelev: > >>> On all our servers with BTRFS, which are otherwise working >>> normally, `btrfs check /' complains that >>> >>> Superblock bytenr is larger than device size >>> Couldn't open file system >>> >> Please try latest btrfs-progs and see if btrfs check >> reports any error. > > It is SUSE Linux Enterprise with its native repository, so I > cannot update btrfs easily. I will, though. Then I recommend to use latest openSUSE Tumbleweed rescue ISO to do the mount and check. Thanks, Qu > signature.asc Description: OpenPGP digital signature
Re: btrfs check: Superblock bytenr is larger than device size
Qu Wenruo to Anton Shepelev: >>On all our servers with BTRFS, which are otherwise working >>normally, `btrfs check /' complains that >> >>Superblock bytenr is larger than device size >>Couldn't open file system >> >Please try latest btrfs-progs and see if btrfs check >reports any error. It is SUSE Linux Enterprise with its native repository, so I cannot update btrfs easily. I will, though. -- () ascii ribbon campaign - against html e-mail /\ http://preview.tinyurl.com/qcy6mjc [archived]
Re: btrfs check: Superblock bytenr is larger than device size
On 2018/10/16 下午10:05, Anton Shepelev wrote: > Hello, all > > On all our servers with BTRFS, which are otherwise working > normally, `btrfs check /' complains that Btrfs check shouldn't continue on mount point. Latest one would report error like: Opening filesystem to check... ERROR: not a regular file or block device: /mnt/btrfs ERROR: cannot open file system > >Superblock bytenr is larger than device size This shouldn't be a big problem, normally related to unaligned numbers, and older kernel/btrfs-progs. Please try latest btrfs-progs and see if btrfs check reports any error. >Couldn't open file system > > Since am not using any "dangerous" options and want merely > to analyse the system for errors without any modifications, > I don't think I must unount my FS, which in my case would > mean booting from a live CD. What may be causeing this > error? Normally old mkfs or old kernel. Latest btrfs check result would definitely help to solve the problem. And the follow result will also help (can be dumpded even with fs mounted, also needs latest btrfs-progs): # btrfs ins dump-super -FfA # btrfs ins dump-tree -t chunk Thanks, Qu > signature.asc Description: OpenPGP digital signature
Re: BTRFS bad block management. Does it exist?
On 10/14/2018 07:08 PM, waxhead wrote: In case BTRFS fails to WRITE to a disk. What happens? Does the bad area get mapped out somehow? There was a proposed patch, its not convincing because the disks does the bad block relocation part transparently to the host and if disk runs out of reserved list then probably its time to replace the disk as in my experience the disk would have failed for other non-media error before it runs out of the reserved list and where in this case the host performed relocation won't help. Further more being at the file-system level you won't be able to accurately determine whether the block write has failed for the bad media error and not because of the reason of target circuitry fault. Does it try again until it succeed or until it "times out" or reach a threshold counter? Block IO timeout and retry are the properties of the block layer depending on the type of error it should. SD module already does retry of 5 counts (when failfast is not set), it should be tune-able. And I think there was a patch for that in the ML. We had few discussion on the retry part in the past. [1] [1] https://www.spinics.net/lists/linux-btrfs/msg70240.html https://www.spinics.net/lists/linux-btrfs/msg71779.html Does it eventually try to write to a different disk (in case of using the raid1/10 profile?) When there is mirror copy it does not go into the RO mode, and it leaves write hole(s) patchy across any transaction as we don't fail the disk at the first failed transaction. That means if a disk is at nth transaction per the super-block, its not guaranteed that all previous transactions have made it to the disk successfully in case of mirror-ed configs. I consider this as a bug. And there is a danger that it may read the junk data, which is hard but not impossible to hit due to our un-reasonable (there is a patch in the ML to address that as well) hard-coded pid-based read-mirror policy. I sent a patch to fail the disk when first write fails so that we know the last good integrity of the FS based on the transaction id. That was a long time back I still believe its important patch. There wasn't enough comments I guess for it go into the next step. The current solution is to replace the offending disk _without_ reading from it, to have a good recovery from the failed disk. As data centers can't relay on admin initiated manual recovery, there is also a patch to do this stuff automatically using the auto-replace feature, patches are in the ML. Again there wasn't enough comments I guess for it go into the next step. Thanks, Anand
Re: BTRFS bad block management. Does it exist?
On 2018-10-14 07:08, waxhead wrote: In case BTRFS fails to WRITE to a disk. What happens? Does the bad area get mapped out somehow? Does it try again until it succeed or until it "times out" or reach a threshold counter? Does it eventually try to write to a different disk (in case of using the raid1/10 profile?) Building on Qu's answer (which is absolutely correct), BTRFS makes the perfectly reasonable assumption that you're not trying to use known bad hardware. It's not alone in this respect either, pretty much every Linux filesystem makes the exact same assumption (and almost all non-Linux ones too), because it really is a perfectly reasonable assumption. The only exception is ext[234], but they only support it statically (you can set the bad block list at mkfs time, but not afterwards, and they don't update it at runtime), and it's a holdover from earlier filesystems which originated at a time when storage was sufficiently expensive _and_ unreliable that you kept using disks until they were essentially completely dead. The reality is that with modern storage hardware, if you have persistently bad sectors the device is either defective (and should be returned under warranty), or it's beyond expected EOL (and should just be replaced). Most people know about SSD's doing block remapping to avoid bad blocks, but hard drives do it to, and they're actually rather good at it. In both cases, enough spare blocks are provided that the device can handle average rates of media errors through the entirety of it's average life expectancy without running out of spare blocks. On top of all of that though, it's fully possible to work around bad blocks in the block layer if you take the time to actually do it. With a bit of reasonably simple math, you can easily set up an LVM volume that actively avoids all the bad blocks on a disk while still fully utilizing the rest of the volume. Similarly, with a bit of work (and a partition table that supports _lots_ of partitions) you can work around bad blocks with an MD concatenated device.
Re: BTRFS bad block management. Does it exist?
On 2018/10/14 下午7:08, waxhead wrote: > In case BTRFS fails to WRITE to a disk. What happens? Normally it should return error when we flush disk. And in that case, error will leads to transaction abort and the fs goes RO to prevent further corruption. > Does the bad area get mapped out somehow? No. > Does it try again until it > succeed or until it "times out" or reach a threshold counter? Unless it's done by block layer, btrfs doesn't try that. > Does it eventually try to write to a different disk (in case of using > the raid1/10 profile?) No. That's not what RAID is designed to do. It's only allowed to have any flush error if using "degraded" mount option and the error is under tolerance. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: btrfs send receive ERROR: chown failed: No such file or directory
Hey Filipe, thanks for the feedback. I ran the command again with -vv. Below are the last commands logged by btrfs receive to stderr: mkfile o2138798-5016457-0 rename leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13 -> o2138802-5207521-0 rename o2138798-5016457-0 -> leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13 utimes leonard/mail/lists/emacs-orgmode/new ERROR: cannot open /mnt/_btrbk_backups/leonard-xps13/@home.20180902T0045/o2138798-5016457-0: No such file or directory The file which is mentioned above has in fact not changed between the parent and current snapshot. I computed further a hash of the file on source volume for parent and current snapshot and target on target volume for parent snapshot. I get the same hash value in all 3 cases, so the parent snapshot was apparently transferred correctly. Note that the folder in which this file is placed contains 96558 files whose sizes sum to 774M. I further observed that the snapshot that previously failed to transfer via send receive could now be sent / received without error. However, the error then occured again for the next snapshot (which has the now successfully transferred one as parent). There may be a race condition that leads to non-deterministic failures. Please let me know if I can help with any further information. Best regards Leonard Filipe Manana writes: >> > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs >> >btrfs send ... | ssh user@machine2 "btrfs receive /path1" >> > 2) Machine 2 backups all subvolumes stored at /path1 to a second >> >independent btrfs filesystem. Let /path1/rootsnapshot be the first >> >snapshot stored at /path1 (ie. it has no Parent UUID). Let >> >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot >> >as a parent. Then >> >btrfs send -v /path1/rootsnapshot | btrfs receive /path2 >> >works without issues, but >> >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs >> > receive /path2 > > -v is useless. Use -vv, which will dump all commands. > >> >fails as follows: >> >ERROR: chown o257-4639416-0 failed: No such file or directory >> > >> > No error is shown in dmesg. /path1 and /path2 denote two independent >> > btrfs filesystems. >> > >> > Note that there was no issue with transferring incrementalsnapshot from >> > machine 1 to machine 2. No error is shown in dmesg.
Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery
Filipe Manana - 05.10.18, 17:21: > On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald wrote: > > Hello! > > > > On ThinkPad T520 after battery was discharged and machine just > > blacked out. > > > > Is that some sign of regular consistency check / replay or something > > to investigate further? > > I think it's harmless, if anything were messed up with link counts or > mismatches between those and dir entries, fsck (btrfs check) should > have reported something. > I'll dig a big further and remove the warning if it's really harmless. I just scrubbed the filesystem. I did not run btrfs check on it. > > I already scrubbed all data and there are no errors. Also btrfs > > device stats reports no errors. SMART status appears to be okay as > > well on both SSD. > > > > [4.524355] BTRFS info (device dm-4): disk space caching is > > enabled [… backtrace …] -- Martin
Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery
On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald wrote: > > Hello! > > On ThinkPad T520 after battery was discharged and machine just blacked > out. > > Is that some sign of regular consistency check / replay or something to > investigate further? I think it's harmless, if anything were messed up with link counts or mismatches between those and dir entries, fsck (btrfs check) should have reported something. I'll dig a big further and remove the warning if it's really harmless. Thanks. > > I already scrubbed all data and there are no errors. Also btrfs device stats > reports no errors. SMART status appears to be okay as well on both SSD. > > [4.524355] BTRFS info (device dm-4): disk space caching is enabled > [4.524356] BTRFS info (device dm-4): has skinny extents > [4.563950] BTRFS info (device dm-4): enabling ssd optimizations > [5.463085] Console: switching to colour frame buffer device 240x67 > [5.492236] i915 :00:02.0: fb0: inteldrmfb frame buffer device > [5.882661] BTRFS info (device dm-3): disk space caching is enabled > [5.882664] BTRFS info (device dm-3): has skinny extents > [5.918579] SGI XFS with ACLs, security attributes, realtime, scrub, no > debug enabled > [5.927421] Adding 20971516k swap on /dev/mapper/sata-swap. Priority:-2 > extents:1 across:20971516k SSDsc > [5.935051] XFS (sdb1): Mounting V5 Filesystem > [5.935218] XFS (sda1): Mounting V5 Filesystem > [5.961100] XFS (sda1): Ending clean mount > [5.970857] BTRFS info (device dm-3): enabling ssd optimizations > [5.972358] XFS (sdb1): Ending clean mount > [5.975955] WARNING: CPU: 1 PID: 1104 at fs/inode.c:342 inc_nlink+0x28/0x30 > [5.978271] Modules linked in: xfs msr pktcdvd intel_rapl > x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc > arc4 snd_hda_codec_conexant snd_hda_codec_generic iwldvm mac80211 iwlwifi > aesni_intel snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd cryptd > snd_hda_core glue_helper intel_cstate snd_hwdep intel_rapl_perf snd_pcm > pcspkr input_leds i915 sg cfg80211 snd_timer thinkpad_acpi nvram > drm_kms_helper snd soundcore tpm_tis tpm_tis_core drm rfkill ac tpm > i2c_algo_bit fb_sys_fops battery rng_core video syscopyarea sysfillrect > sysimgblt button evdev sbs sbshc coretemp bfq hdaps(O) tp_smapi(O) > thinkpad_ec(O) loop ecryptfs cbc sunrpc mcryptd sha256_ssse3 sha256_generic > encrypted_keys ip_tables x_tables autofs4 dm_mod > [5.990499] btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate > raid6_pq libcrc32c crc32c_generic sr_mod cdrom sd_mod hid_lenovo hid_generic > usbhid hid ahci libahci libata ehci_pci crc32c_intel psmouse i2c_i801 > sdhci_pci cqhci lpc_ich sdhci ehci_hcd e1000e scsi_mod i2c_core mfd_core > mmc_core usbcore usb_common thermal > [5.990529] CPU: 1 PID: 1104 Comm: mount Tainted: G O > 4.18.7-tp520 #63 > [5.990532] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET69WW (1.49 ) > 06/14/2018 > [6.000153] RIP: 0010:inc_nlink+0x28/0x30 > [6.000154] Code: 00 00 8b 47 48 85 c0 74 07 83 c0 01 89 47 48 c3 f6 87 a1 > 00 00 00 04 74 11 48 8b 47 28 f0 48 ff 88 98 04 00 00 8b 47 48 eb df <0f> 0b > eb eb 0f 1f 40 00 41 54 8b 0d 70 3f aa 00 48 ba eb 83 b5 80 > [6.008573] RSP: 0018:c90002283828 EFLAGS: 00010246 > [6.008575] RAX: RBX: 8804018bed58 RCX: > 00022261 > [6.008576] RDX: 00022251 RSI: RDI: > 8804018bed58 > [6.008577] RBP: c90002283a50 R08: 0002a330 R09: > a02f3873 > [6.008578] R10: 30ff R11: 7763 R12: > 0011 > [6.008579] R13: 3d5f R14: 880403e19800 R15: > 88040a3c69a0 > [6.008580] FS: 7f071598f100() GS:88041e24() > knlGS: > [6.008581] CS: 0010 DS: ES: CR0: 80050033 > [6.008589] CR2: 7fda4fbf8218 CR3: 000403e42001 CR4: > 000606e0 > [6.008590] Call Trace: > [6.008614] replay_one_buffer+0x80e/0x890 [btrfs] > [6.008632] walk_up_log_tree+0x1dc/0x260 [btrfs] > [6.046858] walk_log_tree+0xaf/0x1e0 [btrfs] > [6.046872] btrfs_recover_log_trees+0x21c/0x410 [btrfs] > [6.046885] ? btree_read_extent_buffer_pages+0xcd/0x210 [btrfs] > [6.055941] ? fixup_inode_link_counts+0x170/0x170 [btrfs] > [6.055953] open_ctree+0x1a0d/0x1b60 [btrfs] > [6.055965] btrfs_mount_root+0x67b/0x760 [btrfs] > [6.065039] ? pcpu_alloc_area+0xdd/0x120 > [6.065040] ? pcpu_next_unpop+0x32/0x40 > [6.065052] mount_fs+0x36/0x162 > [6.065055] vfs_kern_mount.part.34+0x4f/0x120 > [6.065064] btrfs_mount+0x15f/0x890 [btrfs] > [6.065067] ? pcpu_cnt_pop_pages+0x40/0x50 > [6.065069] ? pcpu_alloc_area+0xdd/0x120 > [6.065071] ? pcpu_next_unpop+0x32/0x40 > [6.065073] ? cpumask_next+0x16/0x20 > [
Re: btrfs send receive ERROR: chown failed: No such file or directory
On Tue, Oct 2, 2018 at 7:02 AM Leonard Lausen wrote: > > Hello, > > does anyone have an idea about below issue? It is a severe issue as it > renders btrfs send / receive dysfunctional and it is not clear if there > may be a data corruption issue hiding in the current send / receive > code. > > Thank you. > > Best regards > Leonard > > Leonard Lausen writes: > > Hello! > > > > I observe the following issue with btrfs send | btrfs receive in a setup > > with 2 machines and 3 btrfs file-systems. All machines run Linux 4.18.9. > > Machine 1 runs btrfs-progs 4.17.1, machine 2 runs btrfs-progs 4.17 (via > > https://packages.debian.org/stretch-backports/btrfs-progs). > > > > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs > >btrfs send ... | ssh user@machine2 "btrfs receive /path1" > > 2) Machine 2 backups all subvolumes stored at /path1 to a second > >independent btrfs filesystem. Let /path1/rootsnapshot be the first > >snapshot stored at /path1 (ie. it has no Parent UUID). Let > >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot > >as a parent. Then > >btrfs send -v /path1/rootsnapshot | btrfs receive /path2 > >works without issues, but > >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs > > receive /path2 -v is useless. Use -vv, which will dump all commands. > >fails as follows: > >ERROR: chown o257-4639416-0 failed: No such file or directory > > > > No error is shown in dmesg. /path1 and /path2 denote two independent > > btrfs filesystems. > > > > Note that there was no issue with transferring incrementalsnapshot from > > machine 1 to machine 2. No error is shown in dmesg. > > > > Best regards > > Leonard -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: btrfs send receive ERROR: chown failed: No such file or directory
Hello, does anyone have an idea about below issue? It is a severe issue as it renders btrfs send / receive dysfunctional and it is not clear if there may be a data corruption issue hiding in the current send / receive code. Thank you. Best regards Leonard Leonard Lausen writes: > Hello! > > I observe the following issue with btrfs send | btrfs receive in a setup > with 2 machines and 3 btrfs file-systems. All machines run Linux 4.18.9. > Machine 1 runs btrfs-progs 4.17.1, machine 2 runs btrfs-progs 4.17 (via > https://packages.debian.org/stretch-backports/btrfs-progs). > > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs >btrfs send ... | ssh user@machine2 "btrfs receive /path1" > 2) Machine 2 backups all subvolumes stored at /path1 to a second >independent btrfs filesystem. Let /path1/rootsnapshot be the first >snapshot stored at /path1 (ie. it has no Parent UUID). Let >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot >as a parent. Then >btrfs send -v /path1/rootsnapshot | btrfs receive /path2 >works without issues, but >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs > receive /path2 >fails as follows: >ERROR: chown o257-4639416-0 failed: No such file or directory > > No error is shown in dmesg. /path1 and /path2 denote two independent > btrfs filesystems. > > Note that there was no issue with transferring incrementalsnapshot from > machine 1 to machine 2. No error is shown in dmesg. > > Best regards > Leonard
Re: btrfs panic problem
在 2018年09月25日 16:31, Nikolay Borisov 写道: On 25.09.2018 11:20, sunny.s.zhang wrote: 在 2018年09月20日 02:36, Liu Bo 写道: On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(>refs); return node; } .. btrfs_release_delayed_node(delayed_node); By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s= thanks, liubo I don't think so. this patch has resolved the problem of radix_tree_lookup. I don't think this can resolve my problem that race occur after ACCESS_ONCE(btrfs_inode->delayed_node). Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then the function of btrfs_get_delayed_node will return, and don't continue. Can you reproduce the problem on an upstream kernel with added delays? The original report is from some RHEL-based distro (presumably oracle unbreakable linux) so there is no indication currently that this is a genuine problem in upstream kernels. Not yet. I will reproduce later. But I don't have any clue about this race now. Thanks, Sunny Thanks, Sunny 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(>refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90]
Re: btrfs panic problem
On 25.09.2018 11:20, sunny.s.zhang wrote: > > 在 2018年09月20日 02:36, Liu Bo 写道: >> On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang >> wrote: >>> Hi All, >>> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >>> >>> I found that the freelist of the slub is wrong. >>> >>> crash> struct kmem_cache_cpu 887e7d7a24b0 >>> >>> struct kmem_cache_cpu { >>> freelist = 0x2026, <<< the value is id of one inode >>> tid = 29567861, >>> page = 0xea0132168d00, >>> partial = 0x0 >>> } >>> >>> And, I found there are two different btrfs inodes pointing >>> delayed_node. It >>> means that the same slub is used twice. >>> >>> I think this slub is freed twice, and then the next pointer of this slub >>> point itself. So we get the same slub twice. >>> >>> When use this slub again, that break the freelist. >>> >>> Folloing code will make the delayed node being freed twice. But I don't >>> found what is the process. >>> >>> Process A (btrfs_evict_inode) Process B >>> >>> call btrfs_remove_delayed_node call btrfs_get_delayed_node >>> >>> node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> >>> BTRFS_I(inode)->delayed_node = NULL; >>> btrfs_release_delayed_node(delayed_node); >>> >>> if (node) { >>> atomic_inc(>refs); >>> return node; >>> } >>> >>> .. >>> >>> btrfs_release_delayed_node(delayed_node); >>> >> By looking at the race, seems the following commit has addressed it. >> >> btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes >> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s= >> >> >> thanks, >> liubo > > I don't think so. > this patch has resolved the problem of radix_tree_lookup. I don't think > this can resolve my problem that race occur after > ACCESS_ONCE(btrfs_inode->delayed_node). > Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then > the function of btrfs_get_delayed_node will return, and don't continue. Can you reproduce the problem on an upstream kernel with added delays? The original report is from some RHEL-based distro (presumably oracle unbreakable linux) so there is no indication currently that this is a genuine problem in upstream kernels. > > Thanks, > Sunny > >> >>> 1313 void btrfs_remove_delayed_node(struct inode *inode) >>> 1314 { >>> 1315 struct btrfs_delayed_node *delayed_node; >>> 1316 >>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); >>> 1318 if (!delayed_node) >>> 1319 return; >>> 1320 >>> 1321 BTRFS_I(inode)->delayed_node = NULL; >>> 1322 btrfs_release_delayed_node(delayed_node); >>> 1323 } >>> >>> >>> 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct >>> inode >>> *inode) >>> 88 { >>> 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); >>> 90 struct btrfs_root *root = btrfs_inode->root; >>> 91 u64 ino = btrfs_ino(inode); >>> 92 struct btrfs_delayed_node *node; >>> 93 >>> 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> 95 if (node) { >>> 96 atomic_inc(>refs); >>> 97 return node; >>> 98 } >>> >>> >>> Thanks, >>> >>> Sunny >>> >>> >>> PS: >>> >>> >>> >>> panic informations >>> >>> PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" >>> #0 [88130404f940] machine_kexec at 8105ec10 >>> #1 [88130404f9b0] crash_kexec at 811145b8 >>> #2 [88130404fa80] oops_end at 8101a868 >>> #3 [88130404fab0] no_context at 8106ea91 >>> #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d >>> #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 >>> #6 [88130404fb60] __do_page_fault at 8106f328 >>> #7 [88130404fbd0] do_page_fault at 8106f637 >>> #8 [88130404fc10] page_fault at 816f6308 >>> [exception RIP: kmem_cache_alloc+121] >>> RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 >>> RAX: RBX: RCX: 01c32b76 >>> RDX: 01c32b75 RSI: RDI: 000224b0 >>> RBP: 88130404fd08 R8: 887e7d7a24b0 R9: >>> R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 >>> R13: 2026 R14: 887e3e230a00 R15: a01abf49 >>> ORIG_RAX: CS: 0010 SS: 0018 >>> #9 [88130404fd10] btrfs_get_or_create_delayed_node at >>> a01abf49 >>> [btrfs] >>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 >>> [btrfs]
Re: btrfs panic problem
在 2018年09月20日 00:12, Nikolay Borisov 写道: On 19.09.2018 02:53, sunny.s.zhang wrote: Hi Duncan, Thank you for your advice. I understand what you mean. But i have reviewed the latest btrfs code, and i think the issue is exist still. At 71 line, if the function of btrfs_get_delayed_node run over this line, then switch to other process, which run over the 1282 and release the delayed node at the end. And then, switch back to the btrfs_get_delayed_node. find that the node is not null, and use it as normal. that mean we used a freed memory. at some time, this memory will be freed again. latest code as below. 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) 1279 { 1280 struct btrfs_delayed_node *delayed_node; 1281 1282 delayed_node = READ_ONCE(inode->delayed_node); 1283 if (!delayed_node) 1284 return; 1285 1286 inode->delayed_node = NULL; 1287 btrfs_release_delayed_node(delayed_node); 1288 } 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( 65 struct btrfs_inode *btrfs_inode) 66 { 67 struct btrfs_root *root = btrfs_inode->root; 68 u64 ino = btrfs_ino(btrfs_inode); 69 struct btrfs_delayed_node *node; 70 71 node = READ_ONCE(btrfs_inode->delayed_node); 72 if (node) { 73 refcount_inc(>refs); 74 return node; 75 } 76 77 spin_lock(>inode_lock); 78 node = radix_tree_lookup(>delayed_nodes_tree, ino); You are analysis is correct, however it's missing one crucial point - btrfs_remove_delayed_node is called only from btrfs_evict_inode. And inodes are evicted when all other references have been dropped. Check the code in evict_inodes() - inodes are added to the dispose list when their i_count is 0 at which point there should be no references in this inode. This invalidates your analysis... Thanks. Yes, I know this. and I know that other process can not use this inode if the inode is in the I_FREEING status. But, Chris has fixed a bug, which is similar with this and is found in production. it mean that this will occur in some condition. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s= 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In
Re: btrfs panic problem
在 2018年09月20日 02:36, Liu Bo 写道: On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(>refs); return node; } .. btrfs_release_delayed_node(delayed_node); By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s= thanks, liubo I don't think so. this patch has resolved the problem of radix_tree_lookup. I don't think this can resolve my problem that race occur after ACCESS_ONCE(btrfs_inode->delayed_node). Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then the function of btrfs_get_delayed_node will return, and don't continue. Thanks, Sunny 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(>refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10:
Re: btrfs problems
Adrian Bastholm posted on Thu, 20 Sep 2018 23:35:57 +0200 as excerpted: > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost Another variant I was > considering is running a raid1 mirror on two of the drives and maybe a > subvolume on the third, for less important stuff Agreed with CMurphy's reply, but he didn't mention... As I wrote elsewhere recently, don't remember if it was in a reply to you before you tried zfs and came back, or to someone else, so I'll repeat here, briefer this time... Keep in mind that on btrfs, it's possible (and indeed the default with multiple devices) to run data and metadata at different raid levels. IMO, as long as you're following an appropriate backup policy that backs up anything valuable enough to be worth the time/trouble/resources of doing so, so if you /do/ lose the array you still have a backup of anything you considered valuable enough to worry about (and that caveat is always the case, no matter where or how it's stored, value of data is in practice defined not by arbitrary claims but by the number of backups it's considered worth having of it)... With that backups caveat, I'm now confident /enough/ about raid56 mode to be comfortable cautiously recommending it for data, tho I'd still /not/ recommend it for metadata, which I'd recommend should remain the multi- device default raid1 level. That way, you're only risking a limited amount of raid5 data to the not yet as mature and well tested raid56 mode, the metadata remains protected by the more mature raid1 mode, and if something does go wrong, it's much more likely to be only a few files lost instead of the entire filesystem, as is at risk if your metadata is raid56 as well, the metadata including checksums will be intact so scrub should tell you what files are bad, and if those few files are valuable they'll be on the backup and easy enough to restore, compared to restoring the entire filesystem. But for most use-cases, metadata should be relatively small compared to data, so duplicating metadata as raid1 while doing raid5 for data should go much easier on the capacity needs than raid1 for both would. Tho I'd still recommend raid1 data as well for higher maturity and tested ability to use the good copy to rewrite the bad one if one copy goes bad (in theory, raid56 mode can use parity to rewrite as well, but that's not yet as well tested and there's still the narrow degraded-mode crash write hole to worry about), if it's not cost-prohibitive for the amount of data you need to store. But for people on a really tight budget or who are storing double-digit TB of data or more, I can understand why they prefer raid5, and I do think raid5 is stable enough for data now, as long as the metadata remains raid1, AND they're actually executing on a good backup policy. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: btrfs problems
On 2018-09-20 05:35 PM, Adrian Bastholm wrote: > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost > Another variant I was considering is running a raid1 mirror on two of > the drives and maybe a subvolume on the third, for less important > stuff In case you were not aware, it's perfectly acceptable with BTRFS to use Raid 1 over 3 devices. Even more amazing, regardless of how many devices you start with, 2, 3, 4, whatever, you can add a single drive to the array to increase capacity, (at 50%, of course,, ie, adding a 4TB drive will give you 2TB usable space, assuming the other drives add up to at least 4TB to match it.) <>
Re: btrfs problems
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm wrote: > > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. Yep. I'm referring to the drives, their firmware, cables, logic board, its firmware, the power supply, power, etc. Btrfs is by nature intolerant of corruption. Other file systems are more tolerant because they don't know about it (although recent versions of XFS and ext4 are now defaulting to checksummed metadata and journals). >three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost > Another variant I was considering is running a raid1 mirror on two of > the drives and maybe a subvolume on the third, for less important > stuff RAID does not substantially reduce the chances of data loss. It's not anything like a backup. It's an uptime enhancer. If you have backups, and your primary storage dies, of course you can restore from backup no problem, but it takes time and while the restore is happening, you're not online - uptime is killed. If that's a negative, might want to run RAID so you can keep working during the degraded period, and instead of a restore you're doing a rebuild. But of course there is a chance of failure during the degraded period. So you have to have a backup anyway. At least with Btrfs/ZFS, there is another reason to run with some replication like raid1 or raid5 and that's so that if there's corruption or a bad sector, Btrfs doesn't just detect it, it can fix it up with the good copy. For what it's worth, make sure the drives have lower SCT ERC time than the SCSI command timer. This is the same for Btrfs as it is for md and LVM RAID. The command timer default is 30 seconds, and most drives have SCT ERC disabled with very high recovery times well over 30 seconds. So either set SCT ERC to something like 70 deciseconds. Or increase the command timer to something like 120 or 180 (either one is absurdly high but what you want is for the drive to eventually give up and report a discrete error message which Btrfs can do something about, rather than do a SATA link reset in which case Btrfs can't do anything about it). -- Chris Murphy
Re: btrfs problems
Thanks a lot for the detailed explanation. Aabout "stable hardware/no lying hardware". I'm not running any raid hardware, was planning on just software raid. three drives glued together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would this be a safer bet, or would You recommend running the sausage method instead, with "-d single" for safety ? I'm guessing that if one of the drives dies the data is completely lost Another variant I was considering is running a raid1 mirror on two of the drives and maybe a subvolume on the third, for less important stuff BR Adrian On Thu, Sep 20, 2018 at 9:39 PM Chris Murphy wrote: > > On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm wrote: > > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > > > >> > >> Then I strongly recommend to use the latest upstream kernel and progs > >> for btrfs. (thus using Debian Testing) > >> > >> And if anything went wrong, please report asap to the mail list. > >> > >> Especially for fs corruption, that's the ghost I'm always chasing for. > >> So if any corruption happens again (although I hope it won't happen), I > >> may have a chance to catch it. > > > > You got it > >> > > >> >> Anyway, enjoy your stable fs even it's not btrfs > > > >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't > >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess > >> > after the dust settled I realize I like the flexibility of BTRFS. > >> > > > I'm back to btrfs. > > > >> From the code aspect, the biggest difference is the chunk layout. > >> Due to the ext* block group usage, each block group header (except some > >> sparse bg) is always used, thus btrfs can't use them. > >> > >> This leads to highly fragmented chunk layout. > > > > The only thing I really understood is "highly fragmented" == not good > > . I might need to google these "chunk" thingies > > Chunks are synonyms with block groups. They're like a super extent, or > extent of extents. > > The block group is how Btrfs abstracts the logical address used most > everywhere in Btrfs land, and device + physical location of extents. > It's how a file is referenced only by on logical address, and doesn't > need to know either where the extent is located, or how many copies > there are. The block group allocation profile is what determines if > there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk > and where the copies are located. It's also fundamental to how device > add, remove, replace, file system resize, and balance all interrelate. > > > >> If your primary concern is to make the fs as stable as possible, then > >> keep snapshots to a minimal amount, avoid any functionality you won't > >> use, like qgroup, routinely balance, RAID5/6. > > > > So, is RAID5 stable enough ? reading the wiki there's a big fat > > warning about some parity issues, I read an article about silent > > corruption (written a while back), and chris says he can't recommend > > raid56 to mere mortals. > > Depends on how you define stable. In recent kernels it's stable on > stable hardware, i.e. no lying hardware (actually flushes when it > claims it has), no power failures, and no failed devices. Of course > it's designed to help protect against a clear loss of a device, but > there's tons of stuff here that's just not finished including ejecting > bad devices from the array like md and lvm raids will do. Btrfs will > just keep trying, through all the failures. There are some patches to > moderate this but I don't think they're merged yet. > > You'd also want to be really familiar with how to handle degraded > operation, if you're going to depend on it, and how to replace a bad > device. Last I refreshed my memory on it, it's advised to use "btrfs > device add" followed by "btrfs device remove" for raid56; whereas > "btrfs replace" is preferred for all other profiles. I'm not sure if > the "btrfs replace" issues with parity raid were fixed. > > Metadata as raid56 shows a lot more problem reports than metadata > raid1, so there's something goofy going on in those cases. I'm not > sure how well understood they are. But other people don't have > problems with it. > > It's worth looking through the archives about some things. Btrfs > raid56 isn't exactly perfectly COW, there is read-modify-write code > that means there can be overwrites. I vaguely recall that it's COW in > the logical layer, but the physical writes can end up being RMW or not > for sure COW. > > > > -- > Chris Murphy -- Vänliga hälsningar / Kind regards, Adrian Bastholm ``I would change the world, but they won't give me the sourcecode``
Re: btrfs problems
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm wrote: > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > >> >> Then I strongly recommend to use the latest upstream kernel and progs >> for btrfs. (thus using Debian Testing) >> >> And if anything went wrong, please report asap to the mail list. >> >> Especially for fs corruption, that's the ghost I'm always chasing for. >> So if any corruption happens again (although I hope it won't happen), I >> may have a chance to catch it. > > You got it >> > >> >> Anyway, enjoy your stable fs even it's not btrfs > >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess >> > after the dust settled I realize I like the flexibility of BTRFS. >> > > I'm back to btrfs. > >> From the code aspect, the biggest difference is the chunk layout. >> Due to the ext* block group usage, each block group header (except some >> sparse bg) is always used, thus btrfs can't use them. >> >> This leads to highly fragmented chunk layout. > > The only thing I really understood is "highly fragmented" == not good > . I might need to google these "chunk" thingies Chunks are synonyms with block groups. They're like a super extent, or extent of extents. The block group is how Btrfs abstracts the logical address used most everywhere in Btrfs land, and device + physical location of extents. It's how a file is referenced only by on logical address, and doesn't need to know either where the extent is located, or how many copies there are. The block group allocation profile is what determines if there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk and where the copies are located. It's also fundamental to how device add, remove, replace, file system resize, and balance all interrelate. >> If your primary concern is to make the fs as stable as possible, then >> keep snapshots to a minimal amount, avoid any functionality you won't >> use, like qgroup, routinely balance, RAID5/6. > > So, is RAID5 stable enough ? reading the wiki there's a big fat > warning about some parity issues, I read an article about silent > corruption (written a while back), and chris says he can't recommend > raid56 to mere mortals. Depends on how you define stable. In recent kernels it's stable on stable hardware, i.e. no lying hardware (actually flushes when it claims it has), no power failures, and no failed devices. Of course it's designed to help protect against a clear loss of a device, but there's tons of stuff here that's just not finished including ejecting bad devices from the array like md and lvm raids will do. Btrfs will just keep trying, through all the failures. There are some patches to moderate this but I don't think they're merged yet. You'd also want to be really familiar with how to handle degraded operation, if you're going to depend on it, and how to replace a bad device. Last I refreshed my memory on it, it's advised to use "btrfs device add" followed by "btrfs device remove" for raid56; whereas "btrfs replace" is preferred for all other profiles. I'm not sure if the "btrfs replace" issues with parity raid were fixed. Metadata as raid56 shows a lot more problem reports than metadata raid1, so there's something goofy going on in those cases. I'm not sure how well understood they are. But other people don't have problems with it. It's worth looking through the archives about some things. Btrfs raid56 isn't exactly perfectly COW, there is read-modify-write code that means there can be overwrites. I vaguely recall that it's COW in the logical layer, but the physical writes can end up being RMW or not for sure COW. -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann wrote: > Am 13.9.2018 14:35, schrieb Nikolay Borisov: >> >> On 13.09.2018 15:30, Jürgen Herrmann wrote: >>> >>> OK, I will install kdump later and perform a dump after the hang. >>> >>> One more noob question beforehand: does this dump contain sensitive >>> information, for example the luks encryption key for the disk etc? A >>> Google search only brings up one relevant search result which can only >>> be viewed with a redhat subscription... >> >> >> >> So a kdump will dump the kernel memory so it's possible that the LUKS >> encryption keys could be extracted from that image. Bummer, it's >> understandable why you wouldn't want to upload it :). In this case you'd >> have to install also the 'crash' utility to open the crashdump and >> extract the calltrace of the btrfs process. The rough process should be : >> >> >> crash 'path to vm linux' 'path to vmcore file', then once inside the >> crash utility : >> >> set , you can acquire the pid by issuing 'ps' >> which will give you a ps-like output of all running processes at the >> time of crash. After the context has been set you can run 'bt' which >> will give you a backtrace of the send process. >> >> >> >>> >>> Best regards, >>> Jürgen >>> >>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov >>> : >>> On 13.09.2018 14:50, Jürgen Herrmann wrote: > > I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also > after the hang because I started the loop before the hang. The dmesg > output should show the hanging tasks from second 346 on or so. Still > not > useful? > So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done >>> >>> > I know this is probably not the correct list to ask this question but maybe > someone of the devs can point me to the right list? > > I cannot get kdump to work. The crashkernel is loaded and everything is > setup for it afaict. I asked a question on this over at stackexchange but no > answer yet. > https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing > > So i did a little digging and added some debug printk() statements to see > whats going on and it seems that panic() is never called. maybe the second > stack trace is the reason? > Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN > > Could someone please tell me where I can report this problem and get some > help on this topic? Try kexec mailing list. They handle kdump. http://lists.infradead.org/mailman/listinfo/kexec -- Chris Murphy
Re: btrfs problems
On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > > Then I strongly recommend to use the latest upstream kernel and progs > for btrfs. (thus using Debian Testing) > > And if anything went wrong, please report asap to the mail list. > > Especially for fs corruption, that's the ghost I'm always chasing for. > So if any corruption happens again (although I hope it won't happen), I > may have a chance to catch it. You got it > > > >> Anyway, enjoy your stable fs even it's not btrfs > > My new stable fs is too rigid. Can't grow it, can't shrink it, can't > > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess > > after the dust settled I realize I like the flexibility of BTRFS. > > I'm back to btrfs. > From the code aspect, the biggest difference is the chunk layout. > Due to the ext* block group usage, each block group header (except some > sparse bg) is always used, thus btrfs can't use them. > > This leads to highly fragmented chunk layout. The only thing I really understood is "highly fragmented" == not good . I might need to google these "chunk" thingies > We doesn't have error report about such layout yet, but if you want > everything to be as stable as possible, I still recommend to use a newly > created fs. I guess I'll stick with ext4 on the rootfs > > Another thing is I'd like to see a "first steps after getting started > > " section in the wiki. Something like take your first snapshot, back > > up, how to think when running it - can i just set some cron jobs and > > forget about it, or does it need constant attention, and stuff like > > that. > > There are projects do such things automatically, like snapper. > > If your primary concern is to make the fs as stable as possible, then > keep snapshots to a minimal amount, avoid any functionality you won't > use, like qgroup, routinely balance, RAID5/6. So, is RAID5 stable enough ? reading the wiki there's a big fat warning about some parity issues, I read an article about silent corruption (written a while back), and chris says he can't recommend raid56 to mere mortals. > And keep the necessary btrfs specific operations to minimal, like > subvolume/snapshot (and don't keep too many snapshots, say over 20), > shrink, send/receive. > > Thanks, > Qu > > > > > BR Adrian > > > > > -- Vänliga hälsningar / Kind regards, Adrian Bastholm ``I would change the world, but they won't give me the sourcecode``
Re: btrfs filesystem show takes a long time
Hi, > You may try to run the show command under strace to see where it blocks. any recommendations for strace options? On Friday, September 14, 2018 1:25:30 PM CEST David Sterba wrote: > Hi, > > thanks for the report, I've forwarded it to the issue tracker > https://github.com/kdave/btrfs-progs/issues/148 > > The show command uses the information provided by blkid, that presumably > caches that. The default behaviour of 'fi show' is to skip mount checks, > so the delays are likely caused by blkid, but that's not the only > possible reason. > > You may try to run the show command under strace to see where it blocks. >
Re: btrfs send hangs after partial transfer and blocks all IO
Am 13.9.2018 14:35, schrieb Nikolay Borisov: On 13.09.2018 15:30, Jürgen Herrmann wrote: OK, I will install kdump later and perform a dump after the hang. One more noob question beforehand: does this dump contain sensitive information, for example the luks encryption key for the disk etc? A Google search only brings up one relevant search result which can only be viewed with a redhat subscription... So a kdump will dump the kernel memory so it's possible that the LUKS encryption keys could be extracted from that image. Bummer, it's understandable why you wouldn't want to upload it :). In this case you'd have to install also the 'crash' utility to open the crashdump and extract the calltrace of the btrfs process. The rough process should be : crash 'path to vm linux' 'path to vmcore file', then once inside the crash utility : set , you can acquire the pid by issuing 'ps' which will give you a ps-like output of all running processes at the time of crash. After the context has been set you can run 'bt' which will give you a backtrace of the send process. Best regards, Jürgen Am 13. September 2018 14:02:11 schrieb Nikolay Borisov : On 13.09.2018 14:50, Jürgen Herrmann wrote: I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also after the hang because I started the loop before the hang. The dmesg output should show the hanging tasks from second 346 on or so. Still not useful? So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done I know this is probably not the correct list to ask this question but maybe someone of the devs can point me to the right list? I cannot get kdump to work. The crashkernel is loaded and everything is setup for it afaict. I asked a question on this over at stackexchange but no answer yet. https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing So i did a little digging and added some debug printk() statements to see whats going on and it seems that panic() is never called. maybe the second stack trace is the reason? Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN Could someone please tell me where I can report this problem and get some help on this topic? Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: btrfs send hangs after partial transfer and blocks all IO
Am 13.9.2018 18:22, schrieb Chris Murphy: (resend to all) On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov wrote: On 13.09.2018 18:30, Chris Murphy wrote: This is the 2nd or 3rd thread containing hanging btrfs send, with kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" and the other I can't find at the moment. In that case though the hang is reproducible in 4.14.x and weirdly it only happens when a snapshot contains (perhaps many) reflinks. Scrub and check lowmem find nothing wrong. I have snapshots with a few reflinks (cp --reflink and also deduplication), and I see maybe 15-30 second hangs where nothing is apparently happening (in top or iotop), but I'm also not seeing any blocked tasks or high CPU usage. Perhaps in my case it's just recovering quickly. Are there any kernel config options in "# Debug Lockups and Hangs" that might hint at what's going on? Some of these are enabled in Fedora debug kernels, which are built practically daily, e.g. right now the latest in the build system is 4.19.0-0.rc3.git2.1 - which translates to git 54eda9df17f3. If it's a lock-related problem then you need Lock Debugging => Lock debugging: prove locking correctness OK looks like that's under a different section as CONFIG_PROVE_LOCKING which is enabled on Fedora debug kernels. # Debug Lockups and Hangs CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 # Lock Debugging (spinlocks, mutexes, etc...) CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y # CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_LOCK_TORTURE_TEST=m Hello again! I have CONFIG_PROVE_LOCKING enabled in my kernel, but no change to the observed behaviour. Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: btrfs panic problem
On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: > Hi All, > > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. > > I found that the freelist of the slub is wrong. > > crash> struct kmem_cache_cpu 887e7d7a24b0 > > struct kmem_cache_cpu { > freelist = 0x2026, <<< the value is id of one inode > tid = 29567861, > page = 0xea0132168d00, > partial = 0x0 > } > > And, I found there are two different btrfs inodes pointing delayed_node. It > means that the same slub is used twice. > > I think this slub is freed twice, and then the next pointer of this slub > point itself. So we get the same slub twice. > > When use this slub again, that break the freelist. > > Folloing code will make the delayed node being freed twice. But I don't > found what is the process. > > Process A (btrfs_evict_inode) Process B > > call btrfs_remove_delayed_node call btrfs_get_delayed_node > > node = ACCESS_ONCE(btrfs_inode->delayed_node); > > BTRFS_I(inode)->delayed_node = NULL; > btrfs_release_delayed_node(delayed_node); > > if (node) { > atomic_inc(>refs); > return node; > } > > .. > > btrfs_release_delayed_node(delayed_node); > By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ec35e48b286959991cdbb886f1bdeda4575c80b4 thanks, liubo > > 1313 void btrfs_remove_delayed_node(struct inode *inode) > 1314 { > 1315 struct btrfs_delayed_node *delayed_node; > 1316 > 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); > 1318 if (!delayed_node) > 1319 return; > 1320 > 1321 BTRFS_I(inode)->delayed_node = NULL; > 1322 btrfs_release_delayed_node(delayed_node); > 1323 } > > > 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode > *inode) > 88 { > 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); > 90 struct btrfs_root *root = btrfs_inode->root; > 91 u64 ino = btrfs_ino(inode); > 92 struct btrfs_delayed_node *node; > 93 > 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); > 95 if (node) { > 96 atomic_inc(>refs); > 97 return node; > 98 } > > > Thanks, > > Sunny > > > PS: > > > > panic informations > > PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" > #0 [88130404f940] machine_kexec at 8105ec10 > #1 [88130404f9b0] crash_kexec at 811145b8 > #2 [88130404fa80] oops_end at 8101a868 > #3 [88130404fab0] no_context at 8106ea91 > #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d > #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 > #6 [88130404fb60] __do_page_fault at 8106f328 > #7 [88130404fbd0] do_page_fault at 8106f637 > #8 [88130404fc10] page_fault at 816f6308 > [exception RIP: kmem_cache_alloc+121] > RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 > RAX: RBX: RCX: 01c32b76 > RDX: 01c32b75 RSI: RDI: 000224b0 > RBP: 88130404fd08 R8: 887e7d7a24b0 R9: > R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 > R13: 2026 R14: 887e3e230a00 R15: a01abf49 > ORIG_RAX: CS: 0010 SS: 0018 > #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 > [btrfs] > #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 > [btrfs] > #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] > #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] > #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] > #14 [88130404fe50] touch_atime at 812286d3 > #15 [88130404fe90] iterate_dir at 81221929 > #16 [88130404fee0] sys_getdents64 at 81221a19 > #17 [88130404ff50] system_call_fastpath at 816f2594 > RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 > RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 > RDX: 1000 RSI: 00c83da14000 RDI: 0011 > RBP: R8: R9: > R10: R11: 0246 R12: 00c7 > R13: 02174e74 R14: 0555 R15: 0038 > ORIG_RAX: 00d9 CS: 0033 SS: 002b > > > We also find the list double add informations, including n_list and p_list: > > [8642921.110568] [ cut here ] > [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 > __list_add+0xbe/0xd0() > [8642921.263780] list_add
Re: btrfs panic problem
On 19.09.2018 02:53, sunny.s.zhang wrote: > Hi Duncan, > > Thank you for your advice. I understand what you mean. But i have > reviewed the latest btrfs code, and i think the issue is exist still. > > At 71 line, if the function of btrfs_get_delayed_node run over this > line, then switch to other process, which run over the 1282 and release > the delayed node at the end. > > And then, switch back to the btrfs_get_delayed_node. find that the node > is not null, and use it as normal. that mean we used a freed memory. > > at some time, this memory will be freed again. > > latest code as below. > > 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) > 1279 { > 1280 struct btrfs_delayed_node *delayed_node; > 1281 > 1282 delayed_node = READ_ONCE(inode->delayed_node); > 1283 if (!delayed_node) > 1284 return; > 1285 > 1286 inode->delayed_node = NULL; > 1287 btrfs_release_delayed_node(delayed_node); > 1288 } > > > 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( > 65 struct btrfs_inode *btrfs_inode) > 66 { > 67 struct btrfs_root *root = btrfs_inode->root; > 68 u64 ino = btrfs_ino(btrfs_inode); > 69 struct btrfs_delayed_node *node; > 70 > 71 node = READ_ONCE(btrfs_inode->delayed_node); > 72 if (node) { > 73 refcount_inc(>refs); > 74 return node; > 75 } > 76 > 77 spin_lock(>inode_lock); > 78 node = radix_tree_lookup(>delayed_nodes_tree, ino); > > You are analysis is correct, however it's missing one crucial point - btrfs_remove_delayed_node is called only from btrfs_evict_inode. And inodes are evicted when all other references have been dropped. Check the code in evict_inodes() - inodes are added to the dispose list when their i_count is 0 at which point there should be no references in this inode. This invalidates your analysis... > 在 2018年09月18日 13:05, Duncan 写道: >> sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: >> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >>> >>> I found that the freelist of the slub is wrong. >> [Not a dev, just a btrfs list regular and user, myself. But here's a >> general btrfs list recommendations reply...] >> >> You appear to mean kernel 4.1.12 -- confirmed by the version reported in >> the posted dump: 4.1.12-112.14.13.el6uek.x86_64 >> >> OK, so from the perspective of this forward-development-focused list, >> kernel 4.1 is pretty ancient history, but you do have a number of >> options. >> >> First let's consider the general situation. Most people choose an >> enterprise distro for supported stability, and that's certainly a valid >> thing to want. However, btrfs, while now reaching early maturity for the >> basics (single device in single or dup mode, and multi-device in single/ >> raid0/1/10 modes, note that raid56 mode is newer and less mature), >> remains under quite heavy development, and keeping reasonably current is >> recommended for that reason. >> >> So you you chose an enterprise distro presumably to lock in supported >> stability for several years, but you chose a filesystem, btrfs, that's >> still under heavy development, with reasonably current kernels and >> userspace recommended as tending to have the known bugs fixed. There's a >> bit of a conflict there, and the /general/ recommendation would thus be >> to consider whether one or the other of those choices are inappropriate >> for your use-case, because it's really quite likely that if you really >> want the stability of an enterprise distro and kernel, that btrfs isn't >> as stable a filesystem as you're likely to want to match with it. >> Alternatively, if you want something newer to match the still under heavy >> development btrfs, you very likely want a distro that's not focused on >> years-old stability just for the sake of it. One or the other is likely >> to be a poor match for your needs, and choosing something else that's a >> better match is likely to be a much better experience for you. >> >> But perhaps you do have reason to want to run the newer and not quite to >> traditional enterprise-distro level stability btrfs, on an otherwise >> older and very stable enterprise distro. That's fine, provided you know >> what you're getting yourself into, and are prepared to deal with it. >> >> In that case, for best support from the list, we'd recommend running one >> of the latest two kernels in either the current or mainline LTS tracks. >> >> For current track, With 4.18 being the latest kernel, that'd be 4.18 or >> 4.17, as available on kernel.org (tho 4.17 is already EOL, no further >> releases, at 4.17.19). >> >> For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series >> kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 >> and it's just not out of normal
Re: btrfs panic problem
On 2018/9/19 上午8:35, sunny.s.zhang wrote: > > 在 2018年09月19日 08:05, Qu Wenruo 写道: >> >> On 2018/9/18 上午8:28, sunny.s.zhang wrote: >>> Hi All, >>> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >> Any reproducer? >> >> Anyway we need a reproducer as a testcase. > > I have had a try, but could not reproduce yet. Since it's just one hit in production environment, I'm afraid we need to inject some sleep or delay into this code and try bombing it with fsstress. Despite that I have no good idea on reproducing it. Thanks, Qu > > Any advice to reproduce it? > >> >> The code looks >> >>> I found that the freelist of the slub is wrong. >>> >>> crash> struct kmem_cache_cpu 887e7d7a24b0 >>> >>> struct kmem_cache_cpu { >>> freelist = 0x2026, <<< the value is id of one inode >>> tid = 29567861, >>> page = 0xea0132168d00, >>> partial = 0x0 >>> } >>> >>> And, I found there are two different btrfs inodes pointing delayed_node. >>> It means that the same slub is used twice. >>> >>> I think this slub is freed twice, and then the next pointer of this slub >>> point itself. So we get the same slub twice. >>> >>> When use this slub again, that break the freelist. >>> >>> Folloing code will make the delayed node being freed twice. But I don't >>> found what is the process. >>> >>> Process A (btrfs_evict_inode) Process B >>> >>> call btrfs_remove_delayed_node call btrfs_get_delayed_node >>> >>> node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> >>> BTRFS_I(inode)->delayed_node = NULL; >>> btrfs_release_delayed_node(delayed_node); >>> >>> if (node) { >>> atomic_inc(>refs); >>> return node; >>> } >>> >>> .. >>> >>> btrfs_release_delayed_node(delayed_node); >>> >>> >>> 1313 void btrfs_remove_delayed_node(struct inode *inode) >>> 1314 { >>> 1315 struct btrfs_delayed_node *delayed_node; >>> 1316 >>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); >>> 1318 if (!delayed_node) >>> 1319 return; >>> 1320 >>> 1321 BTRFS_I(inode)->delayed_node = NULL; >>> 1322 btrfs_release_delayed_node(delayed_node); >>> 1323 } >>> >>> >>> 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct >>> inode *inode) >>> 88 { >>> 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); >>> 90 struct btrfs_root *root = btrfs_inode->root; >>> 91 u64 ino = btrfs_ino(inode); >>> 92 struct btrfs_delayed_node *node; >>> 93 >>> 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> 95 if (node) { >>> 96 atomic_inc(>refs); >>> 97 return node; >>> 98 } >>> >> The analyse looks valid. >> Can be fixed by adding a spinlock. >> >> Just wondering why we didn't hit it. > > It just appeared once in our production environment. > > Thanks, > Sunny >> >> Thanks, >> Qu >> >>> Thanks, >>> >>> Sunny >>> >>> >>> PS: >>> >>> >>> >>> panic informations >>> >>> PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" >>> #0 [88130404f940] machine_kexec at 8105ec10 >>> #1 [88130404f9b0] crash_kexec at 811145b8 >>> #2 [88130404fa80] oops_end at 8101a868 >>> #3 [88130404fab0] no_context at 8106ea91 >>> #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d >>> #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 >>> #6 [88130404fb60] __do_page_fault at 8106f328 >>> #7 [88130404fbd0] do_page_fault at 8106f637 >>> #8 [88130404fc10] page_fault at 816f6308 >>> [exception RIP: kmem_cache_alloc+121] >>> RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 >>> RAX: RBX: RCX: 01c32b76 >>> RDX: 01c32b75 RSI: RDI: 000224b0 >>> RBP: 88130404fd08 R8: 887e7d7a24b0 R9: >>> R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 >>> R13: 2026 R14: 887e3e230a00 R15: a01abf49 >>> ORIG_RAX: CS: 0010 SS: 0018 >>> #9 [88130404fd10] btrfs_get_or_create_delayed_node at >>> a01abf49 [btrfs] >>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 >>> [btrfs] >>> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] >>> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] >>> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] >>> #14 [88130404fe50] touch_atime at 812286d3 >>> #15 [88130404fe90] iterate_dir at 81221929 >>> #16 [88130404fee0] sys_getdents64 at 81221a19 >>> #17 [88130404ff50] system_call_fastpath at 816f2594 >>> RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 >>> RAX:
Re: btrfs panic problem
在 2018年09月19日 08:05, Qu Wenruo 写道: On 2018/9/18 上午8:28, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. Any reproducer? Anyway we need a reproducer as a testcase. I have had a try, but could not reproduce yet. Any advice to reproduce it? The code looks I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(>refs); return node; } .. btrfs_release_delayed_node(delayed_node); 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(>refs); 97 return node; 98 } The analyse looks valid. Can be fixed by adding a spinlock. Just wondering why we didn't hit it. It just appeared once in our production environment. Thanks, Sunny Thanks, Qu Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10: R11: 0246 R12: 00c7 R13: 02174e74 R14: 0555 R15: 0038 ORIG_RAX: 00d9 CS: 0033 SS: 002b We also find the list double add informations, including n_list and p_list: [8642921.110568] [ cut here ] [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 __list_add+0xbe/0xd0() [8642921.263780] list_add corruption. prev->next should be next (887e40fa5368), but was ff:ff884c85a36288.
Re: btrfs panic problem
On 2018/9/18 上午8:28, sunny.s.zhang wrote: > Hi All, > > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. Any reproducer? Anyway we need a reproducer as a testcase. The code looks > > I found that the freelist of the slub is wrong. > > crash> struct kmem_cache_cpu 887e7d7a24b0 > > struct kmem_cache_cpu { > freelist = 0x2026, <<< the value is id of one inode > tid = 29567861, > page = 0xea0132168d00, > partial = 0x0 > } > > And, I found there are two different btrfs inodes pointing delayed_node. > It means that the same slub is used twice. > > I think this slub is freed twice, and then the next pointer of this slub > point itself. So we get the same slub twice. > > When use this slub again, that break the freelist. > > Folloing code will make the delayed node being freed twice. But I don't > found what is the process. > > Process A (btrfs_evict_inode) Process B > > call btrfs_remove_delayed_node call btrfs_get_delayed_node > > node = ACCESS_ONCE(btrfs_inode->delayed_node); > > BTRFS_I(inode)->delayed_node = NULL; > btrfs_release_delayed_node(delayed_node); > > if (node) { > atomic_inc(>refs); > return node; > } > > .. > > btrfs_release_delayed_node(delayed_node); > > > 1313 void btrfs_remove_delayed_node(struct inode *inode) > 1314 { > 1315 struct btrfs_delayed_node *delayed_node; > 1316 > 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); > 1318 if (!delayed_node) > 1319 return; > 1320 > 1321 BTRFS_I(inode)->delayed_node = NULL; > 1322 btrfs_release_delayed_node(delayed_node); > 1323 } > > > 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct > inode *inode) > 88 { > 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); > 90 struct btrfs_root *root = btrfs_inode->root; > 91 u64 ino = btrfs_ino(inode); > 92 struct btrfs_delayed_node *node; > 93 > 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); > 95 if (node) { > 96 atomic_inc(>refs); > 97 return node; > 98 } > The analyse looks valid. Can be fixed by adding a spinlock. Just wondering why we didn't hit it. Thanks, Qu > > Thanks, > > Sunny > > > PS: > > > > panic informations > > PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" > #0 [88130404f940] machine_kexec at 8105ec10 > #1 [88130404f9b0] crash_kexec at 811145b8 > #2 [88130404fa80] oops_end at 8101a868 > #3 [88130404fab0] no_context at 8106ea91 > #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d > #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 > #6 [88130404fb60] __do_page_fault at 8106f328 > #7 [88130404fbd0] do_page_fault at 8106f637 > #8 [88130404fc10] page_fault at 816f6308 > [exception RIP: kmem_cache_alloc+121] > RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 > RAX: RBX: RCX: 01c32b76 > RDX: 01c32b75 RSI: RDI: 000224b0 > RBP: 88130404fd08 R8: 887e7d7a24b0 R9: > R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 > R13: 2026 R14: 887e3e230a00 R15: a01abf49 > ORIG_RAX: CS: 0010 SS: 0018 > #9 [88130404fd10] btrfs_get_or_create_delayed_node at > a01abf49 [btrfs] > #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 > [btrfs] > #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] > #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] > #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] > #14 [88130404fe50] touch_atime at 812286d3 > #15 [88130404fe90] iterate_dir at 81221929 > #16 [88130404fee0] sys_getdents64 at 81221a19 > #17 [88130404ff50] system_call_fastpath at 816f2594 > RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 > RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 > RDX: 1000 RSI: 00c83da14000 RDI: 0011 > RBP: R8: R9: > R10: R11: 0246 R12: 00c7 > R13: 02174e74 R14: 0555 R15: 0038 > ORIG_RAX: 00d9 CS: 0033 SS: 002b > > > We also find the list double add informations, including n_list and p_list: > > [8642921.110568] [ cut here ] > [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 > __list_add+0xbe/0xd0() > [8642921.263780] list_add corruption. prev->next should be next > (887e40fa5368), but
Re: btrfs panic problem
Hi Duncan, Thank you for your advice. I understand what you mean. But i have reviewed the latest btrfs code, and i think the issue is exist still. At 71 line, if the function of btrfs_get_delayed_node run over this line, then switch to other process, which run over the 1282 and release the delayed node at the end. And then, switch back to the btrfs_get_delayed_node. find that the node is not null, and use it as normal. that mean we used a freed memory. at some time, this memory will be freed again. latest code as below. 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) 1279 { 1280 struct btrfs_delayed_node *delayed_node; 1281 1282 delayed_node = READ_ONCE(inode->delayed_node); 1283 if (!delayed_node) 1284 return; 1285 1286 inode->delayed_node = NULL; 1287 btrfs_release_delayed_node(delayed_node); 1288 } 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( 65 struct btrfs_inode *btrfs_inode) 66 { 67 struct btrfs_root *root = btrfs_inode->root; 68 u64 ino = btrfs_ino(btrfs_inode); 69 struct btrfs_delayed_node *node; 70 71 node = READ_ONCE(btrfs_inode->delayed_node); 72 if (node) { 73 refcount_inc(>refs); 74 return node; 75 } 76 77 spin_lock(>inode_lock); 78 node = radix_tree_lookup(>delayed_nodes_tree, ino); 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterprise distro, and
Re: btrfs receive incremental stream on another uuid
On Tue, Sep 18, 2018 at 06:28:37PM +, Gervais, Francois wrote: > > No. It is already possible (by setting received UUID); it should not be > made too open to easy abuse. > > > Do you mean edit the UUID in the byte stream before btrfs receive? No, there's an ioctl to change the received UUID of a subvolume. It's used by receive, at the very end of the receive operation. Messing around in this area is basically a recipe for ending up with a half-completed send/receive full of broken data because the receiving subvolume isn't quite as identical as you thought. It enforces the rules for a reason. Now, it's possible to modify the send stream and the logic around it a bit to support a number of additional modes of operation (bidirectional send, for example), but that's queued up waiting for (a) a definitive list of send stream format changes, and (b) David's bandwidth to put them together in one patch set. If you want to see more on the underlying UUID model, and how it could be (ab)used and modified, there's a write-up here, in a thread on pretty much exactly the same proposal that you've just made: https://www.spinics.net/lists/linux-btrfs/msg44089.html Hugo. -- Hugo Mills | Great films about cricket: Monster's No-Ball hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: btrfs receive incremental stream on another uuid
18.09.2018 21:28, Gervais, Francois пишет: >> No. It is already possible (by setting received UUID); it should not be > made too open to easy abuse. > > > Do you mean edit the UUID in the byte stream before btrfs receive? > No, I mean setting received UUID on subvolume. Unfortunately, it is possible. Fortunately, it is not trivially done.
Re: btrfs receive incremental stream on another uuid
> No. It is already possible (by setting received UUID); it should not be made too open to easy abuse. Do you mean edit the UUID in the byte stream before btrfs receive?
Re: btrfs receive incremental stream on another uuid
18.09.2018 20:56, Gervais, Francois пишет: > > Hi, > > I'm trying to apply a btrfs send diff (done through -p) to another subvolume > with the same content as the proper parent but with a different uuid. > > I looked through btrfs receive and I get the feeling that this is not > possible right now. > > I'm thinking of adding a -p option to btrfs receive which could override the > parent information from the stream. > > Would that make sense? > No. It is already possible (by setting received UUID); it should not be made too open to easy abuse.
Re: btrfs panic problem
Add Junxiao 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterprise distro, and very possibly actually paying /them/ for support. So you're not without options if you prefer to stay with your supported enterprise kernel. If you're paying them for support, you might as well use it, and of course of the very many fixes since 4.1, they know what they've backported and what they haven't, so they're far better placed to provide that support in any case. Or, given what you posted, you appear to be reasonably able to do at least limited kernel-dev-level analysis yourself. Given that, you're already reasonably well placed to simply decide to stick with what you have and take the support you can get, diving into things yourself if necessary. So those are your kernel options. What about userspace btrfs-progs? Generally speaking, while the filesystem's running, it's the kernel code doing most of the work. If you have old userspace, it simply means you can't take advantage of some of the newer features as the old userspace doesn't know how to call for them. But the situation changes as soon as you have problems and can't mount, because it's userspace code that runs to try to fix that sort of problem, or failing that, it's userspace code that btrfs restore runs to try to grab what files can be grabbed off of the unmountable filesystem. So for routine operation, it's no big deal if userspace is a bit old, at least as long as it's new enough to have all the newer command formats, etc, that you need, and for
Re: btrfs panic problem
sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. > > I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterprise distro, and very possibly actually paying /them/ for support. So you're not without options if you prefer to stay with your supported enterprise kernel. If you're paying them for support, you might as well use it, and of course of the very many fixes since 4.1, they know what they've backported and what they haven't, so they're far better placed to provide that support in any case. Or, given what you posted, you appear to be reasonably able to do at least limited kernel-dev-level analysis yourself. Given that, you're already reasonably well placed to simply decide to stick with what you have and take the support you can get, diving into things yourself if necessary. So those are your kernel options. What about userspace btrfs-progs? Generally speaking, while the filesystem's running, it's the kernel code doing most of the work. If you have old userspace, it simply means you can't take advantage of some of the newer features as the old userspace doesn't know how to call for them. But the situation changes as soon as you have problems and can't mount, because it's userspace code that runs to try to fix that sort of problem, or failing that, it's userspace code that btrfs restore runs to try to grab what files can be grabbed off of the unmountable filesystem. So for routine operation, it's no big deal if userspace is a bit old, at least as long as it's new enough to have all the newer command formats, etc, that you
Re: btrfs panic problem
Sorry, modify some errors: Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(>refs); return node; } .. btrfs_release_delayed_node(delayed_node); 在 2018年09月18日 08:28, sunny.s.zhang 写道: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(>refs); return node; } .. btrfs_release_delayed_node(delayed_node); 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(>refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10: R11: 0246 R12: 00c7 R13: 02174e74 R14: 0555 R15: 0038 ORIG_RAX: 00d9 CS: 0033 SS: 002b We also find the list double add informations, including n_list and p_list: [8642921.110568] [ cut here ] [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 __list_add+0xbe/0xd0() [8642921.263780] list_add corruption. prev->next should be next (887e40fa5368), but was ff:ff884c85a36288.
Re: btrfs problems
> If your primary concern is to make the fs as stable as possible, then > keep snapshots to a minimal amount, avoid any functionality you won't > use, like qgroup, routinely balance, RAID5/6. > > And keep the necessary btrfs specific operations to minimal, like > subvolume/snapshot (and don't keep too many snapshots, say over 20), > shrink, send/receive. hehe, that sound like "hey use btrfs, its cool, but please - don't use any btrfs specific feature" ;) best Stefan
Re: btrfs problems
On 2018/9/17 下午7:55, Adrian Bastholm wrote: >> Well, I'd say Debian is really not your first choice for btrfs. >> The kernel is really old for btrfs. >> >> My personal recommend is to use rolling release distribution like >> vanilla Archlinux, whose kernel is already 4.18.7 now. > > I just upgraded to Debian Testing which has the 4.18 kernel Then I strongly recommend to use the latest upstream kernel and progs for btrfs. (thus using Debian Testing) And if anything went wrong, please report asap to the mail list. Especially for fs corruption, that's the ghost I'm always chasing for. So if any corruption happens again (although I hope it won't happen), I may have a chance to catch it. > >> Anyway, enjoy your stable fs even it's not btrfs anymore. > > My new stable fs is too rigid. Can't grow it, can't shrink it, can't > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess > after the dust settled I realize I like the flexibility of BTRFS. > > > This time I'm considering BTRFS as rootfs as well, can I do an > in-place conversion ? There's this guide > (https://www.howtoforge.com/how-to-convert-an-ext3-ext4-root-file-system-to-btrfs-on-ubuntu-12.10) > I was planning on following. Btrfs-convert is recommended mostly for short term trial (the ability to rollback to ext* without anything modified) From the code aspect, the biggest difference is the chunk layout. Due to the ext* block group usage, each block group header (except some sparse bg) is always used, thus btrfs can't use them. This leads to highly fragmented chunk layout. We doesn't have error report about such layout yet, but if you want everything to be as stable as possible, I still recommend to use a newly created fs. > > Another thing is I'd like to see a "first steps after getting started > " section in the wiki. Something like take your first snapshot, back > up, how to think when running it - can i just set some cron jobs and > forget about it, or does it need constant attention, and stuff like > that. There are projects do such things automatically, like snapper. If your primary concern is to make the fs as stable as possible, then keep snapshots to a minimal amount, avoid any functionality you won't use, like qgroup, routinely balance, RAID5/6. And keep the necessary btrfs specific operations to minimal, like subvolume/snapshot (and don't keep too many snapshots, say over 20), shrink, send/receive. Thanks, Qu > > BR Adrian > > signature.asc Description: OpenPGP digital signature
Re: btrfs problems
> Well, I'd say Debian is really not your first choice for btrfs. > The kernel is really old for btrfs. > > My personal recommend is to use rolling release distribution like > vanilla Archlinux, whose kernel is already 4.18.7 now. I just upgraded to Debian Testing which has the 4.18 kernel > Anyway, enjoy your stable fs even it's not btrfs anymore. My new stable fs is too rigid. Can't grow it, can't shrink it, can't remove vdevs from it , so I'm planning a comeback to BTRFS. I guess after the dust settled I realize I like the flexibility of BTRFS. This time I'm considering BTRFS as rootfs as well, can I do an in-place conversion ? There's this guide (https://www.howtoforge.com/how-to-convert-an-ext3-ext4-root-file-system-to-btrfs-on-ubuntu-12.10) I was planning on following. Another thing is I'd like to see a "first steps after getting started " section in the wiki. Something like take your first snapshot, back up, how to think when running it - can i just set some cron jobs and forget about it, or does it need constant attention, and stuff like that. BR Adrian -- Vänliga hälsningar / Kind regards, Adrian Bastholm ``I would change the world, but they won't give me the sourcecode``
Re: btrfs problems
On Sun, Sep 16, 2018 at 2:11 PM, Adrian Bastholm wrote: > Thanks for answering Qu. > >> At this timing, your fs is already corrupted. >> I'm not sure about the reason, it can be a failed CoW combined with >> powerloss, or corrupted free space cache, or some old kernel bugs. >> >> Anyway, the metadata itself is already corrupted, and I believe it >> happens even before you noticed. > I suspected it had to be like that >> >> > BTRFS check --repair is not recommended, it >> > crashes , doesn't fix all problems, and I later found out that my >> > lost+found dir had about 39G of lost files and dirs. >> >> lost+found is completely created by btrfs check --repair. >> >> > I spent about two days trying to fix everything, removing a disk, >> > adding it again, checking , you name it. I ended up removing one disk, >> > reformatting it, and moving the data there. >> >> Well, I would recommend to submit such problem to the mail list *BEFORE* >> doing any write operation to the fs (including btrfs check --repair). >> As it would help us to analyse the failure pattern to further enhance btrfs. > > IMHO that's a, how should I put it, a design flaw, the wrong way of > looking at how people think, with all respect to all the very smart > people that put in countless hours of hard work. Users expect and fs > check and repair to repair, not to break stuff. > Reading that --repair is "destructive" is contradictory even to me. It's contradictory to everyone including the developers. No developer set out to make --repair dangerous from the outset. It just turns out that it was a harder problem to solve and the thought was that it would keep getting better. Newer versions are "should be safe" now even if they can't fix everything. The far bigger issue I think the developers are aware of is that depending on repair at all for any Btrfs of appreciable size, is simply not scalable. Taking a day or a week to run a repair on a large file system, is unworkable. And that's why it's better to avoid inconsistencies in the first place which is what Btrfs is supposed to do, and if that's not happening it's a bug somewhere in Btrfs and also sometimes in the hardware. > This problem emerged in a direcory where motion (the camera software) > was saving pictures. Either killing the process or a powerloss could > have left these jpg files (or fs metadata) in a bad state. Maybe > that's something to go on. I was thinking that there's not much anyone > can do without root access to my box anyway, and I'm not sure I was > prepared to give that to anyone. I can't recommend raid56 for people new to Btrfs. It really takes qualified hardware to make sure there's no betrayal, as everything gets a lot more complicated with raid56. The general state of faulty device handling on Btrfs, makes raid56 very much a hands on approach you can't turn your back on it. And then when jumping into raid5, I advise raid1 for metadata. It reduces problems. And that's true for raid6 also, except that raid1 metadata is less redundancy than raid1 so...it's not helpful if you end up losing 2 devices. If you need production grade parity raid you should use openzfs, although I can't speak to how it behaves with respect to faulty devices on Linux. >> Any btrfs unexpected behavior, from strange ls output to aborted >> transaction, please consult with the mail list first. >> (Of course, with kernel version and btrfs-progs version, which is >> missing in your console log though) > > Linux jenna 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) > x86_64 GNU/Linux > btrfs-progs is already the newest version (4.7.3-1). Well the newest versions are kernel 4.18.8, and btrfs-progs 4.17.1, so in Btrfs terms those are kinda old. That is not inherently bad, but there are literally thousands of additions and deletions since kernel 4.9 so there's almost no way anyone on this list, except a developer familiar with backport status, can tell you if the problem you're seeing is a bug that's been fixed in that particular version. There aren't that many developers that familiar with that status who also have time to read user reports. Since this is an upstream list, most developers will want to know if you're able to reproduce the problem with a mainline kernel, because if you can it's very probable it's a bug that needs to be fixed upstream first before it can be backported. That's just the nature of kernel development generally. And you'll find the same thing on ext4 and XFS lists... The main reason why people use Debian and its older kernel bases is they're willing to accept certain bugginess in favor of stability. Transient bugs are really bad in that world. Consistent bugs they just find work arounds for (avoidance) until there's a known highly tested backport, because they want "The Behavior" to be predictable, both good and bad. That is not a model well suited for a file system that's in Btrfs really active development state. It's better now than it was even a
Re: btrfs problems
On Sun, Sep 16, 2018 at 7:58 AM, Adrian Bastholm wrote: > Hello all > Actually I'm not trying to get any help any more, I gave up BTRFS on > the desktop, but I'd like to share my efforts of trying to fix my > problems, in hope I can help some poor noob like me. There's almost no useful information provided for someone to even try to reproduce your results, isolate cause and figure out the bugs. No kernel version. No btrfs-progs version. No description of the hardware and how it's laid out, and what mkfs and mount options are being used. No one really has the time to speculate. >BTRFS check --repair is not recommended Right. So why did you run it anyway? man btrfs check: Warning Do not use --repair unless you are advised to do so by a developer or an experienced user It is always a legitimate complaint, despite this warning, if btrfs check --repair makes things worse, because --repair shouldn't ever make things worse. But Btrfs repairs are complicated, and that's why the warning is there. I suppose the devs could have made the flag --riskyrepair but I doubt this would really slow users down that much. A big part of --repair fixes weren't known to make things worse at the time, and edge cases where it made things worse kept popping up, so only in hindsight does it make sense --repair maybe could have been called something different to catch the user's attention. But anyway, I see this same sort of thing on the linux-raid list all the time. People run into trouble, and they press full forward making all kinds of changes, each change increases the chance of data loss. And then they come on the list with WTF messages. And it's always a lesson in patience for the list regulars and developers... if only you'd come to us with questions sooner. > Please have a look at the console logs. These aren't logs. It's a record of shell commands. Logs would include kernel messages, ideally all of them. Why is device 3 missing? We have no idea. Most of Btrfs code is in the kernel, problems are reported by the kernel. So we need kernel messages, user space messages aren't enough. Anyway, good luck with openzfs, cool project. -- Chris Murphy
Re: btrfs problems
On 2018/9/16 下午9:58, Adrian Bastholm wrote: > Hello all > Actually I'm not trying to get any help any more, I gave up BTRFS on > the desktop, but I'd like to share my efforts of trying to fix my > problems, in hope I can help some poor noob like me. > > I decided to use BTRFS after reading the ArsTechnica article about the > next-gen filesystems, and BTRFS seemed like the natural choice, open > source, built into linux, etc. I even bought a HP microserver to have > everything on because none of the commercial NAS-es supported BTRFS. > What a mistake, I wasted weeks in total managing something that could > have taken a day to set up, and I'd have MUCH more functionality now > (if I wasn't hit by some ransomware, that is). > > I had three 1TB drives, chose to use raid, and all was good for a > while, until started fiddling with Motion, the image capturing > software. When you kill that process (my take on it) a file can be > written but it ends up with question marks instead of attributes, and > it's impossible to remove. At this timing, your fs is already corrupted. I'm not sure about the reason, it can be a failed CoW combined with powerloss, or corrupted free space cache, or some old kernel bugs. Anyway, the metadata itself is already corrupted, and I believe it happens even before you noticed. > BTRFS check --repair is not recommended, it > crashes , doesn't fix all problems, and I later found out that my > lost+found dir had about 39G of lost files and dirs. lost+found is completely created by btrfs check --repair. > I spent about two days trying to fix everything, removing a disk, > adding it again, checking , you name it. I ended up removing one disk, > reformatting it, and moving the data there. Well, I would recommend to submit such problem to the mail list *BEFORE* doing any write operation to the fs (including btrfs check --repair). As it would help us to analyse the failure pattern to further enhance btrfs. > Now I removed BTRFS > entirely and replaced it with a OpenZFS mirror array, to which I'll > add the third disk later when I transferred everything over. Understandable, it's really annoying a fs just get itself corrupted, and without much btrfs specified knowledge it would just be a hell to try any method to fix it (even a lot of them would just make the case worse). > > Please have a look at the console logs. I've been running linux on the > desktop for the past 15 years, so I'm not a noob, but for running > BTRFS you better be involved in the development of it. I'd say, yes. For any btrfs unexpected behavior, don't use btrfs check --repair unless you're a developer or some developer asked to do. Any btrfs unexpected behavior, from strange ls output to aborted transaction, please consult with the mail list first. (Of course, with kernel version and btrfs-progs version, which is missing in your console log though) In fact, in recent (IIRC starting from v4.15) kernel releases, btrfs is already doing much better error detection thus it would detect such problem early on and protect the fs from being further modified. (This further shows that the importance of using the latest mainline kernel other than some old kernel provided by stable distribution). Thanks, Qu > In my humble > opinion, it's not for us "users" just yet. Not even for power users. > > For those of you considering building a NAS without special purposes, > don't. Buy a synology, pop in a couple of drives, and enjoy the ride. > > > > root /home/storage/motion/2017-05-24 1 ls -al > ls: cannot access '36-20170524201346-02.jpg': No such file or directory > ls: cannot access '36-20170524201346-02.jpg': No such file or directory > total 4 > drwxrwxrwx 1 motion motion 114 Sep 14 12:48 . > drwxrwxr-x 1 motion adyhasch 60 Sep 14 09:42 .. > -? ? ?? ?? 36-20170524201346-02.jpg > -? ? ?? ?? 36-20170524201346-02.jpg > -rwxr-xr-x 1 adyhasch adyhasch 62 Sep 14 12:43 remove.py > root /home/storage/motion/2017-05-24 1 touch test.raw > root /home/storage/motion/2017-05-24 cat /dev/random > test.raw > ^C > root /home/storage/motion/2017-05-24 ls -al > ls: cannot access '36-20170524201346-02.jpg': No such file or directory > ls: cannot access '36-20170524201346-02.jpg': No such file or directory > total 8 > drwxrwxrwx 1 motion motion 130 Sep 14 13:12 . > drwxrwxr-x 1 motion adyhasch 60 Sep 14 09:42 .. > -? ? ?? ?? 36-20170524201346-02.jpg > -? ? ?? ?? 36-20170524201346-02.jpg > -rwxr-xr-x 1 adyhasch adyhasch 62 Sep 14 12:43 remove.py > -rwxrwxrwx 1 root root 338 Sep 14 13:12 test.raw > root /home/storage/motion/2017-05-24 1 cp test.raw > 36-20170524201346-02.jpg > 'test.raw' -> '36-20170524201346-02.jpg' > > root /home/storage/motion/2017-05-24 ls -al > total 20 > drwxrwxrwx 1 motion motion 178 Sep 14
Re: btrfs filesystem show takes a long time
Hi, thanks for the report, I've forwarded it to the issue tracker https://github.com/kdave/btrfs-progs/issues/148 The show command uses the information provided by blkid, that presumably caches that. The default behaviour of 'fi show' is to skip mount checks, so the delays are likely caused by blkid, but that's not the only possible reason. You may try to run the show command under strace to see where it blocks.
Re: btrfs convert problem
On 2018/9/14 下午1:52, Nikolay Borisov wrote: > > > On 14.09.2018 02:17, Qu Wenruo wrote: >> >> >> On 2018/9/14 上午12:37, Nikolay Borisov wrote: >>> >>> >>> On 13.09.2018 19:15, Serhat Sevki Dincer wrote: > -1 seems to be EPERM, is your device write-protected, readonly or > something like that ? I am able to use ext4 partition read/write, created a file and wrote in it, un/re mounted it, all is ok. The drive only has a microusb port and a little led light; no rw switch or anything like that.. >>> >>> What might help is running btrfs convert under strace. So something like : >>> >>> sudo strace -f -o strace.log btrfs-convert /dev/sdb1 and then send the log. >>> >> strace would greatly help in this case. >> >> My guess is something wrong happened in migrate_super_block(), which >> doesn't handle ret > 0 case well. >> In that case, it means pwrite 4K doesn't finish in one call, which looks >> pretty strange. > > So Qu, > > I'm not seeing any EPERM errors, though I see the following so you might > be right: > > openat(AT_FDCWD, "/dev/sdb1", O_RDWR) = 5 > > followed by a lot of preads, the last one of which is: > > pread64(5, 0x557a0abd10b0, 4096, 2732765184) = -1 EIO (Input/output > error) With context, it's pretty easy to locate the problem. It's a bad block of your device. Please try to use the following command to verify it: dd if=/dev/sdb1 of=/dev/null bs=1 count=4096 skip=2732765184 And it would be nicer to check kernel dmesg to see if there is clue there. The culprit code is read_disk_extent(), which reset ret to -1 when error happens. I'll send patch(es) to fix it and enhance error messages for btrfs-convert. There is a workaround to let btrfs-convert continue, you could use --no-datasum option to skip datasum generation, thus btrfs-convert won't try to read such data. But that's just ostrich algorithm. I'd recommend to read all files in the original fs and locate which file(s) are affected. And backup data asap. Thanks, Qu > > This is then followed by a a lot of pwrites, the last 10 or so syscalls > are: > > pwrite64(5, "\220\255\262\244\0\0\0\0\0\0"..., 16384, 91127808) = 16384 > > pwrite64(5, "|IO\233\0\0\0\0\0\0"..., 16384, 91144192) = 16384 > > pwrite64(5, "x\2501n\0\0\0\0\0\0"..., 16384, 91160576) = 16384 > > pwrite64(5, "\252\254l)\0\0\0\0\0\0"..., 16384, 91176960) = 16384 > > pwrite64(5, "P\256\331\373\0\0\0\0\0\0"..., 16384, 91193344) = 16384 > > pwrite64(5, "\3\230\230+\0\0\0\0\0\0"..., 4096, 83951616) = 4096 > > write(2, "ERROR: ", 7) = 7 > > write(2, "failed to "..., 37) = 37 > > write(2, "\n", 1) = 1 > > close(4) = 0 > > write(2, "WARNING: ", 9) = 9 > > write(2, "an error o"..., 104) = 104 > > write(2, "\n", 1) = 1 > > exit_group(1) = ? > > but looking at migrate_super_block it's not doing that many writes - > just writing the sb at BTRFS_SUPER_INFO_OFFSET and then zeroing out > 0..BTRFS_SUPER_INFO_OFFSET. Also the number of pwrites: > grep -c pwrite ~/Downloads/btrfs-issue/convert-strace.log > 198 > > So the failure is not originating from migrate_super_block. > >> >> Thanks, >> Qu >>