Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On 2017-05-23 14:32, Kai Krakow wrote: Am Tue, 23 May 2017 07:21:33 -0400 schrieb "Austin S. Hemmelgarn": On 2017-05-22 22:07, Chris Murphy wrote: On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN wrote: On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: [...] [...] [...] Oh, swap will work, you're sure? I already have an SSD, if that's good enough, I can give it a shot. Yeah although I have no idea how much swap is needed for it to succeed. I'm not sure what the relationship is to fs metadata chunk size to btrfs check RAM requirement is; but if it wants all of the metadata in RAM, then whatever btrfs fi us shows you for metadata may be a guide (?) for how much memory it's going to want. I think the in-memory storage is a bit more space efficient than the on-disk storage, but I'm not certain, and I'm pretty sure it takes up more space when it's actually repairing things. If I'm doing the math correctly, you _may_ need up to 50% _more_ than the total metadata size for the FS in virtual memory space. Another possibility is zswap, which still requires a backing device, but it might be able to limit how much swap to disk is needed if the data to swap out is highly compressible. *shrug* zswap won't help in that respect, but it might make swapping stuff back in faster. It just keeps a compressed copy in memory in parallel to writing the full copy out to disk, then uses that compressed copy to swap in instead of going to disk if the copy is still in memory (but it will discard the compressed copies if memory gets really low). In essence, it reduces the impact of swapping when memory pressure is moderate (the situation for most desktops for example), but becomes almost useless when you have very high memory pressure (which is what describes this usage). Is this really how zswap works? OK, looking at the documentation, you're correct, and my assumption based on the description of the frond-end (frontswap) and how the other back-end (the Xen transcendent memory driver) appears to behave was wrong. However, given how zswap does behave, I can't see how it would ever be useful with the default kernel settings, since without manual configuration, the kernel won't try to swap until memory pressure is pretty high, at which point zswap won't likely have much impact. I always thought it acts as a compressed write-back cache in front of the swap devices. Pages first go to zswap compressed, and later write-back kicks in and migrates those compressed pages to real swap, but still compressed. This is done by zswap putting two (or up to three in modern kernels) compressed pages into one page. It has the downside of uncompressing all "buddy pages" when only one is needed back in. But it stays compressed. This also tells me zswap will either achieve around 1:2 or 1:3 effective compression ratio or none. So it cannot be compared to how streaming compression works. OTOH, if the page is reloaded from cache before write-back kicks in, it will never be written to swap but just uncompressed and discarded from the cache. Under high memory pressure it doesn't really work that well due to high CPU overhead if pages constantly swap out, compress, write, read, uncompress, swap in... This usually results in very low CPU usage for processes but high IO and disk wait and high kernel CPU usage. But it defers memory pressure conditions to a little later in exchange for more a little more IO usage and more CPU usage. If you have a lot of inactive memory around, it can make a difference. But it is counter productive if almost all your memory is active and pressure is high. So, in this scenario, it probably still doesn't help. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
Am Tue, 23 May 2017 07:21:33 -0400 schrieb "Austin S. Hemmelgarn": > On 2017-05-22 22:07, Chris Murphy wrote: > > On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN > > wrote: > >> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: > [...] > [...] > [...] > >> > >> Oh, swap will work, you're sure? > >> I already have an SSD, if that's good enough, I can give it a > >> shot. > > > > Yeah although I have no idea how much swap is needed for it to > > succeed. I'm not sure what the relationship is to fs metadata chunk > > size to btrfs check RAM requirement is; but if it wants all of the > > metadata in RAM, then whatever btrfs fi us shows you for metadata > > may be a guide (?) for how much memory it's going to want. > I think the in-memory storage is a bit more space efficient than the > on-disk storage, but I'm not certain, and I'm pretty sure it takes up > more space when it's actually repairing things. If I'm doing the > math correctly, you _may_ need up to 50% _more_ than the total > metadata size for the FS in virtual memory space. > > > > Another possibility is zswap, which still requires a backing device, > > but it might be able to limit how much swap to disk is needed if the > > data to swap out is highly compressible. *shrug* > > > zswap won't help in that respect, but it might make swapping stuff > back in faster. It just keeps a compressed copy in memory in > parallel to writing the full copy out to disk, then uses that > compressed copy to swap in instead of going to disk if the copy is > still in memory (but it will discard the compressed copies if memory > gets really low). In essence, it reduces the impact of swapping when > memory pressure is moderate (the situation for most desktops for > example), but becomes almost useless when you have very high memory > pressure (which is what describes this usage). Is this really how zswap works? I always thought it acts as a compressed write-back cache in front of the swap devices. Pages first go to zswap compressed, and later write-back kicks in and migrates those compressed pages to real swap, but still compressed. This is done by zswap putting two (or up to three in modern kernels) compressed pages into one page. It has the downside of uncompressing all "buddy pages" when only one is needed back in. But it stays compressed. This also tells me zswap will either achieve around 1:2 or 1:3 effective compression ratio or none. So it cannot be compared to how streaming compression works. OTOH, if the page is reloaded from cache before write-back kicks in, it will never be written to swap but just uncompressed and discarded from the cache. Under high memory pressure it doesn't really work that well due to high CPU overhead if pages constantly swap out, compress, write, read, uncompress, swap in... This usually results in very low CPU usage for processes but high IO and disk wait and high kernel CPU usage. But it defers memory pressure conditions to a little later in exchange for more a little more IO usage and more CPU usage. If you have a lot of inactive memory around, it can make a difference. But it is counter productive if almost all your memory is active and pressure is high. So, in this scenario, it probably still doesn't help. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Mon, May 22, 2017 at 09:19:34AM +, Duncan wrote: > btrfs check is userspace, not kernelspace. The btrfs-transacti threads That was my understanding, yes, but since I got it to starve my system, including in kernel OOM issues I pasted in my last message and just referenced in https://bugzilla.kernel.org/show_bug.cgi?id=195863 I think it's not much as black and white as running a userland process that takes too much RAM and get killed if it does. > are indeed kernelspace, but the problem would appear to be either IO or > memory starvation triggered by the userspace check hogging all available > resources, not leaving enough for normal system, including kernel, > processes. Looks like it, but also memory. > * Keeping the number of snapshots as low as possible is strongly > recommended by pretty much everyone here, definitely under 300 per > subvolume and if possible, to double-digits per subvolume. I agree that fewer snapshots is better, but between recovery snapshots and btrfs snapshots for some amount of subvolumes, things add up :) gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | wc -l 93 gargamel:/mnt/btrfs_pool2# btrfs subvolume list . | wc -l 103 > * I personally recommend disabling qgroups, unless you're actively > working with the devs on improving them. In addition to the scaling > issues, quotas simply aren't reliable enough on btrfs yet to rely on them > if the use-case requires them (in which case using a mature filesystem > where they're proven to work is recommended), and if it doesn't, there's > simply too many remaining issues for the qgroups option to be worth it. I had consider using them at some point for each size of each subvolume but good to know they're still not ready quite yet. > * I personally recommend keeping overall filesystem size to something one > can reasonably manage. Most people's use-cases aren't going to allow for > an fsck taking days and tens of GiB, but /will/ allow for multi-TB > filesystems to be split out into multiple independent filesystems of > perhaps a TB or two each, tops, if that's the alternative to multiple-day > fscks taking tens of GiB. (Some use-cases are of course exceptions.) fsck ran in 6H with bcache, but the lowmem one could take a lot longer. Running over ndb to another host with more RAM could indeed take days given the loss of bcache and adding the latency/bandwidth of a networkg. > * The low-memory-mode btrfs check is being developed, tho unfortunately > it doesn't yet do repairs. (Another reason is that it's an alternate > implementation that provides a very useful second opinion and the ability > to cross-check one implementation against the other in hard problem > cases.) True. > >> Sadly, I tried a scrub on the same device, and it stalled after 6TB. > >> The scrub process went zombie and the scrub never succeeded, nor could > >> it be stopped. > > Quite apart from the "... after 6TB" bit setting off my own "it's too big > to reasonably manage" alarm, the filesystem obviously is bugged, and > scrub as well, since it shouldn't just go zombie regardless of the > problem -- it should fail much more gracefully. :) In this case it's mostly big files, so it's fine metadata wise but takes a while to scrub (<24H though). The problem I had is that I copied all of dshelf2 onto dshelf1 while I blew ds2, and rebuilt it. That extra metadata (many smaller files) tipped the metadata size of ds1 over the edge. Once I blew that backup, things became ok again. > Meanwhile, FWIW, unlike check, scrub /is/ kernelspace. Correct, just like balance. > As explained, check is userspace, but as you found, it can still > interfere with kernelspace, including unrelated btrfs-transaction > threads. When the system's out of memory, it's out of memory. userspace should not take the entire system down without the OOM killer even firing. Also, is the logs I just sent, it showed that none of my swap space had been used. Why would that be? > Tho there is ongoing work into better predicting memory allocation needs > for btrfs kernel threads and reserving memory space accordingly, so this > sort of thing doesn't happen any more. That would be good. > Agreed. Lowmem mode looks like about your only option, beyond simply > blowing it away, at this point. Too bad it doesn't do repair yet, but it's not an option since it won't fix the small corruption issue I had. Thankfully deleting enough metadata allowed it to run within my RAM and check --repair fixed it now. > with a bit of luck it should at least give you and the devs some idea > what's wrong, information that can in turn be used to fix both scrub and > normal check mode, as well as low-mem repair mode, once it's available. In this case, not useful information for the devs. It's a bad SAS card that corrupted my data, not a bug in the kernel code. > Of course your "days" comment is triggering my "it's too big to maintain" > reflex again, but obviously
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Tue, May 23, 2017 at 07:21:33AM -0400, Austin S. Hemmelgarn wrote: > > Yeah although I have no idea how much swap is needed for it to > > succeed. I'm not sure what the relationship is to fs metadata chunk > > size to btrfs check RAM requirement is; but if it wants all of the > > metadata in RAM, then whatever btrfs fi us shows you for metadata may > > be a guide (?) for how much memory it's going to want. > > I think the in-memory storage is a bit more space efficient than the on-disk > storage, but I'm not certain, and I'm pretty sure it takes up more space > when it's actually repairing things. If I'm doing the math correctly, you > _may_ need up to 50% _more_ than the total metadata size for the FS in > virtual memory space. So I was able to rescue/fix my system by removing a bunch of temporary data on it, which in turn freed up enough metadata for things to btrfs check to work again. The things to check were minor, so they were fixed quickly. I seem to have been the last person who last edited https://btrfs.wiki.kernel.org/index.php/Btrfsck and it's therefore way out of date :) I propose the following 1) One dev needs to confirm that as long as you have enough swap, btrfs check should. Give some guideline of metadatasize to swap size. Then again I think swap doesn't help, see below 2) I still think there is an issue with either the OOM killer, or btrfs check actually chewing up kernel RAM. I've never seen any linux system die in the spectacular ways mine died with that btrfs check, if it were only taking userspace RAM. I've filed a bug, because it looks bad: https://bugzilla.kernel.org/show_bug.cgi?id=195863 Can someone read those better than me? Is it userspace RAM that is missing? You said that swap would help, but in the dump below, I see: Free swap = 15366388kB so my swap was unused and the system crashed due to OOM anyway. btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) btrfs-transacti cpuset=/ mems_allowed=0 Mem-Info: active_anon:5274313 inactive_anon:378373 isolated_anon:3590 active_file:3711 inactive_file:3809 isolated_file:0 unevictable:1467 dirty:5068 writeback:49189 unstable:0 slab_reclaimable:8721 slab_unreclaimable:67310 mapped:556943 shmem:801313 pagetables:15777 bounce:0 free:89741 free_pcp:6 free_cma:0 Node 0 active_anon:21097252kB inactive_anon:1513492kB active_file:14844kB inactive_file:15236kB unevictable:5868kB isolated(anon):14360kB isolated(file):0kB mapped:2227772kB dirty:20272kB writeback:196756kB shmem:3205252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:215184 all_unreclaimable? no Node 0 DMA free:15880kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 3201 23768 23768 23768 Node 0 DMA32 free:116720kB min:35424kB low:44280kB high:53136kB active_anon:3161376kB inactive_anon:8kB active_file:320kB inactive_file:332kB unevictable:0kB writepending:612kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:460kB slab_unreclaimable:668kB kernel_stack:16kB pagetables:7292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 20567 20567 20567 Node 0 Normal free:226664kB min:226544kB low:283180kB high:339816kB active_anon:17935552kB inactive_anon:1513564kB active_file:14524kB inactive_file:14904kB unevictable:5868kB writepending:216372kB present:21485568kB managed:21080208kB mlocked:5868kB slab_reclaimable:34412kB slab_unreclaimable:268520kB kernel_stack:12480kB pagetables:55816kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB Node 0 DMA32: 768*4kB (UME) 740*8kB (UME) 685*16kB (UME) 446*32kB (UME) 427*64kB (UME) 233*128kB (UME) 79*256kB (UME) 10*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 116720kB Node 0 Normal: 25803*4kB (UME) 11297*8kB (UME) 947*16kB (UME) 260*32kB (ME) 72*64kB (UM) 15*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 223844kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 858720 total pagecache pages 49221 pages in swap cache Swap cache stats: add 62319, delete 13131, find 75/76 Free swap = 15366388kB Total swap = 15616764kB 6215902 pages RAM 0 pages HighMem/MovableOnly 117753 pages reserved 4096 pages cma reserved I'm also happy to modify the wiki to 1) mention that there is a lowmem mode which in turn isn't really useful for much yet since it won't repair even a trivial thing (seen patches go around, but not in upstream yet) 2) warn that for now check --repair of a big filesystem will crash
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On 2017-05-22 22:07, Chris Murphy wrote: On Mon, May 22, 2017 at 5:57 PM, Marc MERLINwrote: On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN wrote: I already have 24GB of RAM in that machine, adding more for the real fsck repair to run, is going to be difficult and ndb would take days I guess (then again I don't have a machine with 32 or 48 or 64GB of RAM anyway). If you can acquire an SSD, you can give the system a bunch of swap, and at least then hopefully the check repair can complete. Yes it'll be slower than with real RAM but it's not nearly as bad as you might think it'd be, based on HDD based swap. Oh, swap will work, you're sure? I already have an SSD, if that's good enough, I can give it a shot. Yeah although I have no idea how much swap is needed for it to succeed. I'm not sure what the relationship is to fs metadata chunk size to btrfs check RAM requirement is; but if it wants all of the metadata in RAM, then whatever btrfs fi us shows you for metadata may be a guide (?) for how much memory it's going to want. I think the in-memory storage is a bit more space efficient than the on-disk storage, but I'm not certain, and I'm pretty sure it takes up more space when it's actually repairing things. If I'm doing the math correctly, you _may_ need up to 50% _more_ than the total metadata size for the FS in virtual memory space. Another possibility is zswap, which still requires a backing device, but it might be able to limit how much swap to disk is needed if the data to swap out is highly compressible. *shrug* zswap won't help in that respect, but it might make swapping stuff back in faster. It just keeps a compressed copy in memory in parallel to writing the full copy out to disk, then uses that compressed copy to swap in instead of going to disk if the copy is still in memory (but it will discard the compressed copies if memory gets really low). In essence, it reduces the impact of swapping when memory pressure is moderate (the situation for most desktops for example), but becomes almost useless when you have very high memory pressure (which is what describes this usage). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Mon, May 22, 2017 at 5:57 PM, Marc MERLINwrote: > On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: >> On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN wrote: >> >> > >> > I already have 24GB of RAM in that machine, adding more for the real >> > fsck repair to run, is going to be difficult and ndb would take days I >> > guess (then again I don't have a machine with 32 or 48 or 64GB of RAM >> > anyway). >> >> If you can acquire an SSD, you can give the system a bunch of swap, >> and at least then hopefully the check repair can complete. Yes it'll >> be slower than with real RAM but it's not nearly as bad as you might >> think it'd be, based on HDD based swap. > > Oh, swap will work, you're sure? > I already have an SSD, if that's good enough, I can give it a shot. Yeah although I have no idea how much swap is needed for it to succeed. I'm not sure what the relationship is to fs metadata chunk size to btrfs check RAM requirement is; but if it wants all of the metadata in RAM, then whatever btrfs fi us shows you for metadata may be a guide (?) for how much memory it's going to want. Another possibility is zswap, which still requires a backing device, but it might be able to limit how much swap to disk is needed if the data to swap out is highly compressible. *shrug* -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: > On Mon, May 22, 2017 at 10:31 AM, Marc MERLINwrote: > > > > > I already have 24GB of RAM in that machine, adding more for the real > > fsck repair to run, is going to be difficult and ndb would take days I > > guess (then again I don't have a machine with 32 or 48 or 64GB of RAM > > anyway). > > If you can acquire an SSD, you can give the system a bunch of swap, > and at least then hopefully the check repair can complete. Yes it'll > be slower than with real RAM but it's not nearly as bad as you might > think it'd be, based on HDD based swap. Oh, swap will work, you're sure? I already have an SSD, if that's good enough, I can give it a shot. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Mon, May 22, 2017 at 10:31 AM, Marc MERLINwrote: > > I already have 24GB of RAM in that machine, adding more for the real > fsck repair to run, is going to be difficult and ndb would take days I > guess (then again I don't have a machine with 32 or 48 or 64GB of RAM > anyway). If you can acquire an SSD, you can give the system a bunch of swap, and at least then hopefully the check repair can complete. Yes it'll be slower than with real RAM but it's not nearly as bad as you might think it'd be, based on HDD based swap. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Sun, May 21, 2017 at 06:35:53PM -0700, Marc MERLIN wrote: > On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote: > > On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote: > > > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 > > > enabling repair mode > > > Checking filesystem on /dev/mapper/dshelf1 > > > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d > > > checking extents > > > > > > This causes a bunch of these: > > > btrfs-transacti: page allocation stalls for 23508ms, order:0, > > > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) > > > btrfs-transacti cpuset=/ mems_allowed=0 > > > > > > What's the recommended way out of this and which code is at fault? I > > > can't tell if btrfs is doing memory allocations wrong, or if it's just > > > being undermined by the block layer dying underneath. > > > > I went back to 4.8.10, and similar problem. > > It looks like btrfs check exercises the kernel and causes everything to > > come down to a halt :( > > > > Sadly, I tried a scrub on the same device, and it stalled after 6TB. The > > scrub process went zombie > > and the scrub never succeeded, nor could it be stopped. > > So, putting the btrfs scrub that stalled issue, I didn't quite realize > that btrs check memory issues actually caused the kernel to eat all the > memory until everything crashed/deadlocked/stalled. > Is that actually working as intended? > Why doesn't it fail and stop instead of taking my entire server down? > Clearly there must be a rule against a kernel subsystem taking all the > memory from everything until everything crashes/deadlocks, right? > > So for now, I'm doing a lowmem check, but it's not going to be very > helpful since it cannot repair anything if it finds a problem. > > At least my machine isn't crashing anymore, I suppose that's still an > improvement. > gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1 > We'll see how many days it takes. Well, at least it's finding errors, but of course it can't fix them since lowmem doesn't have repair yet (yes, I know it's WIP) I already have 24GB of RAM in that machine, adding more for the real fsck repair to run, is going to be difficult and ndb would take days I guess (then again I don't have a machine with 32 or 48 or 64GB of RAM anyway). I'm guessing my next step is to delete a lot of data from that array until its metadata use gets back below something that fits in RAM :-/ But hopefully check --repair can be fixed not to crash your machine if it needs more RAM than is available. Checking filesystem on /dev/mapper/dshelf1 UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d checking free space cache [.] ERROR: root 53282 EXTENT_DATA[8244 4096] interrupt ERROR: root 53282 EXTENT_DATA[50585 4096] interrupt ERROR: root 53282 EXTENT_DATA[51096 4096] interrupt ERROR: root 53282 EXTENT_DATA[182617 4096] interrupt ERROR: root 53282 EXTENT_DATA[212972 4096] interrupt ERROR: root 53282 EXTENT_DATA[260115 4096] interrupt ERROR: root 53282 EXTENT_DATA[278370 4096] interrupt ERROR: root 53282 EXTENT_DATA[323505 4096] interrupt ERROR: root 53282 EXTENT_DATA[396923 4096] interrupt ERROR: root 53282 EXTENT_DATA[419599 4096] interrupt ERROR: root 53282 EXTENT_DATA[490602 4096] interrupt ERROR: root 53282 EXTENT_DATA[41 4096] interrupt ERROR: root 53282 EXTENT_DATA[601942 4096] interrupt ERROR: root 53282 EXTENT_DATA[682215 4096] interrupt ERROR: root 53282 EXTENT_DATA[721729 4096] interrupt ERROR: root 53282 EXTENT_DATA[916271 4096] interrupt ERROR: root 53282 EXTENT_DATA[961074 4096] interrupt ERROR: root 53282 EXTENT_DATA[1118062 4096] interrupt ERROR: root 53282 EXTENT_DATA[1127879 4096] interrupt ERROR: root 53282 EXTENT_DATA[1142984 4096] interrupt ERROR: root 53282 EXTENT_DATA[1379975 4096] interrupt ERROR: root 53282 EXTENT_DATA[1398275 4096] interrupt ERROR: root 53282 EXTENT_DATA[1446265 4096] interrupt ERROR: root 53282 EXTENT_DATA[1459061 4096] interrupt ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt ERROR: root 53282 EXTENT_DATA[1484265 4096] interrupt ERROR: root 53282 EXTENT_DATA[1509227 4096] interrupt ERROR: root 53282 EXTENT_DATA[1671096 4096] interrupt ERROR: root 53282 EXTENT_DATA[1692559 4096] interrupt ERROR: root 53282 EXTENT_DATA[1742832 4096] interrupt ERROR: root 53282 EXTENT_DATA[1808649 4096] interrupt ERROR: root 53292 EXTENT_DATA[57240 4096] interrupt ERROR: root 53446 EXTENT_DATA[3554 4096] interrupt ERROR: root 53446 EXTENT_DATA[64241 4096] interrupt (...) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
Marc MERLIN posted on Sun, 21 May 2017 18:35:53 -0700 as excerpted: > On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote: >> On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote: >> > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 enabling repair >> > mode Checking filesystem on /dev/mapper/dshelf1 UUID: >> > 36f5079e-ca6c-4855-8639-ccb82695c18d checking extents >> > >> > This causes a bunch of these: >> > btrfs-transacti: page allocation stalls for 23508ms, order:0, >> > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) >> > btrfs-transacti cpuset=/ mems_allowed=0 >> > >> > What's the recommended way out of this and which code is at fault? I >> > can't tell if btrfs is doing memory allocations wrong, or if it's >> > just being undermined by the block layer dying underneath. >> >> I went back to 4.8.10, and similar problem. >> It looks like btrfs check exercises the kernel and causes everything to >> come down to a halt :( btrfs check is userspace, not kernelspace. The btrfs-transacti threads are indeed kernelspace, but the problem would appear to be either IO or memory starvation triggered by the userspace check hogging all available resources, not leaving enough for normal system, including kernel, processes. Check is /known/ to be memory intensive, with multi-TB filesystems often requiring tens of GiB of memory, and qgroups and snapshots are both known to dramatically intensify the scaling issues. (btrfs balance, by contrast, has the same scaling issues, but is kernelspace.) That's one reason why (not all of these may apply to your case) ... * Keeping the number of snapshots as low as possible is strongly recommended by pretty much everyone here, definitely under 300 per subvolume and if possible, to double-digits per subvolume. * I personally recommend disabling qgroups, unless you're actively working with the devs on improving them. In addition to the scaling issues, quotas simply aren't reliable enough on btrfs yet to rely on them if the use-case requires them (in which case using a mature filesystem where they're proven to work is recommended), and if it doesn't, there's simply too many remaining issues for the qgroups option to be worth it. * I personally recommend keeping overall filesystem size to something one can reasonably manage. Most people's use-cases aren't going to allow for an fsck taking days and tens of GiB, but /will/ allow for multi-TB filesystems to be split out into multiple independent filesystems of perhaps a TB or two each, tops, if that's the alternative to multiple-day fscks taking tens of GiB. (Some use-cases are of course exceptions.) * The low-memory-mode btrfs check is being developed, tho unfortunately it doesn't yet do repairs. (Another reason is that it's an alternate implementation that provides a very useful second opinion and the ability to cross-check one implementation against the other in hard problem cases.) (The two "I personally recommend" points above aren't recommendations shared by everyone on the list, but obviously I've found them very useful here. =:^) >> Sadly, I tried a scrub on the same device, and it stalled after 6TB. >> The scrub process went zombie and the scrub never succeeded, nor could >> it be stopped. Quite apart from the "... after 6TB" bit setting off my own "it's too big to reasonably manage" alarm, the filesystem obviously is bugged, and scrub as well, since it shouldn't just go zombie regardless of the problem -- it should fail much more gracefully. Meanwhile, FWIW, unlike check, scrub /is/ kernelspace. > So, putting the btrfs scrub that stalled issue, I didn't quite realize > that btrs check memory issues actually caused the kernel to eat all the > memory until everything crashed/deadlocked/stalled. > Is that actually working as intended? > Why doesn't it fail and stop instead of taking my entire server down? > Clearly there must be a rule against a kernel subsystem taking all the > memory from everything until everything crashes/deadlocks, right? As explained, check is userspace, but as you found, it can still interfere with kernelspace, including unrelated btrfs-transaction threads. When the system's out of memory, it's out of memory. Tho there is ongoing work into better predicting memory allocation needs for btrfs kernel threads and reserving memory space accordingly, so this sort of thing doesn't happen any more. Of course it could also be some sort of (not necessarily directly btrfs) lockdep issue, and there's ongoing kernel-wide and btrfs work there as well. > So for now, I'm doing a lowmem check, but it's not going to be very > helpful since it cannot repair anything if it finds a problem. > > At least my machine isn't crashing anymore, I suppose that's still an > improvement. > gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1 We'll see how > many days it takes. Agreed. Lowmem mode looks like about your only option, beyond simply
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote: > On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote: > > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 > > enabling repair mode > > Checking filesystem on /dev/mapper/dshelf1 > > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d > > checking extents > > > > This causes a bunch of these: > > btrfs-transacti: page allocation stalls for 23508ms, order:0, > > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) > > btrfs-transacti cpuset=/ mems_allowed=0 > > > > What's the recommended way out of this and which code is at fault? I can't > > tell if btrfs is doing memory allocations wrong, or if it's just being > > undermined by the block layer dying underneath. > > I went back to 4.8.10, and similar problem. > It looks like btrfs check exercises the kernel and causes everything to come > down to a halt :( > > Sadly, I tried a scrub on the same device, and it stalled after 6TB. The > scrub process went zombie > and the scrub never succeeded, nor could it be stopped. So, putting the btrfs scrub that stalled issue, I didn't quite realize that btrs check memory issues actually caused the kernel to eat all the memory until everything crashed/deadlocked/stalled. Is that actually working as intended? Why doesn't it fail and stop instead of taking my entire server down? Clearly there must be a rule against a kernel subsystem taking all the memory from everything until everything crashes/deadlocks, right? So for now, I'm doing a lowmem check, but it's not going to be very helpful since it cannot repair anything if it finds a problem. At least my machine isn't crashing anymore, I suppose that's still an improvement. gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1 We'll see how many days it takes. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote: > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 > enabling repair mode > Checking filesystem on /dev/mapper/dshelf1 > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d > checking extents > > This causes a bunch of these: > btrfs-transacti: page allocation stalls for 23508ms, order:0, > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) > btrfs-transacti cpuset=/ mems_allowed=0 > > What's the recommended way out of this and which code is at fault? I can't > tell if btrfs is doing memory allocations wrong, or if it's just being > undermined by the block layer dying underneath. I went back to 4.8.10, and similar problem. It looks like btrfs check exercises the kernel and causes everything to come down to a halt :( Sadly, I tried a scrub on the same device, and it stalled after 6TB. The scrub process went zombie and the scrub never succeeded, nor could it be stopped. What do I try next? My filesystem seems ok when I use it except for that BUG() crash I just reported a few days ago. I'm willing to believe there is a some problem with it somewhere but if I can't scrub or check it, it's kind of hard to look into it further [ 1090.912073] INFO: task kworker/dying:63 blocked for more than 120 seconds. [ 1090.933850] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [ 1090.959465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1090.983973] kworker/dying D 9a23424e3d68 063 2 0x [ 1091.006171] 9a23424e3d68 00ff9a213ab32140 8dc0d4c0 9a23424dc100 [ 1091.029349] 9a23424e3d50 9a23424e4000 9a234098d064 9a23424dc100 [ 1091.052490] 9a234098d068 9a23424e3d80 8d6cf1a6 [ 1091.075679] Call Trace: [ 1091.083882] [] schedule+0x8b/0xa3 [ 1091.099532] [] schedule_preempt_disabled+0x18/0x24 [ 1091.119518] [] __mutex_lock_slowpath+0xce/0x16d [ 1091.138705] [] mutex_lock+0x17/0x27 [ 1091.154772] [] ? mutex_lock+0x17/0x27 [ 1091.171382] [] acct_process+0x4e/0xe0 [ 1091.187974] [] ? rescuer_thread+0x24f/0x2d1 [ 1091.206170] [] do_exit+0x3ba/0x97b [ 1091.222001] [] ? kfree+0x7a/0x99 [ 1091.237307] [] ? worker_thread+0x2ab/0x2ba [ 1091.255219] [] ? rescuer_thread+0x2d1/0x2d1 [ 1091.273390] [] kthread+0xbc/0xbc [ 1091.288672] [] ret_from_fork+0x1f/0x40 [ 1091.305524] [] ? init_completion+0x24/0x24 [ 1091.323404] INFO: task kworker/u16:4:158 blocked for more than 120 seconds. [ 1091.344956] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [ 1091.370145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1091.394299] kworker/u16:4 D 9a233e607b58 0 158 2 0x [ 1091.416260] Workqueue: btrfs-endio-write btrfs_endio_write_helper [ 1091.435259] 9a233e607b58 00ff8d0899ae 9a21f0c76180 9a233e6021c0 [ 1091.458328] 9a233e607b40 9a233e608000 7fff 9a233e6021c0 [ 1091.481385] 8d6d1244 9a2317491e68 9a233e607b70 8d6cf1a6 [ 1091.504472] Call Trace: [ 1091.512751] [] ? usleep_range+0x65/0x65 [ 1091.530093] [] schedule+0x8b/0xa3 [ 1091.545833] [] schedule_timeout+0x43/0x126 [ 1091.563782] [] ? wake_up_process+0x15/0x17 [ 1091.581707] [] do_wait_for_common+0x123/0x15f [ 1091.600403] [] ? do_wait_for_common+0x123/0x15f [ 1091.619625] [] ? wake_up_q+0x47/0x47 [ 1091.635983] [] wait_for_common+0x3b/0x55 [ 1091.653380] [] wait_for_completion+0x1d/0x1f [ 1091.671811] [] btrfs_async_run_delayed_refs+0xd3/0xed [ 1091.692598] [] __btrfs_end_transaction+0x2a7/0x2dd [ 1091.712585] [] btrfs_end_transaction+0x10/0x12 [ 1091.731529] [] btrfs_finish_ordered_io+0x3f7/0x4db [ 1091.751495] [] finish_ordered_fn+0x15/0x17 [ 1091.769372] [] btrfs_scrubparity_helper+0x10e/0x258 [ 1091.789590] [] btrfs_endio_write_helper+0xe/0x10 [ 1091.809014] [] process_one_work+0x186/0x29d [ 1091.827123] [] worker_thread+0x1ea/0x2ba [ 1091.844438] [] ? rescuer_thread+0x2d1/0x2d1 [ 1091.862521] [] kthread+0xb4/0xbc [ 1091.877718] [] ret_from_fork+0x1f/0x40 [ 1091.894476] [] ? init_completion+0x24/0x24 [ 1091.912276] INFO: task kworker/u16:5:159 blocked for more than 120 seconds. [ 1091.933740] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [ 1091.958847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1091.982906] kworker/u16:5 D 9a233e60f9c0 0 159 2 0x [ 1092.004713] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [ 1092.023611] 9a233e60f9c0 0246 9a2342fa41c0 9a233e608200 [ 1092.046575] 9a233e60f9a8 9a233e61 9a213898bd88 9a233052f800 [ 1092.069536] 0001 0001 9a233e60f9d8 8d6cf1a6 [ 1092.092523] Call Trace: [ 1092.100496] [] schedule+0x8b/0xa3 [ 1092.115995] [] btrfs_tree_lock+0xd6/0x1fb [ 1092.133574] []
4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
gargamel:~# btrfs check --repair /dev/mapper/dshelf1 enabling repair mode Checking filesystem on /dev/mapper/dshelf1 UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d checking extents This causes a bunch of these: btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) btrfs-transacti cpuset=/ mems_allowed=0 What's the recommended way out of this and which code is at fault? I can't tell if btrfs is doing memory allocations wrong, or if it's just being undermined by the block layer dying underneath. And sadly, I'm also getting workqueues stalls like these: [ 3996.047531] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 45s! [ 3996.073512] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 52s! [ 3996.099466] Showing busy workqueues and worker pools: [ 3996.116824] workqueue events: flags=0x0 [ 3996.130409] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256 [ 3996.150268] in-flight: 9661:do_sync_work [ 3996.165186] pending: wait_rcu_exp_gp, cache_reap [ 3996.182139] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256 [ 3996.202099] in-flight: 9725:do_emergency_remount, 9738:do_poweroff [ 3996.223543] pending: drm_fb_helper_dirty_work, cache_reap, do_sync_work, vmstat_shepherd, update_writeback_rate [bcache], do_poweroff [ 3996.263991] workqueue writeback: flags=0x4e [ 3996.278116] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256 [ 3996.296586] in-flight: 149:wb_workfn wb_workfn [ 3996.312387] workqueue btrfs-endio-write: flags=0xe [ 3996.328090] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/8 [ 3996.345794] in-flight: 20326:btrfs_endio_write_helper, 2927:btrfs_endio_write_helper [ 3996.371981] workqueue kcryptd: flags=0x2a [ 3996.386019] pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8 [ 3996.404325] in-flight: 9950:kcryptd_crypt [dm_crypt], 8859:kcryptd_crypt [dm_crypt], 31087:kcryptd_crypt [dm_crypt], 2929:kcryptd_crypt [dm_crypt], 20328:kcryptd_crypt [dm_crypt], 5951:kcryptd_crypt [dm_crypt], 31084:kcryptd_crypt [dm_crypt], 7553:kcryptd_crypt [dm_crypt] [ 3996.484333] delayed: kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt] [ 3996.719697] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt] Problems started here: [ 3624.349624] php5: page allocation stalls for 15028ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null) [ 3624.382270] php5 cpuset=/ mems_allowed=0 [ 3624.395474] CPU: 1 PID: 9949 Comm: php5 Tainted: G U 4.11.1-amd64-preempt-sysrq-20170406 #4 [ 3624.424907] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 [ 3624.453233] Call Trace: [ 3624.461292] dump_stack+0x61/0x7d [ 3624.472098] warn_alloc+0xfc/0x18c [ 3624.483150] __alloc_pages_slowpath+0x3bc/0xb31 [ 3624.497528] ? finish_wait+0x5a/0x63 [ 3624.509018] __alloc_pages_nodemask+0x12c/0x1e0 [ 3624.523343] alloc_pages_current+0x9b/0xbd [ 3624.536346] __page_cache_alloc+0x8e/0xa4 [ 3624.549067] pagecache_get_page+0xc9/0x16b [ 3624.562067] alloc_extent_buffer+0xdf/0x305 [ 3624.575342] read_tree_block+0x19/0x4e [ 3624.587295] read_block_for_search.isra.21+0x211/0x264 [ 3624.603420] btrfs_search_slot+0x52b/0x72e [ 3624.616387] btrfs_lookup_csum+0x52/0xf7 [ 3624.628835] __btrfs_lookup_bio_sums+0x23b/0x448 [ 3624.643396] btrfs_lookup_bio_sums+0x16/0x18 [ 3624.656886] btrfs_submit_bio_hook+0xcb/0x14a [ 3624.670639]