On Sun, May 21, 2017 at 06:35:53PM -0700, Marc MERLIN wrote:
> On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote:
> > On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
> > > gargamel:~# btrfs check --repair /dev/mapper/dshelf1
> > > enabling repair mode
> > > Checking filesystem on /dev/mapper/dshelf1
> > > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
> > > checking extents
> > > 
> > > This causes a bunch of these:
> > > btrfs-transacti: page allocation stalls for 23508ms, order:0, 
> > > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
> > > btrfs-transacti cpuset=/ mems_allowed=0
> > > 
> > > What's the recommended way out of this and which code is at fault? I 
> > > can't tell if btrfs is doing memory allocations wrong, or if it's just 
> > > being undermined by the block layer dying underneath.
> > 
> > I went back to 4.8.10, and similar problem.
> > It looks like btrfs check exercises the kernel and causes everything to 
> > come down to a halt :(
> > 
> > Sadly, I tried a scrub on the same device, and it stalled after 6TB. The 
> > scrub process went zombie
> > and the scrub never succeeded, nor could it be stopped.
> 
> So, putting the btrfs scrub that stalled issue, I didn't quite realize
> that btrs check memory issues actually caused the kernel to eat all the
> memory until everything crashed/deadlocked/stalled.
> Is that actually working as intended?
> Why doesn't it fail and stop instead of taking my entire server down?
> Clearly there must be a rule against a kernel subsystem taking all the
> memory from everything until everything crashes/deadlocks, right?
> 
> So for now, I'm doing a lowmem check, but it's not going to be very
> helpful since it cannot repair anything if it finds a problem.
> 
> At least my machine isn't crashing anymore, I suppose that's still an
> improvement.
> gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1
> We'll see how many days it takes.

Well, at least it's finding errors, but of course it can't fix them
since lowmem doesn't have repair yet (yes, I know it's WIP)

I already have 24GB of RAM in that machine, adding more for the real
fsck repair to run, is going to be difficult and ndb would take days I
guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
anyway).

I'm guessing my next step is to delete a lot of data from that array
until its metadata use gets back below something that fits in RAM :-/
But hopefully check --repair can be fixed not to crash your machine if
it needs more RAM than is available.

 
Checking filesystem on /dev/mapper/dshelf1
UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
checking free space cache [.]
ERROR: root 53282 EXTENT_DATA[8244 4096] interrupt
ERROR: root 53282 EXTENT_DATA[50585 4096] interrupt
ERROR: root 53282 EXTENT_DATA[51096 4096] interrupt
ERROR: root 53282 EXTENT_DATA[182617 4096] interrupt
ERROR: root 53282 EXTENT_DATA[212972 4096] interrupt
ERROR: root 53282 EXTENT_DATA[260115 4096] interrupt
ERROR: root 53282 EXTENT_DATA[278370 4096] interrupt
ERROR: root 53282 EXTENT_DATA[323505 4096] interrupt
ERROR: root 53282 EXTENT_DATA[396923 4096] interrupt
ERROR: root 53282 EXTENT_DATA[419599 4096] interrupt
ERROR: root 53282 EXTENT_DATA[490602 4096] interrupt
ERROR: root 53282 EXTENT_DATA[555541 4096] interrupt
ERROR: root 53282 EXTENT_DATA[601942 4096] interrupt
ERROR: root 53282 EXTENT_DATA[682215 4096] interrupt
ERROR: root 53282 EXTENT_DATA[721729 4096] interrupt
ERROR: root 53282 EXTENT_DATA[916271 4096] interrupt
ERROR: root 53282 EXTENT_DATA[961074 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1118062 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1127879 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1142984 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1379975 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1398275 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1446265 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1459061 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1484265 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1509227 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1671096 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1692559 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1742832 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1808649 4096] interrupt
ERROR: root 53292 EXTENT_DATA[57240 4096] interrupt
ERROR: root 53446 EXTENT_DATA[3554 4096] interrupt
ERROR: root 53446 EXTENT_DATA[64241 4096] interrupt
(...)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to