Franziska Näpelt posted on Thu, 26 Jun 2014 13:47:12 +0200 as excerpted:

> What do you mean with "if it seems the space_cache rebuild is
> interfering with further activity for too long"
> 
> The boot-process runs for five hours now. How long should i wait? What
> would you recommend?

Well, you have TiBs of capacity to work thru, and your drives will be 
doing a lot of seeking so they won't be running at anything like full 
rated speed.  Multiple TiB at say 10 MiB/sec progress... ~100 seconds/
gig, couple thousand gig... 55 hours?  That's the couple TB drive you 
mentioned, if it was near full.  I suspect it's doing something else too, 
hopefully finishing the delete, but if the I/O for that is fighting with 
the I/O for space_cache rebuild, given the size and the two I/O heavy 
tasks at once, it could take awhile.  Tho hopefully the space_cache 
rebuild should be done in a few hours and the rebuild will go faster 
after that.

Anyway, when you're talking 2 TB, even at a relatively brisk 100 MB/sec 
you're looking at five hours, so if it /is/ actually completing the 
delete, that's about the /minimum/ I'd expect.

As long as you see drive activity I'd not bother it, even if it's a day 
or two... or even more... I'd be evaluating whether to give up at the 
week-point, tho.

Note that we've had cases reported on-list where a resumed balance or the 
like can take a week, but at some point the I/O quit and it was 
apparently CPU.  At that point you gotta guess whether it's looping or 
the logic is just taking time but making (some) progress, and evaluate 
whether it's time to simply give up and restore from backup and eat the 
loss on anything not backed up.

One of the reasons snapshot-aware-defrag was disabled was because it 
simply didn't scale to thousands of snapshots well at all, and as long as 
it didn't run out of memory, it wasn't exactly locked up, but forward 
progress was close to zero and it would literally take over a week in 
some cases.  There's similar issues with the old quota code, tho there's 
patches reworking that but I'm not sure they're actually in or ready for 
mainline yet.  So I've been recommending not using quotas on btrfs -- if 
you NEED them, use a more mature filesystem where they actually work 
properly.  And if you do automated snapshots, use a good thinning script 
to thin them down so you're well under 500. (I've posted figures where 
even starting with per-minute snapshots, thinning down to 10 minute, then 
to half hour within the day, then to say 4/day after two days, 1/day 
after a week, one a week after four weeks, one every 13 weeks aka 
quarterly after say six months, and clearing them all and relying on off-
machine backups after a year or 18 months, runs only 250-ish or so, under 
300.)

Etc.  But of course even if you were doing all the wrong things it's a 
bit late to worry about it now, until you're back up and running.

But as I said, drive activity is a good sign.  I'd leave it alone as long 
as that's happening -- with that much data it could literally take days.  
If the drive activity stops, tho, that's when you gotta reevaluate 
whether it's worth waiting or not.

Meanwhile, if the drive activity /does/ stop, consider doing an alt-sysrq-
w, to get a trace of what's blocking.  Then wait say a half an hour and 
do another, and compare.  People report that sometimes you can see if 
it's making forward progress from that (if the blocked tasks seem to be 
in the same spot), or at least post it for the devs to look at -- that's 
actually one of their most requested things, tho I'm not sure how easy 
it'll be to capture without being able to get at the logs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to