Re: Reducing impact of periodic btrfs balance
On 2016-05-26 18:12, Graham Cobb wrote: On 19/05/16 02:33, Qu Wenruo wrote: Graham Cobb wrote on 2016/05/18 14:29 +0100: A while ago I had a "no space" problem (despite fi df, fi show and fi usage all agreeing I had over 1TB free). But this email isn't about that. As part of fixing that problem, I tried to do a "balance -dusage=20" on the disk. I was expecting it to have system impact, but it was a major disaster. The balance didn't just run for a long time, it locked out all activity on the disk for hours. A simple "touch" command to create one file took over an hour. It seems that balance blocked a transaction for a long time, which makes your touch operation to wait for that transaction to end. I have been reading volumes.c. But I don't have a feel for which transactions are likely to be the things blocking for a really long time (hours). If this can occur, I think the warnings to users about balance need to be extended to include this issue. Currently the user mode code warns users that unfiltered balances may take a long time, but it doesn't warn that the disk may be unusable during that time. Whether or not the disk is usable depends on a number of factors. I have no issues using my disks while they're being balanced (even hen doing a full balance), but they also all support command queuing, and are either fast disks, or on really good storage controllers. 3) My btrfs-balance-slowly script would work better if there was a time-based limit filter for balance, not just the current count-based filter. I would like to be able to say, for example, run balance for no more than 10 minutes (completing the operation in progress, of course) then return. As btrfs balance is done in block group unit, I'm afraid such thing would be a little tricky to implement. It would be really easy to add a jiffies-based limit into the checks in should_balance_chunk. Of course, this would only test the limit in between block groups but that is what I was looking for -- a time-based version of the current limit filter. On the other hand, the time limit could just be added into the user mode code: after the timer expires it could issue a "balance pause". Would the effect be identical in terms of timing, resources required, etc? This is entirely userspace policy, and thus should be done in userspace. Pretty much everything that has a filter already can't be entirely implemented in userspace, despite technically being policy, because it requires specific knowledge of the filesystem internals. Having a time limited mode requires no such knowledge, and thus could be done in userspace. Putting it in userspace also would make it easier to debug, and less likely to cause other fallout in the rest of the balance code. Would it be better to do a "balance pause" or a "balance cancel"? The goal would be to suspend balance processing and allow the system to do something else for a while (say 20 minutes) and then go back to doing more balance later. What is the difference between resuming a paused balance compared to starting a new balance? Bearing in mind that this is a heavily used disk so we can expect lots of transactions to have happened in the meantime (otherwise we wouldn't need this capability)? The difference between resuming a paused balance and starting a balance after canceling one is pretty simple. Resuming a paused balance will not re-process chunks that were already processed, starting a new one after canceling may or may not (depending on what other filters are involved). I think having the option to do either would be a good thing, cancel makes a bit more sense if you're going long periods of time between each run and are using other limiting filters (like usage filtering), whereas pause makes more sense if doing a full balance or only pausing for a short time between each run. Depending on how the balance ioctl reacts to being interrupted with a signal, this would in theory not be hard to implement either. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reducing impact of periodic btrfs balance
On 19/05/16 02:33, Qu Wenruo wrote: > > > Graham Cobb wrote on 2016/05/18 14:29 +0100: >> A while ago I had a "no space" problem (despite fi df, fi show and fi >> usage all agreeing I had over 1TB free). But this email isn't about >> that. >> >> As part of fixing that problem, I tried to do a "balance -dusage=20" on >> the disk. I was expecting it to have system impact, but it was a major >> disaster. The balance didn't just run for a long time, it locked out >> all activity on the disk for hours. A simple "touch" command to create >> one file took over an hour. > > It seems that balance blocked a transaction for a long time, which makes > your touch operation to wait for that transaction to end. I have been reading volumes.c. But I don't have a feel for which transactions are likely to be the things blocking for a really long time (hours). If this can occur, I think the warnings to users about balance need to be extended to include this issue. Currently the user mode code warns users that unfiltered balances may take a long time, but it doesn't warn that the disk may be unusable during that time. >> 3) My btrfs-balance-slowly script would work better if there was a >> time-based limit filter for balance, not just the current count-based >> filter. I would like to be able to say, for example, run balance for no >> more than 10 minutes (completing the operation in progress, of course) >> then return. > > As btrfs balance is done in block group unit, I'm afraid such thing > would be a little tricky to implement. It would be really easy to add a jiffies-based limit into the checks in should_balance_chunk. Of course, this would only test the limit in between block groups but that is what I was looking for -- a time-based version of the current limit filter. On the other hand, the time limit could just be added into the user mode code: after the timer expires it could issue a "balance pause". Would the effect be identical in terms of timing, resources required, etc? Would it be better to do a "balance pause" or a "balance cancel"? The goal would be to suspend balance processing and allow the system to do something else for a while (say 20 minutes) and then go back to doing more balance later. What is the difference between resuming a paused balance compared to starting a new balance? Bearing in mind that this is a heavily used disk so we can expect lots of transactions to have happened in the meantime (otherwise we wouldn't need this capability)? Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Not TLS] Re: Reducing impact of periodic btrfs balance
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Graham Cobb > Sent: Thursday, 19 May 2016 8:11 PM > To: linux-btrfs@vger.kernel.org > Subject: Re: [Not TLS] Re: Reducing impact of periodic btrfs balance > > On 19/05/16 05:09, Duncan wrote: > > So to Graham, are these 1.5K snapshots all of the same subvolume, or > > split into snapshots of several subvolumes? If it's all of the same > > subvolume or of only 2-3 subvolumes, you still have some work to do in > > terms of getting down to recommended snapshot levels. Also, if you > > have quotas on and don't specifically need them, try turning them off > > and see if that alone makes it workable. > > I have just under 20 subvolumes but the snapshots are only taken if > something has changed (actually I use btrbk: I am not sure if it takes the > snapshot and then removes it if nothing changed or whether it knows not to > even take it). The most frequently changing subvolumes have just under 400 > snapshots each. I have played with snapshot retention and think it unlikely I > would want to reduce it further. > > I have quotas turned off. At least, I am not using quotas -- how can I double > check it is really turned off? > > I know that very large numbers of snapshots are not recommended, and I > expected the balance to be slow. I was quite prepared for it to take many > days. My full backups take several days and even incrementals take several > hours. What I did not expect, and think is a MUCH more serious problem, is > that the balance prevented use of the disk, holding up all writes to the disk > for (quite literally) hours each. I have not seen that effect mentioned > anywhere! > > That means that for a large, busy data disk, it is impossible to do a balance > unless the server is taken down to single-user mode for the time the balance > takes (presumably still days). I assume this would also apply to doing a RAID > rebuild (I am not using multiple disks at the moment). > > At the moment I am still using my previous backup strategy, alongside the > snapshots (that is: rsync-based rsnapshots to another disk daily and with > fairly long retentions, and separate daily full/incremental backups using dar > to a nas in another building). I was hoping the btrfs snapshots might replace > the daily rsync snapshots but it doesn't look like that will work out. I do a similar thing - on my main fs I have only minimal snapshots - like less than 10. I rsync (with checksumming off and diff copy on) the fs to the backup fs which is where all the snapshots live. That fs only gets the occasional 20% balance when it runs out of space, and weekly scrubs. Performance doesn't seem to suffer that way. Paul. N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [Not TLS] Re: Reducing impact of periodic btrfs balance
On 19/05/16 05:09, Duncan wrote: > So to Graham, are these 1.5K snapshots all of the same subvolume, or > split into snapshots of several subvolumes? If it's all of the same > subvolume or of only 2-3 subvolumes, you still have some work to do in > terms of getting down to recommended snapshot levels. Also, if you have > quotas on and don't specifically need them, try turning them off and see > if that alone makes it workable. I have just under 20 subvolumes but the snapshots are only taken if something has changed (actually I use btrbk: I am not sure if it takes the snapshot and then removes it if nothing changed or whether it knows not to even take it). The most frequently changing subvolumes have just under 400 snapshots each. I have played with snapshot retention and think it unlikely I would want to reduce it further. I have quotas turned off. At least, I am not using quotas -- how can I double check it is really turned off? I know that very large numbers of snapshots are not recommended, and I expected the balance to be slow. I was quite prepared for it to take many days. My full backups take several days and even incrementals take several hours. What I did not expect, and think is a MUCH more serious problem, is that the balance prevented use of the disk, holding up all writes to the disk for (quite literally) hours each. I have not seen that effect mentioned anywhere! That means that for a large, busy data disk, it is impossible to do a balance unless the server is taken down to single-user mode for the time the balance takes (presumably still days). I assume this would also apply to doing a RAID rebuild (I am not using multiple disks at the moment). At the moment I am still using my previous backup strategy, alongside the snapshots (that is: rsync-based rsnapshots to another disk daily and with fairly long retentions, and separate daily full/incremental backups using dar to a nas in another building). I was hoping the btrfs snapshots might replace the daily rsync snapshots but it doesn't look like that will work out. Thanks to all for the replies. Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reducing impact of periodic btrfs balance
Qu Wenruo posted on Thu, 19 May 2016 09:33:19 +0800 as excerpted: > Graham Cobb wrote on 2016/05/18 14:29 +0100: >> Hi, >> >> I have a 6TB btrfs filesystem I created last year (about 60% used). It >> is my main data disk for my home server so it gets a lot of usage >> (particularly mail). I do frequent snapshots (using btrbk) so I have a >> lot of snapshots (about 1500 now, although it was about double that >> until I cut back the retention times recently). > > Even at 1500, it's still quite large, especially when they are all > snapshots. > > The biggest problem of large amount of snapshots is, it will make any > backref walk operation very slow. (O(n^3)~O(n^4)) > This includes: btrfs qgroup and balance, even fiemap (recently submitted > patch will solve fiemap problem though) > > The btrfs design ensures snapshot creation fast, but that comes with the > cost of backref walk. > > > So, unless some super huge rework, I would prefer to keep the number of > snapshots to a small amount, or avoid balance/qgroup. Qu and Graham, As you may have seen on my previous posts, my normal snapshots recommendation is to try to keep under 250-300 per subvolume, and definitely under 3000 max, 2000 preferably, and 1000 if being conservative, per filesystem, thus allowing snapshotting of 6-8 subvolumes per filesystem before hitting the filesystem cap, due to scaling issues like the above that are directly related to number of snapshots. Also, recognizing that the btrfs quota code dramatically compounds the scaling issues, as well as because of the btrfs quota functionality still never actually working fully correctly on btrfs, I recommend turning it off unless it's definitely and specifically known to be needed, and if it's actually needed, I recommend strong consideration be given to use of a more mature filesystem where quotas are known to work reliably without the scaling issues they present on btrfs. So to Graham, are these 1.5K snapshots all of the same subvolume, or split into snapshots of several subvolumes? If it's all of the same subvolume or of only 2-3 subvolumes, you still have some work to do in terms of getting down to recommended snapshot levels. Also, if you have quotas on and don't specifically need them, try turning them off and see if that alone makes it workable. It's worth noting that a reasonable snapshot thinning program can help quite a bit here, letting you still keep a reasonable retention, and that 250-300 snapshots per subvolume fits very well within that model. Consider, if you're starting with say hourly snapshots, a year or even three months out, are you really going to care what specific hourly snapshot you retrieve a file from, or would daily or weekly snapshots do just as well and actually make finding an appropriate snapshot easier as there's less to go thru? Generally speaking, most people starting with hourly snapshots can delete every other snapshot, thinning by at least half, within a day or two, and those doing snapshots even more frequently can thin down to at least hourly within hours even, since if you haven't noticed a mistaken deletion or whatever within a few hours, chances are good that recovery from hourly snapshots is more than practical, and if you haven't noticed it within a day or two, recovery from say two-hourly or six-hourly snapshots will be fine. Similarly, a week out, most people can thin to twice-daily or daily snapshots, and by 4 weeks out, perhaps to Monday/ Wednesday/Friday snapshots. By 13 weeks (one quarter) out, weekly snapshots are often fine, and by six months (26 weeks) out, thinning to quarterly (13-week) snapshots may be practical. If not, it certainly should be within a year, tho well before a year is out, backups to separate media should have taken over allowing the oldest snapshots be dropped, finally reclaiming the space they were keeping locked up. And primarily to Qu... Is that 2K snapshots overall filesystem cap recommendation still too high, even if per-subvolume snapshots are limited to 300-ish? Or is the real problem per-subvolume snapshots, and as long as snapshots are limited to 300ish per subvolume, for people who have gone subvolume mad and have say 50 separate subvolumes being snapshotted (perhaps not too unreasonable in a VM context with each VM on its own subvolume), if a 300ish cap per subvolume is maintained, the 15K total snapshots per filesystem should still work reasonably well, so I should be able to drop the overall filesystem cap recommendation and simply recommend a per- subvolume snapshot cap of a few hundred? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Reducing impact of periodic btrfs balance
Graham Cobb wrote on 2016/05/18 14:29 +0100: Hi, I have a 6TB btrfs filesystem I created last year (about 60% used). It is my main data disk for my home server so it gets a lot of usage (particularly mail). I do frequent snapshots (using btrbk) so I have a lot of snapshots (about 1500 now, although it was about double that until I cut back the retention times recently). Even at 1500, it's still quite large, especially when they are all snapshots. The biggest problem of large amount of snapshots is, it will make any backref walk operation very slow. (O(n^3)~O(n^4)) This includes: btrfs qgroup and balance, even fiemap (recently submitted patch will solve fiemap problem though) The btrfs design ensures snapshot creation fast, but that comes with the cost of backref walk. So, unless some super huge rework, I would prefer to keep the number of snapshots to a small amount, or avoid balance/qgroup. A while ago I had a "no space" problem (despite fi df, fi show and fi usage all agreeing I had over 1TB free). But this email isn't about that. As part of fixing that problem, I tried to do a "balance -dusage=20" on the disk. I was expecting it to have system impact, but it was a major disaster. The balance didn't just run for a long time, it locked out all activity on the disk for hours. A simple "touch" command to create one file took over an hour. It seems that balance blocked a transaction for a long time, which makes your touch operation to wait for that transaction to end. More seriously, because of that, mail was being lost: all mail delivery timed out and the timeout error was interpreted as a fatal delivery error causing mail to be discarded, mailing lists to cancel subscriptions, etc. The balance never completed, of course. I eventually got it cancelled. I have since managed to complete the "balance -dusage=20" by running it repeatedly with "limit=N" (for small N). I wrote a script to automate that process, and rerun it every week. If anyone is interested, the script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly Out of that experience, I have a couple of thoughts about how to possibly make balance more friendly. 1) It looks like the balance process seems to (effectively) lock all file (extent?) creation for long periods of time. Would it be possible for balance to make more effort to yield locks to allow other processes/threads to get in to continue to create/write files while it is running? Balance doesn't really lock the whole file system, and in fact itself will only lock(mark readonly) one block group (normally in 1G size). But unfortunately, balance will hold one transaction for one block group, and that's the whole fs level, may blocks unrelated write operation. 2) btrfs scrub has options to set ionice options. Could balance have something similar? Or would reducing the IO priority make things worse because locks would be held for longer? IMHO The problem is not about IO. If using iotop, you would find that the IO active not that high, while CPU usage would be near 100% for one core. 3) My btrfs-balance-slowly script would work better if there was a time-based limit filter for balance, not just the current count-based filter. I would like to be able to say, for example, run balance for no more than 10 minutes (completing the operation in progress, of course) then return. As btrfs balance is done in block group unit, I'm afraid such thing would be a little tricky to implement. 4) My btrfs-balance-slowly script would be more reliable if there was a way to get an indication of whether there was more work to be done, instead of parsing the output for the number of relocations. Any thoughts about these? Or other things I could be doing to reduce the impact on my services? Would you try to remove unneeded snapshots and disable qgroup if you're using it? If it's possible, it's better to remove *ALL* snapshots to minimize the backref walk pressure and then retry the balance. Thanks, Qu Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Reducing impact of periodic btrfs balance
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Graham Cobb > Sent: Wednesday, 18 May 2016 11:30 PM > To: linux-btrfs@vger.kernel.org > Subject: Reducing impact of periodic btrfs balance > > Hi, > > I have a 6TB btrfs filesystem I created last year (about 60% used). It is my > main data disk for my home server so it gets a lot of usage (particularly > mail). > I do frequent snapshots (using btrbk) so I have a lot of snapshots (about 1500 > now, although it was about double that until I cut back the retention times > recently). > > A while ago I had a "no space" problem (despite fi df, fi show and fi usage > all > agreeing I had over 1TB free). But this email isn't about that. > > As part of fixing that problem, I tried to do a "balance -dusage=20" on the > disk. I was expecting it to have system impact, but it was a major disaster. > The balance didn't just run for a long time, it locked out all activity on > the disk > for hours. A simple "touch" command to create one file took over an hour. > > More seriously, because of that, mail was being lost: all mail delivery timed > out and the timeout error was interpreted as a fatal delivery error causing > mail to be discarded, mailing lists to cancel subscriptions, etc. The balance > never completed, of course. I eventually got it cancelled. > > I have since managed to complete the "balance -dusage=20" by running it > repeatedly with "limit=N" (for small N). I wrote a script to automate that > process, and rerun it every week. If anyone is interested, the script is on > GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly Hi Graham, I've experienced similar problems from time to time. It seems to be fragmentation of the metadata. In my case I have a volume with about 20 million smallish (100k) files scattered through around 20,000 directories, and originally they were created at random. Updating the files at a data rate of around 5 MB/s took 100% disk utilisation on Raid1 SSD. After a few iterations I needed to delete the files and start again, this took 4 days!! I cancelled it a few times and tried defrags and balances, but they didn't help. Needless to say, the filesystem was basically unusable at the time. Long story short, I discovered that populating each directory completely, one at a time, alleviated the speed issue. I then remembered that if you run defrag with the compress option it writes out the files again, which also fixes the problem. (Note that there is no option for no compression) So if you are ok with using compression try a defrag with compression. That massively fixed my problems. Regards, Paul.