Re: Reducing impact of periodic btrfs balance

2016-05-31 Thread Austin S. Hemmelgarn

On 2016-05-26 18:12, Graham Cobb wrote:

On 19/05/16 02:33, Qu Wenruo wrote:



Graham Cobb wrote on 2016/05/18 14:29 +0100:

A while ago I had a "no space" problem (despite fi df, fi show and fi
usage all agreeing I had over 1TB free).  But this email isn't about
that.

As part of fixing that problem, I tried to do a "balance -dusage=20" on
the disk.  I was expecting it to have system impact, but it was a major
disaster.  The balance didn't just run for a long time, it locked out
all activity on the disk for hours.  A simple "touch" command to create
one file took over an hour.


It seems that balance blocked a transaction for a long time, which makes
your touch operation to wait for that transaction to end.


I have been reading volumes.c.  But I don't have a feel for which
transactions are likely to be the things blocking for a really long time
(hours).

If this can occur, I think the warnings to users about balance need to
be extended to include this issue.  Currently the user mode code warns
users that unfiltered balances may take a long time, but it doesn't warn
that the disk may be unusable during that time.
Whether or not the disk is usable depends on a number of factors.  I 
have no issues using my disks while they're being balanced (even hen 
doing a full balance), but they also all support command queuing, and 
are either fast disks, or on really good storage controllers.



3) My btrfs-balance-slowly script would work better if there was a
time-based limit filter for balance, not just the current count-based
filter.  I would like to be able to say, for example, run balance for no
more than 10 minutes (completing the operation in progress, of course)
then return.


As btrfs balance is done in block group unit, I'm afraid such thing
would be a little tricky to implement.


It would be really easy to add a jiffies-based limit into the checks in
should_balance_chunk.  Of course, this would only test the limit in
between block groups but that is what I was looking for -- a time-based
version of the current limit filter.

On the other hand, the time limit could just be added into the user mode
code: after the timer expires it could issue a "balance pause".  Would
the effect be identical in terms of timing, resources required, etc?
This is entirely userspace policy, and thus should be done in userspace. 
 Pretty much everything that has a filter already can't be entirely 
implemented in userspace, despite technically being policy, because it 
requires specific knowledge of the filesystem internals.  Having a time 
limited mode requires no such knowledge, and thus could be done in 
userspace.  Putting it in userspace also would make it easier to debug, 
and less likely to cause other fallout in the rest of the balance code.


Would it be better to do a "balance pause" or a "balance cancel"?  The
goal would be to suspend balance processing and allow the system to do
something else for a while (say 20 minutes) and then go back to doing
more balance later.  What is the difference between resuming a paused
balance compared to starting a new balance? Bearing in mind that this is
a heavily used disk so we can expect lots of transactions to have
happened in the meantime (otherwise we wouldn't need this capability)?
The difference between resuming a paused balance and starting a balance 
after canceling one is pretty simple.  Resuming a paused balance will 
not re-process chunks that were already processed, starting a new one 
after canceling may or may not (depending on what other filters are 
involved).  I think having the option to do either would be a good 
thing, cancel makes a bit more sense if you're going long periods of 
time between each run and are using other limiting filters (like usage 
filtering), whereas pause makes more sense if doing a full balance or 
only pausing for a short time between each run.


Depending on how the balance ioctl reacts to being interrupted with a 
signal, this would in theory not be hard to implement either.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reducing impact of periodic btrfs balance

2016-05-26 Thread Graham Cobb
On 19/05/16 02:33, Qu Wenruo wrote:
> 
> 
> Graham Cobb wrote on 2016/05/18 14:29 +0100:
>> A while ago I had a "no space" problem (despite fi df, fi show and fi
>> usage all agreeing I had over 1TB free).  But this email isn't about
>> that.
>>
>> As part of fixing that problem, I tried to do a "balance -dusage=20" on
>> the disk.  I was expecting it to have system impact, but it was a major
>> disaster.  The balance didn't just run for a long time, it locked out
>> all activity on the disk for hours.  A simple "touch" command to create
>> one file took over an hour.
> 
> It seems that balance blocked a transaction for a long time, which makes
> your touch operation to wait for that transaction to end.

I have been reading volumes.c.  But I don't have a feel for which
transactions are likely to be the things blocking for a really long time
(hours).

If this can occur, I think the warnings to users about balance need to
be extended to include this issue.  Currently the user mode code warns
users that unfiltered balances may take a long time, but it doesn't warn
that the disk may be unusable during that time.

>> 3) My btrfs-balance-slowly script would work better if there was a
>> time-based limit filter for balance, not just the current count-based
>> filter.  I would like to be able to say, for example, run balance for no
>> more than 10 minutes (completing the operation in progress, of course)
>> then return.
> 
> As btrfs balance is done in block group unit, I'm afraid such thing
> would be a little tricky to implement.

It would be really easy to add a jiffies-based limit into the checks in
should_balance_chunk.  Of course, this would only test the limit in
between block groups but that is what I was looking for -- a time-based
version of the current limit filter.

On the other hand, the time limit could just be added into the user mode
code: after the timer expires it could issue a "balance pause".  Would
the effect be identical in terms of timing, resources required, etc?

Would it be better to do a "balance pause" or a "balance cancel"?  The
goal would be to suspend balance processing and allow the system to do
something else for a while (say 20 minutes) and then go back to doing
more balance later.  What is the difference between resuming a paused
balance compared to starting a new balance? Bearing in mind that this is
a heavily used disk so we can expect lots of transactions to have
happened in the meantime (otherwise we wouldn't need this capability)?

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Not TLS] Re: Reducing impact of periodic btrfs balance

2016-05-19 Thread Paul Jones


> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Graham Cobb
> Sent: Thursday, 19 May 2016 8:11 PM
> To: linux-btrfs@vger.kernel.org
> Subject: Re: [Not TLS] Re: Reducing impact of periodic btrfs balance
> 
> On 19/05/16 05:09, Duncan wrote:
> > So to Graham, are these 1.5K snapshots all of the same subvolume, or
> > split into snapshots of several subvolumes?  If it's all of the same
> > subvolume or of only 2-3 subvolumes, you still have some work to do in
> > terms of getting down to recommended snapshot levels.  Also, if you
> > have quotas on and don't specifically need them, try turning them off
> > and see if that alone makes it workable.
> 
> I have just under 20 subvolumes but the snapshots are only taken if
> something has changed (actually I use btrbk: I am not sure if it takes the
> snapshot and then removes it if nothing changed or whether it knows not to
> even take it).  The most frequently changing subvolumes have just under 400
> snapshots each.  I have played with snapshot retention and think it unlikely I
> would want to reduce it further.
> 
> I have quotas turned off.  At least, I am not using quotas -- how can I double
> check it is really turned off?
> 
> I know that very large numbers of snapshots are not recommended, and I
> expected the balance to be slow.  I was quite prepared for it to take many
> days.  My full backups take several days and even incrementals take several
> hours. What I did not expect, and think is a MUCH more serious problem, is
> that the balance prevented use of the disk, holding up all writes to the disk
> for (quite literally) hours each.  I have not seen that effect mentioned
> anywhere!
> 
> That means that for a large, busy data disk, it is impossible to do a balance
> unless the server is taken down to single-user mode for the time the balance
> takes (presumably still days).  I assume this would also apply to doing a RAID
> rebuild (I am not using multiple disks at the moment).
> 
> At the moment I am still using my previous backup strategy, alongside the
> snapshots (that is: rsync-based rsnapshots to another disk daily and with
> fairly long retentions, and separate daily full/incremental backups using dar
> to a nas in another building).  I was hoping the btrfs snapshots might replace
> the daily rsync snapshots but it doesn't look like that will work out.

I do a similar thing - on my main fs I have only minimal snapshots - like less 
than 10. I rsync (with checksumming off and diff copy on) the fs to the backup 
fs which is where all the snapshots live. That fs only gets the occasional 20% 
balance when it runs out of space, and weekly scrubs. Performance doesn't seem 
to suffer that way.

Paul.
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [Not TLS] Re: Reducing impact of periodic btrfs balance

2016-05-19 Thread Graham Cobb
On 19/05/16 05:09, Duncan wrote:
> So to Graham, are these 1.5K snapshots all of the same subvolume, or 
> split into snapshots of several subvolumes?  If it's all of the same 
> subvolume or of only 2-3 subvolumes, you still have some work to do in 
> terms of getting down to recommended snapshot levels.  Also, if you have 
> quotas on and don't specifically need them, try turning them off and see 
> if that alone makes it workable.

I have just under 20 subvolumes but the snapshots are only taken if
something has changed (actually I use btrbk: I am not sure if it takes
the snapshot and then removes it if nothing changed or whether it knows
not to even take it).  The most frequently changing subvolumes have just
under 400 snapshots each.  I have played with snapshot retention and
think it unlikely I would want to reduce it further.

I have quotas turned off.  At least, I am not using quotas -- how can I
double check it is really turned off?

I know that very large numbers of snapshots are not recommended, and I
expected the balance to be slow.  I was quite prepared for it to take
many days.  My full backups take several days and even incrementals take
several hours. What I did not expect, and think is a MUCH more serious
problem, is that the balance prevented use of the disk, holding up all
writes to the disk for (quite literally) hours each.  I have not seen
that effect mentioned anywhere!

That means that for a large, busy data disk, it is impossible to do a
balance unless the server is taken down to single-user mode for the time
the balance takes (presumably still days).  I assume this would also
apply to doing a RAID rebuild (I am not using multiple disks at the moment).

At the moment I am still using my previous backup strategy, alongside
the snapshots (that is: rsync-based rsnapshots to another disk daily and
with fairly long retentions, and separate daily full/incremental backups
using dar to a nas in another building).  I was hoping the btrfs
snapshots might replace the daily rsync snapshots but it doesn't look
like that will work out.

Thanks to all for the replies.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reducing impact of periodic btrfs balance

2016-05-18 Thread Duncan
Qu Wenruo posted on Thu, 19 May 2016 09:33:19 +0800 as excerpted:

> Graham Cobb wrote on 2016/05/18 14:29 +0100:
>> Hi,
>>
>> I have a 6TB btrfs filesystem I created last year (about 60% used).  It
>> is my main data disk for my home server so it gets a lot of usage
>> (particularly mail). I do frequent snapshots (using btrbk) so I have a
>> lot of snapshots (about 1500 now, although it was about double that
>> until I cut back the retention times recently).
> 
> Even at 1500, it's still quite large, especially when they are all
> snapshots.
> 
> The biggest problem of large amount of snapshots is, it will make any
> backref walk operation very slow. (O(n^3)~O(n^4))
> This includes: btrfs qgroup and balance, even fiemap (recently submitted
> patch will solve fiemap problem though)
> 
> The btrfs design ensures snapshot creation fast, but that comes with the
> cost of backref walk.
> 
> 
> So, unless some super huge rework, I would prefer to keep the number of
> snapshots to a small amount, or avoid balance/qgroup.

Qu and Graham,

As you may have seen on my previous posts, my normal snapshots 
recommendation is to try to keep under 250-300 per subvolume, and 
definitely under 3000 max, 2000 preferably, and 1000 if being 
conservative, per filesystem, thus allowing snapshotting of 6-8 
subvolumes per filesystem before hitting the filesystem cap, due to 
scaling issues like the above that are directly related to number of 
snapshots.  

Also, recognizing that the btrfs quota code dramatically compounds the 
scaling issues, as well as because of the btrfs quota functionality still 
never actually working fully correctly on btrfs, I recommend turning it 
off unless it's definitely and specifically known to be needed, and if 
it's actually needed, I recommend strong consideration be given to use of 
a more mature filesystem where quotas are known to work reliably without 
the scaling issues they present on btrfs.


So to Graham, are these 1.5K snapshots all of the same subvolume, or 
split into snapshots of several subvolumes?  If it's all of the same 
subvolume or of only 2-3 subvolumes, you still have some work to do in 
terms of getting down to recommended snapshot levels.  Also, if you have 
quotas on and don't specifically need them, try turning them off and see 
if that alone makes it workable.

It's worth noting that a reasonable snapshot thinning program can help 
quite a bit here, letting you still keep a reasonable retention, and that 
250-300 snapshots per subvolume fits very well within that model.  
Consider, if you're starting with say hourly snapshots, a year or even 
three months out, are you really going to care what specific hourly 
snapshot you retrieve a file from, or would daily or weekly snapshots do 
just as well and actually make finding an appropriate snapshot easier as 
there's less to go thru?

Generally speaking, most people starting with hourly snapshots can delete 
every other snapshot, thinning by at least half, within a day or two, and 
those doing snapshots even more frequently can thin down to at least 
hourly within hours even, since if you haven't noticed a mistaken 
deletion or whatever within a few hours, chances are good that recovery 
from hourly snapshots is more than practical, and if you haven't noticed 
it within a day or two, recovery from say two-hourly or six-hourly 
snapshots will be fine.  Similarly, a week out, most people can thin to 
twice-daily or daily snapshots, and by 4 weeks out, perhaps to Monday/
Wednesday/Friday snapshots.  By 13 weeks (one quarter) out, weekly 
snapshots are often fine, and by six months (26 weeks) out, thinning to 
quarterly (13-week) snapshots may be practical.  If not, it certainly 
should be within a year, tho well before a year is out, backups to 
separate media should have taken over allowing the oldest snapshots be 
dropped, finally reclaiming the space they were keeping locked up.


And primarily to Qu...

Is that 2K snapshots overall filesystem cap recommendation still too 
high, even if per-subvolume snapshots are limited to 300-ish?  Or is the 
real problem per-subvolume snapshots, and as long as snapshots are 
limited to 300ish per subvolume, for people who have gone subvolume mad 
and have say 50 separate subvolumes being snapshotted (perhaps not too 
unreasonable in a VM context with each VM on its own subvolume), if a 
300ish cap per subvolume is maintained, the 15K total snapshots per 
filesystem should still work reasonably well, so I should be able to drop 
the overall filesystem cap recommendation and simply recommend a per-
subvolume snapshot cap of a few hundred?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Reducing impact of periodic btrfs balance

2016-05-18 Thread Qu Wenruo



Graham Cobb wrote on 2016/05/18 14:29 +0100:

Hi,

I have a 6TB btrfs filesystem I created last year (about 60% used).  It
is my main data disk for my home server so it gets a lot of usage
(particularly mail). I do frequent snapshots (using btrbk) so I have a
lot of snapshots (about 1500 now, although it was about double that
until I cut back the retention times recently).


Even at 1500, it's still quite large, especially when they are all 
snapshots.


The biggest problem of large amount of snapshots is, it will make any 
backref walk operation very slow. (O(n^3)~O(n^4))
This includes: btrfs qgroup and balance, even fiemap (recently submitted 
patch will solve fiemap problem though)


The btrfs design ensures snapshot creation fast, but that comes with the 
cost of backref walk.



So, unless some super huge rework, I would prefer to keep the number of 
snapshots to a small amount, or avoid balance/qgroup.




A while ago I had a "no space" problem (despite fi df, fi show and fi
usage all agreeing I had over 1TB free).  But this email isn't about that.

As part of fixing that problem, I tried to do a "balance -dusage=20" on
the disk.  I was expecting it to have system impact, but it was a major
disaster.  The balance didn't just run for a long time, it locked out
all activity on the disk for hours.  A simple "touch" command to create
one file took over an hour.


It seems that balance blocked a transaction for a long time, which makes 
your touch operation to wait for that transaction to end.




More seriously, because of that, mail was being lost: all mail delivery
timed out and the timeout error was interpreted as a fatal delivery
error causing mail to be discarded, mailing lists to cancel
subscriptions, etc. The balance never completed, of course.  I
eventually got it cancelled.

I have since managed to complete the "balance -dusage=20" by running it
repeatedly with "limit=N" (for small N).  I wrote a script to automate
that process, and rerun it every week.  If anyone is interested, the
script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly

Out of that experience, I have a couple of thoughts about how to
possibly make balance more friendly.

1) It looks like the balance process seems to (effectively) lock all
file (extent?) creation for long periods of time.  Would it be possible
for balance to make more effort to yield locks to allow other
processes/threads to get in to continue to create/write files while it
is running?


Balance doesn't really lock the whole file system, and in fact itself 
will only lock(mark readonly) one block group (normally in 1G size).


But unfortunately, balance will hold one transaction for one block 
group, and that's the whole fs level, may blocks unrelated write operation.




2) btrfs scrub has options to set ionice options.  Could balance have
something similar?  Or would reducing the IO priority make things worse
because locks would be held for longer?


IMHO The problem is not about IO.
If using iotop, you would find that the IO active not that high, while 
CPU usage would be near 100% for one core.




3) My btrfs-balance-slowly script would work better if there was a
time-based limit filter for balance, not just the current count-based
filter.  I would like to be able to say, for example, run balance for no
more than 10 minutes (completing the operation in progress, of course)
then return.


As btrfs balance is done in block group unit, I'm afraid such thing 
would be a little tricky to implement.




4) My btrfs-balance-slowly script would be more reliable if there was a
way to get an indication of whether there was more work to be done,
instead of parsing the output for the number of relocations.

Any thoughts about these?  Or other things I could be doing to reduce
the impact on my services?


Would you try to remove unneeded snapshots and disable qgroup if you're 
using it?


If it's possible, it's better to remove *ALL* snapshots to minimize the 
backref walk pressure and then retry the balance.


Thanks,
Qu



Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Reducing impact of periodic btrfs balance

2016-05-18 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Graham Cobb
> Sent: Wednesday, 18 May 2016 11:30 PM
> To: linux-btrfs@vger.kernel.org
> Subject: Reducing impact of periodic btrfs balance
> 
> Hi,
> 
> I have a 6TB btrfs filesystem I created last year (about 60% used).  It is my
> main data disk for my home server so it gets a lot of usage (particularly 
> mail).
> I do frequent snapshots (using btrbk) so I have a lot of snapshots (about 1500
> now, although it was about double that until I cut back the retention times
> recently).
> 
> A while ago I had a "no space" problem (despite fi df, fi show and fi usage 
> all
> agreeing I had over 1TB free).  But this email isn't about that.
> 
> As part of fixing that problem, I tried to do a "balance -dusage=20" on the
> disk.  I was expecting it to have system impact, but it was a major disaster.
> The balance didn't just run for a long time, it locked out all activity on 
> the disk
> for hours.  A simple "touch" command to create one file took over an hour.
> 
> More seriously, because of that, mail was being lost: all mail delivery timed
> out and the timeout error was interpreted as a fatal delivery error causing
> mail to be discarded, mailing lists to cancel subscriptions, etc. The balance
> never completed, of course.  I eventually got it cancelled.
> 
> I have since managed to complete the "balance -dusage=20" by running it
> repeatedly with "limit=N" (for small N).  I wrote a script to automate that
> process, and rerun it every week.  If anyone is interested, the script is on
> GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly


Hi Graham,

I've experienced similar problems from time to time. It seems to be 
fragmentation of the metadata. In my case I have a volume with about 20 million 
smallish (100k) files scattered through around 20,000 directories, and 
originally they were created at random. Updating the files at a data rate of 
around 5 MB/s took 100% disk utilisation on Raid1 SSD. After a few iterations I 
needed to delete the files and start again, this took 4 days!! I cancelled it a 
few times and tried defrags and balances, but they didn't help. Needless to 
say, the filesystem was basically unusable at the time.
Long story short, I discovered that populating each directory completely, one 
at a time, alleviated the speed issue. I then remembered that if you run defrag 
with the compress option it writes out the files again, which also fixes the 
problem. (Note that there is no option for no compression)
So if you are ok with using compression try a defrag with compression. That 
massively fixed my problems.

Regards,
Paul.