Re: Any chance to get snapshot-aware defragmentation?

2018-05-31 Thread Zygo Blaxell
On Mon, May 21, 2018 at 11:38:28AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-05-21 09:42, Timofey Titovets wrote:
> > пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn :
> > > On 2018-05-19 04:54, Niccolò Belli wrote:
> > > > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> > > > > With a bit of work, it's possible to handle things sanely.  You can
> > > > > deduplicate data from snapshots, even if they are read-only (you need
> > > > > to pass the `-A` option to duperemove and run it as root), so it's
> > > > > perfectly reasonable to only defrag the main subvolume, and then
> > > > > deduplicate the snapshots against that (so that they end up all being
> > > > > reflinks to the main subvolume).  Of course, this won't work if you're
> > > > > short on space, but if you're dealing with snapshots, you should have
> > > > > enough space that this will work (because even without defrag, it's
> > > > > fully possible for something to cause the snapshots to suddenly take
> > > > > up a lot more space).
> > > > 
> > > > Been there, tried that. Unfortunately even if I skip the defreg a simple
> > > > 
> > > > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> > > > 
> > > > is going to eat more space than it was previously available (probably
> > > > due to autodefrag?).
> > > It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> > > ioctl).  There's two things involved here:
> > 
> > > * BTRFS has somewhat odd and inefficient handling of partial extents.
> > > When part of an extent becomes unused (because of a CLONE ioctl, or an
> > > EXTENT_SAME ioctl, or something similar), that part stays allocated
> > > until the whole extent would be unused.
> > > * You're using the default deduplication block size (128k), which is
> > > larger than your filesystem block size (which is at most 64k, most
> > > likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> > > can split extents.
> > 
> > That's a metadata node leaf != fs block size.
> > btrfs fs block size == machine page size currently.
> You're right, I keep forgetting about that (probably because BTRFS is pretty
> much the only modern filesystem that doesn't let you change the block size).
> > 
> > > Because of this, if a duplicate region happens to overlap the front of
> > > an already shared extent, and the end of said shared extent isn't
> > > aligned with the deduplication block size, the EXTENT_SAME call will
> > > deduplicate the first part, creating a new shared extent, but not the
> > > tail end of the existing shared region, and all of that original shared
> > > region will stick around, taking up extra space that it wasn't before.
> > 
> > > Additionally, if only part of an extent is duplicated, then that area of
> > > the extent will stay allocated, because the rest of the extent is still
> > > referenced (so you won't necessarily see any actual space savings).
> > 
> > > You can mitigate this by telling duperemove to use the same block size
> > > as your filesystem using the `-b` option.   Note that using a smaller
> > > block size will also slow down the deduplication process and greatly
> > > increase the size of the hash file.
> > 
> > duperemove -b control "how hash data", not more or less and only support
> > 4KiB..1MiB
> And you can only deduplicate the data at the granularity you hashed it at.
> In particular:
> 
> * The total size of a region being deduplicated has to be an exact multiple
> of the hash block size (what you pass to `-b`).  So for the default 128k
> size, you can only deduplicate regions that are multiples of 128k long
> (128k, 256k, 384k, 512k, etc).   This is a simple limit derived from how
> blocks are matched for deduplication.
> * Because duperemove uses fixed hash blocks (as opposed to using a rolling
> hash window like many file synchronization tools do), the regions being
> deduplicated also have to be exactly aligned to the hash block size.  So,
> with the default 128k size, you can only deduplicate regions starting at 0k,
> 128k, 256k, 384k, 512k, etc, but not ones starting at, for example, 64k into
> the file.
> > 
> > And size of block for dedup will change efficiency of deduplication,
> > when count of hash-block pairs, will change hash file size and time
> > complexity.
> > 
> > Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.
> > 
> > So, example, you have 2 of 2x4KiB blocks:
> > 1: ''
> > 2: ''
> > 
> > With -b 8KiB hash of first block not same as second.
> > But with -b 4KiB duperemove will see both '' and ''
> > And then that blocks will be deduped.
> This supports what I'm saying though.  Your deduplication granularity is
> bounded by your hash granularity.  If in addition to the above you have a
> file that looks like:
> 
> AABBBAA
> 
> It would not get deduplicated against the first two at either `-b 4k` or `-b
> 8k` despite the middle 4k of the file being an exact duplicate 

Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-21 13:43, David Sterba wrote:

On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote:

On 2018-05-18 12:36, Niccolò Belli wrote:

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without
any news on this front.

What I would really like to know is why nobody resumed his work: is it
because it's impossible to implement snapshot-aware degram without
excessive ram usage or is it simply because nobody is interested?

I think it's because nobody who is interested has both the time and the
coding skills to tackle it.

Personally though, I think the biggest issue with what was done was not
the memory consumption, but the fact that there was no switch to turn it
on or off.  Making defrag unconditionally snapshot aware removes one of
the easiest ways to forcibly unshare data without otherwise altering the
files (which, as stupid as it sounds, is actually really useful for some
storage setups), and also forces the people who have ridiculous numbers
of snapshots to deal with the memory usage or never defrag.


Good points. The logic of the sharing-aware is a technical detail,
what's being discussed is the usecase and I think this would be good to
clarify.

1) always -- the old (and now disabled) way, unconditionally (ie. no
option for the user), problems with memory consumption

2) more fine grained:

2.1) defragment only the non-shared extents, ie. no sharing awareness
  needed, shared extents will be silently skipped

2.2) defragment only within the given subvolume -- like 1) but by user's choice

The naive dedup, that Tomasz (CCed) mentions in another mail, would be
probably beyond the defrag purpose and would make things more
complicated.

I'd vote for keeping complexity of the ioctl interface and defrag
implementation low, so if it's simply saying "do forcible defrag" or
"skip shared", then it sounds ok.

If there's eg. "keep sharing only on this  subvolunes", then it
would need to read the snapshot ids from ioctl structure, then enumerate
all extent owners and do some magic to unshare/defrag/share. That's a
quick idea, lots of details would need to be clarified.

From my perspective, I see two things to consider that are somewhat 
orthogonal to each other:


1. Whether to recurse into subvolumes or not (IIRC, we currently do not 
do so, because we see them like a mount point).
2. Whether to use the simple (not reflink-aware) defrag, the reflink 
aware one, or to base it on the extent/file type (use old simpler one 
for unshared extents, and new reflink aware one for shared extents).


This second set of options is what I'd like to see the most (possibly 
without the option to base it on file or extent sharing automatically), 
though the first one would be nice to have.


Better yet, having that second set of options and making the new 
reflink-aware defrag opt-in would allow people who really want it to use 
it, and those of us who don't need it for our storage setups to not need 
to worry about it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread David Sterba
On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote:
> On 2018-05-18 12:36, Niccolò Belli wrote:
> > On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:
> >> Josef started working on that in 2014 and did not finish it. The patches
> >> can be still found in his tree. The problem is in excessive memory
> >> consumption when there are many snapshots that need to be tracked during
> >> the defragmentation, so there are measures to avoid OOM. There's
> >> infrastructure ready for use (shrinkers), there are maybe some problems
> >> but fundamentally is should work.
> >>
> >> I'd like to get the snapshot-aware working again too, we'd need to find
> >> a volunteer to resume the work on the patchset.
> > 
> > Yeah I know of Josef's work, but 4 years had passed since then without 
> > any news on this front.
> > 
> > What I would really like to know is why nobody resumed his work: is it 
> > because it's impossible to implement snapshot-aware degram without 
> > excessive ram usage or is it simply because nobody is interested?
> I think it's because nobody who is interested has both the time and the 
> coding skills to tackle it.
> 
> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 
> files (which, as stupid as it sounds, is actually really useful for some 
> storage setups), and also forces the people who have ridiculous numbers 
> of snapshots to deal with the memory usage or never defrag.

Good points. The logic of the sharing-aware is a technical detail,
what's being discussed is the usecase and I think this would be good to
clarify.

1) always -- the old (and now disabled) way, unconditionally (ie. no
   option for the user), problems with memory consumption

2) more fine grained:

2.1) defragment only the non-shared extents, ie. no sharing awareness
 needed, shared extents will be silently skipped

2.2) defragment only within the given subvolume -- like 1) but by user's choice

The naive dedup, that Tomasz (CCed) mentions in another mail, would be
probably beyond the defrag purpose and would make things more
complicated.

I'd vote for keeping complexity of the ioctl interface and defrag
implementation low, so if it's simply saying "do forcible defrag" or
"skip shared", then it sounds ok.

If there's eg. "keep sharing only on this  subvolunes", then it
would need to read the snapshot ids from ioctl structure, then enumerate
all extent owners and do some magic to unshare/defrag/share. That's a
quick idea, lots of details would need to be clarified.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-21 09:42, Timofey Titovets wrote:

пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn :


On 2018-05-19 04:54, Niccolò Belli wrote:

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:

With a bit of work, it's possible to handle things sanely.  You can
deduplicate data from snapshots, even if they are read-only (you need
to pass the `-A` option to duperemove and run it as root), so it's
perfectly reasonable to only defrag the main subvolume, and then
deduplicate the snapshots against that (so that they end up all being
reflinks to the main subvolume).  Of course, this won't work if you're
short on space, but if you're dealing with snapshots, you should have
enough space that this will work (because even without defrag, it's
fully possible for something to cause the snapshots to suddenly take
up a lot more space).


Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably
due to autodefrag?).

It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
ioctl).  There's two things involved here:



* BTRFS has somewhat odd and inefficient handling of partial extents.
When part of an extent becomes unused (because of a CLONE ioctl, or an
EXTENT_SAME ioctl, or something similar), that part stays allocated
until the whole extent would be unused.
* You're using the default deduplication block size (128k), which is
larger than your filesystem block size (which is at most 64k, most
likely 16k, but might be 4k if it's an old filesystem), so deduplicating
can split extents.


That's a metadata node leaf != fs block size.
btrfs fs block size == machine page size currently.
You're right, I keep forgetting about that (probably because BTRFS is 
pretty much the only modern filesystem that doesn't let you change the 
block size).



Because of this, if a duplicate region happens to overlap the front of
an already shared extent, and the end of said shared extent isn't
aligned with the deduplication block size, the EXTENT_SAME call will
deduplicate the first part, creating a new shared extent, but not the
tail end of the existing shared region, and all of that original shared
region will stick around, taking up extra space that it wasn't before.



Additionally, if only part of an extent is duplicated, then that area of
the extent will stay allocated, because the rest of the extent is still
referenced (so you won't necessarily see any actual space savings).



You can mitigate this by telling duperemove to use the same block size
as your filesystem using the `-b` option.   Note that using a smaller
block size will also slow down the deduplication process and greatly
increase the size of the hash file.


duperemove -b control "how hash data", not more or less and only support
4KiB..1MiB
And you can only deduplicate the data at the granularity you hashed it 
at.  In particular:


* The total size of a region being deduplicated has to be an exact 
multiple of the hash block size (what you pass to `-b`).  So for the 
default 128k size, you can only deduplicate regions that are multiples 
of 128k long (128k, 256k, 384k, 512k, etc).   This is a simple limit 
derived from how blocks are matched for deduplication.
* Because duperemove uses fixed hash blocks (as opposed to using a 
rolling hash window like many file synchronization tools do), the 
regions being deduplicated also have to be exactly aligned to the hash 
block size.  So, with the default 128k size, you can only deduplicate 
regions starting at 0k, 128k, 256k, 384k, 512k, etc, but not ones 
starting at, for example, 64k into the file.


And size of block for dedup will change efficiency of deduplication,
when count of hash-block pairs, will change hash file size and time
complexity.

Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.

So, example, you have 2 of 2x4KiB blocks:
1: ''
2: ''

With -b 8KiB hash of first block not same as second.
But with -b 4KiB duperemove will see both '' and ''
And then that blocks will be deduped.
This supports what I'm saying though.  Your deduplication granularity is 
bounded by your hash granularity.  If in addition to the above you have 
a file that looks like:


AABBBAA

It would not get deduplicated against the first two at either `-b 4k` or 
`-b 8k` despite the middle 4k of the file being an exact duplicate of 
the final 4k of the first file and first 4k of the second one.


If instead you have:

AABB

And the final 6k is a single on-disk extent, that extent will get split 
when you go to deduplicate against the first two files with a 4k block 
size because only the final 4k can be deduplicated, and the entire 6k 
original extent will stay completely allocated.


Even, duperemove have 2 modes of deduping:
1. By extents
2. By blocks
Yes, you can force it to not 

Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Niccolò Belli

On domenica 20 maggio 2018 12:59:28 CEST, Tomasz Pala wrote:

On Sat, May 19, 2018 at 10:56:32 +0200, Niccol? Belli wrote:

snapper users with hourly snapshots will not have any use for it.

Anyone with hourly snapshots anyone is doomed anyway.


I do not agree: having hourly snapshots doesn't mean you cannot limit 
snapshots at a reasonable number. In fact you can simply keep a dozen of 
them, then start discarding the older ones.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Timofey Titovets
пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn :

> On 2018-05-19 04:54, Niccolò Belli wrote:
> > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> >> With a bit of work, it's possible to handle things sanely.  You can
> >> deduplicate data from snapshots, even if they are read-only (you need
> >> to pass the `-A` option to duperemove and run it as root), so it's
> >> perfectly reasonable to only defrag the main subvolume, and then
> >> deduplicate the snapshots against that (so that they end up all being
> >> reflinks to the main subvolume).  Of course, this won't work if you're
> >> short on space, but if you're dealing with snapshots, you should have
> >> enough space that this will work (because even without defrag, it's
> >> fully possible for something to cause the snapshots to suddenly take
> >> up a lot more space).
> >
> > Been there, tried that. Unfortunately even if I skip the defreg a simple
> >
> > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> >
> > is going to eat more space than it was previously available (probably
> > due to autodefrag?).
> It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> ioctl).  There's two things involved here:

> * BTRFS has somewhat odd and inefficient handling of partial extents.
> When part of an extent becomes unused (because of a CLONE ioctl, or an
> EXTENT_SAME ioctl, or something similar), that part stays allocated
> until the whole extent would be unused.
> * You're using the default deduplication block size (128k), which is
> larger than your filesystem block size (which is at most 64k, most
> likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> can split extents.

That's a metadata node leaf != fs block size.
btrfs fs block size == machine page size currently.

> Because of this, if a duplicate region happens to overlap the front of
> an already shared extent, and the end of said shared extent isn't
> aligned with the deduplication block size, the EXTENT_SAME call will
> deduplicate the first part, creating a new shared extent, but not the
> tail end of the existing shared region, and all of that original shared
> region will stick around, taking up extra space that it wasn't before.

> Additionally, if only part of an extent is duplicated, then that area of
> the extent will stay allocated, because the rest of the extent is still
> referenced (so you won't necessarily see any actual space savings).

> You can mitigate this by telling duperemove to use the same block size
> as your filesystem using the `-b` option.   Note that using a smaller
> block size will also slow down the deduplication process and greatly
> increase the size of the hash file.

duperemove -b control "how hash data", not more or less and only support
4KiB..1MiB

And size of block for dedup will change efficiency of deduplication,
when count of hash-block pairs, will change hash file size and time
complexity.

Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.

So, example, you have 2 of 2x4KiB blocks:
1: ''
2: ''

With -b 8KiB hash of first block not same as second.
But with -b 4KiB duperemove will see both '' and ''
And then that blocks will be deduped.

Even, duperemove have 2 modes of deduping:
1. By extents
2. By blocks

Thanks.

--
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-19 04:54, Niccolò Belli wrote:

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need 
to pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take 
up a lot more space).


Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably 
due to autodefrag?).
It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME 
ioctl).  There's two things involved here:


* BTRFS has somewhat odd and inefficient handling of partial extents. 
When part of an extent becomes unused (because of a CLONE ioctl, or an 
EXTENT_SAME ioctl, or something similar), that part stays allocated 
until the whole extent would be unused.
* You're using the default deduplication block size (128k), which is 
larger than your filesystem block size (which is at most 64k, most 
likely 16k, but might be 4k if it's an old filesystem), so deduplicating 
can split extents.


Because of this, if a duplicate region happens to overlap the front of 
an already shared extent, and the end of said shared extent isn't 
aligned with the deduplication block size, the EXTENT_SAME call will 
deduplicate the first part, creating a new shared extent, but not the 
tail end of the existing shared region, and all of that original shared 
region will stick around, taking up extra space that it wasn't before.


Additionally, if only part of an extent is duplicated, then that area of 
the extent will stay allocated, because the rest of the extent is still 
referenced (so you won't necessarily see any actual space savings).


You can mitigate this by telling duperemove to use the same block size 
as your filesystem using the `-b` option.   Note that using a smaller 
block size will also slow down the deduplication process and greatly 
increase the size of the hash file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-19 Thread Niccolò Belli

On sabato 19 maggio 2018 01:55:30 CEST, Tomasz Pala wrote:

The "defrag only not-snapshotted data" mode would be enough for many
use cases and wouldn't require more RAM. One could run this before
taking a snapshot and merge _at least_ the new data.


snapper users with hourly snapshots will not have any use for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-19 Thread Niccolò Belli

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
With a bit of work, it's possible to handle things sanely.  You 
can deduplicate data from snapshots, even if they are read-only 
(you need to pass the `-A` option to duperemove and run it as 
root), so it's perfectly reasonable to only defrag the main 
subvolume, and then deduplicate the snapshots against that (so 
that they end up all being reflinks to the main subvolume).  Of 
course, this won't work if you're short on space, but if you're 
dealing with snapshots, you should have enough space that this 
will work (because even without defrag, it's fully possible for 
something to cause the snapshots to suddenly take up a lot more 
space).


Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably due 
to autodefrag?).


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Tomasz Pala
On Fri, May 18, 2018 at 13:10:02 -0400, Austin S. Hemmelgarn wrote:

> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 

The "defrag only not-snapshotted data" mode would be enough for many
use cases and wouldn't require more RAM. One could run this before
taking a snapshot and merge _at least_ the new data.

And even with current approach it should be possible to interlace
defragmentation with some kind of naive-deduplication; "naive" in the
approach of comparing blocks only within the same in-subvolume paths.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Omar Sandoval
On Fri, May 18, 2018 at 04:26:16PM -0600, Chris Murphy wrote:
> On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
>  wrote:
> > On 2018-05-18 13:18, Niccolò Belli wrote:
> >>
> >> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
> >>>
> >>> and also forces the people who have ridiculous numbers of snapshots to
> >>> deal with the memory usage or never defrag
> >>
> >>
> >> Whoever has at least one snapshot is never going to defrag anyway, unless
> >> he is willing to double the used space.
> >>
> > With a bit of work, it's possible to handle things sanely.  You can
> > deduplicate data from snapshots, even if they are read-only (you need to
> > pass the `-A` option to duperemove and run it as root), so it's perfectly
> > reasonable to only defrag the main subvolume, and then deduplicate the
> > snapshots against that (so that they end up all being reflinks to the main
> > subvolume).  Of course, this won't work if you're short on space, but if
> > you're dealing with snapshots, you should have enough space that this will
> > work (because even without defrag, it's fully possible for something to
> > cause the snapshots to suddenly take up a lot more space).
> 
> 
> Curiously, snapshot aware defragmentation is going to increase free
> space fragmentation. For busy in-use systems, it might be necessary to
> use space cache v2 to avoid performance problems.
> 
> I forget the exact reason why the free space tree is not the default,
> I think it has to do with missing repair support?

Yeah, Nikolay is working on that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Chris Murphy
On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-05-18 13:18, Niccolò Belli wrote:
>>
>> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
>>>
>>> and also forces the people who have ridiculous numbers of snapshots to
>>> deal with the memory usage or never defrag
>>
>>
>> Whoever has at least one snapshot is never going to defrag anyway, unless
>> he is willing to double the used space.
>>
> With a bit of work, it's possible to handle things sanely.  You can
> deduplicate data from snapshots, even if they are read-only (you need to
> pass the `-A` option to duperemove and run it as root), so it's perfectly
> reasonable to only defrag the main subvolume, and then deduplicate the
> snapshots against that (so that they end up all being reflinks to the main
> subvolume).  Of course, this won't work if you're short on space, but if
> you're dealing with snapshots, you should have enough space that this will
> work (because even without defrag, it's fully possible for something to
> cause the snapshots to suddenly take up a lot more space).


Curiously, snapshot aware defragmentation is going to increase free
space fragmentation. For busy in-use systems, it might be necessary to
use space cache v2 to avoid performance problems.

I forget the exact reason why the free space tree is not the default,
I think it has to do with missing repair support?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 13:18, Niccolò Belli wrote:

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
and also forces the people who have ridiculous numbers of snapshots to 
deal with the memory usage or never defrag


Whoever has at least one snapshot is never going to defrag anyway, 
unless he is willing to double the used space.


With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need to 
pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take up 
a lot more space).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Niccolò Belli

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
and also forces the people who have ridiculous numbers of 
snapshots to deal with the memory usage or never defrag


Whoever has at least one snapshot is never going to defrag anyway, unless 
he is willing to double the used space.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 12:36, Niccolò Belli wrote:

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without 
any news on this front.


What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?
I think it's because nobody who is interested has both the time and the 
coding skills to tackle it.


Personally though, I think the biggest issue with what was done was not 
the memory consumption, but the fact that there was no switch to turn it 
on or off.  Making defrag unconditionally snapshot aware removes one of 
the easiest ways to forcibly unshare data without otherwise altering the 
files (which, as stupid as it sounds, is actually really useful for some 
storage setups), and also forces the people who have ridiculous numbers 
of snapshots to deal with the memory usage or never defrag.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Niccolò Belli

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without any 
news on this front.


What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread David Sterba
On Fri, May 11, 2018 at 05:22:26PM +0200, Niccolò Belli wrote:
> I'm waiting for this feature since years and initially it seemed like 
> something which would have been worked on, sooner or later.
> A long time had passed without any progress on this, so I would like to 
> know if there is any technical limitation preventing this or if it's 
> something which could possibly land in the near future.

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html