Re: [GIT PULL] xfs: shared data extents support for 4.9-rc1

2016-10-12 Thread Darrick J. Wong
On Wed, Oct 12, 2016 at 11:18:49PM +1100, Dave Chinner wrote:
> Hi Linus,
> 
> This is the second part of the XFS updates for this merge cycle.
> This pullreq contains the new shared data extents feature for XFS,
> and can be found at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git 
> tags/xfs-reflink-for-linus-4.9-rc1
> 
> The full pull request output is below.
> 
> Given the complexity and size of this change I am expecting - like
> the addition of reverse mapping last cycle - that there will be some
> follow-up bug fixes and cleanups around the -rc3 stage for issues
> that I'm sure will show up once the code hits a wider userbase.
> 
> What it is:
> 
> At the most basic level we are simply adding shared data extents to
> XFS - i.e. a single extent on disk can now have multiple owners. To
> do this we have to add new on-disk features to both track the shared
> extents and the number of times they've been shared. This is done by
> the new "refcount" btree that sits in every allocation group. When
> we share or unshare an extent, this tree gets updated.
> 
> Along with this new tree, the reverse mapping tree needs to be
> updated to track each owner or a shared extent. This also needs to
> be updated ever share/unshare operation. These interactions at
> extent allocation and freeing time have complex ordering and
> recovery constraints, so there's a significant amount of new
> intent-based transaction code to ensure that operations are
> performed atomically from both the runtime and integrity/crash
> recovery perspectives.
> 
> We also need to break sharing when writes hit a shared extent - this
> is where the new copy-on-write implementation comes in. We allocate
> new storage and copy the original data along with the overwrite data
> into the new location.  We only do this for data as we don't share
> metadata at all - each inode has it's own metadata that tracks the
> shared data extents, the extents undergoing CoW and it's own private
> extents.
> 
> Of course, being XFS, nothing is simple - we use delayed allocation
> for CoW similar to how we use it for normal writes. ENOSPC is a
> significant issue here - we build on the reservation code added
> in 4.8-rc1 with the reverse mapping feature to ensure we don't get
> spurious ENOSPC issues part way through a CoW operation. These
> mechanisms also help minimise fragmentation due to repeated CoW
> operations.  To further reduce fragmentation overhead, we've also
> introduced a CoW extent size hint, which indicates how large a
> region we should allocate when we execute a CoW operation.
> 
> With all this functionality in place, we can hook up
> .copy_file_range, .clone_file_range and .dedupe_file_range and we
> gain all the capabilities of reflink and other vfs provided
> functionality that enable manipulation to shared extents. We also
> added a fallocate mode that explicitly unshares a range of a file,
> which we implemented as an explicit CoW of all the shared extents in
> a file.
> 
> As such, it's a huge chunk of new functionality with new on-disk
> format features and internal infrastructure. It warns at mount time
> as an experimental feature and that it may eat data (as we do with
> all new on-disk features until they stabilise).  We have not
> released userspace suport for it yet - userspace support currently
> requires download from Darrick's xfsprogs repo and build from
> source, so the access to this feature is really developer/tester
> only at this point. Initial userspace support will be released at
> the same time the kernel with this code in it is released.

Userland support is in this branch:
https://github.com/djwong/xfsprogs/tree/for-dave-for-4.9-15

There will undoubtedly be more of these since Dave will libxfs-apply
the kernel patches into for-next after the merge window closes, after
which I'll rebase the tool patches against that.

> The new code causes 5-6 new failures with xfstests - these aren't
> serious functional failures but things the output of tests changing
> slightly due to perturbations in layouts, space usage, etc.  OTOH,
> we've added 150+ new tests to xfstests that specifically exercise
> this new functionality so it's got far better test coverage than any
> functionality we've previously added to XFS.

https://github.com/djwong/xfstests/tree/djwong-devel
have fixes to some of the tests tests, if you dare. :)

I'll resync with upstream the next time I see a xfstests.git update.
(Merge window is open, so I don't anticipate that until next week.)

> Darrick has done a pretty amazing job getting us to this stage, and
> special mention also needs to go to Christoph (review, testing,
> improvements and bug fixes) and Brian (caught several intricate
> bugs during review) for the effort they've also put in.

Yes, my hearty thanks to Dave, Christoph, and Brian for their support!

--D

> 
> Thanks,
> 
> -Dave.
> 
> --
> The following changes since commit 155cd433b516506df065866f3d974661f6473572

[GIT PULL] xfs: shared data extents support for 4.9-rc1

2016-10-12 Thread Dave Chinner
Hi Linus,

This is the second part of the XFS updates for this merge cycle.
This pullreq contains the new shared data extents feature for XFS,
and can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git 
tags/xfs-reflink-for-linus-4.9-rc1

The full pull request output is below.

Given the complexity and size of this change I am expecting - like
the addition of reverse mapping last cycle - that there will be some
follow-up bug fixes and cleanups around the -rc3 stage for issues
that I'm sure will show up once the code hits a wider userbase.

What it is:

At the most basic level we are simply adding shared data extents to
XFS - i.e. a single extent on disk can now have multiple owners. To
do this we have to add new on-disk features to both track the shared
extents and the number of times they've been shared. This is done by
the new "refcount" btree that sits in every allocation group. When
we share or unshare an extent, this tree gets updated.

Along with this new tree, the reverse mapping tree needs to be
updated to track each owner or a shared extent. This also needs to
be updated ever share/unshare operation. These interactions at
extent allocation and freeing time have complex ordering and
recovery constraints, so there's a significant amount of new
intent-based transaction code to ensure that operations are
performed atomically from both the runtime and integrity/crash
recovery perspectives.

We also need to break sharing when writes hit a shared extent - this
is where the new copy-on-write implementation comes in. We allocate
new storage and copy the original data along with the overwrite data
into the new location.  We only do this for data as we don't share
metadata at all - each inode has it's own metadata that tracks the
shared data extents, the extents undergoing CoW and it's own private
extents.

Of course, being XFS, nothing is simple - we use delayed allocation
for CoW similar to how we use it for normal writes. ENOSPC is a
significant issue here - we build on the reservation code added
in 4.8-rc1 with the reverse mapping feature to ensure we don't get
spurious ENOSPC issues part way through a CoW operation. These
mechanisms also help minimise fragmentation due to repeated CoW
operations.  To further reduce fragmentation overhead, we've also
introduced a CoW extent size hint, which indicates how large a
region we should allocate when we execute a CoW operation.

With all this functionality in place, we can hook up
.copy_file_range, .clone_file_range and .dedupe_file_range and we
gain all the capabilities of reflink and other vfs provided
functionality that enable manipulation to shared extents. We also
added a fallocate mode that explicitly unshares a range of a file,
which we implemented as an explicit CoW of all the shared extents in
a file.

As such, it's a huge chunk of new functionality with new on-disk
format features and internal infrastructure. It warns at mount time
as an experimental feature and that it may eat data (as we do with
all new on-disk features until they stabilise).  We have not
released userspace suport for it yet - userspace support currently
requires download from Darrick's xfsprogs repo and build from
source, so the access to this feature is really developer/tester
only at this point. Initial userspace support will be released at
the same time the kernel with this code in it is released.

The new code causes 5-6 new failures with xfstests - these aren't
serious functional failures but things the output of tests changing
slightly due to perturbations in layouts, space usage, etc.  OTOH,
we've added 150+ new tests to xfstests that specifically exercise
this new functionality so it's got far better test coverage than any
functionality we've previously added to XFS.

Darrick has done a pretty amazing job getting us to this stage, and
special mention also needs to go to Christoph (review, testing,
improvements and bug fixes) and Brian (caught several intricate
bugs during review) for the effort they've also put in.

Thanks,

-Dave.

--
The following changes since commit 155cd433b516506df065866f3d974661f6473572:

  Merge branch 'xfs-4.9-log-recovery-fixes' into for-next (2016-10-03 09:56:28 
+1100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git 
tags/xfs-reflink-for-linus-4.9-rc1

for you to fetch changes up to feac470e3642e8956ac9b7f14224e6b301b9219d:

  xfs: convert COW blocks to real blocks before unwritten extent conversion 
(2016-10-11 09:03:19 +1100)


xfs: reflink update for 4.9-rc1

< XFS has gained super CoW powers! >
 --
\   ^__^
 \  (oo)\___
(__)\   )\/\
||w |
|| ||

Included in this update:
- unshare range (FALLOC_FL_UNSHARE) support for fallocate
- copy-on-write extent size hints (FS_XFLAG_COWEXTSI