On 12/5/18 9:37 PM, Jeff Mahoney wrote:
The high level idea that Jan Kara and I came up with in our conversation
at Labs conf is pretty expensive. We'd need to set a flag that pauses
new page faults, set the WP bit on affected ranges, do the snapshot,
commit, clear the flag, and wake up the waiting threads. Neither of us
had any concrete idea of how well that would perform and it still
depends on finding a good way to resolve all open mmap ranges on a
subvolume. Perhaps using the address_space->private_list anchored on
each root would work.
This is a potentially wild idea, so "grain of salt" and all that. I may
misuse the exact wording.
So the essential problem of DAX is basically the opposite of
data-deduplication. Instead of merging two duplicate data regions, you
want to mark regions as at-risk while keeping the original content
intact if there are snapshots in conflict.
So suppose you _require_ data checksums and data mode of "dup" or mirror
or one of the other fault tolerant layouts.
By definition any block that gets written with content that it didn't
have before will now have a bad checksum.
If the inode is flagged for direct IO that's an indication that the
block has been updated.
At this point you really just need to do the opposite of deduplication,
as in find/recover the original contents and assign/leave assigned those
to the old/other snapshots, then compute the new checksum on the
"original block" and assign it to the active subvolume.
So when a region is mapped for direct IO, and it's refcount is greater
than one, and you get to a sync or close event, you "recover" the old
contents into a new location and assign those to "all the other users".
Now that original storage region has only one user, so on sync or close
you fix its checksums on the cheap.
Instead of the new data being a small rock sitting over a large rug to
make a lump, the new data is like a rock being slid under the rug to
make a lump.
So the first write to an extent creates a burdensome copy to retain the
old contents, but second and subsequent writes to the same extent only
have the cost of an _eventual_ checksum of the original block list.
Maybe If the data isn't already duplicated then the write mapping or the
DAX open or the setting of the S_DUP flag could force the file into an
extent block that _is_ duplicated.
The mental leap required is that the new blocks don't need to belong to
the new state being created. The new blocks can be associated to the
snapshots since data copy is idempotent.
The side note is that it only ever matters if the usage count is greater
than one, so at worst taking a snapshot, which is already a _little_
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:
If S_DAX :
If checksum invalid :
copy data as-is and checksum, store in snapshot
else : look for duplicate checksum
if duplicate found :
assign that extent to the snapshot
else :
If file opened for writing and has any mmaps for write :
copy extent and assign to new snapshot.
else :
increment usage count and assign current block to snapshot
Anyway, I only know enough of the internals to be dangerous.
Since the real goal of mmap is speed during actual update, this idea is
basically about amortizing the copy costs into the task of maintaining
the snapshots instead of leaving them in the immediate hands of the
time-critical updater.
The flush, unmmap, or close by the user, or a system-wide sync event,
are also good points to expense the bookeeping time.