Re: [PATCH 00/10] btrfs: Support for DAX devices

Robert White Wed, 05 Dec 2018 23:41:27 -0800

On 12/5/18 9:37 PM, Jeff Mahoney wrote:

The high level idea that Jan Kara and I came up with in our conversationat Labs conf is pretty expensive. We'd need to set a flag that pausesnew page faults, set the WP bit on affected ranges, do the snapshot,commit, clear the flag, and wake up the waiting threads. Neither of ushad any concrete idea of how well that would perform and it stilldepends on finding a good way to resolve all open mmap ranges on asubvolume. Perhaps using the address_space->private_list anchored oneach root would work.

This is a potentially wild idea, so "grain of salt" and all that. I maymisuse the exact wording.

So the essential problem of DAX is basically the opposite ofdata-deduplication. Instead of merging two duplicate data regions, youwant to mark regions as at-risk while keeping the original contentintact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirroror one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn'thave before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that theblock has been updated.

At this point you really just need to do the opposite of deduplication,as in find/recover the original contents and assign/leave assigned thoseto the old/other snapshots, then compute the new checksum on the"original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greaterthan one, and you get to a sync or close event, you "recover" the oldcontents into a new location and assign those to "all the other users".Now that original storage region has only one user, so on sync or closeyou fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug tomake a lump, the new data is like a rock being slid under the rug tomake a lump.

So the first write to an extent creates a burdensome copy to retain theold contents, but second and subsequent writes to the same extent onlyhave the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or theDAX open or the setting of the S_DUP flag could force the file into anextent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong tothe new state being created. The new blocks can be associated to thesnapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greaterthan one, so at worst taking a snapshot, which is already a _little_racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:


If S_DAX :
  If checksum invalid :
    copy data as-is and checksum, store in snapshot
  else : look for duplicate checksum
    if duplicate found :
      assign that extent to the snapshot
    else :
      If file opened for writing and has any mmaps for write :
        copy extent and assign to new snapshot.
      else :
        increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea isbasically about amortizing the copy costs into the task of maintainingthe snapshots instead of leaving them in the immediate hands of thetime-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event,are also good points to expense the bookeeping time.

Re: [PATCH 00/10] btrfs: Support for DAX devices

Reply via email to