On 10/17/2018 03:49 AM, Chris Murphy wrote:
On Tue, Oct 16, 2018 at 2:13 AM, Anand Jain <anand.j...@oracle.com> wrote:
On 10/14/2018 06:28 AM, Chris Murphy wrote:
Is it practical and desirable to make Btrfs based OS installation
images reproducible? Or is Btrfs simply too complex and
non-deterministic? [1]
The main three problems with Btrfs right now for reproducibility are:
a. many objects have uuids other than the volume uuid; and mkfs only
lets us set the volume uuid
b. atime, ctime, mtime, otime; and no way to make them all the same
c. non-deterministic allocation of file extents, compression, inode
assignment, logical and physical address allocation
I'm imagining reproducible image creation would be a mkfs feature that
builds on Btrfs seed and --rootdir concepts to constrain Btrfs
features to maybe make reproducible Btrfs volumes possible:
- No raid
- Either all objects needing uuids can have those uuids specified by
switch, or possibly a defined set of uuids expressly for this use
case, or possibly all of them can just be zeros (eek? not sure)
- A flag to set all times the same
- Possibly require that target block device is zero filled before
creation of the Btrfs
- Possibly disallow subvolumes and snapshots
- Require the resulting image is seed/ro and maybe also a new
compat_ro flag to enforce that such Btrfs file systems cannot be
modified after the fact.
- Enforce a consistent means of allocation and compression
The end result is creating two Btrfs volumes would yield image files
with matching hashes.
If I had to guess, the biggest challenge would be allocation. But it's
also possible that such an image may have problems with "sprouts". A
non-removable sprout seems fairly straightforward and safe; but if a
"reproducible build" type of seed is removed, it seems like removal
needs to be smart enough to refresh *all* uuids found in the sprout: a
hard break from the seed.
Right. The seed fsid will be gone in a detached sprout.
I think already we get a new devid, volume uuid, and device uuid.
Yes on the sprout.
Open
question is whether any other uuid's need to be refreshed, such as
chunk uuid since that appears in every node and leaf.
There are quite a number of uuid.
Any thoughts? Useful? Difficult to implement?
Recently Nikolay sent a patch to change fsid on a mounted btrfs. However for
a reproducible builds it also needs neutralized uuids, time, bytenr(s)
further more though the ondisk layout won't change without notice but
block-bytenr might.
Seems like the mkfs population method of such a seed,
could be made
very deterministic as to what the start logical address and physical
address are.
Can be. But it can change in future fixes as those aren't EXPORTED().
The vast majority of non-deterministic behavior comes
from the nature of kernel code having to handle so many complex inputs
and outputs, and negotiate them.
One question why not reproducible builds get the file data extents from the
image and stitch the hashes together to verify the hash. And there could be
a vfs ioctl to import and export filesystem images for a better
support-ability of the use-case similar to the reproducible builds.
Perhaps. I don't know the reproducible build requirements very well,
if all they really care about is the hash of the data extents, and
really how important fs metadata is.
That is important when it comes
to fuzzing file systems that have no metadata checksumming like
squashfs; of course you'd have to checksum the whole file system
image.
Another feature the mkfs variety of seed image would need,
deduplication. As far as I know, deduplication is kernel code only.
You'd want to be able to deduplicate,
as well as compress, to have the
smallest distributed seed possible.
btrfs-image(8) already does compress.
I don't think mkfs is the right place to sanitize the uuid/fsid/time...
it should be when we generate the btrfs-image.
So a possible solution for the reproducible builds:
usual mkfs.btrfs dev
Write the data
unmount; create btrfs-image with uuid/fsid/time sanitized; mark it
as a seed (RO).
check/verify the hash of the image.
If the hash match. To use this btrfs-image.
Rest the seed (RO) flag; mount and use it;
OR
Mount the seed device; add a RW sprout; detach the seed;
OR
Don't set the RO at all (above) and just mount and use it;
Thanks, Anand
And mksquashfs does deduplication
by default.