----- Forwarded message from Nico Williams <n...@cryptonector.com> -----
From: Nico Williams <n...@cryptonector.com> Date: Sat, 21 May 2011 16:22:27 -0500 To: Crypto discussion list <cryptogra...@randombit.net> Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc. Reply-To: Crypto discussion list <cryptogra...@randombit.net> On Sat, May 21, 2011 at 1:50 PM, Zooko O'Whielacronx <zo...@zooko.com> wrote: > What I would most want is for ZFS (and every other filesystem) to > maintain a Merkle Tree over the file data with a good secure hash. Me too. ZFS does do that, but unfortunately the internal Merkel hash maintained this way also has metadata tree nodes (namely, indirect blocks), which wouldn't be a problem except that embedded in that metadata (and hashed in) are block pointers, which contain some data which is completely non-deterministic relative to the file contents. The right thing to have done would have been to exclude the address portion of block pointers, but that owuld have required more data movement and/or computation (and perhaps was not thought of at the time, but I was never in the ZFS team, so I'd not know. > Whenever a change to a file is made, the filesystem can update the > Merkle Tree this with mere O(log(N)) work in the size of the file plus > O(N) work in the size of the change. For a modern filesystem like ZFS > which is already maintaining a checksum tree the *added* cost of > maintaining the secure hash Merkle Tree could be minimal. Right! > Then, the filesystem should make this Merkle Tree available to > applications through a simple query. Indeed. And if non-deterministic metadata is excluded then the only thing one would have to know in order to independently compute the same hash would be the maximum breath of the tree and leaf block size, and that the tree is intended to be kept with all but the rightmost leaf node full. This would be a wonderful feature to have. > This would enable applications—without needing any further > in-filesystem code—to perform a Merkle Tree sync, which would range > from "noticeably more efficient" to "dramatically more efficient" than > rsync or zfs send. :-) I haven't thought about this enough, but my level-0 concern would be that data moves not aligned with block size would require a rolling hash in order to make synchronization efficient, thus the Merkle tree wouldn't help unless, perhaps, it were constructed from rolling hash functions. The other thought I had was that now that we accept that file stores need large amounts of computational power, not jut I/O bandwidth, it's not a stretch for the server to also compute and store (i.e., pre-compute) arrays of rolling CRCs, each taken over a small data chunk contiguous with the preceding and succeeding ones. This would allow a client to very quickly read the rsync signature of a remote file and locally generate a patch to it, making the write process very efficient for large files that get many small changes all over the place. > Of course it is only more efficient because we're treating the > maintenance of the secure-hash Merkle Tree as free. There are two > senses in which this is legitimate and it is almost free: And/or because we intend to use this for optimizing an otherwise slow operation, and if we do it enough then it'd be worthwhile even if the given FS had never before used a Merkle tree. > [...] > 2. a. From what I've read, nobody uses the SHA-256 configuration in > ZFS because it is too computationally expensive, so they use an > insecure checksum (fletcher2/4) instead. You kinda have to use SHA-256 if you want dedup. > 2. b. I assume the shape of the resulting checksum tree is modified by > artifacts of the ZFS layout instead of being a simple canonical shape. Yes: a) the shape of the indirect block tree (but this is deterministic if you know the maximum block size for the filesystem and the maximum block size for the file), and b) physical block address information in indirect blocks (which can and should have been excluded -- it's good enough to compute the hash of an indirect block over the block checksums in each block pointer). See above. Oh, also, there the first few block pointers in the dnode too. > This is a show-stopper for this use case because if the same file data > exists on a different system, and some software on that system > computes a Merkle Tree over the data, it might come out with different > hashes than the ZFS checksum tree, thus eliminating all of the > performance benefits of this approach. Indeed, but: > But, if ZFS could be modified to fix these problems or if a new > filesystem would add a feature of maintaining a canonical, > reproducible Merkle Tree, then it might be extremely useful. I think it could be. I don't think you'll get much out of ZFS folks nowadays though. Maybe if you talk to ZFS folks outside of Oracle though, they might confirm whether this is doable. BTW, IIRC I filed an RFE at Sun to expose a Merkle hash tree checksum to applications. I believe I've seen others ask for this before too at various times, and it may be that the RFE I filed (if I did) was in response to a request on one of the OSOL discuss lists -- this must have been back in 2005. This is a really, really obvious enhancement to want -- it should be obvious the moment you realize that Merkle hash trees are a godsend. Nico -- _______________________________________________ cryptography mailing list cryptogra...@randombit.net http://lists.randombit.net/mailman/listinfo/cryptography ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss