On Tue, May 11, 2010 at 04:15:24AM -0700, Bertrand Augereau wrote: > Is there a O(nb_blocks_for_the_file) solution, then? > > I know O(nb_blocks_for_the_file) == O(nb_bytes_in_the_file), from Mr. > Landau's POV, but I'm quite interested in a good constant factor.
If you were considering the hashes of each zfs block as a precomputed value, it might be tempting to think of getting all of these and hashing them together. You could thereby avoiding reading file data, and the file metadata with the hashes in, you'd have needed to read anyway. This would seem to be appealing, eliminating seeks and cpu work. However, there are some issues that make the approach basically infeasible and unreliable for comparing the results of two otherwise identical files. First, you're assuming there's an easy interface to get the stored hashes of a block, which there isn't. Even if we ignore that for a moment, the hashes zfs records depend on factors other than just the file content, including the way the file has been written over time. The blocks of the file may not be constant size; a file that grew slowly may have different hashes to a copy of it or one extracted from an archive in a fast stream. Filesystem properties, including checksum (obvious), dedup (which implies checksum), compress (which changes written data and can make holes), blocksize and maybe others may be different between filesystems or even change over the time a file has been written, and again change results and defeat comparisons. These things can defeat zfs's dedup too, even though it does have access to the block level checksums. If you're going to do an application-level dedup, you want to utilise the advantage of being independent of these things - or even of the underlying filesystem at all (e.g. dedup between two NAS shares). Something similar would be useful, and much more readily achievable, from ZFS from such an application, and many others. Rather than a way to compare reliably between two files for identity, I'ld liek a way to compare identity of a single file between two points in time. If my application can tell quickly that the file content is unaltered since last time I saw the file, I can avoid rehashing the content and use a stored value. If I can achieve this result for a whole directory tree, even better. -- Dan.
pgp1HgRATGs5S.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss