Re: [zfs-discuss] Hashing files rapidly on ZFS

Edward Ned Harvey Thu, 08 Jul 2010 15:48:58 -0700

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Bertrand Augereau
> 
> is there a way to compute very quickly some hash of a file in a zfs?
> As I understand it, everything is signed in the filesystem, so I'm
> wondering if I can avoid reading whole files with md5sum just to get a
> unique hash. Seems very redundant to me :)


If I understand right:

Although zfs is calculating hashes of blocks, it doesn't correlate to hashes
of files, for many reasons:

Block boundaries are not well aligned with file boundaries.  A single block
might encapsulate several small files, or a file might start in the middle
of a block, span several more, and end in the middle of another block.

Blocks also contain non-file information.

Hashing blocks will be even more irrelevant to file hashes, if you have
compression enabled, because I think it hashes the compressed data, not the
uncompressed data.

If you want to create file hashes out of block hashes, it's even more
convoluted.  Because you can't generally compute hash(A+B) based on hash(A)
and hash(B).  Although perhaps you can for some algorithms.

My advice would be:

Computing hashes is not very expensive, as long as you're just computing
hashes for data that you were going to handle for other reasons anyway.
Specifically, I benchmarked several hash algorithms a while back, and found
... I forget which ... either adler32 or crc is almost zero-time to compute
... that is ... the cpu was very lightly utilized while hashing blocks at
maximum disk speed.

The weakness of adler32 and crc is that they're not cryptographic hashes.
If a malicious person wants to corrupt a data stream while preserving the
hash, it's not difficult to do.  adler32 and crc are good as long as you can
safely assume no malice.

md5 is significantly slower (but surprisingly not much slower) and it's a
cryptographic hash.  Probably not necessary for your needs.

And one more thing.  No matter how strong your hash is, unless your hash is
just as big as your file, collisions happen.  Don't assume data is the same
just because hash is the same, if you care about your data.  Always
byte-level verify every block or file whose hash matches some other hash.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

Reply via email to