Re: [zfs-discuss] Hashing files rapidly on ZFS

Garrett D'Amore Thu, 08 Jul 2010 15:58:22 -0700

On Thu, 2010-07-08 at 18:46 -0400, Edward Ned Harvey wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Bertrand Augereau
> > 
> > is there a way to compute very quickly some hash of a file in a zfs?
> > As I understand it, everything is signed in the filesystem, so I'm
> > wondering if I can avoid reading whole files with md5sum just to get a
> > unique hash. Seems very redundant to me :)
> 
> If I understand right:
> 
> Although zfs is calculating hashes of blocks, it doesn't correlate to hashes
> of files, for many reasons:
> 
> Block boundaries are not well aligned with file boundaries.  A single block
> might encapsulate several small files, or a file might start in the middle
> of a block, span several more, and end in the middle of another block.
> 
> Blocks also contain non-file information.
> 
> Hashing blocks will be even more irrelevant to file hashes, if you have
> compression enabled, because I think it hashes the compressed data, not the
> uncompressed data.
> 
> If you want to create file hashes out of block hashes, it's even more
> convoluted.  Because you can't generally compute hash(A+B) based on hash(A)
> and hash(B).  Although perhaps you can for some algorithms.
> 
> My advice would be:
> 
> Computing hashes is not very expensive, as long as you're just computing
> hashes for data that you were going to handle for other reasons anyway.
> Specifically, I benchmarked several hash algorithms a while back, and found
> ... I forget which ... either adler32 or crc is almost zero-time to compute
> ... that is ... the cpu was very lightly utilized while hashing blocks at
> maximum disk speed.
> 
> The weakness of adler32 and crc is that they're not cryptographic hashes.
> If a malicious person wants to corrupt a data stream while preserving the
> hash, it's not difficult to do.  adler32 and crc are good as long as you can
> safely assume no malice.
> 
> md5 is significantly slower (but surprisingly not much slower) and it's a
> cryptographic hash.  Probably not necessary for your needs.
> 
> And one more thing.  No matter how strong your hash is, unless your hash is
> just as big as your file, collisions happen.  Don't assume data is the same
> just because hash is the same, if you care about your data.  Always
> byte-level verify every block or file whose hash matches some other hash.


MD5 hashing is not recommended for "cryptographically strong" hashing
anymore.  SHA256 is the current recommendation I would make (the state
of the art changes over time.)

The caution about collisions happening is relevant, but with a suitably
strong hash, the risk is close enough to zero that normal people don't
care.

By that, I mean that the chance of a collision within a 256 bit hash is
something like 1/2^255.  You're probably more likely to spontaneously
combust (by an order of magnitude) than you are to have two files that
"accidentally" (or even maliciously) reduce to the same hash.

When the probability of the Sun going supernova in the next 30 seconds
exceeds the probability of a cryptographic hash collision, I don't worry
about the collision anymore. :-)

        - Garrett

> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

Reply via email to