On Fri, Feb 06, 2015 at 02:00:53PM +0100, Lutz Vieweg wrote:
> On 02/06/2015 06:20 AM, Qu Wenruo wrote:
> > From: Lutz Vieweg <l...@5t9.de>
> >> use case: You have two huge files on a btrfs, you assume they contain the 
> >> same bytes,
> >> but you do not know for sure.
> >>
> >> Is there a way to get a checksum of both files from btrfs with less effort 
> >> than
> >> reading the whole of both files and computing a hash sum?
> > For short, NO.
> >
> > For long:
> > For current implement, btrfs use calculate 4K sector into 4bytes(32bit) 
> > crc32 and restore it into
> > csum tree.
> >
> > So, for large files, e.g. 1G(already quite small for modern storage), its 
> > checksum will be 1M in size.
> > Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + 
> > crc32(b)), you still needs to
> > do crc32 on the all 1M crc32.
> 
> And yet, having to read only 1 MB checksums instead of 1 GB data sounds
> like a good deal - is there some userspace interface allowing to read
> (only) those per-4k checksums for a file?

Just a POC code how to get the csum for a given block (based on the
SEARCH ioctl, needs root):

http://repo.or.cz/w/btrfs-progs-unstable/devel.git/commit/33a4d171552736da2977323797f53d9cea830e2f

crc32 is weak but can be used to detect early(-ier) if the files are
different. A hash collision in the middle of huge files is possible but
I guess very low.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to