On Fri, Feb 06, 2015 at 02:00:53PM +0100, Lutz Vieweg wrote: > On 02/06/2015 06:20 AM, Qu Wenruo wrote: > > From: Lutz Vieweg <l...@5t9.de> > >> use case: You have two huge files on a btrfs, you assume they contain the > >> same bytes, > >> but you do not know for sure. > >> > >> Is there a way to get a checksum of both files from btrfs with less effort > >> than > >> reading the whole of both files and computing a hash sum? > > For short, NO. > > > > For long: > > For current implement, btrfs use calculate 4K sector into 4bytes(32bit) > > crc32 and restore it into > > csum tree. > > > > So, for large files, e.g. 1G(already quite small for modern storage), its > > checksum will be 1M in size. > > Which means even using crc32 (same as kernel and crc32(a+b) = crc32(a) + > > crc32(b)), you still needs to > > do crc32 on the all 1M crc32. > > And yet, having to read only 1 MB checksums instead of 1 GB data sounds > like a good deal - is there some userspace interface allowing to read > (only) those per-4k checksums for a file?
Just a POC code how to get the csum for a given block (based on the SEARCH ioctl, needs root): http://repo.or.cz/w/btrfs-progs-unstable/devel.git/commit/33a4d171552736da2977323797f53d9cea830e2f crc32 is weak but can be used to detect early(-ier) if the files are different. A hash collision in the middle of huge files is possible but I guess very low. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html