On 9/5/2016 6:23 PM, Ivan Zhakov wrote:
With all above the new behavior should be working better or the same
in all cases. I agree that 50% approximation may be incorrect for some
specific binary formats (case 6) like sqlite db.
To be fair, I'd argue that in case of binary file modifications the approximation is quite off. Most binary formats (if not all) in our repository differ in the first couple of bytes (if they were changed) and therefore it's quite a significant difference whether we read the full file contents of a single file (which might be >100MB) or just the first few bytes of two files.

As Bert already suggested, I totally support the statement that it's quite a common design pattern for binary formats to have some checksum, time stamp, counter value, filesize record, etc. at the beginning of the file contents which is likely to differ, if the file has changed. If you then take the file sizes differences between text files and binary files into account (aka: text files usually being quite small, while binary files usually being quite large) it certainly has the potential to matter quite much that there's a difference expected for the binary file comparison case.

FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k block and another for the full file) sounds therefore as a reasonable suggestion.

Last but not least the throughput of calculating the SHA-1 is also restricted by the I/O throughput in practice. For working directories I'd assume it's not too unlikely to still reside on some HDD (rather than some faster cache or an SSD) so it'd be limited to around 20 MB/s in practice. Given large binary files this might pose a significant difference in certain (not uncommon) use-cases.

Don't get this wrong: IMHO I agree that the SHA-1 approach is superior (especially on Windows machines since it will reduce the cases where two files have to be opened - pointer: anti virus scanner impacts). I just share Bert's opinion here that the approach should be a bit improved especially in light of binary file support.

If it would be of any help, I could do some performance measurements with the two approaches on our repository to get some real world numbers to work with.

--
Regards,
Stefan Hett

Reply via email to