Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
On Mon, Oct 30, 2017 at 10:59:41AM -0700, Jeff King wrote: > > There's also https://www.strchr.com/hash_functions, which lists DJB2 > > as Bernstein. Both functions rank somewhere in the middle of that list. > > FWIW, I did some experiments with Murmur3 and SipHash a while back, but > I don't think I came up with anything faster than the existing code. > OTOH, moving to SipHash gives us the ability to randomize the hashes, > which can resist some DoS attacks (although as I've said before, > computing arbitrary diffs for untrusted strangers is pretty much a > DoS-in-a-box). By the way, one of the things that complicates plugging new functions into xdiff's hashing is that xdl_hash_record() simultaneously computes the hash _and_ finds the end-of-line marker. So the "siphash is only 10% slower" number I showed came with quite a few contortions to do both. Here it is compared to a more naive application of the siphash code (i.e., memchr to find end-of-line, and then feed the resulting bytes to siphash): Test originHEAD^ jk/xdl-siphash-wip --- 4000.1: log -3000 (baseline) 0.05(0.05+0.00) 0.05(0.05+0.00) +0.0% 0.05(0.05+0.00) +0.0% 4000.2: log --raw -3000 (tree-only) 0.31(0.27+0.03) 0.31(0.27+0.03) +0.0% 0.31(0.27+0.03) +0.0% 4000.3: log -p -3000 (Myers) 2.06(2.01+0.05) 2.30(2.21+0.09) +11.7% 2.96(2.91+0.04) +43.7% 4000.4: log -p -3000 --histogram 2.44(2.38+0.06) 2.67(2.60+0.07) +9.4% 3.32(3.26+0.06) +36.1% 4000.5: log -p -3000 --patience 2.57(2.47+0.09) 2.90(2.82+0.08) +12.8% 3.48(3.43+0.05) +35.4% There "origin" is the existing djb hash, "HEAD^" is the complicated "fast" siphash (which I very well may have screwed up), and the final is the more naive version, which is quite a bit slower. -Peff
Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
On Thu, Oct 26, 2017 at 07:43:19PM +0200, René Scharfe wrote: > Am 25.10.2017 um 20:49 schrieb Stefan Beller: > > The implementations in diff.c to detect moved lines needs to compare > > strings and hash strings, which is implemented in that file, as well > > as in the xdiff library. > > > > Remove the rather recent implementation in diff.c and rely on the well > > exercised code in the xdiff lib. > > > > With this change the hash used for bucketing the strings for the moved > > line detection changes from FNV32 (that is provided via the hashmaps > > memhash) to DJB2 (which is used internally in xdiff). Benchmarks found > > on the web[1] do not indicate that these hashes are different in > > performance for readable strings. > > > > [1] > > https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed > > Awesome comparison! It calls the variant used in libxdiff DJB2a (which > uses xor to mix in the new char) instead of DJB2 (which uses addition). > > There's also https://www.strchr.com/hash_functions, which lists DJB2 > as Bernstein. Both functions rank somewhere in the middle of that list. FWIW, I did some experiments with Murmur3 and SipHash a while back, but I don't think I came up with anything faster than the existing code. OTOH, moving to SipHash gives us the ability to randomize the hashes, which can resist some DoS attacks (although as I've said before, computing arbitrary diffs for untrusted strangers is pretty much a DoS-in-a-box). Anyway, I rebased them and ran p4000[1], with does show some results: Test originjk/xdl-murmur3-wip jk/xdl-siphash-wip --- 4000.1: log -3000 (baseline) 0.05(0.05+0.00) 0.05(0.05+0.00) +0.0% 0.05(0.05+0.00) +0.0% 4000.2: log --raw -3000 (tree-only) 0.30(0.28+0.02) 0.30(0.28+0.01) +0.0% 0.30(0.28+0.02) +0.0% 4000.3: log -p -3000 (Myers) 2.06(1.98+0.08) 1.90(1.85+0.05) -7.8% 2.32(2.25+0.07) +12.6% 4000.4: log -p -3000 --histogram 2.50(2.42+0.07) 2.25(2.21+0.04) -10.0% 2.70(2.62+0.08) +8.0% 4000.5: log -p -3000 --patience 2.58(2.52+0.06) 2.47(2.42+0.04) -4.3% 2.86(2.77+0.08) +10.9% So it looks like murmur3 is indeed a little faster. SipHash is slower, which is too bad (because the randomization would be nice). I _think_ back then I compared to XDL_FAST_HASH, which was a little faster even than murmur3 (but generated too many collisions, which led to bad behavior for certain cases). Anyway, those branches are at https://github.com/peff/git if anybody wants to look further. They probably need some cleanup. At the very least, we'd probably need to teach the whitespace-ignoring hash function to use the same algorithm. -Peff [1] Actually, I added "--no-merges" to each command in p4000. It seems silly as it is, since the point is to compute a bunch of diffs.
Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
Am 25.10.2017 um 20:49 schrieb Stefan Beller: > The implementations in diff.c to detect moved lines needs to compare > strings and hash strings, which is implemented in that file, as well > as in the xdiff library. > > Remove the rather recent implementation in diff.c and rely on the well > exercised code in the xdiff lib. > > With this change the hash used for bucketing the strings for the moved > line detection changes from FNV32 (that is provided via the hashmaps > memhash) to DJB2 (which is used internally in xdiff). Benchmarks found > on the web[1] do not indicate that these hashes are different in > performance for readable strings. > > [1] > https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed Awesome comparison! It calls the variant used in libxdiff DJB2a (which uses xor to mix in the new char) instead of DJB2 (which uses addition). There's also https://www.strchr.com/hash_functions, which lists DJB2 as Bernstein. Both functions rank somewhere in the middle of that list. > Signed-off-by: Stefan Beller > --- > diff.c | 82 > -- > 1 file changed, 4 insertions(+), 78 deletions(-) Very nice. René
Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
Stefan Beller writes: > The implementations in diff.c to detect moved lines needs to compare > strings and hash strings, which is implemented in that file, as well > as in the xdiff library. > > Remove the rather recent implementation in diff.c and rely on the well > exercised code in the xdiff lib. > ... > static int moved_entry_cmp(const struct diff_options *diffopt, > const struct moved_entry *a, > const struct moved_entry *b, > const void *keydata) > { > - const char *ap = a->es->line, *ae = a->es->line + a->es->len; > - const char *bp = b->es->line, *be = b->es->line + b->es->len; > - ... > + return !xdiff_compare_lines(a->es->line, a->es->len, > + b->es->line, b->es->len, > + diffopt->xdl_opts); > } OK, xdiff's xdl_recmatch() takes counted strings, and line[len] is one byte beyond the end of the line (where LF typically sits), which is the same convention as how "emitted_symbol" represents a line, so not just the implementation replaced with a known-working one, but also the code calls the helper with correct calling convention ;-) > - ret->ent.hash = get_string_hash(l, o); > + ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts); Likewise. Looks good. Will queue.