On Fri, Jun 28 2019, Ævar Arnfjörð Bjarmason wrote:
> A non-RFC since it seem people like this approach.
>
> This should fix the test failure noted by Johannes, there's two new
> patches at the start of this series. They address a bug that was there
> for a long time, but I happened to trip over since PCRE is more strict
> about UTF-8 validation than kwset (which doesn't care at all).
>
> I also added performance numbers to the relevant commit messages, took
> brian's suggestion of saying "NUL-byte" instead of "\0", and did some
> other copyediting of my own.
>
> The rest of the code changes are all just comments & rewording of
> previously added comments.
Junio. I thought I'd submit this in before your merge to "next", but I
see that happened. Are you OK with rewinding it for this (& maybe
something else) or should I submit a v3 rebased on "next"?
I'd really prefer the improved commit messages with performance numbers,
and thought I'd have time to work on those details since it was an
RFC/PATCH :)
> Ævar Arnfjörð Bjarmason (9):
> log tests: test regex backends in "--encode=<enc>" tests
> grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
> grep: inline the return value of a function call used only once
> grep tests: move "grep binary" alongside the rest
> grep tests: move binary pattern tests into their own file
> grep: make the behavior for NUL-byte in patterns sane
> grep: drop support for \0 in --fixed-strings <pattern>
> grep: remove the kwset optimization
> grep: use PCRE v2 for optimized fixed-string search
>
> Documentation/git-grep.txt | 17 +++
> grep.c | 115 +++++++---------
> grep.h | 3 +-
> revision.c | 3 +
> t/t4210-log-i18n.sh | 39 +++++-
> ...a1.sh => t7008-filter-branch-null-sha1.sh} | 0
> ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
> t/t7816-grep-binary-pattern.sh | 127 ++++++++++++++++++
> 8 files changed, 233 insertions(+), 172 deletions(-)
> rename t/{t7009-filter-branch-null-sha1.sh =>
> t7008-filter-branch-null-sha1.sh} (100%)
> rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
> create mode 100755 t/t7816-grep-binary-pattern.sh
>
> Range-diff:
> -: ---------- > 1: cfc01f49d3 log tests: test regex backends in
> "--encode=<enc>" tests
> -: ---------- > 2: 4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log
> --encoding=<non-utf8>"
> 1: ad55d3be7e = 3: cc4d3b50d5 grep: inline the return value of a function
> call used only once
> 2: 650bcc8582 = 4: d9b29bdd89 grep tests: move "grep binary" alongside
> the rest
> 3: ef10a8820d ! 5: f85614f435 grep tests: move binary pattern tests into
> their own file
> @@ -2,9 +2,10 @@
>
> grep tests: move binary pattern tests into their own file
>
> - Move the tests for "-f <file>" where "<file>" contains a "\0" pattern
> - into their own file. I added most of these tests in 966be95549
> ("grep:
> - add tests to fix blind spots with \0 patterns", 2017-05-20).
> + Move the tests for "-f <file>" where "<file>" contains a NUL byte
> + pattern into their own file. I added most of these tests in
> + 966be95549 ("grep: add tests to fix blind spots with \0 patterns",
> + 2017-05-20).
>
> Whether a regex engine supports matching binary content is very
> different from whether it matches binary patterns. Since
> @@ -14,8 +15,8 @@
> engine can sensibly match binary patterns.
>
> Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
> - patterns containing "\0" and considering them fixed, except in cases
> - where "--ignore-case" is provided and they're non-ASCII, see
> + patterns containing NUL-byte and considering them fixed, except in
> + cases where "--ignore-case" is provided and they're non-ASCII, see
> 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
> 2016-06-25). Subsequent commits will change this behavior.
>
> 4: 03e5637efc ! 6: 90afca8707 grep: make the behavior for \0 in patterns
> sane
> @@ -1,12 +1,13 @@
> Author: Ævar Arnfjörð Bjarmason <[email protected]>
>
> - grep: make the behavior for \0 in patterns sane
> + grep: make the behavior for NUL-byte in patterns sane
>
> - The behavior of "grep" when patterns contained "\0" has always been
> - haphazard, and has served the vagaries of the implementation more
> than
> - anything else. A "\0" in a pattern can only be provided via "-f
> - <file>", and since pickaxe (log search) has no such flag "\0" in
> - patterns has only ever been supported by "grep".
> + The behavior of "grep" when patterns contained a NUL-byte has always
> + been haphazard, and has served the vagaries of the implementation
> more
> + than anything else. A pattern containing a NUL-byte can only be
> + provided via "-f <file>". Since pickaxe (log search) has no such flag
> + the NUL-byte in patterns has only ever been supported by "grep" (and
> + not "log --grep").
>
> Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns
> containing
> "\0" were considered fixed. In 966be95549 ("grep: add tests to fix
> @@ -14,9 +15,9 @@
> behavior.
>
> Change the behavior to do the obvious thing, i.e. don't silently
> - discard a regex pattern and make it implicitly fixed just because it
> - contains a \0. Instead die if e.g. --basic-regexp is combined with
> - such a pattern.
> + discard a regex pattern and make it implicitly fixed just because
> they
> + contain a NUL-byte. Instead die if the backend in question can't
> + handle them, e.g. --basic-regexp is combined with such a pattern.
>
> This is desired because from a user's point of view it's the obvious
> thing to do. Whether we support BRE/ERE/Perl syntax is different from
> 5: b9aad3ec1c ! 7: 526b925fdc grep: drop support for \0 in
> --fixed-strings <pattern>
> @@ -2,15 +2,14 @@
>
> grep: drop support for \0 in --fixed-strings <pattern>
>
> - Change "-f <file>" to not support patterns with "\0" in them under
> - --fixed-strings, we'll now only support these under --perl-regexp
> with
> - PCRE v2.
> + Change "-f <file>" to not support patterns with a NUL-byte in them
> + under --fixed-strings. We'll now only support these under
> + "--perl-regexp" with PCRE v2.
>
> - A previous change to Documentation/git-grep.txt changed the
> - description of "-f <file>" to be vague enough as to not promise that
> - this would work, and by dropping support for this we make it a whole
> - lot easier to move away from the kwset backend, which a subsequent
> - change will try to do.
> + A previous change to grep's documentation changed the description of
> + "-f <file>" to be vague enough as to not promise that this would
> work.
> + By dropping support for this we make it a whole lot easier to move
> + away from the kwset backend, which we'll do in a subsequent change.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]>
>
> 6: 3587be009a ! 8: 14269bb295 grep: remove the kwset optimization
> @@ -2,9 +2,99 @@
>
> grep: remove the kwset optimization
>
> - A later change will replace this optimization with a different one,
> - but as removing it and running the tests demonstrates no grep
> - semantics depend on this backend anymore.
> + A later change will replace this optimization with optimistic use of
> + PCRE v2. I'm completely removing it as an intermediate step, as
> + opposed to replacing it with PCRE v2, to demonstrate that no grep
> + semantics depend on this (or any other) optimization for the fixed
> + backend anymore.
> +
> + For now this is mostly (but not entirely) a performance regression,
> as
> + shown by this hacky one-liner:
> +
> + for opt in '' ' -i'
> + do
> + GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10
> GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3
> USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh
> + done &&
> + for opt in '' ' -i'
> + do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10
> GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3
> USE_LIBPCRE=YesPlease' ./run origin/master HEAD --
> p4221-log-grep-engines-fixed.sh
> + done
> +
> + Which produces:
> +
> + plain grep:
> +
> + Test origin/master HEAD
> +
> -------------------------------------------------------------------------
> + 7821.1: fixed grep int 0.55(1.60+0.63)
> 0.82(3.11+0.51) +49.1%
> + 7821.2: basic grep int 0.62(1.68+0.49)
> 0.85(3.02+0.52) +37.1%
> + 7821.3: extended grep int 0.61(1.63+0.53)
> 0.91(3.09+0.44) +49.2%
> + 7821.4: perl grep int 0.55(1.60+0.57)
> 0.41(0.93+0.57) -25.5%
> + 7821.6: fixed grep uncommon 0.20(0.50+0.44)
> 0.35(1.27+0.42) +75.0%
> + 7821.7: basic grep uncommon 0.20(0.49+0.45)
> 0.35(1.29+0.41) +75.0%
> + 7821.8: extended grep uncommon 0.20(0.45+0.48)
> 0.35(1.25+0.44) +75.0%
> + 7821.9: perl grep uncommon 0.20(0.53+0.41)
> 0.16(0.24+0.49) -20.0%
> + 7821.11: fixed grep æ 0.35(1.27+0.40)
> 0.25(0.82+0.39) -28.6%
> + 7821.12: basic grep æ 0.35(1.28+0.38)
> 0.25(0.75+0.44) -28.6%
> + 7821.13: extended grep æ 0.36(1.21+0.46)
> 0.25(0.86+0.35) -30.6%
> + 7821.14: perl grep æ 0.35(1.33+0.34)
> 0.16(0.26+0.47) -54.3%
> +
> + grep with -i:
> +
> + Test origin/master HEAD
> +
> -----------------------------------------------------------------------------
> + 7821.1: fixed grep -i int 0.61(1.84+0.64)
> 1.11(4.12+0.64) +82.0%
> + 7821.2: basic grep -i int 0.72(1.86+0.57)
> 1.15(4.48+0.49) +59.7%
> + 7821.3: extended grep -i int 0.94(1.83+0.60)
> 1.53(4.12+0.58) +62.8%
> + 7821.4: perl grep -i int 0.66(1.82+0.59)
> 0.55(1.08+0.58) -16.7%
> + 7821.6: fixed grep -i uncommon 0.21(0.51+0.44)
> 0.44(1.74+0.34) +109.5%
> + 7821.7: basic grep -i uncommon 0.21(0.55+0.41)
> 0.44(1.72+0.40) +109.5%
> + 7821.8: extended grep -i uncommon 0.21(0.57+0.39)
> 0.42(1.64+0.45) +100.0%
> + 7821.9: perl grep -i uncommon 0.21(0.48+0.48)
> 0.17(0.30+0.45) -19.0%
> + 7821.11: fixed grep -i æ 0.25(0.73+0.45)
> 0.25(0.75+0.45) +0.0%
> + 7821.12: basic grep -i æ 0.25(0.71+0.49)
> 0.26(0.77+0.44) +4.0%
> + 7821.13: extended grep -i æ 0.25(0.75+0.44)
> 0.25(0.74+0.46) +0.0%
> + 7821.14: perl grep -i æ 0.17(0.26+0.48)
> 0.16(0.20+0.52) -5.9%
> +
> + plain log:
> +
> + Test origin/master HEAD
> +
> ---------------------------------------------------------------------------------
> + 4221.1: fixed log --grep='int' 7.31(7.06+0.21)
> 8.11(7.85+0.20) +10.9%
> + 4221.2: basic log --grep='int' 7.30(6.94+0.27)
> 8.16(7.89+0.19) +11.8%
> + 4221.3: extended log --grep='int' 7.34(7.05+0.21)
> 8.08(7.76+0.25) +10.1%
> + 4221.4: perl log --grep='int' 7.27(6.94+0.24)
> 7.05(6.76+0.25) -3.0%
> + 4221.6: fixed log --grep='uncommon' 6.97(6.62+0.32)
> 7.86(7.51+0.30) +12.8%
> + 4221.7: basic log --grep='uncommon' 7.05(6.69+0.29)
> 7.89(7.60+0.28) +11.9%
> + 4221.8: extended log --grep='uncommon' 6.89(6.56+0.32)
> 7.99(7.66+0.24) +16.0%
> + 4221.9: perl log --grep='uncommon' 7.02(6.66+0.33)
> 6.97(6.54+0.36) -0.7%
> + 4221.11: fixed log --grep='æ' 7.37(7.03+0.33)
> 7.67(7.30+0.31) +4.1%
> + 4221.12: basic log --grep='æ' 7.41(7.00+0.31)
> 7.60(7.28+0.26) +2.6%
> + 4221.13: extended log --grep='æ' 7.35(6.96+0.38)
> 7.73(7.31+0.34) +5.2%
> + 4221.14: perl log --grep='æ' 7.43(7.10+0.32)
> 6.95(6.61+0.27) -6.5%
> +
> + log with -i:
> +
> + Test origin/master
> HEAD
> +
> ------------------------------------------------------------------------------------
> + 4221.1: fixed log -i --grep='int' 7.40(7.05+0.23)
> 8.66(8.38+0.20) +17.0%
> + 4221.2: basic log -i --grep='int' 7.39(7.09+0.23)
> 8.67(8.39+0.20) +17.3%
> + 4221.3: extended log -i --grep='int' 7.29(6.99+0.26)
> 8.69(8.31+0.26) +19.2%
> + 4221.4: perl log -i --grep='int' 7.42(7.16+0.21)
> 7.14(6.80+0.24) -3.8%
> + 4221.6: fixed log -i --grep='uncommon' 6.94(6.58+0.35)
> 8.43(8.04+0.30) +21.5%
> + 4221.7: basic log -i --grep='uncommon' 6.95(6.62+0.31)
> 8.34(7.93+0.32) +20.0%
> + 4221.8: extended log -i --grep='uncommon' 7.06(6.75+0.25)
> 8.32(7.98+0.31) +17.8%
> + 4221.9: perl log -i --grep='uncommon' 6.96(6.69+0.26)
> 7.04(6.64+0.32) +1.1%
> + 4221.11: fixed log -i --grep='æ' 7.92(7.55+0.33)
> 7.86(7.44+0.34) -0.8%
> + 4221.12: basic log -i --grep='æ' 7.88(7.49+0.32)
> 7.84(7.46+0.34) -0.5%
> + 4221.13: extended log -i --grep='æ' 7.91(7.51+0.32)
> 7.87(7.48+0.32) -0.5%
> + 4221.14: perl log -i --grep='æ' 7.01(6.59+0.35)
> 6.99(6.64+0.28) -0.3%
> +
> + Some of those, as noted in [1] are because PCRE is faster at finding
> + fixed strings. This looks bad for some engines, but in the next
> change
> + we'll optimistically use PCRE v2 for all of these, so it'll look
> + better.
> +
> + 1. https://public-inbox.org/git/[email protected]/
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]>
>
> 7: 5bc25c03b8 ! 9: c0fd75d102 grep: use PCRE v2 for optimized
> fixed-string search
> @@ -7,19 +7,95 @@
> slower than PCRE v1 and v2 JIT with the kwset backend, so that
> optimization was counterproductive.
>
> - This brings back the optimization for "-F", without changing the
> - semantics of "\0" in patterns. As seen in previous commits in this
> - series we could support it now, but I'd rather just leave that
> - edge-case aside so the tests don't need to do one thing or the other
> - depending on what --fixed-strings backend we're using.
> -
> - I could also support the v1 backend here, but that would make the
> code
> - more complex, and I'd rather aim for simplicity here and in future
> + This brings back the optimization for "--fixed-strings", without
> + changing the semantics of having a NUL-byte in patterns. As seen in
> + previous commits in this series we could support it now, but I'd
> + rather just leave that edge-case aside so we don't have one behavior
> + or the other depending what "--fixed-strings" backend we're using. It
> + makes the behavior harder to understand and document, and makes tests
> + for the different backends more painful.
> +
> + I could also support the PCRE v1 backend here, but that would make
> the
> + code more complex. I'd rather aim for simplicity here and in future
> changes to the diffcore. We're not going to have someone who
> absolutely must have faster search, but for whom building PCRE v2
> isn't acceptable.
>
> - 1. https://public-inbox.org/git/[email protected]/
> + The difference between this series of commits and the current
> "master"
> + is, using the same t/perf commands shown in the last commit:
> +
> + plain grep:
> +
> + Test origin/master HEAD
> +
> -------------------------------------------------------------------------
> + 7821.1: fixed grep int 0.55(1.67+0.56)
> 0.41(0.98+0.60) -25.5%
> + 7821.2: basic grep int 0.58(1.65+0.52)
> 0.41(0.96+0.57) -29.3%
> + 7821.3: extended grep int 0.57(1.66+0.49)
> 0.42(0.93+0.60) -26.3%
> + 7821.4: perl grep int 0.54(1.67+0.50)
> 0.43(0.88+0.65) -20.4%
> + 7821.6: fixed grep uncommon 0.21(0.52+0.42)
> 0.16(0.24+0.51) -23.8%
> + 7821.7: basic grep uncommon 0.20(0.49+0.45)
> 0.17(0.28+0.47) -15.0%
> + 7821.8: extended grep uncommon 0.20(0.54+0.39)
> 0.16(0.25+0.50) -20.0%
> + 7821.9: perl grep uncommon 0.20(0.58+0.36)
> 0.16(0.23+0.50) -20.0%
> + 7821.11: fixed grep æ 0.35(1.24+0.43)
> 0.16(0.23+0.50) -54.3%
> + 7821.12: basic grep æ 0.36(1.29+0.38)
> 0.16(0.20+0.54) -55.6%
> + 7821.13: extended grep æ 0.35(1.23+0.44)
> 0.16(0.24+0.50) -54.3%
> + 7821.14: perl grep æ 0.35(1.33+0.34)
> 0.16(0.28+0.46) -54.3%
> +
> + grep with -i:
> +
> + Test origin/master HEAD
> +
> ----------------------------------------------------------------------------
> + 7821.1: fixed grep -i int 0.62(1.81+0.70)
> 0.47(1.11+0.64) -24.2%
> + 7821.2: basic grep -i int 0.67(1.90+0.53)
> 0.46(1.07+0.62) -31.3%
> + 7821.3: extended grep -i int 0.62(1.92+0.53)
> 0.53(1.12+0.58) -14.5%
> + 7821.4: perl grep -i int 0.66(1.85+0.58)
> 0.45(1.10+0.59) -31.8%
> + 7821.6: fixed grep -i uncommon 0.21(0.54+0.43)
> 0.17(0.20+0.55) -19.0%
> + 7821.7: basic grep -i uncommon 0.20(0.52+0.45)
> 0.17(0.29+0.48) -15.0%
> + 7821.8: extended grep -i uncommon 0.21(0.52+0.44)
> 0.17(0.26+0.50) -19.0%
> + 7821.9: perl grep -i uncommon 0.21(0.53+0.44)
> 0.17(0.20+0.56) -19.0%
> + 7821.11: fixed grep -i æ 0.26(0.79+0.44)
> 0.16(0.29+0.46) -38.5%
> + 7821.12: basic grep -i æ 0.26(0.79+0.42)
> 0.16(0.20+0.54) -38.5%
> + 7821.13: extended grep -i æ 0.26(0.84+0.39)
> 0.16(0.24+0.50) -38.5%
> + 7821.14: perl grep -i æ 0.16(0.24+0.49)
> 0.17(0.25+0.51) +6.3%
> +
> + plain log:
> +
> + Test origin/master HEAD
> +
> --------------------------------------------------------------------------------
> + 4221.1: fixed log --grep='int' 7.24(6.95+0.28)
> 7.20(6.95+0.18) -0.6%
> + 4221.2: basic log --grep='int' 7.31(6.97+0.22)
> 7.20(6.93+0.21) -1.5%
> + 4221.3: extended log --grep='int' 7.37(7.04+0.24)
> 7.22(6.91+0.25) -2.0%
> + 4221.4: perl log --grep='int' 7.31(7.04+0.21)
> 7.19(6.89+0.21) -1.6%
> + 4221.6: fixed log --grep='uncommon' 6.93(6.59+0.32)
> 7.04(6.66+0.37) +1.6%
> + 4221.7: basic log --grep='uncommon' 6.92(6.58+0.29)
> 7.08(6.75+0.29) +2.3%
> + 4221.8: extended log --grep='uncommon' 6.92(6.55+0.31)
> 7.00(6.68+0.31) +1.2%
> + 4221.9: perl log --grep='uncommon' 7.03(6.59+0.33)
> 7.12(6.73+0.34) +1.3%
> + 4221.11: fixed log --grep='æ' 7.41(7.08+0.28)
> 7.05(6.76+0.29) -4.9%
> + 4221.12: basic log --grep='æ' 7.39(6.99+0.33)
> 7.00(6.68+0.25) -5.3%
> + 4221.13: extended log --grep='æ' 7.34(7.00+0.25)
> 7.15(6.81+0.31) -2.6%
> + 4221.14: perl log --grep='æ' 7.43(7.13+0.26)
> 7.01(6.60+0.36) -5.7%
> +
> + log with -i:
> +
> + Test origin/master
> HEAD
> +
> ------------------------------------------------------------------------------------
> + 4221.1: fixed log -i --grep='int' 7.31(7.07+0.24)
> 7.23(7.00+0.22) -1.1%
> + 4221.2: basic log -i --grep='int' 7.40(7.08+0.28)
> 7.19(6.92+0.20) -2.8%
> + 4221.3: extended log -i --grep='int' 7.43(7.13+0.25)
> 7.27(6.99+0.21) -2.2%
> + 4221.4: perl log -i --grep='int' 7.34(7.10+0.24)
> 7.10(6.90+0.19) -3.3%
> + 4221.6: fixed log -i --grep='uncommon' 7.07(6.71+0.32)
> 7.11(6.77+0.28) +0.6%
> + 4221.7: basic log -i --grep='uncommon' 6.99(6.64+0.28)
> 7.12(6.69+0.38) +1.9%
> + 4221.8: extended log -i --grep='uncommon' 7.11(6.74+0.32)
> 7.10(6.77+0.27) -0.1%
> + 4221.9: perl log -i --grep='uncommon' 6.98(6.60+0.29)
> 7.05(6.64+0.34) +1.0%
> + 4221.11: fixed log -i --grep='æ' 7.85(7.45+0.34)
> 7.03(6.68+0.32) -10.4%
> + 4221.12: basic log -i --grep='æ' 7.87(7.49+0.29)
> 7.06(6.69+0.31) -10.3%
> + 4221.13: extended log -i --grep='æ' 7.87(7.54+0.31)
> 7.09(6.69+0.31) -9.9%
> + 4221.14: perl log -i --grep='æ' 7.06(6.77+0.28)
> 6.91(6.57+0.31) -2.1%
> +
> + So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
> + search", 2019-06-26) there's a huge improvement in performance for
> + "grep", but in "log" most of our time is spent elsewhere, so we don't
> + notice it that much.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]>
>
> @@ -81,15 +157,19 @@
> + } else {
> + /*
> + * E.g. t7811-grep-open.sh relies on the
> -+ * pattern being restored, and unfortunately
> -+ * there's no PCRE compile flag for "this is
> -+ * fixed", so we need to munge it to
> -+ * "\Q<pat>\E".
> ++ * pattern being restored.
> + */
> + char *old_pattern = p->pattern;
> + size_t old_patternlen = p->patternlen;
> + struct strbuf sb = STRBUF_INIT;
> +
> ++ /*
> ++ * There is the PCRE2_LITERAL flag, but it's
> ++ * only in PCRE v2 10.30 and later. Needing to
> ++ * ifdef our way around that and dealing with
> ++ * it + PCRE2_MULTILINE being an error is more
> ++ * complex than just quoting this ourselves.
> ++ */
> + strbuf_add(&sb, "\\Q", 2);
> + strbuf_add(&sb, p->pattern, p->patternlen);
> + strbuf_add(&sb, "\\E", 2);
> @@ -101,9 +181,9 @@
> + p->patternlen = old_patternlen;
> + strbuf_release(&sb);
> + }
> -+#else
> ++#else /* !USE_LIBPCRE2 */
> compile_fixed_regexp(p, opt);
> -+#endif
> ++#endif /* !USE_LIBPCRE2 */
> return;
> }