My inactivity in git recently
Hi, thought I'd send this to git@ as a public FYI. Some of you were concerned about my inactivity recently, rest assured I'm fine, just been busy with other things. Hoping to get back into it sooner than later, sorry about not replying to things I've been CC'd on.
Re: [PATCH v3 2/5] repo-settings: add feature.manyCommits setting
On Tue, Jul 30 2019, Derrick Stolee via GitGitGadget wrote: > +feature.*:: > + The config settings that start with `feature.` modify the defaults of > + a group of other config settings. These groups are created by the Git > + developer community as recommended defaults and are subject to change. > + In particular, new config options may be added with different defaults. > + > +feature.manyCommits:: > + Enable config options that optimize for repos with many commits. This > + setting is recommended for repos with at least 100,000 commits. The > + new default values are: > ++ > +* `core.commitGraph=true` enables reading the commit-graph file. > ++ > +* `gc.writeCommitGraph=true` enables writing the commit-graph file during > +garbage collection. During the whole new commit graph format discussion (which has now landed) we discussed just auto toggling this: https://public-inbox.org/git/87zhobr4fl@evledraar.gmail.com/ This looks fine, but have we backed out of simply enabling this at this point? I don't see why not, regardless of commit count...
Re: [RFC PATCH v2] grep: allow for run time disabling of JIT in PCRE
On Wed, Jul 31 2019, Johannes Schindelin wrote: > Hi, > > On Mon, 29 Jul 2019, Carlo Marcelo Arenas Belón wrote: > >> $ git grep 'foo bar' >> fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' > > My immediate reaction to this error message was: That's not helpful. > What is `-48` supposed to mean? Why do we even think it sensible to > throw such an error message at the end user? Can't we do a much better > job translating that into something that makes actual sense without > knowing implementation details? > > But then, I realized that -48 must be a well-known constant in PCRE2, > and my reaction transformed into something much more hopeful: why don't > we detect the situation where the JIT'ed code was not actually > executable [*1*], and fall back to the non-JIT'ed code path ourselves, > without troubling the end user (maybe warning, but maybe better not lest > we annoy the user with something pointless)? > > Even after finding out that -48 disappointingly means > PCRE2_ERROR_NOMEMORY (as opposed to something like > PCRE2_ERROR_CANNOT_EXECUTE_JIT_CODE), I like the idea of not bothering > end users and doing the sensible fallback under the hood. > > Ciao, > Dscho > > Footnote *1*: Why anybody would think it sensible to build a PCRE2 with > JIT on an OS that does not allow executing code that was written by the > same process is beyond me. Or is there a mode in OpenBSD that *does* > allow JIT'ed code to be executed? We do detect if JIT isn't supported and fall back. That's what the pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on) code in grep.c does. This and is the subsequent pcre2_pattern_info() call is how PCRE documents that you should do this. What hasn't been supported is all of that saying "yes, I support JIT" and the feature then fail whaling. I had not encountered that before. So far that seems like because Carlo just built a completely broken PCRE v2 package, so I don't know if that's worth supporting on our side. I.e. this isn't something I think could plausibly happen in the wild. That should *not* be confused with me thinking other stuff Carlo's raised is a non-issue, e.g. running into the JIT stack limit etc. Some of that's clearly bugs in our/my grep.c code that need fixing.
Re: [PATCH 0/4] gc docs: modernize and fix the documentation
On Wed, Jul 31 2019, Jeff King wrote: > On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > Michael Haggerty and I have (off-list) discussed variations on that, but >> > it opens up a lot of new issues. Moving something into quarantine isn't >> > atomic. So you've still corrupted the repo, but now it's recoverable by >> > reaching into the quarantine. Who notices that the repo is corrupt, and >> > how? When do we expire objects from quarantine? >> > >> > I think the heart of the issue is really the lack of atomicity in the >> > operations. You need some way to mark "I am using this now" in a way >> > that cannot race with "looks like nobody is using this, so I'll delete >> > it". >> > >> > And ideally without traversing large bits of the graph on the writing >> > side, and without requiring any stop-the-world locks during pruning. >> >> I was thinking (but realize now that I didn't articulate) that the "gc >> quarantine" would be another "alternate" implementing a copy-on-write >> "lockless delete-but-be-able-to-rollback scheme" as you put it. >> >> So "gc" would decide (racily) what's unreachable, but instead of >> unlink()-ing it would "mv" the loose object/pack into the >> "unreferenced-objects" quarantine. >> >> Then in your example #1 "wants to reference ABCD. It sees that we have >> it." would race on the "other side". I.e. maybe ABCD was *just* moved to >> the quarantine. But in that case we'd move it back, which would bump the >> mtime and thus make it ineligible for expiry. > > I think this is basically the same as the current freshening scheme, > though. In general, you can replace "move it back" with "update its > mtime". Neither is atomic with respect to other operations. > > It does seem like the twist is that "gc" is supposed to do the "move it > back" step (and it's also the thing expiring, if we assume that there's > only one gc running at a time). But again, how do we know somebody isn't > referencing it _right now_ while we're deciding whether to move it back? The twist is to create a "quarantine" area of the ref store you can't read any objects from without copying them to the "main" area (git-gc itself would be an exception). Hence step #2 and #6, respectively, in your examples in https://public-inbox.org/git/20190319001829.gl29...@sigill.intra.peff.net/ would have update-ref/receive-pack fail to find "ABCD" in the "main" store due to the exact same race we have now with mtimes & gc, then fall back to the "quarantine" and (this is the important part) immediately copy it back to the "main" store. IOW yes, you'd have the exact same race you have now with the initial move to the quarantine. You'd have ref updates & gc racing and "unreachable" things would be moved to the quarantine, but really the just became reachable again. The difference is that instead of unlinking that unreachable object we move it to the quarantine, so the next "gc" (which is what would delete it) would notice it's reachable and move it to the "main" area before proceeding, *and* anything that "faults" back to reading the "quarantine" would do the same. > I think there are lots of solutions you can come up with if you have > atomicity. But fundamentally it isn't there in the way we handle updates > now. You could imagine something like a shared/unique lock where anybody > updating a ref takes the "shared" side, and multiple entities can hold > it at once. But somebody pruning takes the "unique" side and excludes > everybody else, stopping ref updates during the prune (which you'd > obviously want to do in a way that you hold the lock for as short as > possible; say, optimistically check reachability without the lock, then > take the lock and check to see if anything has changed). > > (By shared/unique I basically mean a reader/writer lock, but I didn't > want to use those terms in the paragraph since both holders are > writing). > > It is tricky to find out when to hold the shared lock, though. It's > _not_ just a ref write, for example. When you accept a push, you'd want > to hold the lock while you are checking that you have all of the > necessary objects to write the ref. For something like "git commit" it's > even harder, because we implicitly rely on state created by commands run > over the course of hours or days (e.g., "git add" to put a blob in the > index and maybe create the tree via cache-tree, then a commit to > reference it, and finally the ref write; each step adds state which the > next step relies on). I don't think this sort of approach would require any global locks, but it would be vulnerable to operations that take longer than the "main->quarantine->unlink()" cycle takes. E.g. a "hash-object" that takes a month before the subsequent "write-tree" etc. All of the above written with the previously stated "I may be missing something" caveat etc. :)
Re: [PATCH] send-email: Ask if a patch should be sent twice
On Tue, Jul 30 2019, Dmitry Safonov wrote: > I was almost certain that git won't let me send the same patch twice, > but today I've managed to double-send a directory by a mistake: > git send-email --to linux-ker...@vger.kernel.org /tmp/timens/ > --cc 'Dmitry Safonov <0x7f454...@gmail.com>' /tmp/timens/` > > [I haven't noticed that I put the directory twice ^^] > > Prevent this shipwreck from happening again by asking if a patch > is sent multiple times on purpose. > > link: https://lkml.kernel.org/r/4d53ebc7-d5b2-346e-c383-606401d19...@gmail.com > Cc: Andrei Vagin > Signed-off-by: Dmitry Safonov > --- > git-send-email.perl | 23 ++- > 1 file changed, 22 insertions(+), 1 deletion(-) There's tests for send-email in t/t9001-send-email.sh. See if what you're adding can have a test added, seems simple enough in this case. > diff --git a/git-send-email.perl b/git-send-email.perl > index 5f92c89c1c1b..0caafc104478 100755 > --- a/git-send-email.perl > +++ b/git-send-email.perl > @@ -33,6 +33,7 @@ > use Net::Domain (); > use Net::SMTP (); > use Git::LoadCPAN::Mail::Address; > +use experimental 'smartmatch'; We depend on Perl 5.8, this bumps the requirenment to 5.10. Aside from that ~~ is its own can of worms in Perl and is best avoided. > Getopt::Long::Configure qw/ pass_through /; > > @@ -658,6 +659,17 @@ sub is_format_patch_arg { > } > } > > +sub send_file_twice { > + my $f = shift; > + $_ = ask(__("Patch $f will be sent twice, continue? [y]/n "), These cases with a default should have "Y/n", not "y/n". See other expamples in the file. > + default => "y", > + valid_re => qr/^(?:yes|y|no|n)/i); > + if (/^n/i) { > + cleanup_compose_files(); > + exit(0); Exit if we have just one of these? More on that later... > + } > +} > + > # Now that all the defaults are set, process the rest of the command line > # arguments and collect up the files that need to be processed. > my @rev_list_opts; > @@ -669,10 +681,19 @@ sub is_format_patch_arg { > opendir my $dh, $f > or die sprintf(__("Failed to opendir %s: %s"), $f, $!); > > - push @files, grep { -f $_ } map { catfile($f, $_) } > + my @new_files = grep { -f $_ } map { catfile($f, $_) } > sort readdir $dh; > + foreach my $nfile (@new_files) { > + if ($nfile ~~ @files) { > + send_file_twice($nfile); > + } One non-smartmatch idiom for this is: my %seen; for my $file (@files) { if ($seen{$file}++) { ...} } Or: my %seen; my @dupes = grep { $seen{$_}++ } @files; > + } > + push @files, @new_files; > closedir $dh; > } elsif ((-f $f or -p $f) and !is_format_patch_arg($f)) { > + if ($f ~~ @files) { > + send_file_twice($f); > + } > push @files, $f; ...but picking up the comment above, I'd expect this to be in the "if ($validate)" block below or something similar, seems like this fits right in with --validate. Then you can also ask "do you want to send this set of patches twice ?". Now the user is asked a file-at-a-time. > } else { > push @rev_list_opts, $f;
Re: [PATCH] Documentation/git-fsck.txt: include fsck.* config variables
On Mon, Jul 29 2019, SZEDER Gábor wrote: > The 'fsck.skipList' and 'fsck.' config variables might be > easier to discover when they are documented in 'git fsck's man page. > > Signed-off-by: SZEDER Gábor > --- > Documentation/git-fsck.txt | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt > index e0eae642c1..d72d15be5b 100644 > --- a/Documentation/git-fsck.txt > +++ b/Documentation/git-fsck.txt > @@ -104,6 +104,11 @@ care about this output and want to speed it up further. > progress status even if the standard error stream is not > directed to a terminal. > > +CONFIGURATION > +- > + > +include::config/fsck.txt[] Before this include let's add: The below documentation is the same as what’s found in git-config(1): As I did for a similar change in git-gc in b6a8d09f6d ("gc docs: include the "gc.*" section from "config" in "gc"", 2019-04-07). Sometimes we repeat ourselves, it helps the reader to know this isn't some slightly different prose than what's in git-config. > + > DISCUSSION > --
Re: Settings for minimizing repacking (and keeping 'rsync' happy)
On Mon, Jul 29 2019, Jeff King wrote: > On Sun, Jul 28, 2019 at 01:41:34AM +0200, ardi wrote: > >> Some of my Git repositories have mirrors, maintained with 'rsync'. I >> want to have some level of repacking, so that the repositories are >> efficient, but I also want it to minimize it, so that 'rsync' never >> has to perform a big transfer for the repositories. > > Yes, this is a common problem. The solutions I've seen/used are: > > - use a git-aware transport like git-fetch that can negotiate which > objects to send > > - use a tool that can find duplicated chunks across files. Many > de-duping backup systems (e.g., borg) use a rolling hash similar to > rsync to find moveable chunks, but then look up those chunks in a > master index (whereas rsync is always looking to match chunks in a > file of the same name). This works well in practice because Git is > not usually rewriting most of the data, but just shuffling it around > between files. > > In theory it shouldn't be that hard to tell the receiving rsync to > look for source chunks not just in the file of the same name, but > from a set of existing packfiles (say, everything already in > .git/objects/pack/ on the receiver). But I don't know offhand of an > option to rsync to do so. > >> For example, I think it would be fine if files are repacked just once >> in their lifetimes, and then that resulting pack file is never >> repacked again. I did read the gc.bigPackThreshold and >> gc.autoPackLimit settings, but I don't think they would accomplish >> that. >> >> Basically, what I'm describing is the behaviour of not packing files >> until the resulting pack would be a given size (say 10MB for example), >> and then never repack such ~10MB packs again, ever. >> >> Can this be done with some Git settings? And do you foresee any kind >> of serious drawback or potential problem with this kind of behaviour? > > You can mark a pack to be kept forever by creating a matching > "pack-1234abcd.keep" file. That doesn't do your automatic "I want 10MB > packs" thing, but if you did it occasionally at the right frequency, > you'd end up with a bunch of 10MB-ish packs. > > But there are downsides to having a bunch of packs: > > - object lookups are O(log n) within a single pack, but O(n) over the > number of packs. So if you get a very large number of packs, normal > operations will start to suffer. This is mitigated by the new "midx" > feature, which generates an index for multiple packs. > > - git doesn't allow delta compression across packs. So imagine you > have ten versions of a file that's 5kb, and each version changes > about 100 bytes. In a single pack, we'd store one base object, plus > 9 deltas, for a total of about 6kb (5000 + 9*100). Across two packs, > we'd store ~11kb (2*5000 + 8*100). And the worst case is ten packs > at 50kb. > > As a more real-world example, try this: > > git -c pack.packsizelimit=10M repack -ad > > In a fresh clone of git.git, the size of the pack directory jumps > from 88MB to 168MB. And in a time-based split (i.e., creating a new > 10MB pack every week), it may be even worse. The command above > ordered the objects optimally to keep deltas together and _then_ > split things. Whereas a time-based scheme would likely sprinkle > versions of a file across more packs. > > It should be possible to loosen this restriction and allow > cross-pack deltas, but it would be very risky. The assumption that > packs are independent of each other is implicit in much of Git's > repacking code, so it would be easy to introduce a bug where we > generate a circular dependency (object A in pack X is a delta > against object B in pack Y, which is a delta against object A -- > oops, we don't have a full copy anymore). The thread I started at https://public-inbox.org/git/87bmhiykvw@evledraar.gmail.com/ should also be of interest. I.e. we could have some knobs to create more "stable" packs, I know rsync does some in-file hashing, but I don't if/how that works if you have 1 file split into N where some chunks in the N are in the one file. But it's possible to imagine a repacking algorithm that would keep producing entirely new packs but arrange for it to be ordered/delta'd in such a way that it optimizes for page-by-page similarity to an older pack to some degree. So e.g. in the examples you mention break the delta chain at 5, then pick it up again once it's 10 etc. So the intermediate packs where it's 6, 7, 8, 9 would have the new stuff at the end.
Re: Warnings in gc.log can prevent gc --auto from running
On Mon, Jul 29 2019, Jeff King wrote: > On Thu, Jul 25, 2019 at 07:18:57PM -0700, Gregory Szorc wrote: > >> I think I've found some undesirable behavior with regards to the >> behavior of `git gc --auto`. The tl;dr is that a warning message written >> to gc.log can result in `git gc --auto` effectively disabling itself for >> gc.logExpiry. The problem is easier to trigger in 2.22 as a result of >> enabling bitmap indices for bare repositories by default and the >> behavior can easily result in performance degradation, especially on >> servers. > > Yuck, thanks for reporting this. > > As you note, this is a special case of a much larger problem. The other > common case is the "oops, you still have a lot of loose objects after > repacking" warning. There's more discussion and some patches here: > > > https://public-inbox.org/git/20180716172717.237373-1-jonathanta...@google.com/ > > though I don't think any of the work that came out of that fundamentally > solves the issue. To add to that Gregory probably finds these two old reports of mine interesting. The former is pretty much his report (but for a different root cause, the loose object issue): https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/ & https://public-inbox.org/git/87fu6bmr0j@evledraar.gmail.com/ >> I don't prescribe to know the best way to solve this problem. I just >> know it is a footgun sitting in the default Git configuration. And the >> footgun became a lot easier to fire with the introduction of warning >> messages related to bitmap indices and again when bitmap indices were >> enabled by default for bare repositories in Git 2.22. > > IMHO one way to mitigate this is to simply warn less. In particular, if > we are auto-enabling bitmaps, then it doesn't necessarily make sense for > us to warn about them being disabled. > > In the case of .keep files, we've already got 7328482253 (repack: > disable bitmaps-by-default if .keep files exist, 2019-06-29), which > should be in the next released version of Git. But I suspect that's > racy with respect to somebody creating .keep files, and as you note > there are other config options that might prevent us from generating > bitmaps. > > Instead, it may make sense to turn the --write-bitmap-index option of > pack-objects into a tri-state: true/false/auto. Then pack-objects would > know that we are in best-effort mode, and would avoid warning in that > case. That would also let git-repack express its intentions better to > git-pack-objects, so we could replace 7328482253, and keep more of the > logic in pack-objects, which is ultimately what has to make the decision > about whether it can generate bitmaps. Sounds like pentastate to me :) (penta = 5, had to look it up). I.e. in most cases of "auto" we pick a true/false at the outset, whereas this is true/true-but-dont-care-much/false/false-but-dont-care-much with "auto" picking the "-but-dont-care-much" versions of a "soft" true/false. On this general topic a *soft* poke about relying to https://public-inbox.org/git/8736lnxlig@evledraar.gmail.com/ if you have time. I think a "loose pack" might be a way forward for the loose object proliferation, but maybe I'm wrong. More generally we're really straining the gc.log pass-along-a-message facility.
Re: [RFC PATCH] grep: allow for run time disabling of JIT in PCRE
On Mon, Jul 29 2019, Carlo Arenas wrote: > On Mon, Jul 29, 2019 at 1:55 AM Ævar Arnfjörð Bjarmason > wrote: >> >> On Mon, Jul 29 2019, Carlo Marcelo Arenas Belón wrote: >> >> > PCRE1 allowed for a compile time flag to disable JIT, but PCRE2 never >> > had one, forcing the use of JIT if -P was requested. >> >> What's that PCRE1 compile-time flag? > > NO_LIBPCRE1_JIT at GIT compile time (regardless of JIT support in the > PCRE1 library you are using) Ah of course, I was reading this as "regexp compile-time". I.e. something like (*NO_JIT). No *such* thing exists for PCRE v1 JIT AFAIK as exposed by git-grep. >> > After ed0479ce3d (Merge branch 'ab/no-kwset' into next, 2019-07-15) >> > the PCRE2 engine will be used more broadly and therefore adding this >> > knob will give users a fallback for situations like the one observed >> > in OpenBSD with a JIT enabled PCRE2, because of W^X restrictions: >> > >> > $ git grep 'foo bar' >> > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' >> > $ git grep -G 'foo bar' >> > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' >> > $ git grep -E 'foo bar' >> > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' >> > $ git grep -F 'foo bar' >> > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' >> >> Yeah that obviously sucks more with ab/no-kwset, but that seems like a >> case where -P would have been completely broken before, and therefore I >> can't imagine the package ever passed "make test". Or is W^X also >> exposed as some run-time option on OpenBSD? > > ironically, you could use PCRE1 since that is not using the JIT fast > path and therefore will fallback automatically to the interpreter ...because OpenBSD PCRE v1 was compiled with --disable-jit before, but their v2 package has --enable-jit, it just doesn't work at all? Is this your custom built git + OpenBSD packages of PCRE coming with the OS? I don't use OpenBSD, but isn't this their recipe? Seems they use "make test", and don't compile with PCRE at all if I'm reading it right: https://github.com/openbsd/ports/blob/master/devel/git/Makefile > there is also a convoluted way to make your binary work by moving > it into a mount point that has been specially exempted from that W^X > restriction. > >> I.e. aside from the merits of such a setting in general these examples >> seem like just working around something that should be fixed at make >> all/test time, or maybe I'm missing something. > > 1) before you could just avoid using -P and still be able to grep > 2) there is no way to tell PCRE2 to get out of the way even if you are > not using -P Right, no arguments at all about ab/no-kwset making this worse (re: your #1). I just really prefer not to expose/document config for what *should* be something purely internal if the X-Y problem is a bug being exposed that we should just fix. Particularly because I think it's a losing battle to provide run-time options for what are surely a *lot* of "make test" failures. If it really is unavoidable to detect this until runtime in some common configurations I have no problem with it, I just haven't encountered that so far. > you are right though that this is not a new problem and was reported > before with patches and the last comment saying a configuration > should be provided. patches = your recent https://public-inbox.org/git/20181209230024.43444-2-care...@gmail.com/ or something earlier? That patch seems sane without having tested it. Seems like the equivalent of what we do with v1 with PCRE2_JIT_COMPLETE. I *am* curious if there's setups where fixing the code for PCRE v1 isn't purely an academic exercise. Is there a reason for why these platforms can't just move to PCRE v2 in principle (dumpster fires in "next" non-withstanding)? >> To the extent that we'd want to make this sort of thing configurable, I >> wonder if a continuation of my (*NO_JIT) patch isn't better, i.e. just >> adding the ability to configure some string we'd inject at the start of >> every pattern. > > looking at the number of lines of code, it would seem the configuration > approach is simpler. > >> That would allow for setting any other number of options in >> pcre2syntax(3) without us needing to carry config for each one, >> e.g. (*LIMIT_HEAP=d), (*LIMIT_DEPTH=d) etc. It does present a larger >> foot-gun surface though... > > the parameters I suspect users might need are not really accessible through > that (ex: jit stacksize). > > it is important to note that currently we are not preventing any user to use > those flags themselves in their patterns either.
Re: [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
On Fri, Jul 26 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> 1-3 here are a re-roll on "next". I figured that was easier for >> everyone with the state of the in-flight patches, it certainly was for >> me. Sorry Junio if this creates a mess for you. > > As long as I can just apply all of them on top of no-kwset and keep > it a single topic, it wouldn't be too much of a hassle. > >> 4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep: >> skip UTF8 checks explicitally" in >> https://public-inbox.org/git/20190721183115.14985-1-care...@gmail.com/ >> >> As noted the bug isn't fully fixed until 8/8, and that patch relies on >> unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good >> enough state to limp forward as noted in the rationale of those >> commits. > > Yikes. Perhaps we should kick the no-kwset thing out of 'next' and > start from scratch? It does not sound that the world is ready yet. I have some fix-for-the-fix and was going to submit a v3 of this series, but I think the more responsible thing to do at this point, especially with various patches from Carlo that need to be integrated in one way or another, is to back it out until the outstanding issues are addressed. If it's not too much trouble, would you mind reverting just the two patches at the tip of ab/no-kwset in "next"? I.e. b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) 48de2a768c ("grep: remove the kwset optimization", 2019-07-01) I believe the rest are all settled & haven't had any issues raised with them, and those tests & preparatory fixes would be very useful to have in "master" for any re-roll without needing to be distracted by those changes. > But that is just a knee-jerk reaction before reading the actual > patches. Let's see how they look ;-) > > Thanks. > >> Ævar Arnfjörð Bjarmason (8): >> grep: remove overly paranoid BUG(...) code >> grep: stop "using" a custom JIT stack with PCRE v2 >> grep: stop using a custom JIT stack with PCRE v1 >> grep: consistently use "p->fixed" in compile_regexp() >> grep: create a "is_fixed" member in "grep_pat" >> grep: stess test PCRE v2 on invalid UTF-8 data >> grep: do not enter PCRE2_UTF mode on fixed matching >> grep: optimistically use PCRE2_MATCH_INVALID_UTF >> >> Makefile| 1 + >> grep.c | 68 +++-- >> grep.h | 13 ++- >> t/helper/test-pcre2-config.c| 12 ++ >> t/helper/test-tool.c| 1 + >> t/helper/test-tool.h| 1 + >> t/t7812-grep-icase-non-ascii.sh | 39 +++ >> 7 files changed, 80 insertions(+), 55 deletions(-) >> create mode 100644 t/helper/test-pcre2-config.c
Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
On Mon, Jul 29 2019, Ævar Arnfjörð Bjarmason wrote: > On Mon, Jul 29 2019, Carlo Arenas wrote: > >> On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason >> wrote: >>> >>> It's less confusing to use that variable consistently that switch back >>> & forth between the two. >>> >>> Signed-off-by: Ævar Arnfjörð Bjarmason >>> --- >>> grep.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/grep.c b/grep.c >>> index 9c2b259771..b94e998680 100644 >>> --- a/grep.c >>> +++ b/grep.c >>> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct >>> grep_opt *opt) >>> die(_("given pattern contains NULL byte (via -f ). >>> This is only supported with -P under PCRE v2")); >>> >>> pat_is_fixed = is_fixed(p->pattern, p->patternlen); >>> - if (opt->fixed || pat_is_fixed) { >>> + if (p->fixed || pat_is_fixed) { >> >> at the end of this series we have: >> >> if (p->fixed || p->is_fixed) >> >> which doesn't make sense; at least with opt->fixed it was clear that >> what was meant is that grep was passed -P > > I assume you mean "was passed -F...". > >> maybe is_fixed shouldn't exist and fixed when applied to the pattern >> means we had determined it was a fixed >> pattern and overridden the user selection of engine. > > They're two flags because p->fixed is "--fixed-strings", and p->is_fixed > is "there's no metachars here". So the former case needs escaping, as > the code just below might do (the two aren't mutually exclusive). > > I don't get how you think we can always fold them into one flag, but > maybe I'm missing something... > >> that at least will give us a logical way to fix the pattern reported >> in [1] and that currently requires the user to know >> git's grep internals and know he can skip the "is_fixed" optimization >> by doing something like : >> >> $ git grep 'foo[ ]bar' >> >> [1] https://public-inbox.org/git/20190728235427.41425-1-care...@gmail.com/ > > As I noted in a reply there this seems like a way to fix a bug in "next" > with a config knob. Yes we should fix the bug, but we've had the kwset > code in git for years without needing this distinction, so after we work > out the bugs I don't see why we'd need this. > > The reason we ignore the user's choice here is because you might > e.g. set grep.patternType=extended in your config, and you'd still want > grepping for a fixed "foo" to be fast. ...and more generally, for any future sanity of implementation and maintenance I think we should only make the promise that we support certain syntax & semantics, not that the -F, -G, -E, -P options are guaranteed to dispatch to a given codepath. Internally we should be free to switch between those, so e.g. if a pattern is fixed and you configure "basic" regexp, but we know your C library is faster for those matches with REG_EXTENDED we should just pass that regardless of -G or -E. Of course that means we *must* expose the same semantics (to some reasonable extent), which means I have a lot of bugs in "next" to address. I'm just saying that the presence of those bugs means we should be inclined to fix them / back out certain changes, not work around them with user-servicable knobs.
Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
On Mon, Jul 29 2019, Carlo Arenas wrote: > On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason > wrote: >> >> It's less confusing to use that variable consistently that switch back >> & forth between the two. >> >> Signed-off-by: Ævar Arnfjörð Bjarmason >> --- >> grep.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/grep.c b/grep.c >> index 9c2b259771..b94e998680 100644 >> --- a/grep.c >> +++ b/grep.c >> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct >> grep_opt *opt) >> die(_("given pattern contains NULL byte (via -f ). >> This is only supported with -P under PCRE v2")); >> >> pat_is_fixed = is_fixed(p->pattern, p->patternlen); >> - if (opt->fixed || pat_is_fixed) { >> + if (p->fixed || pat_is_fixed) { > > at the end of this series we have: > > if (p->fixed || p->is_fixed) > > which doesn't make sense; at least with opt->fixed it was clear that > what was meant is that grep was passed -P I assume you mean "was passed -F...". > maybe is_fixed shouldn't exist and fixed when applied to the pattern > means we had determined it was a fixed > pattern and overridden the user selection of engine. They're two flags because p->fixed is "--fixed-strings", and p->is_fixed is "there's no metachars here". So the former case needs escaping, as the code just below might do (the two aren't mutually exclusive). I don't get how you think we can always fold them into one flag, but maybe I'm missing something... > that at least will give us a logical way to fix the pattern reported > in [1] and that currently requires the user to know > git's grep internals and know he can skip the "is_fixed" optimization > by doing something like : > > $ git grep 'foo[ ]bar' > > [1] https://public-inbox.org/git/20190728235427.41425-1-care...@gmail.com/ As I noted in a reply there this seems like a way to fix a bug in "next" with a config knob. Yes we should fix the bug, but we've had the kwset code in git for years without needing this distinction, so after we work out the bugs I don't see why we'd need this. The reason we ignore the user's choice here is because you might e.g. set grep.patternType=extended in your config, and you'd still want grepping for a fixed "foo" to be fast.
Re: [RFC PATCH] grep: allow for run time disabling of JIT in PCRE
On Mon, Jul 29 2019, Carlo Marcelo Arenas Belón wrote: > PCRE1 allowed for a compile time flag to disable JIT, but PCRE2 never > had one, forcing the use of JIT if -P was requested. What's that PCRE1 compile-time flag? > After ed0479ce3d (Merge branch 'ab/no-kwset' into next, 2019-07-15) > the PCRE2 engine will be used more broadly and therefore adding this > knob will give users a fallback for situations like the one observed > in OpenBSD with a JIT enabled PCRE2, because of W^X restrictions: > > $ git grep 'foo bar' > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' > $ git grep -G 'foo bar' > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' > $ git grep -E 'foo bar' > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' > $ git grep -F 'foo bar' > fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48' Yeah that obviously sucks more with ab/no-kwset, but that seems like a case where -P would have been completely broken before, and therefore I can't imagine the package ever passed "make test". Or is W^X also exposed as some run-time option on OpenBSD? I.e. aside from the merits of such a setting in general these examples seem like just working around something that should be fixed at make all/test time, or maybe I'm missing something. To the extent that we'd want to make this sort of thing configurable, I wonder if a continuation of my (*NO_JIT) patch isn't better, i.e. just adding the ability to configure some string we'd inject at the start of every pattern. That would allow for setting any other number of options in pcre2syntax(3) without us needing to carry config for each one, e.g. (*LIMIT_HEAP=d), (*LIMIT_DEPTH=d) etc. It does present a larger foot-gun surface though...
Re: [PATCH 3/3] grep: plug leak of pcre chartables in PCRE2
On Sat, Jul 27 2019, Carlo Marcelo Arenas Belón wrote: > Just as it is done with PCRE1, make sure that the allocated chartables > get free at cleanup time. > > This assumes no global context is used (NULL passed when created the > tables), but will likely be updated in tandem if that ever changes. > > Signed-off-by: Carlo Marcelo Arenas Belón > --- > grep.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/grep.c b/grep.c > index d04635fad4..d9768c5f05 100644 > --- a/grep.c > +++ b/grep.c > @@ -604,6 +604,7 @@ static void free_pcre2_pattern(struct grep_pat *p) > pcre2_match_data_free(p->pcre2_match_data); > pcre2_jit_stack_free(p->pcre2_jit_stack); > pcre2_match_context_free(p->pcre2_match_context); > + free((void *)p->pcre_tables); Is the cast really needed? I'm rusty on the rules, removing it from the pcre_free() you might have copied this from produces a warning for me, but not for free() itself. This is on GCC 8.3.0. How about for you & what compiler(s)?
Re: [PATCH 1/3] grep: make pcre1_tables version agnostic
On Sat, Jul 27 2019, Carlo Marcelo Arenas Belón wrote: > 6d4b5747f0 ("grep: change internal *pcre* variable & function names > to be *pcre1*", 2017-05-25), renamed most variables to be PCRE1 > specific to give space to similarly named variables for PCRE2, but > in this case the change wasn't needed as the types were compatible > enough (const unsigned char* vs const uint8_t*) to be shared. Both the v1 and v2 functions return const unsigned char *. I don't know where I got the uint8_t from. This makes more sense. This series looks good to me. Thanks for the fix. Just one caveat: The point of 6d4b5747f0 was not to only split out those variables we couldn't get away with re-using. Then I would have later re-used e.g. pcre1_jit_on & pcre2_jit_on as just pcre_jit_on. We could also do that now. I think doing that & this part of the your changes makes things less readable. The two code branches we compile with ifdefs are mutually exclusive, so having the variables be unique helps with eyeballing / reasoning when changing the code. > Revert that change, as 94da9193a6 ("grep: add support for PCRE v2", > 2017-06-01) failed to create an equivalent PCRE2 version. > > Signed-off-by: Carlo Marcelo Arenas Belón > --- > grep.c | 6 +++--- > grep.h | 2 +- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/grep.c b/grep.c > index f7c3a5803e..cc65f7a987 100644 > --- a/grep.c > +++ b/grep.c > @@ -389,14 +389,14 @@ static void compile_pcre1_regexp(struct grep_pat *p, > const struct grep_opt *opt) > > if (opt->ignore_case) { > if (has_non_ascii(p->pattern)) > - p->pcre1_tables = pcre_maketables(); > + p->pcre_tables = pcre_maketables(); > options |= PCRE_CASELESS; > } > if (is_utf8_locale() && has_non_ascii(p->pattern)) > options |= PCRE_UTF8; > > p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset, > - p->pcre1_tables); > + p->pcre_tables); > if (!p->pcre1_regexp) > compile_regexp_failed(p, error); > > @@ -462,7 +462,7 @@ static void free_pcre1_regexp(struct grep_pat *p) > { > pcre_free(p->pcre1_extra_info); > } > - pcre_free((void *)p->pcre1_tables); > + pcre_free((void *)p->pcre_tables); > } > #else /* !USE_LIBPCRE1 */ > static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt > *opt) > diff --git a/grep.h b/grep.h > index 1875880f37..d34f66b384 100644 > --- a/grep.h > +++ b/grep.h > @@ -89,7 +89,7 @@ struct grep_pat { > pcre *pcre1_regexp; > pcre_extra *pcre1_extra_info; > pcre_jit_stack *pcre1_jit_stack; > - const unsigned char *pcre1_tables; > + const unsigned char *pcre_tables; > int pcre1_jit_on; > pcre2_code *pcre2_pattern; > pcre2_match_data *pcre2_match_data;
Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
On Fri, Jul 26 2019, Ævar Arnfjörð Bjarmason wrote: > On Fri, Jul 26 2019, Junio C Hamano wrote: > >> Ævar Arnfjörð Bjarmason writes: >> >>> diff --git a/Makefile b/Makefile >>> index bd246f2989..dd38d5e527 100644 >>> --- a/Makefile >>> +++ b/Makefile >>> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o >>> TEST_BUILTINS_OBJS += test-online-cpus.o >>> TEST_BUILTINS_OBJS += test-parse-options.o >>> TEST_BUILTINS_OBJS += test-path-utils.o >>> +TEST_BUILTINS_OBJS += test-pcre2-config.o >> >> This won't even build with any released pcre version; shouldn't we >> make it at least conditionally compiled code? Specifically... >> >>> TEST_BUILTINS_OBJS += test-pkt-line.o >>> TEST_BUILTINS_OBJS += test-prio-queue.o >>> TEST_BUILTINS_OBJS += test-reach.o >>> diff --git a/grep.c b/grep.c >>> index c7c06ae08d..8b8b9efe12 100644 >>> --- a/grep.c >>> +++ b/grep.c >>> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, >>> const struct grep_opt *opt >>> } >>> if (!opt->ignore_locale && is_utf8_locale() && >>> has_non_ascii(p->pattern) && >>> !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>> - options |= PCRE2_UTF; >>> + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >>> >>> p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, >>> p->patternlen, options, &error, >>> &erroffset, >>> diff --git a/grep.h b/grep.h >>> index c0c71eb4a9..506f05b97b 100644 >>> --- a/grep.h >>> +++ b/grep.h >>> @@ -21,6 +21,9 @@ typedef int pcre_extra; >>> #ifdef USE_LIBPCRE2 >>> #define PCRE2_CODE_UNIT_WIDTH 8 >>> #include >>> +#ifndef PCRE2_MATCH_INVALID_UTF >>> +#define PCRE2_MATCH_INVALID_UTF 0 >>> +#endif >> >> ... unlike this piece of code ... >> >>> #else >>> typedef int pcre2_code; >>> typedef int pcre2_match_data; >>> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c >>> new file mode 100644 >>> index 00..5258fdddba >>> --- /dev/null >>> +++ b/t/helper/test-pcre2-config.c >>> @@ -0,0 +1,12 @@ >>> +#include "test-tool.h" >>> +#include "cache.h" >>> +#include "grep.h" >>> + >>> +int cmd__pcre2_config(int argc, const char **argv) >>> +{ >>> + if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) { >>> + int value = PCRE2_MATCH_INVALID_UTF; >> >> ... this part does not have any fallback definition. > > It works because we include grep.h, which'll define > PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested > this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works > & runs/skips the appropriate tests. Ah, I spoke too soon, of course that's all guarded by "are we using PCRE v2 in general?". I'll fix it...
Re: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
On Fri, Jul 26 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> diff --git a/grep.c b/grep.c >> index 6d60e2e557..5bc0f4f32a 100644 >> --- a/grep.c >> +++ b/grep.c >> @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct >> grep_opt *opt) >> die(_("given pattern contains NULL byte (via -f ). This >> is only supported with -P under PCRE v2")); >> >> p->is_fixed = is_fixed(p->pattern, p->patternlen); >> +#ifdef USE_LIBPCRE2 >> + if (!p->fixed && !p->is_fixed) { >> + const char *no_jit = "(*NO_JIT)"; >> + const int no_jit_len = strlen(no_jit); >> + if (starts_with(p->pattern, no_jit) && >> + is_fixed(p->pattern + no_jit_len, >> +p->patternlen - no_jit_len)) >> + p->is_fixed = 1; > > It is unfortunate that is_fixed() takes a counted string. > Otherwise, using skip_prefix() to avoid "+no_jit_len" would have > made it much easier to read. i.e. > > /* an illustration that does not quite work */ > char *pattern_body; > if (skip_prefix(p->pattern, "(*NO_JIT)", &pattern_body) && > is_fixed(pattern_body)) > p->is_fixed = 1; Indeed, but then we couldn't use this for patterns that have NUL in them, which we otherwise support (and support here). So I think it's worth keeping it so it takes ptr+len. >> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 >> data' ' >> +printf "\\200\\n" >invalid-0x80 && >> +echo "ævar" >expected && >> +cat expected >>invalid-0x80 && >> +git add invalid-0x80 >> +' >> + >> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from >> invalid UTF-8 data' ' >> +git grep -h "var" invalid-0x80 >actual && >> +test_cmp expected actual && >> +git grep -h "(*NO_JIT)var" invalid-0x80 >actual && >> +test_cmp expected actual >> +' >> + >> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from >> invalid UTF-8 data' ' >> +test_might_fail git grep -h "æ" invalid-0x80 >actual && >> +test_cmp expected actual && >> +test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 && >> +test_cmp expected actual >> +' >> + >> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from >> invalid UTF-8 data with -i' ' >> +test_might_fail git grep -hi "Æ" invalid-0x80 >actual && >> +test_cmp expected actual && >> +test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 && >> +test_cmp expected actual >> +' >> + >> test_done
Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
On Fri, Jul 26 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> diff --git a/Makefile b/Makefile >> index bd246f2989..dd38d5e527 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o >> TEST_BUILTINS_OBJS += test-online-cpus.o >> TEST_BUILTINS_OBJS += test-parse-options.o >> TEST_BUILTINS_OBJS += test-path-utils.o >> +TEST_BUILTINS_OBJS += test-pcre2-config.o > > This won't even build with any released pcre version; shouldn't we > make it at least conditionally compiled code? Specifically... > >> TEST_BUILTINS_OBJS += test-pkt-line.o >> TEST_BUILTINS_OBJS += test-prio-queue.o >> TEST_BUILTINS_OBJS += test-reach.o >> diff --git a/grep.c b/grep.c >> index c7c06ae08d..8b8b9efe12 100644 >> --- a/grep.c >> +++ b/grep.c >> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, >> const struct grep_opt *opt >> } >> if (!opt->ignore_locale && is_utf8_locale() && >> has_non_ascii(p->pattern) && >> !(!opt->ignore_case && (p->fixed || p->is_fixed))) >> -options |= PCRE2_UTF; >> +options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> >> p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, >> p->patternlen, options, &error, >> &erroffset, >> diff --git a/grep.h b/grep.h >> index c0c71eb4a9..506f05b97b 100644 >> --- a/grep.h >> +++ b/grep.h >> @@ -21,6 +21,9 @@ typedef int pcre_extra; >> #ifdef USE_LIBPCRE2 >> #define PCRE2_CODE_UNIT_WIDTH 8 >> #include >> +#ifndef PCRE2_MATCH_INVALID_UTF >> +#define PCRE2_MATCH_INVALID_UTF 0 >> +#endif > > ... unlike this piece of code ... > >> #else >> typedef int pcre2_code; >> typedef int pcre2_match_data; >> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c >> new file mode 100644 >> index 00..5258fdddba >> --- /dev/null >> +++ b/t/helper/test-pcre2-config.c >> @@ -0,0 +1,12 @@ >> +#include "test-tool.h" >> +#include "cache.h" >> +#include "grep.h" >> + >> +int cmd__pcre2_config(int argc, const char **argv) >> +{ >> +if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) { >> +int value = PCRE2_MATCH_INVALID_UTF; > > ... this part does not have any fallback definition. It works because we include grep.h, which'll define PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works & runs/skips the appropriate tests.
Re: [PATCH] grep: skip UTF8 checks explicitally
On Fri, Jul 26 2019, Carlo Arenas wrote: > On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason > wrote: >> I'm not sure what a real fix for that is. Part of it is probably 8/8 in >> the series I mention below, but more generally we'd need to be more >> encoding aware at a much higher callsite than "grep". So e.g. we'd know >> that we match "binary" data as not-UTF-8. Now we just throw arbitrary >> bytes around and hope something sticks. > > I haven't look yet at your proposed changes, but my gut feeling is that > the work to support invalid UTF in the yet unreleased PCRE version would > be needed as part of it, and therefore it might be better to keep PCRE > out of the main path until that gets released and can be relied upon. I'm hoping my 8-part series is good enough to move it forward, but as 48de2a768c ("grep: remove the kwset optimization", .2019-07-01) shows we can always just fall back on using regcomp instead. > kwset is not going away with this series anyway, regardless of the no-kwset > name on the branch. The larger context here is that this is the 1st step of a 2-step series to get rid of kwset. If I can pull that off successfully is another matter, but that's the plan. After it's applied we just use it in the pickaxe code, and it's relatively straightforward to convert that. See: https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/ >> > If we're already deciding to paper over things, I'd much rather prefer >> > the simpler patch, i.e. Carlo's. >> >> As I noted upthread PCRE's own docs promise undefined behavior and fire >> and brimstone if that patch is applied. Those last two not >> guaranteed. So we need another solution. > > in my original reply I mentioned I explicitly didn't do a test because of this > "undefined behavior", but I think it should be fair to mention that we are > already affected by that because using the JIT fast path does skip any > UTF-8 validations and is currently possible to get git into an infinite loop > or make it segfault when using PCRE. Right, this is a good point that we should take notice of. I.e. this is *not* a new bug per-se, you can do this on master and get a UTF-8 bug from git.git: git grep -P '(*NO_JIT)[æ]' > in that line, I am not sure I understand the pushback against making that > explicit since it only makes both codepaths behave the same (bugs and > risks of burning alike) Because with my kwset series we're getting a lot more users of this until-now obscure code, so we're finding old-but-new-to-us bugs. We've had this bug dating all the way back to Duy's 18547aacf5 ("grep/pcre: support utf-8", 2016-06-25). It was first released with git 2.10. So why are we getting list discussion about it *now*? Because my kwset series got merged to "next", and we apparently have a lot of users who'd use fixed-string git-grep under locales, but never used PCRE via -P explicitly before. So it's worth getting the semantics right. As noted in the E-Mail I linked to earlier my ulterior motive here is to get to a point where we'll funnel all regex matching through PCRE implicitly if it's available. We need to get these UTF-8 edge cases right. I don't know if my recent 8-part series gets us 100% there, but hopefully it at least gets us closer to it.
Re: [PATCH] grep: skip UTF8 checks explicitally
On Fri, Jul 26 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> FWIW what I meant was not that we'd run around and iconv() things, it >> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8 >> valid", which presumably would be the end result of something like that. >> >> Rather that this model of assuming that a UTF-8 pattern means we can >> consider everything in the repo UTF-8 in git-grep doesn't make sense. My >> kwset patches *revealed* that problem in a painful way, but it was there >> already. > > We already do assume that pathnames are UTF-8 (pathspecs on MacOS > are converted and then they are matched assuming that property). > Further, with the same mechanism, I think there is an assumption > that anything that comes from the command line is UTF-8 (and if I > recall correctly, doesn't the Windows port of Git force us to use > the same assumption---I recall we needed tests tweak for that). > > In the very very longer term, I do not think we would want to keep > the assumption that the text encoding of blobs is always UTF-8, and > it would be nice to extend the system, so that blob data could be > marked in some way to say "I'm in Big-5, and not in UTF-8, so please > treat me as such" and magically the needle and the haystack can be > made to agree, with iconv() either one of them. > > But I do not think the current topic to fix the immediate/imminent > breakage should not be distracted by that. Let's keep assuming that > any blob, when it is text, is UTF-8. > > And from that point of view, I think the two pieces of idea in your > earlier message does make sense. We can try to match as binary most > of the time, as UTF-8 would not let a valid UTF-8 needle match in > the haystack starting in the middle of a character. *nod* > When the user is trying to match case-insensitively, we know the > haystack in which the user is interested in finding the needle is > text, even though there may be non-text blobs as well. > > For example, "git grep -i 'foo' t/" may find a few png files under > the t/ directory. We do not care if they happen to contain Foo and > we do not mind if they appear or do not appear in the result. The > only two things we care about are (1) foo, Foo, FOO are found in the > text files under t/ and (2) the command does not die in the middle, > before processing all the files, only because a png file it found > were not UTF-8 valid. I think this part's a step too far, and not how e.g. GNU grep works. Peeking into binary data in a text grep is what people expect, e.g. because you might want to recursively grep mixed text/mp3s for an author. The text part of the mp3s means that metadata will be grepped for inside the binary files. Getting that right is hard around the edges though...
Re: [PATCH] grep: skip UTF8 checks explicitally
On Thu, Jul 25 2019, Johannes Schindelin wrote: > Hi Junio, > > On Thu, 25 Jul 2019, Junio C Hamano wrote: > >> Johannes Schindelin writes: >> >> >> OK, in short, barfing and stopping is a problem, but that flag is >> >> not the right knob to tweak. And the right knob ... >> >> >> >> > 1) We're oversupplying PCRE2_UTF now, and one such case is what's being >> >> > reported here. I.e. there's no reason I can think of for why a >> >> > fixed-string pattern should need PCRE2_UTF set when not combined >> >> > with --ignore-case. We can just not do that, but maybe I'm missing >> >> > something there. >> >> > >> >> > 2) We can do "try utf8, and fallback". A more advanced version of this >> >> > is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread) >> >> > does. I was thinking something closer to just carrying two compiled >> >> > patterns, and falling back on the ~PCRE2_UTF one if we get a >> >> > PCRE2_ERROR_UTF8_* error. >> >> >> >> ... lies somewhere along that line. I think that is very sensible. >> > >> > I am glad that everybody agrees with my original comment on ab/no-kwset >> > where I suggested that we should use our knowledge of the encoding of >> > the haystack and convert it to UTF-8 if we detect that the pattern is >> > UTF-8 encoded,... >> >> Please do not count me among "everybody", then. I did not think >> that Ævar meant to iconv the haystack when I wrote the message you >> are responding to, but if that was what he meant, I would not have >> said "very sensible". > > Okay, but in that case I cannot agree with your assessment that it is > very sensible. FWIW what I meant was not that we'd run around and iconv() things, it wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8 valid", which presumably would be the end result of something like that. Rather that this model of assuming that a UTF-8 pattern means we can consider everything in the repo UTF-8 in git-grep doesn't make sense. My kwset patches *revealed* that problem in a painful way, but it was there already. I'm not sure what a real fix for that is. Part of it is probably 8/8 in the series I mention below, but more generally we'd need to be more encoding aware at a much higher callsite than "grep". So e.g. we'd know that we match "binary" data as not-UTF-8. Now we just throw arbitrary bytes around and hope something sticks. > If we're already deciding to paper over things, I'd much rather prefer > the simpler patch, i.e. Carlo's. As I noted upthread PCRE's own docs promise undefined behavior and fire and brimstone if that patch is applied. Those last two not guaranteed. So we need another solution. I've submitted https://public-inbox.org/git/20190726150818.6373-1-ava...@gmail.com/ just now. See what you think about it.
[PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
As discussed in the "grep: stess test PCRE v2 on invalid UTF-8 data" commit leading up to this one there's a regression in b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) when matching UTF-8 data. This ultimately isn't straightforward to just "fix", because the kwset backend was so dumb about icase matching that we'd skip it entirely on non-ASCII. See the code removed in 48de2a768c ("grep: remove the kwset optimization", 2019-07-01). Just going back to the C library for those isn't ideal, since it's likely to be even dumber about these mixed-encoding cases. So let's support this "properly" using the PCRE2_MATCH_INVALID_UTF flag. This is new code that's not in any released PCRE v2 version, so we might need a fix that emulates it somehow. I figure that the case that with the non-icase bug out of the way this is obscure enough to tell people "upgrade your PCRE v2 too!'. It'll likely be released by the time we release the git version this commit is part of. We can't just use PCRE2_NO_UTF_CHECK instead for the reasons discussed in [1]. 1. https://public-inbox.org/git/87lfwn70nb@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- Makefile| 1 + grep.c | 2 +- grep.h | 3 +++ t/helper/test-pcre2-config.c| 12 t/helper/test-tool.c| 1 + t/helper/test-tool.h| 1 + t/t7812-grep-icase-non-ascii.sh | 13 - 7 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 t/helper/test-pcre2-config.c diff --git a/Makefile b/Makefile index bd246f2989..dd38d5e527 100644 --- a/Makefile +++ b/Makefile @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o TEST_BUILTINS_OBJS += test-online-cpus.o TEST_BUILTINS_OBJS += test-parse-options.o TEST_BUILTINS_OBJS += test-path-utils.o +TEST_BUILTINS_OBJS += test-pcre2-config.o TEST_BUILTINS_OBJS += test-pkt-line.o TEST_BUILTINS_OBJS += test-prio-queue.o TEST_BUILTINS_OBJS += test-reach.o diff --git a/grep.c b/grep.c index c7c06ae08d..8b8b9efe12 100644 --- a/grep.c +++ b/grep.c @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && !(!opt->ignore_case && (p->fixed || p->is_fixed))) - options |= PCRE2_UTF; + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, p->patternlen, options, &error, &erroffset, diff --git a/grep.h b/grep.h index c0c71eb4a9..506f05b97b 100644 --- a/grep.h +++ b/grep.h @@ -21,6 +21,9 @@ typedef int pcre_extra; #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 #include +#ifndef PCRE2_MATCH_INVALID_UTF +#define PCRE2_MATCH_INVALID_UTF 0 +#endif #else typedef int pcre2_code; typedef int pcre2_match_data; diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c new file mode 100644 index 00..5258fdddba --- /dev/null +++ b/t/helper/test-pcre2-config.c @@ -0,0 +1,12 @@ +#include "test-tool.h" +#include "cache.h" +#include "grep.h" + +int cmd__pcre2_config(int argc, const char **argv) +{ + if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) { + int value = PCRE2_MATCH_INVALID_UTF; + return !value; + } + return 1; +} diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index ce7e89028c..e022ce0e48 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -40,6 +40,7 @@ static struct test_cmd cmds[] = { { "online-cpus", cmd__online_cpus }, { "parse-options", cmd__parse_options }, { "path-utils", cmd__path_utils }, + { "pcre2-config", cmd__pcre2_config }, { "pkt-line", cmd__pkt_line }, { "prio-queue", cmd__prio_queue }, { "reach", cmd__reach }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index f805bb39ae..acd8af2a9d 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -30,6 +30,7 @@ int cmd__oidmap(int argc, const char **argv); int cmd__online_cpus(int argc, const char **argv); int cmd__parse_options(int argc, const char **argv); int cmd__path_utils(int argc, const char **argv); +int cmd__pcre2_config(int argc, const char **argv); int cmd__pkt_line(int argc, const char **argv); int cmd__prio_queue(int argc, const char **argv); int cmd__reach(int argc, const char **argv); diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 531eb59d57..848d46e4f9 100755 --- a/t/t7812-grep-icase-non-as
[PATCH v2 1/8] grep: remove overly paranoid BUG(...) code
Remove code that would trigger if pcre_config() or pcre2_config() was so broken that "do we have JIT?" wouldn't return a boolean. I added this code back in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) and then as noted in f002532784 ("grep: print the pcre2_jit_on value", 2019-07-22) incorrectly copy/pasted some of it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01). Let's just remove this code. Being this paranoid about the pcre2?_config() function itself being broken is crossing the line into unreasonable paranoia. Reported-by: Beat Bolli Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index 0937c5bfff..95af88cb74 100644 --- a/grep.c +++ b/grep.c @@ -394,14 +394,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on == 1) { + if (p->pcre1_jit_on) { p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); if (!p->pcre1_jit_stack) die("Couldn't allocate PCRE JIT stack"); pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } else if (p->pcre1_jit_on != 0) { - BUG("The pcre1_jit_on variable should be 0 or 1, not %d", - p->pcre1_jit_on); } #endif } @@ -510,7 +507,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on); - if (p->pcre2_jit_on == 1) { + if (p->pcre2_jit_on) { jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE); if (jitret) die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret); @@ -545,9 +542,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt if (!p->pcre2_match_context) die("Couldn't allocate PCRE2 match context"); pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); - } else if (p->pcre2_jit_on != 0) { - BUG("The pcre2_jit_on variable should be 0 or 1, not %d", - p->pcre2_jit_on); } } -- 2.22.0.455.g172b71a6c5
[PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
Since my b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) we've been dying on invalid UTF-8 data when grepping for fixed strings if the following are all true: * The subject string is non-ASCII (e.g. "ævar") * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" * We compiled with PCRE v2 * That PCRE v2 did not have JIT support The last of those is why this wasn't caught earlier, per pcre2jit(3): "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined." I.e. the subject being matched against our pattern was invalid, but we were lucky and getting away with it on the JIT path, but the non-JIT one is stricter. This patch does nothing to fix that, instead we sneak in support for fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2 jit with implicit fixed-string matching for testing, see pcre2syntax(3) the syntax. This is technically a change in behavior, but it's so obscure that I figured it was OK. We'd previously consider this an invalid regular expression as regcomp() would die on it, now we feed it to the PCRE v2 fixed-string path. I thought this was better than introducing yet another GIT_TEST_* environment variable. We're also relying on a behavior of PCRE v2 that technically could change, but I think the test coverage is worth dipping our toe into some somewhat undefined behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ++ t/t7812-grep-icase-non-ascii.sh | 28 2 files changed, 38 insertions(+) diff --git a/grep.c b/grep.c index 6d60e2e557..5bc0f4f32a 100644 --- a/grep.c +++ b/grep.c @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); p->is_fixed = is_fixed(p->pattern, p->patternlen); +#ifdef USE_LIBPCRE2 + if (!p->fixed && !p->is_fixed) { + const char *no_jit = "(*NO_JIT)"; + const int no_jit_len = strlen(no_jit); + if (starts_with(p->pattern, no_jit) && + is_fixed(p->pattern + no_jit_len, + p->patternlen - no_jit_len)) + p->is_fixed = 1; + } +#endif if (p->fixed || p->is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 0c685d3598..96c3572056 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -53,4 +53,32 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' ' + printf "\\200\\n" >invalid-0x80 && + echo "ævar" >expected && + cat expected >>invalid-0x80 && + git add invalid-0x80 +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' ' + git grep -h "var" invalid-0x80 >actual && + test_cmp expected actual && + git grep -h "(*NO_JIT)var" invalid-0x80 >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' + test_might_fail git grep -h "æ" invalid-0x80 >actual && + test_cmp expected actual && + test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' + test_might_fail git grep -hi "Æ" invalid-0x80 >actual && + test_cmp expected actual && + test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 && + test_cmp expected actual +' + test_done -- 2.22.0.455.g172b71a6c5
[PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
1-3 here are a re-roll on "next". I figured that was easier for everyone with the state of the in-flight patches, it certainly was for me. Sorry Junio if this creates a mess for you. 4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep: skip UTF8 checks explicitally" in https://public-inbox.org/git/20190721183115.14985-1-care...@gmail.com/ As noted the bug isn't fully fixed until 8/8, and that patch relies on unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good enough state to limp forward as noted in the rationale of those commits. Ævar Arnfjörð Bjarmason (8): grep: remove overly paranoid BUG(...) code grep: stop "using" a custom JIT stack with PCRE v2 grep: stop using a custom JIT stack with PCRE v1 grep: consistently use "p->fixed" in compile_regexp() grep: create a "is_fixed" member in "grep_pat" grep: stess test PCRE v2 on invalid UTF-8 data grep: do not enter PCRE2_UTF mode on fixed matching grep: optimistically use PCRE2_MATCH_INVALID_UTF Makefile| 1 + grep.c | 68 +++-- grep.h | 13 ++- t/helper/test-pcre2-config.c| 12 ++ t/helper/test-tool.c| 1 + t/helper/test-tool.h| 1 + t/t7812-grep-icase-non-ascii.sh | 39 +++ 7 files changed, 80 insertions(+), 55 deletions(-) create mode 100644 t/helper/test-pcre2-config.c -- 2.22.0.455.g172b71a6c5
[PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
At the start of this function we do: p->fixed = opt->fixed; It's less confusing to use that variable consistently that switch back & forth between the two. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grep.c b/grep.c index 9c2b259771..b94e998680 100644 --- a/grep.c +++ b/grep.c @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); pat_is_fixed = is_fixed(p->pattern, p->patternlen); - if (opt->fixed || pat_is_fixed) { + if (p->fixed || pat_is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; if (pat_is_fixed) { -- 2.22.0.455.g172b71a6c5
[PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching
As discussed in the last commit partially fix a bug introduced in b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input we'd die on e.g.: fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set When grepping a non-ASCII fixed string. This is a more general problem that's hard to fix, but we can at least fix the most common case of grepping for a fixed string without "-i". I can't think of a reason for why we'd turn on PCRE2_UTF when matching byte-for-byte like that. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 3 ++- t/t7812-grep-icase-non-ascii.sh | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/grep.c b/grep.c index 5bc0f4f32a..c7c06ae08d 100644 --- a/grep.c +++ b/grep.c @@ -472,7 +472,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 96c3572056..531eb59d57 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -68,9 +68,9 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT ' test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' - test_might_fail git grep -h "æ" invalid-0x80 >actual && + git grep -h "æ" invalid-0x80 >actual && test_cmp expected actual && - test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 && + git grep -h "(*NO_JIT)æ" invalid-0x80 && test_cmp expected actual ' -- 2.22.0.455.g172b71a6c5
[PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1
Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code in the last commit. Unlike with v2 we actually used the custom stack in v1, but let's use PCRE's built-in 32 KB one instead, since experience with v2 shows that's enough. Most distros are already using v2 as a default, and the underlying sljit code is the same. Unfortunately we can't just pass a NULL to pcre_jit_exec() as with pcre2_jit_match(). Unlike the v2 function it doesn't support that. Instead we need to use the fatter pcre_exec() if we'd like the same behavior. This will make things slightly slower than on the fast-path function, but it's OK since we care less about v1 performance these days since we have and recommend v2. Running a similar performance test as what I ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) via: GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Gives us this, just the /perl/ results: TestHEAD~ HEAD --- 7820.3: perl grep 'how.to' 0.19(0.67+0.52) 0.19(0.65+0.52) +0.0% 7820.7: perl grep '^how to' 0.19(0.78+0.44) 0.19(0.72+0.49) +0.0% 7820.11: perl grep '[how] to' 0.39(2.13+0.43) 0.40(2.10+0.46) +2.6% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.44(2.55+0.37) 0.45(2.47+0.41) +2.3% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.23(1.06+0.42) 0.22(1.03+0.43) -4.3% It will also implicitly re-enable UTF-8 validation for PCRE v1. As noted in [1] we now have cases as a result where PCRE v1 is more eager to error out. Subsequent patches will fix that for v2, and I think it's fair to tell v1 users "just upgrade" and not worry about that edge case for v1. 1. https://public-inbox.org/git/capuesphzj_uv9o1-ydpjnla_q-f7gwxz9g1gcy2pyayn8ri...@mail.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 28 +--- grep.h | 5 - 2 files changed, 5 insertions(+), 28 deletions(-) diff --git a/grep.c b/grep.c index 4b1e917ac5..9c2b259771 100644 --- a/grep.c +++ b/grep.c @@ -394,12 +394,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on) { - p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); - if (!p->pcre1_jit_stack) - die("Couldn't allocate PCRE JIT stack"); - pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } #endif } @@ -411,18 +405,9 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol, if (eflags & REG_NOTBOL) flags |= PCRE_NOTBOL; -#ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { - ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector), p->pcre1_jit_stack); - } else -#endif - { - ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector)); - } + ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, + eol - line, 0, flags, ovector, + ARRAY_SIZE(ovector)); if (ret < 0 && ret != PCRE_ERROR_NOMATCH) die("pcre_exec failed with error code %d", ret); @@ -439,14 +424,11 @@ static void free_pcre1_regexp(struct grep_pat *p) { pcre_free(p->pcre1_regexp); #ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { + if (p->pcre1_jit_on) pcre_free_study(p->pcre1_extra_info); - pcre_jit_stack_free(p->pcre1_jit_stack); - } else + else #endif - { pcre_free(p->pcre1_extra_info); - } pcre_free((void *)p->pcre1_tables); } #else /* !USE_LIBPCRE1 */ diff --git a/grep.h b/grep.h index 4d8e300175..ce2d72571f 100644 --- a/grep.h +++ b/grep.h @@ -14,13 +14,9 @@ #ifndef GIT_PCRE_STUDY_JIT_COMPILE #define GIT_PCRE_STUDY_JIT_COMPILE 0 #endif -#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20 -typedef int pcre_jit_stack; -#endif #else typedef int pcre; typedef int pcre_extra; -typedef int pcre_jit_stack; #endif #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 @@ -85,7 +81,6 @@ struct gre
[PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat"
This change paves the way for later using this value the regex compile functions themselves. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 7 +++ grep.h | 1 + 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/grep.c b/grep.c index b94e998680..6d60e2e557 100644 --- a/grep.c +++ b/grep.c @@ -606,7 +606,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; - int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -615,11 +614,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - pat_is_fixed = is_fixed(p->pattern, p->patternlen); - if (p->fixed || pat_is_fixed) { + p->is_fixed = is_fixed(p->pattern, p->patternlen); + if (p->fixed || p->is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; - if (pat_is_fixed) { + if (p->is_fixed) { compile_pcre2_pattern(p, opt); } else { /* diff --git a/grep.h b/grep.h index ce2d72571f..c0c71eb4a9 100644 --- a/grep.h +++ b/grep.h @@ -88,6 +88,7 @@ struct grep_pat { pcre2_compile_context *pcre2_compile_context; uint32_t pcre2_jit_on; unsigned fixed:1; + unsigned is_fixed:1; unsigned ignore_case:1; unsigned word_regexp:1; }; -- 2.22.0.455.g172b71a6c5
[PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2
As reported in [1] the code I added in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) to use a custom JIT stack has never worked. It was incorrectly copy/pasted from code I added in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25), which did work. Thus our intention of starting with 1 byte of stack at a maximum of 1 MB didn't happen, we'd always use the 32 KB stack provided by PCRE v2's jit_machine_stack_exec()[2]. The reason I allocated a custom stack at all was this advice in pcrejit(3) (same in pcre2jit(3)): "By default, it uses 32KiB on the machine stack. However, some large or complicated patterns need more than this" Since we've haven't had any reports of users running into PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume that we can just use the library defaults instead and drop this code. This won't change with the wider use of PCRE v2 in ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a fixed string search is not a "large or complicated pattern". For good measure I ran the performance test noted in 94da9193a6, although the command is simpler now due to my 0f50c8e32c ("Makefile: remove the NO_R_TO_GCC_LINKER flag", 2019-05-17): GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Just the /perl/ results are: TestHEAD~ HEAD --- 7820.3: perl grep 'how.to' 0.17(0.27+0.65) 0.17(0.24+0.68) +0.0% 7820.7: perl grep '^how to' 0.16(0.23+0.66) 0.16(0.23+0.67) +0.0% 7820.11: perl grep '[how] to' 0.18(0.35+0.62) 0.18(0.33+0.65) +0.0% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.17(0.45+0.54) 0.17(0.49+0.50) +0.0% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.16(0.33+0.58) 0.16(0.29+0.62) +0.0% So, as expected there's no change, and running with valgrind reveals that we have fewer allocations now. As noted in [3] there are known regexes that will fail with the lower stack limit, the way GNU grep fixed it is interesting, although I believe the implementation is overly verbose, they could make PCRE v2 handle that gradual re-allocation, that's what min/max memory is for. So we might end up bringing this back, I'm more inclined to just kick such cases upstairs to PCRE maintainers as a bug, perhaps they'll add some overall "just allocate more then" flag to make this easier. In any case there's no functional change here, we didn't have a custom stack, so let's apply this first, we can always revert it later. 1. https://public-inbox.org/git/20190721194052.15440-1-care...@gmail.com/ 2. I didn't really intend to start with 1 byte, looking at the PCRE v2 code again what happened is that I cargo-culted some of PCRE v2's own test code which was meant to test re-allocations. It's more sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c does. 3. https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bzxktfplgkgiu34htjklhm-cmee04f...@mail.gmail.com/ Reported-by: Carlo Marcelo Arenas Belón Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 -- grep.h | 4 2 files changed, 14 deletions(-) diff --git a/grep.c b/grep.c index 95af88cb74..4b1e917ac5 100644 --- a/grep.c +++ b/grep.c @@ -534,14 +534,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_jit_on = 0; return; } - - p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL); - if (!p->pcre2_jit_stack) - die("Couldn't allocate PCRE2 JIT stack"); - p->pcre2_match_context = pcre2_match_context_create(NULL); - if (!p->pcre2_match_context) - die("Couldn't allocate PCRE2 match context"); - pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); } } @@ -585,8 +577,6 @@ static void free_pcre2_pattern(struct grep_pat *p) pcre2_compile_context_free(p->pcre2_compile_context); pcre2_code_free(p->pcre2_pattern); pcre2_match_data_free(p->pcre2_match_data); - pcre2_jit_stack_free(p->pcre2_jit_stack); - pcre2_match_context_free(p->pcre2_match_context); } #else /* !USE_LIBPCRE2 */ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) diff --git a/grep.h b/grep.h index d35a137fcb..4d8e
Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
On Fri, Jul 26 2019, Carlo Arenas wrote: > On Fri, Jul 26, 2019 at 6:50 AM Ævar Arnfjörð Bjarmason > wrote: >> >> On Fri, Jul 26 2019, Carlo Arenas wrote: >> >> > since this moves PCRE1 out of the JIT fast path, >> >> I think you're mostly replying to the wrong thread. None of the patches >> I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT >> stack is resized, and for v2 some dead code removed. > > I didn't mean JIT was disabled, but that we are calling now the regular > PCRE1 function which does UTF-8 validation (unlike the one used before) > >> > introduces the regression where git grep will abort if there is binary >> > data or non UTF-8 text in the repository/log and should be IMHO hold >> > out until a fix for that can be merged. >> >> You're talking about the kwset series, not this cleanup series. > > a combination of both (as seen in pu) and that will also happen in next if > this series get merged there. > > before this cleanup series, a git compiled against PCRE1 and not using > NO_LIBPCRE1_JIT will use the jit fast path function and therefore would > have no problems with binary or non UTF-8 content in the repository, but > will regress after. I see. Yes you're right, I misread pcrejit(3) about how the "fast path API" worked (or more accurately, misremembered). Yes, this is now a new caveat. I have some patches on top of next I'm about to send that hopefully make this whole thing less of a mess.
Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
On Fri, Jul 26 2019, Carlo Arenas wrote: > since this moves PCRE1 out of the JIT fast path, I think you're mostly replying to the wrong thread. None of the patches I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT stack is resized, and for v2 some dead code removed. > introduces the regression where git grep will abort if there is binary > data or non UTF-8 text in the repository/log and should be IMHO hold > out until a fix for that can be merged. You're talking about the kwset series, not this cleanup series. > this also needs additional changes to better support NO_LIBPCRE1_JIT, > patch to follow Looking forward to it, thanks!
Re: [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
On Wed, Jul 24 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> Since we've haven't had any reports of users running into >> PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume >> that we can just use the library defaults instead and drop this >> code. > > Does everybody use pcre2 with JIT with Git these days, or only those > who want to live near the bleeding edge? My informal survey of various package recipies suggests that all the big *nix distros are using it by default now, so we have a lot of users in the wild, including in the just-released Debian stable. So I'm confidend that if there were issues with e.g. it dying on patterns in practical use we'd have heard about them. >> This won't change with the wider use of PCRE v2 in >> ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a >> fixed string search is not a "large or complicated pattern". > > In any case, if we were not "using" the custom stack anyway for v2, > this change does not hurt anybody, possibly other than those who > will learn about pcre2 support by reading this message and experiments > with larger patterns. And it should be simple to wire it back if it > becomes necessary later. *nod*
Re: [PATCH 0/3] grep: PCRE JIT fixes
On Wed, Jul 24 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> There's a couple of patches fixing mistakes in the JIT code I added >> for PCRE in <20190722181923.21572-1-dev+...@drbeat.li> and >> <20190721194052.15440-1-care...@gmail.com> >> >> This small series proposes to replace both of those. In both cases I >> think we're better off just removing the relevant code. The commit >> messages for the patches themselves make the case for that. > > I am not sure about the BUG() that practically never triggered so > far (AFAICT, the check that guards the BUG() would trigger only if > we later introduced a bug, calling the code to compile when we are > not asked to do so)---wouldn't it be better to leave it in while > there still are people who are touching the vicinity? The BUG() in 1/3 is just checking if pcre2?_config() returns a boolean when promised, so it amounts to black-box testing of that library. I think code in that style is overly paranoid and verbose, it's reasonable to just trust the library in that case. I think the reason it ended up in the codebase in the first place was converting some first-draft implementation I wrote where I was being more paranoid about using the PCRE API as a black box. > The other two I am perfectly OK with. It is easy to resurrect the > support for v1 (which may not even be needed for long) and resurrect > the support for v2 with Carlo's fix, if it later turns out that some > users may need to use a more complex pattern. > > Thanks. > >> Ævar Arnfjörð Bjarmason (3): >> grep: remove overly paranoid BUG(...) code >> grep: stop "using" a custom JIT stack with PCRE v2 >> grep: stop using a custom JIT stack with PCRE v1 >> >> grep.c | 46 ++ >> grep.h | 9 - >> 2 files changed, 6 insertions(+), 49 deletions(-)
Re: [PATCH] grep: skip UTF8 checks explicitally
On Wed, Jul 24 2019, Johannes Schindelin wrote: > Hi Carlo, > > On Tue, 23 Jul 2019, Carlo Arenas wrote: > >> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin >> wrote: >> > >> > So when PCRE2 complains about the top two bits not being 0x80, it fails >> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are >> > indeed 0x80). >> >> the error is confusing but it is not coming from the pattern, but from >> what PCRE2 calls >> the subject. >> >> meaning that while going through the repository it found content that >> it tried to match but >> that it is not valid UTF-8, like all the png and a few txt files that >> are not encoded as >> UTF-8 (ex: t/t3900/ISO8859-1.txt). >> >> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this >> > does not happen here... But then, I don't need the `-I` option, and my >> > output looks like this: >> >> -I was just an attempt to workaround the obvious binary files (like >> PNG); I'll assume you >> should be able to reproduce if using a non JIT enabled PCRE2, >> regardless of version. >> >> my point was that unlike in your report, I didn't have any test cases >> failing, because >> AFAIK there are no test cases using broken UTF-8 (the ones with binary data >> are >> actually valid zero terminated UTF-8 strings) > > Thank you for this explanation. I think it makes a total lot of sense. > > So your motivation for this patch is actually a different one than mine, > and I would like to think that this actually strengthens the case _in > favor_ of it. The patch kind of kills two birds with one stone. This patch is really the wrong thing to do. Don't get me wrong, I'm sympathetic to the *problem* and it should be solved, but this isn't the solution. The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid UTF-8 string so you, PCRE, don't need to re-check it". To quote pcre2api(3): If you know that your pattern is a valid UTF string, and you want to skip this check for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in‐ valid UTF string as a pattern is undefined. It may cause your program to crash or loop. (Later it's discussed that "pattern" here is also "subject string" in the context of pcre2_{jit_,}match()). I know almost nothing about the internals of PCRE's engine, but much of it's based on Perl's, which I know way better. Doing the equivalent of this in perl (setting the UTF8 flag on a SV) *will* cause asserts to fail and possibly segfaults. It's likely through dumb luck that this is "working". I.e. yes the JIT mode is less anal about these checks, so if you say grep for "Nguyễn Thái" in UTF-8 mode and there's binary data you're satisfied not to find anything in that binary data. But if you are I'm willing to bet this ruins your day, e.g PCRE would "skip ahead" a character 4-byte character because it sees a telltale U+1 through U+10 start sequence, except that wasn't a character, it was some arbitrary binary. Now, what is the solution? I don't have any patches yet, but things I intend to look at: 1) We're oversupplying PCRE2_UTF now, and one such case is what's being reported here. I.e. there's no reason I can think of for why a fixed-string pattern should need PCRE2_UTF set when not combined with --ignore-case. We can just not do that, but maybe I'm missing something there. 2) We can do "try utf8, and fallback". A more advanced version of this is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread) does. I was thinking something closer to just carrying two compiled patterns, and falling back on the ~PCRE2_UTF one if we get a PCRE2_ERROR_UTF8_* error. One reason we can't "just" go back to the pre-ab/no-kwset behavior is that one thing it does is fix a long-standing bug where we'd do the wrong thing under locales && -i && UTF-8 string/pattern. More precisely we'd punt it to the C library's matching function, which would probably do the wrong thing.
[PATCH 0/3] grep: PCRE JIT fixes
There's a couple of patches fixing mistakes in the JIT code I added for PCRE in <20190722181923.21572-1-dev+...@drbeat.li> and <20190721194052.15440-1-care...@gmail.com> This small series proposes to replace both of those. In both cases I think we're better off just removing the relevant code. The commit messages for the patches themselves make the case for that. Ævar Arnfjörð Bjarmason (3): grep: remove overly paranoid BUG(...) code grep: stop "using" a custom JIT stack with PCRE v2 grep: stop using a custom JIT stack with PCRE v1 grep.c | 46 ++ grep.h | 9 - 2 files changed, 6 insertions(+), 49 deletions(-) -- 2.22.0.455.g172b71a6c5
[PATCH 1/3] grep: remove overly paranoid BUG(...) code
Remove code that would trigger if pcre_config() or pcre2_config() was so broken that "do we have JIT?" wouldn't return a boolean. I added this code back in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) and then as noted in [1] incorrectly copy/pasted some of it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01). Let's just remove it instead of fixing that bug. Being this paranoid about what PCRE returns is crossing the line into unreasonable paranoia. 1. https://public-inbox.org/git/20190722181923.21572-1-dev+...@drbeat.li/ Reported-by: Beat Bolli Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index f7c3a5803e..be4282fef3 100644 --- a/grep.c +++ b/grep.c @@ -406,14 +406,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on == 1) { + if (p->pcre1_jit_on) { p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); if (!p->pcre1_jit_stack) die("Couldn't allocate PCRE JIT stack"); pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } else if (p->pcre1_jit_on != 0) { - BUG("The pcre1_jit_on variable should be 0 or 1, not %d", - p->pcre1_jit_on); } #endif } @@ -522,7 +519,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on); - if (p->pcre2_jit_on == 1) { + if (p->pcre2_jit_on) { jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE); if (jitret) die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret); @@ -557,9 +554,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt if (!p->pcre2_match_context) die("Couldn't allocate PCRE2 match context"); pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); - } else if (p->pcre2_jit_on != 0) { - BUG("The pcre2_jit_on variable should be 0 or 1, not %d", - p->pcre1_jit_on); } } -- 2.22.0.455.g172b71a6c5
[PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code in the last commit. Unlike with v2 we actually used the custom stack in v1, but let's use PCRE's built-in 32 KB one instead, since experience with v2 shows that's enough. Most distros are already using v2 as a default, and the underlying sljit code is the same. Unfortunately we can't just pass a NULL to pcre_jit_exec() as with pcre2_jit_match(). Unlike the v2 function it doesn't support that. Instead we need to use the fatter pcre_exec() if we'd like the same behavior. This will make things slightly slower than on the fast-path function, but it's OK since we care less about v1 performance these days since we have and recommend v2. Running a similar performance test as what I ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) via: GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Gives us this, just the /perl/ results: TestHEAD~ HEAD --- 7820.3: perl grep 'how.to' 0.19(0.67+0.52) 0.19(0.65+0.52) +0.0% 7820.7: perl grep '^how to' 0.19(0.78+0.44) 0.19(0.72+0.49) +0.0% 7820.11: perl grep '[how] to' 0.39(2.13+0.43) 0.40(2.10+0.46) +2.6% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.44(2.55+0.37) 0.45(2.47+0.41) +2.3% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.23(1.06+0.42) 0.22(1.03+0.43) -4.3% Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 28 +--- grep.h | 5 - 2 files changed, 5 insertions(+), 28 deletions(-) diff --git a/grep.c b/grep.c index 20ce95270a..6b52fed53a 100644 --- a/grep.c +++ b/grep.c @@ -406,12 +406,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on) { - p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); - if (!p->pcre1_jit_stack) - die("Couldn't allocate PCRE JIT stack"); - pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } #endif } @@ -423,18 +417,9 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol, if (eflags & REG_NOTBOL) flags |= PCRE_NOTBOL; -#ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { - ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector), p->pcre1_jit_stack); - } else -#endif - { - ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector)); - } + ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, + eol - line, 0, flags, ovector, + ARRAY_SIZE(ovector)); if (ret < 0 && ret != PCRE_ERROR_NOMATCH) die("pcre_exec failed with error code %d", ret); @@ -451,14 +436,11 @@ static void free_pcre1_regexp(struct grep_pat *p) { pcre_free(p->pcre1_regexp); #ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { + if (p->pcre1_jit_on) pcre_free_study(p->pcre1_extra_info); - pcre_jit_stack_free(p->pcre1_jit_stack); - } else + else #endif - { pcre_free(p->pcre1_extra_info); - } pcre_free((void *)p->pcre1_tables); } #else /* !USE_LIBPCRE1 */ diff --git a/grep.h b/grep.h index a65f4a1ae1..a405fc870c 100644 --- a/grep.h +++ b/grep.h @@ -14,13 +14,9 @@ #ifndef GIT_PCRE_STUDY_JIT_COMPILE #define GIT_PCRE_STUDY_JIT_COMPILE 0 #endif -#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20 -typedef int pcre_jit_stack; -#endif #else typedef int pcre; typedef int pcre_extra; -typedef int pcre_jit_stack; #endif #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 @@ -86,7 +82,6 @@ struct grep_pat { regex_t regexp; pcre *pcre1_regexp; pcre_extra *pcre1_extra_info; - pcre_jit_stack *pcre1_jit_stack; const unsigned char *pcre1_tables; int pcre1_jit_on; pcre2_code *pcre2_pattern; -- 2.22.0.455.g172b71a6c5
[PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
As reported in [1] the code I added in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) to use a custom JIT stack has never worked. It was incorrectly copy/pasted from code I added in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25), which did work. Thus our intention of starting with 1 byte of stack at a maximum of 1 MB didn't happen, we'd always use the 32 KB stack provided by PCRE v2's jit_machine_stack_exec()[2]. The reason I allocated a custom stack at all was this advice in pcrejit(3) (same in pcre2jit(3)): "By default, it uses 32KiB on the machine stack. However, some large or complicated patterns need more than this" Since we've haven't had any reports of users running into PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume that we can just use the library defaults instead and drop this code. This won't change with the wider use of PCRE v2 in ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a fixed string search is not a "large or complicated pattern". For good measure I ran the performance test noted in 94da9193a6, although the command is simpler now due to my 0f50c8e32c ("Makefile: remove the NO_R_TO_GCC_LINKER flag", 2019-05-17): GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Just the /perl/ results are: TestHEAD~ HEAD --- 7820.3: perl grep 'how.to' 0.17(0.27+0.65) 0.17(0.24+0.68) +0.0% 7820.7: perl grep '^how to' 0.16(0.23+0.66) 0.16(0.23+0.67) +0.0% 7820.11: perl grep '[how] to' 0.18(0.35+0.62) 0.18(0.33+0.65) +0.0% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.17(0.45+0.54) 0.17(0.49+0.50) +0.0% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.16(0.33+0.58) 0.16(0.29+0.62) +0.0% So, as expected there's no change, and running with valgrind reveals that we have fewer allocations now. 1. https://public-inbox.org/git/20190721194052.15440-1-care...@gmail.com/ 2. I didn't really intend to start with 1 byte, looking at the PCRE v2 code again what happened is that I cargo-culted some of PCRE v2's own test code which was meant to test re-allocations. It's more sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c does. Reported-by: Carlo Marcelo Arenas Belón Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 -- grep.h | 4 2 files changed, 14 deletions(-) diff --git a/grep.c b/grep.c index be4282fef3..20ce95270a 100644 --- a/grep.c +++ b/grep.c @@ -546,14 +546,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_jit_on = 0; return; } - - p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL); - if (!p->pcre2_jit_stack) - die("Couldn't allocate PCRE2 JIT stack"); - p->pcre2_match_context = pcre2_match_context_create(NULL); - if (!p->pcre2_match_context) - die("Couldn't allocate PCRE2 match context"); - pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); } } @@ -597,8 +589,6 @@ static void free_pcre2_pattern(struct grep_pat *p) pcre2_compile_context_free(p->pcre2_compile_context); pcre2_code_free(p->pcre2_pattern); pcre2_match_data_free(p->pcre2_match_data); - pcre2_jit_stack_free(p->pcre2_jit_stack); - pcre2_match_context_free(p->pcre2_match_context); } #else /* !USE_LIBPCRE2 */ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) diff --git a/grep.h b/grep.h index 1875880f37..a65f4a1ae1 100644 --- a/grep.h +++ b/grep.h @@ -29,8 +29,6 @@ typedef int pcre_jit_stack; typedef int pcre2_code; typedef int pcre2_match_data; typedef int pcre2_compile_context; -typedef int pcre2_match_context; -typedef int pcre2_jit_stack; #endif #include "kwset.h" #include "thread-utils.h" @@ -94,8 +92,6 @@ struct grep_pat { pcre2_code *pcre2_pattern; pcre2_match_data *pcre2_match_data; pcre2_compile_context *pcre2_compile_context; - pcre2_match_context *pcre2_match_context; - pcre2_jit_stack *pcre2_jit_stack; uint32_t pcre2_jit_on; kwset_t kws; unsigned fixed:1; -- 2.22.0.455.g172b71a6c5
Re: [PATCH] grep: skip UTF8 checks explicitally
On Mon, Jul 22 2019, Johannes Schindelin wrote: > Hi Carlo, > > On Sun, 21 Jul 2019, Carlo Marcelo Arenas Belón wrote: > >> Usually PCRE is compiled with JIT support, and therefore the code >> path used includes calling pcre2_jit_match (for PCRE2), that ignores >> invalid UTF-8 in the corpus. >> >> Make that option explicit so it can be also used when JIT is not >> enabled and pcre2_match is called instead, preventing `git grep` >> to abort when hitting the first binary blob in a fixed match >> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15) > > Good idea. > > The flag has been in PCRE1 since at least March 5, 2007, when the > pcre.h.in file was first recorded in their Subversion repository: > https://vcs.pcre.org/pcre/code/trunk/pcre.h.in?view=log > > It also was part of PCRE2 from the first revision (rev 4, in fact, where > pcre2.h.in was added): > https://vcs.pcre.org/pcre2/code/trunk/src/pcre2.h.in?view=log Thanks for digging, that portability indeed sounds just fine. > So I am fine with this patch. I'm not, not because I dislike the approach. I haven't made up my mind yet. I stopped paying attention to this error-with-not-JIT discussion when I heard that some other series going into next for Windows fixed that issue[1] But now we have it again in some form? My ab/no-kwset has a lot of tests for encodings & locales combined with grep, don't some of those trigger this? If so we should make any such failure a test & part of this patch. Right now we don't have the info of whether we're really using the JIT or not, but that would be easy to add to grep's --debug mode for use in a test prereq. As noted in [2] I'd be inclined to go the other way, if we indeed have some cases where PCRE skips its own checks does not dying actually give us anything useful? I'd think not, so just ignoring the issue seems like the wrong thing to do. Surely we're not producing useful grep results at that point, so just not dying and mysteriously returning either nothing or garbage isn't going to help much... 1. https://public-inbox.org/git/xmqq4l3wxk8j@gitster-ct.c.googlers.com/ 2. https://public-inbox.org/git/87pnms7kv0@evledraar.gmail.com/ > Thanks, > Dscho > >> --- >> grep.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/grep.c b/grep.c >> index fc0ed73ef3..146093f590 100644 >> --- a/grep.c >> +++ b/grep.c >> @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, >> const struct grep_opt *opt) >> static int pcre1match(struct grep_pat *p, const char *line, const char *eol, >> regmatch_t *match, int eflags) >> { >> -int ovector[30], ret, flags = 0; >> +int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK; >> >> if (eflags & REG_NOTBOL) >> flags |= PCRE_NOTBOL; >> @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, >> const struct grep_opt *opt >> static int pcre2match(struct grep_pat *p, const char *line, const char *eol, >> regmatch_t *match, int eflags) >> { >> -int ret, flags = 0; >> +int ret, flags = PCRE2_NO_UTF_CHECK; >> PCRE2_SIZE *ovector; >> PCRE2_UCHAR errbuf[256]; >> >> -- >> 2.22.0 >> >>
Re: [PATCH v2 1/1] gettext: always use UTF-8 on native Windows
On Wed, Jul 03 2019, Karsten Blees via GitGitGadget wrote: > From: Karsten Blees > > On native Windows, Git exclusively uses UTF-8 for console output (both > with MinTTY and native Win32 Console). Gettext uses `setlocale()` to > determine the output encoding for translated text, however, MSVCRT's > `setlocale()` does not support UTF-8. As a result, translated text is > encoded in system encoding (as per `GetAPC()`), and non-ASCII chars are > mangled in console output. > > Side note: There is actually a code page for UTF-8: 65001. In practice, > it does not work as expected at least on Windows 7, though, so we cannot > use it in Git. Besides, if we overrode the code page, any process > spawned from Git would inherit that code page (as opposed to the code > page configured for the current user), which would quite possibly break > e.g. diff or merge helpers. So we really cannot override the code page. > > In `init_gettext_charset()`, Git calls gettext's > `bind_textdomain_codeset()` with the character set obtained via > `locale_charset()`; Let's override that latter function to force the > encoding to UTF-8 on native Windows. > > In Git for Windows' SDK, there is a `libcharset.h` and therefore we > define `HAVE_LIBCHARSET_H` in the MINGW-specific section in > `config.mak.uname`, therefore we need to add the override before that > conditionally-compiled code block. > > Rather than simply defining `locale_charset()` to return the string > `"UTF-8"`, though, we are careful not to break `LC_ALL=C`: the > `ab/no-kwset` patch series, for example, needs to have a way to prevent > Git from expecting UTF-8-encoded input. It's not just the ab/no-kwset I have cooking (but happy to have this take that into account), but also anything grep-like is usually must faster with LC_ALL=C. Isn't that also the case on Windows? Setting locales affects a large variety of libc functions and third party libraries (e.g. PCRE via us setting "use UTF-8" under locale). > Signed-off-by: Karsten Blees > Signed-off-by: Johannes Schindelin > --- > gettext.c | 20 +++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/gettext.c b/gettext.c > index d4021d690c..3f2aca5c3b 100644 > --- a/gettext.c > +++ b/gettext.c > @@ -12,7 +12,25 @@ > #ifndef NO_GETTEXT > #include > #include > -#ifdef HAVE_LIBCHARSET_H > +#ifdef GIT_WINDOWS_NATIVE > + > +static const char *locale_charset(void) > +{ > + const char *env = getenv("LC_ALL"), *dot; > + > + if (!env || !*env) > + env = getenv("LC_CTYPE"); > + if (!env || !*env) > + env = getenv("LANG"); > + > + if (!env) > + return "UTF-8"; > + > + dot = strchr(env, '.'); > + return !dot ? env : dot + 1; > +} > + > +#elif defined HAVE_LIBCHARSET_H > #include > #else > #include I'll take it on faith that this is what the locale_charset() should look like. I wonder if it wouldn't be better to always compile this function, and just have init_gettext_charset() switch between the two. We've moved more towards that sort of thing (e.g. with pthreads). I.e. prefer redundant compilation to ifdefing platform-only code (which then only gets compiled there). See "HAVE_THREADS" in the code. It looks to me that with this patch the HAVE_LIBCHARSET_H docs in "Makefile" become wrong. Shouldn't those be updated too? We also still pass -DHAVE_LIBCHARSET_H to every file we compile, only to never use it under GIT_WINDOWS_NATIVE, but perhaps fixing that isn't possible with GIT_WINDOWS_NATIVE being a macro, and perhaps I've again gotten the "native" v.s. "mingw" etc. relationship wrong in my head and the HAVE_LIBCHARSET_H docs are fine. It just seems wrong that we have both the configure script & config.mak.uname look for / declare that we have libcharset.h, only to at this late point not use libcharset.h at all. Couldn't we just know if GIT_WINDOWS_NATIVE will be true earlier & move that check up, so it & HAVE_LIBCHARSET_H can be mutually exclusive (with accompanying #error if we have both)?
Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
On Mon, Jul 01 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason writes: > >> This v3 has a new patch (3/10) that I believe fixes the regression on >> MinGW Johannes noted in >> https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/ >> >> As noted in the updated commit message in 10/10 I believe just >> skipping this test & documenting this in a commit message is the least >> amount of suck for now. It's really an existing issue with us doing >> nothing sensible when the log/grep haystack encoding doesn't match the >> needle encoding supplied via the command line. > > Is that quite the case? If they do not match, not finding the match > is the right answer, because we are byte-for-byte matching/searching > IIUC. > >> We swept that under the carpet with the kwset backend, but PCRE v2 >> exposes it. > > Is it exposing, or just showing the limitation of the rewritten > implementation where it cannot do byte-for-byte matching/searching > as we used to be able to? > > Without having a way to know what encoding is used on the command > line, there is no sensible way to reencode them to match the > haystack encoding (even when it is known), so "you got to feed the > strings in the same encoding, as we are going to match/search > byte-for-byte" is the only sensible way to work, given the design > space, I would think. > > Not that it is all that useful to be able to match/search > byte-for-byte, of course, so I am OK if we punt with these tests, > but I'd prefer to see us admit we are punting when we do ;-). I'm guilty as charged in punting this larger encoding issue. As it pertains to this patch series it unearths an obscure case I think nobody cares about in practice, and I'd like to move on with the "remove kwset" optimization. But I strongly believe that the new behavior with the PCRE v2 optimization is the only sane thing to do, and to the extent we have anything left to do (#leftoverbits) it's that we should modify git more generally (aside from string searching) to do the same thing where appropriate. Remember, this only happens if the user has set a UTF-8 locale and thus promised that they're going to give us UTF-8. We then take that promise and make e.g. "æ" match "Æ" under --ignore-case. Just falling back on raw byte matching isn't going to cut it, because then "æ" won't match "Æ" under --ignore-case, and there's other cases like that with matching word boundaries & other Unicode gotchas. The best that can be hoped for at that point is some "loose UTF-8" mode. I see both perl & GNU grep seem to support that (although I'm sure it falls apart at some point). GNU grep will also die in the same way that we now die with --perl-regexp (since it also use PCRE). I think that's saner, if the user thinks they're feeding us UTF-8 but they're not I think they'd like to know rather than having the string matching library fall back.
Re: [PATCH v3 1/3] repo-settings: create core.featureAdoptionRate setting
On Tue, Jul 02 2019, Duy Nguyen wrote: > On Mon, Jul 1, 2019 at 10:32 PM Derrick Stolee via GitGitGadget > wrote: >> @@ -601,3 +602,22 @@ core.abbrev:: >> in your repository, which hopefully is enough for >> abbreviated object names to stay unique for some time. >> The minimum length is 4. >> + >> +core.featureAdoptionRate:: >> + Set an integer value on a scale from 0 to 10 describing your >> + desire to adopt new performance features. Defaults to 0. As >> + the value increases, features are enabled by changing the >> + default values of other config settings. If a config variable >> + is specified explicitly, the explicit value will override these >> + defaults: > > This is because I'd like to keep core.* from growing too big (it's > already big), hard to read, search and maintain. Perhaps this should > belong to a separate group? Something like tuning.something or > defaults.something. The main thing users look at is "man git-config" (or its web rendering) which renders it all in one page anyway. I think in general adding more things to core.* sucks less than explaining the special-case that "tuning.*" isn't a config for git-tuning(1) (although we have some of that already, e.g. with trace2.*). Documentation/config/core.txt is ~600 lines. Maybe it would be a good idea to split it up, similar to your split of Documentation/config/*.txt, but let's not conflate how we'd like to maintain stuff in git.git with a config interface we expose externally. It's going to be very confusing for users if some settings that otherwise would be in core aren't there because a file in git.git was "too big" at the time. Users (mostly) aren't going to know/care in what chronological order we added config keys.
Re: [PATCH v2 1/3] repo-settings: create core.featureAdoptionRate setting
On Wed, Jun 19 2019, Derrick Stolee via GitGitGadget wrote: > core.commitGraph:: > If true, then git will read the commit-graph file (if it exists) > - to parse the graph structure of commits. Defaults to false. See > + to parse the graph structure of commits. Defaults to false, unless > + `core.featureAdoptionRate` is at least three. See > linkgit:git-commit-graph[1] for more information. > > core.useReplaceRefs:: > @@ -601,3 +602,21 @@ core.abbrev:: > in your repository, which hopefully is enough for > abbreviated object names to stay unique for some time. > The minimum length is 4. > + > +core.featureAdoptionRate:: > + Set an integer value on a scale from 0 to 10 describing your > + desire to adopt new performance features. Defaults to 0. As > + the value increases, features are enabled by changing the > + default values of other config settings. If a config variable > + is specified explicitly, the explicit value will override these > + defaults: > ++ > +If the value is at least 3, then the following defaults are modified. > +These represent relatively new features that have existed for multiple > +major releases, and present significant performance benefits. They do > +not modify the user-facing output of porcelain commands. > ++ > +* `core.commitGraph=true` enables reading commit-graph files. > ++ > +* `gc.writeCommitGraph=true` eneables writing commit-graph files during I barked up a similar tree in https://public-inbox.org/git/cacbzzx5sbyo5fvptk6lw1ff96nr5591rhhc-5wdjw-fmg1r...@mail.gmail.com/ I wonder if you've seen that & what you think about that approach. I.e. have a core.version=2.28 (or core.version=+6) or whatever to opt-in to features we'd make default in 2.28. Would that be your core.featureAdoptionRate=6 (28-28 = 6)? I admit that question is partly rhetorical, because I think it suggests how hard it would be for users to reason about this. The "core.version" idea also sucks, but at least it's bound to our advertised version number, so it's obvious if you set it to e.g. +2 what feature track you're on, and furthermore when we'd commit to making that the default for users who don't set core.version (although we could of course always change our minds...). It's also something that mirrors how e.g. Perl, C compilers (with --std=*) treat this sort of thing. So I'm all for a facility to have a setting to collectively opt-in to new things early. But I think for such a thing we really should a) at least in principle commit to making those things the default eventually (if they don't suck) b) it needs to be obvious to the user how the "rate" relates to git releases. This "core.featureAdoptionRate" value seems more like zlib compression values & unrelated to release numbers. It's also for "performance features" only but squats a more general name. I suggested "core.version" & then "core.uiVersion" (in https://public-inbox.org/git/87pnunxz5i@evledraar.gmail.com/). Regardless of whether we want to pin opt-in early-bird features to version numbers in some way, which I think is a good idea, but maybe others disagree. I think if it's "just performance" it's good to put that in the key name in such a way that we can have "early UI" features, or other non-UI non-performance. Thanks for working on this!
Re: ab/no-kwset, was Re: What's cooking in git.git (Jun 2019, #07; Fri, 28)
On Mon, Jul 01 2019, Johannes Schindelin wrote: > Hi Junio & Ævar, > > On Fri, 28 Jun 2019, Junio C Hamano wrote: > >> * ab/no-kwset (2019-06-28) 9 commits >> - grep: use PCRE v2 for optimized fixed-string search >> - grep: remove the kwset optimization >> - grep: drop support for \0 in --fixed-strings >> - grep: make the behavior for NUL-byte in patterns sane >> - grep tests: move binary pattern tests into their own file >> - grep tests: move "grep binary" alongside the rest >> - grep: inline the return value of a function call used only once >> - grep: don't use PCRE2?_UTF8 with "log --encoding=" >> - log tests: test regex backends in "--encode=" tests >> >> Retire use of kwset library, which is an optimization for looking >> for fixed strings, with use of pcre2 JIT. >> >> Will merge to 'next'. > > There is still a test failure that I am not sure how Ævar wants to > address: > > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab CC'd you there, but as a note here: I believe my v3 sent just now fixes this: https://public-inbox.org/git/20190701212100.27850-1-ava...@gmail.com/
[PATCH v3 08/10] grep: drop support for \0 in --fixed-strings
Change "-f " to not support patterns with a NUL-byte in them under --fixed-strings. We'll now only support these under "--perl-regexp" with PCRE v2. A previous change to grep's documentation changed the description of "-f " to be vague enough as to not promise that this would work. By dropping support for this we make it a whole lot easier to move away from the kwset backend, which we'll do in a subsequent change. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 6 +-- t/t7816-grep-binary-pattern.sh | 82 +- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/grep.c b/grep.c index d6603bc950..8d0fff316c 100644 --- a/grep.c +++ b/grep.c @@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + /* * Even when -F (fixed) asks us to do a non-regexp search, we * may not be able to correctly case-fold when -i @@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) return; } - if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) - die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { /* * We come here when the pattern has the non-ascii diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 9e09bd5d6a..60bab291e4 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -60,23 +60,23 @@ test_expect_success 'setup' " " # Simple fixed-string matching that can use kwset (no -i && non-ASCII) -nul_match 1 1 1 '-F' 'yQf' -nul_match 0 0 0 '-F' 'yQx' -nul_match 1 1 1 '-Fi' 'YQf' -nul_match 0 0 0 '-Fi' 'YQx' -nul_match 1 1 1 '' 'yQf' -nul_match 0 0 0 '' 'yQx' -nul_match 1 1 1 '' 'æQð' -nul_match 1 1 1 '-F' 'eQm[*]c' -nul_match 1 1 1 '-Fi' 'EQM[*]C' +nul_match P P P '-F' 'yQf' +nul_match P P P '-F' 'yQx' +nul_match P P P '-Fi' 'YQf' +nul_match P P P '-Fi' 'YQx' +nul_match P P 1 '' 'yQf' +nul_match P P 0 '' 'yQx' +nul_match P P 1 '' 'æQð' +nul_match P P P '-F' 'eQm[*]c' +nul_match P P P '-Fi' 'EQM[*]C' # Regex patterns that would match but shouldn't with -F -nul_match 0 0 0 '-F' 'yQ[f]' -nul_match 0 0 0 '-F' '[y]Qf' -nul_match 0 0 0 '-Fi' 'YQ[F]' -nul_match 0 0 0 '-Fi' '[Y]QF' -nul_match 0 0 0 '-F' 'æQ[ð]' -nul_match 0 0 0 '-F' '[æ]Qð' +nul_match P P P '-F' 'yQ[f]' +nul_match P P P '-F' '[y]Qf' +nul_match P P P '-Fi' 'YQ[F]' +nul_match P P P '-Fi' '[Y]QF' +nul_match P P P '-F' 'æQ[ð]' +nul_match P P P '-F' '[æ]Qð' # The -F kwset codepath can't handle -i && non-ASCII... nul_match P 1 1 '-i' '[æ]Qð' @@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð' nul_match P 0 1 '-i' 'ÆQÐ' # \0 in regexes can only work with -P & PCRE v2 -nul_match P 1 1 '' 'yQ[f]' -nul_match P 1 1 '' '[y]Qf' -nul_match P 1 1 '-i' 'YQ[F]' -nul_match P 1 1 '-i' '[Y]Qf' -nul_match P 1 1 '' 'æQ[ð]' -nul_match P 1 1 '' '[æ]Qð' -nul_match P 0 1 '-i' 'ÆQ[Ð]' -nul_match P 1 1 '' 'eQm.*cQ' -nul_match P 1 1 '-i' 'EQM.*cQ' -nul_match P 0 0 '' 'eQm[*]c' -nul_match P 0 0 '-i' 'EQM[*]C' +nul_match P P 1 '' 'yQ[f]' +nul_match P P 1 '' '[y]Qf' +nul_match P P 1 '-i' 'YQ[F]' +nul_match P P 1 '-i' '[Y]Qf' +nul_match P P 1 '' 'æQ[ð]' +nul_match P P 1 '' '[æ]Qð' +nul_match P P 1 '-i' 'ÆQ[Ð]' +nul_match P P 1 '' 'eQm.*cQ' +nul_match P P 1 '-i' 'EQM.*cQ' +nul_match P P 0 '' 'eQm[*]c' +nul_match P P 0
[PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search
7.39(6.99+0.33) 7.00(6.68+0.25) -5.3% 4221.13: extended log --grep='æ' 7.34(7.00+0.25) 7.15(6.81+0.31) -2.6% 4221.14: perl log --grep='æ' 7.43(7.13+0.26) 7.01(6.60+0.36) -5.7% log with -i: Testorigin/master HEAD 4221.1: fixed log -i --grep='int' 7.31(7.07+0.24) 7.23(7.00+0.22) -1.1% 4221.2: basic log -i --grep='int' 7.40(7.08+0.28) 7.19(6.92+0.20) -2.8% 4221.3: extended log -i --grep='int'7.43(7.13+0.25) 7.27(6.99+0.21) -2.2% 4221.4: perl log -i --grep='int'7.34(7.10+0.24) 7.10(6.90+0.19) -3.3% 4221.6: fixed log -i --grep='uncommon' 7.07(6.71+0.32) 7.11(6.77+0.28) +0.6% 4221.7: basic log -i --grep='uncommon' 6.99(6.64+0.28) 7.12(6.69+0.38) +1.9% 4221.8: extended log -i --grep='uncommon' 7.11(6.74+0.32) 7.10(6.77+0.27) -0.1% 4221.9: perl log -i --grep='uncommon' 6.98(6.60+0.29) 7.05(6.64+0.34) +1.0% 4221.11: fixed log -i --grep='æ'7.85(7.45+0.34) 7.03(6.68+0.32) -10.4% 4221.12: basic log -i --grep='æ'7.87(7.49+0.29) 7.06(6.69+0.31) -10.3% 4221.13: extended log -i --grep='æ' 7.87(7.54+0.31) 7.09(6.69+0.31) -9.9% 4221.14: perl log -i --grep='æ' 7.06(6.77+0.28) 6.91(6.57+0.31) -2.1% So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string search", 2019-06-26) there's a huge improvement in performance for "grep", but in "log" most of our time is spent elsewhere, so we don't notice it that much. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 4468519d5c..fc0ed73ef3 100644 --- a/grep.c +++ b/grep.c @@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } +static int is_fixed(const char *s, size_t len) +{ + size_t i; + + for (i = 0; i < len; i++) { + if (is_regex_special(s[i])) + return 0; + } + + return 1; +} + #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, static void free_pcre2_pattern(struct grep_pat *p) { } -#endif /* !USE_LIBPCRE2 */ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) { @@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) compile_regexp_failed(p, errbuf); } } +#endif /* !USE_LIBPCRE2 */ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; + int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { + pat_is_fixed = is_fixed(p->pattern, p->patternlen); + if (opt->fixed || pat_is_fixed) { +#ifdef USE_LIBPCRE2 + opt->pcre2 = 1; + if (pat_is_fixed) { + compile_pcre2_pattern(p, opt); + } else { + /* +* E.g. t7811-grep-open.sh relies on the +* pattern being restored. +*/ + char *old_pattern = p->pattern; + size_t old_patternlen = p->patternlen; + struct strbuf sb = STRBUF_INIT; + + /* +* There is the PCRE2_LITERAL flag, but it's +* only in PCRE v2 10.30 and later. Needing to +* ifdef our way around that and dealing with +* it + PCRE2_MULTILINE being an error is more +* complex than just quoting this ourselves. + */ + strbuf_add(&sb, "\\Q", 2); + strbuf_add(&sb, p->pattern, p->patternlen); + strbuf_add(&sb, "\\E", 2); + + p->pattern = sb.buf; + p->patternlen = sb.len; + compile_pcre2_pattern(p, opt); + p->pattern = old_pattern; + p->patternlen = old_patternlen; + strbuf_release(&sb); + } +#else /* !USE_LIBPCRE2 */ compile_fixed_regexp(p, opt); +#endif /* !USE_LIBPCRE2 */ return; } -- 2.22.0.455.g172b71a6c5
[PATCH v3 09/10] grep: remove the kwset optimization
0.26) +19.2% 4221.4: perl log -i --grep='int'7.42(7.16+0.21) 7.14(6.80+0.24) -3.8% 4221.6: fixed log -i --grep='uncommon' 6.94(6.58+0.35) 8.43(8.04+0.30) +21.5% 4221.7: basic log -i --grep='uncommon' 6.95(6.62+0.31) 8.34(7.93+0.32) +20.0% 4221.8: extended log -i --grep='uncommon' 7.06(6.75+0.25) 8.32(7.98+0.31) +17.8% 4221.9: perl log -i --grep='uncommon' 6.96(6.69+0.26) 7.04(6.64+0.32) +1.1% 4221.11: fixed log -i --grep='æ'7.92(7.55+0.33) 7.86(7.44+0.34) -0.8% 4221.12: basic log -i --grep='æ'7.88(7.49+0.32) 7.84(7.46+0.34) -0.5% 4221.13: extended log -i --grep='æ' 7.91(7.51+0.32) 7.87(7.48+0.32) -0.5% 4221.14: perl log -i --grep='æ' 7.01(6.59+0.35) 6.99(6.64+0.28) -0.3% Some of those, as noted in [1] are because PCRE is faster at finding fixed strings. This looks bad for some engines, but in the next change we'll optimistically use PCRE v2 for all of these, so it'll look better. 1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 63 +++--- grep.h | 2 -- 2 files changed, 3 insertions(+), 62 deletions(-) diff --git a/grep.c b/grep.c index 8d0fff316c..4468519d5c 100644 --- a/grep.c +++ b/grep.c @@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } -static int is_fixed(const char *s, size_t len) -{ - size_t i; - - for (i = 0; i < len; i++) { - if (is_regex_special(s[i])) - return 0; - } - - return 1; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + p->fixed = opt->fixed; if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - /* -* Even when -F (fixed) asks us to do a non-regexp search, we -* may not be able to correctly case-fold when -i -* (ignore-case) is asked (in which case, we'll synthesize a -* regexp to match the pattern that matches regexp special -* characters literally, while ignoring case differences). On -* the other hand, even without -F, if the pattern does not -* have any regexp special characters and there is no need for -* case-folding search, we can internally turn it into a -* simple string match using kws. p->fixed tells us if we -* want to use kws. -*/ - if (opt->fixed || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); - - if (p->fixed) { - p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); - kwsincr(p->kws, p->pattern, p->patternlen); - kwsprep(p->kws); - return; - } - if (opt->fixed) { - /* -* We come here when the pattern has the non-ascii -* characters we cannot case-fold, and asked to -* ignore-case. -*/ compile_fixed_regexp(p, opt); return; } @@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt) case GREP_PATTERN: /* atom */ case GREP_PATTERN_HEAD: case GREP_PATTERN_BODY: - if (p->kws) - kwsfree(p->kws); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) free_pcre1_regexp(p); else if (p->pcre2_pattern) free_pcre2_pattern(p); @@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name) opt->output(opt, opt->null_following_name ? "\0" : "\n", 1); } -static int fixmatch(struct grep_pat *p, char *line, char *eol, - regmatch_t *match) -{ - struct kwsmatch kwsm; - size_t offset = kwsexec(p->kws, line, eol - line, &kwsm); - if (offset == -1) { - match->rm_so = match->rm_eo = -1; - return REG_NOMATCH; - } else { - match->rm_so = offset; - match->rm_eo = match->rm_so + kwsm.size[0]; -
[PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW
In 5212f91deb ("t4210: skip command-line encoding tests on mingw", 2014-07-17) the positive tests in this file were skipped. That left the negative tests that don't produce a match. An upcoming change to migrate the "fixed" backend of grep to PCRE v2 will cause these "log" commands to produce an error instead on MinGW. This is because the command-line on that platform implicitly has its encoding changed before being passed to git. See [1]. 1. https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t4210-log-i18n.sh | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 515bcb7ce1..6e61f57f09 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -51,7 +51,7 @@ test_expect_success !MINGW 'log --grep does not find non-reencoded values (utf8) test_must_be_empty actual ' -test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' +test_expect_success !MINGW 'log --grep does not find non-reencoded values (latin1)' ' git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual && test_must_be_empty actual ' @@ -70,7 +70,7 @@ do then force_regex=.* fi - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " cat >expect <<-\EOF && latin1 utf8 @@ -79,12 +79,12 @@ do test_cmp expect actual " - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual && test_must_be_empty actual " - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && test_must_be_empty actual " -- 2.22.0.455.g172b71a6c5
[PATCH v3 04/10] grep: inline the return value of a function call used only once
Since e944d9d932 ("grep: rewrite an if/else condition to avoid duplicate expression", 2016-06-25) the "ascii_only" variable has only been used once in compile_regexp(), let's just inline it there. This makes the code easier to read, and might make it marginally faster depending on compiler optimizations. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/grep.c b/grep.c index 1de4ab49c0..4e8d0645a8 100644 --- a/grep.c +++ b/grep.c @@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { - int ascii_only; int err; int regflags = REG_NEWLINE; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; - ascii_only = !has_non_ascii(p->pattern); /* * Even when -F (fixed) asks us to do a non-regexp search, we @@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || ascii_only; + p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); -- 2.22.0.455.g172b71a6c5
[PATCH v3 01/10] log tests: test regex backends in "--encode=" tests
Improve the tests added in 04deccda11 ("log: re-encode commit messages before grepping", 2013-02-11) to test the regex backends. Those tests never worked as advertised, due to the is_fixed() optimization in grep.c (which was in place at the time), and the needle in the tests being a fixed string. We'd thus always use the "fixed" backend during the tests, which would use the kwset() backend. This backend liberally accepts any garbage input, so invalid encodings would be silently accepted. In a follow-up commit we'll fix this bug, this test just demonstrates the existing issue. In practice this issue happened on Windows, see [1], but due to the structure of the existing tests & how liberal the kwset code is about garbage we missed this. Cover this blind spot by testing all our regex engines. The PCRE backend will spot these invalid encodings. It's possible that this test breaks the "basic" and "extended" backends on some systems that are more anal than glibc about the encoding of locale issues with POSIX functions that I can remember, but PCRE is more careful about the validation. 1. https://public-inbox.org/git/nycvar.qro.7.76.6.1906271113090...@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t4210-log-i18n.sh | 41 - 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 7c519436ef..86d22c1d4c 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -1,12 +1,15 @@ #!/bin/sh test_description='test log with i18n features' -. ./test-lib.sh +. ./lib-gettext.sh # two forms of é utf8_e=$(printf '\303\251') latin1_e=$(printf '\351') +# invalid UTF-8 +invalid_e=$(printf '\303\50)') # ")" at end to close opening "(" + test_expect_success 'create commits in different encodings' ' test_tick && cat >msg <<-EOF && @@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' test_must_be_empty actual ' +for engine in fixed basic extended perl +do + prereq= + result=success + if test $engine = "perl" + then + result=failure + prereq="PCRE" + else + prereq="" + fi + force_regex= + if test $engine != "fixed" + then + force_regex=.* + fi + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + cat >expect <<-\EOF && + latin1 + utf8 + EOF + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual && + test_cmp expect actual + " + + test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual && + test_must_be_empty actual + " + + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && + test_must_be_empty actual + " +done + test_done -- 2.22.0.455.g172b71a6c5
[PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding="
Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was missed due to a blindspot in our tests, as discussed in the previous commit. I then blindly copied the same bug in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when adding the PCRE v2 code. We should not tell PCRE that we're processing UTF-8 just because we're dealing with non-ASCII. In the case of e.g. "log --encoding=<...>" under is_utf8_locale() the haystack might be in ISO-8859-1, and the needle might be in a non-UTF-8 encoding. Maybe we should be more strict here and die earlier? Should we also be converting the needle to the encoding in question, and failing if it's not a string that's valid in that encoding? Maybe. But for now matching this as non-UTF8 at least has some hope of producing sensible results, since we know that our default heuristic of assuming the text to be matched is in the user locale encoding isn't true when we've explicitly encoded it to be in a different encoding. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 8 grep.h | 1 + revision.c | 3 +++ t/t4210-log-i18n.sh | 6 ++ 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index f7c3a5803e..1de4ab49c0 100644 --- a/grep.c +++ b/grep.c @@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) int options = PCRE_MULTILINE; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) + if (!opt->ignore_locale && has_non_ascii(p->pattern)) p->pcre1_tables = pcre_maketables(); options |= PCRE_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE_UTF8; p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset, @@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_compile_context = NULL; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) { + if (!opt->ignore_locale && has_non_ascii(p->pattern)) { character_tables = pcre2_maketables(NULL); p->pcre2_compile_context = pcre2_compile_context_create(NULL); pcre2_set_character_tables(p->pcre2_compile_context, character_tables); } options |= PCRE2_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, diff --git a/grep.h b/grep.h index 1875880f37..4bb8a79d93 100644 --- a/grep.h +++ b/grep.h @@ -173,6 +173,7 @@ struct grep_opt { int funcbody; int extended_regexp_option; int pattern_type_option; + int ignore_locale; char colors[NR_GREP_COLORS][COLOR_MAXLEN]; unsigned pre_context; unsigned post_context; diff --git a/revision.c b/revision.c index 621feb9df7..a842fb158a 100644 --- a/revision.c +++ b/revision.c @@ -28,6 +28,7 @@ #include "commit-graph.h" #include "prio-queue.h" #include "hashmap.h" +#include "utf8.h" volatile show_early_output_fn_t show_early_output; @@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED, &revs->grep_filter); + if (!is_encoding_utf8(get_log_output_encoding())) + revs->grep_filter.ignore_locale = 1; compile_grep_patterns(&revs->grep_filter); if (revs->reverse && revs->reflog_info) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 86d22c1d4c..515bcb7ce1 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' for engine in fixed basic extended perl do prereq= - result=success if test $engine = "perl" then - result=failure prereq="PCRE" else prereq="" @@ -72,7 +70,7 @@ do then force_regex=.* fi - test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success GETTEXT_LOCALE,$prereq
[PATCH v3 05/10] grep tests: move "grep binary" alongside the rest
Move the "grep binary" test case added in aca20dd558 ("grep: add test script for binary file handling", 2010-05-22) so that it lives alongside the rest of the "grep" tests in t781*. This would have left a gap in the t/700* namespace, so move a "filter-branch" test down, leaving the "t7010-setup.sh" test as the next one after that. Signed-off-by: Ævar Arnfjörð Bjarmason --- ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0 t/{t7008-grep-binary.sh => t7815-grep-binary.sh} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%) diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh similarity index 100% rename from t/t7009-filter-branch-null-sha1.sh rename to t/t7008-filter-branch-null-sha1.sh diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh similarity index 100% rename from t/t7008-grep-binary.sh rename to t/t7815-grep-binary.sh -- 2.22.0.455.g172b71a6c5
[PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane
The behavior of "grep" when patterns contained a NUL-byte has always been haphazard, and has served the vagaries of the implementation more than anything else. A pattern containing a NUL-byte can only be provided via "-f ". Since pickaxe (log search) has no such flag the NUL-byte in patterns has only ever been supported by "grep" (and not "log --grep"). Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing "\0" were considered fixed. In 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20) I added tests for this behavior. Change the behavior to do the obvious thing, i.e. don't silently discard a regex pattern and make it implicitly fixed just because they contain a NUL-byte. Instead die if the backend in question can't handle them, e.g. --basic-regexp is combined with such a pattern. This is desired because from a user's point of view it's the obvious thing to do. Whether we support BRE/ERE/Perl syntax is different from whether our implementation is limited by C-strings. These patterns are obscure enough that I think this behavior change is OK, especially since we never documented the old behavior. Doing this also makes it easier to replace the kwset backend with something else, since we'll no longer strictly need it for anything we can't easily use another fixed-string backend for. Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/git-grep.txt | 17 grep.c | 23 ++--- t/t7816-grep-binary-pattern.sh | 159 ++--- 3 files changed, 110 insertions(+), 89 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 2d27969057..c89fb569e3 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -271,6 +271,23 @@ providing this option will cause it to die. -f :: Read patterns from , one per line. ++ +Passing the pattern via allows for providing a search pattern +containing a \0. ++ +Not all pattern types support patterns containing \0. Git will error +out if a given pattern type can't support such a pattern. The +`--perl-regexp` pattern type when compiled against the PCRE v2 backend +has the widest support for these types of patterns. ++ +In versions of Git before 2.23.0 patterns containing \0 would be +silently considered fixed. This was never documented, there were also +odd and undocumented interactions between e.g. non-ASCII patterns +containing \0 and `--ignore-case`. ++ +In future versions we may learn to support patterns containing \0 for +more search backends, until then we'll die when the pattern type in +question doesn't support them. -e:: The next parameter is the pattern. This option has to be diff --git a/grep.c b/grep.c index 4e8d0645a8..d6603bc950 100644 --- a/grep.c +++ b/grep.c @@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len) return 1; } -static int has_null(const char *s, size_t len) -{ - /* -* regcomp cannot accept patterns with NULs so when using it -* we consider any pattern containing a NUL fixed. -*/ - if (memchr(s, 0, len)) - return 1; - - return 0; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) * simple string match using kws. p->fixed tells us if we * want to use kws. */ - if (opt->fixed || - has_null(p->pattern, p->patternlen) || - is_fixed(p->pattern, p->patternlen)) + if (opt->fixed || is_fixed(p->pattern, p->patternlen)) p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { @@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) kwsincr(p->kws, p->pattern, p->patternlen); kwsprep(p->kws); return; - } else if (opt->fixed) { + } + + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + + if (opt->fixed) { /* * We come here when the pattern has the non-ascii * characters we cannot case-fold, and asked to diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 4060dbd679..9e09bd5d6a 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -2,113 +2,126 @@ test_description='git grep with a binary pattern files' -. ./test-lib.sh +. ./lib-gettext.sh -nul_match () { +nul_match_internal () { matches=$1 - fla
[PATCH v3 06/10] grep tests: move binary pattern tests into their own file
Move the tests for "-f " where "" contains a NUL byte pattern into their own file. I added most of these tests in 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20). Whether a regex engine supports matching binary content is very different from whether it matches binary patterns. Since 2f8952250a ("regex: add regexec_buf() that can work on a non NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our regex engines so we can match binary content, but only the PCRE v2 engine can sensibly match binary patterns. Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting patterns containing NUL-byte and considering them fixed, except in cases where "--ignore-case" is provided and they're non-ASCII, see 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", 2016-06-25). Subsequent commits will change this behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t7815-grep-binary.sh | 101 - t/t7816-grep-binary-pattern.sh | 114 + 2 files changed, 114 insertions(+), 101 deletions(-) create mode 100755 t/t7816-grep-binary-pattern.sh diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh index 2d87c49b75..90ebb64f46 100755 --- a/t/t7815-grep-binary.sh +++ b/t/t7815-grep-binary.sh @@ -4,41 +4,6 @@ test_description='git grep in binary files' . ./test-lib.sh -nul_match () { - matches=$1 - flags=$2 - pattern=$3 - pattern_human=$(echo "$pattern" | sed 's/Q//g') - - if test "$matches" = 1 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = 0 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - elif test "$matches" = T1 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = T0 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - else - test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' - fi -} - test_expect_success 'setup' " echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && git add a && @@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' ' git grep .fi a ' -nul_match 1 '-F' 'yQf' -nul_match 0 '-F' 'yQx' -nul_match 1 '-Fi' 'YQf' -nul_match 0 '-Fi' 'YQx' -nul_match 1 '' 'yQf' -nul_match 0 '' 'yQx' -nul_match 1 '' 'æQð' -nul_match 1 '-F' 'eQm[*]c' -nul_match 1 '-Fi' 'EQM[*]C' - -# Regex patterns that would match but shouldn't with -F -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-F' '[y]Qf' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '-Fi' '[Y]QF' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-F' '[æ]Qð' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '-Fi' '[Æ]QÐ' - -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' - -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' - -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0 ''
[PATCH v3 00/10] grep: move from kwset to optional PCRE v2
This v3 has a new patch (3/10) that I believe fixes the regression on MinGW Johannes noted in https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/ As noted in the updated commit message in 10/10 I believe just skipping this test & documenting this in a commit message is the least amount of suck for now. It's really an existing issue with us doing nothing sensible when the log/grep haystack encoding doesn't match the needle encoding supplied via the command line. We swept that under the carpet with the kwset backend, but PCRE v2 exposes it. Ævar Arnfjörð Bjarmason (10): log tests: test regex backends in "--encode=" tests grep: don't use PCRE2?_UTF8 with "log --encoding=" t4210: skip more command-line encoding tests on MinGW grep: inline the return value of a function call used only once grep tests: move "grep binary" alongside the rest grep tests: move binary pattern tests into their own file grep: make the behavior for NUL-byte in patterns sane grep: drop support for \0 in --fixed-strings grep: remove the kwset optimization grep: use PCRE v2 for optimized fixed-string search Documentation/git-grep.txt| 17 +++ grep.c| 115 +++- grep.h| 3 +- revision.c| 3 + t/t4210-log-i18n.sh | 41 +- ...a1.sh => t7008-filter-branch-null-sha1.sh} | 0 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 -- t/t7816-grep-binary-pattern.sh| 127 ++ 8 files changed, 234 insertions(+), 173 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%) create mode 100755 t/t7816-grep-binary-pattern.sh Range-diff: 1: cfc01f49d3 = 1: cfc01f49d3 log tests: test regex backends in "--encode=" tests 2: 4b59eb32f0 = 2: 4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log --encoding=" -: -- > 3: 676c76afe4 t4210: skip more command-line encoding tests on MinGW 3: cc4d3b50d5 = 4: da9b491f70 grep: inline the return value of a function call used only once 4: d9b29bdd89 = 5: c42d3268fa grep tests: move "grep binary" alongside the rest 5: f85614f435 = 6: 36b9c1c541 grep tests: move binary pattern tests into their own file 6: 90afca8707 = 7: 3c54e782e6 grep: make the behavior for NUL-byte in patterns sane 7: 526b925fdc = 8: 8e5f418189 grep: drop support for \0 in --fixed-strings 8: 14269bb295 = 9: d1cb8319d5 grep: remove the kwset optimization 9: c0fd75d102 ! 10: 4de0c82314 grep: use PCRE v2 for optimized fixed-string search @@ -15,6 +15,15 @@ makes the behavior harder to understand and document, and makes tests for the different backends more painful. +This does change the behavior under non-C locales when "log"'s +"--encoding" option is used and the heystack/needle in the +content/command-line doesn't have a matching encoding. See the recent +change in "t4210: skip more command-line encoding tests on MinGW" in +this series. I think that's OK. We did nothing sensible before +then (just compared raw bytes that had no hope of matching). At least +now the user will get some idea why their grep/log never matches in +that edge case. + I could also support the PCRE v1 backend here, but that would make the code more complex. I'd rather aim for simplicity here and in future changes to the diffcore. We're not going to have someone who -- 2.22.0.455.g172b71a6c5
Re: [PATCH 0/6] easy bulk commit creation in tests
On Fri, Jun 28 2019, Jeff King wrote: > On Fri, Jun 28, 2019 at 02:41:03AM -0400, Jeff King wrote: > >> I think this would exercise it, at the cost of making the test more >> expensive: >> >> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh >> index 82d7f7f6a5..8ed6982dcb 100755 >> --- a/t/t5310-pack-bitmaps.sh >> +++ b/t/t5310-pack-bitmaps.sh >> @@ -21,7 +21,7 @@ has_any () { >> } >> >> test_expect_success 'setup repo with moderate-sized history' ' >> -for i in $(test_seq 1 10) >> +for i in $(test_seq 1 100) >> do >> test_commit $i >> done && >> >> It would be nice if we had a "test_commits_bulk" that used fast-import >> to create larger numbers of commits. > > So here's a patch to do that. Writing the bulk commit function was a fun > exercise, and I found a couple other places to apply it, too, shaving > off ~7.5 seconds from my test runs. Not ground-breaking, but I think > it's nice to have a solution where we don't have to be afraid to > generate a bunch of commits. Nice. Just a side-note: I've wondered how much we could speed up the tests in other places if rather than doing setup all over the place we simply created a few "template" repository shapes, and the common case for tests would be to simply cp(1) those over. I.e. for things like fsck etc. we really do need some specific repository layout, but a lot of our tests are simply re-doing setup slightly differently just to get things like "I want a few commits on a few branches" or "set up a repo like but with some remotes" etc.
Re: [PATCH 1/6] test-lib: introduce test_commit_bulk
On Fri, Jun 28 2019, Jeff King wrote: > Some tests need to create a string of commits. Doing this with > test_commit is very heavy-weight, as it needs at least one process per > commit (and in fact, uses several). > > For bulk creation, we can do much better by using fast-import, but it's > often a pain to generate the input. Let's provide a helper to do so. > > We'll use t5310 as a guinea pig, as it has three 10-commit loops. Here > are hyperfine results before and after: > > [before] > Benchmark #1: ./t5310-pack-bitmaps.sh --root=/var/ram/git-tests > Time (mean ± σ): 2.846 s ± 0.305 s[User: 3.042 s, System: 0.919 > s] > Range (min … max):2.250 s … 3.210 s10 runs > > [after] > Benchmark #1: ./t5310-pack-bitmaps.sh --root=/var/ram/git-tests > Time (mean ± σ): 2.210 s ± 0.174 s[User: 2.570 s, System: 0.604 > s] > Range (min … max):1.999 s … 2.590 s10 runs > > So we're over 20% faster, while making the callers slightly shorter. We > added a lot more lines in test-lib-function.sh, of course, and the > helper is way more featureful than we need here. But my hope is that it > will be flexible enough to use in more places. > > Signed-off-by: Jeff King > --- > t/t5310-pack-bitmaps.sh | 15 + > t/test-lib-functions.sh | 131 > 2 files changed, 134 insertions(+), 12 deletions(-) > > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh > index a26c8ba9a2..3aab7024ca 100755 > --- a/t/t5310-pack-bitmaps.sh > +++ b/t/t5310-pack-bitmaps.sh > @@ -21,15 +21,9 @@ has_any () { > } > > test_expect_success 'setup repo with moderate-sized history' ' > - for i in $(test_seq 1 10) > - do > - test_commit $i > - done && > + test_commit_bulk --id=file 10 && > git checkout -b other HEAD~5 && > - for i in $(test_seq 1 10) > - do > - test_commit side-$i > - done && > + test_commit_bulk --id=side 10 && > git checkout master && > bitmaptip=$(git rev-parse master) && > blob=$(echo tagged-blob | git hash-object -w --stdin) && > @@ -106,10 +100,7 @@ test_expect_success 'clone from bitmapped repository' ' > ' > > test_expect_success 'setup further non-bitmapped commits' ' > - for i in $(test_seq 1 10) > - do > - test_commit further-$i > - done > + test_commit_bulk --id=further 10 > ' > > rev_list_tests 'partial bitmap' > diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh > index 0367cec5fd..32a1db81a3 100644 > --- a/t/test-lib-functions.sh > +++ b/t/test-lib-functions.sh > @@ -233,6 +233,137 @@ test_merge () { > git tag "$1" > } > > +# Similar to test_commit, but efficiently create commits, each with a > +# unique number $n (from 1 to by default) in the commit message. Is it intentional not to follow test_commit's convention of creating a tag as well? If so it would be helpful to note that difference here, or rather, move this documentation to t/README where test_commit and friends are documented.
Re: [PATCH v2 0/9] grep: move from kwset to optional PCRE v2
On Fri, Jun 28 2019, Ævar Arnfjörð Bjarmason wrote: > A non-RFC since it seem people like this approach. > > This should fix the test failure noted by Johannes, there's two new > patches at the start of this series. They address a bug that was there > for a long time, but I happened to trip over since PCRE is more strict > about UTF-8 validation than kwset (which doesn't care at all). > > I also added performance numbers to the relevant commit messages, took > brian's suggestion of saying "NUL-byte" instead of "\0", and did some > other copyediting of my own. > > The rest of the code changes are all just comments & rewording of > previously added comments. Junio. I thought I'd submit this in before your merge to "next", but I see that happened. Are you OK with rewinding it for this (& maybe something else) or should I submit a v3 rebased on "next"? I'd really prefer the improved commit messages with performance numbers, and thought I'd have time to work on those details since it was an RFC/PATCH :) > Ævar Arnfjörð Bjarmason (9): > log tests: test regex backends in "--encode=" tests > grep: don't use PCRE2?_UTF8 with "log --encoding=" > grep: inline the return value of a function call used only once > grep tests: move "grep binary" alongside the rest > grep tests: move binary pattern tests into their own file > grep: make the behavior for NUL-byte in patterns sane > grep: drop support for \0 in --fixed-strings > grep: remove the kwset optimization > grep: use PCRE v2 for optimized fixed-string search > > Documentation/git-grep.txt| 17 +++ > grep.c| 115 +++- > grep.h| 3 +- > revision.c| 3 + > t/t4210-log-i18n.sh | 39 +- > ...a1.sh => t7008-filter-branch-null-sha1.sh} | 0 > ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 -- > t/t7816-grep-binary-pattern.sh| 127 ++ > 8 files changed, 233 insertions(+), 172 deletions(-) > rename t/{t7009-filter-branch-null-sha1.sh => > t7008-filter-branch-null-sha1.sh} (100%) > rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%) > create mode 100755 t/t7816-grep-binary-pattern.sh > > Range-diff: > -: -- > 1: cfc01f49d3 log tests: test regex backends in > "--encode=" tests > -: -- > 2: 4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log > --encoding=" > 1: ad55d3be7e = 3: cc4d3b50d5 grep: inline the return value of a function > call used only once > 2: 650bcc8582 = 4: d9b29bdd89 grep tests: move "grep binary" alongside > the rest > 3: ef10a8820d ! 5: f85614f435 grep tests: move binary pattern tests into > their own file > @@ -2,9 +2,10 @@ > > grep tests: move binary pattern tests into their own file > > -Move the tests for "-f " where "" contains a "\0" pattern > -into their own file. I added most of these tests in 966be95549 > ("grep: > -add tests to fix blind spots with \0 patterns", 2017-05-20). > +Move the tests for "-f " where "" contains a NUL byte > +pattern into their own file. I added most of these tests in > +966be95549 ("grep: add tests to fix blind spots with \0 patterns", > +2017-05-20). > > Whether a regex engine supports matching binary content is very > different from whether it matches binary patterns. Since > @@ -14,8 +15,8 @@ > engine can sensibly match binary patterns. > > Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting > -patterns containing "\0" and considering them fixed, except in cases > -where "--ignore-case" is provided and they're non-ASCII, see > +patterns containing NUL-byte and considering them fixed, except in > +cases where "--ignore-case" is provided and they're non-ASCII, see > 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", > 2016-06-25). Subsequent commits will change this behavior. > > 4: 03e5637efc ! 6: 90afca8707 grep: make the behavior for \0 in patterns > sane > @@ -1,12 +1,13 @@ > Author: Ævar Arnfjörð Bjarmason > > -grep: make the behavior for \0 in patterns sane > +grep: make the behavior for NUL-byte in patterns sane > > -The behavi
Re: [PATCH] repack: disable bitmaps-by-default if .keep files exist
On Fri, Jun 28 2019, Eric Wong wrote: > Jeff King wrote: >> On Sun, Jun 23, 2019 at 06:08:25PM +, Eric Wong wrote: >> >> > > I'm not sure of the right solution. For maximal backwards-compatibility, >> > > the default for bitmaps could become "if not bare and if there are no >> > > .keep files". But that would mean bitmaps sometimes not getting >> > > generated because of the problems that ee34a2bead was trying to solve. >> > > >> > > That's probably OK, though; you can always flip the bitmap config to >> > > "true" yourself if you _must_ have bitmaps. >> > >> > What about something like this? Needs tests but I need to leave, now. >> >> Yeah, I think that's the right direction. > > OK. I have a real patch with one additional test, below. > (don't have a lot of time for hacking) > >> Though... >> >> > +static int has_pack_keep_file(void) >> > +{ >> > + DIR *dir; >> > + struct dirent *e; >> > + int found = 0; >> > + >> > + if (!(dir = opendir(packdir))) >> > + return found; >> > + >> > + while ((e = readdir(dir)) != NULL) { >> > + if (ends_with(e->d_name, ".keep")) { >> > + found = 1; >> > + break; >> > + } >> > + } >> > + closedir(dir); >> > + return found; >> > +} >> >> I think this can be replaced with just checking p->pack_keep for each >> item in the packed_git list. > > Good point, I tend to forget git C API internals as soon as I > learn them :x > >> That's racy, but then so is your code here, since it's really the child >> pack-objects which is going to deal with the .keep. I don't think we >> need to care much about the race, though. Either: > > Agreed. > > 8<--- > Subject: [PATCH] repack: disable bitmaps-by-default if .keep files exist > > Bitmaps aren't useful with multiple packs, and users with > .keep files ended up with redundant packs when bitmaps > got enabled by default in bare repos. > > So detect when .keep files exist and stop enabling bitmaps > by default in that case. > > Wasteful (but otherwise harmless) race conditions with .keep files > documented by Jeff King still apply and there's a chance we'd > still end up with redundant data on the FS: > > https://public-inbox.org/git/20190623224244.gb1...@sigill.intra.peff.net/ > > Fixes: 36eba0323d3288a8 ("repack: enable bitmaps by default on bare repos") > Signed-off-by: Eric Wong > Helped-by: Jeff King > Reported-by: Janos Farkas > --- > builtin/repack.c | 18 -- > t/t7700-repack.sh | 10 ++ > 2 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/builtin/repack.c b/builtin/repack.c > index caca113927..a9529d1afc 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -89,6 +89,17 @@ static void remove_pack_on_signal(int signo) > raise(signo); > } > > +static int has_pack_keep_file(void) > +{ > + struct packed_git *p; > + > + for (p = get_packed_git(the_repository); p; p = p->next) { > + if (p->pack_keep) > + return 1; > + } > + return 0; > +} > + > /* > * Adds all packs hex strings to the fname list, which do not > * have a corresponding .keep file. These packs are not to > @@ -343,9 +354,12 @@ int cmd_repack(int argc, const char **argv, const char > *prefix) > (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) > die(_("--keep-unreachable and -A are incompatible")); > > - if (write_bitmaps < 0) > + if (write_bitmaps < 0) { > write_bitmaps = (pack_everything & ALL_INTO_ONE) && > - is_bare_repository(); > + is_bare_repository() && > + keep_pack_list.nr == 0 && > + !has_pack_keep_file(); > + } > if (pack_kept_objects < 0) > pack_kept_objects = write_bitmaps; > > diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh > index 86d05160a3..0acde3b1f8 100755 > --- a/t/t7700-repack.sh > +++ b/t/t7700-repack.sh > @@ -239,4 +239,14 @@ test_expect_success 'bitmaps can be disabled on bare > repos' ' > test -z "$bitmap" > ' I have the feedback I posted before this patch in https://public-inbox.org/git/874l4f8h4c@evledraar.gmail.com/ In particular "b" there since "a" is clearly more work. I.e. shouldn't we at least in interactive mode on a "gc" print something about skipping what we'd otherwise do. Maybe that's tricky with the gc.log functionality, but I think we should at least document this before the next guy shows up with "sometimes my .bitmap files aren't generated...". > +test_expect_success 'no bitmaps created if .keep files present' ' > + pack=$(ls bare.git/objects/pack/*.pack) && > + test_path_is_file "$pack" && > + keep=${pack%.pack}.keep && > + >"$keep" && > + git -C bare.git repack -ad && > + bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) && > + test -z "$bitmap" Maybe more readable as:
[PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search
asic log -i --grep='int' 7.40(7.08+0.28) 7.19(6.92+0.20) -2.8% 4221.3: extended log -i --grep='int'7.43(7.13+0.25) 7.27(6.99+0.21) -2.2% 4221.4: perl log -i --grep='int'7.34(7.10+0.24) 7.10(6.90+0.19) -3.3% 4221.6: fixed log -i --grep='uncommon' 7.07(6.71+0.32) 7.11(6.77+0.28) +0.6% 4221.7: basic log -i --grep='uncommon' 6.99(6.64+0.28) 7.12(6.69+0.38) +1.9% 4221.8: extended log -i --grep='uncommon' 7.11(6.74+0.32) 7.10(6.77+0.27) -0.1% 4221.9: perl log -i --grep='uncommon' 6.98(6.60+0.29) 7.05(6.64+0.34) +1.0% 4221.11: fixed log -i --grep='æ'7.85(7.45+0.34) 7.03(6.68+0.32) -10.4% 4221.12: basic log -i --grep='æ'7.87(7.49+0.29) 7.06(6.69+0.31) -10.3% 4221.13: extended log -i --grep='æ' 7.87(7.54+0.31) 7.09(6.69+0.31) -9.9% 4221.14: perl log -i --grep='æ' 7.06(6.77+0.28) 6.91(6.57+0.31) -2.1% So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string search", 2019-06-26) there's a huge improvement in performance for "grep", but in "log" most of our time is spent elsewhere, so we don't notice it that much. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 4468519d5c..fc0ed73ef3 100644 --- a/grep.c +++ b/grep.c @@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } +static int is_fixed(const char *s, size_t len) +{ + size_t i; + + for (i = 0; i < len; i++) { + if (is_regex_special(s[i])) + return 0; + } + + return 1; +} + #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, static void free_pcre2_pattern(struct grep_pat *p) { } -#endif /* !USE_LIBPCRE2 */ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) { @@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) compile_regexp_failed(p, errbuf); } } +#endif /* !USE_LIBPCRE2 */ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; + int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { + pat_is_fixed = is_fixed(p->pattern, p->patternlen); + if (opt->fixed || pat_is_fixed) { +#ifdef USE_LIBPCRE2 + opt->pcre2 = 1; + if (pat_is_fixed) { + compile_pcre2_pattern(p, opt); + } else { + /* +* E.g. t7811-grep-open.sh relies on the +* pattern being restored. +*/ + char *old_pattern = p->pattern; + size_t old_patternlen = p->patternlen; + struct strbuf sb = STRBUF_INIT; + + /* +* There is the PCRE2_LITERAL flag, but it's +* only in PCRE v2 10.30 and later. Needing to +* ifdef our way around that and dealing with +* it + PCRE2_MULTILINE being an error is more +* complex than just quoting this ourselves. + */ + strbuf_add(&sb, "\\Q", 2); + strbuf_add(&sb, p->pattern, p->patternlen); + strbuf_add(&sb, "\\E", 2); + + p->pattern = sb.buf; + p->patternlen = sb.len; + compile_pcre2_pattern(p, opt); + p->pattern = old_pattern; + p->patternlen = old_patternlen; + strbuf_release(&sb); + } +#else /* !USE_LIBPCRE2 */ compile_fixed_regexp(p, opt); +#endif /* !USE_LIBPCRE2 */ return; } -- 2.22.0.455.g172b71a6c5
[PATCH v2 8/9] grep: remove the kwset optimization
0.26) +19.2% 4221.4: perl log -i --grep='int'7.42(7.16+0.21) 7.14(6.80+0.24) -3.8% 4221.6: fixed log -i --grep='uncommon' 6.94(6.58+0.35) 8.43(8.04+0.30) +21.5% 4221.7: basic log -i --grep='uncommon' 6.95(6.62+0.31) 8.34(7.93+0.32) +20.0% 4221.8: extended log -i --grep='uncommon' 7.06(6.75+0.25) 8.32(7.98+0.31) +17.8% 4221.9: perl log -i --grep='uncommon' 6.96(6.69+0.26) 7.04(6.64+0.32) +1.1% 4221.11: fixed log -i --grep='æ'7.92(7.55+0.33) 7.86(7.44+0.34) -0.8% 4221.12: basic log -i --grep='æ'7.88(7.49+0.32) 7.84(7.46+0.34) -0.5% 4221.13: extended log -i --grep='æ' 7.91(7.51+0.32) 7.87(7.48+0.32) -0.5% 4221.14: perl log -i --grep='æ' 7.01(6.59+0.35) 6.99(6.64+0.28) -0.3% Some of those, as noted in [1] are because PCRE is faster at finding fixed strings. This looks bad for some engines, but in the next change we'll optimistically use PCRE v2 for all of these, so it'll look better. 1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 63 +++--- grep.h | 2 -- 2 files changed, 3 insertions(+), 62 deletions(-) diff --git a/grep.c b/grep.c index 8d0fff316c..4468519d5c 100644 --- a/grep.c +++ b/grep.c @@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } -static int is_fixed(const char *s, size_t len) -{ - size_t i; - - for (i = 0; i < len; i++) { - if (is_regex_special(s[i])) - return 0; - } - - return 1; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + p->fixed = opt->fixed; if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - /* -* Even when -F (fixed) asks us to do a non-regexp search, we -* may not be able to correctly case-fold when -i -* (ignore-case) is asked (in which case, we'll synthesize a -* regexp to match the pattern that matches regexp special -* characters literally, while ignoring case differences). On -* the other hand, even without -F, if the pattern does not -* have any regexp special characters and there is no need for -* case-folding search, we can internally turn it into a -* simple string match using kws. p->fixed tells us if we -* want to use kws. -*/ - if (opt->fixed || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); - - if (p->fixed) { - p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); - kwsincr(p->kws, p->pattern, p->patternlen); - kwsprep(p->kws); - return; - } - if (opt->fixed) { - /* -* We come here when the pattern has the non-ascii -* characters we cannot case-fold, and asked to -* ignore-case. -*/ compile_fixed_regexp(p, opt); return; } @@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt) case GREP_PATTERN: /* atom */ case GREP_PATTERN_HEAD: case GREP_PATTERN_BODY: - if (p->kws) - kwsfree(p->kws); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) free_pcre1_regexp(p); else if (p->pcre2_pattern) free_pcre2_pattern(p); @@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name) opt->output(opt, opt->null_following_name ? "\0" : "\n", 1); } -static int fixmatch(struct grep_pat *p, char *line, char *eol, - regmatch_t *match) -{ - struct kwsmatch kwsm; - size_t offset = kwsexec(p->kws, line, eol - line, &kwsm); - if (offset == -1) { - match->rm_so = match->rm_eo = -1; - return REG_NOMATCH; - } else { - match->rm_so = offset; - match->rm_eo = match->rm_so + kwsm.size[0]; -
[PATCH v2 4/9] grep tests: move "grep binary" alongside the rest
Move the "grep binary" test case added in aca20dd558 ("grep: add test script for binary file handling", 2010-05-22) so that it lives alongside the rest of the "grep" tests in t781*. This would have left a gap in the t/700* namespace, so move a "filter-branch" test down, leaving the "t7010-setup.sh" test as the next one after that. Signed-off-by: Ævar Arnfjörð Bjarmason --- ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0 t/{t7008-grep-binary.sh => t7815-grep-binary.sh} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%) diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh similarity index 100% rename from t/t7009-filter-branch-null-sha1.sh rename to t/t7008-filter-branch-null-sha1.sh diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh similarity index 100% rename from t/t7008-grep-binary.sh rename to t/t7815-grep-binary.sh -- 2.22.0.455.g172b71a6c5
[PATCH v2 7/9] grep: drop support for \0 in --fixed-strings
Change "-f " to not support patterns with a NUL-byte in them under --fixed-strings. We'll now only support these under "--perl-regexp" with PCRE v2. A previous change to grep's documentation changed the description of "-f " to be vague enough as to not promise that this would work. By dropping support for this we make it a whole lot easier to move away from the kwset backend, which we'll do in a subsequent change. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 6 +-- t/t7816-grep-binary-pattern.sh | 82 +- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/grep.c b/grep.c index d6603bc950..8d0fff316c 100644 --- a/grep.c +++ b/grep.c @@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + /* * Even when -F (fixed) asks us to do a non-regexp search, we * may not be able to correctly case-fold when -i @@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) return; } - if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) - die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { /* * We come here when the pattern has the non-ascii diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 9e09bd5d6a..60bab291e4 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -60,23 +60,23 @@ test_expect_success 'setup' " " # Simple fixed-string matching that can use kwset (no -i && non-ASCII) -nul_match 1 1 1 '-F' 'yQf' -nul_match 0 0 0 '-F' 'yQx' -nul_match 1 1 1 '-Fi' 'YQf' -nul_match 0 0 0 '-Fi' 'YQx' -nul_match 1 1 1 '' 'yQf' -nul_match 0 0 0 '' 'yQx' -nul_match 1 1 1 '' 'æQð' -nul_match 1 1 1 '-F' 'eQm[*]c' -nul_match 1 1 1 '-Fi' 'EQM[*]C' +nul_match P P P '-F' 'yQf' +nul_match P P P '-F' 'yQx' +nul_match P P P '-Fi' 'YQf' +nul_match P P P '-Fi' 'YQx' +nul_match P P 1 '' 'yQf' +nul_match P P 0 '' 'yQx' +nul_match P P 1 '' 'æQð' +nul_match P P P '-F' 'eQm[*]c' +nul_match P P P '-Fi' 'EQM[*]C' # Regex patterns that would match but shouldn't with -F -nul_match 0 0 0 '-F' 'yQ[f]' -nul_match 0 0 0 '-F' '[y]Qf' -nul_match 0 0 0 '-Fi' 'YQ[F]' -nul_match 0 0 0 '-Fi' '[Y]QF' -nul_match 0 0 0 '-F' 'æQ[ð]' -nul_match 0 0 0 '-F' '[æ]Qð' +nul_match P P P '-F' 'yQ[f]' +nul_match P P P '-F' '[y]Qf' +nul_match P P P '-Fi' 'YQ[F]' +nul_match P P P '-Fi' '[Y]QF' +nul_match P P P '-F' 'æQ[ð]' +nul_match P P P '-F' '[æ]Qð' # The -F kwset codepath can't handle -i && non-ASCII... nul_match P 1 1 '-i' '[æ]Qð' @@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð' nul_match P 0 1 '-i' 'ÆQÐ' # \0 in regexes can only work with -P & PCRE v2 -nul_match P 1 1 '' 'yQ[f]' -nul_match P 1 1 '' '[y]Qf' -nul_match P 1 1 '-i' 'YQ[F]' -nul_match P 1 1 '-i' '[Y]Qf' -nul_match P 1 1 '' 'æQ[ð]' -nul_match P 1 1 '' '[æ]Qð' -nul_match P 0 1 '-i' 'ÆQ[Ð]' -nul_match P 1 1 '' 'eQm.*cQ' -nul_match P 1 1 '-i' 'EQM.*cQ' -nul_match P 0 0 '' 'eQm[*]c' -nul_match P 0 0 '-i' 'EQM[*]C' +nul_match P P 1 '' 'yQ[f]' +nul_match P P 1 '' '[y]Qf' +nul_match P P 1 '-i' 'YQ[F]' +nul_match P P 1 '-i' '[Y]Qf' +nul_match P P 1 '' 'æQ[ð]' +nul_match P P 1 '' '[æ]Qð' +nul_match P P 1 '-i' 'ÆQ[Ð]' +nul_match P P 1 '' 'eQm.*cQ' +nul_match P P 1 '-i' 'EQM.*cQ' +nul_match P P 0 '' 'eQm[*]c' +nul_match P P 0
[PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane
The behavior of "grep" when patterns contained a NUL-byte has always been haphazard, and has served the vagaries of the implementation more than anything else. A pattern containing a NUL-byte can only be provided via "-f ". Since pickaxe (log search) has no such flag the NUL-byte in patterns has only ever been supported by "grep" (and not "log --grep"). Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing "\0" were considered fixed. In 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20) I added tests for this behavior. Change the behavior to do the obvious thing, i.e. don't silently discard a regex pattern and make it implicitly fixed just because they contain a NUL-byte. Instead die if the backend in question can't handle them, e.g. --basic-regexp is combined with such a pattern. This is desired because from a user's point of view it's the obvious thing to do. Whether we support BRE/ERE/Perl syntax is different from whether our implementation is limited by C-strings. These patterns are obscure enough that I think this behavior change is OK, especially since we never documented the old behavior. Doing this also makes it easier to replace the kwset backend with something else, since we'll no longer strictly need it for anything we can't easily use another fixed-string backend for. Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/git-grep.txt | 17 grep.c | 23 ++--- t/t7816-grep-binary-pattern.sh | 159 ++--- 3 files changed, 110 insertions(+), 89 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 2d27969057..c89fb569e3 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -271,6 +271,23 @@ providing this option will cause it to die. -f :: Read patterns from , one per line. ++ +Passing the pattern via allows for providing a search pattern +containing a \0. ++ +Not all pattern types support patterns containing \0. Git will error +out if a given pattern type can't support such a pattern. The +`--perl-regexp` pattern type when compiled against the PCRE v2 backend +has the widest support for these types of patterns. ++ +In versions of Git before 2.23.0 patterns containing \0 would be +silently considered fixed. This was never documented, there were also +odd and undocumented interactions between e.g. non-ASCII patterns +containing \0 and `--ignore-case`. ++ +In future versions we may learn to support patterns containing \0 for +more search backends, until then we'll die when the pattern type in +question doesn't support them. -e:: The next parameter is the pattern. This option has to be diff --git a/grep.c b/grep.c index 4e8d0645a8..d6603bc950 100644 --- a/grep.c +++ b/grep.c @@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len) return 1; } -static int has_null(const char *s, size_t len) -{ - /* -* regcomp cannot accept patterns with NULs so when using it -* we consider any pattern containing a NUL fixed. -*/ - if (memchr(s, 0, len)) - return 1; - - return 0; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) * simple string match using kws. p->fixed tells us if we * want to use kws. */ - if (opt->fixed || - has_null(p->pattern, p->patternlen) || - is_fixed(p->pattern, p->patternlen)) + if (opt->fixed || is_fixed(p->pattern, p->patternlen)) p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { @@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) kwsincr(p->kws, p->pattern, p->patternlen); kwsprep(p->kws); return; - } else if (opt->fixed) { + } + + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + + if (opt->fixed) { /* * We come here when the pattern has the non-ascii * characters we cannot case-fold, and asked to diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 4060dbd679..9e09bd5d6a 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -2,113 +2,126 @@ test_description='git grep with a binary pattern files' -. ./test-lib.sh +. ./lib-gettext.sh -nul_match () { +nul_match_internal () { matches=$1 - fla
[PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding="
Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was missed due to a blindspot in our tests, as discussed in the previous commit. I then blindly copied the same bug in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when adding the PCRE v2 code. We should not tell PCRE that we're processing UTF-8 just because we're dealing with non-ASCII. In the case of e.g. "log --encoding=<...>" under is_utf8_locale() the haystack might be in ISO-8859-1, and the needle might be in a non-UTF-8 encoding. Maybe we should be more strict here and die earlier? Should we also be converting the needle to the encoding in question, and failing if it's not a string that's valid in that encoding? Maybe. But for now matching this as non-UTF8 at least has some hope of producing sensible results, since we know that our default heuristic of assuming the text to be matched is in the user locale encoding isn't true when we've explicitly encoded it to be in a different encoding. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 8 grep.h | 1 + revision.c | 3 +++ t/t4210-log-i18n.sh | 6 ++ 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index f7c3a5803e..1de4ab49c0 100644 --- a/grep.c +++ b/grep.c @@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) int options = PCRE_MULTILINE; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) + if (!opt->ignore_locale && has_non_ascii(p->pattern)) p->pcre1_tables = pcre_maketables(); options |= PCRE_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE_UTF8; p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset, @@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_compile_context = NULL; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) { + if (!opt->ignore_locale && has_non_ascii(p->pattern)) { character_tables = pcre2_maketables(NULL); p->pcre2_compile_context = pcre2_compile_context_create(NULL); pcre2_set_character_tables(p->pcre2_compile_context, character_tables); } options |= PCRE2_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, diff --git a/grep.h b/grep.h index 1875880f37..4bb8a79d93 100644 --- a/grep.h +++ b/grep.h @@ -173,6 +173,7 @@ struct grep_opt { int funcbody; int extended_regexp_option; int pattern_type_option; + int ignore_locale; char colors[NR_GREP_COLORS][COLOR_MAXLEN]; unsigned pre_context; unsigned post_context; diff --git a/revision.c b/revision.c index 621feb9df7..a842fb158a 100644 --- a/revision.c +++ b/revision.c @@ -28,6 +28,7 @@ #include "commit-graph.h" #include "prio-queue.h" #include "hashmap.h" +#include "utf8.h" volatile show_early_output_fn_t show_early_output; @@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED, &revs->grep_filter); + if (!is_encoding_utf8(get_log_output_encoding())) + revs->grep_filter.ignore_locale = 1; compile_grep_patterns(&revs->grep_filter); if (revs->reverse && revs->reflog_info) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 86d22c1d4c..515bcb7ce1 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' for engine in fixed basic extended perl do prereq= - result=success if test $engine = "perl" then - result=failure prereq="PCRE" else prereq="" @@ -72,7 +70,7 @@ do then force_regex=.* fi - test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success GETTEXT_LOCALE,$prereq
[PATCH v2 1/9] log tests: test regex backends in "--encode=" tests
Improve the tests added in 04deccda11 ("log: re-encode commit messages before grepping", 2013-02-11) to test the regex backends. Those tests never worked as advertised, due to the is_fixed() optimization in grep.c (which was in place at the time), and the needle in the tests being a fixed string. We'd thus always use the "fixed" backend during the tests, which would use the kwset() backend. This backend liberally accepts any garbage input, so invalid encodings would be silently accepted. In a follow-up commit we'll fix this bug, this test just demonstrates the existing issue. In practice this issue happened on Windows, see [1], but due to the structure of the existing tests & how liberal the kwset code is about garbage we missed this. Cover this blind spot by testing all our regex engines. The PCRE backend will spot these invalid encodings. It's possible that this test breaks the "basic" and "extended" backends on some systems that are more anal than glibc about the encoding of locale issues with POSIX functions that I can remember, but PCRE is more careful about the validation. 1. https://public-inbox.org/git/nycvar.qro.7.76.6.1906271113090...@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t4210-log-i18n.sh | 41 - 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 7c519436ef..86d22c1d4c 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -1,12 +1,15 @@ #!/bin/sh test_description='test log with i18n features' -. ./test-lib.sh +. ./lib-gettext.sh # two forms of é utf8_e=$(printf '\303\251') latin1_e=$(printf '\351') +# invalid UTF-8 +invalid_e=$(printf '\303\50)') # ")" at end to close opening "(" + test_expect_success 'create commits in different encodings' ' test_tick && cat >msg <<-EOF && @@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' test_must_be_empty actual ' +for engine in fixed basic extended perl +do + prereq= + result=success + if test $engine = "perl" + then + result=failure + prereq="PCRE" + else + prereq="" + fi + force_regex= + if test $engine != "fixed" + then + force_regex=.* + fi + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + cat >expect <<-\EOF && + latin1 + utf8 + EOF + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual && + test_cmp expect actual + " + + test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual && + test_must_be_empty actual + " + + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && + test_must_be_empty actual + " +done + test_done -- 2.22.0.455.g172b71a6c5
[PATCH v2 0/9] grep: move from kwset to optional PCRE v2
A non-RFC since it seem people like this approach. This should fix the test failure noted by Johannes, there's two new patches at the start of this series. They address a bug that was there for a long time, but I happened to trip over since PCRE is more strict about UTF-8 validation than kwset (which doesn't care at all). I also added performance numbers to the relevant commit messages, took brian's suggestion of saying "NUL-byte" instead of "\0", and did some other copyediting of my own. The rest of the code changes are all just comments & rewording of previously added comments. Ævar Arnfjörð Bjarmason (9): log tests: test regex backends in "--encode=" tests grep: don't use PCRE2?_UTF8 with "log --encoding=" grep: inline the return value of a function call used only once grep tests: move "grep binary" alongside the rest grep tests: move binary pattern tests into their own file grep: make the behavior for NUL-byte in patterns sane grep: drop support for \0 in --fixed-strings grep: remove the kwset optimization grep: use PCRE v2 for optimized fixed-string search Documentation/git-grep.txt| 17 +++ grep.c| 115 +++- grep.h| 3 +- revision.c| 3 + t/t4210-log-i18n.sh | 39 +- ...a1.sh => t7008-filter-branch-null-sha1.sh} | 0 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 -- t/t7816-grep-binary-pattern.sh| 127 ++ 8 files changed, 233 insertions(+), 172 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%) create mode 100755 t/t7816-grep-binary-pattern.sh Range-diff: -: -- > 1: cfc01f49d3 log tests: test regex backends in "--encode=" tests -: -- > 2: 4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log --encoding=" 1: ad55d3be7e = 3: cc4d3b50d5 grep: inline the return value of a function call used only once 2: 650bcc8582 = 4: d9b29bdd89 grep tests: move "grep binary" alongside the rest 3: ef10a8820d ! 5: f85614f435 grep tests: move binary pattern tests into their own file @@ -2,9 +2,10 @@ grep tests: move binary pattern tests into their own file -Move the tests for "-f " where "" contains a "\0" pattern -into their own file. I added most of these tests in 966be95549 ("grep: -add tests to fix blind spots with \0 patterns", 2017-05-20). +Move the tests for "-f " where "" contains a NUL byte +pattern into their own file. I added most of these tests in +966be95549 ("grep: add tests to fix blind spots with \0 patterns", +2017-05-20). Whether a regex engine supports matching binary content is very different from whether it matches binary patterns. Since @@ -14,8 +15,8 @@ engine can sensibly match binary patterns. Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting -patterns containing "\0" and considering them fixed, except in cases -where "--ignore-case" is provided and they're non-ASCII, see +patterns containing NUL-byte and considering them fixed, except in +cases where "--ignore-case" is provided and they're non-ASCII, see 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", 2016-06-25). Subsequent commits will change this behavior. 4: 03e5637efc ! 6: 90afca8707 grep: make the behavior for \0 in patterns sane @@ -1,12 +1,13 @@ Author: Ævar Arnfjörð Bjarmason -grep: make the behavior for \0 in patterns sane +grep: make the behavior for NUL-byte in patterns sane -The behavior of "grep" when patterns contained "\0" has always been -haphazard, and has served the vagaries of the implementation more than -anything else. A "\0" in a pattern can only be provided via "-f -", and since pickaxe (log search) has no such flag "\0" in -patterns has only ever been supported by "grep". +The behavior of "grep" when patterns contained a NUL-byte has always +been haphazard, and has served the vagaries of the implementation more +than anything else. A pattern containing a NUL-byte can only be +provided via "-f ". Since pickaxe (log search) has no such flag +the NUL-byte in patterns has only ever been supported by "grep" (and +not &
[PATCH v2 3/9] grep: inline the return value of a function call used only once
Since e944d9d932 ("grep: rewrite an if/else condition to avoid duplicate expression", 2016-06-25) the "ascii_only" variable has only been used once in compile_regexp(), let's just inline it there. This makes the code easier to read, and might make it marginally faster depending on compiler optimizations. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/grep.c b/grep.c index 1de4ab49c0..4e8d0645a8 100644 --- a/grep.c +++ b/grep.c @@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { - int ascii_only; int err; int regflags = REG_NEWLINE; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; - ascii_only = !has_non_ascii(p->pattern); /* * Even when -F (fixed) asks us to do a non-regexp search, we @@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || ascii_only; + p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); -- 2.22.0.455.g172b71a6c5
[PATCH v2 5/9] grep tests: move binary pattern tests into their own file
Move the tests for "-f " where "" contains a NUL byte pattern into their own file. I added most of these tests in 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20). Whether a regex engine supports matching binary content is very different from whether it matches binary patterns. Since 2f8952250a ("regex: add regexec_buf() that can work on a non NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our regex engines so we can match binary content, but only the PCRE v2 engine can sensibly match binary patterns. Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting patterns containing NUL-byte and considering them fixed, except in cases where "--ignore-case" is provided and they're non-ASCII, see 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", 2016-06-25). Subsequent commits will change this behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t7815-grep-binary.sh | 101 - t/t7816-grep-binary-pattern.sh | 114 + 2 files changed, 114 insertions(+), 101 deletions(-) create mode 100755 t/t7816-grep-binary-pattern.sh diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh index 2d87c49b75..90ebb64f46 100755 --- a/t/t7815-grep-binary.sh +++ b/t/t7815-grep-binary.sh @@ -4,41 +4,6 @@ test_description='git grep in binary files' . ./test-lib.sh -nul_match () { - matches=$1 - flags=$2 - pattern=$3 - pattern_human=$(echo "$pattern" | sed 's/Q//g') - - if test "$matches" = 1 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = 0 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - elif test "$matches" = T1 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = T0 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - else - test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' - fi -} - test_expect_success 'setup' " echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && git add a && @@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' ' git grep .fi a ' -nul_match 1 '-F' 'yQf' -nul_match 0 '-F' 'yQx' -nul_match 1 '-Fi' 'YQf' -nul_match 0 '-Fi' 'YQx' -nul_match 1 '' 'yQf' -nul_match 0 '' 'yQx' -nul_match 1 '' 'æQð' -nul_match 1 '-F' 'eQm[*]c' -nul_match 1 '-Fi' 'EQM[*]C' - -# Regex patterns that would match but shouldn't with -F -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-F' '[y]Qf' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '-Fi' '[Y]QF' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-F' '[æ]Qð' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '-Fi' '[Æ]QÐ' - -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' - -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' - -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0 ''
Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
On Thu, Jun 27 2019, Johannes Schindelin wrote: > Hi Ævar, > > On Wed, 26 Jun 2019, Johannes Schindelin wrote: > >> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote: >> >> > This speeds things up a lot, but as shown in the patches & tests >> > changed modifies the behavior where we have \0 in *patterns* (only >> > possible with 'grep -f '). >> >> I agree that it is not worth a lot to care about NULs in search patterns. >> >> So I am in favor of the goal of this patch series. > > There seems to be a Windows-specific test failure: > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug > > The output is this: > > -- snip -- > not ok 5 - log --grep does not find non-reencoded values (latin1) > > expecting success: > git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual > && > test_must_be_empty actual > > ++ git log --encoding=ISO-8859-1 --format=%s --grep=é > fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits > not 0x80 > -- snap -- > > Any quick ideas? (I _could_ imagine that it is yet another case of passing > non-UTF-8-encoded stuff via command-line vs via file, which does not work > on Windows.) This is an existing issue that my patches just happen to uncover. I'm working on a v2 which'll fix it.
Re: fprintf_ln() is slow
On Thu, Jun 27 2019, Duy Nguyen wrote: > On Thu, Jun 27, 2019 at 1:00 PM Jeff King wrote: >> >> On Thu, Jun 27, 2019 at 01:25:15AM -0400, Jeff King wrote: >> >> > Taylor and I noticed a slowdown in p1451 between v2.20.1 and v2.21.0. I >> > was surprised to find that it bisects to bbb15c5193 (fsck: reduce word >> > legos to help i18n, 2018-11-10). >> > >> > The important part, as it turns out, is the switch to using fprintf_ln() >> > instead of a regular fprintf() with a "\n" in it. Doing this: >> > [...] >> > on top of the current tip of master yields this result: >> > >> > Test HEAD^ HEAD >> > >> > - >> > 1451.3: fsck with 0 skipped bad commits 9.78(7.46+2.32) >> > 8.74(7.38+1.36) -10.6% >> > 1451.5: fsck with 1 skipped bad commits 9.78(7.66+2.11) >> > 8.49(7.04+1.44) -13.2% >> > 1451.7: fsck with 10 skipped bad commits 9.83(7.45+2.37) >> > 8.53(7.26+1.24) -13.2% >> > 1451.9: fsck with 100 skipped bad commits9.87(7.47+2.40) >> > 8.54(7.24+1.30) -13.5% >> > 1451.11: fsck with 1000 skipped bad commits 9.79(7.67+2.12) >> > 8.48(7.25+1.23) -13.4% >> > 1451.13: fsck with 1 skipped bad commits 9.86(7.58+2.26) >> > 8.38(7.09+1.28) -15.0% >> > 1451.15: fsck with 10 skipped bad commits9.58(7.39+2.19) >> > 8.41(7.21+1.19) -12.2% >> > 1451.17: fsck with 100 skipped bad commits 6.38(6.31+0.07) >> > 6.35(6.26+0.07) -0.5% >> >> Ah, I think I see it. >> >> See how the system times for HEAD^ (with fprintf_ln) are higher? We're >> flushing stderr more frequently (twice as much, since it's unbuffered, >> and we now have an fprintf followed by a putc). >> >> I can get similar speedups by formatting into a buffer: >> >> diff --git a/strbuf.c b/strbuf.c >> index 0e18b259ce..07ce9b9178 100644 >> --- a/strbuf.c >> +++ b/strbuf.c >> @@ -880,8 +880,22 @@ int printf_ln(const char *fmt, ...) >> >> int fprintf_ln(FILE *fp, const char *fmt, ...) >> { >> + char buf[1024]; >> int ret; >> va_list ap; >> + >> + /* Fast path: format it ourselves and dump it via fwrite. */ >> + va_start(ap, fmt); >> + ret = vsnprintf(buf, sizeof(buf), fmt, ap); >> + va_end(ap); >> + if (ret < sizeof(buf)) { >> + buf[ret++] = '\n'; >> + if (fwrite(buf, 1, ret, fp) != ret) >> + return -1; >> + return ret; >> + } >> + >> + /* Slow path: a normal fprintf/putc combo */ >> va_start(ap, fmt); >> ret = vfprintf(fp, fmt, ap); >> va_end(ap); >> >> But we shouldn't have to resort to that. We can use setvbuf() to toggle >> buffering back and forth, but I'm not sure if there's a way to query the >> current buffering scheme for a stdio stream. We'd need that to be able >> to switch back correctly (and to avoid switching for things that are >> already buffered). >> >> I suppose it would be enough to check for "fp == stderr", since that is >> the only unbuffered thing we'd generally see. >> >> And it may be that the code above is really not much different anyway. >> For an unbuffered stream, I'd guess it dumps an fwrite() directly to >> write() anyway (since by definition it does not need to hold onto it, >> and nor is there anything in the buffer ahead of it). >> >> Something like: >> >> char buf[1024]; >> if (fp == stderr) >> setvbuf(stream, buf, _IOLBF, sizeof(buf)); >> >> ... do fprintf and putc ... >> >> if (fp == stderr) >> setvbuf(stream, NULL, _IONBF, 0); >> >> feels less horrible, but it's making the assumption that we were >> unbuffered coming into the function. I dunno. > > How about do all the formatting in strbuf and only fwrite last minute? > A bit more overhead with malloc(), so I don't know if it's an > improvement or not. Why shouldn't we just move back to plain fprintf() with "\n"? Your 9a0a30aa4b ("strbuf: convenience format functions with \n automatically appended", 2012-04-23) doesn't explain why this is a convenience for translators. When I'm translating things tend to like knowing that something ends in a newline explicitly, why do we need to hide that from translators? They also need to deal with trailing \n in other messages, so these *_ln() functions make things inconsistent. It's also not possible for translators to do this by mistake without being caught, because msgfmt will catch this (and other common issues): po/de.po:23: 'msgid' and 'msgstr' entries do not both end with '\n'
[RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
This speeds things up a lot, but as shown in the patches & tests changed modifies the behavior where we have \0 in *patterns* (only possible with 'grep -f '). I'd like to go down this route because it makes dropping kwset a lot easier, and I don't think bending over backwards to support these \0 patterns is worth it. But maybe others disagree, so I wanted to send what I had before I tried tackling the pickaxe code. There I figured I'd just make -G's ERE be a PCRE if we had the PCRE v2 backend, since unlike "grep"'s default BRE the ERE syntax is mostly a subset of PCRE, but again others might thing that's too aggressive and would prefer to keep the distinction, only using PCRE there in place of our current use of kwset. Ævar Arnfjörð Bjarmason (7): grep: inline the return value of a function call used only once grep tests: move "grep binary" alongside the rest grep tests: move binary pattern tests into their own file grep: make the behavior for \0 in patterns sane grep: drop support for \0 in --fixed-strings grep: remove the kwset optimization grep: use PCRE v2 for optimized fixed-string search Documentation/git-grep.txt| 17 +++ grep.c| 103 ++ grep.h| 2 - ...a1.sh => t7008-filter-branch-null-sha1.sh} | 0 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 -- t/t7816-grep-binary-pattern.sh| 127 ++ 6 files changed, 183 insertions(+), 167 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%) create mode 100755 t/t7816-grep-binary-pattern.sh -- 2.22.0.455.g172b71a6c5
[RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file
Move the tests for "-f " where "" contains a "\0" pattern into their own file. I added most of these tests in 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20). Whether a regex engine supports matching binary content is very different from whether it matches binary patterns. Since 2f8952250a ("regex: add regexec_buf() that can work on a non NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our regex engines so we can match binary content, but only the PCRE v2 engine can sensibly match binary patterns. Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting patterns containing "\0" and considering them fixed, except in cases where "--ignore-case" is provided and they're non-ASCII, see 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", 2016-06-25). Subsequent commits will change this behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t7815-grep-binary.sh | 101 - t/t7816-grep-binary-pattern.sh | 114 + 2 files changed, 114 insertions(+), 101 deletions(-) create mode 100755 t/t7816-grep-binary-pattern.sh diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh index 2d87c49b75..90ebb64f46 100755 --- a/t/t7815-grep-binary.sh +++ b/t/t7815-grep-binary.sh @@ -4,41 +4,6 @@ test_description='git grep in binary files' . ./test-lib.sh -nul_match () { - matches=$1 - flags=$2 - pattern=$3 - pattern_human=$(echo "$pattern" | sed 's/Q//g') - - if test "$matches" = 1 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = 0 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - elif test "$matches" = T1 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = T0 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - else - test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' - fi -} - test_expect_success 'setup' " echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && git add a && @@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' ' git grep .fi a ' -nul_match 1 '-F' 'yQf' -nul_match 0 '-F' 'yQx' -nul_match 1 '-Fi' 'YQf' -nul_match 0 '-Fi' 'YQx' -nul_match 1 '' 'yQf' -nul_match 0 '' 'yQx' -nul_match 1 '' 'æQð' -nul_match 1 '-F' 'eQm[*]c' -nul_match 1 '-Fi' 'EQM[*]C' - -# Regex patterns that would match but shouldn't with -F -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-F' '[y]Qf' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '-Fi' '[Y]QF' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-F' '[æ]Qð' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '-Fi' '[Æ]QÐ' - -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' - -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' - -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0
[RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane
The behavior of "grep" when patterns contained "\0" has always been haphazard, and has served the vagaries of the implementation more than anything else. A "\0" in a pattern can only be provided via "-f ", and since pickaxe (log search) has no such flag "\0" in patterns has only ever been supported by "grep". Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing "\0" were considered fixed. In 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20) I added tests for this behavior. Change the behavior to do the obvious thing, i.e. don't silently discard a regex pattern and make it implicitly fixed just because it contains a \0. Instead die if e.g. --basic-regexp is combined with such a pattern. This is desired because from a user's point of view it's the obvious thing to do. Whether we support BRE/ERE/Perl syntax is different from whether our implementation is limited by C-strings. These patterns are obscure enough that I think this behavior change is OK, especially since we never documented the old behavior. Doing this also makes it easier to replace the kwset backend with something else, since we'll no longer strictly need it for anything we can't easily use another fixed-string backend for. Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/git-grep.txt | 17 grep.c | 23 ++--- t/t7816-grep-binary-pattern.sh | 159 ++--- 3 files changed, 110 insertions(+), 89 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 2d27969057..c89fb569e3 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -271,6 +271,23 @@ providing this option will cause it to die. -f :: Read patterns from , one per line. ++ +Passing the pattern via allows for providing a search pattern +containing a \0. ++ +Not all pattern types support patterns containing \0. Git will error +out if a given pattern type can't support such a pattern. The +`--perl-regexp` pattern type when compiled against the PCRE v2 backend +has the widest support for these types of patterns. ++ +In versions of Git before 2.23.0 patterns containing \0 would be +silently considered fixed. This was never documented, there were also +odd and undocumented interactions between e.g. non-ASCII patterns +containing \0 and `--ignore-case`. ++ +In future versions we may learn to support patterns containing \0 for +more search backends, until then we'll die when the pattern type in +question doesn't support them. -e:: The next parameter is the pattern. This option has to be diff --git a/grep.c b/grep.c index d3e6111c46..261bd3a342 100644 --- a/grep.c +++ b/grep.c @@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len) return 1; } -static int has_null(const char *s, size_t len) -{ - /* -* regcomp cannot accept patterns with NULs so when using it -* we consider any pattern containing a NUL fixed. -*/ - if (memchr(s, 0, len)) - return 1; - - return 0; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) * simple string match using kws. p->fixed tells us if we * want to use kws. */ - if (opt->fixed || - has_null(p->pattern, p->patternlen) || - is_fixed(p->pattern, p->patternlen)) + if (opt->fixed || is_fixed(p->pattern, p->patternlen)) p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { @@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) kwsincr(p->kws, p->pattern, p->patternlen); kwsprep(p->kws); return; - } else if (opt->fixed) { + } + + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + + if (opt->fixed) { /* * We come here when the pattern has the non-ascii * characters we cannot case-fold, and asked to diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 4060dbd679..9e09bd5d6a 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -2,113 +2,126 @@ test_description='git grep with a binary pattern files' -. ./test-lib.sh +. ./lib-gettext.sh -nul_match () { +nul_match_internal () { matches=$1 - flags=$2 - pattern=$3 + prereqs=$2 + lc_all=$3 + extra
[RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings
Change "-f " to not support patterns with "\0" in them under --fixed-strings, we'll now only support these under --perl-regexp with PCRE v2. A previous change to Documentation/git-grep.txt changed the description of "-f " to be vague enough as to not promise that this would work, and by dropping support for this we make it a whole lot easier to move away from the kwset backend, which a subsequent change will try to do. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 6 +-- t/t7816-grep-binary-pattern.sh | 82 +- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/grep.c b/grep.c index 261bd3a342..14570c7ac1 100644 --- a/grep.c +++ b/grep.c @@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + /* * Even when -F (fixed) asks us to do a non-regexp search, we * may not be able to correctly case-fold when -i @@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) return; } - if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) - die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { /* * We come here when the pattern has the non-ascii diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 9e09bd5d6a..60bab291e4 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -60,23 +60,23 @@ test_expect_success 'setup' " " # Simple fixed-string matching that can use kwset (no -i && non-ASCII) -nul_match 1 1 1 '-F' 'yQf' -nul_match 0 0 0 '-F' 'yQx' -nul_match 1 1 1 '-Fi' 'YQf' -nul_match 0 0 0 '-Fi' 'YQx' -nul_match 1 1 1 '' 'yQf' -nul_match 0 0 0 '' 'yQx' -nul_match 1 1 1 '' 'æQð' -nul_match 1 1 1 '-F' 'eQm[*]c' -nul_match 1 1 1 '-Fi' 'EQM[*]C' +nul_match P P P '-F' 'yQf' +nul_match P P P '-F' 'yQx' +nul_match P P P '-Fi' 'YQf' +nul_match P P P '-Fi' 'YQx' +nul_match P P 1 '' 'yQf' +nul_match P P 0 '' 'yQx' +nul_match P P 1 '' 'æQð' +nul_match P P P '-F' 'eQm[*]c' +nul_match P P P '-Fi' 'EQM[*]C' # Regex patterns that would match but shouldn't with -F -nul_match 0 0 0 '-F' 'yQ[f]' -nul_match 0 0 0 '-F' '[y]Qf' -nul_match 0 0 0 '-Fi' 'YQ[F]' -nul_match 0 0 0 '-Fi' '[Y]QF' -nul_match 0 0 0 '-F' 'æQ[ð]' -nul_match 0 0 0 '-F' '[æ]Qð' +nul_match P P P '-F' 'yQ[f]' +nul_match P P P '-F' '[y]Qf' +nul_match P P P '-Fi' 'YQ[F]' +nul_match P P P '-Fi' '[Y]QF' +nul_match P P P '-F' 'æQ[ð]' +nul_match P P P '-F' '[æ]Qð' # The -F kwset codepath can't handle -i && non-ASCII... nul_match P 1 1 '-i' '[æ]Qð' @@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð' nul_match P 0 1 '-i' 'ÆQÐ' # \0 in regexes can only work with -P & PCRE v2 -nul_match P 1 1 '' 'yQ[f]' -nul_match P 1 1 '' '[y]Qf' -nul_match P 1 1 '-i' 'YQ[F]' -nul_match P 1 1 '-i' '[Y]Qf' -nul_match P 1 1 '' 'æQ[ð]' -nul_match P 1 1 '' '[æ]Qð' -nul_match P 0 1 '-i' 'ÆQ[Ð]' -nul_match P 1 1 '' 'eQm.*cQ' -nul_match P 1 1 '-i' 'EQM.*cQ' -nul_match P 0 0 '' 'eQm[*]c' -nul_match P 0 0 '-i' 'EQM[*]C' +nul_match P P 1 '' 'yQ[f]' +nul_match P P 1 '' '[y]Qf' +nul_match P P 1 '-i' 'YQ[F]' +nul_match P P 1 '-i' '[Y]Qf' +nul_match P P 1 '' 'æQ[ð]' +nul_match P P 1 '' '[æ]Qð' +nul_match P P 1 '-i' 'ÆQ[Ð]' +nul_match P P 1 '' 'eQm.*cQ' +nul_match P P 1 '-i' 'EQM.*cQ' +nul_match P P 0 '' 'eQm[*]c' +nul_match P P 0 '-i&
[RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
Bring back optimized fixed-string search for "grep", this time with PCRE v2 as an optional backend. As noted in [1] with kwset we were slower than PCRE v1 and v2 JIT with the kwset backend, so that optimization was counterproductive. This brings back the optimization for "-F", without changing the semantics of "\0" in patterns. As seen in previous commits in this series we could support it now, but I'd rather just leave that edge-case aside so the tests don't need to do one thing or the other depending on what --fixed-strings backend we're using. I could also support the v1 backend here, but that would make the code more complex, and I'd rather aim for simplicity here and in future changes to the diffcore. We're not going to have someone who absolutely must have faster search, but for whom building PCRE v2 isn't acceptable. 1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 4716217837..6b75d5be68 100644 --- a/grep.c +++ b/grep.c @@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } +static int is_fixed(const char *s, size_t len) +{ + size_t i; + + for (i = 0; i < len; i++) { + if (is_regex_special(s[i])) + return 0; + } + + return 1; +} + #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, static void free_pcre2_pattern(struct grep_pat *p) { } -#endif /* !USE_LIBPCRE2 */ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) { @@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) compile_regexp_failed(p, errbuf); } } +#endif /* !USE_LIBPCRE2 */ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; + int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -636,8 +649,38 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { + pat_is_fixed = is_fixed(p->pattern, p->patternlen); + if (opt->fixed || pat_is_fixed) { +#ifdef USE_LIBPCRE2 + opt->pcre2 = 1; + if (pat_is_fixed) { + compile_pcre2_pattern(p, opt); + } else { + /* +* E.g. t7811-grep-open.sh relies on the +* pattern being restored, and unfortunately +* there's no PCRE compile flag for "this is +* fixed", so we need to munge it to +* "\Q\E". +*/ + char *old_pattern = p->pattern; + size_t old_patternlen = p->patternlen; + struct strbuf sb = STRBUF_INIT; + + strbuf_add(&sb, "\\Q", 2); + strbuf_add(&sb, p->pattern, p->patternlen); + strbuf_add(&sb, "\\E", 2); + + p->pattern = sb.buf; + p->patternlen = sb.len; + compile_pcre2_pattern(p, opt); + p->pattern = old_pattern; + p->patternlen = old_patternlen; + strbuf_release(&sb); + } +#else compile_fixed_regexp(p, opt); +#endif return; } -- 2.22.0.455.g172b71a6c5
[RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest
Move the "grep binary" test case added in aca20dd558 ("grep: add test script for binary file handling", 2010-05-22) so that it lives alongside the rest of the "grep" tests in t781*. This would have left a gap in the t/700* namespace, so move a "filter-branch" test down, leaving the "t7010-setup.sh" test as the next one after that. Signed-off-by: Ævar Arnfjörð Bjarmason --- ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0 t/{t7008-grep-binary.sh => t7815-grep-binary.sh} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%) diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh similarity index 100% rename from t/t7009-filter-branch-null-sha1.sh rename to t/t7008-filter-branch-null-sha1.sh diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh similarity index 100% rename from t/t7008-grep-binary.sh rename to t/t7815-grep-binary.sh -- 2.22.0.455.g172b71a6c5
[RFC/PATCH 1/7] grep: inline the return value of a function call used only once
Since e944d9d932 ("grep: rewrite an if/else condition to avoid duplicate expression", 2016-06-25) the "ascii_only" variable has only been used once in compile_regexp(), let's just inline it there. This makes the code easier to read, and might make it marginally faster depending on compiler optimizations. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/grep.c b/grep.c index f7c3a5803e..d3e6111c46 100644 --- a/grep.c +++ b/grep.c @@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { - int ascii_only; int err; int regflags = REG_NEWLINE; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; - ascii_only = !has_non_ascii(p->pattern); /* * Even when -F (fixed) asks us to do a non-regexp search, we @@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || ascii_only; + p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); -- 2.22.0.455.g172b71a6c5
[RFC/PATCH 6/7] grep: remove the kwset optimization
A later change will replace this optimization with a different one, but as removing it and running the tests demonstrates no grep semantics depend on this backend anymore. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 63 +++--- grep.h | 2 -- 2 files changed, 3 insertions(+), 62 deletions(-) diff --git a/grep.c b/grep.c index 14570c7ac1..4716217837 100644 --- a/grep.c +++ b/grep.c @@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } -static int is_fixed(const char *s, size_t len) -{ - size_t i; - - for (i = 0; i < len; i++) { - if (is_regex_special(s[i])) - return 0; - } - - return 1; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + p->fixed = opt->fixed; if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - /* -* Even when -F (fixed) asks us to do a non-regexp search, we -* may not be able to correctly case-fold when -i -* (ignore-case) is asked (in which case, we'll synthesize a -* regexp to match the pattern that matches regexp special -* characters literally, while ignoring case differences). On -* the other hand, even without -F, if the pattern does not -* have any regexp special characters and there is no need for -* case-folding search, we can internally turn it into a -* simple string match using kws. p->fixed tells us if we -* want to use kws. -*/ - if (opt->fixed || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); - - if (p->fixed) { - p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); - kwsincr(p->kws, p->pattern, p->patternlen); - kwsprep(p->kws); - return; - } - if (opt->fixed) { - /* -* We come here when the pattern has the non-ascii -* characters we cannot case-fold, and asked to -* ignore-case. -*/ compile_fixed_regexp(p, opt); return; } @@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt) case GREP_PATTERN: /* atom */ case GREP_PATTERN_HEAD: case GREP_PATTERN_BODY: - if (p->kws) - kwsfree(p->kws); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) free_pcre1_regexp(p); else if (p->pcre2_pattern) free_pcre2_pattern(p); @@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name) opt->output(opt, opt->null_following_name ? "\0" : "\n", 1); } -static int fixmatch(struct grep_pat *p, char *line, char *eol, - regmatch_t *match) -{ - struct kwsmatch kwsm; - size_t offset = kwsexec(p->kws, line, eol - line, &kwsm); - if (offset == -1) { - match->rm_so = match->rm_eo = -1; - return REG_NOMATCH; - } else { - match->rm_so = offset; - match->rm_eo = match->rm_so + kwsm.size[0]; - return 0; - } -} - static int patmatch(struct grep_pat *p, char *line, char *eol, regmatch_t *match, int eflags) { int hit; - if (p->fixed) - hit = !fixmatch(p, line, eol, match); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) hit = !pcre1match(p, line, eol, match, eflags); else if (p->pcre2_pattern) hit = !pcre2match(p, line, eol, match, eflags); diff --git a/grep.h b/grep.h index 1875880f37..90ca435aad 100644 --- a/grep.h +++ b/grep.h @@ -32,7 +32,6 @@ typedef int pcre2_compile_context; typedef int pcre2_match_context; typedef int pcre2_jit_stack; #endif -#include "kwset.h" #include "thread-utils.h" #include "userdiff.h" @@ -97,7 +96,6 @@ struct grep_pat { pcre2_match_context *pcre2_match_context; pcre2_jit_stack *pcre2_jit_stack; uint32_t pcre2_jit_on; - kwset_t kws; unsigned fixed:1; unsigned ignore_case:1; unsigned word_regexp:1; -- 2.22.0.455.g172b71a6c5
Re: 2.22.0 repack -a duplicating pack contents
On Mon, Jun 24 2019, Jeff King wrote: > On Sun, Jun 23, 2019 at 06:08:25PM +, Eric Wong wrote: > >> > I'm not sure of the right solution. For maximal backwards-compatibility, >> > the default for bitmaps could become "if not bare and if there are no >> > .keep files". But that would mean bitmaps sometimes not getting >> > generated because of the problems that ee34a2bead was trying to solve. >> > >> > That's probably OK, though; you can always flip the bitmap config to >> > "true" yourself if you _must_ have bitmaps. >> >> What about something like this? Needs tests but I need to leave, now. > > Yeah, I think that's the right direction. > > Though... > >> +static int has_pack_keep_file(void) >> +{ >> +DIR *dir; >> +struct dirent *e; >> +int found = 0; >> + >> +if (!(dir = opendir(packdir))) >> +return found; >> + >> +while ((e = readdir(dir)) != NULL) { >> +if (ends_with(e->d_name, ".keep")) { >> +found = 1; >> +break; >> +} >> +} >> +closedir(dir); >> +return found; >> +} > > I think this can be replaced with just checking p->pack_keep for each > item in the packed_git list. > > That's racy, but then so is your code here, since it's really the child > pack-objects which is going to deal with the .keep. I don't think we > need to care much about the race, though. Either: > > 1. Somebody has made an old intentional .keep, which would not be > racy. We'd see it in both places. > > 2. Somebody _just_ made an intentional .keep; we'll race with that and > maybe duplicate objects from the kept pack. But this is a rare > occurrence, and there's no real ordering promise here anyway with > somebody creating .keep files alongside a running repack. > > 3. An incoming fetch/push may create a .keep file as a temporary lock, > which we see here but which goes away by the time pack-objects > runs. That's OK; we err on the side of not generating bitmaps, but > they're an optimization anyway (and if you really insist on having > them, you should tell Git to definitely make them instead of > relying on this default behavior). This sort of thing (#3) strikes me as a fairly pathological case we should try to avoid. Now what we've turned on bitmaps by default people will take the sort of performance increase noted in [1] for granted. So they'll be happily running with that & then get a CPU/IO spike as the *.bitmap files they'd been implicitly relying on for years in their default config goes away, only to have it re-appear when "repack" runs next. I can't think of some great solution for this case, some thoughts: a. Perhaps we should split the *.keep flag into two things or more. We're using it for all of "I want this *.pack forever" (e.g. debugging) and "I want only this *.pack to contain the data found in it" (I/O & CPU optimization, what Janos wants) and "I'm git.git code avoiding a race with myself" (what you describe in #3). So maybe for the last of those we could also use and understand *.tmp-keep, at which point we wouldn't have this race described in #3. The 1st of those is a *.noprune and the 2nd is *.highlander (but whether it's worth splitting all that out v.s. just having *.tmp-keep is another matter). b) Shouldn't we at least print some warning to STDERR in this case so e.g. gc.log will note the performance degradation of the repo in its current configuration? > 4. Like (3), but we _don't _see the temporary .keep here but _do_ see > it during pack-objects. That's OK, because we'll have told > pack-objects to pack those objects anyway, which is the right > thing. > > -Peff 1. https://github.blog/2015-09-22-counting-objects/
Re: 2.22.0 repack -a duplicating pack contents
On Sun, Jun 23 2019, Janos Farkas wrote: > I'm using .keep files to... well.. keep packs to avoid some CPU time > spent on repacking huge packs and make the process somewhat more > incremental. > > Something changed with 22.2.0. Now .bitmap files are also created, > and no simple repacks re-create the pack data in a completely new > file, wasting quite some storage: > > 02d03::master> find objects/pack/pack* -type f|xargs ls -sht > 108K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.bitmap > 524K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.idx > 4.7M objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.pack > 108K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.bitmap > 524K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.idx > 4.6M objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack > 116K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.bitmap > 524K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.idx > 4.6M objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack >0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep > 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap > 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx > 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack > 02d03::master > git repack -af > Enumerating objects: 19001, done. > Counting objects: 100% (19001/19001), done. > Delta compression using up to 2 threads > Compressing objects: 100% (18952/18952), done. > Writing objects: 100% (19001/19001), done. > warning: ignoring extra bitmap file: > ./objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack > warning: ignoring extra bitmap file: > ./objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack > warning: ignoring extra bitmap file: > ./objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack > Reusing bitmaps: 104, done. > Selecting bitmap commits: 2550, done. > Building bitmaps: 100% (130/130), done. > Total 19001 (delta 14837), reused 4162 (delta 0) > 02d03::master > find objects/pack/pack* -type f|xargs ls -sht > 108K objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.bitmap > 524K objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.idx > 4.6M objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.pack <= > 108K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.bitmap > 524K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.idx > 4.7M objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.pack > 108K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.bitmap > 524K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.idx > 4.6M objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack > 116K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.bitmap > 524K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.idx > 4.6M objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack >0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep > 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap > 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx > 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack > > The ccbd455 pack and its metadata seem quite pointless to be > containing apparently all the data based on the size. > > If I use -ad, a new pack is still created,which, judging by the size, > is essentially everything again, (but at least the extra packs are > removed) > > 02d03::master> git repack -ad > Enumerating objects: 19001, done. > Counting objects: 100% (19001/19001), done. > Delta compression using up to 2 threads > Compressing objects: 100% (4114/4114), done. > Writing objects: 100% (19001/19001), done. > warning: ignoring extra bitmap file: > ./objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack > Reusing bitmaps: 104, done. > Selecting bitmap commits: 2550, done. > Building bitmaps: 100% (130/130), done. > Total 19001 (delta 14838), reused 19001 (delta 14838) > 02d03::master 9060> find objects/pack/pack* -type f|xargs ls -sht > 116K objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.bitmap > 524K objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.idx > 4.6M objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.pack <= >0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep > 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap > 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx > 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack > > Previously, the kept pack would be kept, and no additional packs would > be created if no new objects were born in the repro. > > With the .keep placeholder removed, the duplication does not happen, > but all the repro is rewritten into a new pack, which does not look > correct. Am I doing something u
Re: Deadname rewriting
On Fri, Jun 21 2019, Phil Hord wrote: > On Sat, Jun 15, 2019 at 1:19 AM Ævar Arnfjörð Bjarmason > wrote: >> On Sat, Jun 15 2019, Phil Hord wrote: >> >> > At $work we have a long time employee who has changed their name from >> > Alice to Bob. Bob doesn't want anyone to call him "Alice" anymore and >> > is prone to be offended if they do. This is called "deadnaming". > ... >> What should be done is to extend the .mailmap support to other >> cases. I.e. make tools like blame, shortlog etc. show the equivalent of >> %aN and %aE by default. > > It seems that shortlog and blame do use %aE and %aN by default. Even > log does. It is only because I didn't know about %aN 10 years ago > that my custom log format does not. > > It's a pity the format author has the option to ignore the mailmap. I > think it's a choice commonly made by mistake rather than intention. I > wonder if anyone would mind a forced-override config. Maybe a force > flag in the .mailmap file itself. > > >Other Authornick2 >Alice Doe--force Yeah I'm sure a lot of people who do %an really mean %aN, but blanket forcing it seems a recipe for breakage since "log" and friends are also used as plumbing where you really mean "what does it say in this commit object". E.g. I use %an intentionally for a company-internal tool to map an Alice to Bob for reporting purposes, which presumably you'd also want. But yeah, there'll be other uses that didn't intend it. I think probably the best way forward is to just make git use %aN by default in porcelain, and outside users presumably would get reports about such issues eventually in cases like this where someone cared. >> This topic was discussed at the last git contributor summit (brought up >> by CB Bailey) resulting in this patch, which I see didn't make it in & >> needs to be resurrected again: >> https://public-inbox.org/git/20181212171052.13415-1...@hashpling.org/ > > Thanks for the link. > > I didn't know about config options for mailmap.file and log.mailmap > before. These do make this option much more useful, especially when we > can insert default settings for them into /etc/gitconfig across the > company. Right, and to the extent that we don't --use-mailmap by default I think that's mainly because nobody's cared enough to advocate for it. I think it would be a sensible default.
Re: [PATCH 1/1] t0001: fix on case-insensitive filesystems
On Sun, Jun 09 2019, brian m. carlson wrote: > On 2019-06-08 at 14:43:43, Johannes Schindelin via GitGitGadget wrote: >> diff --git a/t/t0001-init.sh b/t/t0001-init.sh >> index 42a263cada..f54a69e2d9 100755 >> --- a/t/t0001-init.sh >> +++ b/t/t0001-init.sh >> @@ -307,10 +307,20 @@ test_expect_success 'init prefers command line to >> GIT_DIR' ' >> test_path_is_missing otherdir/refs >> ' >> >> +downcase_on_case_insensitive_fs () { >> +test false = "$(git config --get core.filemode)" || return 0 >> +for f > > TIL that “for f” is equivalent to “for f in "$@"”. Thanks for teaching > me something new. See also test_have_prereq in test-lib-functions.sh where this trick is combined with IFS to loop over a "param,like,this" split by ",".
Re: [PATCH] tests: mark two failing tests under FAIL_PREREQS
On Fri, Jun 21 2019, Johannes Schindelin wrote: > Hi Ævar, > > On Thu, 20 Jun 2019, Ævar Arnfjörð Bjarmason wrote: > >> Fix a couple of tests that would potentially fail under >> GIT_TEST_FAIL_PREREQS=true. >> >> I missed these when annotating other tests in dfe1a17df9 ("tests: add >> a special setup where prerequisites fail", 2019-05-13) because on my >> system I can only reproduce this failure when I run the tests as >> "root", since the tests happen to depend on whether we can fall back >> on GECOS info or not. I.e. they'd usually fail to look up the ident >> info anyway, but not always. > > I had to read the commit message (in particular the oneline) a couple of > times, and I have to admit that I wish it was a bit clearer... > > From the explanation, I would have assumed that those two test cases fail > often, anyway, so they shouldn't care whether `FAIL_PREREQS` is in effect. > > The only reason why they should be exempt from the `FAIL_PREREQS` mode > that I can think of is that later test cases would depend on them, but how > can they? Those test cases would also have to have the `AUTOIDENT` prereq, > and they would be skipped under `FAIL_PREREQS`, too, no? The test doesn't depend on "AUTOIDENT", but "!AUTOIDENT", i.e. the negated version. The effect of the FAIL_PREREQS mode is to set all prereqs to false, and therefore "test_have_prereq AUTOIDENT" is false, but "test_have_prereq !AUTOIDENT" is true. So this test that would otherwise get skipped gets run. I honestly didn't think much about these cases when I wrote dfe1a17df9 ("tests: add a special setup where prerequisites fail", 2019-05-13), and now I'm not quite sure whether it should be considered a bug or a feature, but in the meantime this un-breaks the test suite under this mode. > In other words, I struggle to understand why this patch is necessary. > > Could you help me understand? > > Ciao, > Dscho > >> >> Signed-off-by: Ævar Arnfjörð Bjarmason >> --- >> t/t0007-git-var.sh | 2 +- >> t/t7502-commit-porcelain.sh | 2 +- >> 2 files changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/t/t0007-git-var.sh b/t/t0007-git-var.sh >> index 5868a87352..1f600e2cae 100755 >> --- a/t/t0007-git-var.sh >> +++ b/t/t0007-git-var.sh >> @@ -17,7 +17,7 @@ test_expect_success 'get GIT_COMMITTER_IDENT' ' >> test_cmp expect actual >> ' >> >> -test_expect_success !AUTOIDENT 'requested identites are strict' ' >> +test_expect_success !FAIL_PREREQS,!AUTOIDENT 'requested identites are >> strict' ' >> ( >> sane_unset GIT_COMMITTER_NAME && >> sane_unset GIT_COMMITTER_EMAIL && >> diff --git a/t/t7502-commit-porcelain.sh b/t/t7502-commit-porcelain.sh >> index 5733d9cd34..14c92e4c25 100755 >> --- a/t/t7502-commit-porcelain.sh >> +++ b/t/t7502-commit-porcelain.sh >> @@ -402,7 +402,7 @@ echo editor started >"$(pwd)/.git/result" >> exit 0 >> EOF >> >> -test_expect_success !AUTOIDENT 'do not fire editor when committer is bogus' >> ' >> +test_expect_success !FAIL_PREREQS,!AUTOIDENT 'do not fire editor when >> committer is bogus' ' >> >.git/result && >> >> echo >>negative && >> -- >> 2.22.0.455.g172b71a6c5 >> >>
[PATCH] push: make "HEAD:tags/my-tag" consistently push to a branch
When a refspec like "HEAD:tags/my-tag" is pushed where "HEAD" is a branch, we'll push a *branch* that'll be located at "refs/heads/tags/my-tag". This is part of the rather straightforward rules I documented in 2219c09e23 ("push doc: document the DWYM behavior pushing to unqualified ", 2018-11-13). However, if there exists a "refs/tags/my-tag" on the remote the count_refspec_match() logic will, as a result of calling refname_match(), match partially-qualified RHS of the refspec "refs/tags/my-tag", because it's in a loop where it tries to match "tags/my-tag" to "refs/tags/my-tag', then "refs/tags/tags/my-tag" etc. This resulted in a case[1] where someone on LKML did: git push kvm +HEAD:tags/for-linus Which would have created a new "refs/heads/tags/for-linus" branch in their "kvm" repository. But since they happened to have an existing "refs/tags/for-linus" reference we pushed there instead, and replaced an annotated tag with a lightweight tag. We do want a RHS ref like "master" to match "refs/heads/master", but it's confusing and dangerous that the DWYM behavior for matching partial RHS refspecs acts differently when the start of the RHS happens to be a second-level namespace under "refs/" namespace like "tags". Now we'll print out the following advice when this happens, and act differently as described therein: hint: The part of the refspec matched both of: hint: hint: 1. tags/my-tag -> refs/tags/my-tag hint: 2. tags/my-tag -> refs/heads/tags/my-tag hint: hint: Earlier versions of git would have picked (1) as the RHS starts hint: with a second-level ref prefix which could be fully-qualified by hint: adding 'refs/' in front of it. We now pick (2) which uses the prefix hint: inferred from the part of the refspec. hint: hint: See the "..." rules discussed in 'git help push'. An earlier version of this patch[2] used the much more heavy-handed approach of changing this logic in refname_match(). As shown from the tests that patch needed to modify that results in changes that are overzealous for fixing this push-specific issue. The right place to fix this is in match_explicit(). There we can see if we have both a DWYM match and a match based on the prefix of the LHS of the refspec, in those cases the match based on the LHS's ref prefix should win. 1. https://lore.kernel.org/lkml/2d55fd2a-afbf-1b7c-ca82-8bffaa18e...@redhat.com/ 2. https://public-inbox.org/git/20190526225445.21618-1-ava...@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- Now that the 2.22.0 release is out I cleaned this up into a more sensible patch. Documentation/config/advice.txt | 7 +++ Documentation/git-push.txt | 13 + advice.c| 2 ++ advice.h| 1 + remote.c| 23 ++- t/t5505-remote.sh | 18 ++ 6 files changed, 63 insertions(+), 1 deletion(-) diff --git a/Documentation/config/advice.txt b/Documentation/config/advice.txt index ec4f6ae658..36cb3db63a 100644 --- a/Documentation/config/advice.txt +++ b/Documentation/config/advice.txt @@ -37,6 +37,13 @@ advice.*:: we can still suggest that the user push to either refs/heads/* or refs/tags/* based on the type of the source object. + pushPartialAmbigiousName:: + Shown when linkgit:git-push[1] is given a refspec + where the in earlier versions of Git would have + matched a on the remote based on its existence + over appending a prefix based on the type of the + . See the "..." documentation in + linkgit:git-push[1] for details. statusHints:: Show directions on how to proceed from the current state in the output of linkgit:git-status[1], in diff --git a/Documentation/git-push.txt b/Documentation/git-push.txt index 6a8a0d958b..5c46ef5e59 100644 --- a/Documentation/git-push.txt +++ b/Documentation/git-push.txt @@ -84,6 +84,19 @@ is ambiguous. * If resolves to a ref starting with refs/heads/ or refs/tags/, then prepend that to . ++ +Versions of Git before 2.23.0 would override this rule and match +e.g. `HEAD:tags/mark` to either `refs/tags/mark` or `refs/tags/mark` +depending on, respectively, if `refs/tags/mark` existed or not on the +remote. ++ +We'll now consistently pick `refs/heads/tags/mark` based on this rule +and so that we're not so eager in guessing the on the remote +that we'll pick a different based on what refs exist there +already than we otherwise would have. This exception guards for cases +where the match would be different due to a subse
[PATCH v3 4/8] t6040 test: stop using global "script" variable
Change test code added in c0234b2ef6 ("stat_tracking_info(): clear object flags used during counting", 2008-07-03) to stop using the "script" variable also used for lazy prerequisites in test-lib-functions.sh. Since this test uses test_i18ncmp and expects to use its own "script" variable twice it implicitly depends on the C_LOCALE_OUTPUT prerequisite not being a lazy prerequisite. A follow-up change will make it a lazy prerequisite, so we must remove this landmine before inadvertently stepping on it as we make that change. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t6040-tracking-info.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/t/t6040-tracking-info.sh b/t/t6040-tracking-info.sh index 716283b274..970b25a289 100755 --- a/t/t6040-tracking-info.sh +++ b/t/t6040-tracking-info.sh @@ -38,7 +38,7 @@ test_expect_success setup ' advance h ' -script='s/^..\(b.\) *[0-9a-f]* \(.*\)$/\1 \2/p' +t6040_script='s/^..\(b.\) *[0-9a-f]* \(.*\)$/\1 \2/p' cat >expect <<\EOF b1 [ahead 1, behind 1] d b2 [ahead 1, behind 1] d @@ -53,7 +53,7 @@ test_expect_success 'branch -v' ' cd test && git branch -v ) | - sed -n -e "$script" >actual && + sed -n -e "$t6040_script" >actual && test_i18ncmp expect actual ' @@ -71,7 +71,7 @@ test_expect_success 'branch -vv' ' cd test && git branch -vv ) | - sed -n -e "$script" >actual && + sed -n -e "$t6040_script" >actual && test_i18ncmp expect actual ' -- 2.22.0.455.g172b71a6c5
[PATCH v3 1/8] config tests: simplify include cycle test
Simplify an overly verbose test added in 9b25a0b52e ("config: add include directive", 2012-02-06). The "expect" file was never used, and by using .gitconfig it's not as intuitive to reproduce this manually with "-d" as some other tests, since HOME needs to be set in the environment. Also remove the use of test_i18ngrep added in a769bfc74f ("config.c: mark more strings for translation", 2018-07-21) in favor of overriding the GIT_TEST_GETTEXT_POISON value. Using the i18n test wrappers hasn't been needed since my 6cdccfce1e ("i18n: make GETTEXT_POISON a runtime option", 2018-11-08). As a follow-up change to the yet-to-be-added t0017-env-helper.sh will show, doing it this way can hide a regression when combined with trace2's early config reading. That early config reading was added in bce9db6de9 ("trace2: use system/global config for default trace2 settings", 2019-04-15). So let's remove the testing for that potential regression here, I'll instead add it explicitly to t0017-env-helper.sh in a follow-up change. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t1305-config-include.sh | 21 +++-- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/t/t1305-config-include.sh b/t/t1305-config-include.sh index 579a86b7f8..6b388ba2d0 100755 --- a/t/t1305-config-include.sh +++ b/t/t1305-config-include.sh @@ -310,20 +310,13 @@ test_expect_success SYMLINKS 'conditional include, gitdir matching symlink, icas ' test_expect_success 'include cycles are detected' ' - cat >.gitconfig <<-\EOF && - [test]value = gitconfig - [include]path = cycle - EOF - cat >cycle <<-\EOF && - [test]value = cycle - [include]path = .gitconfig - EOF - cat >expect <<-\EOF && - gitconfig - cycle - EOF - test_must_fail git config --get-all test.value 2>stderr && - test_i18ngrep "exceeded maximum include depth" stderr + git init --bare cycle && + git -C cycle config include.path cycle && + git config -f cycle/cycle include.path config && + test_must_fail \ + env GIT_TEST_GETTEXT_POISON= \ + git -C cycle config --get-all test.value 2>stderr && + grep "exceeded maximum include depth" stderr ' test_done -- 2.22.0.455.g172b71a6c5
[PATCH v3 7/8] tests: replace test_tristate with "git env--helper"
The test_tristate helper introduced in 83d842dc8c ("tests: turn on network daemon tests by default", 2014-02-10) can now be better implemented with "git env--helper" to give the variables in question the standard boolean behavior. The reason for the "tristate" was to have all of false/true/auto, where "auto" meant either "false" or "true" depending on what the fallback was. With the --default option to "git env--helper" we can simply have e.g. GIT_TEST_HTTPD where we know if it's true because the user asked explicitly ("true"), or true implicitly ("auto"). This breaks backwards compatibility for explicitly setting "auto" for these variables, but I don't think anyone cares. That was always intended to be internal. This means the test_normalize_bool() code in test-lib-functions.sh goes away in addition to test_tristate(). We still need the test_skip_or_die() helper, but now it takes the variable name instead of the value, and uses "git env--bool" to distinguish a default "true" from an explicit "true" (in those "explicit true" cases we want to fail the test in question). Signed-off-by: Ævar Arnfjörð Bjarmason --- t/lib-git-daemon.sh | 7 +++--- t/lib-git-svn.sh| 11 +++- t/lib-httpd.sh | 15 ++- t/t5512-ls-remote.sh| 3 +-- t/test-lib-functions.sh | 56 ++--- 5 files changed, 22 insertions(+), 70 deletions(-) diff --git a/t/lib-git-daemon.sh b/t/lib-git-daemon.sh index 7b3407134e..fb8f887080 100644 --- a/t/lib-git-daemon.sh +++ b/t/lib-git-daemon.sh @@ -15,8 +15,7 @@ # # test_done -test_tristate GIT_TEST_GIT_DAEMON -if test "$GIT_TEST_GIT_DAEMON" = false +if ! git env--helper --type=bool --default=true --exit-code GIT_TEST_GIT_DAEMON then skip_all="git-daemon testing disabled (unset GIT_TEST_GIT_DAEMON to enable)" test_done @@ -24,7 +23,7 @@ fi if test_have_prereq !PIPE then - test_skip_or_die $GIT_TEST_GIT_DAEMON "file system does not support FIFOs" + test_skip_or_die GIT_TEST_GIT_DAEMON "file system does not support FIFOs" fi test_set_port LIB_GIT_DAEMON_PORT @@ -73,7 +72,7 @@ start_git_daemon() { kill "$GIT_DAEMON_PID" wait "$GIT_DAEMON_PID" unset GIT_DAEMON_PID - test_skip_or_die $GIT_TEST_GIT_DAEMON \ + test_skip_or_die GIT_TEST_GIT_DAEMON \ "git daemon failed to start" fi } diff --git a/t/lib-git-svn.sh b/t/lib-git-svn.sh index c1271d6863..5d4ae629e1 100644 --- a/t/lib-git-svn.sh +++ b/t/lib-git-svn.sh @@ -69,14 +69,12 @@ svn_cmd () { maybe_start_httpd () { loc=${1-svn} - test_tristate GIT_SVN_TEST_HTTPD - case $GIT_SVN_TEST_HTTPD in - true) + if git env--helper --type=bool --default=false --exit-code GIT_TEST_HTTPD + then . "$TEST_DIRECTORY"/lib-httpd.sh LIB_HTTPD_SVN="$loc" start_httpd - ;; - esac + fi } convert_to_rev_db () { @@ -106,8 +104,7 @@ EOF } require_svnserve () { - test_tristate GIT_TEST_SVNSERVE - if ! test "$GIT_TEST_SVNSERVE" = true + if ! git env--helper --type=bool --default=false --exit-code GIT_TEST_SVNSERVE then skip_all='skipping svnserve test. (set $GIT_TEST_SVNSERVE to enable)' test_done diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh index b3cc62bd36..0d985758c6 100644 --- a/t/lib-httpd.sh +++ b/t/lib-httpd.sh @@ -41,15 +41,14 @@ then test_done fi -test_tristate GIT_TEST_HTTPD -if test "$GIT_TEST_HTTPD" = false +if ! git env--helper --type=bool --default=true --exit-code GIT_TEST_HTTPD then skip_all="Network testing disabled (unset GIT_TEST_HTTPD to enable)" test_done fi if ! test_have_prereq NOT_ROOT; then - test_skip_or_die $GIT_TEST_HTTPD \ + test_skip_or_die GIT_TEST_HTTPD \ "Cannot run httpd tests as root" fi @@ -95,7 +94,7 @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE if ! test -x "$LIB_HTTPD_PATH" then - test_skip_or_die $GIT_TEST_HTTPD "no web server found at '$LIB_HTTPD_PATH'" + test_skip_or_die GIT_TEST_HTTPD "no web server found at '$LIB_HTTPD_PATH'" fi HTTPD_VERSION=$($LIB_HTTPD_PATH -v | \ @@ -107,19 +106,19 @@ then then if ! test $HTTPD_VERSION -ge 2 then - test_skip_or_die $GIT_TEST_HTTPD \ + test_skip_or_die GIT_TEST_HTTPD \ "at least Apache version 2 is required" fi if
[PATCH v3 3/8] config.c: refactor die_bad_number() to not call gettext() early
Prepare die_bad_number() for a change to specially handle GIT_TEST_GETTEXT_POISON calling git_env_bool() by making die_bad_number() not call gettext() early, which would in turn call git_env_bool(). There's no meaningful change here yet, just a re-arrangement of the current code to make that subsequent change easier to read. Signed-off-by: Ævar Arnfjörð Bjarmason --- config.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/config.c b/config.c index 296a6d9cc4..374cb33005 100644 --- a/config.c +++ b/config.c @@ -949,34 +949,35 @@ int git_parse_ssize_t(const char *value, ssize_t *ret) NORETURN static void die_bad_number(const char *name, const char *value) { - const char * error_type = (errno == ERANGE)? _("out of range"):_("invalid unit"); + const char *error_type = (errno == ERANGE) ? + N_("out of range") : N_("invalid unit"); + const char *bad_numeric = N_("bad numeric config value '%s' for '%s': %s"); if (!value) value = ""; if (!(cf && cf->name)) - die(_("bad numeric config value '%s' for '%s': %s"), - value, name, error_type); + die(_(bad_numeric), value, name, _(error_type)); switch (cf->origin_type) { case CONFIG_ORIGIN_BLOB: die(_("bad numeric config value '%s' for '%s' in blob %s: %s"), - value, name, cf->name, error_type); + value, name, cf->name, _(error_type)); case CONFIG_ORIGIN_FILE: die(_("bad numeric config value '%s' for '%s' in file %s: %s"), - value, name, cf->name, error_type); + value, name, cf->name, _(error_type)); case CONFIG_ORIGIN_STDIN: die(_("bad numeric config value '%s' for '%s' in standard input: %s"), - value, name, error_type); + value, name, _(error_type)); case CONFIG_ORIGIN_SUBMODULE_BLOB: die(_("bad numeric config value '%s' for '%s' in submodule-blob %s: %s"), - value, name, cf->name, error_type); + value, name, cf->name, _(error_type)); case CONFIG_ORIGIN_CMDLINE: die(_("bad numeric config value '%s' for '%s' in command line %s: %s"), - value, name, cf->name, error_type); + value, name, cf->name, _(error_type)); default: die(_("bad numeric config value '%s' for '%s' in %s: %s"), - value, name, cf->name, error_type); + value, name, cf->name, _(error_type)); } } -- 2.22.0.455.g172b71a6c5
[PATCH v3 8/8] tests: make GIT_TEST_FAIL_PREREQS a boolean
Change the GIT_TEST_FAIL_PREREQS variable from being "non-empty?" to being a more standard boolean variable. I recently added the variable in dfe1a17df9 ("tests: add a special setup where prerequisites fail", 2019-05-13), having to add another "non-empty?" special-case is what prompted me to write the "git env--helper" utility being used here. Converting this one is a bit tricky since we use it so early and frequently in the guts of the test code itself, so let's set a GIT_TEST_FAIL_PREREQS_INTERNAL which can be tested with the old "test -n" for the purposes of the shell code, and change the user-exposed and documented GIT_TEST_FAIL_PREREQS variable to a boolean. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/README| 2 +- t/t-basic.sh| 10 +- t/test-lib-functions.sh | 4 ++-- t/test-lib.sh | 23 +++ 4 files changed, 27 insertions(+), 12 deletions(-) diff --git a/t/README b/t/README index 072c9854d1..60d5b77bcc 100644 --- a/t/README +++ b/t/README @@ -334,7 +334,7 @@ that cannot be easily covered by a few specific test cases. These could be enabled by running the test suite with correct GIT_TEST_ environment set. -GIT_TEST_FAIL_PREREQS fails all prerequisites. This is +GIT_TEST_FAIL_PREREQS= fails all prerequisites. This is useful for discovering issues with the tests where say a later test implicitly depends on an optional earlier test. diff --git a/t/t-basic.sh b/t/t-basic.sh index 31de7e90f3..e89438e619 100755 --- a/t/t-basic.sh +++ b/t/t-basic.sh @@ -726,7 +726,7 @@ donthaveit=yes test_expect_success DONTHAVEIT 'unmet prerequisite causes test to be skipped' ' donthaveit=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit != yesyes then say "bug in test framework: prerequisite tags do not work reliably" exit 1 @@ -747,7 +747,7 @@ donthaveiteither=yes test_expect_success DONTHAVEIT,HAVEIT 'unmet prerequisites causes test to be skipped' ' donthaveiteither=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit$donthaveiteither != yesyesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit$donthaveiteither != yesyesyes then say "bug in test framework: multiple prerequisite tags do not work reliably" exit 1 @@ -763,7 +763,7 @@ test_expect_success !LAZY_TRUE 'missing lazy prereqs skip tests' ' donthavetrue=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a "$havetrue$donthavetrue" != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$havetrue$donthavetrue" != yesyes then say 'bug in test framework: lazy prerequisites do not work' exit 1 @@ -779,7 +779,7 @@ test_expect_success LAZY_FALSE 'missing negative lazy prereqs will skip' ' havefalse=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a "$nothavefalse$havefalse" != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$nothavefalse$havefalse" != yesyes then say 'bug in test framework: negative lazy prerequisites do not work' exit 1 @@ -790,7 +790,7 @@ test_expect_success 'tests clean up after themselves' ' test_when_finished clean=yes ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $clean != yes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $clean != yes then say "bug in test framework: basic cleanup command does not work reliably" exit 1 diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 527508c350..1cd0655f96 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -309,7 +309,7 @@ test_unset_prereq () { } test_set_prereq () { - if test -n "$GIT_TEST_FAIL_PREREQS" + if test -n "$GIT_TEST_FAIL_PREREQS_INTERNAL" then case "$1" in # The "!" case is handled below with @@ -1043,7 +1043,7 @@ perl () { # The error/skip message should be given by $2. # test_skip_or_die () { - if ! git env--helper --mode-bool --variable=$1 --default=0 --exit-code --quiet + if ! git env--helper --type=bool --default=false --exit-code $1 then skip_all=$2 test_done diff --git a/t/test-lib.sh b/t/test-lib.sh index ed5d69dfe5..1af4e50653 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -1389,6 +1389,25 @@ yes () { done } +# The GIT_TEST_FAIL_PREREQS code hooks into test_set_prereq(), and +# thus needs to be set up really early, and set an internal variable +# for convenience so
[PATCH v3 6/8] tests README: re-flow a previously changed paragraph
A previous change to the "GIT_TEST_GETTEXT_POISON" variable left this paragraph needing to be re-flowed. Let's do that in this separate change to make it easy to see that there's no change here when viewed with "--word-diff". Signed-off-by: Ævar Arnfjörð Bjarmason --- t/README | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/t/README b/t/README index 9a131f472e..072c9854d1 100644 --- a/t/README +++ b/t/README @@ -344,10 +344,10 @@ refactor to deal with it. The "SYMLINKS" prerequisite is currently excluded as so much relies on it, but this might change in the future. GIT_TEST_GETTEXT_POISON= turns all strings marked for -translation into gibberish if true. Used for -spotting those tests that need to be marked with a C_LOCALE_OUTPUT -prerequisite when adding more strings for translation. See "Testing -marked strings" in po/README for details. +translation into gibberish if true. Used for spotting those tests that +need to be marked with a C_LOCALE_OUTPUT prerequisite when adding more +strings for translation. See "Testing marked strings" in po/README for +details. GIT_TEST_SPLIT_INDEX= forces split-index mode on the whole test suite. Accept any boolean values that are accepted by git-config. -- 2.22.0.455.g172b71a6c5
[PATCH v3 5/8] tests: make GIT_TEST_GETTEXT_POISON a boolean
Change the GIT_TEST_GETTEXT_POISON variable from being "non-empty?" to being a more standard boolean variable. Since it needed to be checked in both C code and shellscript (via test -n) it was one of the remaining shellscript-like variables. Now that we have "env--helper" we can change that. There's a couple of tricky edge cases that arise because we're using git_env_bool() early, and the config-reading "env--helper". If GIT_TEST_GETTEXT_POISON is set to an invalid value die_bad_number() will die, but to do so it would usually call gettext(). Let's detect the special case of GIT_TEST_GETTEXT_POISON and always emit that message in the C locale, lest we infinitely loop. As seen in the updated tests in t0017-env-helper.sh there's also a caveat related to "env--helper" needing to read the config for trace2 purposes. Since the C_LOCALE_OUTPUT prerequisite is lazy and relies on "env--helper" we could get invalid results if we failed to read the config (e.g. because we'd loop on includes) when combined with e.g. "test_i18ngrep" wanting to check with "env--helper" if GIT_TEST_GETTEXT_POISON was true or not. I'm crossing my fingers and hoping that a test similar to the one I removed in the earlier "config tests: simplify include cycle test" change in this series won't happen again, and testing for this explicitly in "env--helper"'s own tests. This change breaks existing uses of e.g. GIT_TEST_GETTEXT_POISON=YesPlease, which we've documented in po/README and other places. As noted in [1] we might want to consider also accepting "YesPlease" in "env--helper" as a special-case. But as the lack of uproar over 6cdccfce1e ("i18n: make GETTEXT_POISON a runtime option", 2018-11-08) demonstrates the audience for this option is a really narrow set of git developers, who shouldn't have much trouble modifying their test scripts, so I think it's better to deal with that minor headache now and make all the relevant GIT_TEST_* variables boolean in the same way than carry the "YesPlease" special-case forward. 1. https://public-inbox.org/git/xmqqtvckm3h8@gitster-ct.c.googlers.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- ci/lib.sh | 2 +- config.c | 9 + gettext.c | 6 ++ git-sh-i18n.sh| 4 +++- po/README | 2 +- t/README | 4 ++-- t/t0017-env-helper.sh | 16 t/t0205-gettext-poison.sh | 7 ++- t/t1305-config-include.sh | 2 +- t/t7201-co.sh | 2 +- t/t9902-completion.sh | 2 +- t/test-lib.sh | 8 +++- 12 files changed, 46 insertions(+), 18 deletions(-) diff --git a/ci/lib.sh b/ci/lib.sh index 288a5b3884..fd799ae663 100755 --- a/ci/lib.sh +++ b/ci/lib.sh @@ -184,7 +184,7 @@ osx-clang|osx-gcc) export GIT_SKIP_TESTS="t9810 t9816" ;; GIT_TEST_GETTEXT_POISON) - export GIT_TEST_GETTEXT_POISON=YesPlease + export GIT_TEST_GETTEXT_POISON=true ;; esac diff --git a/config.c b/config.c index 374cb33005..b985d60fa4 100644 --- a/config.c +++ b/config.c @@ -956,6 +956,15 @@ static void die_bad_number(const char *name, const char *value) if (!value) value = ""; + if (!strcmp(name, "GIT_TEST_GETTEXT_POISON")) + /* +* We explicitly *don't* use _() here since it would +* cause an infinite loop with _() needing to call +* use_gettext_poison(). This is why marked up +* translations with N_() above. +*/ + die(bad_numeric, value, name, error_type); + if (!(cf && cf->name)) die(_(bad_numeric), value, name, _(error_type)); diff --git a/gettext.c b/gettext.c index d4021d690c..5c71f4c8b9 100644 --- a/gettext.c +++ b/gettext.c @@ -50,10 +50,8 @@ const char *get_preferred_languages(void) int use_gettext_poison(void) { static int poison_requested = -1; - if (poison_requested == -1) { - const char *v = getenv("GIT_TEST_GETTEXT_POISON"); - poison_requested = v && strlen(v) ? 1 : 0; - } + if (poison_requested == -1) + poison_requested = git_env_bool("GIT_TEST_GETTEXT_POISON", 0); return poison_requested; } diff --git a/git-sh-i18n.sh b/git-sh-i18n.sh index e1d917fd27..8eef60b43f 100644 --- a/git-sh-i18n.sh +++ b/git-sh-i18n.sh @@ -17,7 +17,9 @@ export TEXTDOMAINDIR # First decide what scheme to use... GIT_INTERNAL_GETTEXT_SH_SCHEME=fallthrough -if test -n "$GIT_TEST_GETTEXT_POISON" +if test -n "$GIT_TEST_GETTEXT_POISON" && + git env--helper --type=bool --default=0 --exit-code \ +
[PATCH v3 2/8] env--helper: new undocumented builtin wrapping git_env_*()
We have many GIT_TEST_* variables that accept a because they're implemented in C, and then some that take because they're implemented at least partially in shellscript. Add a helper that wraps git_env_bool() and git_env_ulong() as the first step in fixing this. This isn't being added as a test-tool mode because some of these are used outside the test suite. Part of what this tool does can be done via a trick with "git config" added in 83d842dc8c ("tests: turn on network daemon tests by default", 2014-02-10) for test_tristate(), i.e.: git -c magic.variable="$1" config --bool magic.variable 2>/dev/null But as subsequent changes will show being able to pass along the default value makes all the difference, and we'll be able to replace test_tristate() itself with that. The --type=bool option will be used by subsequent patches, but not --type=ulong. I figured it was easy enough to add it & test for it so I left it in so we'd have wrappers for both git_env_*() functions, and to have a template to make it obvious how we'd add --type=int etc. if it's needed in the future. Signed-off-by: Ævar Arnfjörð Bjarmason --- .gitignore| 1 + Makefile | 1 + builtin.h | 1 + builtin/env--helper.c | 95 +++ git.c | 1 + t/t0017-env-helper.sh | 83 + 6 files changed, 182 insertions(+) create mode 100644 builtin/env--helper.c create mode 100755 t/t0017-env-helper.sh diff --git a/.gitignore b/.gitignore index 4470d7cfc0..1f7a83fb3c 100644 --- a/.gitignore +++ b/.gitignore @@ -58,6 +58,7 @@ /git-difftool /git-difftool--helper /git-describe +/git-env--helper /git-fast-export /git-fast-import /git-fetch diff --git a/Makefile b/Makefile index f58bf14c7b..f2cfc8d812 100644 --- a/Makefile +++ b/Makefile @@ -1059,6 +1059,7 @@ BUILTIN_OBJS += builtin/diff-index.o BUILTIN_OBJS += builtin/diff-tree.o BUILTIN_OBJS += builtin/diff.o BUILTIN_OBJS += builtin/difftool.o +BUILTIN_OBJS += builtin/env--helper.o BUILTIN_OBJS += builtin/fast-export.o BUILTIN_OBJS += builtin/fetch-pack.o BUILTIN_OBJS += builtin/fetch.o diff --git a/builtin.h b/builtin.h index ec7e0954c4..93bd49fe4f 100644 --- a/builtin.h +++ b/builtin.h @@ -160,6 +160,7 @@ int cmd_diff_index(int argc, const char **argv, const char *prefix); int cmd_diff(int argc, const char **argv, const char *prefix); int cmd_diff_tree(int argc, const char **argv, const char *prefix); int cmd_difftool(int argc, const char **argv, const char *prefix); +int cmd_env__helper(int argc, const char **argv, const char *prefix); int cmd_fast_export(int argc, const char **argv, const char *prefix); int cmd_fetch(int argc, const char **argv, const char *prefix); int cmd_fetch_pack(int argc, const char **argv, const char *prefix); diff --git a/builtin/env--helper.c b/builtin/env--helper.c new file mode 100644 index 00..1083c0f707 --- /dev/null +++ b/builtin/env--helper.c @@ -0,0 +1,95 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const env__helper_usage[] = { + N_("git env--helper --type=[bool|ulong] "), + NULL +}; + +enum { + ENV_HELPER_TYPE_BOOL = 1, + ENV_HELPER_TYPE_ULONG +} cmdmode = 0; + +static int option_parse_type(const struct option *opt, const char *arg, +int unset) +{ + if (!strcmp(arg, "bool")) + cmdmode = ENV_HELPER_TYPE_BOOL; + else if (!strcmp(arg, "ulong")) + cmdmode = ENV_HELPER_TYPE_ULONG; + else + die(_("unrecognized --type argument, %s"), arg); + + return 0; +} + +int cmd_env__helper(int argc, const char **argv, const char *prefix) +{ + int exit_code = 0; + const char *env_variable = NULL; + const char *env_default = NULL; + int ret; + int ret_int, default_int; + unsigned long ret_ulong, default_ulong; + struct option opts[] = { + OPT_CALLBACK_F(0, "type", &cmdmode, N_("type"), + N_("value is given this type"), PARSE_OPT_NONEG, + option_parse_type), + OPT_STRING(0, "default", &env_default, N_("value"), + N_("default for git_env_*(...) to fall back on")), + OPT_BOOL(0, "exit-code", &exit_code, +N_("be quiet only use git_env_*() value as exit code")), + OPT_END(), + }; + + argc = parse_options(argc, argv, prefix, opts, env__helper_usage, +PARSE_OPT_KEEP_UNKNOWN); + if (env_default && !*env_default) + usage_with_options(env__helper_usage, opts)
[PATCH v3 0/8] Change GIT_TEST_* variables to
Now with: * The --type=bool etc. Ui change to env--bool. * I considered supporting YesPlease for gettext poison, but didn't go for it. Details in updated commit message. * "default" "case" arm fork warning on some compilers. Ævar Arnfjörð Bjarmason (8): config tests: simplify include cycle test env--helper: new undocumented builtin wrapping git_env_*() config.c: refactor die_bad_number() to not call gettext() early t6040 test: stop using global "script" variable tests: make GIT_TEST_GETTEXT_POISON a boolean tests README: re-flow a previously changed paragraph tests: replace test_tristate with "git env--helper" tests: make GIT_TEST_FAIL_PREREQS a boolean .gitignore| 1 + Makefile | 1 + builtin.h | 1 + builtin/env--helper.c | 95 + ci/lib.sh | 2 +- config.c | 28 +++ gettext.c | 6 +-- git-sh-i18n.sh| 4 +- git.c | 1 + po/README | 2 +- t/README | 12 ++--- t/lib-git-daemon.sh | 7 ++- t/lib-git-svn.sh | 11 ++--- t/lib-httpd.sh| 15 +++--- t/t-basic.sh | 10 ++-- t/t0017-env-helper.sh | 99 +++ t/t0205-gettext-poison.sh | 7 ++- t/t1305-config-include.sh | 21 +++-- t/t5512-ls-remote.sh | 3 +- t/t6040-tracking-info.sh | 6 +-- t/t7201-co.sh | 2 +- t/t9902-completion.sh | 2 +- t/test-lib-functions.sh | 58 --- t/test-lib.sh | 31 24 files changed, 298 insertions(+), 127 deletions(-) create mode 100644 builtin/env--helper.c create mode 100755 t/t0017-env-helper.sh Range-diff: 1: c3483c37a1 = 1: c3483c37a1 config tests: simplify include cycle test 2: e689759f7c ! 2: 39cb96739a env--helper: new undocumented builtin wrapping git_env_*() @@ -20,9 +20,11 @@ default value makes all the difference, and we'll be able to replace test_tristate() itself with that. -The --mode-bool option will be used by subsequent patches, but not ---mode-ulong. I figured it was easy enough to add it & test for it so -I left it in so we'd have wrappers for both git_env_*() functions. +The --type=bool option will be used by subsequent patches, but not +--type=ulong. I figured it was easy enough to add it & test for it so +I left it in so we'd have wrappers for both git_env_*() functions, and +to have a template to make it obvious how we'd add --type=int etc. if + it's needed in the future. Signed-off-by: Ævar Arnfjörð Bjarmason @@ -72,74 +74,95 @@ +#include "parse-options.h" + +static char const * const env__helper_usage[] = { -+ N_("git env--helper [--mode-bool | --mode-ulong] --env-variable= --env-default= []"), ++ N_("git env--helper --type=[bool|ulong] "), + NULL +}; + ++enum { ++ ENV_HELPER_TYPE_BOOL = 1, ++ ENV_HELPER_TYPE_ULONG ++} cmdmode = 0; ++ ++static int option_parse_type(const struct option *opt, const char *arg, ++ int unset) ++{ ++ if (!strcmp(arg, "bool")) ++ cmdmode = ENV_HELPER_TYPE_BOOL; ++ else if (!strcmp(arg, "ulong")) ++ cmdmode = ENV_HELPER_TYPE_ULONG; ++ else ++ die(_("unrecognized --type argument, %s"), arg); ++ ++ return 0; ++} ++ +int cmd_env__helper(int argc, const char **argv, const char *prefix) +{ -+ enum { -+ ENV_HELPER_BOOL = 1, -+ ENV_HELPER_ULONG, -+ } cmdmode = 0; + int exit_code = 0; -+ int quiet = 0; + const char *env_variable = NULL; + const char *env_default = NULL; + int ret; -+ int ret_int, tmp_int; -+ unsigned long ret_ulong, tmp_ulong; ++ int ret_int, default_int; ++ unsigned long ret_ulong, default_ulong; + struct option opts[] = { -+ OPT_CMDMODE(0, "mode-bool", &cmdmode, -+ N_("invoke git_env_bool(...)"), ENV_HELPER_BOOL), -+ OPT_CMDMODE(0, "mode-ulong", &cmdmode, -+ N_("invoke git_env_ulong(...)"), ENV_HELPER_ULONG), -+ OPT_STRING(0, "variable", &env_variable, N_("name"), -+ N_("which environment variable to ask git_env_*(...) about")), ++ OPT_CALLBACK_F(0, "type", &cmdmode, N_("type"), ++ N_("value is given this type"), PARSE_OPT_NONEG, ++ option_parse_type),
Re: [PATCH v2 2/8] env--helper: new undocumented builtin wrapping git_env_*()
On Fri, Jun 21 2019, Junio C Hamano wrote: > Junio C Hamano writes: > >> ... >> as I am getting >> >> error: 'ret' may be used uninitialized in this function >> [-Werror=maybe-uninitialized] >> >> from here. >> >> Giving an otherwise useless initial value to ret would be a >> workaround. > > I've added this on top of the topic before merging to keep the > integration going at least for now. > > commit 8f86948797a1152594a8dee50d0878604fec3e80 > Author: Junio C Hamano > Date: Thu Jun 20 15:13:14 2019 -0700 > > SQUASH??? avoid maybe-uninitialized > > diff --git a/builtin/env--helper.c b/builtin/env--helper.c > index 2bb65ecf3f..29df0567fb 100644 > --- a/builtin/env--helper.c > +++ b/builtin/env--helper.c > @@ -43,6 +43,9 @@ int cmd_env__helper(int argc, const char **argv, const char > *prefix) > usage_with_options(env__helper_usage, opts); > > switch (cmdmode) { > + default: > + BUG("wrong cmdmode"); > + break; > case ENV_HELPER_BOOL: > tmp_int = strtol(env_default, (char **)&env_default, 10); > if (*env_default) { In this case the compiler is wrong, and gcc/clang in e.g. Debian unstable doesn't warn about this since the analyzer sees that it's impossible for "ret" to be uninitialized. I can change it anyway, and if I rewrite the UI of this command it might go away anyway. Just thought I'd ask if appeasing older analyzers is what we want for these sorts of optional warnings in general.
Re: [RFC/PATCH] gc: run more pre-detach operations under lock
On Thu, Jun 20 2019, Duy Nguyen wrote: > On Thu, Jun 20, 2019 at 5:49 AM Ævar Arnfjörð Bjarmason > wrote: >> >> >> On Wed, Jun 19 2019, Jeff King wrote: >> >> > On Wed, Jun 19, 2019 at 08:01:55PM +0200, Ævar Arnfjörð Bjarmason wrote: >> > >> >> > You could sort of avoid the problem here too with >> >> > >> >> > parallel 'git fetch --no-auto-gc {}' ::: $(git remote) >> >> > git gc --auto >> >> > >> >> > It's definitely simpler, but of course we have to manually add >> >> > --no-auto-gc in everywhere we need, so not quite as elegant. >> >> > >> >> > Actually you could already do that with 'git -c gc.auto=false fetch', I >> >> > guess. >> >> >> >> The point of the 'parallel' example is to show disconnected git >> >> commands, think trying to run 'git' in a terminal while your editor >> >> asynchronously runs a polling 'fetch', or a server with multiple >> >> concurrent clients running 'gc --auto'. >> >> >> >> That's the question my RFC patch raises. As far as I can tell the >> >> approach in your patch is only needed because our locking for gc is >> >> buggy, rather than introduce the caveat that an fetch(N) operation won't >> >> do "gc" until it's finished (we may have hundreds, thousands of remotes, >> >> I use that for some more obscure use-cases) shouldn't we just fix the >> >> locking? >> > >> > I think there may be room for both approaches. Yours fixes the repeated >> > message in the more general case, but Duy's suggestion is the most >> > efficient thing. >> > >> > I agree that the "thousands of remotes" case means we might want to gc >> > in the interim. But we probably ought to do that deterministically >> > rather than hoping that the pattern of lock contention makes sense. >> >> We do it deterministically, when gc.auto thresholds et al are exceeded >> we kick one off without waiting for other stuff, if we can get the lock. >> >> I don't think this desire to just wait a bit until all the fetches are >> complete makes sense as a special-case. >> >> If, as you noted in <20190619190845.gd28...@sigill.intra.peff.net>, the >> desire is to reduce GC CPU use then you're better off just tweaking the >> limits upwards. Then you get that with everything, like when you run >> "commit" in a for-loop, not just this one special case of "fetch". >> >> We have existing potentially long-running operations like "fetch", >> "rebase" and "git svn fetch" that run "gc --auto" for their incremental >> steps, and that's a feature. > > gc --auto is added at arbitrary points to help garbage collection. I > don't think it's ever intended to "do gc at this and that exact > moment", just "hey this command has taken a lot of time already (i.e. > no instant response needed) and it may have added a bit more garbage, > let's just check real quick". I don't mean we can't ever change the algorithm, but that we've documented: When common porcelain operations that create objects are run, they will check whether the repository has grown substantially since the last maintenance[...] The "fetch" command is a common porcelain operation, when it fetches from N remotes it just runs an invocation of itself, so thus far it's both worked & been intuitive that if we needed (potentially multiple) gc's while doing that we'd just go ahead and run it then, even if something concurrent was happening. No that's not optimal in many cases, but at least doesn't create caveats we don't have now where we have runaway object growth. >> It keeps "gc --auto" dumb enough to avoid a pathological case where >> we'll have a ballooning objects dir because we figure we can run >> something "at the end", when "the end" could be hours away, and we're >> adding a new pack or hundreds of loose objects every second. > > Are we optimizing for a rare (large scale) case? Such setup requires > tuning regardless to me. At least for me it doesn't require custom tuning before this patch of yours. I.e. now "gc --auto" is dumb enough that you can run it on everything from stuff that just does "commit" from cron, user's laptops, massive rebases that take forever
[PATCH v2 5/8] tests: make GIT_TEST_GETTEXT_POISON a boolean
Change the GIT_TEST_GETTEXT_POISON variable from being "non-empty?" to being a more standard boolean variable. Since it needed to be checked in both C code and shellscript (via test -n) it was one of the remaining shellscript-like variables. Now that we have "env--helper" we can change that. There's a couple of tricky edge cases that arise because we're using git_env_bool() early, and the config-reading "env--helper". If GIT_TEST_GETTEXT_POISON is set to an invalid value die_bad_number() will die, but to do so it would usually call gettext(). Let's detect the special case of GIT_TEST_GETTEXT_POISON and always emit that message in the C locale, lest we infinitely loop. As seen in the updated tests in t0016-env-helper.sh there's also a caveat related to "env--helper" needing to read the config for trace2 purposes. Since the C_LOCALE_OUTPUT prerequisite is lazy and relies on "env--helper" we could get invalid results if we failed to read the config (e.g. because we'd loop on includes) when combined with e.g. "test_i18ngrep" wanting to check with "env--helper" if GIT_TEST_GETTEXT_POISON was true or not. I'm crossing my fingers and hoping that a test similar to the one I removed in the earlier "config tests: simplify include cycle test" change in this series won't happen again, and testing for this explicitly in "env--helper"'s own tests. Signed-off-by: Ævar Arnfjörð Bjarmason --- ci/lib.sh | 2 +- config.c | 9 + gettext.c | 6 ++ git-sh-i18n.sh| 4 +++- po/README | 2 +- t/README | 4 ++-- t/t0017-env-helper.sh | 16 t/t0205-gettext-poison.sh | 7 ++- t/t1305-config-include.sh | 2 +- t/t7201-co.sh | 2 +- t/t9902-completion.sh | 2 +- t/test-lib.sh | 8 +++- 12 files changed, 46 insertions(+), 18 deletions(-) diff --git a/ci/lib.sh b/ci/lib.sh index 288a5b3884..fd799ae663 100755 --- a/ci/lib.sh +++ b/ci/lib.sh @@ -184,7 +184,7 @@ osx-clang|osx-gcc) export GIT_SKIP_TESTS="t9810 t9816" ;; GIT_TEST_GETTEXT_POISON) - export GIT_TEST_GETTEXT_POISON=YesPlease + export GIT_TEST_GETTEXT_POISON=true ;; esac diff --git a/config.c b/config.c index 374cb33005..b985d60fa4 100644 --- a/config.c +++ b/config.c @@ -956,6 +956,15 @@ static void die_bad_number(const char *name, const char *value) if (!value) value = ""; + if (!strcmp(name, "GIT_TEST_GETTEXT_POISON")) + /* +* We explicitly *don't* use _() here since it would +* cause an infinite loop with _() needing to call +* use_gettext_poison(). This is why marked up +* translations with N_() above. +*/ + die(bad_numeric, value, name, error_type); + if (!(cf && cf->name)) die(_(bad_numeric), value, name, _(error_type)); diff --git a/gettext.c b/gettext.c index d4021d690c..5c71f4c8b9 100644 --- a/gettext.c +++ b/gettext.c @@ -50,10 +50,8 @@ const char *get_preferred_languages(void) int use_gettext_poison(void) { static int poison_requested = -1; - if (poison_requested == -1) { - const char *v = getenv("GIT_TEST_GETTEXT_POISON"); - poison_requested = v && strlen(v) ? 1 : 0; - } + if (poison_requested == -1) + poison_requested = git_env_bool("GIT_TEST_GETTEXT_POISON", 0); return poison_requested; } diff --git a/git-sh-i18n.sh b/git-sh-i18n.sh index e1d917fd27..de8ae67d7b 100644 --- a/git-sh-i18n.sh +++ b/git-sh-i18n.sh @@ -17,7 +17,9 @@ export TEXTDOMAINDIR # First decide what scheme to use... GIT_INTERNAL_GETTEXT_SH_SCHEME=fallthrough -if test -n "$GIT_TEST_GETTEXT_POISON" +if test -n "$GIT_TEST_GETTEXT_POISON" && + git env--helper --mode-bool --variable=GIT_TEST_GETTEXT_POISON \ + --default=0 --exit-code --quiet then GIT_INTERNAL_GETTEXT_SH_SCHEME=poison elif test -n "@@USE_GETTEXT_SCHEME@@" diff --git a/po/README b/po/README index aa704ffcb7..07595d369b 100644 --- a/po/README +++ b/po/README @@ -293,7 +293,7 @@ To smoke out issues like these, Git tested with a translation mode that emits gibberish on every call to gettext. To use it run the test suite with it, e.g.: -cd t && GIT_TEST_GETTEXT_POISON=YesPlease prove -j 9 ./t[0-9]*.sh +cd t && GIT_TEST_GETTEXT_POISON=true prove -j 9 ./t[0-9]*.sh If tests break with it you should inspect them manually and see if what you're translating is sane, i.e. that you're not translating diff --git a/t/README b/t/README index 9747971d58..9a131f472e
[PATCH v2 8/8] tests: make GIT_TEST_FAIL_PREREQS a boolean
Change the GIT_TEST_FAIL_PREREQS variable from being "non-empty?" to being a more standard boolean variable. I recently added the variable in dfe1a17df9 ("tests: add a special setup where prerequisites fail", 2019-05-13), having to add another "non-empty?" special-case is what prompted me to write the "git env--helper" utility being used here. Converting this one is a bit tricky since we use it so early and frequently in the guts of the test code itself, so let's set a GIT_TEST_FAIL_PREREQS_INTERNAL which can be tested with the old "test -n" for the purposes of the shell code, and change the user-exposed and documented GIT_TEST_FAIL_PREREQS variable to a boolean. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/README| 2 +- t/t-basic.sh| 10 +- t/test-lib-functions.sh | 2 +- t/test-lib.sh | 25 + 4 files changed, 28 insertions(+), 11 deletions(-) diff --git a/t/README b/t/README index 072c9854d1..60d5b77bcc 100644 --- a/t/README +++ b/t/README @@ -334,7 +334,7 @@ that cannot be easily covered by a few specific test cases. These could be enabled by running the test suite with correct GIT_TEST_ environment set. -GIT_TEST_FAIL_PREREQS fails all prerequisites. This is +GIT_TEST_FAIL_PREREQS= fails all prerequisites. This is useful for discovering issues with the tests where say a later test implicitly depends on an optional earlier test. diff --git a/t/t-basic.sh b/t/t-basic.sh index 31de7e90f3..e89438e619 100755 --- a/t/t-basic.sh +++ b/t/t-basic.sh @@ -726,7 +726,7 @@ donthaveit=yes test_expect_success DONTHAVEIT 'unmet prerequisite causes test to be skipped' ' donthaveit=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit != yesyes then say "bug in test framework: prerequisite tags do not work reliably" exit 1 @@ -747,7 +747,7 @@ donthaveiteither=yes test_expect_success DONTHAVEIT,HAVEIT 'unmet prerequisites causes test to be skipped' ' donthaveiteither=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit$donthaveiteither != yesyesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit$donthaveiteither != yesyesyes then say "bug in test framework: multiple prerequisite tags do not work reliably" exit 1 @@ -763,7 +763,7 @@ test_expect_success !LAZY_TRUE 'missing lazy prereqs skip tests' ' donthavetrue=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a "$havetrue$donthavetrue" != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$havetrue$donthavetrue" != yesyes then say 'bug in test framework: lazy prerequisites do not work' exit 1 @@ -779,7 +779,7 @@ test_expect_success LAZY_FALSE 'missing negative lazy prereqs will skip' ' havefalse=no ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a "$nothavefalse$havefalse" != yesyes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$nothavefalse$havefalse" != yesyes then say 'bug in test framework: negative lazy prerequisites do not work' exit 1 @@ -790,7 +790,7 @@ test_expect_success 'tests clean up after themselves' ' test_when_finished clean=yes ' -if test -z "$GIT_TEST_FAIL_PREREQS" -a $clean != yes +if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $clean != yes then say "bug in test framework: basic cleanup command does not work reliably" exit 1 diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 527508c350..3fba71c358 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -309,7 +309,7 @@ test_unset_prereq () { } test_set_prereq () { - if test -n "$GIT_TEST_FAIL_PREREQS" + if test -n "$GIT_TEST_FAIL_PREREQS_INTERNAL" then case "$1" in # The "!" case is handled below with diff --git a/t/test-lib.sh b/t/test-lib.sh index c45b0d2611..238ef62401 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -1389,6 +1389,27 @@ yes () { done } +# The GIT_TEST_FAIL_PREREQS code hooks into test_set_prereq(), and +# thus needs to be set up really early, and set an internal variable +# for convenience so the hot test_set_prereq() codepath doesn't need +# to call "git env--helper". Only do that work if needed by seeing if +# GIT_TEST_FAIL_PREREQS is set at all. +GIT_TEST_FAIL_PREREQS_INTERNAL= +if test -n "$GIT_TEST_FAIL_PREREQS" +then + if git env--helper --mode-bool --variable=GIT_TEST_FAIL_PREREQS \ + --default=
[PATCH v2 7/8] tests: replace test_tristate with "git env--helper"
The test_tristate helper introduced in 83d842dc8c ("tests: turn on network daemon tests by default", 2014-02-10) can now be better implemented with "git env--helper" to give the variables in question the standard boolean behavior. The reason for the "tristate" was to have all of false/true/auto, where "auto" meant either "false" or "true" depending on what the fallback was. With the --default option to "git env--helper" we can simply have e.g. GIT_TEST_HTTPD where we know if it's true because the user asked explicitly ("true"), or true implicitly ("auto"). This breaks backwards compatibility for explicitly setting "auto" for these variables, but I don't think anyone cares. That was always intended to be internal. This means the test_normalize_bool() code in test-lib-functions.sh goes away in addition to test_tristate(). We still need the test_skip_or_die() helper, but now it takes the variable name instead of the value, and uses "git env--bool" to distinguish a default "true" from an explicit "true" (in those "explicit true" cases we want to fail the test in question). Signed-off-by: Ævar Arnfjörð Bjarmason --- t/lib-git-daemon.sh | 7 +++--- t/lib-git-svn.sh| 11 +++- t/lib-httpd.sh | 15 ++- t/t5512-ls-remote.sh| 3 +-- t/test-lib-functions.sh | 56 ++--- 5 files changed, 22 insertions(+), 70 deletions(-) diff --git a/t/lib-git-daemon.sh b/t/lib-git-daemon.sh index 7b3407134e..770c5218ea 100644 --- a/t/lib-git-daemon.sh +++ b/t/lib-git-daemon.sh @@ -15,8 +15,7 @@ # # test_done -test_tristate GIT_TEST_GIT_DAEMON -if test "$GIT_TEST_GIT_DAEMON" = false +if ! git env--helper --mode-bool --variable=GIT_TEST_GIT_DAEMON --default=1 --exit-code --quiet then skip_all="git-daemon testing disabled (unset GIT_TEST_GIT_DAEMON to enable)" test_done @@ -24,7 +23,7 @@ fi if test_have_prereq !PIPE then - test_skip_or_die $GIT_TEST_GIT_DAEMON "file system does not support FIFOs" + test_skip_or_die GIT_TEST_GIT_DAEMON "file system does not support FIFOs" fi test_set_port LIB_GIT_DAEMON_PORT @@ -73,7 +72,7 @@ start_git_daemon() { kill "$GIT_DAEMON_PID" wait "$GIT_DAEMON_PID" unset GIT_DAEMON_PID - test_skip_or_die $GIT_TEST_GIT_DAEMON \ + test_skip_or_die GIT_TEST_GIT_DAEMON \ "git daemon failed to start" fi } diff --git a/t/lib-git-svn.sh b/t/lib-git-svn.sh index c1271d6863..853d33a57a 100644 --- a/t/lib-git-svn.sh +++ b/t/lib-git-svn.sh @@ -69,14 +69,12 @@ svn_cmd () { maybe_start_httpd () { loc=${1-svn} - test_tristate GIT_SVN_TEST_HTTPD - case $GIT_SVN_TEST_HTTPD in - true) + if git env--helper --mode-bool --variable=GIT_TEST_HTTPD --default=0 --exit-code --quiet + then . "$TEST_DIRECTORY"/lib-httpd.sh LIB_HTTPD_SVN="$loc" start_httpd - ;; - esac + fi } convert_to_rev_db () { @@ -106,8 +104,7 @@ EOF } require_svnserve () { - test_tristate GIT_TEST_SVNSERVE - if ! test "$GIT_TEST_SVNSERVE" = true + if ! git env--helper --mode-bool --variable=GIT_TEST_SVNSERVE --default=0 --exit-code --quiet then skip_all='skipping svnserve test. (set $GIT_TEST_SVNSERVE to enable)' test_done diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh index b3cc62bd36..eef3250552 100644 --- a/t/lib-httpd.sh +++ b/t/lib-httpd.sh @@ -41,15 +41,14 @@ then test_done fi -test_tristate GIT_TEST_HTTPD -if test "$GIT_TEST_HTTPD" = false +if ! git env--helper --mode-bool --variable=GIT_TEST_HTTPD --default=1 --exit-code --quiet then skip_all="Network testing disabled (unset GIT_TEST_HTTPD to enable)" test_done fi if ! test_have_prereq NOT_ROOT; then - test_skip_or_die $GIT_TEST_HTTPD \ + test_skip_or_die GIT_TEST_HTTPD \ "Cannot run httpd tests as root" fi @@ -95,7 +94,7 @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE if ! test -x "$LIB_HTTPD_PATH" then - test_skip_or_die $GIT_TEST_HTTPD "no web server found at '$LIB_HTTPD_PATH'" + test_skip_or_die GIT_TEST_HTTPD "no web server found at '$LIB_HTTPD_PATH'" fi HTTPD_VERSION=$($LIB_HTTPD_PATH -v | \ @@ -107,19 +106,19 @@ then then if ! test $HTTPD_VERSION -ge 2 then - test_skip_or_die $GIT_TEST_HTTPD \ + test_skip_or_die GIT_TEST_HTTPD \ &