My inactivity in git recently

2019-10-16 Thread Ævar Arnfjörð Bjarmason
Hi, thought I'd send this to git@ as a public FYI. Some of you were
concerned about my inactivity recently, rest assured I'm fine, just been
busy with other things.

Hoping to get back into it sooner than later, sorry about not replying
to things I've been CC'd on.


Re: [PATCH v3 2/5] repo-settings: add feature.manyCommits setting

2019-07-31 Thread Ævar Arnfjörð Bjarmason


On Tue, Jul 30 2019, Derrick Stolee via GitGitGadget wrote:

> +feature.*::
> + The config settings that start with `feature.` modify the defaults of
> + a group of other config settings. These groups are created by the Git
> + developer community as recommended defaults and are subject to change.
> + In particular, new config options may be added with different defaults.
> +
> +feature.manyCommits::
> + Enable config options that optimize for repos with many commits. This
> + setting is recommended for repos with at least 100,000 commits. The
> + new default values are:
> ++
> +* `core.commitGraph=true` enables reading the commit-graph file.
> ++
> +* `gc.writeCommitGraph=true` enables writing the commit-graph file during
> +garbage collection.

During the whole new commit graph format discussion (which has now
landed) we discussed just auto toggling this:
https://public-inbox.org/git/87zhobr4fl@evledraar.gmail.com/

This looks fine, but have we backed out of simply enabling this at this
point? I don't see why not, regardless of commit count...


Re: [RFC PATCH v2] grep: allow for run time disabling of JIT in PCRE

2019-07-31 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 31 2019, Johannes Schindelin wrote:

> Hi,
>
> On Mon, 29 Jul 2019, Carlo Marcelo Arenas Belón wrote:
>
>>   $ git grep 'foo bar'
>>   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>
> My immediate reaction to this error message was: That's not helpful.
> What is `-48` supposed to mean? Why do we even think it sensible to
> throw such an error message at the end user? Can't we do a much better
> job translating that into something that makes actual sense without
> knowing implementation details?
>
> But then, I realized that -48 must be a well-known constant in PCRE2,
> and my reaction transformed into something much more hopeful: why don't
> we detect the situation where the JIT'ed code was not actually
> executable [*1*], and fall back to the non-JIT'ed code path ourselves,
> without troubling the end user (maybe warning, but maybe better not lest
> we annoy the user with something pointless)?
>
> Even after finding out that -48 disappointingly means
> PCRE2_ERROR_NOMEMORY (as opposed to something like
> PCRE2_ERROR_CANNOT_EXECUTE_JIT_CODE), I like the idea of not bothering
> end users and doing the sensible fallback under the hood.
>
> Ciao,
> Dscho
>
> Footnote *1*: Why anybody would think it sensible to build a PCRE2 with
> JIT on an OS that does not allow executing code that was written by the
> same process is beyond me. Or is there a mode in OpenBSD that *does*
> allow JIT'ed code to be executed?

We do detect if JIT isn't supported and fall back. That's what the
pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on) code in grep.c
does. This and is the subsequent pcre2_pattern_info() call is how PCRE
documents that you should do this.

What hasn't been supported is all of that saying "yes, I support JIT"
and the feature then fail whaling. I had not encountered that before.

So far that seems like because Carlo just built a completely broken PCRE
v2 package, so I don't know if that's worth supporting on our
side. I.e. this isn't something I think could plausibly happen in the
wild.

That should *not* be confused with me thinking other stuff Carlo's
raised is a non-issue, e.g. running into the JIT stack limit etc. Some
of that's clearly bugs in our/my grep.c code that need fixing.


Re: [PATCH 0/4] gc docs: modernize and fix the documentation

2019-07-31 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 31 2019, Jeff King wrote:

> On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Michael Haggerty and I have (off-list) discussed variations on that, but
>> > it opens up a lot of new issues.  Moving something into quarantine isn't
>> > atomic. So you've still corrupted the repo, but now it's recoverable by
>> > reaching into the quarantine. Who notices that the repo is corrupt, and
>> > how? When do we expire objects from quarantine?
>> >
>> > I think the heart of the issue is really the lack of atomicity in the
>> > operations. You need some way to mark "I am using this now" in a way
>> > that cannot race with "looks like nobody is using this, so I'll delete
>> > it".
>> >
>> > And ideally without traversing large bits of the graph on the writing
>> > side, and without requiring any stop-the-world locks during pruning.
>>
>> I was thinking (but realize now that I didn't articulate) that the "gc
>> quarantine" would be another "alternate" implementing a copy-on-write
>> "lockless delete-but-be-able-to-rollback scheme" as you put it.
>>
>> So "gc" would decide (racily) what's unreachable, but instead of
>> unlink()-ing it would "mv" the loose object/pack into the
>> "unreferenced-objects" quarantine.
>>
>> Then in your example #1 "wants to reference ABCD. It sees that we have
>> it." would race on the "other side". I.e. maybe ABCD was *just* moved to
>> the quarantine. But in that case we'd move it back, which would bump the
>> mtime and thus make it ineligible for expiry.
>
> I think this is basically the same as the current freshening scheme,
> though. In general, you can replace "move it back" with "update its
> mtime". Neither is atomic with respect to other operations.
>
> It does seem like the twist is that "gc" is supposed to do the "move it
> back" step (and it's also the thing expiring, if we assume that there's
> only one gc running at a time). But again, how do we know somebody isn't
> referencing it _right now_ while we're deciding whether to move it back?

The twist is to create a "quarantine" area of the ref store you can't
read any objects from without copying them to the "main" area (git-gc
itself would be an exception).

Hence step #2 and #6, respectively, in your examples in
https://public-inbox.org/git/20190319001829.gl29...@sigill.intra.peff.net/
would have update-ref/receive-pack fail to find "ABCD" in the "main"
store due to the exact same race we have now with mtimes & gc, then fall
back to the "quarantine" and (this is the important part) immediately
copy it back to the "main" store.

IOW yes, you'd have the exact same race you have now with the initial
move to the quarantine. You'd have ref updates & gc racing and
"unreachable" things would be moved to the quarantine, but really the
just became reachable again.

The difference is that instead of unlinking that unreachable object we
move it to the quarantine, so the next "gc" (which is what would delete
it) would notice it's reachable and move it to the "main" area before
proceeding, *and* anything that "faults" back to reading the
"quarantine" would do the same.

> I think there are lots of solutions you can come up with if you have
> atomicity. But fundamentally it isn't there in the way we handle updates
> now. You could imagine something like a shared/unique lock where anybody
> updating a ref takes the "shared" side, and multiple entities can hold
> it at once. But somebody pruning takes the "unique" side and excludes
> everybody else, stopping ref updates during the prune (which you'd
> obviously want to do in a way that you hold the lock for as short as
> possible; say, optimistically check reachability without the lock, then
> take the lock and check to see if anything has changed).
>
> (By shared/unique I basically mean a reader/writer lock, but I didn't
> want to use those terms in the paragraph since both holders are
> writing).
>
> It is tricky to find out when to hold the shared lock, though. It's
> _not_ just a ref write, for example. When you accept a push, you'd want
> to hold the lock while you are checking that you have all of the
> necessary objects to write the ref. For something like "git commit" it's
> even harder, because we implicitly rely on state created by commands run
> over the course of hours or days (e.g., "git add" to put a blob in the
> index and maybe create the tree via cache-tree, then a commit to
> reference it, and finally the ref write; each step adds state which the
> next step relies on).

I don't think this sort of approach would require any global locks, but
it would be vulnerable to operations that take longer than the
"main->quarantine->unlink()" cycle takes. E.g. a "hash-object" that
takes a month before the subsequent "write-tree" etc.

All of the above written with the previously stated "I may be missing
something" caveat etc. :)


Re: [PATCH] send-email: Ask if a patch should be sent twice

2019-07-30 Thread Ævar Arnfjörð Bjarmason


On Tue, Jul 30 2019, Dmitry Safonov wrote:

> I was almost certain that git won't let me send the same patch twice,
> but today I've managed to double-send a directory by a mistake:
>   git send-email --to linux-ker...@vger.kernel.org /tmp/timens/
>   --cc 'Dmitry Safonov <0x7f454...@gmail.com>' /tmp/timens/`
>
> [I haven't noticed that I put the directory twice ^^]
>
> Prevent this shipwreck from happening again by asking if a patch
> is sent multiple times on purpose.
>
> link: https://lkml.kernel.org/r/4d53ebc7-d5b2-346e-c383-606401d19...@gmail.com
> Cc: Andrei Vagin 
> Signed-off-by: Dmitry Safonov 
> ---
>  git-send-email.perl | 23 ++-
>  1 file changed, 22 insertions(+), 1 deletion(-)

There's tests for send-email in t/t9001-send-email.sh. See if what
you're adding can have a test added, seems simple enough in this case.

> diff --git a/git-send-email.perl b/git-send-email.perl
> index 5f92c89c1c1b..0caafc104478 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -33,6 +33,7 @@
>  use Net::Domain ();
>  use Net::SMTP ();
>  use Git::LoadCPAN::Mail::Address;
> +use experimental 'smartmatch';

We depend on Perl 5.8, this bumps the requirenment to 5.10. Aside from
that ~~ is its own can of worms in Perl and is best avoided.

>  Getopt::Long::Configure qw/ pass_through /;
>
> @@ -658,6 +659,17 @@ sub is_format_patch_arg {
>   }
>  }
>
> +sub send_file_twice {
> + my $f = shift;
> + $_ = ask(__("Patch $f will be sent twice, continue? [y]/n "),

These cases with a default should have "Y/n", not "y/n". See other
expamples in the file.

> + default => "y",
> + valid_re => qr/^(?:yes|y|no|n)/i);
> + if (/^n/i) {
> + cleanup_compose_files();
> + exit(0);

Exit if we have just one of these? More on that later...

> + }
> +}
> +
>  # Now that all the defaults are set, process the rest of the command line
>  # arguments and collect up the files that need to be processed.
>  my @rev_list_opts;
> @@ -669,10 +681,19 @@ sub is_format_patch_arg {
>   opendir my $dh, $f
>   or die sprintf(__("Failed to opendir %s: %s"), $f, $!);
>
> - push @files, grep { -f $_ } map { catfile($f, $_) }
> + my @new_files = grep { -f $_ } map { catfile($f, $_) }
>   sort readdir $dh;
> + foreach my $nfile (@new_files) {
> + if ($nfile ~~ @files) {
> + send_file_twice($nfile);
> + }

One non-smartmatch idiom for this is:

my %seen;
for my $file (@files) {
if ($seen{$file}++) { ...}
}

Or:

my %seen;
my @dupes = grep { $seen{$_}++ } @files;

> + }
> + push @files, @new_files;
>   closedir $dh;
>   } elsif ((-f $f or -p $f) and !is_format_patch_arg($f)) {
> + if ($f ~~ @files) {
> + send_file_twice($f);
> + }
>   push @files, $f;

...but picking up the comment above, I'd expect this to be in the "if
($validate)" block below or something similar, seems like this fits
right in with --validate.

Then you can also ask "do you want to send this set of patches twice
?".

Now the user is asked a file-at-a-time.

>   } else {
>   push @rev_list_opts, $f;


Re: [PATCH] Documentation/git-fsck.txt: include fsck.* config variables

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, SZEDER Gábor wrote:

> The 'fsck.skipList' and 'fsck.' config variables might be
> easier to discover when they are documented in 'git fsck's man page.
>
> Signed-off-by: SZEDER Gábor 
> ---
>  Documentation/git-fsck.txt | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
> index e0eae642c1..d72d15be5b 100644
> --- a/Documentation/git-fsck.txt
> +++ b/Documentation/git-fsck.txt
> @@ -104,6 +104,11 @@ care about this output and want to speed it up further.
>   progress status even if the standard error stream is not
>   directed to a terminal.
>
> +CONFIGURATION
> +-
> +
> +include::config/fsck.txt[]

Before this include let's add:

The below documentation is the same as what’s found in
git-config(1):

As I did for a similar change in git-gc in b6a8d09f6d ("gc docs: include
the "gc.*" section from "config" in "gc"", 2019-04-07). Sometimes we
repeat ourselves, it helps the reader to know this isn't some slightly
different prose than what's in git-config.

> +
>  DISCUSSION
>  --


Re: Settings for minimizing repacking (and keeping 'rsync' happy)

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Jeff King wrote:

> On Sun, Jul 28, 2019 at 01:41:34AM +0200, ardi wrote:
>
>> Some of my Git repositories have mirrors, maintained with 'rsync'. I
>> want to have some level of repacking, so that the repositories are
>> efficient, but I also want it to minimize it, so that 'rsync' never
>> has to perform a big transfer for the repositories.
>
> Yes, this is a common problem. The solutions I've seen/used are:
>
>   - use a git-aware transport like git-fetch that can negotiate which
> objects to send
>
>   - use a tool that can find duplicated chunks across files. Many
> de-duping backup systems (e.g., borg) use a rolling hash similar to
> rsync to find moveable chunks, but then look up those chunks in a
> master index (whereas rsync is always looking to match chunks in a
> file of the same name). This works well in practice because Git is
> not usually rewriting most of the data, but just shuffling it around
> between files.
>
> In theory it shouldn't be that hard to tell the receiving rsync to
> look for source chunks not just in the file of the same name, but
> from a set of existing packfiles (say, everything already in
> .git/objects/pack/ on the receiver). But I don't know offhand of an
> option to rsync to do so.
>
>> For example, I think it would be fine if files are repacked just once
>> in their lifetimes, and then that resulting pack file is never
>> repacked again. I did read the gc.bigPackThreshold and
>> gc.autoPackLimit settings, but I don't think they would accomplish
>> that.
>>
>> Basically, what I'm describing is the behaviour of not packing files
>> until the resulting pack would be a given size (say 10MB for example),
>> and then never repack such ~10MB packs again, ever.
>>
>> Can this be done with some Git settings? And do you foresee any kind
>> of serious drawback or potential problem with this kind of behaviour?
>
> You can mark a pack to be kept forever by creating a matching
> "pack-1234abcd.keep" file. That doesn't do your automatic "I want 10MB
> packs" thing, but if you did it occasionally at the right frequency,
> you'd end up with a bunch of 10MB-ish packs.
>
> But there are downsides to having a bunch of packs:
>
>   - object lookups are O(log n) within a single pack, but O(n) over the
> number of packs. So if you get a very large number of packs, normal
> operations will start to suffer. This is mitigated by the new "midx"
> feature, which generates an index for multiple packs.
>
>   - git doesn't allow delta compression across packs. So imagine you
> have ten versions of a file that's 5kb, and each version changes
> about 100 bytes. In a single pack, we'd store one base object, plus
> 9 deltas, for a total of about 6kb (5000 + 9*100). Across two packs,
> we'd store ~11kb (2*5000 + 8*100). And the worst case is ten packs
> at 50kb.
>
> As a more real-world example, try this:
>
>   git -c pack.packsizelimit=10M repack -ad
>
> In a fresh clone of git.git, the size of the pack directory jumps
> from 88MB to 168MB. And in a time-based split (i.e., creating a new
> 10MB pack every week), it may be even worse. The command above
> ordered the objects optimally to keep deltas together and _then_
> split things. Whereas a time-based scheme would likely sprinkle
> versions of a file across more packs.
>
> It should be possible to loosen this restriction and allow
> cross-pack deltas, but it would be very risky. The assumption that
> packs are independent of each other is implicit in much of Git's
> repacking code, so it would be easy to introduce a bug where we
> generate a circular dependency (object A in pack X is a delta
> against object B in pack Y, which is a delta against object A --
> oops, we don't have a full copy anymore).

The thread I started at
https://public-inbox.org/git/87bmhiykvw@evledraar.gmail.com/ should
also be of interest. I.e. we could have some knobs to create more
"stable" packs, I know rsync does some in-file hashing, but I don't
if/how that works if you have 1 file split into N where some chunks in
the N are in the one file.

But it's possible to imagine a repacking algorithm that would keep
producing entirely new packs but arrange for it to be ordered/delta'd in
such a way that it optimizes for page-by-page similarity to an older
pack to some degree.

So e.g. in the examples you mention break the delta chain at 5, then
pick it up again once it's 10 etc. So the intermediate packs where it's
6, 7, 8, 9 would have the new stuff at the end.


Re: Warnings in gc.log can prevent gc --auto from running

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Jeff King wrote:

> On Thu, Jul 25, 2019 at 07:18:57PM -0700, Gregory Szorc wrote:
>
>> I think I've found some undesirable behavior with regards to the
>> behavior of `git gc --auto`. The tl;dr is that a warning message written
>> to gc.log can result in `git gc --auto` effectively disabling itself for
>> gc.logExpiry. The problem is easier to trigger in 2.22 as a result of
>> enabling bitmap indices for bare repositories by default and the
>> behavior can easily result in performance degradation, especially on
>> servers.
>
> Yuck, thanks for reporting this.
>
> As you note, this is a special case of a much larger problem. The other
> common case is the "oops, you still have a lot of loose objects after
> repacking" warning. There's more discussion and some patches here:
>
>   
> https://public-inbox.org/git/20180716172717.237373-1-jonathanta...@google.com/
>
> though I don't think any of the work that came out of that fundamentally
> solves the issue.

To add to that Gregory probably finds these two old reports of mine
interesting. The former is pretty much his report (but for a different
root cause, the loose object issue):
https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/ &
https://public-inbox.org/git/87fu6bmr0j@evledraar.gmail.com/

>> I don't prescribe to know the best way to solve this problem. I just
>> know it is a footgun sitting in the default Git configuration. And the
>> footgun became a lot easier to fire with the introduction of warning
>> messages related to bitmap indices and again when bitmap indices were
>> enabled by default for bare repositories in Git 2.22.
>
> IMHO one way to mitigate this is to simply warn less. In particular, if
> we are auto-enabling bitmaps, then it doesn't necessarily make sense for
> us to warn about them being disabled.
>
> In the case of .keep files, we've already got 7328482253 (repack:
> disable bitmaps-by-default if .keep files exist, 2019-06-29), which
> should be in the next released version of Git. But I suspect that's
> racy with respect to somebody creating .keep files, and as you note
> there are other config options that might prevent us from generating
> bitmaps.
>
> Instead, it may make sense to turn the --write-bitmap-index option of
> pack-objects into a tri-state: true/false/auto. Then pack-objects would
> know that we are in best-effort mode, and would avoid warning in that
> case. That would also let git-repack express its intentions better to
> git-pack-objects, so we could replace 7328482253, and keep more of the
> logic in pack-objects, which is ultimately what has to make the decision
> about whether it can generate bitmaps.

Sounds like pentastate to me :) (penta = 5, had to look it up). I.e. in
most cases of "auto" we pick a true/false at the outset, whereas this is
true/true-but-dont-care-much/false/false-but-dont-care-much with "auto"
picking the "-but-dont-care-much" versions of a "soft" true/false.

On this general topic a *soft* poke about relying to
https://public-inbox.org/git/8736lnxlig@evledraar.gmail.com/ if you
have time. I think a "loose pack" might be a way forward for the loose
object proliferation, but maybe I'm wrong.

More generally we're really straining the gc.log pass-along-a-message
facility.


Re: [RFC PATCH] grep: allow for run time disabling of JIT in PCRE

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Carlo Arenas wrote:

> On Mon, Jul 29, 2019 at 1:55 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>> On Mon, Jul 29 2019, Carlo Marcelo Arenas Belón wrote:
>>
>> > PCRE1 allowed for a compile time flag to disable JIT, but PCRE2 never
>> > had one, forcing the use of JIT if -P was requested.
>>
>> What's that PCRE1 compile-time flag?
>
> NO_LIBPCRE1_JIT at GIT compile time (regardless of JIT support in the
> PCRE1 library you are using)

Ah of course, I was reading this as "regexp
compile-time". I.e. something like (*NO_JIT). No *such* thing exists for
PCRE v1 JIT AFAIK as exposed by git-grep.

>> > After ed0479ce3d (Merge branch 'ab/no-kwset' into next, 2019-07-15)
>> > the PCRE2 engine will be used more broadly and therefore adding this
>> > knob will give users a fallback for situations like the one observed
>> > in OpenBSD with a JIT enabled PCRE2, because of W^X restrictions:
>> >
>> >   $ git grep 'foo bar'
>> >   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>> >   $ git grep -G 'foo bar'
>> >   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>> >   $ git grep -E 'foo bar'
>> >   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>> >   $ git grep -F 'foo bar'
>> >   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>>
>> Yeah that obviously sucks more with ab/no-kwset, but that seems like a
>> case where -P would have been completely broken before, and therefore I
>> can't imagine the package ever passed "make test". Or is W^X also
>> exposed as some run-time option on OpenBSD?
>
> ironically, you could use PCRE1 since that is not using the JIT fast
> path and therefore will fallback automatically to the interpreter

...because OpenBSD PCRE v1 was compiled with --disable-jit before, but
their v2 package has --enable-jit, it just doesn't work at all? Is this
your custom built git + OpenBSD packages of PCRE coming with the OS?

I don't use OpenBSD, but isn't this their recipe? Seems they use "make
test", and don't compile with PCRE at all if I'm reading it right:
https://github.com/openbsd/ports/blob/master/devel/git/Makefile

> there is also a convoluted way to make your binary work by moving
> it into a mount point that has been specially exempted from that W^X
> restriction.
>
>> I.e. aside from the merits of such a setting in general these examples
>> seem like just working around something that should be fixed at make
>> all/test time, or maybe I'm missing something.
>
> 1) before you could just avoid using -P and still be able to grep
> 2) there is no way to tell PCRE2 to get out of the way even if you are
> not using -P

Right, no arguments at all about ab/no-kwset making this worse (re: your
#1). I just really prefer not to expose/document config for what
*should* be something purely internal if the X-Y problem is a bug being
exposed that we should just fix.

Particularly because I think it's a losing battle to provide run-time
options for what are surely a *lot* of "make test" failures.

If it really is unavoidable to detect this until runtime in some common
configurations I have no problem with it, I just haven't encountered
that so far.

> you are right though that this is not a new problem and was reported
> before with patches and the last comment saying a configuration
> should be provided.

patches = your recent
https://public-inbox.org/git/20181209230024.43444-2-care...@gmail.com/
or something earlier?

That patch seems sane without having tested it. Seems like the
equivalent of what we do with v1 with PCRE2_JIT_COMPLETE.

I *am* curious if there's setups where fixing the code for PCRE v1 isn't
purely an academic exercise. Is there a reason for why these platforms
can't just move to PCRE v2 in principle (dumpster fires in "next"
non-withstanding)?

>> To the extent that we'd want to make this sort of thing configurable, I
>> wonder if a continuation of my (*NO_JIT) patch isn't better, i.e. just
>> adding the ability to configure some string we'd inject at the start of
>> every pattern.
>
> looking at the number of lines of code, it would seem the configuration
> approach is simpler.
>
>> That would allow for setting any other number of options in
>> pcre2syntax(3) without us needing to carry config for each one,
>> e.g. (*LIMIT_HEAP=d), (*LIMIT_DEPTH=d) etc. It does present a larger
>> foot-gun surface though...
>
> the parameters I suspect users might need are not really accessible through
> that (ex: jit stacksize).
>
> it is important to note that currently we are not preventing any user to use
> those flags themselves in their patterns either.


Re: [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> 1-3 here are a re-roll on "next". I figured that was easier for
>> everyone with the state of the in-flight patches, it certainly was for
>> me. Sorry Junio if this creates a mess for you.
>
> As long as I can just apply all of them on top of no-kwset and keep
> it a single topic, it wouldn't be too much of a hassle.
>
>> 4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep:
>> skip UTF8 checks explicitally" in
>> https://public-inbox.org/git/20190721183115.14985-1-care...@gmail.com/
>>
>> As noted the bug isn't fully fixed until 8/8, and that patch relies on
>> unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good
>> enough state to limp forward as noted in the rationale of those
>> commits.
>
> Yikes.  Perhaps we should kick the no-kwset thing out of 'next' and
> start from scratch?  It does not sound that the world is ready yet.

I have some fix-for-the-fix and was going to submit a v3 of this series,
but I think the more responsible thing to do at this point, especially
with various patches from Carlo that need to be integrated in one way or
another, is to back it out until the outstanding issues are addressed.

If it's not too much trouble, would you mind reverting just the two
patches at the tip of ab/no-kwset in "next"? I.e.

b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 
2019-07-01)
48de2a768c ("grep: remove the kwset optimization", 2019-07-01)

I believe the rest are all settled & haven't had any issues raised with
them, and those tests & preparatory fixes would be very useful to have
in "master" for any re-roll without needing to be distracted by those
changes.

> But that is just a knee-jerk reaction before reading the actual
> patches.  Let's see how they look ;-)
>
> Thanks.
>
>> Ævar Arnfjörð Bjarmason (8):
>>   grep: remove overly paranoid BUG(...) code
>>   grep: stop "using" a custom JIT stack with PCRE v2
>>   grep: stop using a custom JIT stack with PCRE v1
>>   grep: consistently use "p->fixed" in compile_regexp()
>>   grep: create a "is_fixed" member in "grep_pat"
>>   grep: stess test PCRE v2 on invalid UTF-8 data
>>   grep: do not enter PCRE2_UTF mode on fixed matching
>>   grep: optimistically use PCRE2_MATCH_INVALID_UTF
>>
>>  Makefile|  1 +
>>  grep.c  | 68 +++--
>>  grep.h  | 13 ++-
>>  t/helper/test-pcre2-config.c| 12 ++
>>  t/helper/test-tool.c|  1 +
>>  t/helper/test-tool.h|  1 +
>>  t/t7812-grep-icase-non-ascii.sh | 39 +++
>>  7 files changed, 80 insertions(+), 55 deletions(-)
>>  create mode 100644 t/helper/test-pcre2-config.c


Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Jul 29 2019, Carlo Arenas wrote:
>
>> On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
>>  wrote:
>>>
>>> It's less confusing to use that variable consistently that switch back
>>> & forth between the two.
>>>
>>> Signed-off-by: Ævar Arnfjörð Bjarmason 
>>> ---
>>>  grep.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/grep.c b/grep.c
>>> index 9c2b259771..b94e998680 100644
>>> --- a/grep.c
>>> +++ b/grep.c
>>> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct 
>>> grep_opt *opt)
>>> die(_("given pattern contains NULL byte (via -f ). 
>>> This is only supported with -P under PCRE v2"));
>>>
>>> pat_is_fixed = is_fixed(p->pattern, p->patternlen);
>>> -   if (opt->fixed || pat_is_fixed) {
>>> +   if (p->fixed || pat_is_fixed) {
>>
>> at the end of this series we have:
>>
>>   if (p->fixed || p->is_fixed)
>>
>> which doesn't make sense; at least with opt->fixed it was clear that
>> what was meant is that grep was passed -P
>
> I assume you mean "was passed -F...".
>
>> maybe is_fixed shouldn't exist and fixed when applied to the pattern
>> means we had determined it was a fixed
>> pattern and overridden the user selection of engine.
>
> They're two flags because p->fixed is "--fixed-strings", and p->is_fixed
> is "there's no metachars here". So the former case needs escaping, as
> the code just below might do (the two aren't mutually exclusive).
>
> I don't get how you think we can always fold them into one flag, but
> maybe I'm missing something...
>
>> that at least will give us a logical way to fix the pattern reported
>> in [1] and that currently requires the user to know
>> git's grep internals and know he can skip the "is_fixed" optimization
>> by doing something like :
>>
>>   $ git grep 'foo[ ]bar'
>>
>> [1] https://public-inbox.org/git/20190728235427.41425-1-care...@gmail.com/
>
> As I noted in a reply there this seems like a way to fix a bug in "next"
> with a config knob. Yes we should fix the bug, but we've had the kwset
> code in git for years without needing this distinction, so after we work
> out the bugs I don't see why we'd need this.
>
> The reason we ignore the user's choice here is because you might
> e.g. set grep.patternType=extended in your config, and you'd still want
> grepping for a fixed "foo" to be fast.

...and more generally, for any future sanity of implementation and
maintenance I think we should only make the promise that we support
certain syntax & semantics, not that the -F, -G, -E, -P options are
guaranteed to dispatch to a given codepath.

Internally we should be free to switch between those, so e.g. if a
pattern is fixed and you configure "basic" regexp, but we know your C
library is faster for those matches with REG_EXTENDED we should just
pass that regardless of -G or -E.

Of course that means we *must* expose the same semantics (to some
reasonable extent), which means I have a lot of bugs in "next" to
address.

I'm just saying that the presence of those bugs means we should be
inclined to fix them / back out certain changes, not work around them
with user-servicable knobs.


Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>> It's less confusing to use that variable consistently that switch back
>> & forth between the two.
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason 
>> ---
>>  grep.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/grep.c b/grep.c
>> index 9c2b259771..b94e998680 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct 
>> grep_opt *opt)
>> die(_("given pattern contains NULL byte (via -f ). 
>> This is only supported with -P under PCRE v2"));
>>
>> pat_is_fixed = is_fixed(p->pattern, p->patternlen);
>> -   if (opt->fixed || pat_is_fixed) {
>> +   if (p->fixed || pat_is_fixed) {
>
> at the end of this series we have:
>
>   if (p->fixed || p->is_fixed)
>
> which doesn't make sense; at least with opt->fixed it was clear that
> what was meant is that grep was passed -P

I assume you mean "was passed -F...".

> maybe is_fixed shouldn't exist and fixed when applied to the pattern
> means we had determined it was a fixed
> pattern and overridden the user selection of engine.

They're two flags because p->fixed is "--fixed-strings", and p->is_fixed
is "there's no metachars here". So the former case needs escaping, as
the code just below might do (the two aren't mutually exclusive).

I don't get how you think we can always fold them into one flag, but
maybe I'm missing something...

> that at least will give us a logical way to fix the pattern reported
> in [1] and that currently requires the user to know
> git's grep internals and know he can skip the "is_fixed" optimization
> by doing something like :
>
>   $ git grep 'foo[ ]bar'
>
> [1] https://public-inbox.org/git/20190728235427.41425-1-care...@gmail.com/

As I noted in a reply there this seems like a way to fix a bug in "next"
with a config knob. Yes we should fix the bug, but we've had the kwset
code in git for years without needing this distinction, so after we work
out the bugs I don't see why we'd need this.

The reason we ignore the user's choice here is because you might
e.g. set grep.patternType=extended in your config, and you'd still want
grepping for a fixed "foo" to be fast.


Re: [RFC PATCH] grep: allow for run time disabling of JIT in PCRE

2019-07-29 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 29 2019, Carlo Marcelo Arenas Belón wrote:

> PCRE1 allowed for a compile time flag to disable JIT, but PCRE2 never
> had one, forcing the use of JIT if -P was requested.

What's that PCRE1 compile-time flag?

> After ed0479ce3d (Merge branch 'ab/no-kwset' into next, 2019-07-15)
> the PCRE2 engine will be used more broadly and therefore adding this
> knob will give users a fallback for situations like the one observed
> in OpenBSD with a JIT enabled PCRE2, because of W^X restrictions:
>
>   $ git grep 'foo bar'
>   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>   $ git grep -G 'foo bar'
>   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>   $ git grep -E 'foo bar'
>   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'
>   $ git grep -F 'foo bar'
>   fatal: Couldn't JIT the PCRE2 pattern 'foo bar', got '-48'

Yeah that obviously sucks more with ab/no-kwset, but that seems like a
case where -P would have been completely broken before, and therefore I
can't imagine the package ever passed "make test". Or is W^X also
exposed as some run-time option on OpenBSD?

I.e. aside from the merits of such a setting in general these examples
seem like just working around something that should be fixed at make
all/test time, or maybe I'm missing something.

To the extent that we'd want to make this sort of thing configurable, I
wonder if a continuation of my (*NO_JIT) patch isn't better, i.e. just
adding the ability to configure some string we'd inject at the start of
every pattern.

That would allow for setting any other number of options in
pcre2syntax(3) without us needing to carry config for each one,
e.g. (*LIMIT_HEAP=d), (*LIMIT_DEPTH=d) etc. It does present a larger
foot-gun surface though...


Re: [PATCH 3/3] grep: plug leak of pcre chartables in PCRE2

2019-07-27 Thread Ævar Arnfjörð Bjarmason


On Sat, Jul 27 2019, Carlo Marcelo Arenas Belón wrote:

> Just as it is done with PCRE1, make sure that the allocated chartables
> get free at cleanup time.
>
> This assumes no global context is used (NULL passed when created the
> tables), but will likely be updated in tandem if that ever changes.
>
> Signed-off-by: Carlo Marcelo Arenas Belón 
> ---
>  grep.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/grep.c b/grep.c
> index d04635fad4..d9768c5f05 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -604,6 +604,7 @@ static void free_pcre2_pattern(struct grep_pat *p)
>   pcre2_match_data_free(p->pcre2_match_data);
>   pcre2_jit_stack_free(p->pcre2_jit_stack);
>   pcre2_match_context_free(p->pcre2_match_context);
> + free((void *)p->pcre_tables);

Is the cast really needed? I'm rusty on the rules, removing it from the
pcre_free() you might have copied this from produces a warning for me,
but not for free() itself. This is on GCC 8.3.0. How about for you &
what compiler(s)?


Re: [PATCH 1/3] grep: make pcre1_tables version agnostic

2019-07-27 Thread Ævar Arnfjörð Bjarmason


On Sat, Jul 27 2019, Carlo Marcelo Arenas Belón wrote:

> 6d4b5747f0 ("grep: change internal *pcre* variable & function names
> to be *pcre1*", 2017-05-25), renamed most variables to be PCRE1
> specific to give space to similarly named variables for PCRE2, but
> in this case the change wasn't needed as the types were compatible
> enough (const unsigned char* vs const uint8_t*) to be shared.

Both the v1 and v2 functions return const unsigned char *. I don't know
where I got the uint8_t from. This makes more sense.

This series looks good to me. Thanks for the fix. Just one caveat:

The point of 6d4b5747f0 was not to only split out those variables we
couldn't get away with re-using. Then I would have later re-used
e.g. pcre1_jit_on & pcre2_jit_on as just pcre_jit_on. We could also do
that now.

I think doing that & this part of the your changes makes things less
readable. The two code branches we compile with ifdefs are mutually
exclusive, so having the variables be unique helps with eyeballing /
reasoning when changing the code.

> Revert that change, as 94da9193a6 ("grep: add support for PCRE v2",
> 2017-06-01) failed to create an equivalent PCRE2 version.
>
> Signed-off-by: Carlo Marcelo Arenas Belón 
> ---
>  grep.c | 6 +++---
>  grep.h | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index f7c3a5803e..cc65f7a987 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -389,14 +389,14 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
> const struct grep_opt *opt)
>
>   if (opt->ignore_case) {
>   if (has_non_ascii(p->pattern))
> - p->pcre1_tables = pcre_maketables();
> + p->pcre_tables = pcre_maketables();
>   options |= PCRE_CASELESS;
>   }
>   if (is_utf8_locale() && has_non_ascii(p->pattern))
>   options |= PCRE_UTF8;
>
>   p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
> -   p->pcre1_tables);
> +   p->pcre_tables);
>   if (!p->pcre1_regexp)
>   compile_regexp_failed(p, error);
>
> @@ -462,7 +462,7 @@ static void free_pcre1_regexp(struct grep_pat *p)
>   {
>   pcre_free(p->pcre1_extra_info);
>   }
> - pcre_free((void *)p->pcre1_tables);
> + pcre_free((void *)p->pcre_tables);
>  }
>  #else /* !USE_LIBPCRE1 */
>  static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
> *opt)
> diff --git a/grep.h b/grep.h
> index 1875880f37..d34f66b384 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -89,7 +89,7 @@ struct grep_pat {
>   pcre *pcre1_regexp;
>   pcre_extra *pcre1_extra_info;
>   pcre_jit_stack *pcre1_jit_stack;
> - const unsigned char *pcre1_tables;
> + const unsigned char *pcre_tables;
>   int pcre1_jit_on;
>   pcre2_code *pcre2_pattern;
>   pcre2_match_data *pcre2_match_data;


Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Jul 26 2019, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason   writes:
>>
>>> diff --git a/Makefile b/Makefile
>>> index bd246f2989..dd38d5e527 100644
>>> --- a/Makefile
>>> +++ b/Makefile
>>> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
>>>  TEST_BUILTINS_OBJS += test-online-cpus.o
>>>  TEST_BUILTINS_OBJS += test-parse-options.o
>>>  TEST_BUILTINS_OBJS += test-path-utils.o
>>> +TEST_BUILTINS_OBJS += test-pcre2-config.o
>>
>> This won't even build with any released pcre version; shouldn't we
>> make it at least conditionally compiled code?  Specifically...
>>
>>>  TEST_BUILTINS_OBJS += test-pkt-line.o
>>>  TEST_BUILTINS_OBJS += test-prio-queue.o
>>>  TEST_BUILTINS_OBJS += test-reach.o
>>> diff --git a/grep.c b/grep.c
>>> index c7c06ae08d..8b8b9efe12 100644
>>> --- a/grep.c
>>> +++ b/grep.c
>>> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
>>> const struct grep_opt *opt
>>> }
>>> if (!opt->ignore_locale && is_utf8_locale() && 
>>> has_non_ascii(p->pattern) &&
>>> !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> -   options |= PCRE2_UTF;
>>> +   options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>>>  p->patternlen, options, &error, 
>>> &erroffset,
>>> diff --git a/grep.h b/grep.h
>>> index c0c71eb4a9..506f05b97b 100644
>>> --- a/grep.h
>>> +++ b/grep.h
>>> @@ -21,6 +21,9 @@ typedef int pcre_extra;
>>>  #ifdef USE_LIBPCRE2
>>>  #define PCRE2_CODE_UNIT_WIDTH 8
>>>  #include 
>>> +#ifndef PCRE2_MATCH_INVALID_UTF
>>> +#define PCRE2_MATCH_INVALID_UTF 0
>>> +#endif
>>
>> ... unlike this piece of code ...
>>
>>>  #else
>>>  typedef int pcre2_code;
>>>  typedef int pcre2_match_data;
>>> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
>>> new file mode 100644
>>> index 00..5258fdddba
>>> --- /dev/null
>>> +++ b/t/helper/test-pcre2-config.c
>>> @@ -0,0 +1,12 @@
>>> +#include "test-tool.h"
>>> +#include "cache.h"
>>> +#include "grep.h"
>>> +
>>> +int cmd__pcre2_config(int argc, const char **argv)
>>> +{
>>> +   if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
>>> +   int value = PCRE2_MATCH_INVALID_UTF;
>>
>> ... this part does not have any fallback definition.
>
> It works because we include grep.h, which'll define
> PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested
> this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works
> & runs/skips the appropriate tests.

Ah, I spoke too soon, of course that's all guarded by "are we using PCRE
v2 in general?". I'll fix it...


Re: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> diff --git a/grep.c b/grep.c
>> index 6d60e2e557..5bc0f4f32a 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct 
>> grep_opt *opt)
>>  die(_("given pattern contains NULL byte (via -f ). This 
>> is only supported with -P under PCRE v2"));
>>
>>  p->is_fixed = is_fixed(p->pattern, p->patternlen);
>> +#ifdef USE_LIBPCRE2
>> +   if (!p->fixed && !p->is_fixed) {
>> +   const char *no_jit = "(*NO_JIT)";
>> +   const int no_jit_len = strlen(no_jit);
>> +   if (starts_with(p->pattern, no_jit) &&
>> +   is_fixed(p->pattern + no_jit_len,
>> +p->patternlen - no_jit_len))
>> +   p->is_fixed = 1;
>
> It is unfortunate that is_fixed() takes a counted string.
> Otherwise, using skip_prefix() to avoid "+no_jit_len" would have
> made it much easier to read. i.e.
>
>   /* an illustration that does not quite work */
>   char *pattern_body;
>   if (skip_prefix(p->pattern, "(*NO_JIT)", &pattern_body) &&
> is_fixed(pattern_body))
>   p->is_fixed = 1;

Indeed, but then we couldn't use this for patterns that have NUL in
them, which we otherwise support (and support here). So I think it's
worth keeping it so it takes ptr+len.

>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 
>> data' '
>> +printf "\\200\\n" >invalid-0x80 &&
>> +echo "ævar" >expected &&
>> +cat expected >>invalid-0x80 &&
>> +git add invalid-0x80
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from 
>> invalid UTF-8 data' '
>> +git grep -h "var" invalid-0x80 >actual &&
>> +test_cmp expected actual &&
>> +git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>> +test_cmp expected actual
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from 
>> invalid UTF-8 data' '
>> +test_might_fail git grep -h "æ" invalid-0x80 >actual &&
>> +test_cmp expected actual &&
>> +test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
>> +test_cmp expected actual
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from 
>> invalid UTF-8 data with -i' '
>> +test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
>> +test_cmp expected actual &&
>> +test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
>> +test_cmp expected actual
>> +'
>> +
>>  test_done


Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> diff --git a/Makefile b/Makefile
>> index bd246f2989..dd38d5e527 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
>>  TEST_BUILTINS_OBJS += test-online-cpus.o
>>  TEST_BUILTINS_OBJS += test-parse-options.o
>>  TEST_BUILTINS_OBJS += test-path-utils.o
>> +TEST_BUILTINS_OBJS += test-pcre2-config.o
>
> This won't even build with any released pcre version; shouldn't we
> make it at least conditionally compiled code?  Specifically...
>
>>  TEST_BUILTINS_OBJS += test-pkt-line.o
>>  TEST_BUILTINS_OBJS += test-prio-queue.o
>>  TEST_BUILTINS_OBJS += test-reach.o
>> diff --git a/grep.c b/grep.c
>> index c7c06ae08d..8b8b9efe12 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
>> const struct grep_opt *opt
>>  }
>>  if (!opt->ignore_locale && is_utf8_locale() && 
>> has_non_ascii(p->pattern) &&
>>  !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> -options |= PCRE2_UTF;
>> +options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>
>>  p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>>   p->patternlen, options, &error, 
>> &erroffset,
>> diff --git a/grep.h b/grep.h
>> index c0c71eb4a9..506f05b97b 100644
>> --- a/grep.h
>> +++ b/grep.h
>> @@ -21,6 +21,9 @@ typedef int pcre_extra;
>>  #ifdef USE_LIBPCRE2
>>  #define PCRE2_CODE_UNIT_WIDTH 8
>>  #include 
>> +#ifndef PCRE2_MATCH_INVALID_UTF
>> +#define PCRE2_MATCH_INVALID_UTF 0
>> +#endif
>
> ... unlike this piece of code ...
>
>>  #else
>>  typedef int pcre2_code;
>>  typedef int pcre2_match_data;
>> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
>> new file mode 100644
>> index 00..5258fdddba
>> --- /dev/null
>> +++ b/t/helper/test-pcre2-config.c
>> @@ -0,0 +1,12 @@
>> +#include "test-tool.h"
>> +#include "cache.h"
>> +#include "grep.h"
>> +
>> +int cmd__pcre2_config(int argc, const char **argv)
>> +{
>> +if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
>> +int value = PCRE2_MATCH_INVALID_UTF;
>
> ... this part does not have any fallback definition.

It works because we include grep.h, which'll define
PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested
this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works
& runs/skips the appropriate tests.


Re: [PATCH] grep: skip UTF8 checks explicitally

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason
>  wrote:
>> I'm not sure what a real fix for that is. Part of it is probably 8/8 in
>> the series I mention below, but more generally we'd need to be more
>> encoding aware at a much higher callsite than "grep". So e.g. we'd know
>> that we match "binary" data as not-UTF-8. Now we just throw arbitrary
>> bytes around and hope something sticks.
>
> I haven't look yet at your proposed changes, but my gut feeling is that
> the work to support invalid UTF in the yet unreleased PCRE version would
> be needed as part of it, and therefore it might be better to keep PCRE
> out of the main path until that gets released and can be relied upon.

I'm hoping my 8-part series is good enough to move it forward, but as
48de2a768c ("grep: remove the kwset optimization", .2019-07-01) shows we
can always just fall back on using regcomp instead.

> kwset is not going away with this series anyway, regardless of the no-kwset
> name on the branch.

The larger context here is that this is the 1st step of a 2-step series
to get rid of kwset. If I can pull that off successfully is another
matter, but that's the plan. After it's applied we just use it in the
pickaxe code, and it's relatively straightforward to convert that. See:
https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/

>> > If we're already deciding to paper over things, I'd much rather prefer
>> > the simpler patch, i.e. Carlo's.
>>
>> As I noted upthread PCRE's own docs promise undefined behavior and fire
>> and brimstone if that patch is applied. Those last two not
>> guaranteed. So we need another solution.
>
> in my original reply I mentioned I explicitly didn't do a test because of this
> "undefined behavior", but I think it should be fair to mention that we are
> already affected by that because using the JIT fast path does skip any
> UTF-8 validations and is currently possible to get git into an infinite loop
> or make it segfault when using PCRE.

Right, this is a good point that we should take notice of. I.e. this is
*not* a new bug per-se, you can do this on master and get a UTF-8 bug
from git.git:

git grep -P '(*NO_JIT)[æ]'

> in that line, I am not sure I understand the pushback against making that
> explicit since it only makes both codepaths behave the same (bugs and
> risks of burning alike)

Because with my kwset series we're getting a lot more users of this
until-now obscure code, so we're finding old-but-new-to-us bugs.

We've had this bug dating all the way back to Duy's 18547aacf5
("grep/pcre: support utf-8", 2016-06-25). It was first released with git
2.10.

So why are we getting list discussion about it *now*? Because my kwset
series got merged to "next", and we apparently have a lot of users who'd
use fixed-string git-grep under locales, but never used PCRE via -P
explicitly before.

So it's worth getting the semantics right. As noted in the E-Mail I
linked to earlier my ulterior motive here is to get to a point where
we'll funnel all regex matching through PCRE implicitly if it's
available.

We need to get these UTF-8 edge cases right. I don't know if my recent
8-part series gets us 100% there, but hopefully it at least gets us
closer to it.


Re: [PATCH] grep: skip UTF8 checks explicitally

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  writes:
>
>> FWIW what I meant was not that we'd run around and iconv() things, it
>> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
>> valid", which presumably would be the end result of something like that.
>>
>> Rather that this model of assuming that a UTF-8 pattern means we can
>> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
>> kwset patches *revealed* that problem in a painful way, but it was there
>> already.
>
> We already do assume that pathnames are UTF-8 (pathspecs on MacOS
> are converted and then they are matched assuming that property).
> Further, with the same mechanism, I think there is an assumption
> that anything that comes from the command line is UTF-8 (and if I
> recall correctly, doesn't the Windows port of Git force us to use
> the same assumption---I recall we needed tests tweak for that).
>
> In the very very longer term, I do not think we would want to keep
> the assumption that the text encoding of blobs is always UTF-8, and
> it would be nice to extend the system, so that blob data could be
> marked in some way to say "I'm in Big-5, and not in UTF-8, so please
> treat me as such" and magically the needle and the haystack can be
> made to agree, with iconv() either one of them.
>
> But I do not think the current topic to fix the immediate/imminent
> breakage should not be distracted by that.  Let's keep assuming that
> any blob, when it is text, is UTF-8.
>
> And from that point of view, I think the two pieces of idea in your
> earlier message does make sense.  We can try to match as binary most
> of the time, as UTF-8 would not let a valid UTF-8 needle match in
> the haystack starting in the middle of a character.

*nod*

> When the user is trying to match case-insensitively, we know the
> haystack in which the user is interested in finding the needle is
> text, even though there may be non-text blobs as well.
>
> For example, "git grep -i 'foo' t/" may find a few png files under
> the t/ directory.  We do not care if they happen to contain Foo and
> we do not mind if they appear or do not appear in the result.  The
> only two things we care about are (1) foo, Foo, FOO are found in the
> text files under t/ and (2) the command does not die in the middle,
> before processing all the files, only because a png file it found
> were not UTF-8 valid.

I think this part's a step too far, and not how e.g. GNU grep
works. Peeking into binary data in a text grep is what people expect,
e.g. because you might want to recursively grep mixed text/mp3s for an
author. The text part of the mp3s means that metadata will be grepped
for inside the binary files.

Getting that right is hard around the edges though...


Re: [PATCH] grep: skip UTF8 checks explicitally

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Thu, Jul 25 2019, Johannes Schindelin wrote:

> Hi Junio,
>
> On Thu, 25 Jul 2019, Junio C Hamano wrote:
>
>> Johannes Schindelin  writes:
>>
>> >> OK, in short, barfing and stopping is a problem, but that flag is
>> >> not the right knob to tweak.  And the right knob ...
>> >>
>> >> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
>> >> > reported here. I.e. there's no reason I can think of for why a
>> >> > fixed-string pattern should need PCRE2_UTF set when not combined
>> >> > with --ignore-case. We can just not do that, but maybe I'm missing
>> >> > something there.
>> >> >
>> >> >  2) We can do "try utf8, and fallback". A more advanced version of this
>> >> > is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
>> >> > does. I was thinking something closer to just carrying two compiled
>> >> > patterns, and falling back on the ~PCRE2_UTF one if we get a
>> >> > PCRE2_ERROR_UTF8_* error.
>> >>
>> >> ... lies somewhere along that line.  I think that is very sensible.
>> >
>> > I am glad that everybody agrees with my original comment on ab/no-kwset
>> > where I suggested that we should use our knowledge of the encoding of
>> > the haystack and convert it to UTF-8 if we detect that the pattern is
>> > UTF-8 encoded,...
>>
>> Please do not count me among "everybody", then.  I did not think
>> that Ævar meant to iconv the haystack when I wrote the message you
>> are responding to, but if that was what he meant, I would not have
>> said "very sensible".
>
> Okay, but in that case I cannot agree with your assessment that it is
> very sensible.

FWIW what I meant was not that we'd run around and iconv() things, it
wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
valid", which presumably would be the end result of something like that.

Rather that this model of assuming that a UTF-8 pattern means we can
consider everything in the repo UTF-8 in git-grep doesn't make sense. My
kwset patches *revealed* that problem in a painful way, but it was there
already.

I'm not sure what a real fix for that is. Part of it is probably 8/8 in
the series I mention below, but more generally we'd need to be more
encoding aware at a much higher callsite than "grep". So e.g. we'd know
that we match "binary" data as not-UTF-8. Now we just throw arbitrary
bytes around and hope something sticks.

> If we're already deciding to paper over things, I'd much rather prefer
> the simpler patch, i.e. Carlo's.

As I noted upthread PCRE's own docs promise undefined behavior and fire
and brimstone if that patch is applied. Those last two not
guaranteed. So we need another solution.

I've submitted
https://public-inbox.org/git/20190726150818.6373-1-ava...@gmail.com/
just now. See what you think about it.


[PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF

2019-07-26 Thread Ævar Arnfjörð Bjarmason
As discussed in the "grep: stess test PCRE v2 on invalid UTF-8 data"
commit leading up to this one there's a regression in
b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search",
2019-07-01) when matching UTF-8 data.

This ultimately isn't straightforward to just "fix", because the kwset
backend was so dumb about icase matching that we'd skip it entirely on
non-ASCII. See the code removed in 48de2a768c ("grep: remove the kwset
optimization", 2019-07-01).

Just going back to the C library for those isn't ideal, since it's
likely to be even dumber about these mixed-encoding cases.

So let's support this "properly" using the PCRE2_MATCH_INVALID_UTF
flag. This is new code that's not in any released PCRE v2 version, so
we might need a fix that emulates it somehow. I figure that the case
that with the non-icase bug out of the way this is obscure enough to
tell people "upgrade your PCRE v2 too!'. It'll likely be released by
the time we release the git version this commit is part of.

We can't just use PCRE2_NO_UTF_CHECK instead for the reasons discussed
in [1].

1. https://public-inbox.org/git/87lfwn70nb@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 Makefile|  1 +
 grep.c  |  2 +-
 grep.h  |  3 +++
 t/helper/test-pcre2-config.c| 12 
 t/helper/test-tool.c|  1 +
 t/helper/test-tool.h|  1 +
 t/t7812-grep-icase-non-ascii.sh | 13 -
 7 files changed, 31 insertions(+), 2 deletions(-)
 create mode 100644 t/helper/test-pcre2-config.c

diff --git a/Makefile b/Makefile
index bd246f2989..dd38d5e527 100644
--- a/Makefile
+++ b/Makefile
@@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-pcre2-config.o
 TEST_BUILTINS_OBJS += test-pkt-line.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-reach.o
diff --git a/grep.c b/grep.c
index c7c06ae08d..8b8b9efe12 100644
--- a/grep.c
+++ b/grep.c
@@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
}
if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern) &&
!(!opt->ignore_case && (p->fixed || p->is_fixed)))
-   options |= PCRE2_UTF;
+   options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
 p->patternlen, options, &error, 
&erroffset,
diff --git a/grep.h b/grep.h
index c0c71eb4a9..506f05b97b 100644
--- a/grep.h
+++ b/grep.h
@@ -21,6 +21,9 @@ typedef int pcre_extra;
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
 #include 
+#ifndef PCRE2_MATCH_INVALID_UTF
+#define PCRE2_MATCH_INVALID_UTF 0
+#endif
 #else
 typedef int pcre2_code;
 typedef int pcre2_match_data;
diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
new file mode 100644
index 00..5258fdddba
--- /dev/null
+++ b/t/helper/test-pcre2-config.c
@@ -0,0 +1,12 @@
+#include "test-tool.h"
+#include "cache.h"
+#include "grep.h"
+
+int cmd__pcre2_config(int argc, const char **argv)
+{
+   if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
+   int value = PCRE2_MATCH_INVALID_UTF;
+   return !value;
+   }
+   return 1;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index ce7e89028c..e022ce0e48 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -40,6 +40,7 @@ static struct test_cmd cmds[] = {
{ "online-cpus", cmd__online_cpus },
{ "parse-options", cmd__parse_options },
{ "path-utils", cmd__path_utils },
+   { "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "prio-queue", cmd__prio_queue },
{ "reach", cmd__reach },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index f805bb39ae..acd8af2a9d 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -30,6 +30,7 @@ int cmd__oidmap(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
+int cmd__pcre2_config(int argc, const char **argv);
 int cmd__pkt_line(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__reach(int argc, const char **argv);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 531eb59d57..848d46e4f9 100755
--- a/t/t7812-grep-icase-non-as

[PATCH v2 1/8] grep: remove overly paranoid BUG(...) code

2019-07-26 Thread Ævar Arnfjörð Bjarmason
Remove code that would trigger if pcre_config() or pcre2_config() was
so broken that "do we have JIT?" wouldn't return a boolean.

I added this code back in fbaceaac47 ("grep: add support for the PCRE
v1 JIT API", 2017-05-25) and then as noted in f002532784 ("grep: print
the pcre2_jit_on value", 2019-07-22) incorrectly copy/pasted some of
it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01).

Let's just remove this code. Being this paranoid about the
pcre2?_config() function itself being broken is crossing the line into
unreasonable paranoia.

Reported-by:  Beat Bolli 
Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index 0937c5bfff..95af88cb74 100644
--- a/grep.c
+++ b/grep.c
@@ -394,14 +394,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-   if (p->pcre1_jit_on == 1) {
+   if (p->pcre1_jit_on) {
p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
if (!p->pcre1_jit_stack)
die("Couldn't allocate PCRE JIT stack");
pcre_assign_jit_stack(p->pcre1_extra_info, NULL, 
p->pcre1_jit_stack);
-   } else if (p->pcre1_jit_on != 0) {
-   BUG("The pcre1_jit_on variable should be 0 or 1, not %d",
-   p->pcre1_jit_on);
}
 #endif
 }
@@ -510,7 +507,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
}
 
pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on);
-   if (p->pcre2_jit_on == 1) {
+   if (p->pcre2_jit_on) {
jitret = pcre2_jit_compile(p->pcre2_pattern, 
PCRE2_JIT_COMPLETE);
if (jitret)
die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", 
p->pattern, jitret);
@@ -545,9 +542,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
if (!p->pcre2_match_context)
die("Couldn't allocate PCRE2 match context");
pcre2_jit_stack_assign(p->pcre2_match_context, NULL, 
p->pcre2_jit_stack);
-   } else if (p->pcre2_jit_on != 0) {
-   BUG("The pcre2_jit_on variable should be 0 or 1, not %d",
-   p->pcre2_jit_on);
}
 }
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data

2019-07-26 Thread Ævar Arnfjörð Bjarmason
Since my b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-07-01) we've been dying on invalid UTF-8 data when
grepping for fixed strings if the following are all true:

* The subject string is non-ASCII (e.g. "ævar")
* We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
* We compiled with PCRE v2
* That PCRE v2 did not have JIT support

The last of those is why this wasn't caught earlier, per pcre2jit(3):

"unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not
happen on the JIT fast path, and if invalid data is passed, the
result is undefined."

I.e. the subject being matched against our pattern was invalid, but we
were lucky and getting away with it on the JIT path, but the non-JIT
one is stricter.

This patch does nothing to fix that, instead we sneak in support for
fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2
jit with implicit fixed-string matching for testing, see
pcre2syntax(3) the syntax.

This is technically a change in behavior, but it's so obscure that I
figured it was OK. We'd previously consider this an invalid regular
expression as regcomp() would die on it, now we feed it to the PCRE v2
fixed-string path. I thought this was better than introducing yet
another GIT_TEST_* environment variable.

We're also relying on a behavior of PCRE v2 that technically could
change, but I think the test coverage is worth dipping our toe into
some somewhat undefined behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c  | 10 ++
 t/t7812-grep-icase-non-ascii.sh | 28 
 2 files changed, 38 insertions(+)

diff --git a/grep.c b/grep.c
index 6d60e2e557..5bc0f4f32a 100644
--- a/grep.c
+++ b/grep.c
@@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
p->is_fixed = is_fixed(p->pattern, p->patternlen);
+#ifdef USE_LIBPCRE2
+   if (!p->fixed && !p->is_fixed) {
+  const char *no_jit = "(*NO_JIT)";
+  const int no_jit_len = strlen(no_jit);
+  if (starts_with(p->pattern, no_jit) &&
+  is_fixed(p->pattern + no_jit_len,
+   p->patternlen - no_jit_len))
+  p->is_fixed = 1;
+   }
+#endif
if (p->fixed || p->is_fixed) {
 #ifdef USE_LIBPCRE2
opt->pcre2 = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 0c685d3598..96c3572056 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,4 +53,32 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
test_cmp expected actual
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 
data' '
+   printf "\\200\\n" >invalid-0x80 &&
+   echo "ævar" >expected &&
+   cat expected >>invalid-0x80 &&
+   git add invalid-0x80
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid 
UTF-8 data' '
+   git grep -h "var" invalid-0x80 >actual &&
+   test_cmp expected actual &&
+   git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
+   test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from 
invalid UTF-8 data' '
+   test_might_fail git grep -h "æ" invalid-0x80 >actual &&
+   test_cmp expected actual &&
+   test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
+   test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from 
invalid UTF-8 data with -i' '
+   test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
+   test_cmp expected actual &&
+   test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
+   test_cmp expected actual
+'
+
 test_done
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix

2019-07-26 Thread Ævar Arnfjörð Bjarmason
1-3 here are a re-roll on "next". I figured that was easier for
everyone with the state of the in-flight patches, it certainly was for
me. Sorry Junio if this creates a mess for you.

4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep:
skip UTF8 checks explicitally" in
https://public-inbox.org/git/20190721183115.14985-1-care...@gmail.com/

As noted the bug isn't fully fixed until 8/8, and that patch relies on
unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good
enough state to limp forward as noted in the rationale of those
commits.

Ævar Arnfjörð Bjarmason (8):
  grep: remove overly paranoid BUG(...) code
  grep: stop "using" a custom JIT stack with PCRE v2
  grep: stop using a custom JIT stack with PCRE v1
  grep: consistently use "p->fixed" in compile_regexp()
  grep: create a "is_fixed" member in "grep_pat"
  grep: stess test PCRE v2 on invalid UTF-8 data
  grep: do not enter PCRE2_UTF mode on fixed matching
  grep: optimistically use PCRE2_MATCH_INVALID_UTF

 Makefile|  1 +
 grep.c  | 68 +++--
 grep.h  | 13 ++-
 t/helper/test-pcre2-config.c| 12 ++
 t/helper/test-tool.c|  1 +
 t/helper/test-tool.h|  1 +
 t/t7812-grep-icase-non-ascii.sh | 39 +++
 7 files changed, 80 insertions(+), 55 deletions(-)
 create mode 100644 t/helper/test-pcre2-config.c

-- 
2.22.0.455.g172b71a6c5



[PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()

2019-07-26 Thread Ævar Arnfjörð Bjarmason
At the start of this function we do:

p->fixed = opt->fixed;

It's less confusing to use that variable consistently that switch back
& forth between the two.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 9c2b259771..b94e998680 100644
--- a/grep.c
+++ b/grep.c
@@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
pat_is_fixed = is_fixed(p->pattern, p->patternlen);
-   if (opt->fixed || pat_is_fixed) {
+   if (p->fixed || pat_is_fixed) {
 #ifdef USE_LIBPCRE2
opt->pcre2 = 1;
if (pat_is_fixed) {
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching

2019-07-26 Thread Ævar Arnfjörð Bjarmason
As discussed in the last commit partially fix a bug introduced in
b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search",
2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input
we'd die on e.g.:

fatal: pcre2_match failed with error code -22: UTF-8 error:
isolated byte with 0x80 bit set

When grepping a non-ASCII fixed string. This is a more general problem
that's hard to fix, but we can at least fix the most common case of
grepping for a fixed string without "-i". I can't think of a reason
for why we'd turn on PCRE2_UTF when matching byte-for-byte like that.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c  | 3 ++-
 t/t7812-grep-icase-non-ascii.sh | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 5bc0f4f32a..c7c06ae08d 100644
--- a/grep.c
+++ b/grep.c
@@ -472,7 +472,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
}
options |= PCRE2_CASELESS;
}
-   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern) &&
+   !(!opt->ignore_case && (p->fixed || p->is_fixed)))
options |= PCRE2_UTF;
 
p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 96c3572056..531eb59d57 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -68,9 +68,9 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep 
ASCII from invalid UT
 '
 
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from 
invalid UTF-8 data' '
-   test_might_fail git grep -h "æ" invalid-0x80 >actual &&
+   git grep -h "æ" invalid-0x80 >actual &&
test_cmp expected actual &&
-   test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
+   git grep -h "(*NO_JIT)æ" invalid-0x80 &&
test_cmp expected actual
 '
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1

2019-07-26 Thread Ævar Arnfjörð Bjarmason
Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code
in the last commit. Unlike with v2 we actually used the custom stack
in v1, but let's use PCRE's built-in 32 KB one instead, since
experience with v2 shows that's enough. Most distros are already using
v2 as a default, and the underlying sljit code is the same.

Unfortunately we can't just pass a NULL to pcre_jit_exec() as with
pcre2_jit_match(). Unlike the v2 function it doesn't support
that. Instead we need to use the fatter pcre_exec() if we'd like the
same behavior.

This will make things slightly slower than on the fast-path function,
but it's OK since we care less about v1 performance these days since
we have and recommend v2. Running a similar performance test as what I
ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API",
2017-05-25) via:

GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux 
GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 
LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Gives us this, just the /perl/ results:

TestHEAD~ HEAD

---
7820.3: perl grep 'how.to'  0.19(0.67+0.52)   
0.19(0.65+0.52) +0.0%
7820.7: perl grep '^how to' 0.19(0.78+0.44)   
0.19(0.72+0.49) +0.0%
7820.11: perl grep '[how] to'   0.39(2.13+0.43)   
0.40(2.10+0.46) +2.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare'   0.44(2.55+0.37)   
0.45(2.47+0.41) +2.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'  0.23(1.06+0.42)   
0.22(1.03+0.43) -4.3%

It will also implicitly re-enable UTF-8 validation for PCRE v1. As
noted in [1] we now have cases as a result where PCRE v1 is more eager
to error out. Subsequent patches will fix that for v2, and I think
it's fair to tell v1 users "just upgrade" and not worry about that
edge case for v1.

1.  
https://public-inbox.org/git/capuesphzj_uv9o1-ydpjnla_q-f7gwxz9g1gcy2pyayn8ri...@mail.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 28 +---
 grep.h |  5 -
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/grep.c b/grep.c
index 4b1e917ac5..9c2b259771 100644
--- a/grep.c
+++ b/grep.c
@@ -394,12 +394,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const 
struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-   if (p->pcre1_jit_on) {
-   p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
-   if (!p->pcre1_jit_stack)
-   die("Couldn't allocate PCRE JIT stack");
-   pcre_assign_jit_stack(p->pcre1_extra_info, NULL, 
p->pcre1_jit_stack);
-   }
 #endif
 }
 
@@ -411,18 +405,9 @@ static int pcre1match(struct grep_pat *p, const char 
*line, const char *eol,
if (eflags & REG_NOTBOL)
flags |= PCRE_NOTBOL;
 
-#ifdef GIT_PCRE1_USE_JIT
-   if (p->pcre1_jit_on) {
-   ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-   eol - line, 0, flags, ovector,
-   ARRAY_SIZE(ovector), p->pcre1_jit_stack);
-   } else
-#endif
-   {
-   ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-   eol - line, 0, flags, ovector,
-   ARRAY_SIZE(ovector));
-   }
+   ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
+   eol - line, 0, flags, ovector,
+   ARRAY_SIZE(ovector));
 
if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
die("pcre_exec failed with error code %d", ret);
@@ -439,14 +424,11 @@ static void free_pcre1_regexp(struct grep_pat *p)
 {
pcre_free(p->pcre1_regexp);
 #ifdef GIT_PCRE1_USE_JIT
-   if (p->pcre1_jit_on) {
+   if (p->pcre1_jit_on)
pcre_free_study(p->pcre1_extra_info);
-   pcre_jit_stack_free(p->pcre1_jit_stack);
-   } else
+   else
 #endif
-   {
pcre_free(p->pcre1_extra_info);
-   }
pcre_free((void *)p->pcre1_tables);
 }
 #else /* !USE_LIBPCRE1 */
diff --git a/grep.h b/grep.h
index 4d8e300175..ce2d72571f 100644
--- a/grep.h
+++ b/grep.h
@@ -14,13 +14,9 @@
 #ifndef GIT_PCRE_STUDY_JIT_COMPILE
 #define GIT_PCRE_STUDY_JIT_COMPILE 0
 #endif
-#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20
-typedef int pcre_jit_stack;
-#endif
 #else
 typedef int pcre;
 typedef int pcre_extra;
-typedef int pcre_jit_stack;
 #endif
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
@@ -85,7 +81,6 @@ struct gre

[PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat"

2019-07-26 Thread Ævar Arnfjörð Bjarmason
This change paves the way for later using this value the regex compile
functions themselves.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 7 +++
 grep.h | 1 +
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index b94e998680..6d60e2e557 100644
--- a/grep.c
+++ b/grep.c
@@ -606,7 +606,6 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 {
int err;
int regflags = REG_NEWLINE;
-   int pat_is_fixed;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
@@ -615,11 +614,11 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   pat_is_fixed = is_fixed(p->pattern, p->patternlen);
-   if (p->fixed || pat_is_fixed) {
+   p->is_fixed = is_fixed(p->pattern, p->patternlen);
+   if (p->fixed || p->is_fixed) {
 #ifdef USE_LIBPCRE2
opt->pcre2 = 1;
-   if (pat_is_fixed) {
+   if (p->is_fixed) {
compile_pcre2_pattern(p, opt);
} else {
/*
diff --git a/grep.h b/grep.h
index ce2d72571f..c0c71eb4a9 100644
--- a/grep.h
+++ b/grep.h
@@ -88,6 +88,7 @@ struct grep_pat {
pcre2_compile_context *pcre2_compile_context;
uint32_t pcre2_jit_on;
unsigned fixed:1;
+   unsigned is_fixed:1;
unsigned ignore_case:1;
unsigned word_regexp:1;
 };
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2

2019-07-26 Thread Ævar Arnfjörð Bjarmason
As reported in [1] the code I added in 94da9193a6 ("grep: add support
for PCRE v2", 2017-06-01) to use a custom JIT stack has never
worked. It was incorrectly copy/pasted from code I added in
fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25),
which did work.

Thus our intention of starting with 1 byte of stack at a maximum of 1
MB didn't happen, we'd always use the 32 KB stack provided by PCRE
v2's jit_machine_stack_exec()[2]. The reason I allocated a custom
stack at all was this advice in pcrejit(3) (same in pcre2jit(3)):

"By default, it uses 32KiB on the machine stack. However, some
large or complicated patterns need more than this"

Since we've haven't had any reports of users running into
PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
that we can just use the library defaults instead and drop this
code. This won't change with the wider use of PCRE v2 in
ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
fixed string search is not a "large or complicated pattern".

For good measure I ran the performance test noted in 94da9193a6,
although the command is simpler now due to my 0f50c8e32c ("Makefile:
remove the NO_R_TO_GCC_LINKER flag", 2019-05-17):

GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux 
GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 
LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Just the /perl/ results are:

TestHEAD~ HEAD

---
7820.3: perl grep 'how.to'  0.17(0.27+0.65)   
0.17(0.24+0.68) +0.0%
7820.7: perl grep '^how to' 0.16(0.23+0.66)   
0.16(0.23+0.67) +0.0%
7820.11: perl grep '[how] to'   0.18(0.35+0.62)   
0.18(0.33+0.65) +0.0%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare'   0.17(0.45+0.54)   
0.17(0.49+0.50) +0.0%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'  0.16(0.33+0.58)   
0.16(0.29+0.62) +0.0%

So, as expected there's no change, and running with valgrind reveals
that we have fewer allocations now.

As noted in [3] there are known regexes that will fail with the lower
stack limit, the way GNU grep fixed it is interesting, although I
believe the implementation is overly verbose, they could make PCRE v2
handle that gradual re-allocation, that's what min/max memory is
for.

So we might end up bringing this back, I'm more inclined to just kick
such cases upstairs to PCRE maintainers as a bug, perhaps they'll add
some overall "just allocate more then" flag to make this easier. In
any case there's no functional change here, we didn't have a custom
stack, so let's apply this first, we can always revert it later.

1. https://public-inbox.org/git/20190721194052.15440-1-care...@gmail.com/
2. I didn't really intend to start with 1 byte, looking at the PCRE v2
   code again what happened is that I cargo-culted some of PCRE v2's
   own test code which was meant to test re-allocations. It's more
   sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c
   does.
3. 
https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bzxktfplgkgiu34htjklhm-cmee04f...@mail.gmail.com/

Reported-by: Carlo Marcelo Arenas Belón 
Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 10 --
 grep.h |  4 
 2 files changed, 14 deletions(-)

diff --git a/grep.c b/grep.c
index 95af88cb74..4b1e917ac5 100644
--- a/grep.c
+++ b/grep.c
@@ -534,14 +534,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
const struct grep_opt *opt
p->pcre2_jit_on = 0;
return;
}
-
-   p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, 
NULL);
-   if (!p->pcre2_jit_stack)
-   die("Couldn't allocate PCRE2 JIT stack");
-   p->pcre2_match_context = pcre2_match_context_create(NULL);
-   if (!p->pcre2_match_context)
-   die("Couldn't allocate PCRE2 match context");
-   pcre2_jit_stack_assign(p->pcre2_match_context, NULL, 
p->pcre2_jit_stack);
}
 }
 
@@ -585,8 +577,6 @@ static void free_pcre2_pattern(struct grep_pat *p)
pcre2_compile_context_free(p->pcre2_compile_context);
pcre2_code_free(p->pcre2_pattern);
pcre2_match_data_free(p->pcre2_match_data);
-   pcre2_jit_stack_free(p->pcre2_jit_stack);
-   pcre2_match_context_free(p->pcre2_match_context);
 }
 #else /* !USE_LIBPCRE2 */
 static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt 
*opt)
diff --git a/grep.h b/grep.h
index d35a137fcb..4d8e

Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 6:50 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>> On Fri, Jul 26 2019, Carlo Arenas wrote:
>>
>> > since this moves PCRE1 out of the JIT fast path,
>>
>> I think you're mostly replying to the wrong thread. None of the patches
>> I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT
>> stack is resized, and for v2 some dead code removed.
>
> I didn't mean JIT was disabled, but that we are calling now the regular
> PCRE1 function which does UTF-8 validation (unlike the one used before)
>
>> > introduces the regression where git grep will abort if there is binary
>> > data or non UTF-8 text in the repository/log and should be IMHO hold
>> > out until a fix for that can be merged.
>>
>> You're talking about the kwset series, not this cleanup series.
>
> a combination of both (as seen in pu) and that will also happen in next if
> this series get merged there.
>
> before this cleanup series, a git compiled against PCRE1 and not using
> NO_LIBPCRE1_JIT will use the jit fast path function and therefore would
> have no problems with binary or non UTF-8 content in the repository, but
> will regress after.

I see. Yes you're right, I misread pcrejit(3) about how the "fast path
API" worked (or more accurately, misremembered). Yes, this is now a new
caveat.

I have some patches on top of next I'm about to send that hopefully make
this whole thing less of a mess.


Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1

2019-07-26 Thread Ævar Arnfjörð Bjarmason


On Fri, Jul 26 2019, Carlo Arenas wrote:

> since this moves PCRE1 out of the JIT fast path,

I think you're mostly replying to the wrong thread. None of the patches
I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT
stack is resized, and for v2 some dead code removed.

> introduces the regression where git grep will abort if there is binary
> data or non UTF-8 text in the repository/log and should be IMHO hold
> out until a fix for that can be merged.

You're talking about the kwset series, not this cleanup series.

> this also needs additional changes to better support NO_LIBPCRE1_JIT,
> patch to follow

Looking forward to it, thanks!


Re: [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2

2019-07-24 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 24 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> Since we've haven't had any reports of users running into
>> PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
>> that we can just use the library defaults instead and drop this
>> code.
>
> Does everybody use pcre2 with JIT with Git these days, or only those
> who want to live near the bleeding edge?

My informal survey of various package recipies suggests that all the big
*nix distros are using it by default now, so we have a lot of users in
the wild, including in the just-released Debian stable.

So I'm confidend that if there were issues with e.g. it dying on
patterns in practical use we'd have heard about them.

>> This won't change with the wider use of PCRE v2 in
>> ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
>> fixed string search is not a "large or complicated pattern".
>
> In any case, if we were not "using" the custom stack anyway for v2,
> this change does not hurt anybody, possibly other than those who
> will learn about pcre2 support by reading this message and experiments
> with larger patterns.  And it should be simple to wire it back if it
> becomes necessary later.

*nod*


Re: [PATCH 0/3] grep: PCRE JIT fixes

2019-07-24 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 24 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> There's a couple of patches fixing mistakes in the JIT code I added
>> for PCRE in <20190722181923.21572-1-dev+...@drbeat.li> and
>> <20190721194052.15440-1-care...@gmail.com>
>>
>> This small series proposes to replace both of those. In both cases I
>> think we're better off just removing the relevant code. The commit
>> messages for the patches themselves make the case for that.
>
> I am not sure about the BUG() that practically never triggered so
> far (AFAICT, the check that guards the BUG() would trigger only if
> we later introduced a bug, calling the code to compile when we are
> not asked to do so)---wouldn't it be better to leave it in while
> there still are people who are touching the vicinity?

The BUG() in 1/3 is just checking if pcre2?_config() returns a boolean
when promised, so it amounts to black-box testing of that library.

I think code in that style is overly paranoid and verbose, it's
reasonable to just trust the library in that case.

I think the reason it ended up in the codebase in the first place was
converting some first-draft implementation I wrote where I was being
more paranoid about using the PCRE API as a black box.

> The other two I am perfectly OK with.  It is easy to resurrect the
> support for v1 (which may not even be needed for long) and resurrect
> the support for v2 with Carlo's fix, if it later turns out that some
> users may need to use a more complex pattern.
>
> Thanks.
>
>> Ævar Arnfjörð Bjarmason (3):
>>   grep: remove overly paranoid BUG(...) code
>>   grep: stop "using" a custom JIT stack with PCRE v2
>>   grep: stop using a custom JIT stack with PCRE v1
>>
>>  grep.c | 46 ++
>>  grep.h |  9 -
>>  2 files changed, 6 insertions(+), 49 deletions(-)


Re: [PATCH] grep: skip UTF8 checks explicitally

2019-07-24 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 24 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Tue, 23 Jul 2019, Carlo Arenas wrote:
>
>> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
>>  wrote:
>> >
>> > So when PCRE2 complains about the top two bits not being 0x80, it fails
>> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
>> > indeed 0x80).
>>
>> the error is confusing but it is not coming from the pattern, but from
>> what PCRE2 calls
>> the subject.
>>
>> meaning that while going through the repository it found content that
>> it tried to match but
>> that it is not valid UTF-8, like all the png and a few txt files that
>> are not encoded as
>> UTF-8 (ex: t/t3900/ISO8859-1.txt).
>>
>> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
>> > does not happen here... But then, I don't need the `-I` option, and my
>> > output looks like this:
>>
>> -I was just an attempt to workaround the obvious binary files (like
>> PNG); I'll assume you
>> should be able to reproduce if using a non JIT enabled PCRE2,
>> regardless of version.
>>
>> my point was that unlike in your report, I didn't have any test cases
>> failing, because
>> AFAIK there are no test cases using broken UTF-8 (the ones with binary data 
>> are
>> actually valid zero terminated UTF-8 strings)
>
> Thank you for this explanation. I think it makes a total lot of sense.
>
> So your motivation for this patch is actually a different one than mine,
> and I would like to think that this actually strengthens the case _in
> favor_ of it. The patch kind of kills two birds with one stone.

This patch is really the wrong thing to do. Don't get me wrong, I'm
sympathetic to the *problem* and it should be solved, but this isn't the
solution.

The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
UTF-8 string so you, PCRE, don't need to re-check it". To quote
pcre2api(3):

If you know that your pattern is a valid UTF string, and you want to
skip this check for performance reasons, you can set the
PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
in‐ valid UTF string as a pattern is undefined. It may cause your
program to crash or loop.

(Later it's discussed that "pattern" here is also "subject string" in
the context of pcre2_{jit_,}match()).

I know almost nothing about the internals of PCRE's engine, but much of
it's based on Perl's, which I know way better. Doing the equivalent of
this in perl (setting the UTF8 flag on a SV) *will* cause asserts to
fail and possibly segfaults.

It's likely through dumb luck that this is "working". I.e. yes the JIT
mode is less anal about these checks, so if you say grep for "Nguyễn
Thái" in UTF-8 mode and there's binary data you're satisfied not to find
anything in that binary data.

But if you are I'm willing to bet this ruins your day, e.g PCRE would
"skip ahead" a character 4-byte character because it sees a telltale
U+1 through U+10 start sequence, except that wasn't a character,
it was some arbitrary binary.

Now, what is the solution? I don't have any patches yet, but things I
intend to look at:

 1) We're oversupplying PCRE2_UTF now, and one such case is what's being
reported here. I.e. there's no reason I can think of for why a
fixed-string pattern should need PCRE2_UTF set when not combined
with --ignore-case. We can just not do that, but maybe I'm missing
something there.

 2) We can do "try utf8, and fallback". A more advanced version of this
is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
does. I was thinking something closer to just carrying two compiled
patterns, and falling back on the ~PCRE2_UTF one if we get a
PCRE2_ERROR_UTF8_* error.

One reason we can't "just" go back to the pre-ab/no-kwset behavior is
that one thing it does is fix a long-standing bug where we'd do the
wrong thing under locales && -i && UTF-8 string/pattern. More precisely
we'd punt it to the C library's matching function, which would probably
do the wrong thing.


[PATCH 0/3] grep: PCRE JIT fixes

2019-07-24 Thread Ævar Arnfjörð Bjarmason
There's a couple of patches fixing mistakes in the JIT code I added
for PCRE in <20190722181923.21572-1-dev+...@drbeat.li> and
<20190721194052.15440-1-care...@gmail.com>

This small series proposes to replace both of those. In both cases I
think we're better off just removing the relevant code. The commit
messages for the patches themselves make the case for that.

Ævar Arnfjörð Bjarmason (3):
  grep: remove overly paranoid BUG(...) code
  grep: stop "using" a custom JIT stack with PCRE v2
  grep: stop using a custom JIT stack with PCRE v1

 grep.c | 46 ++
 grep.h |  9 -
 2 files changed, 6 insertions(+), 49 deletions(-)

-- 
2.22.0.455.g172b71a6c5



[PATCH 1/3] grep: remove overly paranoid BUG(...) code

2019-07-24 Thread Ævar Arnfjörð Bjarmason
Remove code that would trigger if pcre_config() or pcre2_config() was
so broken that "do we have JIT?" wouldn't return a boolean.

I added this code back in fbaceaac47 ("grep: add support for the PCRE
v1 JIT API", 2017-05-25) and then as noted in [1] incorrectly
copy/pasted some of it in 94da9193a6 ("grep: add support for PCRE v2",
2017-06-01).

Let's just remove it instead of fixing that bug. Being this paranoid
about what PCRE returns is crossing the line into unreasonable
paranoia.

1. https://public-inbox.org/git/20190722181923.21572-1-dev+...@drbeat.li/

Reported-by:  Beat Bolli 
Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..be4282fef3 100644
--- a/grep.c
+++ b/grep.c
@@ -406,14 +406,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-   if (p->pcre1_jit_on == 1) {
+   if (p->pcre1_jit_on) {
p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
if (!p->pcre1_jit_stack)
die("Couldn't allocate PCRE JIT stack");
pcre_assign_jit_stack(p->pcre1_extra_info, NULL, 
p->pcre1_jit_stack);
-   } else if (p->pcre1_jit_on != 0) {
-   BUG("The pcre1_jit_on variable should be 0 or 1, not %d",
-   p->pcre1_jit_on);
}
 #endif
 }
@@ -522,7 +519,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
}
 
pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on);
-   if (p->pcre2_jit_on == 1) {
+   if (p->pcre2_jit_on) {
jitret = pcre2_jit_compile(p->pcre2_pattern, 
PCRE2_JIT_COMPLETE);
if (jitret)
die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", 
p->pattern, jitret);
@@ -557,9 +554,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const 
struct grep_opt *opt
if (!p->pcre2_match_context)
die("Couldn't allocate PCRE2 match context");
pcre2_jit_stack_assign(p->pcre2_match_context, NULL, 
p->pcre2_jit_stack);
-   } else if (p->pcre2_jit_on != 0) {
-   BUG("The pcre2_jit_on variable should be 0 or 1, not %d",
-   p->pcre1_jit_on);
}
 }
 
-- 
2.22.0.455.g172b71a6c5



[PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1

2019-07-24 Thread Ævar Arnfjörð Bjarmason
Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code
in the last commit. Unlike with v2 we actually used the custom stack
in v1, but let's use PCRE's built-in 32 KB one instead, since
experience with v2 shows that's enough. Most distros are already using
v2 as a default, and the underlying sljit code is the same.

Unfortunately we can't just pass a NULL to pcre_jit_exec() as with
pcre2_jit_match(). Unlike the v2 function it doesn't support
that. Instead we need to use the fatter pcre_exec() if we'd like the
same behavior.

This will make things slightly slower than on the fast-path function,
but it's OK since we care less about v1 performance these days since
we have and recommend v2. Running a similar performance test as what I
ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API",
2017-05-25) via:

GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux 
GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 
LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Gives us this, just the /perl/ results:

TestHEAD~ HEAD

---
7820.3: perl grep 'how.to'  0.19(0.67+0.52)   
0.19(0.65+0.52) +0.0%
7820.7: perl grep '^how to' 0.19(0.78+0.44)   
0.19(0.72+0.49) +0.0%
7820.11: perl grep '[how] to'   0.39(2.13+0.43)   
0.40(2.10+0.46) +2.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare'   0.44(2.55+0.37)   
0.45(2.47+0.41) +2.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'  0.23(1.06+0.42)   
0.22(1.03+0.43) -4.3%

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 28 +---
 grep.h |  5 -
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/grep.c b/grep.c
index 20ce95270a..6b52fed53a 100644
--- a/grep.c
+++ b/grep.c
@@ -406,12 +406,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const 
struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-   if (p->pcre1_jit_on) {
-   p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
-   if (!p->pcre1_jit_stack)
-   die("Couldn't allocate PCRE JIT stack");
-   pcre_assign_jit_stack(p->pcre1_extra_info, NULL, 
p->pcre1_jit_stack);
-   }
 #endif
 }
 
@@ -423,18 +417,9 @@ static int pcre1match(struct grep_pat *p, const char 
*line, const char *eol,
if (eflags & REG_NOTBOL)
flags |= PCRE_NOTBOL;
 
-#ifdef GIT_PCRE1_USE_JIT
-   if (p->pcre1_jit_on) {
-   ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-   eol - line, 0, flags, ovector,
-   ARRAY_SIZE(ovector), p->pcre1_jit_stack);
-   } else
-#endif
-   {
-   ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-   eol - line, 0, flags, ovector,
-   ARRAY_SIZE(ovector));
-   }
+   ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
+   eol - line, 0, flags, ovector,
+   ARRAY_SIZE(ovector));
 
if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
die("pcre_exec failed with error code %d", ret);
@@ -451,14 +436,11 @@ static void free_pcre1_regexp(struct grep_pat *p)
 {
pcre_free(p->pcre1_regexp);
 #ifdef GIT_PCRE1_USE_JIT
-   if (p->pcre1_jit_on) {
+   if (p->pcre1_jit_on)
pcre_free_study(p->pcre1_extra_info);
-   pcre_jit_stack_free(p->pcre1_jit_stack);
-   } else
+   else
 #endif
-   {
pcre_free(p->pcre1_extra_info);
-   }
pcre_free((void *)p->pcre1_tables);
 }
 #else /* !USE_LIBPCRE1 */
diff --git a/grep.h b/grep.h
index a65f4a1ae1..a405fc870c 100644
--- a/grep.h
+++ b/grep.h
@@ -14,13 +14,9 @@
 #ifndef GIT_PCRE_STUDY_JIT_COMPILE
 #define GIT_PCRE_STUDY_JIT_COMPILE 0
 #endif
-#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20
-typedef int pcre_jit_stack;
-#endif
 #else
 typedef int pcre;
 typedef int pcre_extra;
-typedef int pcre_jit_stack;
 #endif
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
@@ -86,7 +82,6 @@ struct grep_pat {
regex_t regexp;
pcre *pcre1_regexp;
pcre_extra *pcre1_extra_info;
-   pcre_jit_stack *pcre1_jit_stack;
const unsigned char *pcre1_tables;
int pcre1_jit_on;
pcre2_code *pcre2_pattern;
-- 
2.22.0.455.g172b71a6c5



[PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2

2019-07-24 Thread Ævar Arnfjörð Bjarmason
As reported in [1] the code I added in 94da9193a6 ("grep: add support
for PCRE v2", 2017-06-01) to use a custom JIT stack has never
worked. It was incorrectly copy/pasted from code I added in
fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25),
which did work.

Thus our intention of starting with 1 byte of stack at a maximum of 1
MB didn't happen, we'd always use the 32 KB stack provided by PCRE
v2's jit_machine_stack_exec()[2]. The reason I allocated a custom
stack at all was this advice in pcrejit(3) (same in pcre2jit(3)):

"By default, it uses 32KiB on the machine stack. However, some
large or complicated patterns need more than this"

Since we've haven't had any reports of users running into
PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
that we can just use the library defaults instead and drop this
code. This won't change with the wider use of PCRE v2 in
ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
fixed string search is not a "large or complicated pattern".

For good measure I ran the performance test noted in 94da9193a6,
although the command is simpler now due to my 0f50c8e32c ("Makefile:
remove the NO_R_TO_GCC_LINKER flag", 2019-05-17):

GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux 
GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 
LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Just the /perl/ results are:

TestHEAD~ HEAD

---
7820.3: perl grep 'how.to'  0.17(0.27+0.65)   
0.17(0.24+0.68) +0.0%
7820.7: perl grep '^how to' 0.16(0.23+0.66)   
0.16(0.23+0.67) +0.0%
7820.11: perl grep '[how] to'   0.18(0.35+0.62)   
0.18(0.33+0.65) +0.0%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare'   0.17(0.45+0.54)   
0.17(0.49+0.50) +0.0%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'  0.16(0.33+0.58)   
0.16(0.29+0.62) +0.0%

So, as expected there's no change, and running with valgrind reveals
that we have fewer allocations now.

1. https://public-inbox.org/git/20190721194052.15440-1-care...@gmail.com/
2. I didn't really intend to start with 1 byte, looking at the PCRE v2
   code again what happened is that I cargo-culted some of PCRE v2's
   own test code which was meant to test re-allocations. It's more
   sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c
   does.

Reported-by: Carlo Marcelo Arenas Belón 
Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 10 --
 grep.h |  4 
 2 files changed, 14 deletions(-)

diff --git a/grep.c b/grep.c
index be4282fef3..20ce95270a 100644
--- a/grep.c
+++ b/grep.c
@@ -546,14 +546,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
const struct grep_opt *opt
p->pcre2_jit_on = 0;
return;
}
-
-   p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, 
NULL);
-   if (!p->pcre2_jit_stack)
-   die("Couldn't allocate PCRE2 JIT stack");
-   p->pcre2_match_context = pcre2_match_context_create(NULL);
-   if (!p->pcre2_match_context)
-   die("Couldn't allocate PCRE2 match context");
-   pcre2_jit_stack_assign(p->pcre2_match_context, NULL, 
p->pcre2_jit_stack);
}
 }
 
@@ -597,8 +589,6 @@ static void free_pcre2_pattern(struct grep_pat *p)
pcre2_compile_context_free(p->pcre2_compile_context);
pcre2_code_free(p->pcre2_pattern);
pcre2_match_data_free(p->pcre2_match_data);
-   pcre2_jit_stack_free(p->pcre2_jit_stack);
-   pcre2_match_context_free(p->pcre2_match_context);
 }
 #else /* !USE_LIBPCRE2 */
 static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt 
*opt)
diff --git a/grep.h b/grep.h
index 1875880f37..a65f4a1ae1 100644
--- a/grep.h
+++ b/grep.h
@@ -29,8 +29,6 @@ typedef int pcre_jit_stack;
 typedef int pcre2_code;
 typedef int pcre2_match_data;
 typedef int pcre2_compile_context;
-typedef int pcre2_match_context;
-typedef int pcre2_jit_stack;
 #endif
 #include "kwset.h"
 #include "thread-utils.h"
@@ -94,8 +92,6 @@ struct grep_pat {
pcre2_code *pcre2_pattern;
pcre2_match_data *pcre2_match_data;
pcre2_compile_context *pcre2_compile_context;
-   pcre2_match_context *pcre2_match_context;
-   pcre2_jit_stack *pcre2_jit_stack;
uint32_t pcre2_jit_on;
kwset_t kws;
unsigned fixed:1;
-- 
2.22.0.455.g172b71a6c5



Re: [PATCH] grep: skip UTF8 checks explicitally

2019-07-22 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 22 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Sun, 21 Jul 2019, Carlo Marcelo Arenas Belón wrote:
>
>> Usually PCRE is compiled with JIT support, and therefore the code
>> path used includes calling pcre2_jit_match (for PCRE2), that ignores
>> invalid UTF-8 in the corpus.
>>
>> Make that option explicit so it can be also used when JIT is not
>> enabled and pcre2_match is called instead, preventing `git grep`
>> to abort when hitting the first binary blob in a fixed match
>> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)
>
> Good idea.
>
> The flag has been in PCRE1 since at least March 5, 2007, when the
> pcre.h.in file was first recorded in their Subversion repository:
> https://vcs.pcre.org/pcre/code/trunk/pcre.h.in?view=log
>
> It also was part of PCRE2 from the first revision (rev 4, in fact, where
> pcre2.h.in was added):
> https://vcs.pcre.org/pcre2/code/trunk/src/pcre2.h.in?view=log

Thanks for digging, that portability indeed sounds just fine.

> So I am fine with this patch.

I'm not, not because I dislike the approach. I haven't made up my mind
yet.

I stopped paying attention to this error-with-not-JIT discussion when I
heard that some other series going into next for Windows fixed that
issue[1]

But now we have it again in some form? My ab/no-kwset has a lot of tests
for encodings & locales combined with grep, don't some of those trigger
this? If so we should make any such failure a test & part of this patch.

Right now we don't have the info of whether we're really using the JIT
or not, but that would be easy to add to grep's --debug mode for use in
a test prereq.

As noted in [2] I'd be inclined to go the other way, if we indeed have
some cases where PCRE skips its own checks does not dying actually give
us anything useful? I'd think not, so just ignoring the issue seems like
the wrong thing to do.

Surely we're not producing useful grep results at that point, so just
not dying and mysteriously returning either nothing or garbage isn't
going to help much...

1. https://public-inbox.org/git/xmqq4l3wxk8j@gitster-ct.c.googlers.com/
2. https://public-inbox.org/git/87pnms7kv0@evledraar.gmail.com/

> Thanks,
> Dscho
>
>> ---
>>  grep.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/grep.c b/grep.c
>> index fc0ed73ef3..146093f590 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
>> const struct grep_opt *opt)
>>  static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
>>  regmatch_t *match, int eflags)
>>  {
>> -int ovector[30], ret, flags = 0;
>> +int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
>>
>>  if (eflags & REG_NOTBOL)
>>  flags |= PCRE_NOTBOL;
>> @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
>> const struct grep_opt *opt
>>  static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>>  regmatch_t *match, int eflags)
>>  {
>> -int ret, flags = 0;
>> +int ret, flags = PCRE2_NO_UTF_CHECK;
>>  PCRE2_SIZE *ovector;
>>  PCRE2_UCHAR errbuf[256];
>>
>> --
>> 2.22.0
>>
>>


Re: [PATCH v2 1/1] gettext: always use UTF-8 on native Windows

2019-07-04 Thread Ævar Arnfjörð Bjarmason


On Wed, Jul 03 2019, Karsten Blees via GitGitGadget wrote:

> From: Karsten Blees 
>
> On native Windows, Git exclusively uses UTF-8 for console output (both
> with MinTTY and native Win32 Console). Gettext uses `setlocale()` to
> determine the output encoding for translated text, however, MSVCRT's
> `setlocale()` does not support UTF-8. As a result, translated text is
> encoded in system encoding (as per `GetAPC()`), and non-ASCII chars are
> mangled in console output.
>
> Side note: There is actually a code page for UTF-8: 65001. In practice,
> it does not work as expected at least on Windows 7, though, so we cannot
> use it in Git. Besides, if we overrode the code page, any process
> spawned from Git would inherit that code page (as opposed to the code
> page configured for the current user), which would quite possibly break
> e.g. diff or merge helpers. So we really cannot override the code page.
>
> In `init_gettext_charset()`, Git calls gettext's
> `bind_textdomain_codeset()` with the character set obtained via
> `locale_charset()`; Let's override that latter function to force the
> encoding to UTF-8 on native Windows.
>
> In Git for Windows' SDK, there is a `libcharset.h` and therefore we
> define `HAVE_LIBCHARSET_H` in the MINGW-specific section in
> `config.mak.uname`, therefore we need to add the override before that
> conditionally-compiled code block.
>
> Rather than simply defining `locale_charset()` to return the string
> `"UTF-8"`, though, we are careful not to break `LC_ALL=C`: the
> `ab/no-kwset` patch series, for example, needs to have a way to prevent
> Git from expecting UTF-8-encoded input.

It's not just the ab/no-kwset I have cooking (but happy to have this
take that into account), but also anything grep-like is usually must
faster with LC_ALL=C. Isn't that also the case on Windows? Setting
locales affects a large variety of libc functions and third party
libraries (e.g. PCRE via us setting "use UTF-8" under locale).

> Signed-off-by: Karsten Blees 
> Signed-off-by: Johannes Schindelin 
> ---
>  gettext.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/gettext.c b/gettext.c
> index d4021d690c..3f2aca5c3b 100644
> --- a/gettext.c
> +++ b/gettext.c
> @@ -12,7 +12,25 @@
>  #ifndef NO_GETTEXT
>  #include 
>  #include 
> -#ifdef HAVE_LIBCHARSET_H
> +#ifdef GIT_WINDOWS_NATIVE
> +
> +static const char *locale_charset(void)
> +{
> + const char *env = getenv("LC_ALL"), *dot;
> +
> + if (!env || !*env)
> + env = getenv("LC_CTYPE");
> + if (!env || !*env)
> + env = getenv("LANG");
> +
> + if (!env)
> + return "UTF-8";
> +
> + dot = strchr(env, '.');
> + return !dot ? env : dot + 1;
> +}
> +
> +#elif defined HAVE_LIBCHARSET_H
>  #include 
>  #else
>  #include 

I'll take it on faith that this is what the locale_charset() should look
like.

I wonder if it wouldn't be better to always compile this function, and
just have init_gettext_charset() switch between the two. We've moved
more towards that sort of thing (e.g. with pthreads). I.e. prefer
redundant compilation to ifdefing platform-only code (which then only
gets compiled there). See "HAVE_THREADS" in the code.

It looks to me that with this patch the HAVE_LIBCHARSET_H docs in
"Makefile" become wrong. Shouldn't those be updated too?

We also still pass -DHAVE_LIBCHARSET_H to every file we compile, only to
never use it under GIT_WINDOWS_NATIVE, but perhaps fixing that isn't
possible with GIT_WINDOWS_NATIVE being a macro, and perhaps I've again
gotten the "native" v.s. "mingw" etc. relationship wrong in my head and
the HAVE_LIBCHARSET_H docs are fine.

It just seems wrong that we have both the configure script &
config.mak.uname look for / declare that we have libcharset.h, only to
at this late point not use libcharset.h at all. Couldn't we just know if
GIT_WINDOWS_NATIVE will be true earlier & move that check up, so it &
HAVE_LIBCHARSET_H can be mutually exclusive (with accompanying #error if
we have both)?


Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2

2019-07-02 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 01 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason   writes:
>
>> This v3 has a new patch (3/10) that I believe fixes the regression on
>> MinGW Johannes noted in
>> https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/
>>
>> As noted in the updated commit message in 10/10 I believe just
>> skipping this test & documenting this in a commit message is the least
>> amount of suck for now. It's really an existing issue with us doing
>> nothing sensible when the log/grep haystack encoding doesn't match the
>> needle encoding supplied via the command line.
>
> Is that quite the case?  If they do not match, not finding the match
> is the right answer, because we are byte-for-byte matching/searching
> IIUC.
>
>> We swept that under the carpet with the kwset backend, but PCRE v2
>> exposes it.
>
> Is it exposing, or just showing the limitation of the rewritten
> implementation where it cannot do byte-for-byte matching/searching
> as we used to be able to?
>
> Without having a way to know what encoding is used on the command
> line, there is no sensible way to reencode them to match the
> haystack encoding (even when it is known), so "you got to feed the
> strings in the same encoding, as we are going to match/search
> byte-for-byte" is the only sensible way to work, given the design
> space, I would think.
>
> Not that it is all that useful to be able to match/search
> byte-for-byte, of course, so I am OK if we punt with these tests,
> but I'd prefer to see us admit we are punting when we do ;-).

I'm guilty as charged in punting this larger encoding issue. As it
pertains to this patch series it unearths an obscure case I think nobody
cares about in practice, and I'd like to move on with the "remove kwset"
optimization.

But I strongly believe that the new behavior with the PCRE v2
optimization is the only sane thing to do, and to the extent we have
anything left to do (#leftoverbits) it's that we should modify git more
generally (aside from string searching) to do the same thing where
appropriate.

Remember, this only happens if the user has set a UTF-8 locale and thus
promised that they're going to give us UTF-8. We then take that promise
and make e.g. "æ" match "Æ" under --ignore-case.

Just falling back on raw byte matching isn't going to cut it, because
then "æ" won't match "Æ" under
--ignore-case, and there's other cases like that with matching word
boundaries & other Unicode gotchas.

The best that can be hoped for at that point is some "loose UTF-8"
mode. I see both perl & GNU grep seem to support that (although I'm sure
it falls apart at some point). GNU grep will also die in the same way
that we now die with --perl-regexp (since it also use PCRE).

I think that's saner, if the user thinks they're feeding us UTF-8 but
they're not I think they'd like to know rather than having the string
matching library fall back.


Re: [PATCH v3 1/3] repo-settings: create core.featureAdoptionRate setting

2019-07-02 Thread Ævar Arnfjörð Bjarmason


On Tue, Jul 02 2019, Duy Nguyen wrote:

> On Mon, Jul 1, 2019 at 10:32 PM Derrick Stolee via GitGitGadget
>  wrote:
>> @@ -601,3 +602,22 @@ core.abbrev::
>> in your repository, which hopefully is enough for
>> abbreviated object names to stay unique for some time.
>> The minimum length is 4.
>> +
>> +core.featureAdoptionRate::
>> +   Set an integer value on a scale from 0 to 10 describing your
>> +   desire to adopt new performance features. Defaults to 0. As
>> +   the value increases, features are enabled by changing the
>> +   default values of other config settings. If a config variable
>> +   is specified explicitly, the explicit value will override these
>> +   defaults:
>
> This is because I'd like to keep core.* from growing too big (it's
> already big), hard to read, search and maintain. Perhaps this should
> belong to a separate group? Something like tuning.something or
> defaults.something.

The main thing users look at is "man git-config" (or its web rendering)
which renders it all in one page anyway.

I think in general adding more things to core.* sucks less than
explaining the special-case that "tuning.*" isn't a config for
git-tuning(1) (although we have some of that already, e.g. with
trace2.*).

Documentation/config/core.txt is ~600 lines. Maybe it would be a good
idea to split it up, similar to your split of
Documentation/config/*.txt, but let's not conflate how we'd like to
maintain stuff in git.git with a config interface we expose externally.

It's going to be very confusing for users if some settings that
otherwise would be in core aren't there because a file in git.git was
"too big" at the time. Users (mostly) aren't going to know/care in what
chronological order we added config keys.


Re: [PATCH v2 1/3] repo-settings: create core.featureAdoptionRate setting

2019-07-02 Thread Ævar Arnfjörð Bjarmason


On Wed, Jun 19 2019, Derrick Stolee via GitGitGadget wrote:

>  core.commitGraph::
>   If true, then git will read the commit-graph file (if it exists)
> - to parse the graph structure of commits. Defaults to false. See
> + to parse the graph structure of commits. Defaults to false, unless
> + `core.featureAdoptionRate` is at least three. See
>   linkgit:git-commit-graph[1] for more information.
>
>  core.useReplaceRefs::
> @@ -601,3 +602,21 @@ core.abbrev::
>   in your repository, which hopefully is enough for
>   abbreviated object names to stay unique for some time.
>   The minimum length is 4.
> +
> +core.featureAdoptionRate::
> + Set an integer value on a scale from 0 to 10 describing your
> + desire to adopt new performance features. Defaults to 0. As
> + the value increases, features are enabled by changing the
> + default values of other config settings. If a config variable
> + is specified explicitly, the explicit value will override these
> + defaults:
> ++
> +If the value is at least 3, then the following defaults are modified.
> +These represent relatively new features that have existed for multiple
> +major releases, and present significant performance benefits. They do
> +not modify the user-facing output of porcelain commands.
> ++
> +* `core.commitGraph=true` enables reading commit-graph files.
> ++
> +* `gc.writeCommitGraph=true` eneables writing commit-graph files during

I barked up a similar tree in
https://public-inbox.org/git/cacbzzx5sbyo5fvptk6lw1ff96nr5591rhhc-5wdjw-fmg1r...@mail.gmail.com/

I wonder if you've seen that & what you think about that
approach. I.e. have a core.version=2.28 (or core.version=+6) or whatever
to opt-in to features we'd make default in 2.28. Would that be your
core.featureAdoptionRate=6 (28-28 = 6)?

I admit that question is partly rhetorical, because I think it suggests
how hard it would be for users to reason about this.

The "core.version" idea also sucks, but at least it's bound to our
advertised version number, so it's obvious if you set it to e.g. +2 what
feature track you're on, and furthermore when we'd commit to making that
the default for users who don't set core.version (although we could of
course always change our minds...). It's also something that mirrors how
e.g. Perl, C compilers (with --std=*) treat this sort of thing.

So I'm all for a facility to have a setting to collectively opt-in to
new things early. But I think for such a thing we really should a) at
least in principle commit to making those things the default eventually
(if they don't suck) b) it needs to be obvious to the user how the
"rate" relates to git releases.

This "core.featureAdoptionRate" value seems more like zlib compression
values & unrelated to release numbers. It's also for "performance
features" only but squats a more general name. I suggested
"core.version" & then "core.uiVersion" (in
https://public-inbox.org/git/87pnunxz5i@evledraar.gmail.com/).

Regardless of whether we want to pin opt-in early-bird features to
version numbers in some way, which I think is a good idea, but maybe
others disagree. I think if it's "just performance" it's good to put
that in the key name in such a way that we can have "early UI" features,
or other non-UI non-performance.

Thanks for working on this!


Re: ab/no-kwset, was Re: What's cooking in git.git (Jun 2019, #07; Fri, 28)

2019-07-01 Thread Ævar Arnfjörð Bjarmason


On Mon, Jul 01 2019, Johannes Schindelin wrote:

> Hi Junio & Ævar,
>
> On Fri, 28 Jun 2019, Junio C Hamano wrote:
>
>> * ab/no-kwset (2019-06-28) 9 commits
>>  - grep: use PCRE v2 for optimized fixed-string search
>>  - grep: remove the kwset optimization
>>  - grep: drop support for \0 in --fixed-strings 
>>  - grep: make the behavior for NUL-byte in patterns sane
>>  - grep tests: move binary pattern tests into their own file
>>  - grep tests: move "grep binary" alongside the rest
>>  - grep: inline the return value of a function call used only once
>>  - grep: don't use PCRE2?_UTF8 with "log --encoding="
>>  - log tests: test regex backends in "--encode=" tests
>>
>>  Retire use of kwset library, which is an optimization for looking
>>  for fixed strings, with use of pcre2 JIT.
>>
>>  Will merge to 'next'.
>
> There is still a test failure that I am not sure how Ævar wants to
> address:
>
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab

CC'd you there, but as a note here: I believe my v3 sent just now fixes
this:
https://public-inbox.org/git/20190701212100.27850-1-ava...@gmail.com/


[PATCH v3 08/10] grep: drop support for \0 in --fixed-strings

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Change "-f " to not support patterns with a NUL-byte in them
under --fixed-strings. We'll now only support these under
"--perl-regexp" with PCRE v2.

A previous change to grep's documentation changed the description of
"-f " to be vague enough as to not promise that this would work.
By dropping support for this we make it a whole lot easier to move
away from the kwset backend, which we'll do in a subsequent change.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +-
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index d6603bc950..8d0fff316c 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
 
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
return;
}
 
-   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
-
if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0

[PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search

2019-07-01 Thread Ævar Arnfjörð Bjarmason
7.39(6.99+0.33)   7.00(6.68+0.25) 
-5.3%
4221.13: extended log --grep='æ' 7.34(7.00+0.25)   7.15(6.81+0.31) 
-2.6%
4221.14: perl log --grep='æ' 7.43(7.13+0.26)   7.01(6.60+0.36) 
-5.7%

log with -i:

Testorigin/master HEAD


4221.1: fixed log -i --grep='int'   7.31(7.07+0.24)   
7.23(7.00+0.22) -1.1%
4221.2: basic log -i --grep='int'   7.40(7.08+0.28)   
7.19(6.92+0.20) -2.8%
4221.3: extended log -i --grep='int'7.43(7.13+0.25)   
7.27(6.99+0.21) -2.2%
4221.4: perl log -i --grep='int'7.34(7.10+0.24)   
7.10(6.90+0.19) -3.3%
4221.6: fixed log -i --grep='uncommon'  7.07(6.71+0.32)   
7.11(6.77+0.28) +0.6%
4221.7: basic log -i --grep='uncommon'  6.99(6.64+0.28)   
7.12(6.69+0.38) +1.9%
4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   
7.10(6.77+0.27) -0.1%
4221.9: perl log -i --grep='uncommon'   6.98(6.60+0.29)   
7.05(6.64+0.34) +1.0%
4221.11: fixed log -i --grep='æ'7.85(7.45+0.34)   
7.03(6.68+0.32) -10.4%
4221.12: basic log -i --grep='æ'7.87(7.49+0.29)   
7.06(6.69+0.31) -10.3%
4221.13: extended log -i --grep='æ' 7.87(7.54+0.31)   
7.09(6.69+0.31) -9.9%
    4221.14: perl log -i --grep='æ' 7.06(6.77+0.28)   
6.91(6.57+0.31) -2.1%

So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4468519d5c..fc0ed73ef3 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+   size_t i;
+
+   for (i = 0; i < len; i++) {
+   if (is_regex_special(s[i]))
+   return 0;
+   }
+
+   return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, 
const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
compile_regexp_failed(p, errbuf);
}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
int err;
int regflags = REG_NEWLINE;
+   int pat_is_fixed;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
@@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   if (opt->fixed) {
+   pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+   if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+   opt->pcre2 = 1;
+   if (pat_is_fixed) {
+   compile_pcre2_pattern(p, opt);
+   } else {
+   /*
+* E.g. t7811-grep-open.sh relies on the
+* pattern being restored.
+*/
+   char *old_pattern = p->pattern;
+   size_t old_patternlen = p->patternlen;
+   struct strbuf sb = STRBUF_INIT;
+
+   /*
+* There is the PCRE2_LITERAL flag, but it's
+* only in PCRE v2 10.30 and later. Needing to
+* ifdef our way around that and dealing with
+* it + PCRE2_MULTILINE being an error is more
+* complex than just quoting this ourselves.
+   */
+   strbuf_add(&sb, "\\Q", 2);
+   strbuf_add(&sb, p->pattern, p->patternlen);
+   strbuf_add(&sb, "\\E", 2);
+
+   p->pattern = sb.buf;
+   p->patternlen = sb.len;
+   compile_pcre2_pattern(p, opt);
+   p->pattern = old_pattern;
+   p->patternlen = old_patternlen;
+   strbuf_release(&sb);
+   }
+#else /* !USE_LIBPCRE2 */
compile_fixed_regexp(p, opt);
+#endif /* !USE_LIBPCRE2 */
return;
}
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 09/10] grep: remove the kwset optimization

2019-07-01 Thread Ævar Arnfjörð Bjarmason
0.26) +19.2%
4221.4: perl log -i --grep='int'7.42(7.16+0.21)   
7.14(6.80+0.24) -3.8%
4221.6: fixed log -i --grep='uncommon'  6.94(6.58+0.35)   
8.43(8.04+0.30) +21.5%
4221.7: basic log -i --grep='uncommon'  6.95(6.62+0.31)   
8.34(7.93+0.32) +20.0%
4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   
8.32(7.98+0.31) +17.8%
4221.9: perl log -i --grep='uncommon'   6.96(6.69+0.26)   
7.04(6.64+0.32) +1.1%
4221.11: fixed log -i --grep='æ'7.92(7.55+0.33)   
7.86(7.44+0.34) -0.8%
4221.12: basic log -i --grep='æ'7.88(7.49+0.32)   
7.84(7.46+0.34) -0.5%
4221.13: extended log -i --grep='æ' 7.91(7.51+0.32)   
7.87(7.48+0.32) -0.5%
4221.14: perl log -i --grep='æ' 7.01(6.59+0.35)   
6.99(6.64+0.28) -0.3%

Some of those, as noted in [1] are because PCRE is faster at finding
fixed strings. This looks bad for some engines, but in the next change
we'll optimistically use PCRE v2 for all of these, so it'll look
better.

1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 63 +++---
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 8d0fff316c..4468519d5c 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-   size_t i;
-
-   for (i = 0; i < len; i++) {
-   if (is_regex_special(s[i]))
-   return 0;
-   }
-
-   return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
+   p->fixed = opt->fixed;
 
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   /*
-* Even when -F (fixed) asks us to do a non-regexp search, we
-* may not be able to correctly case-fold when -i
-* (ignore-case) is asked (in which case, we'll synthesize a
-* regexp to match the pattern that matches regexp special
-* characters literally, while ignoring case differences).  On
-* the other hand, even without -F, if the pattern does not
-* have any regexp special characters and there is no need for
-* case-folding search, we can internally turn it into a
-* simple string match using kws.  p->fixed tells us if we
-* want to use kws.
-*/
-   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-   if (p->fixed) {
-   p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-   kwsincr(p->kws, p->pattern, p->patternlen);
-   kwsprep(p->kws);
-   return;
-   }
-
if (opt->fixed) {
-   /*
-* We come here when the pattern has the non-ascii
-* characters we cannot case-fold, and asked to
-* ignore-case.
-*/
compile_fixed_regexp(p, opt);
return;
}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
case GREP_PATTERN: /* atom */
case GREP_PATTERN_HEAD:
case GREP_PATTERN_BODY:
-   if (p->kws)
-   kwsfree(p->kws);
-   else if (p->pcre1_regexp)
+   if (p->pcre1_regexp)
free_pcre1_regexp(p);
else if (p->pcre2_pattern)
free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char 
*name)
opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-   regmatch_t *match)
-{
-   struct kwsmatch kwsm;
-   size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-   if (offset == -1) {
-   match->rm_so = match->rm_eo = -1;
-   return REG_NOMATCH;
-   } else {
-   match->rm_so = offset;
-   match->rm_eo = match->rm_so + kwsm.size[0];
-  

[PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW

2019-07-01 Thread Ævar Arnfjörð Bjarmason
In 5212f91deb ("t4210: skip command-line encoding tests on mingw",
2014-07-17) the positive tests in this file were skipped. That left
the negative tests that don't produce a match.

An upcoming change to migrate the "fixed" backend of grep to PCRE v2
will cause these "log" commands to produce an error instead on
MinGW. This is because the command-line on that platform implicitly
has its encoding changed before being passed to git. See [1].

1. 
https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t4210-log-i18n.sh | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 515bcb7ce1..6e61f57f09 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -51,7 +51,7 @@ test_expect_success !MINGW 'log --grep does not find 
non-reencoded values (utf8)
test_must_be_empty actual
 '
 
-test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
+test_expect_success !MINGW 'log --grep does not find non-reencoded values 
(latin1)' '
git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
test_must_be_empty actual
 '
@@ -70,7 +70,7 @@ do
then
force_regex=.*
fi
-   test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c 
grep.patternType=$engine log --grep does not find non-reencoded values (latin1 
+ locale)" "
cat >expect <<-\EOF &&
latin1
utf8
@@ -79,12 +79,12 @@ do
test_cmp expect actual
"
 
-   test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c 
grep.patternType=$engine log --grep does not find non-reencoded values (latin1 
+ locale)" "
LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
test_must_be_empty actual
"
 
-   test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid 
needle)" "
+   test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c 
grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 
+ locale + invalid needle)" "
LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
test_must_be_empty actual
"
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 04/10] grep: inline the return value of a function call used only once

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 1de4ab49c0..4e8d0645a8 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-   int ascii_only;
int err;
int regflags = REG_NEWLINE;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
-   ascii_only = !has_non_ascii(p->pattern);
 
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (opt->fixed ||
has_null(p->pattern, p->patternlen) ||
is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || ascii_only;
+   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 01/10] log tests: test regex backends in "--encode=" tests

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Improve the tests added in 04deccda11 ("log: re-encode commit messages
before grepping", 2013-02-11) to test the regex backends. Those tests
never worked as advertised, due to the is_fixed() optimization in
grep.c (which was in place at the time), and the needle in the tests
being a fixed string.

We'd thus always use the "fixed" backend during the tests, which would
use the kwset() backend. This backend liberally accepts any garbage
input, so invalid encodings would be silently accepted.

In a follow-up commit we'll fix this bug, this test just demonstrates
the existing issue.

In practice this issue happened on Windows, see [1], but due to the
structure of the existing tests & how liberal the kwset code is about
garbage we missed this.

Cover this blind spot by testing all our regex engines. The PCRE
backend will spot these invalid encodings. It's possible that this
test breaks the "basic" and "extended" backends on some systems that
are more anal than glibc about the encoding of locale issues with
POSIX functions that I can remember, but PCRE is more careful about
the validation.

1. 
https://public-inbox.org/git/nycvar.qro.7.76.6.1906271113090...@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t4210-log-i18n.sh | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 7c519436ef..86d22c1d4c 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -1,12 +1,15 @@
 #!/bin/sh
 
 test_description='test log with i18n features'
-. ./test-lib.sh
+. ./lib-gettext.sh
 
 # two forms of é
 utf8_e=$(printf '\303\251')
 latin1_e=$(printf '\351')
 
+# invalid UTF-8
+invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
+
 test_expect_success 'create commits in different encodings' '
test_tick &&
cat >msg <<-EOF &&
@@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded 
values (latin1)' '
test_must_be_empty actual
 '
 
+for engine in fixed basic extended perl
+do
+   prereq=
+   result=success
+   if test $engine = "perl"
+   then
+   result=failure
+   prereq="PCRE"
+   else
+   prereq=""
+   fi
+   force_regex=
+   if test $engine != "fixed"
+   then
+   force_regex=.*
+   fi
+   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   cat >expect <<-\EOF &&
+   latin1
+   utf8
+   EOF
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual &&
+   test_cmp expect actual
+   "
+
+   test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
+   test_must_be_empty actual
+   "
+
+   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid 
needle)" "
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
+   test_must_be_empty actual
+   "
+done
+
 test_done
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding="

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.

We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.

Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.

But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c  | 8 
 grep.h  | 1 +
 revision.c  | 3 +++
 t/t4210-log-i18n.sh | 6 ++
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..1de4ab49c0 100644
--- a/grep.c
+++ b/grep.c
@@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
const struct grep_opt *opt)
int options = PCRE_MULTILINE;
 
if (opt->ignore_case) {
-   if (has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && has_non_ascii(p->pattern))
p->pcre1_tables = pcre_maketables();
options |= PCRE_CASELESS;
}
-   if (is_utf8_locale() && has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern))
options |= PCRE_UTF8;
 
p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
@@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
const struct grep_opt *opt
p->pcre2_compile_context = NULL;
 
if (opt->ignore_case) {
-   if (has_non_ascii(p->pattern)) {
+   if (!opt->ignore_locale && has_non_ascii(p->pattern)) {
character_tables = pcre2_maketables(NULL);
p->pcre2_compile_context = 
pcre2_compile_context_create(NULL);
pcre2_set_character_tables(p->pcre2_compile_context, 
character_tables);
}
options |= PCRE2_CASELESS;
}
-   if (is_utf8_locale() && has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern))
options |= PCRE2_UTF;
 
p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/grep.h b/grep.h
index 1875880f37..4bb8a79d93 100644
--- a/grep.h
+++ b/grep.h
@@ -173,6 +173,7 @@ struct grep_opt {
int funcbody;
int extended_regexp_option;
int pattern_type_option;
+   int ignore_locale;
char colors[NR_GREP_COLORS][COLOR_MAXLEN];
unsigned pre_context;
unsigned post_context;
diff --git a/revision.c b/revision.c
index 621feb9df7..a842fb158a 100644
--- a/revision.c
+++ b/revision.c
@@ -28,6 +28,7 @@
 #include "commit-graph.h"
 #include "prio-queue.h"
 #include "hashmap.h"
+#include "utf8.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
 
grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 &revs->grep_filter);
+   if (!is_encoding_utf8(get_log_output_encoding()))
+   revs->grep_filter.ignore_locale = 1;
compile_grep_patterns(&revs->grep_filter);
 
if (revs->reverse && revs->reflog_info)
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 86d22c1d4c..515bcb7ce1 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded 
values (latin1)' '
 for engine in fixed basic extended perl
 do
prereq=
-   result=success
if test $engine = "perl"
then
-   result=failure
prereq="PCRE"
else
prereq=""
@@ -72,7 +70,7 @@ do
then
force_regex=.*
fi
-   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   test_expect_success GETTEXT_LOCALE,$prereq

[PATCH v3 05/10] grep tests: move "grep binary" alongside the rest

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh 
b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane

2019-07-01 Thread Ævar Arnfjörð Bjarmason
The behavior of "grep" when patterns contained a NUL-byte has always
been haphazard, and has served the vagaries of the implementation more
than anything else. A pattern containing a NUL-byte can only be
provided via "-f ". Since pickaxe (log search) has no such flag
the NUL-byte in patterns has only ever been supported by "grep" (and
not "log --grep").

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because they
contain a NUL-byte. Instead die if the backend in question can't
handle them, e.g. --basic-regexp is combined with such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 Documentation/git-grep.txt |  17 
 grep.c |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++---
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f ::
Read patterns from , one per line.
++
+Passing the pattern via  allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index 4e8d0645a8..d6603bc950 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-   /*
-* regcomp cannot accept patterns with NULs so when using it
-* we consider any pattern containing a NUL fixed.
-*/
-   if (memchr(s, 0, len))
-   return 1;
-
-   return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 * simple string match using kws.  p->fixed tells us if we
 * want to use kws.
 */
-   if (opt->fixed ||
-   has_null(p->pattern, p->patternlen) ||
-   is_fixed(p->pattern, p->patternlen))
+   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
kwsincr(p->kws, p->pattern, p->patternlen);
kwsprep(p->kws);
return;
-   } else if (opt->fixed) {
+   }
+
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
+   if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
matches=$1
-   fla

[PATCH v3 06/10] grep tests: move binary pattern tests into their own file

2019-07-01 Thread Ævar Arnfjörð Bjarmason
Move the tests for "-f " where "" contains a NUL byte
pattern into their own file. I added most of these tests in
966be95549 ("grep: add tests to fix blind spots with \0 patterns",
2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing NUL-byte and considering them fixed, except in
cases where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t7815-grep-binary.sh | 101 -
 t/t7816-grep-binary-pattern.sh | 114 +
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-   matches=$1
-   flags=$2
-   pattern=$3
-   pattern_human=$(echo "$pattern" | sed 's/Q//g')
-
-   if test "$matches" = 1
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = 0
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   elif test "$matches" = T1
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = T0
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   else
-   test_expect_success "PANIC: Test framework error. Unknown 
matches value $matches" 'false'
-   fi
-}
-
 test_expect_success 'setup' "
echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 ''

[PATCH v3 00/10] grep: move from kwset to optional PCRE v2

2019-07-01 Thread Ævar Arnfjörð Bjarmason
This v3 has a new patch (3/10) that I believe fixes the regression on
MinGW Johannes noted in
https://public-inbox.org/git/nycvar.qro.7.76.6.1907011515150...@tvgsbejvaqbjf.bet/

As noted in the updated commit message in 10/10 I believe just
skipping this test & documenting this in a commit message is the least
amount of suck for now. It's really an existing issue with us doing
nothing sensible when the log/grep haystack encoding doesn't match the
needle encoding supplied via the command line.

We swept that under the carpet with the kwset backend, but PCRE v2
exposes it.

Ævar Arnfjörð Bjarmason (10):
  log tests: test regex backends in "--encode=" tests
  grep: don't use PCRE2?_UTF8 with "log --encoding="
  t4210: skip more command-line encoding tests on MinGW
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for NUL-byte in patterns sane
  grep: drop support for \0 in --fixed-strings 
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt|  17 +++
 grep.c| 115 +++-
 grep.h|   3 +-
 revision.c|   3 +
 t/t4210-log-i18n.sh   |  41 +-
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --
 t/t7816-grep-binary-pattern.sh| 127 ++
 8 files changed, 234 insertions(+), 173 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

Range-diff:
 1:  cfc01f49d3 =  1:  cfc01f49d3 log tests: test regex backends in 
"--encode=" tests
 2:  4b59eb32f0 =  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log 
--encoding="
 -:  -- >  3:  676c76afe4 t4210: skip more command-line encoding tests 
on MinGW
 3:  cc4d3b50d5 =  4:  da9b491f70 grep: inline the return value of a function 
call used only once
 4:  d9b29bdd89 =  5:  c42d3268fa grep tests: move "grep binary" alongside the 
rest
 5:  f85614f435 =  6:  36b9c1c541 grep tests: move binary pattern tests into 
their own file
 6:  90afca8707 =  7:  3c54e782e6 grep: make the behavior for NUL-byte in 
patterns sane
 7:  526b925fdc =  8:  8e5f418189 grep: drop support for \0 in --fixed-strings 

 8:  14269bb295 =  9:  d1cb8319d5 grep: remove the kwset optimization
 9:  c0fd75d102 ! 10:  4de0c82314 grep: use PCRE v2 for optimized fixed-string 
search
@@ -15,6 +15,15 @@
 makes the behavior harder to understand and document, and makes tests
 for the different backends more painful.
 
+This does change the behavior under non-C locales when "log"'s
+"--encoding" option is used and the heystack/needle in the
+content/command-line doesn't have a matching encoding. See the recent
+change in "t4210: skip more command-line encoding tests on MinGW" in
+this series. I think that's OK. We did nothing sensible before
+then (just compared raw bytes that had no hope of matching). At least
+now the user will get some idea why their grep/log never matches in
+that edge case.
+
 I could also support the PCRE v1 backend here, but that would make the
 code more complex. I'd rather aim for simplicity here and in future
 changes to the diffcore. We're not going to have someone who
-- 
2.22.0.455.g172b71a6c5



Re: [PATCH 0/6] easy bulk commit creation in tests

2019-06-28 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 28 2019, Jeff King wrote:

> On Fri, Jun 28, 2019 at 02:41:03AM -0400, Jeff King wrote:
>
>> I think this would exercise it, at the cost of making the test more
>> expensive:
>>
>> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
>> index 82d7f7f6a5..8ed6982dcb 100755
>> --- a/t/t5310-pack-bitmaps.sh
>> +++ b/t/t5310-pack-bitmaps.sh
>> @@ -21,7 +21,7 @@ has_any () {
>>  }
>>
>>  test_expect_success 'setup repo with moderate-sized history' '
>> -for i in $(test_seq 1 10)
>> +for i in $(test_seq 1 100)
>>  do
>>  test_commit $i
>>  done &&
>>
>> It would be nice if we had a "test_commits_bulk" that used fast-import
>> to create larger numbers of commits.
>
> So here's a patch to do that. Writing the bulk commit function was a fun
> exercise, and I found a couple other places to apply it, too, shaving
> off ~7.5 seconds from my test runs. Not ground-breaking, but I think
> it's nice to have a solution where we don't have to be afraid to
> generate a bunch of commits.

Nice.

Just a side-note: I've wondered how much we could speed up the tests in
other places if rather than doing setup all over the place we simply
created a few "template" repository shapes, and the common case for
tests would be to simply cp(1) those over.

I.e. for things like fsck etc. we really do need some specific
repository layout, but a lot of our tests are simply re-doing setup
slightly differently just to get things like "I want a few commits on a
few branches" or "set up a repo like  but with some remotes" etc.


Re: [PATCH 1/6] test-lib: introduce test_commit_bulk

2019-06-28 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 28 2019, Jeff King wrote:

> Some tests need to create a string of commits. Doing this with
> test_commit is very heavy-weight, as it needs at least one process per
> commit (and in fact, uses several).
>
> For bulk creation, we can do much better by using fast-import, but it's
> often a pain to generate the input. Let's provide a helper to do so.
>
> We'll use t5310 as a guinea pig, as it has three 10-commit loops. Here
> are hyperfine results before and after:
>
>   [before]
>   Benchmark #1: ./t5310-pack-bitmaps.sh --root=/var/ram/git-tests
> Time (mean ± σ):  2.846 s ±  0.305 s[User: 3.042 s, System: 0.919 
> s]
> Range (min … max):2.250 s …  3.210 s10 runs
>
>   [after]
>   Benchmark #1: ./t5310-pack-bitmaps.sh --root=/var/ram/git-tests
> Time (mean ± σ):  2.210 s ±  0.174 s[User: 2.570 s, System: 0.604 
> s]
> Range (min … max):1.999 s …  2.590 s10 runs
>
> So we're over 20% faster, while making the callers slightly shorter. We
> added a lot more lines in test-lib-function.sh, of course, and the
> helper is way more featureful than we need here. But my hope is that it
> will be flexible enough to use in more places.
>
> Signed-off-by: Jeff King 
> ---
>  t/t5310-pack-bitmaps.sh |  15 +
>  t/test-lib-functions.sh | 131 
>  2 files changed, 134 insertions(+), 12 deletions(-)
>
> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index a26c8ba9a2..3aab7024ca 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -21,15 +21,9 @@ has_any () {
>  }
>
>  test_expect_success 'setup repo with moderate-sized history' '
> - for i in $(test_seq 1 10)
> - do
> - test_commit $i
> - done &&
> + test_commit_bulk --id=file 10 &&
>   git checkout -b other HEAD~5 &&
> - for i in $(test_seq 1 10)
> - do
> - test_commit side-$i
> - done &&
> + test_commit_bulk --id=side 10 &&
>   git checkout master &&
>   bitmaptip=$(git rev-parse master) &&
>   blob=$(echo tagged-blob | git hash-object -w --stdin) &&
> @@ -106,10 +100,7 @@ test_expect_success 'clone from bitmapped repository' '
>  '
>
>  test_expect_success 'setup further non-bitmapped commits' '
> - for i in $(test_seq 1 10)
> - do
> - test_commit further-$i
> - done
> + test_commit_bulk --id=further 10
>  '
>
>  rev_list_tests 'partial bitmap'
> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
> index 0367cec5fd..32a1db81a3 100644
> --- a/t/test-lib-functions.sh
> +++ b/t/test-lib-functions.sh
> @@ -233,6 +233,137 @@ test_merge () {
>   git tag "$1"
>  }
>
> +# Similar to test_commit, but efficiently create  commits, each with a
> +# unique number $n (from 1 to  by default) in the commit message.

Is it intentional not to follow test_commit's convention of creating a
tag as well? If so it would be helpful to note that difference here, or
rather, move this documentation to t/README where test_commit and
friends are documented.


Re: [PATCH v2 0/9] grep: move from kwset to optional PCRE v2

2019-06-28 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 28 2019, Ævar Arnfjörð Bjarmason wrote:

> A non-RFC since it seem people like this approach.
>
> This should fix the test failure noted by Johannes, there's two new
> patches at the start of this series. They address a bug that was there
> for a long time, but I happened to trip over since PCRE is more strict
> about UTF-8 validation than kwset (which doesn't care at all).
>
> I also added performance numbers to the relevant commit messages, took
> brian's suggestion of saying "NUL-byte" instead of "\0", and did some
> other copyediting of my own.
>
> The rest of the code changes are all just comments & rewording of
> previously added comments.

Junio. I thought I'd submit this in before your merge to "next", but I
see that happened. Are you OK with rewinding it for this (& maybe
something else) or should I submit a v3 rebased on "next"?

I'd really prefer the improved commit messages with performance numbers,
and thought I'd have time to work on those details since it was an
RFC/PATCH :)

> Ævar Arnfjörð Bjarmason (9):
>   log tests: test regex backends in "--encode=" tests
>   grep: don't use PCRE2?_UTF8 with "log --encoding="
>   grep: inline the return value of a function call used only once
>   grep tests: move "grep binary" alongside the rest
>   grep tests: move binary pattern tests into their own file
>   grep: make the behavior for NUL-byte in patterns sane
>   grep: drop support for \0 in --fixed-strings 
>   grep: remove the kwset optimization
>   grep: use PCRE v2 for optimized fixed-string search
>
>  Documentation/git-grep.txt|  17 +++
>  grep.c| 115 +++-
>  grep.h|   3 +-
>  revision.c|   3 +
>  t/t4210-log-i18n.sh   |  39 +-
>  ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
>  ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --
>  t/t7816-grep-binary-pattern.sh| 127 ++
>  8 files changed, 233 insertions(+), 172 deletions(-)
>  rename t/{t7009-filter-branch-null-sha1.sh => 
> t7008-filter-branch-null-sha1.sh} (100%)
>  rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
>  create mode 100755 t/t7816-grep-binary-pattern.sh
>
> Range-diff:
>  -:  -- >  1:  cfc01f49d3 log tests: test regex backends in 
> "--encode=" tests
>  -:  -- >  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log 
> --encoding="
>  1:  ad55d3be7e =  3:  cc4d3b50d5 grep: inline the return value of a function 
> call used only once
>  2:  650bcc8582 =  4:  d9b29bdd89 grep tests: move "grep binary" alongside 
> the rest
>  3:  ef10a8820d !  5:  f85614f435 grep tests: move binary pattern tests into 
> their own file
> @@ -2,9 +2,10 @@
>
>  grep tests: move binary pattern tests into their own file
>
> -Move the tests for "-f " where "" contains a "\0" pattern
> -into their own file. I added most of these tests in 966be95549 
> ("grep:
> -add tests to fix blind spots with \0 patterns", 2017-05-20).
> +Move the tests for "-f " where "" contains a NUL byte
> +pattern into their own file. I added most of these tests in
> +966be95549 ("grep: add tests to fix blind spots with \0 patterns",
> +2017-05-20).
>
>  Whether a regex engine supports matching binary content is very
>  different from whether it matches binary patterns. Since
> @@ -14,8 +15,8 @@
>  engine can sensibly match binary patterns.
>
>  Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
> -patterns containing "\0" and considering them fixed, except in cases
> -where "--ignore-case" is provided and they're non-ASCII, see
> +patterns containing NUL-byte and considering them fixed, except in
> +cases where "--ignore-case" is provided and they're non-ASCII, see
>  5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
>  2016-06-25). Subsequent commits will change this behavior.
>
>  4:  03e5637efc !  6:  90afca8707 grep: make the behavior for \0 in patterns 
> sane
> @@ -1,12 +1,13 @@
>  Author: Ævar Arnfjörð Bjarmason 
>
> -grep: make the behavior for \0 in patterns sane
> +grep: make the behavior for NUL-byte in patterns sane
>
> -The behavi

Re: [PATCH] repack: disable bitmaps-by-default if .keep files exist

2019-06-28 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 28 2019, Eric Wong wrote:

> Jeff King  wrote:
>> On Sun, Jun 23, 2019 at 06:08:25PM +, Eric Wong wrote:
>>
>> > > I'm not sure of the right solution. For maximal backwards-compatibility,
>> > > the default for bitmaps could become "if not bare and if there are no
>> > > .keep files". But that would mean bitmaps sometimes not getting
>> > > generated because of the problems that ee34a2bead was trying to solve.
>> > >
>> > > That's probably OK, though; you can always flip the bitmap config to
>> > > "true" yourself if you _must_ have bitmaps.
>> >
>> > What about something like this?  Needs tests but I need to leave, now.
>>
>> Yeah, I think that's the right direction.
>
> OK.  I have a real patch with one additional test, below.
> (don't have a lot of time for hacking)
>
>> Though...
>>
>> > +static int has_pack_keep_file(void)
>> > +{
>> > +  DIR *dir;
>> > +  struct dirent *e;
>> > +  int found = 0;
>> > +
>> > +  if (!(dir = opendir(packdir)))
>> > +  return found;
>> > +
>> > +  while ((e = readdir(dir)) != NULL) {
>> > +  if (ends_with(e->d_name, ".keep")) {
>> > +  found = 1;
>> > +  break;
>> > +  }
>> > +  }
>> > +  closedir(dir);
>> > +  return found;
>> > +}
>>
>> I think this can be replaced with just checking p->pack_keep for each
>> item in the packed_git list.
>
> Good point, I tend to forget git C API internals as soon as I
> learn them :x
>
>> That's racy, but then so is your code here, since it's really the child
>> pack-objects which is going to deal with the .keep. I don't think we
>> need to care much about the race, though. Either:
>
> Agreed.  
>
> 8<---
> Subject: [PATCH] repack: disable bitmaps-by-default if .keep files exist
>
> Bitmaps aren't useful with multiple packs, and users with
> .keep files ended up with redundant packs when bitmaps
> got enabled by default in bare repos.
>
> So detect when .keep files exist and stop enabling bitmaps
> by default in that case.
>
> Wasteful (but otherwise harmless) race conditions with .keep files
> documented by Jeff King still apply and there's a chance we'd
> still end up with redundant data on the FS:
>
>   https://public-inbox.org/git/20190623224244.gb1...@sigill.intra.peff.net/
>
> Fixes: 36eba0323d3288a8 ("repack: enable bitmaps by default on bare repos")
> Signed-off-by: Eric Wong 
> Helped-by: Jeff King 
> Reported-by: Janos Farkas 
> ---
>  builtin/repack.c  | 18 --
>  t/t7700-repack.sh | 10 ++
>  2 files changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index caca113927..a9529d1afc 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -89,6 +89,17 @@ static void remove_pack_on_signal(int signo)
>   raise(signo);
>  }
>
> +static int has_pack_keep_file(void)
> +{
> + struct packed_git *p;
> +
> + for (p = get_packed_git(the_repository); p; p = p->next) {
> + if (p->pack_keep)
> + return 1;
> + }
> + return 0;
> +}
> +
>  /*
>   * Adds all packs hex strings to the fname list, which do not
>   * have a corresponding .keep file. These packs are not to
> @@ -343,9 +354,12 @@ int cmd_repack(int argc, const char **argv, const char 
> *prefix)
>   (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>   die(_("--keep-unreachable and -A are incompatible"));
>
> - if (write_bitmaps < 0)
> + if (write_bitmaps < 0) {
>   write_bitmaps = (pack_everything & ALL_INTO_ONE) &&
> -  is_bare_repository();
> +  is_bare_repository() &&
> +  keep_pack_list.nr == 0 &&
> +  !has_pack_keep_file();
> + }
>   if (pack_kept_objects < 0)
>   pack_kept_objects = write_bitmaps;
>
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 86d05160a3..0acde3b1f8 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -239,4 +239,14 @@ test_expect_success 'bitmaps can be disabled on bare 
> repos' '
>   test -z "$bitmap"
>  '

I have the feedback I posted before this patch in
https://public-inbox.org/git/874l4f8h4c@evledraar.gmail.com/

In particular "b" there since "a" is clearly more work. I.e. shouldn't
we at least in interactive mode on a "gc" print something about skipping
what we'd otherwise do.

Maybe that's tricky with the gc.log functionality, but I think we should
at least document this before the next guy shows up with "sometimes my
.bitmap files aren't generated...".


> +test_expect_success 'no bitmaps created if .keep files present' '
> + pack=$(ls bare.git/objects/pack/*.pack) &&
> + test_path_is_file "$pack" &&
> + keep=${pack%.pack}.keep &&
> + >"$keep" &&
> + git -C bare.git repack -ad &&
> + bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) &&
> + test -z "$bitmap"

Maybe more readable as:

 

[PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search

2019-06-27 Thread Ævar Arnfjörð Bjarmason
asic log -i --grep='int'   7.40(7.08+0.28)   
7.19(6.92+0.20) -2.8%
4221.3: extended log -i --grep='int'7.43(7.13+0.25)   
7.27(6.99+0.21) -2.2%
4221.4: perl log -i --grep='int'7.34(7.10+0.24)   
7.10(6.90+0.19) -3.3%
4221.6: fixed log -i --grep='uncommon'  7.07(6.71+0.32)   
7.11(6.77+0.28) +0.6%
4221.7: basic log -i --grep='uncommon'  6.99(6.64+0.28)   
7.12(6.69+0.38) +1.9%
4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   
7.10(6.77+0.27) -0.1%
4221.9: perl log -i --grep='uncommon'   6.98(6.60+0.29)   
7.05(6.64+0.34) +1.0%
4221.11: fixed log -i --grep='æ'7.85(7.45+0.34)   
7.03(6.68+0.32) -10.4%
4221.12: basic log -i --grep='æ'7.87(7.49+0.29)   
7.06(6.69+0.31) -10.3%
4221.13: extended log -i --grep='æ' 7.87(7.54+0.31)   
7.09(6.69+0.31) -9.9%
4221.14: perl log -i --grep='æ'     7.06(6.77+0.28)   
6.91(6.57+0.31) -2.1%

So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4468519d5c..fc0ed73ef3 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+   size_t i;
+
+   for (i = 0; i < len; i++) {
+   if (is_regex_special(s[i]))
+   return 0;
+   }
+
+   return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, 
const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
compile_regexp_failed(p, errbuf);
}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
int err;
int regflags = REG_NEWLINE;
+   int pat_is_fixed;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
@@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   if (opt->fixed) {
+   pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+   if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+   opt->pcre2 = 1;
+   if (pat_is_fixed) {
+   compile_pcre2_pattern(p, opt);
+   } else {
+   /*
+* E.g. t7811-grep-open.sh relies on the
+* pattern being restored.
+*/
+   char *old_pattern = p->pattern;
+   size_t old_patternlen = p->patternlen;
+   struct strbuf sb = STRBUF_INIT;
+
+   /*
+* There is the PCRE2_LITERAL flag, but it's
+* only in PCRE v2 10.30 and later. Needing to
+* ifdef our way around that and dealing with
+* it + PCRE2_MULTILINE being an error is more
+* complex than just quoting this ourselves.
+   */
+   strbuf_add(&sb, "\\Q", 2);
+   strbuf_add(&sb, p->pattern, p->patternlen);
+   strbuf_add(&sb, "\\E", 2);
+
+   p->pattern = sb.buf;
+   p->patternlen = sb.len;
+   compile_pcre2_pattern(p, opt);
+   p->pattern = old_pattern;
+   p->patternlen = old_patternlen;
+   strbuf_release(&sb);
+   }
+#else /* !USE_LIBPCRE2 */
compile_fixed_regexp(p, opt);
+#endif /* !USE_LIBPCRE2 */
return;
}
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 8/9] grep: remove the kwset optimization

2019-06-27 Thread Ævar Arnfjörð Bjarmason
0.26) +19.2%
4221.4: perl log -i --grep='int'7.42(7.16+0.21)   
7.14(6.80+0.24) -3.8%
4221.6: fixed log -i --grep='uncommon'  6.94(6.58+0.35)   
8.43(8.04+0.30) +21.5%
4221.7: basic log -i --grep='uncommon'  6.95(6.62+0.31)   
8.34(7.93+0.32) +20.0%
4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   
8.32(7.98+0.31) +17.8%
4221.9: perl log -i --grep='uncommon'   6.96(6.69+0.26)   
7.04(6.64+0.32) +1.1%
4221.11: fixed log -i --grep='æ'7.92(7.55+0.33)   
7.86(7.44+0.34) -0.8%
4221.12: basic log -i --grep='æ'7.88(7.49+0.32)   
7.84(7.46+0.34) -0.5%
4221.13: extended log -i --grep='æ' 7.91(7.51+0.32)   
7.87(7.48+0.32) -0.5%
4221.14: perl log -i --grep='æ' 7.01(6.59+0.35)   
6.99(6.64+0.28) -0.3%

Some of those, as noted in [1] are because PCRE is faster at finding
fixed strings. This looks bad for some engines, but in the next change
we'll optimistically use PCRE v2 for all of these, so it'll look
better.

1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 63 +++---
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 8d0fff316c..4468519d5c 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-   size_t i;
-
-   for (i = 0; i < len; i++) {
-   if (is_regex_special(s[i]))
-   return 0;
-   }
-
-   return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
+   p->fixed = opt->fixed;
 
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   /*
-* Even when -F (fixed) asks us to do a non-regexp search, we
-* may not be able to correctly case-fold when -i
-* (ignore-case) is asked (in which case, we'll synthesize a
-* regexp to match the pattern that matches regexp special
-* characters literally, while ignoring case differences).  On
-* the other hand, even without -F, if the pattern does not
-* have any regexp special characters and there is no need for
-* case-folding search, we can internally turn it into a
-* simple string match using kws.  p->fixed tells us if we
-* want to use kws.
-*/
-   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-   if (p->fixed) {
-   p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-   kwsincr(p->kws, p->pattern, p->patternlen);
-   kwsprep(p->kws);
-   return;
-   }
-
if (opt->fixed) {
-   /*
-* We come here when the pattern has the non-ascii
-* characters we cannot case-fold, and asked to
-* ignore-case.
-*/
compile_fixed_regexp(p, opt);
return;
}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
case GREP_PATTERN: /* atom */
case GREP_PATTERN_HEAD:
case GREP_PATTERN_BODY:
-   if (p->kws)
-   kwsfree(p->kws);
-   else if (p->pcre1_regexp)
+   if (p->pcre1_regexp)
free_pcre1_regexp(p);
else if (p->pcre2_pattern)
free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char 
*name)
opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-   regmatch_t *match)
-{
-   struct kwsmatch kwsm;
-   size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-   if (offset == -1) {
-   match->rm_so = match->rm_eo = -1;
-   return REG_NOMATCH;
-   } else {
-   match->rm_so = offset;
-   match->rm_eo = match->rm_so + kwsm.size[0];
-  

[PATCH v2 4/9] grep tests: move "grep binary" alongside the rest

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh 
b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 7/9] grep: drop support for \0 in --fixed-strings

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Change "-f " to not support patterns with a NUL-byte in them
under --fixed-strings. We'll now only support these under
"--perl-regexp" with PCRE v2.

A previous change to grep's documentation changed the description of
"-f " to be vague enough as to not promise that this would work.
By dropping support for this we make it a whole lot easier to move
away from the kwset backend, which we'll do in a subsequent change.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +-
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index d6603bc950..8d0fff316c 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
 
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
return;
}
 
-   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
-
if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0

[PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane

2019-06-27 Thread Ævar Arnfjörð Bjarmason
The behavior of "grep" when patterns contained a NUL-byte has always
been haphazard, and has served the vagaries of the implementation more
than anything else. A pattern containing a NUL-byte can only be
provided via "-f ". Since pickaxe (log search) has no such flag
the NUL-byte in patterns has only ever been supported by "grep" (and
not "log --grep").

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because they
contain a NUL-byte. Instead die if the backend in question can't
handle them, e.g. --basic-regexp is combined with such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 Documentation/git-grep.txt |  17 
 grep.c |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++---
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f ::
Read patterns from , one per line.
++
+Passing the pattern via  allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index 4e8d0645a8..d6603bc950 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-   /*
-* regcomp cannot accept patterns with NULs so when using it
-* we consider any pattern containing a NUL fixed.
-*/
-   if (memchr(s, 0, len))
-   return 1;
-
-   return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 * simple string match using kws.  p->fixed tells us if we
 * want to use kws.
 */
-   if (opt->fixed ||
-   has_null(p->pattern, p->patternlen) ||
-   is_fixed(p->pattern, p->patternlen))
+   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
kwsincr(p->kws, p->pattern, p->patternlen);
kwsprep(p->kws);
return;
-   } else if (opt->fixed) {
+   }
+
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
+   if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
matches=$1
-   fla

[PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding="

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.

We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.

Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.

But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c  | 8 
 grep.h  | 1 +
 revision.c  | 3 +++
 t/t4210-log-i18n.sh | 6 ++
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..1de4ab49c0 100644
--- a/grep.c
+++ b/grep.c
@@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, 
const struct grep_opt *opt)
int options = PCRE_MULTILINE;
 
if (opt->ignore_case) {
-   if (has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && has_non_ascii(p->pattern))
p->pcre1_tables = pcre_maketables();
options |= PCRE_CASELESS;
}
-   if (is_utf8_locale() && has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern))
options |= PCRE_UTF8;
 
p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
@@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, 
const struct grep_opt *opt
p->pcre2_compile_context = NULL;
 
if (opt->ignore_case) {
-   if (has_non_ascii(p->pattern)) {
+   if (!opt->ignore_locale && has_non_ascii(p->pattern)) {
character_tables = pcre2_maketables(NULL);
p->pcre2_compile_context = 
pcre2_compile_context_create(NULL);
pcre2_set_character_tables(p->pcre2_compile_context, 
character_tables);
}
options |= PCRE2_CASELESS;
}
-   if (is_utf8_locale() && has_non_ascii(p->pattern))
+   if (!opt->ignore_locale && is_utf8_locale() && 
has_non_ascii(p->pattern))
options |= PCRE2_UTF;
 
p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/grep.h b/grep.h
index 1875880f37..4bb8a79d93 100644
--- a/grep.h
+++ b/grep.h
@@ -173,6 +173,7 @@ struct grep_opt {
int funcbody;
int extended_regexp_option;
int pattern_type_option;
+   int ignore_locale;
char colors[NR_GREP_COLORS][COLOR_MAXLEN];
unsigned pre_context;
unsigned post_context;
diff --git a/revision.c b/revision.c
index 621feb9df7..a842fb158a 100644
--- a/revision.c
+++ b/revision.c
@@ -28,6 +28,7 @@
 #include "commit-graph.h"
 #include "prio-queue.h"
 #include "hashmap.h"
+#include "utf8.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
 
grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 &revs->grep_filter);
+   if (!is_encoding_utf8(get_log_output_encoding()))
+   revs->grep_filter.ignore_locale = 1;
compile_grep_patterns(&revs->grep_filter);
 
if (revs->reverse && revs->reflog_info)
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 86d22c1d4c..515bcb7ce1 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded 
values (latin1)' '
 for engine in fixed basic extended perl
 do
prereq=
-   result=success
if test $engine = "perl"
then
-   result=failure
prereq="PCRE"
else
prereq=""
@@ -72,7 +70,7 @@ do
then
force_regex=.*
fi
-   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   test_expect_success GETTEXT_LOCALE,$prereq

[PATCH v2 1/9] log tests: test regex backends in "--encode=" tests

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Improve the tests added in 04deccda11 ("log: re-encode commit messages
before grepping", 2013-02-11) to test the regex backends. Those tests
never worked as advertised, due to the is_fixed() optimization in
grep.c (which was in place at the time), and the needle in the tests
being a fixed string.

We'd thus always use the "fixed" backend during the tests, which would
use the kwset() backend. This backend liberally accepts any garbage
input, so invalid encodings would be silently accepted.

In a follow-up commit we'll fix this bug, this test just demonstrates
the existing issue.

In practice this issue happened on Windows, see [1], but due to the
structure of the existing tests & how liberal the kwset code is about
garbage we missed this.

Cover this blind spot by testing all our regex engines. The PCRE
backend will spot these invalid encodings. It's possible that this
test breaks the "basic" and "extended" backends on some systems that
are more anal than glibc about the encoding of locale issues with
POSIX functions that I can remember, but PCRE is more careful about
the validation.

1. 
https://public-inbox.org/git/nycvar.qro.7.76.6.1906271113090...@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t4210-log-i18n.sh | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 7c519436ef..86d22c1d4c 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -1,12 +1,15 @@
 #!/bin/sh
 
 test_description='test log with i18n features'
-. ./test-lib.sh
+. ./lib-gettext.sh
 
 # two forms of é
 utf8_e=$(printf '\303\251')
 latin1_e=$(printf '\351')
 
+# invalid UTF-8
+invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
+
 test_expect_success 'create commits in different encodings' '
test_tick &&
cat >msg <<-EOF &&
@@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded 
values (latin1)' '
test_must_be_empty actual
 '
 
+for engine in fixed basic extended perl
+do
+   prereq=
+   result=success
+   if test $engine = "perl"
+   then
+   result=failure
+   prereq="PCRE"
+   else
+   prereq=""
+   fi
+   force_regex=
+   if test $engine != "fixed"
+   then
+   force_regex=.*
+   fi
+   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   cat >expect <<-\EOF &&
+   latin1
+   utf8
+   EOF
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual &&
+   test_cmp expect actual
+   "
+
+   test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not find non-reencoded values (latin1 + locale)" "
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
+   test_must_be_empty actual
+   "
+
+   test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine 
log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid 
needle)" "
+   LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log 
--encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
+   test_must_be_empty actual
+   "
+done
+
 test_done
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 0/9] grep: move from kwset to optional PCRE v2

2019-06-27 Thread Ævar Arnfjörð Bjarmason
A non-RFC since it seem people like this approach.

This should fix the test failure noted by Johannes, there's two new
patches at the start of this series. They address a bug that was there
for a long time, but I happened to trip over since PCRE is more strict
about UTF-8 validation than kwset (which doesn't care at all).

I also added performance numbers to the relevant commit messages, took
brian's suggestion of saying "NUL-byte" instead of "\0", and did some
other copyediting of my own.

The rest of the code changes are all just comments & rewording of
previously added comments.

Ævar Arnfjörð Bjarmason (9):
  log tests: test regex backends in "--encode=" tests
  grep: don't use PCRE2?_UTF8 with "log --encoding="
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for NUL-byte in patterns sane
  grep: drop support for \0 in --fixed-strings 
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt|  17 +++
 grep.c| 115 +++-
 grep.h|   3 +-
 revision.c|   3 +
 t/t4210-log-i18n.sh   |  39 +-
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --
 t/t7816-grep-binary-pattern.sh| 127 ++
 8 files changed, 233 insertions(+), 172 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

Range-diff:
 -:  -- >  1:  cfc01f49d3 log tests: test regex backends in 
"--encode=" tests
 -:  -- >  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log 
--encoding="
 1:  ad55d3be7e =  3:  cc4d3b50d5 grep: inline the return value of a function 
call used only once
 2:  650bcc8582 =  4:  d9b29bdd89 grep tests: move "grep binary" alongside the 
rest
 3:  ef10a8820d !  5:  f85614f435 grep tests: move binary pattern tests into 
their own file
@@ -2,9 +2,10 @@
 
 grep tests: move binary pattern tests into their own file
 
-Move the tests for "-f " where "" contains a "\0" pattern
-into their own file. I added most of these tests in 966be95549 ("grep:
-add tests to fix blind spots with \0 patterns", 2017-05-20).
+Move the tests for "-f " where "" contains a NUL byte
+pattern into their own file. I added most of these tests in
+966be95549 ("grep: add tests to fix blind spots with \0 patterns",
+2017-05-20).
 
 Whether a regex engine supports matching binary content is very
 different from whether it matches binary patterns. Since
@@ -14,8 +15,8 @@
 engine can sensibly match binary patterns.
 
 Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
-patterns containing "\0" and considering them fixed, except in cases
-where "--ignore-case" is provided and they're non-ASCII, see
+patterns containing NUL-byte and considering them fixed, except in
+cases where "--ignore-case" is provided and they're non-ASCII, see
 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
 2016-06-25). Subsequent commits will change this behavior.
 
 4:  03e5637efc !  6:  90afca8707 grep: make the behavior for \0 in patterns 
sane
@@ -1,12 +1,13 @@
 Author: Ævar Arnfjörð Bjarmason 
 
-grep: make the behavior for \0 in patterns sane
+grep: make the behavior for NUL-byte in patterns sane
 
-The behavior of "grep" when patterns contained "\0" has always been
-haphazard, and has served the vagaries of the implementation more than
-anything else. A "\0" in a pattern can only be provided via "-f
-", and since pickaxe (log search) has no such flag "\0" in
-patterns has only ever been supported by "grep".
+The behavior of "grep" when patterns contained a NUL-byte has always
+been haphazard, and has served the vagaries of the implementation more
+than anything else. A pattern containing a NUL-byte can only be
+provided via "-f ". Since pickaxe (log search) has no such flag
+the NUL-byte in patterns has only ever been supported by "grep" (and
+not &

[PATCH v2 3/9] grep: inline the return value of a function call used only once

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 1de4ab49c0..4e8d0645a8 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-   int ascii_only;
int err;
int regflags = REG_NEWLINE;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
-   ascii_only = !has_non_ascii(p->pattern);
 
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (opt->fixed ||
has_null(p->pattern, p->patternlen) ||
is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || ascii_only;
+   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5



[PATCH v2 5/9] grep tests: move binary pattern tests into their own file

2019-06-27 Thread Ævar Arnfjörð Bjarmason
Move the tests for "-f " where "" contains a NUL byte
pattern into their own file. I added most of these tests in
966be95549 ("grep: add tests to fix blind spots with \0 patterns",
2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing NUL-byte and considering them fixed, except in
cases where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t7815-grep-binary.sh | 101 -
 t/t7816-grep-binary-pattern.sh | 114 +
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-   matches=$1
-   flags=$2
-   pattern=$3
-   pattern_human=$(echo "$pattern" | sed 's/Q//g')
-
-   if test "$matches" = 1
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = 0
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   elif test "$matches" = T1
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = T0
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   else
-   test_expect_success "PANIC: Test framework error. Unknown 
matches value $matches" 'false'
-   fi
-}
-
 test_expect_success 'setup' "
echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 ''

Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2

2019-06-27 Thread Ævar Arnfjörð Bjarmason


On Thu, Jun 27 2019, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Wed, 26 Jun 2019, Johannes Schindelin wrote:
>
>> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>>
>> > This speeds things up a lot, but as shown in the patches & tests
>> > changed modifies the behavior where we have \0 in *patterns* (only
>> > possible with 'grep -f ').
>>
>> I agree that it is not worth a lot to care about NULs in search patterns.
>>
>> So I am in favor of the goal of this patch series.
>
> There seems to be a Windows-specific test failure:
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
>
> The output is this:
>
> -- snip --
> not ok 5 - log --grep does not find non-reencoded values (latin1)
>
> expecting success:
>   git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> &&
>   test_must_be_empty actual
>
> ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> not 0x80
> -- snap --
>
> Any quick ideas? (I _could_ imagine that it is yet another case of passing
> non-UTF-8-encoded stuff via command-line vs via file, which does not work
> on Windows.)

This is an existing issue that my patches just happen to uncover. I'm
working on a v2 which'll fix it.


Re: fprintf_ln() is slow

2019-06-27 Thread Ævar Arnfjörð Bjarmason


On Thu, Jun 27 2019, Duy Nguyen wrote:

> On Thu, Jun 27, 2019 at 1:00 PM Jeff King  wrote:
>>
>> On Thu, Jun 27, 2019 at 01:25:15AM -0400, Jeff King wrote:
>>
>> > Taylor and I noticed a slowdown in p1451 between v2.20.1 and v2.21.0. I
>> > was surprised to find that it bisects to bbb15c5193 (fsck: reduce word
>> > legos to help i18n, 2018-11-10).
>> >
>> > The important part, as it turns out, is the switch to using fprintf_ln()
>> > instead of a regular fprintf() with a "\n" in it. Doing this:
>> > [...]
>> > on top of the current tip of master yields this result:
>> >
>> >   Test HEAD^ HEAD
>> >   
>> > -
>> >   1451.3: fsck with 0 skipped bad commits  9.78(7.46+2.32)   
>> > 8.74(7.38+1.36) -10.6%
>> >   1451.5: fsck with 1 skipped bad commits  9.78(7.66+2.11)   
>> > 8.49(7.04+1.44) -13.2%
>> >   1451.7: fsck with 10 skipped bad commits 9.83(7.45+2.37)   
>> > 8.53(7.26+1.24) -13.2%
>> >   1451.9: fsck with 100 skipped bad commits9.87(7.47+2.40)   
>> > 8.54(7.24+1.30) -13.5%
>> >   1451.11: fsck with 1000 skipped bad commits  9.79(7.67+2.12)   
>> > 8.48(7.25+1.23) -13.4%
>> >   1451.13: fsck with 1 skipped bad commits 9.86(7.58+2.26)   
>> > 8.38(7.09+1.28) -15.0%
>> >   1451.15: fsck with 10 skipped bad commits9.58(7.39+2.19)   
>> > 8.41(7.21+1.19) -12.2%
>> >   1451.17: fsck with 100 skipped bad commits   6.38(6.31+0.07)   
>> > 6.35(6.26+0.07) -0.5%
>>
>> Ah, I think I see it.
>>
>> See how the system times for HEAD^ (with fprintf_ln) are higher? We're
>> flushing stderr more frequently (twice as much, since it's unbuffered,
>> and we now have an fprintf followed by a putc).
>>
>> I can get similar speedups by formatting into a buffer:
>>
>> diff --git a/strbuf.c b/strbuf.c
>> index 0e18b259ce..07ce9b9178 100644
>> --- a/strbuf.c
>> +++ b/strbuf.c
>> @@ -880,8 +880,22 @@ int printf_ln(const char *fmt, ...)
>>
>>  int fprintf_ln(FILE *fp, const char *fmt, ...)
>>  {
>> +   char buf[1024];
>> int ret;
>> va_list ap;
>> +
>> +   /* Fast path: format it ourselves and dump it via fwrite. */
>> +   va_start(ap, fmt);
>> +   ret = vsnprintf(buf, sizeof(buf), fmt, ap);
>> +   va_end(ap);
>> +   if (ret < sizeof(buf)) {
>> +   buf[ret++] = '\n';
>> +   if (fwrite(buf, 1, ret, fp) != ret)
>> +   return -1;
>> +   return ret;
>> +   }
>> +
>> +   /* Slow path: a normal fprintf/putc combo */
>> va_start(ap, fmt);
>> ret = vfprintf(fp, fmt, ap);
>> va_end(ap);
>>
>> But we shouldn't have to resort to that. We can use setvbuf() to toggle
>> buffering back and forth, but I'm not sure if there's a way to query the
>> current buffering scheme for a stdio stream. We'd need that to be able
>> to switch back correctly (and to avoid switching for things that are
>> already buffered).
>>
>> I suppose it would be enough to check for "fp == stderr", since that is
>> the only unbuffered thing we'd generally see.
>>
>> And it may be that the code above is really not much different anyway.
>> For an unbuffered stream, I'd guess it dumps an fwrite() directly to
>> write() anyway (since by definition it does not need to hold onto it,
>> and nor is there anything in the buffer ahead of it).
>>
>> Something like:
>>
>>   char buf[1024];
>>   if (fp == stderr)
>> setvbuf(stream, buf, _IOLBF, sizeof(buf));
>>
>>   ... do fprintf and putc ...
>>
>>   if (fp == stderr)
>> setvbuf(stream, NULL, _IONBF, 0);
>>
>> feels less horrible, but it's making the assumption that we were
>> unbuffered coming into the function. I dunno.
>
> How about do all the formatting in strbuf and only fwrite last minute?
> A bit more overhead with malloc(), so I don't know if it's an
> improvement or not.

Why shouldn't we just move back to plain fprintf() with "\n"? Your
9a0a30aa4b ("strbuf: convenience format functions with \n automatically
appended", 2012-04-23) doesn't explain why this is a convenience for
translators.

When I'm translating things tend to like knowing that something ends in
a newline explicitly, why do we need to hide that from translators? They
also need to deal with trailing \n in other messages, so these *_ln()
functions make things inconsistent.

It's also not possible for translators to do this by mistake without
being caught, because msgfmt will catch this (and other common issues):

po/de.po:23: 'msgid' and 'msgstr' entries do not both end with '\n'


[RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2

2019-06-25 Thread Ævar Arnfjörð Bjarmason
This speeds things up a lot, but as shown in the patches & tests
changed modifies the behavior where we have \0 in *patterns* (only
possible with 'grep -f ').

I'd like to go down this route because it makes dropping kwset a lot
easier, and I don't think bending over backwards to support these \0
patterns is worth it.

But maybe others disagree, so I wanted to send what I had before I
tried tackling the pickaxe code. There I figured I'd just make -G's
ERE be a PCRE if we had the PCRE v2 backend, since unlike "grep"'s
default BRE the ERE syntax is mostly a subset of PCRE, but again
others might thing that's too aggressive and would prefer to keep the
distinction, only using PCRE there in place of our current use of
kwset.

Ævar Arnfjörð Bjarmason (7):
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for \0 in patterns sane
  grep: drop support for \0 in --fixed-strings 
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt|  17 +++
 grep.c| 103 ++
 grep.h|   2 -
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --
 t/t7816-grep-binary-pattern.sh| 127 ++
 6 files changed, 183 insertions(+), 167 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

-- 
2.22.0.455.g172b71a6c5



[RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file

2019-06-25 Thread Ævar Arnfjörð Bjarmason
Move the tests for "-f " where "" contains a "\0" pattern
into their own file. I added most of these tests in 966be95549 ("grep:
add tests to fix blind spots with \0 patterns", 2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing "\0" and considering them fixed, except in cases
where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t7815-grep-binary.sh | 101 -
 t/t7816-grep-binary-pattern.sh | 114 +
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-   matches=$1
-   flags=$2
-   pattern=$3
-   pattern_human=$(echo "$pattern" | sed 's/Q//g')
-
-   if test "$matches" = 1
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = 0
-   then
-   test_expect_success "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   elif test "$matches" = T1
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   git grep -f f $flags a
-   "
-   elif test "$matches" = T0
-   then
-   test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-   printf '$pattern' | q_to_nul >f &&
-   test_must_fail git grep -f f $flags a
-   "
-   else
-   test_expect_success "PANIC: Test framework error. Unknown 
matches value $matches" 'false'
-   fi
-}
-
 test_expect_success 'setup' "
echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 &#

[RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane

2019-06-25 Thread Ævar Arnfjörð Bjarmason
The behavior of "grep" when patterns contained "\0" has always been
haphazard, and has served the vagaries of the implementation more than
anything else. A "\0" in a pattern can only be provided via "-f
", and since pickaxe (log search) has no such flag "\0" in
patterns has only ever been supported by "grep".

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because it
contains a \0. Instead die if e.g. --basic-regexp is combined with
such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 Documentation/git-grep.txt |  17 
 grep.c |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++---
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f ::
Read patterns from , one per line.
++
+Passing the pattern via  allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index d3e6111c46..261bd3a342 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-   /*
-* regcomp cannot accept patterns with NULs so when using it
-* we consider any pattern containing a NUL fixed.
-*/
-   if (memchr(s, 0, len))
-   return 1;
-
-   return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 * simple string match using kws.  p->fixed tells us if we
 * want to use kws.
 */
-   if (opt->fixed ||
-   has_null(p->pattern, p->patternlen) ||
-   is_fixed(p->pattern, p->patternlen))
+   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
kwsincr(p->kws, p->pattern, p->patternlen);
kwsprep(p->kws);
return;
-   } else if (opt->fixed) {
+   }
+
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
+   if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
matches=$1
-   flags=$2
-   pattern=$3
+   prereqs=$2
+   lc_all=$3
+   extra

[RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings

2019-06-25 Thread Ævar Arnfjörð Bjarmason
Change "-f " to not support patterns with "\0" in them under
--fixed-strings, we'll now only support these under --perl-regexp with
PCRE v2.

A previous change to Documentation/git-grep.txt changed the
description of "-f " to be vague enough as to not promise that
this would work, and by dropping support for this we make it a whole
lot easier to move away from the kwset backend, which a subsequent
change will try to do.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +-
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index 261bd3a342..14570c7ac1 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
 
+   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
+
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
return;
}
 
-   if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-   die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
-
if (opt->fixed) {
/*
 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0 '-i&

[RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search

2019-06-25 Thread Ævar Arnfjörð Bjarmason
Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.

This brings back the optimization for "-F", without changing the
semantics of "\0" in patterns. As seen in previous commits in this
series we could support it now, but I'd rather just leave that
edge-case aside so the tests don't need to do one thing or the other
depending on what --fixed-strings backend we're using.

I could also support the v1 backend here, but that would make the code
more complex, and I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.

1. https://public-inbox.org/git/87v9x793qi@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4716217837..6b75d5be68 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+   size_t i;
+
+   for (i = 0; i < len; i++) {
+   if (is_regex_special(s[i]))
+   return 0;
+   }
+
+   return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, 
const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
compile_regexp_failed(p, errbuf);
}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
int err;
int regflags = REG_NEWLINE;
+   int pat_is_fixed;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
@@ -636,8 +649,38 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   if (opt->fixed) {
+   pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+   if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+   opt->pcre2 = 1;
+   if (pat_is_fixed) {
+   compile_pcre2_pattern(p, opt);
+   } else {
+   /*
+* E.g. t7811-grep-open.sh relies on the
+* pattern being restored, and unfortunately
+* there's no PCRE compile flag for "this is
+* fixed", so we need to munge it to
+* "\Q\E".
+*/
+   char *old_pattern = p->pattern;
+   size_t old_patternlen = p->patternlen;
+   struct strbuf sb = STRBUF_INIT;
+
+   strbuf_add(&sb, "\\Q", 2);
+   strbuf_add(&sb, p->pattern, p->patternlen);
+   strbuf_add(&sb, "\\E", 2);
+
+   p->pattern = sb.buf;
+   p->patternlen = sb.len;
+   compile_pcre2_pattern(p, opt);
+   p->pattern = old_pattern;
+   p->patternlen = old_patternlen;
+   strbuf_release(&sb);
+   }
+#else
compile_fixed_regexp(p, opt);
+#endif
return;
}
 
-- 
2.22.0.455.g172b71a6c5



[RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest

2019-06-25 Thread Ævar Arnfjörð Bjarmason
Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => 
t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh 
b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5



[RFC/PATCH 1/7] grep: inline the return value of a function call used only once

2019-06-25 Thread Ævar Arnfjörð Bjarmason
Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..d3e6111c46 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, 
struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-   int ascii_only;
int err;
int regflags = REG_NEWLINE;
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
-   ascii_only = !has_non_ascii(p->pattern);
 
/*
 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
if (opt->fixed ||
has_null(p->pattern, p->patternlen) ||
is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || ascii_only;
+   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
if (p->fixed) {
p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5



[RFC/PATCH 6/7] grep: remove the kwset optimization

2019-06-25 Thread Ævar Arnfjörð Bjarmason
A later change will replace this optimization with a different one,
but as removing it and running the tests demonstrates no grep
semantics depend on this backend anymore.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 grep.c | 63 +++---
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 14570c7ac1..4716217837 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct 
grep_pat *p,
die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-   size_t i;
-
-   for (i = 0; i < len; i++) {
-   if (is_regex_special(s[i]))
-   return 0;
-   }
-
-   return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt 
*opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct 
grep_opt *opt)
 
p->word_regexp = opt->word_regexp;
p->ignore_case = opt->ignore_case;
+   p->fixed = opt->fixed;
 
if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
die(_("given pattern contains NULL byte (via -f ). This 
is only supported with -P under PCRE v2"));
 
-   /*
-* Even when -F (fixed) asks us to do a non-regexp search, we
-* may not be able to correctly case-fold when -i
-* (ignore-case) is asked (in which case, we'll synthesize a
-* regexp to match the pattern that matches regexp special
-* characters literally, while ignoring case differences).  On
-* the other hand, even without -F, if the pattern does not
-* have any regexp special characters and there is no need for
-* case-folding search, we can internally turn it into a
-* simple string match using kws.  p->fixed tells us if we
-* want to use kws.
-*/
-   if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-   p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-   if (p->fixed) {
-   p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-   kwsincr(p->kws, p->pattern, p->patternlen);
-   kwsprep(p->kws);
-   return;
-   }
-
if (opt->fixed) {
-   /*
-* We come here when the pattern has the non-ascii
-* characters we cannot case-fold, and asked to
-* ignore-case.
-*/
compile_fixed_regexp(p, opt);
return;
}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
case GREP_PATTERN: /* atom */
case GREP_PATTERN_HEAD:
case GREP_PATTERN_BODY:
-   if (p->kws)
-   kwsfree(p->kws);
-   else if (p->pcre1_regexp)
+   if (p->pcre1_regexp)
free_pcre1_regexp(p);
else if (p->pcre2_pattern)
free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char 
*name)
opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-   regmatch_t *match)
-{
-   struct kwsmatch kwsm;
-   size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-   if (offset == -1) {
-   match->rm_so = match->rm_eo = -1;
-   return REG_NOMATCH;
-   } else {
-   match->rm_so = offset;
-   match->rm_eo = match->rm_so + kwsm.size[0];
-   return 0;
-   }
-}
-
 static int patmatch(struct grep_pat *p, char *line, char *eol,
regmatch_t *match, int eflags)
 {
int hit;
 
-   if (p->fixed)
-   hit = !fixmatch(p, line, eol, match);
-   else if (p->pcre1_regexp)
+   if (p->pcre1_regexp)
hit = !pcre1match(p, line, eol, match, eflags);
else if (p->pcre2_pattern)
hit = !pcre2match(p, line, eol, match, eflags);
diff --git a/grep.h b/grep.h
index 1875880f37..90ca435aad 100644
--- a/grep.h
+++ b/grep.h
@@ -32,7 +32,6 @@ typedef int pcre2_compile_context;
 typedef int pcre2_match_context;
 typedef int pcre2_jit_stack;
 #endif
-#include "kwset.h"
 #include "thread-utils.h"
 #include "userdiff.h"
 
@@ -97,7 +96,6 @@ struct grep_pat {
pcre2_match_context *pcre2_match_context;
pcre2_jit_stack *pcre2_jit_stack;
uint32_t pcre2_jit_on;
-   kwset_t kws;
unsigned fixed:1;
unsigned ignore_case:1;
unsigned word_regexp:1;
-- 
2.22.0.455.g172b71a6c5



Re: 2.22.0 repack -a duplicating pack contents

2019-06-24 Thread Ævar Arnfjörð Bjarmason


On Mon, Jun 24 2019, Jeff King wrote:

> On Sun, Jun 23, 2019 at 06:08:25PM +, Eric Wong wrote:
>
>> > I'm not sure of the right solution. For maximal backwards-compatibility,
>> > the default for bitmaps could become "if not bare and if there are no
>> > .keep files". But that would mean bitmaps sometimes not getting
>> > generated because of the problems that ee34a2bead was trying to solve.
>> >
>> > That's probably OK, though; you can always flip the bitmap config to
>> > "true" yourself if you _must_ have bitmaps.
>>
>> What about something like this?  Needs tests but I need to leave, now.
>
> Yeah, I think that's the right direction.
>
> Though...
>
>> +static int has_pack_keep_file(void)
>> +{
>> +DIR *dir;
>> +struct dirent *e;
>> +int found = 0;
>> +
>> +if (!(dir = opendir(packdir)))
>> +return found;
>> +
>> +while ((e = readdir(dir)) != NULL) {
>> +if (ends_with(e->d_name, ".keep")) {
>> +found = 1;
>> +break;
>> +}
>> +}
>> +closedir(dir);
>> +return found;
>> +}
>
> I think this can be replaced with just checking p->pack_keep for each
> item in the packed_git list.
>
> That's racy, but then so is your code here, since it's really the child
> pack-objects which is going to deal with the .keep. I don't think we
> need to care much about the race, though. Either:
>
>   1. Somebody has made an old intentional .keep, which would not be
>  racy. We'd see it in both places.
>
>   2. Somebody _just_ made an intentional .keep; we'll race with that and
>  maybe duplicate objects from the kept pack. But this is a rare
>  occurrence, and there's no real ordering promise here anyway with
>  somebody creating .keep files alongside a running repack.
>
>   3. An incoming fetch/push may create a .keep file as a temporary lock,
>  which we see here but which goes away by the time pack-objects
>  runs. That's OK; we err on the side of not generating bitmaps, but
>  they're an optimization anyway (and if you really insist on having
>  them, you should tell Git to definitely make them instead of
>  relying on this default behavior).

This sort of thing (#3) strikes me as a fairly pathological case we
should try to avoid. Now what we've turned on bitmaps by default people
will take the sort of performance increase noted in [1] for granted.

So they'll be happily running with that & then get a CPU/IO spike as the
*.bitmap files they'd been implicitly relying on for years in their
default config goes away, only to have it re-appear when "repack" runs
next.

I can't think of some great solution for this case, some thoughts:

 a. Perhaps we should split the *.keep flag into two things or
more.

We're using it for all of "I want this *.pack forever"
(e.g. debugging) and "I want only this *.pack to contain the data
found in it" (I/O & CPU optimization, what Janos wants) and "I'm
git.git code avoiding a race with myself" (what you describe in #3).

So maybe for the last of those we could also use and understand
*.tmp-keep, at which point we wouldn't have this race described in
#3. The 1st of those is a *.noprune and the 2nd is *.highlander (but
whether it's worth splitting all that out v.s. just having
*.tmp-keep is another matter).

 b) Shouldn't we at least print some warning to STDERR in this case so
e.g. gc.log will note the performance degradation of the repo in its
current configuration?

>   4. Like (3), but we _don't _see the temporary .keep here but _do_ see
>  it during pack-objects. That's OK, because we'll have told
>  pack-objects to pack those objects anyway, which is the right
>  thing.
>
> -Peff

1. https://github.blog/2015-09-22-counting-objects/


Re: 2.22.0 repack -a duplicating pack contents

2019-06-23 Thread Ævar Arnfjörð Bjarmason


On Sun, Jun 23 2019, Janos Farkas wrote:

> I'm using .keep files to... well.. keep packs to avoid some CPU time
> spent on repacking huge packs and make the process somewhat more
> incremental.
>
> Something changed with 22.2.0.  Now .bitmap files are also created,
> and no simple repacks re-create the pack data in a completely new
> file, wasting quite some storage:
>
> 02d03::master> find objects/pack/pack* -type f|xargs ls -sht
> 108K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.bitmap
> 524K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.idx
> 4.7M objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.pack
> 108K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.bitmap
> 524K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.idx
> 4.6M objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack
> 116K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.bitmap
> 524K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.idx
> 4.6M objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack
>0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep
> 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap
> 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx
> 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack
> 02d03::master > git repack -af
> Enumerating objects: 19001, done.
> Counting objects: 100% (19001/19001), done.
> Delta compression using up to 2 threads
> Compressing objects: 100% (18952/18952), done.
> Writing objects: 100% (19001/19001), done.
> warning: ignoring extra bitmap file:
> ./objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack
> warning: ignoring extra bitmap file:
> ./objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack
> warning: ignoring extra bitmap file:
> ./objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack
> Reusing bitmaps: 104, done.
> Selecting bitmap commits: 2550, done.
> Building bitmaps: 100% (130/130), done.
> Total 19001 (delta 14837), reused 4162 (delta 0)
> 02d03::master > find objects/pack/pack* -type f|xargs ls -sht
> 108K objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.bitmap
> 524K objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.idx
> 4.6M objects/pack/pack-8702a2550b7e29940af8bc62bc6fca011ccbd455.pack   <= 
> 108K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.bitmap
> 524K objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.idx
> 4.7M objects/pack/pack-879f2c28d15e57d353eb8e0ddbcb540655c844c9.pack
> 108K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.bitmap
> 524K objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.idx
> 4.6M objects/pack/pack-e7a7aebfc6dc6b1431f6f56bb8b2f7e730cc4a0c.pack
> 116K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.bitmap
> 524K objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.idx
> 4.6M objects/pack/pack-994c76cb1999e3b29552677d05e6364e6be2ae5e.pack
>0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep
> 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap
> 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx
> 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack
>
> The ccbd455 pack and its metadata seem quite pointless to be
> containing apparently all the data based on the size.
>
> If I use -ad, a new pack is still created,which, judging by the size,
> is essentially everything again, (but at least the extra packs are
> removed)
>
> 02d03::master> git repack -ad
> Enumerating objects: 19001, done.
> Counting objects: 100% (19001/19001), done.
> Delta compression using up to 2 threads
> Compressing objects: 100% (4114/4114), done.
> Writing objects: 100% (19001/19001), done.
> warning: ignoring extra bitmap file:
> ./objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack
> Reusing bitmaps: 104, done.
> Selecting bitmap commits: 2550, done.
> Building bitmaps: 100% (130/130), done.
> Total 19001 (delta 14838), reused 19001 (delta 14838)
> 02d03::master 9060> find objects/pack/pack* -type f|xargs ls -sht
> 116K objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.bitmap
> 524K objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.idx
> 4.6M objects/pack/pack-46ab64716d4220aac8d53b380d90a264d5293d3d.pack   <= 
>0 objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.keep
> 108K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.bitmap
> 524K objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.idx
> 4.6M objects/pack/pack-e5b8848e7c1096274dba2430323ccaf5320c6846.pack
>
> Previously, the kept pack would be kept, and no additional packs would
> be created if no new objects were born in the repro.
>
> With the .keep placeholder removed, the duplication does not happen,
> but all the repro is rewritten into a new pack, which does not look
> correct.  Am I doing something u

Re: Deadname rewriting

2019-06-21 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 21 2019, Phil Hord wrote:

> On Sat, Jun 15, 2019 at 1:19 AM Ævar Arnfjörð Bjarmason
>  wrote:
>> On Sat, Jun 15 2019, Phil Hord wrote:
>>
>> > At $work we have a long time employee who has changed their name from
>> > Alice to Bob.  Bob doesn't want anyone to call him "Alice" anymore and
>> > is prone to be offended if they do.  This is called "deadnaming".
> ...
>> What should be done is to extend the .mailmap support to other
>> cases. I.e. make tools like blame, shortlog etc. show the equivalent of
>> %aN and %aE by default.
>
> It seems that shortlog and blame do use %aE and %aN by default.  Even
> log does.  It is only because I didn't know about %aN 10 years ago
> that my custom log format does not.
>
> It's a pity the format author has the option to ignore the mailmap. I
> think it's a choice commonly made by mistake rather than intention.  I
> wonder if anyone would mind a forced-override config.  Maybe a force
> flag in the .mailmap file itself.
>
>   
>Other Authornick2 
>Alice Doe--force

Yeah I'm sure a lot of people who do %an really mean %aN, but blanket
forcing it seems a recipe for breakage since "log" and friends are also
used as plumbing where you really mean "what does it say in this commit
object".

E.g. I use %an intentionally for a company-internal tool to map an Alice
to Bob for reporting purposes, which presumably you'd also want.

But yeah, there'll be other uses that didn't intend it. I think probably
the best way forward is to just make git use %aN by default in
porcelain, and outside users presumably would get reports about such
issues eventually in cases like this where someone cared.

>> This topic was discussed at the last git contributor summit (brought up
>> by CB Bailey) resulting in this patch, which I see didn't make it in &
>> needs to be resurrected again:
>> https://public-inbox.org/git/20181212171052.13415-1...@hashpling.org/
>
> Thanks for the link.
>
> I didn't know about config options for mailmap.file and log.mailmap
> before. These do make this option much more useful, especially when we
> can insert default settings for them into /etc/gitconfig across the
> company.

Right, and to the extent that we don't --use-mailmap by default I think
that's mainly because nobody's cared enough to advocate for it. I think
it would be a sensible default.


Re: [PATCH 1/1] t0001: fix on case-insensitive filesystems

2019-06-21 Thread Ævar Arnfjörð Bjarmason


On Sun, Jun 09 2019, brian m. carlson wrote:

> On 2019-06-08 at 14:43:43, Johannes Schindelin via GitGitGadget wrote:
>> diff --git a/t/t0001-init.sh b/t/t0001-init.sh
>> index 42a263cada..f54a69e2d9 100755
>> --- a/t/t0001-init.sh
>> +++ b/t/t0001-init.sh
>> @@ -307,10 +307,20 @@ test_expect_success 'init prefers command line to 
>> GIT_DIR' '
>>  test_path_is_missing otherdir/refs
>>  '
>>
>> +downcase_on_case_insensitive_fs () {
>> +test false = "$(git config --get core.filemode)" || return 0
>> +for f
>
> TIL that “for f” is equivalent to “for f in "$@"”. Thanks for teaching
> me something new.

See also test_have_prereq in test-lib-functions.sh where this trick is
combined with IFS to loop over a "param,like,this" split by ",".


Re: [PATCH] tests: mark two failing tests under FAIL_PREREQS

2019-06-21 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 21 2019, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Thu, 20 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>
>> Fix a couple of tests that would potentially fail under
>> GIT_TEST_FAIL_PREREQS=true.
>>
>> I missed these when annotating other tests in dfe1a17df9 ("tests: add
>> a special setup where prerequisites fail", 2019-05-13) because on my
>> system I can only reproduce this failure when I run the tests as
>> "root", since the tests happen to depend on whether we can fall back
>> on GECOS info or not. I.e. they'd usually fail to look up the ident
>> info anyway, but not always.
>
> I had to read the commit message (in particular the oneline) a couple of
> times, and I have to admit that I wish it was a bit clearer...
>
> From the explanation, I would have assumed that those two test cases fail
> often, anyway, so they shouldn't care whether `FAIL_PREREQS` is in effect.
>
> The only reason why they should be exempt from the `FAIL_PREREQS` mode
> that I can think of is that later test cases would depend on them, but how
> can they? Those test cases would also have to have the `AUTOIDENT` prereq,
> and they would be skipped under `FAIL_PREREQS`, too, no?

The test doesn't depend on "AUTOIDENT", but "!AUTOIDENT", i.e. the
negated version. The effect of the FAIL_PREREQS mode is to set all
prereqs to false, and therefore "test_have_prereq AUTOIDENT" is false,
but "test_have_prereq !AUTOIDENT" is true.

So this test that would otherwise get skipped gets run.

I honestly didn't think much about these cases when I wrote dfe1a17df9
("tests: add a special setup where prerequisites fail", 2019-05-13), and
now I'm not quite sure whether it should be considered a bug or a
feature, but in the meantime this un-breaks the test suite under this
mode.

> In other words, I struggle to understand why this patch is necessary.
>
> Could you help me understand?
>
> Ciao,
> Dscho
>
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason 
>> ---
>>  t/t0007-git-var.sh  | 2 +-
>>  t/t7502-commit-porcelain.sh | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/t/t0007-git-var.sh b/t/t0007-git-var.sh
>> index 5868a87352..1f600e2cae 100755
>> --- a/t/t0007-git-var.sh
>> +++ b/t/t0007-git-var.sh
>> @@ -17,7 +17,7 @@ test_expect_success 'get GIT_COMMITTER_IDENT' '
>>  test_cmp expect actual
>>  '
>>
>> -test_expect_success !AUTOIDENT 'requested identites are strict' '
>> +test_expect_success !FAIL_PREREQS,!AUTOIDENT 'requested identites are 
>> strict' '
>>  (
>>  sane_unset GIT_COMMITTER_NAME &&
>>  sane_unset GIT_COMMITTER_EMAIL &&
>> diff --git a/t/t7502-commit-porcelain.sh b/t/t7502-commit-porcelain.sh
>> index 5733d9cd34..14c92e4c25 100755
>> --- a/t/t7502-commit-porcelain.sh
>> +++ b/t/t7502-commit-porcelain.sh
>> @@ -402,7 +402,7 @@ echo editor started >"$(pwd)/.git/result"
>>  exit 0
>>  EOF
>>
>> -test_expect_success !AUTOIDENT 'do not fire editor when committer is bogus' 
>> '
>> +test_expect_success !FAIL_PREREQS,!AUTOIDENT 'do not fire editor when 
>> committer is bogus' '
>>  >.git/result &&
>>
>>  echo >>negative &&
>> --
>> 2.22.0.455.g172b71a6c5
>>
>>


[PATCH] push: make "HEAD:tags/my-tag" consistently push to a branch

2019-06-21 Thread Ævar Arnfjörð Bjarmason
When a refspec like "HEAD:tags/my-tag" is pushed where "HEAD" is a
branch, we'll push a *branch* that'll be located at
"refs/heads/tags/my-tag". This is part of the rather straightforward
rules I documented in 2219c09e23 ("push doc: document the DWYM
behavior pushing to unqualified ", 2018-11-13).

However, if there exists a "refs/tags/my-tag" on the remote the
count_refspec_match() logic will, as a result of calling
refname_match(), match partially-qualified RHS of the refspec
"refs/tags/my-tag", because it's in a loop where it tries to match
"tags/my-tag" to "refs/tags/my-tag', then "refs/tags/tags/my-tag" etc.

This resulted in a case[1] where someone on LKML did:

git push kvm +HEAD:tags/for-linus

Which would have created a new "refs/heads/tags/for-linus" branch in
their "kvm" repository. But since they happened to have an existing
"refs/tags/for-linus" reference we pushed there instead, and replaced
an annotated tag with a lightweight tag.

We do want a RHS ref like "master" to match "refs/heads/master", but
it's confusing and dangerous that the DWYM behavior for matching
partial RHS refspecs acts differently when the start of the RHS
happens to be a second-level namespace under "refs/" namespace like
"tags".

Now we'll print out the following advice when this happens, and act
differently as described therein:

hint: The  part of the refspec matched both of:
hint:
hint:   1. tags/my-tag -> refs/tags/my-tag
hint:   2. tags/my-tag -> refs/heads/tags/my-tag
hint:
hint: Earlier versions of git would have picked (1) as the RHS starts
hint: with a second-level ref prefix which could be fully-qualified by
hint: adding 'refs/' in front of it. We now pick (2) which uses the prefix
hint: inferred from the  part of the refspec.
hint:
hint: See the "..." rules  discussed in 'git help push'.

An earlier version of this patch[2] used the much more heavy-handed
approach of changing this logic in refname_match(). As shown from the
tests that patch needed to modify that results in changes that are
overzealous for fixing this push-specific issue.

The right place to fix this is in match_explicit(). There we can see
if we have both a DWYM match and a match based on the prefix of the
LHS of the refspec, in those cases the match based on the LHS's ref
prefix should win.

1. https://lore.kernel.org/lkml/2d55fd2a-afbf-1b7c-ca82-8bffaa18e...@redhat.com/
2. https://public-inbox.org/git/20190526225445.21618-1-ava...@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---

Now that the 2.22.0 release is out I cleaned this up into a more
sensible patch.

 Documentation/config/advice.txt |  7 +++
 Documentation/git-push.txt  | 13 +
 advice.c|  2 ++
 advice.h|  1 +
 remote.c| 23 ++-
 t/t5505-remote.sh   | 18 ++
 6 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/advice.txt b/Documentation/config/advice.txt
index ec4f6ae658..36cb3db63a 100644
--- a/Documentation/config/advice.txt
+++ b/Documentation/config/advice.txt
@@ -37,6 +37,13 @@ advice.*::
we can still suggest that the user push to either
refs/heads/* or refs/tags/* based on the type of the
source object.
+   pushPartialAmbigiousName::
+   Shown when linkgit:git-push[1] is given a refspec
+   where the  in earlier versions of Git would have
+   matched a  on the remote based on its existence
+   over appending a prefix based on the type of the
+   . See the "..." documentation in
+   linkgit:git-push[1] for details.
statusHints::
Show directions on how to proceed from the current
state in the output of linkgit:git-status[1], in
diff --git a/Documentation/git-push.txt b/Documentation/git-push.txt
index 6a8a0d958b..5c46ef5e59 100644
--- a/Documentation/git-push.txt
+++ b/Documentation/git-push.txt
@@ -84,6 +84,19 @@ is ambiguous.
 
 * If  resolves to a ref starting with refs/heads/ or refs/tags/,
   then prepend that to .
++
+Versions of Git before 2.23.0 would override this rule and match
+e.g. `HEAD:tags/mark` to either `refs/tags/mark` or `refs/tags/mark`
+depending on, respectively, if `refs/tags/mark` existed or not on the
+remote.
++
+We'll now consistently pick `refs/heads/tags/mark` based on this rule
+and so that we're not so eager in guessing the  on the remote
+that we'll pick a different  based on what refs exist there
+already than we otherwise would have. This exception guards for cases
+where the match would be different due to a subse

[PATCH v3 4/8] t6040 test: stop using global "script" variable

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Change test code added in c0234b2ef6 ("stat_tracking_info(): clear
object flags used during counting", 2008-07-03) to stop using the
"script" variable also used for lazy prerequisites in
test-lib-functions.sh.

Since this test uses test_i18ncmp and expects to use its own "script"
variable twice it implicitly depends on the C_LOCALE_OUTPUT
prerequisite not being a lazy prerequisite. A follow-up change will
make it a lazy prerequisite, so we must remove this landmine before
inadvertently stepping on it as we make that change.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t6040-tracking-info.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/t/t6040-tracking-info.sh b/t/t6040-tracking-info.sh
index 716283b274..970b25a289 100755
--- a/t/t6040-tracking-info.sh
+++ b/t/t6040-tracking-info.sh
@@ -38,7 +38,7 @@ test_expect_success setup '
advance h
 '
 
-script='s/^..\(b.\) *[0-9a-f]* \(.*\)$/\1 \2/p'
+t6040_script='s/^..\(b.\) *[0-9a-f]* \(.*\)$/\1 \2/p'
 cat >expect <<\EOF
 b1 [ahead 1, behind 1] d
 b2 [ahead 1, behind 1] d
@@ -53,7 +53,7 @@ test_expect_success 'branch -v' '
cd test &&
git branch -v
) |
-   sed -n -e "$script" >actual &&
+   sed -n -e "$t6040_script" >actual &&
test_i18ncmp expect actual
 '
 
@@ -71,7 +71,7 @@ test_expect_success 'branch -vv' '
cd test &&
git branch -vv
) |
-   sed -n -e "$script" >actual &&
+   sed -n -e "$t6040_script" >actual &&
test_i18ncmp expect actual
 '
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 1/8] config tests: simplify include cycle test

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Simplify an overly verbose test added in 9b25a0b52e ("config: add
include directive", 2012-02-06). The "expect" file was never used, and
by using .gitconfig it's not as intuitive to reproduce this manually
with "-d" as some other tests, since HOME needs to be set in the
environment.

Also remove the use of test_i18ngrep added in a769bfc74f ("config.c:
mark more strings for translation", 2018-07-21) in favor of overriding
the GIT_TEST_GETTEXT_POISON value.

Using the i18n test wrappers hasn't been needed since my
6cdccfce1e ("i18n: make GETTEXT_POISON a runtime option", 2018-11-08).
As a follow-up change to the yet-to-be-added t0017-env-helper.sh will
show, doing it this way can hide a regression when combined with
trace2's early config reading. That early config reading was added in
bce9db6de9 ("trace2: use system/global config for default trace2
settings", 2019-04-15).

So let's remove the testing for that potential regression here, I'll
instead add it explicitly to t0017-env-helper.sh in a follow-up
change.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/t1305-config-include.sh | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/t/t1305-config-include.sh b/t/t1305-config-include.sh
index 579a86b7f8..6b388ba2d0 100755
--- a/t/t1305-config-include.sh
+++ b/t/t1305-config-include.sh
@@ -310,20 +310,13 @@ test_expect_success SYMLINKS 'conditional include, gitdir 
matching symlink, icas
 '
 
 test_expect_success 'include cycles are detected' '
-   cat >.gitconfig <<-\EOF &&
-   [test]value = gitconfig
-   [include]path = cycle
-   EOF
-   cat >cycle <<-\EOF &&
-   [test]value = cycle
-   [include]path = .gitconfig
-   EOF
-   cat >expect <<-\EOF &&
-   gitconfig
-   cycle
-   EOF
-   test_must_fail git config --get-all test.value 2>stderr &&
-   test_i18ngrep "exceeded maximum include depth" stderr
+   git init --bare cycle &&
+   git -C cycle config include.path cycle &&
+   git config -f cycle/cycle include.path config &&
+   test_must_fail \
+   env GIT_TEST_GETTEXT_POISON= \
+   git -C cycle config --get-all test.value 2>stderr &&
+   grep "exceeded maximum include depth" stderr
 '
 
 test_done
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 7/8] tests: replace test_tristate with "git env--helper"

2019-06-21 Thread Ævar Arnfjörð Bjarmason
The test_tristate helper introduced in 83d842dc8c ("tests: turn on
network daemon tests by default", 2014-02-10) can now be better
implemented with "git env--helper" to give the variables in question
the standard boolean behavior.

The reason for the "tristate" was to have all of false/true/auto,
where "auto" meant either "false" or "true" depending on what the
fallback was. With the --default option to "git env--helper" we can
simply have e.g. GIT_TEST_HTTPD where we know if it's true because the
user asked explicitly ("true"), or true implicitly ("auto").

This breaks backwards compatibility for explicitly setting "auto" for
these variables, but I don't think anyone cares. That was always
intended to be internal.

This means the test_normalize_bool() code in test-lib-functions.sh
goes away in addition to test_tristate(). We still need the
test_skip_or_die() helper, but now it takes the variable name instead
of the value, and uses "git env--bool" to distinguish a default "true"
from an explicit "true" (in those "explicit true" cases we want to
fail the test in question).

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/lib-git-daemon.sh |  7 +++---
 t/lib-git-svn.sh| 11 +++-
 t/lib-httpd.sh  | 15 ++-
 t/t5512-ls-remote.sh|  3 +--
 t/test-lib-functions.sh | 56 ++---
 5 files changed, 22 insertions(+), 70 deletions(-)

diff --git a/t/lib-git-daemon.sh b/t/lib-git-daemon.sh
index 7b3407134e..fb8f887080 100644
--- a/t/lib-git-daemon.sh
+++ b/t/lib-git-daemon.sh
@@ -15,8 +15,7 @@
 #
 #  test_done
 
-test_tristate GIT_TEST_GIT_DAEMON
-if test "$GIT_TEST_GIT_DAEMON" = false
+if ! git env--helper --type=bool --default=true --exit-code GIT_TEST_GIT_DAEMON
 then
skip_all="git-daemon testing disabled (unset GIT_TEST_GIT_DAEMON to 
enable)"
test_done
@@ -24,7 +23,7 @@ fi
 
 if test_have_prereq !PIPE
 then
-   test_skip_or_die $GIT_TEST_GIT_DAEMON "file system does not support 
FIFOs"
+   test_skip_or_die GIT_TEST_GIT_DAEMON "file system does not support 
FIFOs"
 fi
 
 test_set_port LIB_GIT_DAEMON_PORT
@@ -73,7 +72,7 @@ start_git_daemon() {
kill "$GIT_DAEMON_PID"
wait "$GIT_DAEMON_PID"
unset GIT_DAEMON_PID
-   test_skip_or_die $GIT_TEST_GIT_DAEMON \
+   test_skip_or_die GIT_TEST_GIT_DAEMON \
"git daemon failed to start"
fi
 }
diff --git a/t/lib-git-svn.sh b/t/lib-git-svn.sh
index c1271d6863..5d4ae629e1 100644
--- a/t/lib-git-svn.sh
+++ b/t/lib-git-svn.sh
@@ -69,14 +69,12 @@ svn_cmd () {
 maybe_start_httpd () {
loc=${1-svn}
 
-   test_tristate GIT_SVN_TEST_HTTPD
-   case $GIT_SVN_TEST_HTTPD in
-   true)
+   if git env--helper --type=bool --default=false --exit-code 
GIT_TEST_HTTPD
+   then
. "$TEST_DIRECTORY"/lib-httpd.sh
LIB_HTTPD_SVN="$loc"
start_httpd
-   ;;
-   esac
+   fi
 }
 
 convert_to_rev_db () {
@@ -106,8 +104,7 @@ EOF
 }
 
 require_svnserve () {
-   test_tristate GIT_TEST_SVNSERVE
-   if ! test "$GIT_TEST_SVNSERVE" = true
+   if ! git env--helper --type=bool --default=false --exit-code 
GIT_TEST_SVNSERVE
then
skip_all='skipping svnserve test. (set $GIT_TEST_SVNSERVE to 
enable)'
test_done
diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
index b3cc62bd36..0d985758c6 100644
--- a/t/lib-httpd.sh
+++ b/t/lib-httpd.sh
@@ -41,15 +41,14 @@ then
test_done
 fi
 
-test_tristate GIT_TEST_HTTPD
-if test "$GIT_TEST_HTTPD" = false
+if ! git env--helper --type=bool --default=true --exit-code GIT_TEST_HTTPD
 then
skip_all="Network testing disabled (unset GIT_TEST_HTTPD to enable)"
test_done
 fi
 
 if ! test_have_prereq NOT_ROOT; then
-   test_skip_or_die $GIT_TEST_HTTPD \
+   test_skip_or_die GIT_TEST_HTTPD \
"Cannot run httpd tests as root"
 fi
 
@@ -95,7 +94,7 @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE
 
 if ! test -x "$LIB_HTTPD_PATH"
 then
-   test_skip_or_die $GIT_TEST_HTTPD "no web server found at 
'$LIB_HTTPD_PATH'"
+   test_skip_or_die GIT_TEST_HTTPD "no web server found at 
'$LIB_HTTPD_PATH'"
 fi
 
 HTTPD_VERSION=$($LIB_HTTPD_PATH -v | \
@@ -107,19 +106,19 @@ then
then
if ! test $HTTPD_VERSION -ge 2
then
-   test_skip_or_die $GIT_TEST_HTTPD \
+   test_skip_or_die GIT_TEST_HTTPD \
"at least Apache version 2 is required"
fi
if

[PATCH v3 3/8] config.c: refactor die_bad_number() to not call gettext() early

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Prepare die_bad_number() for a change to specially handle
GIT_TEST_GETTEXT_POISON calling git_env_bool() by making
die_bad_number() not call gettext() early, which would in turn call
git_env_bool().

There's no meaningful change here yet, just a re-arrangement of the
current code to make that subsequent change easier to read.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 config.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/config.c b/config.c
index 296a6d9cc4..374cb33005 100644
--- a/config.c
+++ b/config.c
@@ -949,34 +949,35 @@ int git_parse_ssize_t(const char *value, ssize_t *ret)
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
-   const char * error_type = (errno == ERANGE)? _("out of 
range"):_("invalid unit");
+   const char *error_type = (errno == ERANGE) ?
+   N_("out of range") : N_("invalid unit");
+   const char *bad_numeric = N_("bad numeric config value '%s' for '%s': 
%s");
 
if (!value)
value = "";
 
if (!(cf && cf->name))
-   die(_("bad numeric config value '%s' for '%s': %s"),
-   value, name, error_type);
+   die(_(bad_numeric), value, name, _(error_type));
 
switch (cf->origin_type) {
case CONFIG_ORIGIN_BLOB:
die(_("bad numeric config value '%s' for '%s' in blob %s: %s"),
-   value, name, cf->name, error_type);
+   value, name, cf->name, _(error_type));
case CONFIG_ORIGIN_FILE:
die(_("bad numeric config value '%s' for '%s' in file %s: %s"),
-   value, name, cf->name, error_type);
+   value, name, cf->name, _(error_type));
case CONFIG_ORIGIN_STDIN:
die(_("bad numeric config value '%s' for '%s' in standard 
input: %s"),
-   value, name, error_type);
+   value, name, _(error_type));
case CONFIG_ORIGIN_SUBMODULE_BLOB:
die(_("bad numeric config value '%s' for '%s' in submodule-blob 
%s: %s"),
-   value, name, cf->name, error_type);
+   value, name, cf->name, _(error_type));
case CONFIG_ORIGIN_CMDLINE:
die(_("bad numeric config value '%s' for '%s' in command line 
%s: %s"),
-   value, name, cf->name, error_type);
+   value, name, cf->name, _(error_type));
default:
die(_("bad numeric config value '%s' for '%s' in %s: %s"),
-   value, name, cf->name, error_type);
+   value, name, cf->name, _(error_type));
}
 }
 
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 8/8] tests: make GIT_TEST_FAIL_PREREQS a boolean

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Change the GIT_TEST_FAIL_PREREQS variable from being "non-empty?" to
being a more standard boolean variable. I recently added the variable
in dfe1a17df9 ("tests: add a special setup where prerequisites fail",
2019-05-13), having to add another "non-empty?" special-case is what
prompted me to write the "git env--helper" utility being used here.

Converting this one is a bit tricky since we use it so early and
frequently in the guts of the test code itself, so let's set a
GIT_TEST_FAIL_PREREQS_INTERNAL which can be tested with the old "test
-n" for the purposes of the shell code, and change the user-exposed
and documented GIT_TEST_FAIL_PREREQS variable to a boolean.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/README|  2 +-
 t/t-basic.sh| 10 +-
 t/test-lib-functions.sh |  4 ++--
 t/test-lib.sh   | 23 +++
 4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/t/README b/t/README
index 072c9854d1..60d5b77bcc 100644
--- a/t/README
+++ b/t/README
@@ -334,7 +334,7 @@ that cannot be easily covered by a few specific test cases. 
These
 could be enabled by running the test suite with correct GIT_TEST_
 environment set.
 
-GIT_TEST_FAIL_PREREQS fails all prerequisites. This is
+GIT_TEST_FAIL_PREREQS= fails all prerequisites. This is
 useful for discovering issues with the tests where say a later test
 implicitly depends on an optional earlier test.
 
diff --git a/t/t-basic.sh b/t/t-basic.sh
index 31de7e90f3..e89438e619 100755
--- a/t/t-basic.sh
+++ b/t/t-basic.sh
@@ -726,7 +726,7 @@ donthaveit=yes
 test_expect_success DONTHAVEIT 'unmet prerequisite causes test to be skipped' '
donthaveit=no
 '
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit != yesyes
 then
say "bug in test framework: prerequisite tags do not work reliably"
exit 1
@@ -747,7 +747,7 @@ donthaveiteither=yes
 test_expect_success DONTHAVEIT,HAVEIT 'unmet prerequisites causes test to be 
skipped' '
donthaveiteither=no
 '
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit$donthaveiteither != 
yesyesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a 
$haveit$donthaveit$donthaveiteither != yesyesyes
 then
say "bug in test framework: multiple prerequisite tags do not work 
reliably"
exit 1
@@ -763,7 +763,7 @@ test_expect_success !LAZY_TRUE 'missing lazy prereqs skip 
tests' '
donthavetrue=no
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a "$havetrue$donthavetrue" != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$havetrue$donthavetrue" != 
yesyes
 then
say 'bug in test framework: lazy prerequisites do not work'
exit 1
@@ -779,7 +779,7 @@ test_expect_success LAZY_FALSE 'missing negative lazy 
prereqs will skip' '
havefalse=no
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a "$nothavefalse$havefalse" != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$nothavefalse$havefalse" != 
yesyes
 then
say 'bug in test framework: negative lazy prerequisites do not work'
exit 1
@@ -790,7 +790,7 @@ test_expect_success 'tests clean up after themselves' '
test_when_finished clean=yes
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $clean != yes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $clean != yes
 then
say "bug in test framework: basic cleanup command does not work 
reliably"
exit 1
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 527508c350..1cd0655f96 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -309,7 +309,7 @@ test_unset_prereq () {
 }
 
 test_set_prereq () {
-   if test -n "$GIT_TEST_FAIL_PREREQS"
+   if test -n "$GIT_TEST_FAIL_PREREQS_INTERNAL"
then
case "$1" in
# The "!" case is handled below with
@@ -1043,7 +1043,7 @@ perl () {
 # The error/skip message should be given by $2.
 #
 test_skip_or_die () {
-   if ! git env--helper --mode-bool --variable=$1 --default=0 --exit-code 
--quiet
+   if ! git env--helper --type=bool --default=false --exit-code $1
then
skip_all=$2
test_done
diff --git a/t/test-lib.sh b/t/test-lib.sh
index ed5d69dfe5..1af4e50653 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1389,6 +1389,25 @@ yes () {
done
 }
 
+# The GIT_TEST_FAIL_PREREQS code hooks into test_set_prereq(), and
+# thus needs to be set up really early, and set an internal variable
+# for convenience so 

[PATCH v3 6/8] tests README: re-flow a previously changed paragraph

2019-06-21 Thread Ævar Arnfjörð Bjarmason
A previous change to the "GIT_TEST_GETTEXT_POISON" variable left this
paragraph needing to be re-flowed. Let's do that in this separate
change to make it easy to see that there's no change here when viewed
with "--word-diff".

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/README | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 9a131f472e..072c9854d1 100644
--- a/t/README
+++ b/t/README
@@ -344,10 +344,10 @@ refactor to deal with it. The "SYMLINKS" prerequisite is 
currently
 excluded as so much relies on it, but this might change in the future.
 
 GIT_TEST_GETTEXT_POISON= turns all strings marked for
-translation into gibberish if true. Used for
-spotting those tests that need to be marked with a C_LOCALE_OUTPUT
-prerequisite when adding more strings for translation. See "Testing
-marked strings" in po/README for details.
+translation into gibberish if true. Used for spotting those tests that
+need to be marked with a C_LOCALE_OUTPUT prerequisite when adding more
+strings for translation. See "Testing marked strings" in po/README for
+details.
 
 GIT_TEST_SPLIT_INDEX= forces split-index mode on the whole
 test suite. Accept any boolean values that are accepted by git-config.
-- 
2.22.0.455.g172b71a6c5



[PATCH v3 5/8] tests: make GIT_TEST_GETTEXT_POISON a boolean

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Change the GIT_TEST_GETTEXT_POISON variable from being "non-empty?" to
being a more standard boolean variable.

Since it needed to be checked in both C code and shellscript (via test
-n) it was one of the remaining shellscript-like variables. Now that
we have "env--helper" we can change that.

There's a couple of tricky edge cases that arise because we're using
git_env_bool() early, and the config-reading "env--helper".

If GIT_TEST_GETTEXT_POISON is set to an invalid value die_bad_number()
will die, but to do so it would usually call gettext(). Let's detect
the special case of GIT_TEST_GETTEXT_POISON and always emit that
message in the C locale, lest we infinitely loop.

As seen in the updated tests in t0017-env-helper.sh there's also a
caveat related to "env--helper" needing to read the config for trace2
purposes.

Since the C_LOCALE_OUTPUT prerequisite is lazy and relies on
"env--helper" we could get invalid results if we failed to read the
config (e.g. because we'd loop on includes) when combined with
e.g. "test_i18ngrep" wanting to check with "env--helper" if
GIT_TEST_GETTEXT_POISON was true or not.

I'm crossing my fingers and hoping that a test similar to the one I
removed in the earlier "config tests: simplify include cycle test"
change in this series won't happen again, and testing for this
explicitly in "env--helper"'s own tests.

This change breaks existing uses of
e.g. GIT_TEST_GETTEXT_POISON=YesPlease, which we've documented in
po/README and other places. As noted in [1] we might want to consider
also accepting "YesPlease" in "env--helper" as a special-case.

But as the lack of uproar over 6cdccfce1e ("i18n: make GETTEXT_POISON
a runtime option", 2018-11-08) demonstrates the audience for this
option is a really narrow set of git developers, who shouldn't have
much trouble modifying their test scripts, so I think it's better to
deal with that minor headache now and make all the relevant GIT_TEST_*
variables boolean in the same way than carry the "YesPlease"
special-case forward.

1. https://public-inbox.org/git/xmqqtvckm3h8@gitster-ct.c.googlers.com/

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 ci/lib.sh |  2 +-
 config.c  |  9 +
 gettext.c |  6 ++
 git-sh-i18n.sh|  4 +++-
 po/README |  2 +-
 t/README  |  4 ++--
 t/t0017-env-helper.sh | 16 
 t/t0205-gettext-poison.sh |  7 ++-
 t/t1305-config-include.sh |  2 +-
 t/t7201-co.sh |  2 +-
 t/t9902-completion.sh |  2 +-
 t/test-lib.sh |  8 +++-
 12 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/ci/lib.sh b/ci/lib.sh
index 288a5b3884..fd799ae663 100755
--- a/ci/lib.sh
+++ b/ci/lib.sh
@@ -184,7 +184,7 @@ osx-clang|osx-gcc)
export GIT_SKIP_TESTS="t9810 t9816"
;;
 GIT_TEST_GETTEXT_POISON)
-   export GIT_TEST_GETTEXT_POISON=YesPlease
+   export GIT_TEST_GETTEXT_POISON=true
;;
 esac
 
diff --git a/config.c b/config.c
index 374cb33005..b985d60fa4 100644
--- a/config.c
+++ b/config.c
@@ -956,6 +956,15 @@ static void die_bad_number(const char *name, const char 
*value)
if (!value)
value = "";
 
+   if (!strcmp(name, "GIT_TEST_GETTEXT_POISON"))
+   /*
+* We explicitly *don't* use _() here since it would
+* cause an infinite loop with _() needing to call
+* use_gettext_poison(). This is why marked up
+* translations with N_() above.
+*/
+   die(bad_numeric, value, name, error_type);
+
if (!(cf && cf->name))
die(_(bad_numeric), value, name, _(error_type));
 
diff --git a/gettext.c b/gettext.c
index d4021d690c..5c71f4c8b9 100644
--- a/gettext.c
+++ b/gettext.c
@@ -50,10 +50,8 @@ const char *get_preferred_languages(void)
 int use_gettext_poison(void)
 {
static int poison_requested = -1;
-   if (poison_requested == -1) {
-   const char *v = getenv("GIT_TEST_GETTEXT_POISON");
-   poison_requested = v && strlen(v) ? 1 : 0;
-   }
+   if (poison_requested == -1)
+   poison_requested = git_env_bool("GIT_TEST_GETTEXT_POISON", 0);
return poison_requested;
 }
 
diff --git a/git-sh-i18n.sh b/git-sh-i18n.sh
index e1d917fd27..8eef60b43f 100644
--- a/git-sh-i18n.sh
+++ b/git-sh-i18n.sh
@@ -17,7 +17,9 @@ export TEXTDOMAINDIR
 
 # First decide what scheme to use...
 GIT_INTERNAL_GETTEXT_SH_SCHEME=fallthrough
-if test -n "$GIT_TEST_GETTEXT_POISON"
+if test -n "$GIT_TEST_GETTEXT_POISON" &&
+   git env--helper --type=bool --default=0 --exit-code \
+   

[PATCH v3 2/8] env--helper: new undocumented builtin wrapping git_env_*()

2019-06-21 Thread Ævar Arnfjörð Bjarmason
We have many GIT_TEST_* variables that accept a  because
they're implemented in C, and then some that take  because
they're implemented at least partially in shellscript.

Add a helper that wraps git_env_bool() and git_env_ulong() as the
first step in fixing this. This isn't being added as a test-tool mode
because some of these are used outside the test suite.

Part of what this tool does can be done via a trick with "git config"
added in 83d842dc8c ("tests: turn on network daemon tests by default",
2014-02-10) for test_tristate(), i.e.:

git -c magic.variable="$1" config --bool magic.variable 2>/dev/null

But as subsequent changes will show being able to pass along the
default value makes all the difference, and we'll be able to replace
test_tristate() itself with that.

The --type=bool option will be used by subsequent patches, but not
--type=ulong. I figured it was easy enough to add it & test for it so
I left it in so we'd have wrappers for both git_env_*() functions, and
to have a template to make it obvious how we'd add --type=int etc. if
it's needed in the future.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 .gitignore|  1 +
 Makefile  |  1 +
 builtin.h |  1 +
 builtin/env--helper.c | 95 +++
 git.c |  1 +
 t/t0017-env-helper.sh | 83 +
 6 files changed, 182 insertions(+)
 create mode 100644 builtin/env--helper.c
 create mode 100755 t/t0017-env-helper.sh

diff --git a/.gitignore b/.gitignore
index 4470d7cfc0..1f7a83fb3c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -58,6 +58,7 @@
 /git-difftool
 /git-difftool--helper
 /git-describe
+/git-env--helper
 /git-fast-export
 /git-fast-import
 /git-fetch
diff --git a/Makefile b/Makefile
index f58bf14c7b..f2cfc8d812 100644
--- a/Makefile
+++ b/Makefile
@@ -1059,6 +1059,7 @@ BUILTIN_OBJS += builtin/diff-index.o
 BUILTIN_OBJS += builtin/diff-tree.o
 BUILTIN_OBJS += builtin/diff.o
 BUILTIN_OBJS += builtin/difftool.o
+BUILTIN_OBJS += builtin/env--helper.o
 BUILTIN_OBJS += builtin/fast-export.o
 BUILTIN_OBJS += builtin/fetch-pack.o
 BUILTIN_OBJS += builtin/fetch.o
diff --git a/builtin.h b/builtin.h
index ec7e0954c4..93bd49fe4f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -160,6 +160,7 @@ int cmd_diff_index(int argc, const char **argv, const char 
*prefix);
 int cmd_diff(int argc, const char **argv, const char *prefix);
 int cmd_diff_tree(int argc, const char **argv, const char *prefix);
 int cmd_difftool(int argc, const char **argv, const char *prefix);
+int cmd_env__helper(int argc, const char **argv, const char *prefix);
 int cmd_fast_export(int argc, const char **argv, const char *prefix);
 int cmd_fetch(int argc, const char **argv, const char *prefix);
 int cmd_fetch_pack(int argc, const char **argv, const char *prefix);
diff --git a/builtin/env--helper.c b/builtin/env--helper.c
new file mode 100644
index 00..1083c0f707
--- /dev/null
+++ b/builtin/env--helper.c
@@ -0,0 +1,95 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const env__helper_usage[] = {
+   N_("git env--helper --type=[bool|ulong]  "),
+   NULL
+};
+
+enum {
+   ENV_HELPER_TYPE_BOOL = 1,
+   ENV_HELPER_TYPE_ULONG
+} cmdmode = 0;
+
+static int option_parse_type(const struct option *opt, const char *arg,
+int unset)
+{
+   if (!strcmp(arg, "bool"))
+   cmdmode = ENV_HELPER_TYPE_BOOL;
+   else if (!strcmp(arg, "ulong"))
+   cmdmode = ENV_HELPER_TYPE_ULONG;
+   else
+   die(_("unrecognized --type argument, %s"), arg);
+
+   return 0;
+}
+
+int cmd_env__helper(int argc, const char **argv, const char *prefix)
+{
+   int exit_code = 0;
+   const char *env_variable = NULL;
+   const char *env_default = NULL;
+   int ret;
+   int ret_int, default_int;
+   unsigned long ret_ulong, default_ulong;
+   struct option opts[] = {
+   OPT_CALLBACK_F(0, "type", &cmdmode, N_("type"),
+  N_("value is given this type"), PARSE_OPT_NONEG,
+  option_parse_type),
+   OPT_STRING(0, "default", &env_default, N_("value"),
+  N_("default for git_env_*(...) to fall back on")),
+   OPT_BOOL(0, "exit-code", &exit_code,
+N_("be quiet only use git_env_*() value as exit 
code")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, prefix, opts, env__helper_usage,
+PARSE_OPT_KEEP_UNKNOWN);
+   if (env_default && !*env_default)
+   usage_with_options(env__helper_usage, opts)

[PATCH v3 0/8] Change GIT_TEST_* variables to

2019-06-21 Thread Ævar Arnfjörð Bjarmason
Now with:

 * The --type=bool etc. Ui change to env--bool.

 * I considered supporting YesPlease for gettext poison, but didn't go
   for it. Details in updated commit message.

 * "default" "case" arm fork warning on some compilers.

Ævar Arnfjörð Bjarmason (8):
  config tests: simplify include cycle test
  env--helper: new undocumented builtin wrapping git_env_*()
  config.c: refactor die_bad_number() to not call gettext() early
  t6040 test: stop using global "script" variable
  tests: make GIT_TEST_GETTEXT_POISON a boolean
  tests README: re-flow a previously changed paragraph
  tests: replace test_tristate with "git env--helper"
  tests: make GIT_TEST_FAIL_PREREQS a boolean

 .gitignore|  1 +
 Makefile  |  1 +
 builtin.h |  1 +
 builtin/env--helper.c | 95 +
 ci/lib.sh |  2 +-
 config.c  | 28 +++
 gettext.c |  6 +--
 git-sh-i18n.sh|  4 +-
 git.c |  1 +
 po/README |  2 +-
 t/README  | 12 ++---
 t/lib-git-daemon.sh   |  7 ++-
 t/lib-git-svn.sh  | 11 ++---
 t/lib-httpd.sh| 15 +++---
 t/t-basic.sh  | 10 ++--
 t/t0017-env-helper.sh | 99 +++
 t/t0205-gettext-poison.sh |  7 ++-
 t/t1305-config-include.sh | 21 +++--
 t/t5512-ls-remote.sh  |  3 +-
 t/t6040-tracking-info.sh  |  6 +--
 t/t7201-co.sh |  2 +-
 t/t9902-completion.sh |  2 +-
 t/test-lib-functions.sh   | 58 ---
 t/test-lib.sh | 31 
 24 files changed, 298 insertions(+), 127 deletions(-)
 create mode 100644 builtin/env--helper.c
 create mode 100755 t/t0017-env-helper.sh

Range-diff:
1:  c3483c37a1 = 1:  c3483c37a1 config tests: simplify include cycle test
2:  e689759f7c ! 2:  39cb96739a env--helper: new undocumented builtin wrapping 
git_env_*()
@@ -20,9 +20,11 @@
 default value makes all the difference, and we'll be able to replace
 test_tristate() itself with that.
 
-The --mode-bool option will be used by subsequent patches, but not
---mode-ulong. I figured it was easy enough to add it & test for it so
-I left it in so we'd have wrappers for both git_env_*() functions.
+The --type=bool option will be used by subsequent patches, but not
+--type=ulong. I figured it was easy enough to add it & test for it so
+I left it in so we'd have wrappers for both git_env_*() functions, and
+to have a template to make it obvious how we'd add --type=int etc. if
+    it's needed in the future.
 
 Signed-off-by: Ævar Arnfjörð Bjarmason 
 
@@ -72,74 +74,95 @@
 +#include "parse-options.h"
 +
 +static char const * const env__helper_usage[] = {
-+  N_("git env--helper [--mode-bool | --mode-ulong] --env-variable= 
--env-default= []"),
++  N_("git env--helper --type=[bool|ulong]  "),
 +  NULL
 +};
 +
++enum {
++  ENV_HELPER_TYPE_BOOL = 1,
++  ENV_HELPER_TYPE_ULONG
++} cmdmode = 0;
++
++static int option_parse_type(const struct option *opt, const char *arg,
++   int unset)
++{
++  if (!strcmp(arg, "bool"))
++  cmdmode = ENV_HELPER_TYPE_BOOL;
++  else if (!strcmp(arg, "ulong"))
++  cmdmode = ENV_HELPER_TYPE_ULONG;
++  else
++  die(_("unrecognized --type argument, %s"), arg);
++
++  return 0;
++}
++
 +int cmd_env__helper(int argc, const char **argv, const char *prefix)
 +{
-+  enum {
-+  ENV_HELPER_BOOL = 1,
-+  ENV_HELPER_ULONG,
-+  } cmdmode = 0;
 +  int exit_code = 0;
-+  int quiet = 0;
 +  const char *env_variable = NULL;
 +  const char *env_default = NULL;
 +  int ret;
-+  int ret_int, tmp_int;
-+  unsigned long ret_ulong, tmp_ulong;
++  int ret_int, default_int;
++  unsigned long ret_ulong, default_ulong;
 +  struct option opts[] = {
-+  OPT_CMDMODE(0, "mode-bool", &cmdmode,
-+  N_("invoke git_env_bool(...)"), ENV_HELPER_BOOL),
-+  OPT_CMDMODE(0, "mode-ulong", &cmdmode,
-+  N_("invoke git_env_ulong(...)"), ENV_HELPER_ULONG),
-+  OPT_STRING(0, "variable", &env_variable, N_("name"),
-+ N_("which environment variable to ask git_env_*(...) 
about")),
++  OPT_CALLBACK_F(0, "type", &cmdmode, N_("type"),
++ N_("value is given this type"), PARSE_OPT_NONEG,
++ option_parse_type),
 

Re: [PATCH v2 2/8] env--helper: new undocumented builtin wrapping git_env_*()

2019-06-21 Thread Ævar Arnfjörð Bjarmason


On Fri, Jun 21 2019, Junio C Hamano wrote:

> Junio C Hamano  writes:
>
>> ...
>> as I am getting
>>
>> error: 'ret' may be used uninitialized in this function 
>> [-Werror=maybe-uninitialized]
>>
>> from here.
>>
>> Giving an otherwise useless initial value to ret would be a
>> workaround.
>
> I've added this on top of the topic before merging to keep the
> integration going at least for now.
>
> commit 8f86948797a1152594a8dee50d0878604fec3e80
> Author: Junio C Hamano 
> Date:   Thu Jun 20 15:13:14 2019 -0700
>
> SQUASH??? avoid maybe-uninitialized
>
> diff --git a/builtin/env--helper.c b/builtin/env--helper.c
> index 2bb65ecf3f..29df0567fb 100644
> --- a/builtin/env--helper.c
> +++ b/builtin/env--helper.c
> @@ -43,6 +43,9 @@ int cmd_env__helper(int argc, const char **argv, const char 
> *prefix)
>   usage_with_options(env__helper_usage, opts);
>
>   switch (cmdmode) {
> + default:
> + BUG("wrong cmdmode");
> + break;
>   case ENV_HELPER_BOOL:
>   tmp_int = strtol(env_default, (char **)&env_default, 10);
>   if (*env_default) {

In this case the compiler is wrong, and gcc/clang in e.g. Debian
unstable doesn't warn about this since the analyzer sees that it's
impossible for "ret" to be uninitialized.

I can change it anyway, and if I rewrite the UI of this command it might
go away anyway.

Just thought I'd ask if appeasing older analyzers is what we want for
these sorts of optional warnings in general.


Re: [RFC/PATCH] gc: run more pre-detach operations under lock

2019-06-20 Thread Ævar Arnfjörð Bjarmason


On Thu, Jun 20 2019, Duy Nguyen wrote:

> On Thu, Jun 20, 2019 at 5:49 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>>
>> On Wed, Jun 19 2019, Jeff King wrote:
>>
>> > On Wed, Jun 19, 2019 at 08:01:55PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >
>> >> > You could sort of avoid the problem here too with
>> >> >
>> >> > parallel 'git fetch --no-auto-gc {}' ::: $(git remote)
>> >> > git gc --auto
>> >> >
>> >> > It's definitely simpler, but of course we have to manually add
>> >> > --no-auto-gc in everywhere we need, so not quite as elegant.
>> >> >
>> >> > Actually you could already do that with 'git -c gc.auto=false fetch', I 
>> >> > guess.
>> >>
>> >> The point of the 'parallel' example is to show disconnected git
>> >> commands, think trying to run 'git' in a terminal while your editor
>> >> asynchronously runs a polling 'fetch', or a server with multiple
>> >> concurrent clients running 'gc --auto'.
>> >>
>> >> That's the question my RFC patch raises. As far as I can tell the
>> >> approach in your patch is only needed because our locking for gc is
>> >> buggy, rather than introduce the caveat that an fetch(N) operation won't
>> >> do "gc" until it's finished (we may have hundreds, thousands of remotes,
>> >> I use that for some more obscure use-cases) shouldn't we just fix the
>> >> locking?
>> >
>> > I think there may be room for both approaches. Yours fixes the repeated
>> > message in the more general case, but Duy's suggestion is the most
>> > efficient thing.
>> >
>> > I agree that the "thousands of remotes" case means we might want to gc
>> > in the interim. But we probably ought to do that deterministically
>> > rather than hoping that the pattern of lock contention makes sense.
>>
>> We do it deterministically, when gc.auto thresholds et al are exceeded
>> we kick one off without waiting for other stuff, if we can get the lock.
>>
>> I don't think this desire to just wait a bit until all the fetches are
>> complete makes sense as a special-case.
>>
>> If, as you noted in <20190619190845.gd28...@sigill.intra.peff.net>, the
>> desire is to reduce GC CPU use then you're better off just tweaking the
>> limits upwards. Then you get that with everything, like when you run
>> "commit" in a for-loop, not just this one special case of "fetch".
>>
>> We have existing potentially long-running operations like "fetch",
>> "rebase" and "git svn fetch" that run "gc --auto" for their incremental
>> steps, and that's a feature.
>
> gc --auto is added at arbitrary points to help garbage collection. I
> don't think it's ever intended to "do gc at this and that exact
> moment", just "hey this command has taken a lot of time already (i.e.
> no instant response needed) and it may have added a bit more garbage,
> let's just check real quick".

I don't mean we can't ever change the algorithm, but that we've
documented:

When common porcelain operations that create objects are run, they
will check whether the repository has grown substantially since the
last maintenance[...]

The "fetch" command is a common porcelain operation, when it fetches
from N remotes it just runs an invocation of itself, so thus far it's
both worked & been intuitive that if we needed (potentially multiple)
gc's while doing that we'd just go ahead and run it then, even if
something concurrent was happening.

No that's not optimal in many cases, but at least doesn't create caveats
we don't have now where we have runaway object growth.

>> It keeps "gc --auto" dumb enough to avoid a pathological case where
>> we'll have a ballooning objects dir because we figure we can run
>> something "at the end", when "the end" could be hours away, and we're
>> adding a new pack or hundreds of loose objects every second.
>
> Are we optimizing for a rare (large scale) case? Such setup requires
> tuning regardless to me.

At least for me it doesn't require custom tuning before this patch of
yours.

I.e. now "gc --auto" is dumb enough that you can run it on everything
from stuff that just does "commit" from cron, user's laptops, massive
rebases that take forever

[PATCH v2 5/8] tests: make GIT_TEST_GETTEXT_POISON a boolean

2019-06-20 Thread Ævar Arnfjörð Bjarmason
Change the GIT_TEST_GETTEXT_POISON variable from being "non-empty?" to
being a more standard boolean variable.

Since it needed to be checked in both C code and shellscript (via test
-n) it was one of the remaining shellscript-like variables. Now that
we have "env--helper" we can change that.

There's a couple of tricky edge cases that arise because we're using
git_env_bool() early, and the config-reading "env--helper".

If GIT_TEST_GETTEXT_POISON is set to an invalid value die_bad_number()
will die, but to do so it would usually call gettext(). Let's detect
the special case of GIT_TEST_GETTEXT_POISON and always emit that
message in the C locale, lest we infinitely loop.

As seen in the updated tests in t0016-env-helper.sh there's also a
caveat related to "env--helper" needing to read the config for trace2
purposes.

Since the C_LOCALE_OUTPUT prerequisite is lazy and relies on
"env--helper" we could get invalid results if we failed to read the
config (e.g. because we'd loop on includes) when combined with
e.g. "test_i18ngrep" wanting to check with "env--helper" if
GIT_TEST_GETTEXT_POISON was true or not.

I'm crossing my fingers and hoping that a test similar to the one I
removed in the earlier "config tests: simplify include cycle test"
change in this series won't happen again, and testing for this
explicitly in "env--helper"'s own tests.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 ci/lib.sh |  2 +-
 config.c  |  9 +
 gettext.c |  6 ++
 git-sh-i18n.sh|  4 +++-
 po/README |  2 +-
 t/README  |  4 ++--
 t/t0017-env-helper.sh | 16 
 t/t0205-gettext-poison.sh |  7 ++-
 t/t1305-config-include.sh |  2 +-
 t/t7201-co.sh |  2 +-
 t/t9902-completion.sh |  2 +-
 t/test-lib.sh |  8 +++-
 12 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/ci/lib.sh b/ci/lib.sh
index 288a5b3884..fd799ae663 100755
--- a/ci/lib.sh
+++ b/ci/lib.sh
@@ -184,7 +184,7 @@ osx-clang|osx-gcc)
export GIT_SKIP_TESTS="t9810 t9816"
;;
 GIT_TEST_GETTEXT_POISON)
-   export GIT_TEST_GETTEXT_POISON=YesPlease
+   export GIT_TEST_GETTEXT_POISON=true
;;
 esac
 
diff --git a/config.c b/config.c
index 374cb33005..b985d60fa4 100644
--- a/config.c
+++ b/config.c
@@ -956,6 +956,15 @@ static void die_bad_number(const char *name, const char 
*value)
if (!value)
value = "";
 
+   if (!strcmp(name, "GIT_TEST_GETTEXT_POISON"))
+   /*
+* We explicitly *don't* use _() here since it would
+* cause an infinite loop with _() needing to call
+* use_gettext_poison(). This is why marked up
+* translations with N_() above.
+*/
+   die(bad_numeric, value, name, error_type);
+
if (!(cf && cf->name))
die(_(bad_numeric), value, name, _(error_type));
 
diff --git a/gettext.c b/gettext.c
index d4021d690c..5c71f4c8b9 100644
--- a/gettext.c
+++ b/gettext.c
@@ -50,10 +50,8 @@ const char *get_preferred_languages(void)
 int use_gettext_poison(void)
 {
static int poison_requested = -1;
-   if (poison_requested == -1) {
-   const char *v = getenv("GIT_TEST_GETTEXT_POISON");
-   poison_requested = v && strlen(v) ? 1 : 0;
-   }
+   if (poison_requested == -1)
+   poison_requested = git_env_bool("GIT_TEST_GETTEXT_POISON", 0);
return poison_requested;
 }
 
diff --git a/git-sh-i18n.sh b/git-sh-i18n.sh
index e1d917fd27..de8ae67d7b 100644
--- a/git-sh-i18n.sh
+++ b/git-sh-i18n.sh
@@ -17,7 +17,9 @@ export TEXTDOMAINDIR
 
 # First decide what scheme to use...
 GIT_INTERNAL_GETTEXT_SH_SCHEME=fallthrough
-if test -n "$GIT_TEST_GETTEXT_POISON"
+if test -n "$GIT_TEST_GETTEXT_POISON" &&
+   git env--helper --mode-bool --variable=GIT_TEST_GETTEXT_POISON \
+   --default=0 --exit-code --quiet
 then
GIT_INTERNAL_GETTEXT_SH_SCHEME=poison
 elif test -n "@@USE_GETTEXT_SCHEME@@"
diff --git a/po/README b/po/README
index aa704ffcb7..07595d369b 100644
--- a/po/README
+++ b/po/README
@@ -293,7 +293,7 @@ To smoke out issues like these, Git tested with a 
translation mode that
 emits gibberish on every call to gettext. To use it run the test suite
 with it, e.g.:
 
-cd t && GIT_TEST_GETTEXT_POISON=YesPlease prove -j 9 ./t[0-9]*.sh
+cd t && GIT_TEST_GETTEXT_POISON=true prove -j 9 ./t[0-9]*.sh
 
 If tests break with it you should inspect them manually and see if
 what you're translating is sane, i.e. that you're not translating
diff --git a/t/README b/t/README
index 9747971d58..9a131f472e

[PATCH v2 8/8] tests: make GIT_TEST_FAIL_PREREQS a boolean

2019-06-20 Thread Ævar Arnfjörð Bjarmason
Change the GIT_TEST_FAIL_PREREQS variable from being "non-empty?" to
being a more standard boolean variable. I recently added the variable
in dfe1a17df9 ("tests: add a special setup where prerequisites fail",
2019-05-13), having to add another "non-empty?" special-case is what
prompted me to write the "git env--helper" utility being used here.

Converting this one is a bit tricky since we use it so early and
frequently in the guts of the test code itself, so let's set a
GIT_TEST_FAIL_PREREQS_INTERNAL which can be tested with the old "test
-n" for the purposes of the shell code, and change the user-exposed
and documented GIT_TEST_FAIL_PREREQS variable to a boolean.

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/README|  2 +-
 t/t-basic.sh| 10 +-
 t/test-lib-functions.sh |  2 +-
 t/test-lib.sh   | 25 +
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/t/README b/t/README
index 072c9854d1..60d5b77bcc 100644
--- a/t/README
+++ b/t/README
@@ -334,7 +334,7 @@ that cannot be easily covered by a few specific test cases. 
These
 could be enabled by running the test suite with correct GIT_TEST_
 environment set.
 
-GIT_TEST_FAIL_PREREQS fails all prerequisites. This is
+GIT_TEST_FAIL_PREREQS= fails all prerequisites. This is
 useful for discovering issues with the tests where say a later test
 implicitly depends on an optional earlier test.
 
diff --git a/t/t-basic.sh b/t/t-basic.sh
index 31de7e90f3..e89438e619 100755
--- a/t/t-basic.sh
+++ b/t/t-basic.sh
@@ -726,7 +726,7 @@ donthaveit=yes
 test_expect_success DONTHAVEIT 'unmet prerequisite causes test to be skipped' '
donthaveit=no
 '
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $haveit$donthaveit != yesyes
 then
say "bug in test framework: prerequisite tags do not work reliably"
exit 1
@@ -747,7 +747,7 @@ donthaveiteither=yes
 test_expect_success DONTHAVEIT,HAVEIT 'unmet prerequisites causes test to be 
skipped' '
donthaveiteither=no
 '
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $haveit$donthaveit$donthaveiteither != 
yesyesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a 
$haveit$donthaveit$donthaveiteither != yesyesyes
 then
say "bug in test framework: multiple prerequisite tags do not work 
reliably"
exit 1
@@ -763,7 +763,7 @@ test_expect_success !LAZY_TRUE 'missing lazy prereqs skip 
tests' '
donthavetrue=no
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a "$havetrue$donthavetrue" != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$havetrue$donthavetrue" != 
yesyes
 then
say 'bug in test framework: lazy prerequisites do not work'
exit 1
@@ -779,7 +779,7 @@ test_expect_success LAZY_FALSE 'missing negative lazy 
prereqs will skip' '
havefalse=no
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a "$nothavefalse$havefalse" != yesyes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a "$nothavefalse$havefalse" != 
yesyes
 then
say 'bug in test framework: negative lazy prerequisites do not work'
exit 1
@@ -790,7 +790,7 @@ test_expect_success 'tests clean up after themselves' '
test_when_finished clean=yes
 '
 
-if test -z "$GIT_TEST_FAIL_PREREQS" -a $clean != yes
+if test -z "$GIT_TEST_FAIL_PREREQS_INTERNAL" -a $clean != yes
 then
say "bug in test framework: basic cleanup command does not work 
reliably"
exit 1
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 527508c350..3fba71c358 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -309,7 +309,7 @@ test_unset_prereq () {
 }
 
 test_set_prereq () {
-   if test -n "$GIT_TEST_FAIL_PREREQS"
+   if test -n "$GIT_TEST_FAIL_PREREQS_INTERNAL"
then
case "$1" in
# The "!" case is handled below with
diff --git a/t/test-lib.sh b/t/test-lib.sh
index c45b0d2611..238ef62401 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1389,6 +1389,27 @@ yes () {
done
 }
 
+# The GIT_TEST_FAIL_PREREQS code hooks into test_set_prereq(), and
+# thus needs to be set up really early, and set an internal variable
+# for convenience so the hot test_set_prereq() codepath doesn't need
+# to call "git env--helper". Only do that work if needed by seeing if
+# GIT_TEST_FAIL_PREREQS is set at all.
+GIT_TEST_FAIL_PREREQS_INTERNAL=
+if test -n "$GIT_TEST_FAIL_PREREQS"
+then
+   if git env--helper --mode-bool --variable=GIT_TEST_FAIL_PREREQS \
+   --default=

[PATCH v2 7/8] tests: replace test_tristate with "git env--helper"

2019-06-20 Thread Ævar Arnfjörð Bjarmason
The test_tristate helper introduced in 83d842dc8c ("tests: turn on
network daemon tests by default", 2014-02-10) can now be better
implemented with "git env--helper" to give the variables in question
the standard boolean behavior.

The reason for the "tristate" was to have all of false/true/auto,
where "auto" meant either "false" or "true" depending on what the
fallback was. With the --default option to "git env--helper" we can
simply have e.g. GIT_TEST_HTTPD where we know if it's true because the
user asked explicitly ("true"), or true implicitly ("auto").

This breaks backwards compatibility for explicitly setting "auto" for
these variables, but I don't think anyone cares. That was always
intended to be internal.

This means the test_normalize_bool() code in test-lib-functions.sh
goes away in addition to test_tristate(). We still need the
test_skip_or_die() helper, but now it takes the variable name instead
of the value, and uses "git env--bool" to distinguish a default "true"
from an explicit "true" (in those "explicit true" cases we want to
fail the test in question).

Signed-off-by: Ævar Arnfjörð Bjarmason 
---
 t/lib-git-daemon.sh |  7 +++---
 t/lib-git-svn.sh| 11 +++-
 t/lib-httpd.sh  | 15 ++-
 t/t5512-ls-remote.sh|  3 +--
 t/test-lib-functions.sh | 56 ++---
 5 files changed, 22 insertions(+), 70 deletions(-)

diff --git a/t/lib-git-daemon.sh b/t/lib-git-daemon.sh
index 7b3407134e..770c5218ea 100644
--- a/t/lib-git-daemon.sh
+++ b/t/lib-git-daemon.sh
@@ -15,8 +15,7 @@
 #
 #  test_done
 
-test_tristate GIT_TEST_GIT_DAEMON
-if test "$GIT_TEST_GIT_DAEMON" = false
+if ! git env--helper --mode-bool --variable=GIT_TEST_GIT_DAEMON --default=1 
--exit-code --quiet
 then
skip_all="git-daemon testing disabled (unset GIT_TEST_GIT_DAEMON to 
enable)"
test_done
@@ -24,7 +23,7 @@ fi
 
 if test_have_prereq !PIPE
 then
-   test_skip_or_die $GIT_TEST_GIT_DAEMON "file system does not support 
FIFOs"
+   test_skip_or_die GIT_TEST_GIT_DAEMON "file system does not support 
FIFOs"
 fi
 
 test_set_port LIB_GIT_DAEMON_PORT
@@ -73,7 +72,7 @@ start_git_daemon() {
kill "$GIT_DAEMON_PID"
wait "$GIT_DAEMON_PID"
unset GIT_DAEMON_PID
-   test_skip_or_die $GIT_TEST_GIT_DAEMON \
+   test_skip_or_die GIT_TEST_GIT_DAEMON \
"git daemon failed to start"
fi
 }
diff --git a/t/lib-git-svn.sh b/t/lib-git-svn.sh
index c1271d6863..853d33a57a 100644
--- a/t/lib-git-svn.sh
+++ b/t/lib-git-svn.sh
@@ -69,14 +69,12 @@ svn_cmd () {
 maybe_start_httpd () {
loc=${1-svn}
 
-   test_tristate GIT_SVN_TEST_HTTPD
-   case $GIT_SVN_TEST_HTTPD in
-   true)
+   if git env--helper --mode-bool --variable=GIT_TEST_HTTPD --default=0 
--exit-code --quiet
+   then
. "$TEST_DIRECTORY"/lib-httpd.sh
LIB_HTTPD_SVN="$loc"
start_httpd
-   ;;
-   esac
+   fi
 }
 
 convert_to_rev_db () {
@@ -106,8 +104,7 @@ EOF
 }
 
 require_svnserve () {
-   test_tristate GIT_TEST_SVNSERVE
-   if ! test "$GIT_TEST_SVNSERVE" = true
+   if ! git env--helper --mode-bool --variable=GIT_TEST_SVNSERVE 
--default=0 --exit-code --quiet
then
skip_all='skipping svnserve test. (set $GIT_TEST_SVNSERVE to 
enable)'
test_done
diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
index b3cc62bd36..eef3250552 100644
--- a/t/lib-httpd.sh
+++ b/t/lib-httpd.sh
@@ -41,15 +41,14 @@ then
test_done
 fi
 
-test_tristate GIT_TEST_HTTPD
-if test "$GIT_TEST_HTTPD" = false
+if ! git env--helper --mode-bool --variable=GIT_TEST_HTTPD --default=1 
--exit-code --quiet
 then
skip_all="Network testing disabled (unset GIT_TEST_HTTPD to enable)"
test_done
 fi
 
 if ! test_have_prereq NOT_ROOT; then
-   test_skip_or_die $GIT_TEST_HTTPD \
+   test_skip_or_die GIT_TEST_HTTPD \
"Cannot run httpd tests as root"
 fi
 
@@ -95,7 +94,7 @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE
 
 if ! test -x "$LIB_HTTPD_PATH"
 then
-   test_skip_or_die $GIT_TEST_HTTPD "no web server found at 
'$LIB_HTTPD_PATH'"
+   test_skip_or_die GIT_TEST_HTTPD "no web server found at 
'$LIB_HTTPD_PATH'"
 fi
 
 HTTPD_VERSION=$($LIB_HTTPD_PATH -v | \
@@ -107,19 +106,19 @@ then
then
if ! test $HTTPD_VERSION -ge 2
then
-   test_skip_or_die $GIT_TEST_HTTPD \
+   test_skip_or_die GIT_TEST_HTTPD \
&

  1   2   3   4   5   6   7   8   9   10   >