Confusing behavior with ignored submodules and `git commit -a`

2018-03-16 Thread Michael Forney
Hi,

In the past few months have noticed some confusing behavior with
ignored submodules. I finally got around to bisecting this to commit
5556808690ea245708fb80383be5c1afee2fb3eb (add, reset: ensure
submodules can be added or reset).

Here is a demonstration of the problem:

First some repository initialization with a submodule marked as ignored.

$ git init inner && git -C inner commit --allow-empty -m 'inner 1'
Initialized empty Git repository in /tmp/inner/.git/
[master (root-commit) ef55bed] inner 1
$ git init outer && cd outer
Initialized empty Git repository in /tmp/outer/.git/
$ git submodule add ../inner
Cloning into '/tmp/outer/inner'...
done.
$ echo 1 > foo.txt && git add foo.txt
$ git commit -m 'outer 1'
[master (root-commit) efeb85c] outer 1
 3 files changed, 5 insertions(+)
 create mode 100644 .gitmodules
 create mode 100644 foo.txt
 create mode 16 inner
$ git config submodule.inner.ignore all
$ git -C inner commit --allow-empty -m 'inner 2'
[master 7b7f0fa] inner 2
$ git status
On branch master
nothing to commit, working tree clean
$

Up to here is all expected. However, if I go to update `foo.txt` and
commit with `git commit -a`, changes to inner get recorded
unexpectedly. What's worse is the shortstat output of `git commit -a`,
and the diff output of `git show` give no indication that the
submodule was changed.

$ echo 2 > foo.txt
$ git status
On branch master
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

modified:   foo.txt

no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m 'update foo.txt'
[master 6ec564c] update foo.txt
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git show
commit 6ec564c15ddae099c71f01750b4c434557525653 (HEAD -> master)
Author: Michael Forney 
Date:   Fri Mar 16 20:18:37 2018 -0700

update foo.txt

diff --git a/foo.txt b/foo.txt
index d00491f..0cfbf08 100644
--- a/foo.txt
+++ b/foo.txt
@@ -1 +1 @@
-1
+2
$

There have been a couple occasions where I accidentally pushed local
changes to ignored submodules because of this. Since they don't show
up in the log output, it is difficult to figure out what actually has
gone wrong.

Anyway, since the bisected commit (555680869) only mentions add and
reset, I'm assuming that this is a regression and not a deliberate
behavior change. The documentation for submodule..ignore states
that the setting should only affect `git status` and the diff family.
In terms of my expectations, I would go further and say it should only
affect `git status` and diffs against the working tree.

I took a brief look through the relevant sources, and it wasn't clear
to me how to fix this without accidentally changing the behavior of
other subcommands.

Any help with this issue is appreciated!


[PATCH 0/1] Fix typo in merge-strategies documentation

2018-03-16 Thread David Pursehouse
From: David Pursehouse 

Fixes a minor typo in the merge-strategies documentation.

David Pursehouse (1):
  Fix typo in merge-strategies documentation

 Documentation/merge-strategies.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.16.2



[PATCH 1/1] Fix typo in merge-strategies documentation

2018-03-16 Thread David Pursehouse
From: David Pursehouse 

Signed-off-by: David Pursehouse 
---
 Documentation/merge-strategies.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/merge-strategies.txt 
b/Documentation/merge-strategies.txt
index fd5d748d1..4a58aad4b 100644
--- a/Documentation/merge-strategies.txt
+++ b/Documentation/merge-strategies.txt
@@ -40,7 +40,7 @@ the other tree did, declaring 'our' history contains all that 
happened in it.
 
 theirs;;
This is the opposite of 'ours'; note that, unlike 'ours', there is
-   no 'theirs' merge stragegy to confuse this merge option with.
+   no 'theirs' merge strategy to confuse this merge option with.
 
 patience;;
With this option, 'merge-recursive' spends a little extra time
-- 
2.16.2



Re: [RFC] Rebasing merges: a jorney to the ultimate solution (Road Clear)

2018-03-16 Thread Igor Djordjevic
Hi Sergey,

On 16/03/2018 08:31, Sergey Organov wrote:
> 
> > > As I said, putting myself on the user side, I'd prefer entirely
> > > separate first step of the algorithm, exactly as written, with
> > > its own conflict resolution, all running entirely the same way as
> > > it does with non-merge commits. I'm used to it and don't want to
> > > learn something new without necessity. I.e., I'd prefer to
> > > actually see it in two separate stages, like this:
> > >
> > > Rebasing mainline of the merge...
> > > [.. possible conflicts resolution ..]
> > > Merging in changes to side branch(es)...
> > > [.. possible conflicts resolution ..]
> > >
> > > And if the second stage gives non-trivial conflicts, I'd like to
> > > have a simple way to just do "merge -s ours " on top of
> > > already rebased mainline of the merge and go with it. Note that
> > > the latter is significantly different than re-merging everything
> > > from scratch, that would be the only choice with "all-in-one"
> > > approach, and it essentially gives me back those simple "rebase
> > > first parent and just record other parents" semantics when
> > > needed.
> > 
> > [...]
> > 
> > Also note that, for example, in case side branch(es) dropped some 
> > commits (interactively or otherwise), first step alone would still 
> > reintroduce those dropped changes, thus later possible `merge -s ours 
> > ` would be a pretty bad "evil merge" case and a wrong thing to 
> > do in general.
> 
> Except that my presumption is that the second step has been run already
> and has stopped due to conflicts, so I see the conflicting result of
> dropping those commits on side branch(es), check the previous state of
> the right side of the conflicting merge, and decide those state, being
> the result of the fist step after possibly demanding conflicts
> resolution, is fine after all. Thus I just re-merge -x ours the
> branch(es), instead of re-merging everythig from scratch only to finally
> get back to the same result, be it evil or not, the hard way.

Might be my comment missed the point here, it should have been more 
about what you said regarding "first step having its own conflict 
resolution" - in case of dropped commits on side branch(es), you would 
be trying to resolve conflicts using one tree that doesn`t/shouldn`t 
even exist anymore (rebased merge commit first parent changes), which 
might be pretty confusing, only to find the "second stage" later 
removing changes that you might have actually picked as "first stage" 
conflict resolution, making it all even worse.

Only once "huge merge" is done completely (meaning all steps involved 
in merge commit rebasing), user can have a more realistic overview of 
(possibly nested, even) conflicts to resolve (and knowing his resolution 
will actually stick).

Regarding `merge -s ours ` you mention, as you say it would 
happen only after "huge merge" is complete (with possible conflicts), 
I guess it`s unrelated to having "merge commit rebasing" happen in 
one go ("huge merge"), or iteratively, in stages (from user`s 
perspective, unrelated to underlying implementation)...?

Thus I`m questioning use-case for step-by-step merge commit rebasing 
where each stage has its own conflict resolution, in the face of it 
possibly being more confusing than helpful.

Otherwise, I see the point in what you would like to accomplish with 
that `merge -s ours ` (not from scratch), but I`m not sure 
what would be the most sane way to allow it, and if it would be worth 
it in the first place, seeming to be a pretty exotic use case.

Regards, Buga


Re: [RFC] Rebasing merges: a jorney to the ultimate solution(RoadClear)

2018-03-16 Thread Igor Djordjevic
On 15/03/2018 00:11, Igor Djordjevic wrote:
> 
> > > > Second side note: if we can fast-forward, currently we prefer
> > > > that, and I think we should keep that behavior with -R, too.
> > >
> > > I agree.
> > 
> > I'm admittedly somewhat lost in the discussion, but are you
> > talking fast-forward on _rebasing_ existing merge? Where would it
> > go in any of the suggested algorithms of rebasing and why?
> > 
> > I readily see how it can break merges. E.g., any "git merge
> > --ff-only --no-ff" merge will magically disappear. So, even if
> > somehow supported, fast-forward should not be performed by default
> > during _rebasing_ of a merge.
> 
> Hmm, now that you brought this up, I can only agree, of course.
> 
> What I had in my mind was more similar to "no-rebase-cousins", like 
> if we can get away without actually rebasing the merge but still 
> using the original one, do it. But I guess that`s not what Johannes 
> originally asked about.
> 
> This is another definitive difference between rebasing (`pick`?) and 
> recreating (`merge`) a merge commit - in the case where we`re rebasing, 
> of course it doesn`t make sense to drop commit this time (due to 
> fast-forward). This does make sense in recreating the merge (only).

Eh, I might take this back. I think my original interpretation (and 
agreement) to fast-forwarding is correct.

But the confusion here comes from `--no-ff` as used for merging, as 
opposed to `--no-ff` as used for rebasing. I _think_ Johannes meant 
the latter one.

In rebasing, `--no-ff` means that even if a commit inside todo list 
isn`t to be changed, do not reuse it but create a new one. Here`s 
excerpt from the docs[1]:

  --no-ff
With --interactive, cherry-pick all rebased commits instead of 
fast-forwarding over the unchanged ones. This ensures that the 
entire history of the rebased branch is composed of new commits.

Without --interactive, this is a synonym for --force-rebase.


So fast-forwarding in case of rebasing (merge commits as well) is 
something you would want by default, as it wouldn`t drop/lose 
anything, but merely reuse existing commit (if unchanged), instead of 
cherry-picking (rebasing) it into a new (merge) commit anyway.

The same goes for this part:

> > > > If the user wants to force a new merge, they simply remove that
> > > > -R flag.

This means that using `-R` flag is sensitive to `--no-ff` rebase 
option, original merge commit _reused_ (fast-forwarded) when possible 
(unchanged, and `--no-ff` not provided), or original merge commit 
_rebased_ (when changed, or `--no-ff` provided).

If `-R` flag is removed, then merge commit is always _recreated_, no 
matter if `--no-ff` option is used or not.

p.s. I`m still a bit opposed to `-R` flag in the first place, as 
discussed elsewhere[2][3], but that`s unrelated to this fast-forward 
discussion.

Related to it, though, if `pick` command would be used instead of 
`merge -R` to signal merge commit _rebasing_, it would fit into 
existing logic nicely, where `pick` is already sensitive to `--no-ff` 
option (for rebasing regular commits). Then `merge` alone could be 
naturally (and only) used for _recreating_ merge commits, as 
originally intended (and intuitively expected).

Regards, Buga

[1] https://git-scm.com/docs/git-rebase#git-rebase---no-ff
[2] https://public-inbox.org/git/77b695d0-7564-80d7-d9e6-70a531e66...@gmail.com/
[3] https://public-inbox.org/git/a3d40dca-f508-5853-89bc-1f9ab3934...@gmail.com/


Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-16 Thread Junio C Hamano
Jonathan Nieder  writes:

>> I was hoping to hear quick Acks for remove-ignore-env (and also
>> Duy's reroll of their topics on it) from people involved in all the
>> related topics, so that we can advance them more-or-less at the same
>> time.
>
> I can go along with this for this series, but I want to explain why I
> am not thrilled with the process in general.
>
> The series "Moving global state into the repository object (part 1)"
> had gone through three revisions with relatively minor changes between
> each and we were in the middle of reviewing the fourth.  In response
> to a series that comes after it, "(part 2)", Duy created
> nd/remove-ignore-env, a series that I quite like.
>
> But with this reroll, the result is:
> ...
> So even though I agree with the goal and I like the improved
> initialization code, I am not happy.

Well, I do not think it was the "process" issue in particular,
though.  Yes, it was unfortunate that these three revisions were
reviewed but they each resulted with only relative minor changes
between iterations.  IOW, reviewers that lost fresh eyes, perhaps
including me, all missed a larger wart in the design that made the
"ignore-env" thing required in the structure, which was spotted by
somebody else.  It would have been nicer if such an improvement were
done during an earlier part of the cycle and did not require that
other reviewer.

But we didn't spot it, and now we know a better structure of the
topic, luckily before all the things touch 'next'.  We not just got
a suggestion "it would be better to rewrite it like so---now go
ahead to do that", we got something even better and Duy did all the
work restructuring the series, reshuffling the patches.  We should
be thankful, not unhappy.



Assalam u Alaikum!

2018-03-16 Thread Alsha Gaddafi



--
I am Aisha Gaddafi, I have a business proposal for you and I need your 
Mutual Respect, Trust, Honesty & Transparency,kind support and 
assistance. This business is hitch free and i assure you that you are 
100% safe.Contact me For more info,.
Sincerely,
Aisha.

--



Re: [GSoC] parse: not show usage help when invalid id

2018-03-16 Thread Junio C Hamano
Roger Solano  writes:

> patch for too chatty command when  is invalid
> ex. git tag --contains 
>
> It is possible to skip 'goto show_usage_error' when
> parse_long_opt fails? and return directly from there.

Can you explain how this relates to and/or compares with a recent
update [1]?

Thanks.

[1] 
https://public-inbox.org/git/20180306193116.23876-1-ungureanupaulsebast...@gmail.com

>
> Signed-off-by: Roger Solano 
> ---
>  parse-options.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/parse-options.c b/parse-options.c
> index 125e84f98..074915eb3 100644
> --- a/parse-options.c
> +++ b/parse-options.c
> @@ -550,7 +550,7 @@ int parse_options_step(struct parse_opt_ctx_t *ctx,
>   goto show_usage;
>   switch (parse_long_opt(ctx, arg + 2, options)) {
>   case -1:
> - goto show_usage_error;
> + return -1; 
>   case -2:
>   goto unknown;
>   }


Re: [PATCH 0/2] routines to generate JSON data

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Jeff King jotted:

> I really like the idea of being able to send our machine-readable output
> in some "standard" syntax for which people may already have parsers. But
> one big hangup with JSON is that it assumes all strings are UTF-8.

FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding
is valid, not that it changes anything you're pointing out, but people
on Win32 could use UTF-16 as-is if their filenames were in that format.

I'm just going to use UTF-8 synonymously with "Unicode encoding" for the
rest of this mail...

> Some possible solutions I can think of:
>
>   1. Ignore the UTF-8 requirement, making a JSON-like output (which I
>  think is what your patches do). I'm not sure what problems this
>  might cause on the parsing side.

Maybe some JSON parsers are more permissive, but they'll commonly just
die on non-Unicode (usually UTF-8) input, e.g.:

$ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl 
-0666 -MJSON::XS -wE 'say decode_json(<>)->{str}'
malformed UTF-8 character in JSON string, at character offset 10 (before 
"\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1.

>   2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
>  to know the options here, but my understanding is that numeric
>  escapes are just for inserting unicode code points. _Can_ you
>  actually transport arbitrary binary data across JSON without
>  base64-encoding it (yech)?

There's no way to transfer binary data in JSON without it being shoved
into a UTF-8 encoding, so you'd need to know on the other side that
such-and-such a field has binary in it, i.e. you'll need to invent your
own schema.

E.g.:

head -c 10 /dev/urandom | perl -MDevel::Peek -MJSON::XS -wE 'my $in = 
; my $roundtrip = decode_json(encode_json({str => $in}))->{str}; 
utf8::decode($roundtrip) if $ARGV[0]; say Dump [$in, $roundtrip]' 0

You can tweak that trailing "0" to "1" to toggle the ad-hoc schema,
i.e. after we decode the JSON we go and manually UTF-8 decode it to get
back at the same binary data, otherwise we end up with an UTF-8 escaped
version of what we put in.

>   3. Some other similar format. YAML comes to mind. Last time I looked
>  (quite a while ago), it seemed insanely complex, but I think you
>  could implement only a reasonable subset. OTOH, I think the tools
>  ecosystem for parsing JSON (e.g., jq) is much better.

The lack of fast schema-less formats that supported arrays, hashes
etc. and didn't suck when it came to mixed binary/UTF-8 led us to
implementing our own at work: https://github.com/Sereal/Sereal

I think for git's use-case we're probably best off with JSON. It's going
to work almost all of the time, and when it doesn't it's going to be on
someone's weird non-UTF-8 repo, and those people are probably used to
dealing with crap because of that anyway and can just manually decode
their thing after it gets double-encoded.

That sucks, but given that we'll be using this either for just ASCII
(telemetry) or UTF-8 most of the time, and that realistically other
formats either suck more or aren't nearly as ubiquitous...


[PATCH v4 1/3] stash: fix nonsense pipeline

2018-03-16 Thread Junio C Hamano
An earlier change bba067d2 ("stash: don't delete untracked files
that match pathspec", 2018-01-06) was made by taking a suggestion in
a list discussion [1] but did not copy the suggested snippet
correctly.  And the bug was unnoticed during the review and slipped
through.

This fixes it.

[1] https://public-inbox.org/git/xmqqpo7byjwb@gitster.mtv.corp.google.com/

Signed-off-by: Junio C Hamano 
---
 git-stash.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/git-stash.sh b/git-stash.sh
index b48b164748..4c92ec931f 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -315,9 +315,9 @@ push_stash () {
 
if test $# != 0
then
-   git add -u -- "$@" |
-   git checkout-index -z --force --stdin
-   git diff-index -p --cached --binary HEAD -- "$@" | git 
apply --index -R
+   git add -u -- "$@"
+   git diff-index -p --cached --binary HEAD -- "$@" |
+   git apply --index -R
else
git reset --hard -q
fi
-- 
2.17.0-rc0



[PATCH v4 0/3] stash push -u -- fixes

2018-03-16 Thread Junio C Hamano
Here is a preliminary fix for an earlier copy-pasto, plus two
patches from your v3.

I tried to reduce the nesting level by swapping the order of if/elif
chain; please double check the logic to ensure I didn't make stupid
mistakes while doing so.

Junio C Hamano (1):
  stash: fix nonsense pipeline

Thomas Gummerer (2):
  stash push: avoid printing errors
  stash push -u: don't create empty stash

 git-stash.sh | 13 +++--
 t/t3903-stash.sh | 22 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

-- 
2.17.0-rc0



[PATCH v4 3/3] stash push -u: don't create empty stash

2018-03-16 Thread Junio C Hamano
From: Thomas Gummerer 

When introducing the stash push feature, and thus allowing users to pass
in a pathspec to limit the files that would get stashed in
df6bba0937 ("stash: teach 'push' (and 'create_stash') to honor
pathspec", 2017-02-28), this developer missed one place where the
pathspec should be passed in.

Namely in the call to the 'untracked_files()' function in the
'no_changes()' function.  This resulted in 'git stash push -u --
' creating an empty stash when there are untracked files
in the repository other that don't match the pathspec.

As 'git stash' never creates empty stashes, this behaviour is wrong and
confusing for users.  Instead it should just show a message "No local
changes to save", and not create a stash.

Luckily the 'untracked_files()' function already correctly respects
pathspecs that are passed to it, so the fix is simply to pass the
pathspec along to the function.

Reported-by: Marc Strapetz 
Signed-off-by: Thomas Gummerer 
Signed-off-by: Junio C Hamano 
---
 git-stash.sh | 2 +-
 t/t3903-stash.sh | 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/git-stash.sh b/git-stash.sh
index fa61151548..1652f64b5c 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -39,7 +39,7 @@ fi
 no_changes () {
git diff-index --quiet --cached HEAD --ignore-submodules -- "$@" &&
git diff-files --quiet --ignore-submodules -- "$@" &&
-   (test -z "$untracked" || test -z "$(untracked_files)")
+   (test -z "$untracked" || test -z "$(untracked_files "$@")")
 }
 
 untracked_files () {
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index f5b102a958..7e0b1c248c 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -1019,4 +1019,10 @@ test_expect_success 'stash -u --  leaves rest 
of working tree in plac
test_path_is_file tracked
 '
 
+test_expect_success 'stash -u --  shows no changes when there 
are none' '
+   git stash push -u -- non-existant >actual &&
+   echo "No local changes to save" >expect &&
+   test_i18ncmp expect actual
+'
+
 test_done
-- 
2.17.0-rc0



[PATCH v4 2/3] stash push: avoid printing errors

2018-03-16 Thread Junio C Hamano
From: Thomas Gummerer 

'git stash push -u -- ' prints the following errors if
 only matches untracked files:

fatal: pathspec 'untracked' did not match any files
error: unrecognized input

This is because we first clean up the untracked files using 'git clean
', and then use a command chain involving 'git add -u
' and 'git apply' to clear the changes to files that are in
the index and were stashed.

As the  only includes untracked files that were already
removed by 'git clean', the 'git add' call will barf, and so will 'git
apply', as there are no changes that need to be applied.

Fix this by making sure to only call this command chain if there are
still files that match  after the call to 'git clean'.

[jc: swapped the order of if/elseif chain to reduce nesting levels]

Reported-by: Marc Strapetz 
Signed-off-by: Thomas Gummerer 
Signed-off-by: Junio C Hamano 
---
 git-stash.sh |  7 ---
 t/t3903-stash.sh | 16 
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/git-stash.sh b/git-stash.sh
index 4c92ec931f..fa61151548 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -313,13 +313,14 @@ push_stash () {
git clean --force --quiet -d $CLEAN_X_OPTION -- "$@"
fi
 
-   if test $# != 0
+   if test $# = 0
+   then
+   git reset --hard -q
+   elif git ls-files --error-unmatch -- "$@" >/dev/null 2>&1
then
git add -u -- "$@"
git diff-index -p --cached --binary HEAD -- "$@" |
git apply --index -R
-   else
-   git reset --hard -q
fi
 
if test "$keep_index" = "t" && test -n "$i_tree"
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index 1cd85e70e9..f5b102a958 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -1003,4 +1003,20 @@ test_expect_success 'stash --  works with binary 
files' '
test_path_is_file subdir/untracked
 '
 
+test_expect_success 'stash -u --  doesnt print error' '
+   >untracked &&
+   git stash push -u -- untracked 2>actual &&
+   test_path_is_missing untracked &&
+   test_line_count = 0 actual
+'
+
+test_expect_success 'stash -u --  leaves rest of working tree in 
place' '
+   >tracked &&
+   git add tracked &&
+   >untracked &&
+   git stash push -u -- untracked &&
+   test_path_is_missing untracked &&
+   test_path_is_file tracked
+'
+
 test_done
-- 
2.17.0-rc0



[GSoC] parse: not show usage help when invalid id

2018-03-16 Thread Roger Solano
patch for too chatty command when  is invalid
ex. git tag --contains 

It is possible to skip 'goto show_usage_error' when
parse_long_opt fails? and return directly from there.

Signed-off-by: Roger Solano 
---
 parse-options.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/parse-options.c b/parse-options.c
index 125e84f98..074915eb3 100644
--- a/parse-options.c
+++ b/parse-options.c
@@ -550,7 +550,7 @@ int parse_options_step(struct parse_opt_ctx_t *ctx,
goto show_usage;
switch (parse_long_opt(ctx, arg + 2, options)) {
case -1:
-   goto show_usage_error;
+   return -1; 
case -2:
goto unknown;
}
-- 
2.13.6



Re: [PATCH] RelNotes: add details on Perl module changes

2018-03-16 Thread Junio C Hamano
Todd Zullinger  writes:

> Document changes to core and non-core Perl module handling in 2.17.
> ...
> Maybe something like this?  I had intended to suggest a note about
> NO_PERL_CPAN_FALLBACKS as well, so that's included too.

Thanks.  A help like this in individual areas from people more
familiar than I am is greatly appreciated.



Re: [PATCH] RelNotes: add details on Perl module changes

2018-03-16 Thread Todd Zullinger
I wrote:
> Document changes to core and non-core Perl module handling in 2.17.

This should have:

Signed-off-by: Todd Zullinger 

And perhaps also:

Helped-by: Junio C Hamano 

since I borrowed liberally from your initial text. :)

> ---
> Junio C Hamano  writes:
> 
>>> I haven't wordsmithed it fully, but it should say something along
>>> the lines of ...
>>>
>>>  Documentation/RelNotes/2.16.0.txt | 10 ++
>>>  1 file changed, 10 insertions(+)
>>
>> Eh, of course the addition should go to 2.17 release notes ;-)  I
>> just happened to be reviewing a topic forked earlier.
> 
> Maybe something like this?  I had intended to suggest a note about
> NO_PERL_CPAN_FALLBACKS as well, so that's included too.  I don't know if that
> should be expanded to provide more of a hint to users/packagers on platforms
> where these modules are harder to install, letting them know that we now have
> fallbacks to Error and Mail::Address.  That might allow scripts which were
> previously excluded to be included on their platforms.
> 
>  Documentation/RelNotes/2.17.0.txt | 14 ++
>  INSTALL   |  3 ++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/RelNotes/2.17.0.txt 
> b/Documentation/RelNotes/2.17.0.txt
> index c828d37345..085bf1dba1 100644
> --- a/Documentation/RelNotes/2.17.0.txt
> +++ b/Documentation/RelNotes/2.17.0.txt
> @@ -75,6 +75,20 @@ Performance, Internal Implementation, Development Support 
> etc.
>   * The build procedure for perl/ part has been greatly simplified by
> weaning ourselves off of MakeMaker.
>  
> + * Perl 5.8 or greater has been required since Git 1.7.4 released in
> +   2010, but we continued to assume some core modules may not exist and
> +   used a conditional "eval { require <> }"; we no longer do
> +   this.  Some platforms (Fedora/RedHat/CentOS, for example) ship Perl
> +   without all core modules by default (e.g. Digest::MD5, File::Temp,
> +   File::Spec, Net::Domain, Net::SMTP).  Users on such platforms may
> +   need to install these additional modules.
> +
> + * As a convenience, we install copies of Perl modules we require which
> +   are not part of the core Perl distribution (e.g. Error and
> +   Mail::Address).  Users and packagers whose operating system provides
> +   these modules can set NO_PERL_CPAN_FALLBACKS to avoid installing the
> +   bundled modules.
> +
>   * In preparation for implementing narrow/partial clone, the machinery
> for checking object connectivity used by gc and fsck has been
> taught that a missing object is OK when it is referenced by a
> diff --git a/INSTALL b/INSTALL
> index 60e515eaf7..c39006e8e7 100644
> --- a/INSTALL
> +++ b/INSTALL
> @@ -126,7 +126,8 @@ Issues of note:
> Redhat/Fedora are reported to ship Perl binary package with some
> core modules stripped away (see http://lwn.net/Articles/477234/),
> so you might need to install additional packages other than Perl
> -   itself, e.g. Time::HiRes.
> +   itself, e.g. Digest::MD5, File::Spec, File::Temp, Net::Domain,
> +   Net::SMTP, and Time::HiRes.
>  
>   - git-imap-send needs the OpenSSL library to talk IMAP over SSL if
> you are using libcurl older than 7.34.0.  Otherwise you can use

-- 
Todd
~~
The secret of life is honesty and fair dealing. If you can fake that,
you've got it made.
-- Groucho Marx



[PATCH] RelNotes: add details on Perl module changes

2018-03-16 Thread Todd Zullinger
Document changes to core and non-core Perl module handling in 2.17.
---
Junio C Hamano  writes:

>> I haven't wordsmithed it fully, but it should say something along
>> the lines of ...
>>
>>  Documentation/RelNotes/2.16.0.txt | 10 ++
>>  1 file changed, 10 insertions(+)
>
> Eh, of course the addition should go to 2.17 release notes ;-)  I
> just happened to be reviewing a topic forked earlier.

Maybe something like this?  I had intended to suggest a note about
NO_PERL_CPAN_FALLBACKS as well, so that's included too.  I don't know if that
should be expanded to provide more of a hint to users/packagers on platforms
where these modules are harder to install, letting them know that we now have
fallbacks to Error and Mail::Address.  That might allow scripts which were
previously excluded to be included on their platforms.

 Documentation/RelNotes/2.17.0.txt | 14 ++
 INSTALL   |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/RelNotes/2.17.0.txt 
b/Documentation/RelNotes/2.17.0.txt
index c828d37345..085bf1dba1 100644
--- a/Documentation/RelNotes/2.17.0.txt
+++ b/Documentation/RelNotes/2.17.0.txt
@@ -75,6 +75,20 @@ Performance, Internal Implementation, Development Support 
etc.
  * The build procedure for perl/ part has been greatly simplified by
weaning ourselves off of MakeMaker.
 
+ * Perl 5.8 or greater has been required since Git 1.7.4 released in
+   2010, but we continued to assume some core modules may not exist and
+   used a conditional "eval { require <> }"; we no longer do
+   this.  Some platforms (Fedora/RedHat/CentOS, for example) ship Perl
+   without all core modules by default (e.g. Digest::MD5, File::Temp,
+   File::Spec, Net::Domain, Net::SMTP).  Users on such platforms may
+   need to install these additional modules.
+
+ * As a convenience, we install copies of Perl modules we require which
+   are not part of the core Perl distribution (e.g. Error and
+   Mail::Address).  Users and packagers whose operating system provides
+   these modules can set NO_PERL_CPAN_FALLBACKS to avoid installing the
+   bundled modules.
+
  * In preparation for implementing narrow/partial clone, the machinery
for checking object connectivity used by gc and fsck has been
taught that a missing object is OK when it is referenced by a
diff --git a/INSTALL b/INSTALL
index 60e515eaf7..c39006e8e7 100644
--- a/INSTALL
+++ b/INSTALL
@@ -126,7 +126,8 @@ Issues of note:
  Redhat/Fedora are reported to ship Perl binary package with some
  core modules stripped away (see http://lwn.net/Articles/477234/),
  so you might need to install additional packages other than Perl
- itself, e.g. Time::HiRes.
+ itself, e.g. Digest::MD5, File::Spec, File::Temp, Net::Domain,
+ Net::SMTP, and Time::HiRes.
 
- git-imap-send needs the OpenSSL library to talk IMAP over SSL if
  you are using libcurl older than 7.34.0.  Otherwise you can use
-- 
2.17.0.rc0



Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-16 Thread Jonathan Nieder
Hi,

Junio C Hamano wrote:
> Jonathan Tan  writes:
>> On Wed, 14 Mar 2018 18:34:49 -0700
>> Junio C Hamano  wrote:

>>> * sb/object-store (2018-03-05) 27 commits
>>
>> [snip list of commits]
>>
>>>  (this branch is used by sb/packfiles-in-repository; uses 
>>> nd/remove-ignore-env-field.)
>>>
>>>  Refactoring the internal global data structure to make it possible
>>>  to open multiple repositories, work with and then close them.
>>>
>>>  Rerolled by Duy on top of a separate preliminary clean-up topic.
>>>  The resulting structure of the topics looked very sensible.
>>>
>>>  Waiting for a follow-up discussion.
>>
>> Would it be possible for this set to go in independently of
>> nd/remove-ignore-env-field? I understand that some patches might be
>> cleaner if ignore_env is first removed, but this patch set has already
>> undergone several rounds of review and (I think) is an improvement to
>> the codebase on its own.
>
> I thought the "remove-ignore-env-field" thing is a quite small and
> more-or-less straightforward improvements that would serve as a good
> preparatory change to give a solid foundation to the object-store
> topic.
>
> I was hoping to hear quick Acks for remove-ignore-env (and also
> Duy's reroll of their topics on it) from people involved in all the
> related topics, so that we can advance them more-or-less at the same
> time.

I can go along with this for this series, but I want to explain why I
am not thrilled with the process in general.

The series "Moving global state into the repository object (part 1)"
had gone through three revisions with relatively minor changes between
each and we were in the middle of reviewing the fourth.  In response
to a series that comes after it, "(part 2)", Duy created
nd/remove-ignore-env, a series that I quite like.

But with this reroll, the result is:

 * The patches that have already been reviewed 3 times (in their
   current incarnation; parts of this series have also appeared on
   list earlier too) and we were in the middle of reviewing a fourth
   time are now on a new base.

 * These three series, each of a manageable size, have been combined
   into a larger series of 44 patches.

 * I have fears about when they're ever going to land and be part of
   an API I can start making use of.

So even though I agree with the goal and I like the improved
initialization code, I am not happy.

I would be happier with just the new initialization code being its own
series that we can fast-track, then deal with part 1 separately on
top, and then deal with part 2 after that.

Thanks,
Jonathan


Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-16 Thread Junio C Hamano
Jonathan Tan  writes:

> On Wed, 14 Mar 2018 18:34:49 -0700
> Junio C Hamano  wrote:
>
>> * sb/object-store (2018-03-05) 27 commits
>
> [snip list of commits]
>
>>  (this branch is used by sb/packfiles-in-repository; uses 
>> nd/remove-ignore-env-field.)
>> 
>>  Refactoring the internal global data structure to make it possible
>>  to open multiple repositories, work with and then close them.
>> 
>>  Rerolled by Duy on top of a separate preliminary clean-up topic.
>>  The resulting structure of the topics looked very sensible.
>> 
>>  Waiting for a follow-up discussion.
>
> Would it be possible for this set to go in independently of
> nd/remove-ignore-env-field? I understand that some patches might be
> cleaner if ignore_env is first removed, but this patch set has already
> undergone several rounds of review and (I think) is an improvement to
> the codebase on its own.

I thought the "remove-ignore-env-field" thing is a quite small and
more-or-less straightforward improvements that would serve as a good
preparatory change to give a solid foundation to the object-store
topic.

I was hoping to hear quick Acks for remove-ignore-env (and also
Duy's reroll of their topics on it) from people involved in all the
related topics, so that we can advance them more-or-less at the same
time.




Re: [RFC PATCH] git-submodule.sh:cmd_update: if submodule branch exists, fetch that instead of default

2018-03-16 Thread Eric Sunshine
On Fri, Mar 16, 2018 at 5:33 PM, Thomas Gummerer  wrote:
> Subject: [PATCH] SubmittingPatches: mention the git contacts command
>
> Instead of just mentioning 'git blame' and 'git shortlog', which make
> it harder than necessary for new contributors to pick out the
> appropriate list of people to cc on their patch series, mention the
> 'git contacts' utility, which should make it much easier to get a
> reasonable list of contacts for a change.
>
> Signed-off-by: Thomas Gummerer 
> ---
> diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
> @@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`.  
> That is
>  Send your patch with "To:" set to the mailing list, with "cc:" listing
> -people who are involved in the area you are touching (the output from
> -`git blame $path` and `git shortlog --no-merges $path` would help to
> +people who are involved in the area you are touching (the `git
> +contacts` command in `contrib/contacts/` can help to
>  identify them), to solicit comments and reviews.

It may be a disservice to remove mention of git-blame and git-shortlog
since git-contacts may not be suitable for everyone. Instead, perhaps
advertise git-contacts as a potentially simpler alternative to the
already-mentioned git-blamd & git-shortlog?

Also, would it make sense to mention Felipe's git-related[1] which is
the original (and now more configurable) script from which
git-contacts functionality was derived?

[1]: https://github.com/felipec/git-related


Re: [PATCH] format-patch: use --compact-summary instead of --summary

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> For reviewers, a couple fewer lines from diffstat (either cover letter
> or patches) is a good thing because there's less to read.

I can certainly accept that there may be some reviewers who share
that view, but me personally, I would rather have them outside the
main diffstat section, so addition and removal of files that are
rarer events do stand out.

I guess people would not mind an option to use it, but they can
already say "git format-patch --compact-summary $upstream..", so...




Re: [PATCH v4 09/11] pack-objects: shrink size field in struct object_entry

2018-03-16 Thread Junio C Hamano
Junio C Hamano  writes:

> Also, don't we want to use uintmax_t throughout the callchain?  How
> would the code in this series work when your ulong is 32-bit?

My own answer to this question is "no conversion to uintmax_t, at
least not in this series."  As long as the original code uses
"unsigned long", this series also should, I think.



Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-16 Thread Jonathan Tan
On Wed, 14 Mar 2018 18:34:49 -0700
Junio C Hamano  wrote:

> * sb/object-store (2018-03-05) 27 commits

[snip list of commits]

>  (this branch is used by sb/packfiles-in-repository; uses 
> nd/remove-ignore-env-field.)
> 
>  Refactoring the internal global data structure to make it possible
>  to open multiple repositories, work with and then close them.
> 
>  Rerolled by Duy on top of a separate preliminary clean-up topic.
>  The resulting structure of the topics looked very sensible.
> 
>  Waiting for a follow-up discussion.

Would it be possible for this set to go in independently of
nd/remove-ignore-env-field? I understand that some patches might be
cleaner if ignore_env is first removed, but this patch set has already
undergone several rounds of review and (I think) is an improvement to
the codebase on its own.


Re: [PATCH v3 1/2] stash push: avoid printing errors

2018-03-16 Thread Junio C Hamano
Thomas Gummerer  writes:

> @@ -322,9 +322,12 @@ push_stash () {
>  
>   if test $# != 0
>   then
> - git add -u -- "$@" |
> - git checkout-index -z --force --stdin

This obviously is not something this patch breaks, but I am finding
this pipeline that was already here quite puzzling.

The "add -u" is about adding the changes in paths that match the
pathspec to the index; the output from it is meant for human
consumption and certainly is not something "--stdin" should expect
to be understandable, let alone with "-z".

... goes and digs ...

I think you mis-copied the suggestion in

https://public-inbox.org/git/xmqqpo7byjwb@gitster.mtv.corp.google.com/

when you made bba067d2 ("stash: don't delete untracked files that
match pathspec", 2018-01-06), and nobody caught that breakage during
the review.

> - git diff-index -p --cached --binary HEAD -- "$@" | git 
> apply --index -R
> + if git ls-files --error-unmatch -- "$@" >/dev/null 
> 2>/dev/null
> + then
> + git add -u -- "$@" |
> + git checkout-index -z --force --stdin

And the same breakage is inherited here; just drop "|" and
downstream "checkout-index" and you should be OK.

> + git diff-index -p --cached --binary HEAD -- 
> "$@" | git apply --index -R

And while at it, let's split this to two lines after "|".

> + fi



Re: [RFC PATCH] git-submodule.sh:cmd_update: if submodule branch exists, fetch that instead of default

2018-03-16 Thread Thomas Gummerer
On 03/08, Eddy Petrișor wrote:
> 2018-03-06 22:21 GMT+02:00 Jonathan Nieder :
> >
> > (cc list snipped)
> > Hi,
> >
> > Eddy Petrișor wrote:
> >
> > > Cc: [a lot of people]
> >
> > Can you say a little about how this cc list was created?  E.g. should
> > "git send-email" get a feature to warn about long cc lists?
> 
> 
> I did it as advised by the  documentation, git blame:
> 
> https://github.com/git/git/blob/master/Documentation/SubmittingPatches#L264
> 
> I was suprised that there is no specialized script that does this, as
> it is for the kernel or u-boot, so I ran first
> 
> git log --pretty=format:'%an <%ae>' git-submodule.sh | sort -u >
> mail-list-submodule
> 
> then configure git to use that custom output and ignore the patch
> since it was trying to convert every line of it into a email address:
> 
> git config sendemail.ccCmd 'cat mail-list-submodule # '
> 
> Then "git send-email 0001..." did the rest.

There's a 'git contacts' command which we've been carrying in
"contrib/" for a while.  Maybe we could advertise that in
"Documentation/SubmittingPatches" instead of 'git blame' and 'git
shortlog'?

Maybe something like the patch below?

--- >8 ---
Subject: [PATCH] SubmittingPatches: mention the git contacts command

Instead of just mentioning 'git blame' and 'git shortlog', which make
it harder than necessary for new contributors to pick out the
appropriate list of people to cc on their patch series, mention the
'git contacts' utility, which should make it much easier to get a
reasonable list of contacts for a change.

Signed-off-by: Thomas Gummerer 
---
 Documentation/SubmittingPatches | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index a1d0feca36..945f8edb46 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`.  
That is
 not a text/plain, it's something else.
 
 Send your patch with "To:" set to the mailing list, with "cc:" listing
-people who are involved in the area you are touching (the output from
-`git blame $path` and `git shortlog --no-merges $path` would help to
+people who are involved in the area you are touching (the `git
+contacts` command in `contrib/contacts/` can help to
 identify them), to solicit comments and reviews.
 
 :1: footnote:[The current maintainer: gits...@pobox.com]
-- 
2.16.2.804.g6dcf76e11



Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 09:01:42PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Suggestion for a thing to add to it, I don't have the time on the Go
> tuits:
> 
> One thing that can make repositories very pathological is if the ratio
> of trees to commits is too low.
> 
> I was dealing with a repo the other day that had several thousand files
> all in the same root directory, and no subdirectories.

We've definitely run into this problem before (CocoaPods/Specs, for
example). The metric that would hopefully show this off is "what is the
tree object with the most entries". Or possibly "what is the average
number of entries in a tree object".

That's not the _whole_ story, because the really pathological case is
when you then touch that giant tree a lot. But if you assume the paths
touched by commits are reasonably distributed over the tree, then having
a huge number of entries in one tree will also mean that more commits
will touch that tree. Sort of a vaguely quadratic problem.

Doing it at the root is obviously the worst case, but the same thing can
happen if you have "foo/bar" as a huge tree, and every single commit
needs to touch some variant of "foo/bar/baz".

That's why I suspect some "average per tree object" may show this type
of problem, because you'd have a lot of near-identical copies of that
giant tree if it's being modified a lot.

> But it's not something where you can just say having more trees is
> better, because on the other end of the spectrume we can imagine a repo
> like linux.git where each file like COPYING instead exists at
> C/O/P/Y/I/N/G, that would also be pathological.
> 
> It would be very interesting to do some tests to see what the optimal
> value would be.

I suspect there's some math that could give us the solution. You want
approximately equal-sized trees, so maybe log(N) levels?

-Peff


Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

>   struct option builtin_gc_options[] = {
>   OPT__QUIET(, N_("suppress progress reporting")),
> @@ -362,6 +390,8 @@ int cmd_gc(int argc, const char **argv, const char 
> *prefix)
>   OPT_BOOL(0, "aggressive", , N_("be more thorough 
> (increased runtime)")),
>   OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
>   OPT_BOOL(0, "force", , N_("force running gc even if there 
> may be another gc running")),
> + OPT_BOOL(0, "keep-base-pack", _base_pack,
> +  N_("repack all other packs except the base pack")),
>   OPT_END()
>   };

There's an easy to solve merge conflict here between the current master
& this. Pushed out a solution (for my own use) at
https://github.com/avar/git/ gc-auto-keep-base-pack. Interdiff with
yours:

@@ -112,9 +112,9 @@
struct option builtin_gc_options[] = {
OPT__QUIET(, N_("suppress progress reporting")),
 @@
-   OPT_BOOL(0, "aggressive", , N_("be more thorough 
(increased runtime)")),
-   OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
-   OPT_BOOL(0, "force", , N_("force running gc even if there 
may be another gc running")),
+   OPT_BOOL_F(0, "force", ,
+  N_("force running gc even if there may be another gc 
running"),
+  PARSE_OPT_NOCOMPLETE),
 +  OPT_BOOL(0, "keep-base-pack", _base_pack,
 +   N_("repack all other packs except the base pack")),
OPT_END()


Re: [PATCH v2] travis-ci: enable more warnings on travis linux-gcc job

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 08:33:55PM +0100, Nguyễn Thái Ngọc Duy wrote:

> We have DEVELOPER config to enable more warnings, but since we can't set
> a fixed gcc version to all devs, that has to be a bit more conservative.
> On travis, we know almost exact version to be used (bumped to 6.x for
> more warnings), we could be more aggressive.

Certainly it makes sense to dial up any static checks we can for the
Travis environment, but...

> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> index 3735ce413f..f6f346c468 100755
> --- a/ci/run-build-and-tests.sh
> +++ b/ci/run-build-and-tests.sh
> @@ -7,6 +7,22 @@
>  
>  ln -s "$cache_dir/.prove" t/.prove
>  
> +if [ "$jobname" = linux-gcc ]; then
> + gcc-6 --version
> + cat >config.mak <<-EOF
> + CC=gcc-6
> + CFLAGS = -g -O2 -Wall
> + CFLAGS += -Wextra
> + CFLAGS += -Wmissing-prototypes
> + CFLAGS += -Wno-empty-body
> + CFLAGS += -Wno-maybe-uninitialized
> + CFLAGS += -Wno-missing-field-initializers
> + CFLAGS += -Wno-sign-compare
> + CFLAGS += -Wno-unused-function
> + CFLAGS += -Wno-unused-parameter
> + EOF
> +fi

Why isn't this just turning on DEVELOPER=1 if we know we have a capable
compiler?

-Peff


Re: getting fatal error trying to open git gui

2018-03-16 Thread Bryan Turner
On Fri, Mar 16, 2018 at 2:01 PM, Briggs, John  wrote:
> No, it was a fresh install.  Plus file search reveals only one copy of the 
> file.
>
> I also noticed that I cannot use the file properties to run as administrator. 
> I must right-click on Git GUI and select "More >> Run as administrator" in 
> the start menu. Even though I have "run as administrator" checked on both 
> shortcut and the program (or any combo of).

You definitely shouldn't need to run it as an administrator at all. I
have the same version installed:

$ git version
git version 2.16.2.windows.1

Running Git GUI from the Start menu, it starts without a UAC prompt
(which means it's not running as Administrator) and runs without
issue. Starting it from a non-escalated command prompt via "git gui"
also does not have a UAC prompt and runs without issue.

Seems like there must be something more going on. What options did you
select during installation? What installer did you use?

Bryan


Re: [PATCH 0/2] routines to generate JSON data

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 07:40:55PM +, g...@jeffhostetler.com wrote:

> This patch series adds a set of utility routines to compose data in JSON
> format into a "struct strbuf".  The resulting string can then be output
> by commands wanting to support a JSON output format.
> 
> This is a stand alone patch.  Nothing currently uses these routines.  I'm
> currently working on a series to log "telemetry" data (as we discussed
> briefly during Ævar's "Performance Misc" session [1] in Barcelona last
> week).  And I want emit the data in JSON rather than a fixed column/field
> format.  The JSON routines here are independent of that, so it made sense
> to submit the JSON part by itself.
> 
> Back when we added porcelain=v2 format to status, we talked about adding a
> JSON format.  I think the routines in this patch would let us easily do
> that, if someone were interested.  (Extending status is not on my radar
> right now, however.)

I really like the idea of being able to send our machine-readable output
in some "standard" syntax for which people may already have parsers. But
one big hangup with JSON is that it assumes all strings are UTF-8. That
may be OK for telemetry data, but it would probably lead to problems for
something like status porcelain, since Git's view of paths is just a
string of bytes (not to mention possible uses elsewhere like author
names, subject lines, etc).

Before we commit to a standardized format, I think we need to work out
a solution there (because I'd much rather not go down this road for
telemetry data only to find that we cannot use the same standardized
format in other parts of Git).

Some possible solutions I can think of:

  1. Ignore the UTF-8 requirement, making a JSON-like output (which I
 think is what your patches do). I'm not sure what problems this
 might cause on the parsing side.

  2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
 to know the options here, but my understanding is that numeric
 escapes are just for inserting unicode code points. _Can_ you
 actually transport arbitrary binary data across JSON without
 base64-encoding it (yech)?

  3. Some other similar format. YAML comes to mind. Last time I looked
 (quite a while ago), it seemed insanely complex, but I think you
 could implement only a reasonable subset. OTOH, I think the tools
 ecosystem for parsing JSON (e.g., jq) is much better.

> Documentation for the new API is given in json-writer.h at the bottom of
> the first patch.

The API generally looks pleasant, but the nesting surprised me.  E.g.,
I'd have expected:

  jw_array_begin(out);
  jw_array_begin(out);
  jw_array_append_int(out, 42);
  jw_array_end(out);
  jw_array_end(out);

to result in an array containing an array containing an integer. But
array_begin() actually resets the strbuf, so you can't build up nested
items like this internally. Ditto for objects within objects. You have
to use two separate strbufs and copy the data an extra time.

To make the above work, I think you'd have to store a little more state.
E.g., the "array_append" functions check "out->len" to see if they need
to add a separating comma. That wouldn't work if we might be part of a
nested array. So I think you'd need a context struct like:

  struct json_writer {
int first_item;
struct strbuf out;
  };
  #define JSON_WRITER_INIT { 1, STRBUF_INIT }

to store the state and the output. As a bonus, you could also use it to
store some other sanity checks (e.g., keep a "depth" counter and BUG()
when somebody tries to access the finished strbuf with a hanging-open
object or array).

> I wasn't sure how to unit test the API from a shell script, so I added a
> helper command that does most of the work in the second patch.

In general I'd really prefer to keep the shell script as the driver for
the tests, and have t/helper programs just be conduits. E.g., something
like:

  cat >expect <<-\EOF &&
  {"key": "value", "foo": 42}
  EOF
  test-json-writer >actual \
object_begin \
object_append_string key value \
object_append_int foo 42 \
object_end &&
  test_cmp expect actual

It's a bit tedious (though fairly mechanical) to expose the API in this
way, but it makes it much easier to debug, modify, or add tests later
on (for example, I had to modify the C program to find out that my
append example above wouldn't work).

-Peff


Re: [PATCH v3 4/7] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:


> + struct packed_git * p = find_base_packs(_pack, 0);

Style nit: space after "*".


Re: [PATCH v3 5/7] gc: handle a corner case in gc.bigPackThreshold

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

> This config allows us to keep  packs back if their size is larger
> than a limit. But if this N >= gc.autoPackLimit, we may have a
> problem. We are supposed to reduce the number of packs after a
> threshold because it affects performance.
>
> We could tell the user that they have incompatible gc.bigPackThreshold
> and gc.autoPackLimit, but it's kinda hard when 'git gc --auto' runs in
> background. Instead let's fall back to the next best stategy: try to
> reduce the number of packs anyway, but keep the base pack out. This
> reduces the number of packs to two and hopefully won't take up too
> much resources to repack (the assumption still is the base pack takes
> most resources to handle).

I think this strategy makes perfect sense.

Those with say a 1GB "base" pack might set this setting at to 500MB or
something large like that, then it's realistically never going to happen
that you're going to then have a collision between gc.bigPackThreshold
and gc.autoPackLimit, even if your checkout is many years old *maybe*
you've accumulated 5-10 of those 500MB packs for any sane repo.

But this also allows for setting this value really low, e.g. 50MB or
something to place a very low upper bound on how much memory GC takes on
a regular basis, but of course you'll need to repack that set of 50MB's
eventually.

Great!


Re: [PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

> +--keep-base-pack::
> + All packs except the base pack and those marked with a `.keep`
> + files are consolidated into a single pack. The largest pack is
> + considered the base pack.
> +

I wonder if all of this would be less confusing as:

> +--keep-biggest-pack::
> + All packs except the largest pack and those marked with a `.keep`
> + files are consolidated into a single pack.

I.e. just skimming these docs I'd expect "base" to somehow be the thing
that we initially cloned, of course in almost all cases that *is* the
largest pack, but not necessarily. So rather than communicate that
expectation let's just say largest/biggest?

Maybe I'm the only one who finds this confusing...


Re: [PATCH v3 3/7] gc: detect base packs based on gc.bigPackThreshold config

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Nguyễn Thái Ngọc Duy jotted:

Awesome, thanks for making this support N large packs.

> +gc.bigPackThreshold::
> + If non-zero, all packs larger than this limit are kept when
> + `git gc` is run. This is very similar to `--keep-base-pack`
> + except that all packs that meet the threshold are kept, not
> + just the base pack. Defaults to zero.
> +

We should add:

+
Common unit suffixes of 'k', 'm', or 'g' are supported.

Since this now supports those suffixes (yay!), see existing copy/pasting
of that phrase in "git help config".


Re: [PATCH v4 11/11] pack-objects.h: reorder members to shrink struct object_entry

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> Previous patches leave lots of holes and padding in this struct. This
> patch reorders the members and shrinks the struct down to 80 bytes
> (from 136 bytes, before any field shrinking is done) with 16 bits to
> spare (and a couple more in in_pack_header_size when we really run out
> of bits).

Nice.

I am wondering if we need some conditional code for 32-bit platform.
For example, you have uint32_t field and do things like this:

static inline int oe_size_less_than(const struct object_entry *e,
unsigned long limit)
{
if (e->size_valid)
return e->size_ < limit;
if (limit > maximum_unsigned_value_of_type(uint32_t))
return 1;
return oe_size(e) < limit;
}

Do we and compilers do the right thing when your ulong is uint32_t?


RE: getting fatal error trying to open git gui

2018-03-16 Thread Briggs, John
No, it was a fresh install.  Plus file search reveals only one copy of the file.

I also noticed that I cannot use the file properties to run as administrator. I 
must right-click on Git GUI and select "More >> Run as administrator" in the 
start menu. Even though I have "run as administrator" checked on both shortcut 
and the program (or any combo of).

John

-Original Message-
From: Jonathan Nieder  
Sent: Friday, March 16, 2018 2:53 PM
To: Briggs, John 
Cc: git@vger.kernel.org; git-for-wind...@googlegroups.com
Subject: Re: getting fatal error trying to open git gui

Briggs, John wrote:
> Jonathan Nieder wrote:
>> Briggs, John wrote:

>>> I just installed git for windows 10 and am getting "git-gui: fatal 
>>> error" "Cannot parse Git version string.
>>>
>>> When I execute "git version" in the command prompt I get Git version
>>> 2.16.2.windows.1
>>>
>>> Everything else seems to be working. How can I get the gui to work?
>>
>> That's strange indeed.  Why is Git capitalized when you run "git version"?
>
> Got it figured out. Git gui must be ran as administrator.

Hm, that leaves me even more mystified.

Before v1.7.4-rc0~155^2~4 (git-gui: generic version trimming, 2010-10-07), 
git-gui was not able to handle "git version" output like "git version 
2.16.2.windows.1", but since then, it should have been able to cope fine with 
it.

I wonder: do you have multiple versions of git gui installed?

Thanks,
Jonathan


Re: [PATCH v4 07/11] pack-objects: refer to delta objects by index instead of pointer

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> These delta pointers always point to elements in the objects[] array
> in packing_data struct. We can only hold maximum 4GB of those objects

4GB, as in "number of bytes"?  Or "We can hold 4 billion or so of
those objects"?

> because the array length, nr_objects, is uint32_t. We could use
> uint32_t indexes to address these elements instead of pointers. On
> 64-bit architecture (8 bytes per pointer) this would save 4 bytes per
> pointer.
>
> Convert these delta pointers to indexes. Since we need to handle NULL
> pointers as well, the index is shifted by one [1].
>
> [1] This means we can only index 2^32-2 objects even though nr_objects
> could contain 2^32-1 objects. It should not be a problem in
> practice because when we grow objects[], nr_alloc would probably
> blow up long before nr_objects hits the wall.



Re: getting fatal error trying to open git gui

2018-03-16 Thread Jonathan Nieder
Briggs, John wrote:
> Jonathan Nieder wrote:
>> Briggs, John wrote:

>>> I just installed git for windows 10 and am getting "git-gui: fatal
>>> error" "Cannot parse Git version string.
>>>
>>> When I execute "git version" in the command prompt I get Git version
>>> 2.16.2.windows.1
>>>
>>> Everything else seems to be working. How can I get the gui to work?
>>
>> That's strange indeed.  Why is Git capitalized when you run "git version"?
>
> Got it figured out. Git gui must be ran as administrator.

Hm, that leaves me even more mystified.

Before v1.7.4-rc0~155^2~4 (git-gui: generic version trimming,
2010-10-07), git-gui was not able to handle "git version" output like
"git version 2.16.2.windows.1", but since then, it should have been
able to cope fine with it.

I wonder: do you have multiple versions of git gui installed?

Thanks,
Jonathan


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 11:33:55AM -0700, Junio C Hamano wrote:

> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
> 
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
> 
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I coincidentally was doing some experiments in this area a few weeks
ago, and found a few things:

  1. The ordering makes a _huge_ difference for accessing trees and
 blobs.

  2. Pack order (not pack-idx order) is actually the best order, since
 it tends to follow the delta patterns (it's close to traversal
 order, but packs delta families more tightly).

  3. None of this really matters for commits, since we almost never
 store them as deltas anyway.

Here are a few experiments people can do themselves to demonstrate (my
numbers here are all from linux.git, which is sort of a wort-case
for bad ordering because its size stresses the default delta cache):

  [every object in sha1 order: slow]
  $ time git cat-file --batch-all-objects --batch >/dev/null
  real  8m44.041s
  user  8m31.359s
  sys   0m12.262s

  [every object from a traversal: faster, but --objects traversals are
   actually CPU heavy due to all of the hash lookups for each tree. Note
   not just wall-clock time but the CPU since it's split across two
   processes]
  $ time git rev-list --objects --all |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  1m2.667s
  user  0m58.537s
  sys   0m32.392s

  [every object in pack order: fastest. This is due to skipping the
   traversal overhead, and should use our delta cache quite efficiently.
   I'm assuming a single pack and no loose objects here, but the
   performance should generalize since accessing the "big" pack
   dominates]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
 sort -n |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m51.718s
  user  0m50.963s
  sys   0m7.068s

  [just commits, sha1 order: not horrible]
  $ time git cat-file --batch-all-objects --batch-check='%(objecttype) 
%(objectname)' |
 grep ^commit |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m8.115s
  user  0m14.033s
  sys   0m1.170s

  [just commits, pack order: slightly worse due to the extra piping, but
   obviously that could be done more quickly internally]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
 sort -n |
 cut -d' ' -f2 |
 git cat-file --batch-check='%(objecttype) %(objectname)' |
 grep ^commit |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m21.670s
  user  0m24.867s
  sys   0m9.600s

  [and the reason is that hardly any commits get deltas]
  $ git cat-file --batch-all-objects --batch-check='%(objecttype) %(deltabase)' 
|
grep ^commit >commits
  $ wc -l commits
  692596
  $ grep -v '' commits | wc -l
  18856

For the purposes of this patch series, I don't think the order matters
much, since we're only dealing with commits. For doing --batch-check, I
think the sha1 ordering given by "cat-file --batch-all-objects" is
convenient, and doesn't have a big impact on performance. But it's
_awful_ for --batch. I think we may want to add a sorting option to just
return the objects in the original packfile order.

-Peff


Re: [PATCH v4 02/11] pack-objects: turn type and in_pack_type to bitfields

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> An extra field type_valid is added to carry the equivalent of OBJ_BAD
> in the original "type" field. in_pack_type always contains a valid
> type so we only need 3 bits for it.
> ...
> @@ -1570,7 +1576,7 @@ static void drop_reused_delta(struct object_entry 
> *entry)
>   entry->depth = 0;
>  
>   oi.sizep = >size;
> - oi.typep = >type;
> + oi.typep = 
>   if (packed_object_info(entry->in_pack, entry->in_pack_offset, ) < 0) 
> {
>   /*
>* We failed to get the info from this pack for some reason;
> @@ -1578,8 +1584,10 @@ static void drop_reused_delta(struct object_entry 
> *entry)
>* And if that fails, the error will be recorded in entry->type

This "entry->type" needs updating.

>* and dealt with in prepare_pack().
>*/
> - entry->type = sha1_object_info(entry->idx.oid.hash,
> ->size);
> + oe_set_type(entry, sha1_object_info(entry->idx.oid.hash,
> + >size));
> + } else {
> + oe_set_type(entry, type);
>   }
>  }


[PATCH v3 1/2] stash push: avoid printing errors

2018-03-16 Thread Thomas Gummerer
Currently 'git stash push -u -- ' prints the following errors
if  only matches untracked files:

fatal: pathspec 'untracked' did not match any files
error: unrecognized input

This is because we first clean up the untracked files using 'git clean
', and then use a command chain involving 'git add -u
' and 'git apply' to clear the changes to files that are in
the index and were stashed.

As the  only includes untracked files that were already
removed by 'git clean', the 'git add' call will barf, and so will 'git
apply', as there are no changes that need to be applied.

Fix this by making sure to only call this command chain if there are
still files that match  after the call to 'git clean'.

Reported-by: Marc Strapetz 
Signed-off-by: Thomas Gummerer 
---
 git-stash.sh |  9 ++---
 t/t3903-stash.sh | 16 
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/git-stash.sh b/git-stash.sh
index fc8f8ae640..4de9f9bea8 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -322,9 +322,12 @@ push_stash () {
 
if test $# != 0
then
-   git add -u -- "$@" |
-   git checkout-index -z --force --stdin
-   git diff-index -p --cached --binary HEAD -- "$@" | git 
apply --index -R
+   if git ls-files --error-unmatch -- "$@" >/dev/null 
2>/dev/null
+   then
+   git add -u -- "$@" |
+   git checkout-index -z --force --stdin
+   git diff-index -p --cached --binary HEAD -- 
"$@" | git apply --index -R
+   fi
else
git reset --hard -q
fi
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index aefde7b172..8b1a8d2982 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -1096,4 +1096,20 @@ test_expect_success 'stash --  works with binary 
files' '
test_path_is_file subdir/untracked
 '
 
+test_expect_success 'stash -u --  doesnt print error' '
+   >untracked &&
+   git stash push -u -- untracked 2>actual &&
+   test_path_is_missing untracked &&
+   test_line_count = 0 actual
+'
+
+test_expect_success 'stash -u --  leaves rest of working tree in 
place' '
+   >tracked &&
+   git add tracked &&
+   >untracked &&
+   git stash push -u -- untracked &&
+   test_path_is_missing untracked &&
+   test_path_is_file tracked
+'
+
 test_done
-- 
2.16.2.804.g6dcf76e11



[PATCH v3 2/2] stash push -u: don't create empty stash

2018-03-16 Thread Thomas Gummerer
When introducing the stash push feature, and thus allowing users to pass
in a pathspec to limit the files that would get stashed in
df6bba0937 ("stash: teach 'push' (and 'create_stash') to honor
pathspec", 2017-02-28), this developer missed one place where the
pathspec should be passed in.

Namely in the call to the 'untracked_files()' function in the
'no_changes()' function.  This resulted in 'git stash push -u --
' creating an empty stash when there are untracked files
in the repository other that don't match the pathspec.

As 'git stash' never creates empty stashes, this behaviour is wrong and
confusing for users.  Instead it should just show a message "No local
changes to save", and not create a stash.

Luckily the 'untracked_files()' function already correctly respects
pathspecs that are passed to it, so the fix is simply to pass the
pathspec along to the function.

Reported-by: Marc Strapetz 
Signed-off-by: Thomas Gummerer 
---
 git-stash.sh | 2 +-
 t/t3903-stash.sh | 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/git-stash.sh b/git-stash.sh
index 4de9f9bea8..dbedc7fb9f 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -39,7 +39,7 @@ fi
 no_changes () {
git diff-index --quiet --cached HEAD --ignore-submodules -- "$@" &&
git diff-files --quiet --ignore-submodules -- "$@" &&
-   (test -z "$untracked" || test -z "$(untracked_files)")
+   (test -z "$untracked" || test -z "$(untracked_files "$@")")
 }
 
 untracked_files () {
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index 8b1a8d2982..7efc52fe11 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -1112,4 +1112,10 @@ test_expect_success 'stash -u --  leaves rest 
of working tree in plac
test_path_is_file tracked
 '
 
+test_expect_success 'stash -u --  shows no changes when there 
are none' '
+   git stash push -u -- non-existant >actual &&
+   echo "No local changes to save" >expect &&
+   test_i18ncmp expect actual
+'
+
 test_done
-- 
2.16.2.804.g6dcf76e11



[PATCH v3 0/2] stash push -u -- fixes

2018-03-16 Thread Thomas Gummerer
Thanks Marc for catching the regression I almost introduced and Junio
for the review of the second patch.  Here's a re-roll that should fix
the issues of v2.

Interdiff below:

diff --git a/git-stash.sh b/git-stash.sh
index 7a4ec98f6b..dbedc7fb9f 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -39,7 +39,7 @@ fi
 no_changes () {
git diff-index --quiet --cached HEAD --ignore-submodules -- "$@" &&
git diff-files --quiet --ignore-submodules -- "$@" &&
-   (test -z "$untracked" || test -z "$(untracked_files $@)")
+   (test -z "$untracked" || test -z "$(untracked_files "$@")")
 }
 
 untracked_files () {
@@ -320,11 +320,14 @@ push_stash () {
git clean --force --quiet -d $CLEAN_X_OPTION -- "$@"
fi
 
-   if test $# != 0 && git ls-files --error-unmatch -- "$@" 
>/dev/null 2>/dev/null
+   if test $# != 0
then
-   git add -u -- "$@" |
-   git checkout-index -z --force --stdin
-   git diff-index -p --cached --binary HEAD -- "$@" | git 
apply --index -R
+   if git ls-files --error-unmatch -- "$@" >/dev/null 
2>/dev/null
+   then
+   git add -u -- "$@" |
+   git checkout-index -z --force --stdin
+   git diff-index -p --cached --binary HEAD -- 
"$@" | git apply --index -R
+   fi
else
git reset --hard -q
fi
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index 5e7078c083..7efc52fe11 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -1103,6 +1103,15 @@ test_expect_success 'stash -u --  doesnt 
print error' '
test_line_count = 0 actual
 '
 
+test_expect_success 'stash -u --  leaves rest of working tree in 
place' '
+   >tracked &&
+   git add tracked &&
+   >untracked &&
+   git stash push -u -- untracked &&
+   test_path_is_missing untracked &&
+   test_path_is_file tracked
+'
+
 test_expect_success 'stash -u --  shows no changes when there 
are none' '
git stash push -u -- non-existant >actual &&
echo "No local changes to save" >expect &&

Thomas Gummerer (2):
  stash push: avoid printing errors
  stash push -u: don't create empty stash

 git-stash.sh | 11 +++
 t/t3903-stash.sh | 22 ++
 2 files changed, 29 insertions(+), 4 deletions(-)

-- 
2.16.2.804.g6dcf76e11


Re: [PATCH v4 01/11] pack-objects: a bit of document about struct object_entry

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> The role of this comment block becomes more important after we shuffle
> fields around to shrink this struct. It will be much harder to see what
> field is related to what.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  pack-objects.h | 44 
>  1 file changed, 44 insertions(+)
>
> diff --git a/pack-objects.h b/pack-objects.h
> index 03f1191659..85345a4af1 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -1,6 +1,50 @@
>  #ifndef PACK_OBJECTS_H
>  #define PACK_OBJECTS_H
>  
> +/*
> + * basic object info
> + * -
> + * idx.oid is filled up before delta searching starts. idx.crc32 and
> + * is only valid after the object is written out and will be used for

"and is"?

> + * generating the index. idx.offset will be both gradually set and
> + * used in writing phase (base objects get offset first, then deltas
> + * refer to them)
> + *
> + * "size" is the uncompressed object size. Compressed size is not
> + * cached (ie. raw data in a pack) but available via revindex.

I am having a hard time understanding what "ie. raw data in a pack"
is doing in that sentence.

It is correct that compressed size is not cached; it does not even
exist and the only way to know it is to compute it by reversing the
.idx file (or actually uncompressing the compressed stream).

Perhaps:

Compressed size of the raw data for an object in a pack is not
stored anywhere but is computed and made available when reverse
.idx is made.

> + * "hash" contains a path name hash which is used for sorting the
> + * delta list and also during delta searching. Once prepare_pack()
> + * returns it's no longer needed.

Hmm, that suggests an interesting optimization opportunity ;-)

> + * source pack info
> + * 
> + * The (in_pack, in_pack_offset, in_pack_header_size) tuple contains
> + * the location of the object in the source pack, with or without
> + * header.

"with or without", meaning...?  An object in the source pack may or
may not have any in_pack_header, in which case in_pack_header_size
is zero, or something?  Not suggesting to rephrase (at least not
yet), but trying to understand.

> + * "type" and "in_pack_type" both describe object type. in_pack_type
> + * may contain a delta type, while type is always the canonical type.
> + *
> + * deltas
> + * --
> + * Delta links (delta, delta_child and delta_sibling) are created
> + * reflect that delta graph from the source pack then updated or added
> + * during delta searching phase when we find better deltas.

Isn't anything missing after "are created"?  Perhaps "to"?

> + *
> + * delta_child and delta_sibling are last needed in
> + * compute_write_order(). "delta" and "delta_size" must remain valid
> + * at object writing phase in case the delta is not cached.

True.  I thought child and sibling are only needed during write
order computing, so there may be an optimization opportunity there.

> + * If a delta is cached in memory and is compressed delta_data points

s/compressed delta_data/compressed, delta_data/;

> + * to the data and z_delta_size contains the compressed size. If it's
> + * uncompressed [1], z_delta_size must be zero. delta_size is always
> + * the uncompressed size and must be valid even if the delta is not
> + * cached.
> + *
> + * [1] during try_delta phase we don't bother with compressing because
> + * the delta could be quickly replaced with a better one.
> + */
>  struct object_entry {
>   struct pack_idx_entry idx;
>   unsigned long size; /* uncompressed size */


RE: getting fatal error trying to open git gui

2018-03-16 Thread Briggs, John
Got it figured out. Git gui must be ran as administrator.

John

-Original Message-
From: Jonathan Nieder  
Sent: Friday, March 16, 2018 1:58 PM
To: Briggs, John 
Cc: git@vger.kernel.org; git-for-wind...@googlegroups.com
Subject: Re: getting fatal error trying to open git gui

Hi,

Briggs, John wrote:

> I just installed git for windows 10 and am getting "git-gui: fatal error" 
> "Cannot parse Git version string.
>
> When I execute "git version" in the command prompt I get Git version 
> 2.16.2.windows.1
>
> Everything else seems to be working. How can I get the gui to work?

That's strange indeed.  Why is Git capitalized when you run "git version"?

Thanks,
Jonathan


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 04:06:39PM -0400, Jeff King wrote:

> > Furthermore, in order to look at an object it has to be zlib inflated
> > first, and since commit objects tend to be much smaller than trees and
> > especially blobs, there are a lot less bytes to inflate:
> > 
> >   $ grep ^commit type-size |cut -d' ' -f2 |avg
> >   34395730 / 53754 = 639
> >   $ cat type-size |cut -d' ' -f2 |avg
> >   3866685744 / 244723 = 15800
> > 
> > So a simple revision walk inflates less than 1% of the bytes that the
> > "enumerate objects packfiles" approach has to inflate.
> 
> I don't think this is quite accurate. It's true that we have to
> _consider_ every object, but Git is smart enough not to inflate each one
> to find its type. For loose objects we just inflate the header. For
> packed objects, we either pick the type directly out of the packfile
> header (for a non-delta) or can walk the delta chain (without actually
> looking at the data bytes!) until we hit the base.

Hmm, so that's a big part of the problem with this patch series. It
actually _does_ unpack every object with --stdin-packs to get the type,
which is just silly. With the patch below, my time for "commit-graph
write --stdin-packs" on linux.git goes from over 5 minutes (I got bored
and killed it) to 17 seconds.

diff --git a/commit-graph.c b/commit-graph.c
index 6348bab82b..cf1da2e8c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -491,11 +491,12 @@ static int add_packed_commits(const struct object_id *oid,
 {
struct packed_oid_list *list = (struct packed_oid_list*)data;
enum object_type type;
-   unsigned long size;
-   void *inner_data;
off_t offset = nth_packed_object_offset(pack, pos);
-   inner_data = unpack_entry(pack, offset, , );
-   FREE_AND_NULL(inner_data);
+   struct object_info oi = OBJECT_INFO_INIT;
+
+   oi.typep = 
+   if (packed_object_info(pack, offset, ) < 0)
+   die("unable to get type of object %s", oid_to_hex(oid));
 
if (type != OBJ_COMMIT)
return 0;

-Peff


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread SZEDER Gábor
On Fri, Mar 16, 2018 at 7:33 PM, Junio C Hamano  wrote:
> SZEDER Gábor  writes:
>
>> You should forget '--stdin-packs' and use '--stdin-commits' to generate
>> the initial graph, it's much faster even without '--additive'[1].  See
>>
>>   
>> https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/
>>
>> I still think that the default behaviour for 'git commit-graph write'
>> should simply walk history from all refs instead of enumerating all
>> objects in all packfiles.
>
> Somehow I missed that one.  Thanks for the link to it.
>
> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
>
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
>
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I came up with a different explanation back then: we are only interested
in commit objects when creating the commit graph, and only a small-ish
fraction of all objects are commit objects, so the "enumerate objects in
packfiles" approach has to look at a lot more objects:

  # in my git fork
  $ git rev-list --all --objects |cut -d' ' -f1 |\
git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
  $ grep -c ^commit type-size
  53754
  $ wc -l type-size
  244723 type-size

I.e. only about 20% of all objects are commit objects.

Furthermore, in order to look at an object it has to be zlib inflated
first, and since commit objects tend to be much smaller than trees and
especially blobs, there are a lot less bytes to inflate:

  $ grep ^commit type-size |cut -d' ' -f2 |avg
  34395730 / 53754 = 639
  $ cat type-size |cut -d' ' -f2 |avg
  3866685744 / 244723 = 15800

So a simple revision walk inflates less than 1% of the bytes that the
"enumerate objects packfiles" approach has to inflate.


>> [1] - Please excuse the bikeshed: '--additive' is such a strange
>>   sounding option name, at least for me.  '--append', perhaps?
>
> Yeah, I think "fetch --append" is probably a precedence.


Re: Git "branch properties"-- thoughts?

2018-03-16 Thread clime
>But for a generic branch like 'master', that arrangement would not
>work well, I am afraid.  You may have N copies of the same project,
>where the 'master' branch of each is used to work on separate topic.
>The focus of the 'master' branch of the overall project would be
>quite different from each of these copies you have, hence it is
>likely that it would be inappropriate to share the task list and
>stuff you would want to add to branch descriptions and branch
>properties between the shared repository's 'master' and any of your
>'master' in these N copies you have.
>
>So...

I think it could work in a way that other people branch properties
are only taken into account if you don't have your own properties of
those same names already set. In that case, git would initialize
your branch properties from the other branch of the same name and
inform you about it. Of course, you should always have a way to overwrite
any property to a value of your choice. So what you would be getting
from others were basically default values (inspired by
https://docs.python.org/3.5/library/stdtypes.html#dict.setdefault).

Would it work? This sounds pretty good to me. I would really love
to have a chance to implement this feature if people were
interested in it.

clime


Re: [PATCH v4 09/11] pack-objects: shrink size field in struct object_entry

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> It's very very rare that an uncompressd object is larger than 4GB
> (partly because Git does not handle those large files very well to
> begin with). Let's optimize it for the common case where object size
> is smaller than this limit.
>
> Shrink size field down to 32 bits [1] and one overflow bit. If the size
> is too large, we read it back from disk.

OK.

> Add two compare helpers that can take advantage of the overflow
> bit (e.g. if the file is 4GB+, chances are it's already larger than
> core.bigFileThreshold and there's no point in comparing the actual
> value).

I had trouble reading the callers of these helpers.

> +static inline int oe_size_less_than(const struct object_entry *e,
> + unsigned long limit)
> +{
> + if (e->size_valid)
> + return e->size_ < limit;
> + if (limit > maximum_unsigned_value_of_type(uint32_t))
> + return 1;

When size_valid bit is false, that means that the size is larger
than 4GB.  If "limit" is larger than 4GB, then we do not know
anything, no?  I'd understand if this "optimization" were

if (limit < 4GB) {
/*
 * we know e whose size won't fit in 4GB is larger
 * than that!
 */
return 0;
}

> + return oe_size(e) < limit;
> +}

Also, don't we want to use uintmax_t throughout the callchain?  How
would the code in this series work when your ulong is 32-bit?

> +
> +static inline int oe_size_greater_than(const struct object_entry *e,
> +unsigned long limit)
> +{
> + if (e->size_valid)
> + return e->size_ > limit;
> + if (limit <= maximum_unsigned_value_of_type(uint32_t))
> + return 1;
> + return oe_size(e) > limit;
> +}
> +
> +static inline void oe_set_size(struct object_entry *e,
> +unsigned long size)
> +{
> + e->size_ = size;
> + e->size_valid = e->size_ == size;
> +}
> +
>  #endif


Re: [PATCH v2 4/5] pack-objects: show some progress when counting kept objects

2018-03-16 Thread Duy Nguyen
On Fri, Mar 16, 2018 at 8:14 PM, Duy Nguyen  wrote:
> On Mon, Mar 12, 2018 at 7:32 PM, Ævar Arnfjörð Bjarmason
>  wrote:
>>
>> On Tue, Mar 06 2018, Nguyễn Thái Ngọc Duy jotted:
>>
>>> We only show progress when there are new objects to be packed. But
>>> when --keep-pack is specified on the base pack, we will exclude most
>>> of objects. This makes 'pack-objects' stay silent for a long time
>>> while the counting phase is going.
>>>
>>> Let's show some progress whenever we visit an object instead. The
>>> number of packed objects will be shown after if it's not the same as
>>> the number of visited objects.
>>>
>>> Since the meaning of this number has changed, use another word instead
>>> of "Counting" to hint about the change.
>>
>> Can you elaborate on how the meaning has changed? With/without this on
>> linux.git I get:
>>
>> With:
>>
>> Enumerating objects: 5901144, done.
>> Getting object details: 100% (5901145/5901145), done.
>> Delta compression using up to 8 threads.
>>
>> Without:
>>
>> Counting objects: 5901145, done.
>> Delta compression using up to 8 threads.
>>
>> So now we're seemingly off-by-one but otherwise doing the same thing?
>
> Yep, it's an off-by-one bug.
>
>> As for as user feedback goes we might as well have said "Reticulating
>> splines", but I have some bias towards keeping the current "Counting
>> objects..." phrasing. We ourselves have other docs referring to it that
>> aren't changed by this patch, and there's
>> e.g. https://githubengineering.com/counting-objects/ and lots of other
>> 3rd party docs that refer to this.
>
> This is why I changed the phrase. The counting is now a bit different.
> Documents describing this exact phrase won't apply to the new version.
>
> The old way counts objects that will be packed. The new way simply
> counts objects that are visited. When you keep some packs, the number
> of objects you visit but not pack could be very high, while in normal
> case the two numbers should be the same (e.g. you pack everything you
> visit). I would prefer to print both values (e.g. "counting objects:
> /") but it's not possible with the current progress
> code.

On second thought, maybe instead of introducing a new line "getting
object details" i could just rename that line to "counting objects"?
They are exactly the same, except that in the new version, this
"counting objects" line could run a lot faster than the old line.


-- 
Duy


Re: [PATCH v2 1/2] stash push: avoid printing errors

2018-03-16 Thread Thomas Gummerer
On 03/15, Marc Strapetz wrote:
> On 14.03.2018 22:46, Thomas Gummerer wrote:
> >Currently 'git stash push -u -- ' prints the following errors
> >if  only matches untracked files:
> >
> > fatal: pathspec 'untracked' did not match any files
> > error: unrecognized input
> >
> >This is because we first clean up the untracked files using 'git clean
> >', and then use a command chain involving 'git add -u
> >' and 'git apply' to clear the changes to files that are in
> >the index and were stashed.
> >
> >As the  only includes untracked files that were already
> >removed by 'git clean', the 'git add' call will barf, and so will 'git
> >apply', as there are no changes that need to be applied.
> >
> >Fix this by making sure to only call this command chain if there are
> >still files that match  after the call to 'git clean'.
> >
> >Reported-by: Marc Strapetz 
> >Signed-off-by: Thomas Gummerer 
> >---
> >
> >>Either way I'll try to address this as soon as I can get some
> >>time to look at it.
> >
> >I finally got around to do this.  The fix (in the second patch) turns
> >out to be fairly simple, I just forgot to pass the pathspec along to
> >one function whene originally introducing the pathspec feature in git
> >stash push (more explanation in the commit message for the patch
> >itself).  Thanks Marc for reporting the two breakages!
> 
> Thanks, I confirm that both issues are resolved. There is another issue now
> which seems to be a regression. When *successfully* stashing an untracked
> file, local modifications of other files are cleared, too.
> 
> $ git init
> $ touch file1
> $ git add file1
> $ git commit -m "initial import"
> $ echo "a" > file1
> $ touch file2
> $ git status --porcelain
>  M file1
> ?? file2
> $ git stash push -u -- file2
> Saved working directory and index state WIP on master: 25352d7 initial
> import
> $ git status
> On branch master
> nothing to commit, working tree clean
> 
> Hence, by stashing just "file2" the local modification of "file1" became
> reset.

Ah yes, this is indeed a regression, thanks for catching it.  I'll fix
it in the re-roll and add another test for this case.  Sorry about the
false starts here :/

> -Marc
> 
> 
> 
> 


Re: [PATCH v2 2/2] stash push -u: don't create empty stash

2018-03-16 Thread Thomas Gummerer
On 03/15, Junio C Hamano wrote:
> Thomas Gummerer  writes:
> 
> >  no_changes () {
> > git diff-index --quiet --cached HEAD --ignore-submodules -- "$@" &&
> > git diff-files --quiet --ignore-submodules -- "$@" &&
> > -   (test -z "$untracked" || test -z "$(untracked_files)")
> > +   (test -z "$untracked" || test -z "$(untracked_files $@)")
> 
> Don't you need a pair of double-quotes around that $@?

Ah yes indeed.  Will fix, thanks!


[PATCH] format-patch: use --compact-summary instead of --summary

2018-03-16 Thread Nguyễn Thái Ngọc Duy
For reviewers, a couple fewer lines from diffstat (either cover letter
or patches) is a good thing because there's less to read.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 I realize that rename percentage is lost with this change. I just
 hope that nobody cares about that, or I'll have to implement rename
 percentage in --compact-summary too in order to move this patch
 forward.

 builtin/log.c |  9 +---
 ...tach_--stdout_--suffix=.diff_initial..side |  7 +++---
 ...at-patch_--attach_--stdout_initial..master | 19 +++-
 ...t-patch_--attach_--stdout_initial..master^ | 12 +-
 ...rmat-patch_--attach_--stdout_initial..side |  7 +++---
 ..._--stdout_--numbered-files_initial..master | 19 +++-
 ..._--subject-prefix=TESTCASE_initial..master | 19 +++-
 ...at-patch_--inline_--stdout_initial..master | 19 +++-
 ...t-patch_--inline_--stdout_initial..master^ | 12 +-
 ...-patch_--inline_--stdout_initial..master^^ |  7 +++---
 ...rmat-patch_--inline_--stdout_initial..side |  7 +++---
 ...-stdout_--cover-letter_-n_initial..master^ | 22 ---
 ...tch_--stdout_--no-numbered_initial..master | 19 +++-
 ...-patch_--stdout_--numbered_initial..master | 19 +++-
 ...diff.format-patch_--stdout_initial..master | 19 +++-
 ...iff.format-patch_--stdout_initial..master^ | 12 +-
 .../diff.format-patch_--stdout_initial..side  |  7 +++---
 t/t4052-stat-output.sh|  4 ++--
 18 files changed, 103 insertions(+), 136 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 94ee177d56..c8e3be5de7 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1062,7 +1062,8 @@ static void make_cover_letter(struct rev_info *rev, int 
use_stdout,
return;
 
memcpy(, >diffopt, sizeof(opts));
-   opts.output_format = DIFF_FORMAT_SUMMARY | DIFF_FORMAT_DIFFSTAT;
+   opts.output_format = DIFF_FORMAT_DIFFSTAT;
+   opts.flags.stat_with_summary = 1;
opts.stat_width = MAIL_DEFAULT_WRAP;
 
diff_setup_done();
@@ -1615,8 +1616,10 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
 
if (!use_patch_format &&
(!rev.diffopt.output_format ||
-rev.diffopt.output_format == DIFF_FORMAT_PATCH))
-   rev.diffopt.output_format = DIFF_FORMAT_DIFFSTAT | 
DIFF_FORMAT_SUMMARY;
+rev.diffopt.output_format == DIFF_FORMAT_PATCH)) {
+   rev.diffopt.output_format = DIFF_FORMAT_DIFFSTAT;
+   rev.diffopt.flags.stat_with_summary = 1;
+   }
if (!rev.diffopt.stat_width)
rev.diffopt.stat_width = MAIL_DEFAULT_WRAP;
 
diff --git 
a/t/t4013/diff.format-patch_--attach_--stdout_--suffix=.diff_initial..side 
b/t/t4013/diff.format-patch_--attach_--stdout_--suffix=.diff_initial..side
index 547ca065a5..89ab87fb4e 100644
--- a/t/t4013/diff.format-patch_--attach_--stdout_--suffix=.diff_initial..side
+++ b/t/t4013/diff.format-patch_--attach_--stdout_--suffix=.diff_initial..side
@@ -12,11 +12,10 @@ Content-Type: text/plain; charset=UTF-8; format=fixed
 Content-Transfer-Encoding: 8bit
 
 ---
- dir/sub | 2 ++
- file0   | 3 +++
- file3   | 4 
+ dir/sub | 2 ++
+ file0   | 3 +++
+ file3 (new) | 4 
  3 files changed, 9 insertions(+)
- create mode 100644 file3
 
 
 --g-i-t--v-e-r-s-i-o-n
diff --git a/t/t4013/diff.format-patch_--attach_--stdout_initial..master 
b/t/t4013/diff.format-patch_--attach_--stdout_initial..master
index 52fedc179e..48f284f1d9 100644
--- a/t/t4013/diff.format-patch_--attach_--stdout_initial..master
+++ b/t/t4013/diff.format-patch_--attach_--stdout_initial..master
@@ -14,11 +14,10 @@ Content-Transfer-Encoding: 8bit
 
 This is the second commit.
 ---
- dir/sub | 2 ++
- file0   | 3 +++
- file2   | 3 ---
+ dir/sub  | 2 ++
+ file0| 3 +++
+ file2 (gone) | 3 ---
  3 files changed, 5 insertions(+), 3 deletions(-)
- delete mode 100644 file2
 
 
 --g-i-t--v-e-r-s-i-o-n
@@ -73,10 +72,9 @@ Content-Type: text/plain; charset=UTF-8; format=fixed
 Content-Transfer-Encoding: 8bit
 
 ---
- dir/sub | 2 ++
- file1   | 3 +++
+ dir/sub | 2 ++
+ file1 (new) | 3 +++
  2 files changed, 5 insertions(+)
- create mode 100644 file1
 
 
 --g-i-t--v-e-r-s-i-o-n
@@ -121,11 +119,10 @@ Content-Type: text/plain; charset=UTF-8; format=fixed
 Content-Transfer-Encoding: 8bit
 
 ---
- dir/sub | 2 ++
- file0   | 3 +++
- file3   | 4 
+ dir/sub | 2 ++
+ file0   | 3 +++
+ file3 (new) | 4 
  3 files changed, 9 insertions(+)
- create mode 100644 file3
 
 
 --g-i-t--v-e-r-s-i-o-n
diff --git a/t/t4013/diff.format-patch_--attach_--stdout_initial..master^ 
b/t/t4013/diff.format-patch_--attach_--stdout_initial..master^
index 1c3cde251b..be6de1290a 100644
--- a/t/t4013/diff.format-patch_--attach_--stdout_initial..master^
+++ 

Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 08:48:49PM +0100, SZEDER Gábor wrote:

> I came up with a different explanation back then: we are only interested
> in commit objects when creating the commit graph, and only a small-ish
> fraction of all objects are commit objects, so the "enumerate objects in
> packfiles" approach has to look at a lot more objects:
> 
>   # in my git fork
>   $ git rev-list --all --objects |cut -d' ' -f1 |\
> git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
>   $ grep -c ^commit type-size
>   53754
>   $ wc -l type-size
>   244723 type-size
> 
> I.e. only about 20% of all objects are commit objects.
> 
> Furthermore, in order to look at an object it has to be zlib inflated
> first, and since commit objects tend to be much smaller than trees and
> especially blobs, there are a lot less bytes to inflate:
> 
>   $ grep ^commit type-size |cut -d' ' -f2 |avg
>   34395730 / 53754 = 639
>   $ cat type-size |cut -d' ' -f2 |avg
>   3866685744 / 244723 = 15800
> 
> So a simple revision walk inflates less than 1% of the bytes that the
> "enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

So starting from scratch:

  git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

is in the same ballpark for most repos as:

  git rev-list --all |
  git cat-file --batch

though in my timings the traversal is a little bit faster (and I'd
expect that to remain the case when doing it all in a single process,
since the traversal only follows commit links, whereas processing the
object list has to do the type lookup for each object before deciding
whether to inflate it).

I'm not sure, though, if that edge would remain for incremental updates.
For instance, after we take in some new objects via "fetch", the
traversal strategy would want to do something like:

  git rev-list $new_tips --not --all |
  git cat-file --batch

whose performance will depend on the refs _currently_ in the repository,
as we load them as UNINTERESTING tips for the walk. Whereas doing:

  git show-index <.git/objects/pack/the-one-new-packfile.idx |
  cut -d' ' -f2 |
  git cat-file --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

always scales exactly with the size of the new objects (obviously that's
kind of baroque and this would all be done internally, but I'm trying to
demonstrate the algorithmic complexity). I'm not sure what the plan
would be if we explode loose objects, though. ;)

-Peff


Re: [PATCH v4 08/11] pack-objects: shrink z_delta_size field in struct object_entry

2018-03-16 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> We only cache deltas when it's smaller than a certain limit. This limit
> defaults to 1000 but save its compressed length in a 64-bit field.
> Shrink that field down to 16 bits, so you can only cache 65kb deltas.
> Larger deltas must be recomputed at when the pack is written down.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---

>   if (entry->delta_data && !pack_to_stdout) {
> - entry->z_delta_size = do_compress(>delta_data,
> -   entry->delta_size);
> - cache_lock();
> - delta_cache_size -= entry->delta_size;
> - delta_cache_size += entry->z_delta_size;
> - cache_unlock();
> + unsigned long size;
> +
> + size = do_compress(>delta_data, 
> entry->delta_size);
> + entry->z_delta_size = size;
> + if (entry->z_delta_size == size) {

It is confusing to readers to write

A = B;
if (A == B) {
/* OK, A was big enough */
} else {
/* No, B is too big to fit on A */
}

I actually was about to complain that you attempted an unrelated
micro-optimization to skip cache_lock/unlock when delta_size and
z_delta_size are the same, and made a typo.  Something like:

size = do_compress(...);
if (size < (1 << OE_Z_DELTA_BITS)) {
entry->z_delta_size = size;
cache_lock();
...
cache_unlock();
} else {
FREE_AND_NULL(entry->delta_data);
entry->z_delta_size = 0;
}

would have saved me a few dozens of seconds of head-scratching.

> + cache_lock();
> + delta_cache_size -= entry->delta_size;
> + delta_cache_size += entry->z_delta_size;
> + cache_unlock();
> + } else {
> + FREE_AND_NULL(entry->delta_data);
> + entry->z_delta_size = 0;
> + }
>   }


Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Michael Haggerty jotted:

> What makes a Git repository unwieldy to work with and host? It turns
> out that the respository's on-disk size in gigabytes is only part of
> the story. From our experience at GitHub, repositories cause problems
> because of poor internal layout at least as often as because of their
> overall size. For example,
>
> * blobs or trees that are too large
> * large blobs that are modified frequently (e.g., database dumps)
> * large trees that are modified frequently
> * trees that expand to unreasonable size when checked out (e.g., "Git
> bombs" [2])
> * too many tiny Git objects
> * too many references
> * other oddities, such as giant octopus merges, super long reference
> names or file paths, huge commit messages, etc.
>
> `git-sizer` [1] is a new open-source tool that computes various
> size-related statistics for a Git repository and points out those that
> are likely to cause problems or inconvenience to its users.

This is a very useful tool. I've been using it to get insight into some
bad repositories.

Suggestion for a thing to add to it, I don't have the time on the Go
tuits:

One thing that can make repositories very pathological is if the ratio
of trees to commits is too low.

I was dealing with a repo the other day that had several thousand files
all in the same root directory, and no subdirectories.

This meant that doing `git log -- ` was very expensive. I wrote a
bit about this on this related ticket the other day:
https://gitlab.com/gitlab-org/gitlab-ce/issues/42104#note_54933512

But it's not something where you can just say having more trees is
better, because on the other end of the spectrume we can imagine a repo
like linux.git where each file like COPYING instead exists at
C/O/P/Y/I/N/G, that would also be pathological.

It would be very interesting to do some tests to see what the optimal
value would be.

I also suspect it's not really about the commit / tree ratio, but that
you have some reasonable amount of nested trees per file, *and* that
changes to them are reasonably spread out. I.e. it doesn't help if you
have a doc/ and a src/ directory if 99% of your commits change src/, and
if you're doing 'git log -- src/something.c'.

Which is all a very long-winded way of saying that I don't know what the
general rule is, but I have some suspicions, but having all your files
in the root is definitely bad.


[PATCH 0/2] routines to generate JSON data

2018-03-16 Thread git
From: Jeff Hostetler 

This patch series adds a set of utility routines to compose data in JSON
format into a "struct strbuf".  The resulting string can then be output
by commands wanting to support a JSON output format.

This is a stand alone patch.  Nothing currently uses these routines.  I'm
currently working on a series to log "telemetry" data (as we discussed
briefly during Ævar's "Performance Misc" session [1] in Barcelona last
week).  And I want emit the data in JSON rather than a fixed column/field
format.  The JSON routines here are independent of that, so it made sense
to submit the JSON part by itself.

Back when we added porcelain=v2 format to status, we talked about adding a
JSON format.  I think the routines in this patch would let us easily do
that, if someone were interested.  (Extending status is not on my radar
right now, however.)

Documentation for the new API is given in json-writer.h at the bottom of
the first patch.

I wasn't sure how to unit test the API from a shell script, so I added a
helper command that does most of the work in the second patch.

[1] https://public-inbox.org/git/20180313004940.gg61...@google.com/T/


Jeff Hostetler (2):
  json_writer: new routines to create data in JSON format
  json-writer: unit test

 Makefile|   2 +
 json-writer.c   | 224 
 json-writer.h   | 120 
 t/helper/test-json-writer.c | 146 +
 t/t0019-json-writer.sh  |  10 ++
 5 files changed, 502 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh

-- 
2.9.3



[PATCH 1/2] json_writer: new routines to create data in JSON format

2018-03-16 Thread git
From: Jeff Hostetler 

Add basic routines to generate data in JSON format.

Signed-off-by: Jeff Hostetler 
---
 Makefile  |   1 +
 json-writer.c | 224 ++
 json-writer.h | 120 +++
 3 files changed, 345 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h

diff --git a/Makefile b/Makefile
index 1a9b23b..9000369 100644
--- a/Makefile
+++ b/Makefile
@@ -815,6 +815,7 @@ LIB_OBJS += hashmap.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
+LIB_OBJS += json-writer.o
 LIB_OBJS += kwset.o
 LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
diff --git a/json-writer.c b/json-writer.c
new file mode 100644
index 000..755ff80
--- /dev/null
+++ b/json-writer.c
@@ -0,0 +1,224 @@
+#include "cache.h"
+#include "json-writer.h"
+
+/*
+ * Append JSON-quoted version of the given string to 'out'.
+ */
+static void jw_append_quoted_string(struct strbuf *out, const char *in)
+{
+   strbuf_addch(out, '"');
+   for (/**/; *in; in++) {
+   unsigned char c = (unsigned char)*in;
+   if (c == '"')
+   strbuf_add(out, "\\\"", 2);
+   else if (c == '\\')
+   strbuf_add(out, "", 2);
+   else if (c == '\n')
+   strbuf_add(out, "\\n", 2);
+   else if (c == '\r')
+   strbuf_add(out, "\\r", 2);
+   else if (c == '\t')
+   strbuf_add(out, "\\t", 2);
+   else if (c == '\f')
+   strbuf_add(out, "\\f", 2);
+   else if (c == '\b')
+   strbuf_add(out, "\\b", 2);
+   else if (c < 0x20)
+   strbuf_addf(out, "\\u%04x", c);
+   else
+   strbuf_addch(out, c);
+   }
+   strbuf_addch(out, '"');
+}
+
+void jw_object_begin(struct strbuf *out)
+{
+   strbuf_reset(out);
+   strbuf_addch(out, '{');
+}
+
+void jw_object_append_string(struct strbuf *out, const char *key,
+const char *value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addch(out, ':');
+   jw_append_quoted_string(out, value);
+}
+
+void jw_object_append_int(struct strbuf *out, const char *key, int value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addf(out, ":%d", value);
+}
+
+void jw_object_append_uint64(struct strbuf *out, const char *key, uint64_t 
value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addf(out, ":%"PRIuMAX, value);
+}
+
+void jw_object_append_double(struct strbuf *out, const char *key, double value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addf(out, ":%f", value);
+}
+
+void jw_object_append_true(struct strbuf *out, const char *key)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addstr(out, ":true");
+}
+
+void jw_object_append_false(struct strbuf *out, const char *key)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addstr(out, ":false");
+}
+
+void jw_object_append_null(struct strbuf *out, const char *key)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addstr(out, ":null");
+}
+
+void jw_object_append_object(struct strbuf *out, const char *key,
+const char *value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addch(out, ':');
+   strbuf_addstr(out, value);
+}
+
+void jw_object_append_array(struct strbuf *out, const char *key,
+   const char *value)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+
+   jw_append_quoted_string(out, key);
+   strbuf_addch(out, ':');
+   strbuf_addstr(out, value);
+}
+
+void jw_object_end(struct strbuf *out)
+{
+   strbuf_addch(out, '}');
+}
+
+void jw_array_begin(struct strbuf *out)
+{
+   strbuf_reset(out);
+   strbuf_addch(out, '[');
+}
+
+void jw_array_append_string(struct strbuf *out, const char *elem)
+{
+   struct strbuf qe = STRBUF_INIT;
+
+   jw_append_quoted_string(, elem);
+
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+   strbuf_addstr(out, qe.buf);
+
+   strbuf_release();
+}
+
+void jw_array_append_int(struct strbuf *out, int elem)
+{
+   if (out->len > 1)
+   strbuf_addch(out, ',');
+   

[PATCH 2/2] json-writer: unit test

2018-03-16 Thread git
From: Jeff Hostetler 

Signed-off-by: Jeff Hostetler 
---
 Makefile|   1 +
 t/helper/test-json-writer.c | 146 
 t/t0019-json-writer.sh  |  10 +++
 3 files changed, 157 insertions(+)
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh

diff --git a/Makefile b/Makefile
index 9000369..57f58e6 100644
--- a/Makefile
+++ b/Makefile
@@ -662,6 +662,7 @@ TEST_PROGRAMS_NEED_X += test-fake-ssh
 TEST_PROGRAMS_NEED_X += test-genrandom
 TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
+TEST_PROGRAMS_NEED_X += test-json-writer
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
 TEST_PROGRAMS_NEED_X += test-match-trees
diff --git a/t/helper/test-json-writer.c b/t/helper/test-json-writer.c
new file mode 100644
index 000..bb43efb
--- /dev/null
+++ b/t/helper/test-json-writer.c
@@ -0,0 +1,146 @@
+#include "cache.h"
+#include "json-writer.h"
+
+const char *expect_obj1 = "{\"a\":\"abc\",\"b\":42,\"c\":true}";
+const char *expect_obj2 = "{\"a\":-1,\"b\":2147483647,\"c\":0}";
+const char *expect_obj3 = 
"{\"a\":0,\"b\":4294967295,\"c\":18446744073709551615}";
+const char *expect_obj4 = "{\"t\":true,\"f\":false,\"n\":null}";
+const char *expect_obj5 = "{\"abc\\tdef\":\"abcdef\"}";
+
+struct strbuf obj1 = STRBUF_INIT;
+struct strbuf obj2 = STRBUF_INIT;
+struct strbuf obj3 = STRBUF_INIT;
+struct strbuf obj4 = STRBUF_INIT;
+struct strbuf obj5 = STRBUF_INIT;
+
+void make_obj1(void)
+{
+   jw_object_begin();
+   jw_object_append_string(, "a", "abc");
+   jw_object_append_int(, "b", 42);
+   jw_object_append_true(, "c");
+   jw_object_end();
+}
+
+void make_obj2(void)
+{
+   jw_object_begin();
+   jw_object_append_int(, "a", -1);
+   jw_object_append_int(, "b", 0x7fff);
+   jw_object_append_int(, "c", 0);
+   jw_object_end();
+}
+
+void make_obj3(void)
+{
+   jw_object_begin();
+   jw_object_append_uint64(, "a", 0);
+   jw_object_append_uint64(, "b", 0x);
+   jw_object_append_uint64(, "c", 0x);
+   jw_object_end();
+}
+
+void make_obj4(void)
+{
+   jw_object_begin();
+   jw_object_append_true(, "t");
+   jw_object_append_false(, "f");
+   jw_object_append_null(, "n");
+   jw_object_end();
+}
+
+void make_obj5(void)
+{
+   jw_object_begin();
+   jw_object_append_string(, "abc" "\x09" "def", "abc" "\\" "def");
+   jw_object_end();
+}
+
+const char *expect_arr1 = "[\"abc\",42,true]";
+const char *expect_arr2 = "[-1,2147483647,0]";
+const char *expect_arr3 = "[0,4294967295,18446744073709551615]";
+const char *expect_arr4 = "[true,false,null]";
+
+struct strbuf arr1 = STRBUF_INIT;
+struct strbuf arr2 = STRBUF_INIT;
+struct strbuf arr3 = STRBUF_INIT;
+struct strbuf arr4 = STRBUF_INIT;
+
+void make_arr1(void)
+{
+   jw_array_begin();
+   jw_array_append_string(, "abc");
+   jw_array_append_int(, 42);
+   jw_array_append_true();
+   jw_array_end();
+}
+
+void make_arr2(void)
+{
+   jw_array_begin();
+   jw_array_append_int(, -1);
+   jw_array_append_int(, 0x7fff);
+   jw_array_append_int(, 0);
+   jw_array_end();
+}
+
+void make_arr3(void)
+{
+   jw_array_begin();
+   jw_array_append_uint64(, 0);
+   jw_array_append_uint64(, 0x);
+   jw_array_append_uint64(, 0x);
+   jw_array_end();
+}
+
+void make_arr4(void)
+{
+   jw_array_begin();
+   jw_array_append_true();
+   jw_array_append_false();
+   jw_array_append_null();
+   jw_array_end();
+}
+
+char *expect_nest1 =
+   
"{\"obj1\":{\"a\":\"abc\",\"b\":42,\"c\":true},\"arr1\":[\"abc\",42,true]}";
+
+struct strbuf nest1 = STRBUF_INIT;
+
+void make_nest1(void)
+{
+   jw_object_begin();
+   jw_object_append_object(, "obj1", obj1.buf);
+   jw_object_append_array(, "arr1", arr1.buf);
+   jw_object_end();
+}
+
+void cmp(const char *test, const struct strbuf *buf, const char *exp)
+{
+   if (!strcmp(buf->buf, exp))
+   return;
+
+   printf("error[%s]: observed '%s' expected '%s'\n",
+  test, buf->buf, exp);
+   exit(1);
+}
+
+#define t(v) do { make_##v(); cmp(#v, , expect_##v); } while (0)
+
+int cmd_main(int argc, const char **argv)
+{
+   t(obj1);
+   t(obj2);
+   t(obj3);
+   t(obj4);
+   t(obj5);
+
+   t(arr1);
+   t(arr2);
+   t(arr3);
+   t(arr4);
+
+   t(nest1);
+
+   return 0;
+}
diff --git a/t/t0019-json-writer.sh b/t/t0019-json-writer.sh
new file mode 100755
index 000..7b86087
--- /dev/null
+++ b/t/t0019-json-writer.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='test json-writer JSON generation'
+. ./test-lib.sh
+
+test_expect_success 'unit test of json-writer routines' '
+   test-json-writer
+'
+
+test_done
-- 
2.9.3



Re: getting fatal error trying to open git gui

2018-03-16 Thread Jonathan Nieder
Hi,

Briggs, John wrote:

> I just installed git for windows 10 and am getting "git-gui: fatal error" 
> "Cannot parse Git version string.
>
> When I execute "git version" in the command prompt I get
> Git version 2.16.2.windows.1
>
> Everything else seems to be working. How can I get the gui to work?

That's strange indeed.  Why is Git capitalized when you run
"git version"?

Thanks,
Jonathan


[PATCH v2] travis-ci: enable more warnings on travis linux-gcc job

2018-03-16 Thread Nguyễn Thái Ngọc Duy
We have DEVELOPER config to enable more warnings, but since we can't set
a fixed gcc version to all devs, that has to be a bit more conservative.
On travis, we know almost exact version to be used (bumped to 6.x for
more warnings), we could be more aggressive.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 I still think it's a good thing to do. So here's the rebased version
 (v1 collides badly with another series, which has landed)

 .travis.yml   |  3 +++
 ci/run-build-and-tests.sh | 16 
 2 files changed, 19 insertions(+)

diff --git a/.travis.yml b/.travis.yml
index 5f5ee4f3bd..0b3c50f5e7 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -16,10 +16,13 @@ compiler:
 
 addons:
   apt:
+sources:
+- ubuntu-toolchain-r-test
 packages:
 - language-pack-is
 - git-svn
 - apache2
+- gcc-6
 
 matrix:
   include:
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 3735ce413f..f6f346c468 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -7,6 +7,22 @@
 
 ln -s "$cache_dir/.prove" t/.prove
 
+if [ "$jobname" = linux-gcc ]; then
+   gcc-6 --version
+   cat >config.mak <<-EOF
+   CC=gcc-6
+   CFLAGS = -g -O2 -Wall
+   CFLAGS += -Wextra
+   CFLAGS += -Wmissing-prototypes
+   CFLAGS += -Wno-empty-body
+   CFLAGS += -Wno-maybe-uninitialized
+   CFLAGS += -Wno-missing-field-initializers
+   CFLAGS += -Wno-sign-compare
+   CFLAGS += -Wno-unused-function
+   CFLAGS += -Wno-unused-parameter
+   EOF
+fi
+
 make --jobs=2
 make --quiet test
 if test "$jobname" = "linux-gcc"
-- 
2.16.2.903.gd04caf5039



[PATCH v3 0/7] nd/repack-keep-pack updates

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This is not in mergeable state yet. But since there's lots of changes
and I'm pretty sure there's still room for improvement when it comes
to configuration and stuff, I'm posting it anyway to get the
discussion continue.

- --keep-pack does not imply --honor-pack-keep or --pack-kept-objects
  anymore. If you want it, you pass those in. If not, only
  --keep-pack'd packs are kept.

- v2 03/05 feels too big to me so it's now broken down in a couple
  patches.

- freebsd support is added and tested (which probably means OS X
  support is good too). Windows code for getting memory size remains
  untested.

- platforms that do not have support for getting memory will not have
  this "keep base pack on gc --auto" turned on. This may be just safer
  than printing "unrecognized platform" which may not reach platform
  developers anyway.

- gc.bigBasePackThreshold is renamed to gc.bigPackThreshold and now
  keeps all packs larger than this limit.

- fix the off-by-one in "enumerating objects" code.

- while I tend to agree to make the "50% physical memory"
  configurable. I'm not sure if I should add a config var for that
  (which leads to more questions like, should we allow to say "30%
  _free_ memory" as well?) or I should just give the memory estimation
  to a user and he/she can decide what to do with it via hooks.

Interdiff

-- 8< --
diff --git a/Documentation/config.txt b/Documentation/config.txt
index 120cf6bac9..ce40112e31 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1549,12 +1549,16 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
-gc.bigBasePackThreshold::
-   Make `git gc --auto` only enable `--keep-base-pack` when the
-   base pack's size is larger than this limit (in bytes).
-   Defaults to zero, which disables this check and lets
-   `git gc --auto` determine when to enable `--keep-base-pack`
-   based on memory usage.
+gc.bigPackThreshold::
+   If non-zero, all packs larger than this limit are kept when
+   `git gc` is run. This is very similar to `--keep-base-pack`
+   except that all packs that meet the threshold are kept, not
+   just the base pack. Defaults to zero.
++
+Note that if the number of kept packs is more than gc.autoPackLimit,
+this configuration variable is ignored, all packs except the base pack
+will be repacked. After this the number of packs should go below
+gc.autoPackLimit and gc.bigPackThreshold should be respected again.
 
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 35ad420d5c..19b0d1741b 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -9,7 +9,7 @@ git-gc - Cleanup unnecessary files and optimize the local 
repository
 SYNOPSIS
 
 [verse]
-'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force]
+'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force] [--keep-base-pack]
 
 DESCRIPTION
 ---
@@ -55,15 +55,16 @@ all loose objects are combined into a single pack using
 disables automatic packing of loose objects.
 +
 If the number of packs exceeds the value of `gc.autoPackLimit`,
-then existing packs (except those marked with a `.keep` file)
+then existing packs (except those marked with a `.keep` file
+or over `gc.bigPackThreshold` limit)
 are consolidated into a single pack by using the `-A` option of
-'git repack'. Setting `gc.autoPackLimit` to 0 disables
-automatic consolidation of packs.
-+
-If the physical amount of memory is considered not enough for `git
-repack` to run smoothly, `--keep-base-pack` is enabled. This could be
-overridden by setting `gc.bigBasePackThreshold` which only enables
-`--keep-base-pack` when the base pack is larger the specified limit.
+'git repack'.
+If the amount of memory is estimated not enough for `git repack` to
+run smoothly and `gc.bigPackThreshold` is not set, the largest
+pack will also be excluded (this is the equivalent of running `git gc`
+with `--keep-base-pack`).
+Setting `gc.autoPackLimit` to 0 disables automatic consolidation of
+packs.
 
 --prune=::
Prune loose objects older than date (default is 2 weeks ago,
@@ -84,8 +85,10 @@ overridden by setting `gc.bigBasePackThreshold` which only 
enables
instance running on this repository.
 
 --keep-base-pack::
-   All packs except the base pack are consolidated into a single
-   pack. The largest pack is considered the base pack.
+   All packs except the base pack and those marked with a `.keep`
+   files are consolidated into a single pack. The largest pack is
+   considered the base pack. When this option is used,
+   `gc.bigPackThreshold` is ignored.
 
 Configuration
 -
@@ -176,10 +179,6 @@ run commands concurrently have to live with some risk of 
corruption (which
 seems to be low in 

[PATCH v3 2/7] gc: add --keep-base-pack

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This adds a new repack mode that combines everything into a secondary
pack, leaving the largest/base pack alone.

This could help reduce memory pressure. On linux-2.6.git, valgrind
massif reports 1.6GB heap in "pack all" case, and 535MB in "pack
all except the base pack" case. We save roughly 1GB memory by
excluding the base pack.

This should also lower I/O because we don't have to rewrite a giant
pack every time (e.g. for linux-2.6.git that's a 1.4GB pack file)..

PS. The use of string_list here seems overkill, but we'll need it in
the next patch...

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/git-gc.txt |  7 +-
 builtin/gc.c | 47 
 t/t6500-gc.sh| 22 +++
 3 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..1717517043 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -9,7 +9,7 @@ git-gc - Cleanup unnecessary files and optimize the local 
repository
 SYNOPSIS
 
 [verse]
-'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force]
+'git gc' [--aggressive] [--auto] [--quiet] [--prune= | --no-prune] 
[--force] [--keep-base-pack]
 
 DESCRIPTION
 ---
@@ -78,6 +78,11 @@ automatic consolidation of packs.
Force `git gc` to run even if there may be another `git gc`
instance running on this repository.
 
+--keep-base-pack::
+   All packs except the base pack and those marked with a `.keep`
+   files are consolidated into a single pack. The largest pack is
+   considered the base pack.
+
 Configuration
 -
 
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..362dd537a4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -164,6 +164,24 @@ static int too_many_loose_objects(void)
return needed;
 }
 
+static void find_base_packs(struct string_list *packs)
+{
+   struct packed_git *p, *base = NULL;
+
+   prepare_packed_git();
+
+   for (p = packed_git; p; p = p->next) {
+   if (!p->pack_local)
+   continue;
+   if (!base || base->pack_size < p->pack_size) {
+   base = p;
+   }
+   }
+
+   if (base)
+   string_list_append(packs, base->pack_name);
+}
+
 static int too_many_packs(void)
 {
struct packed_git *p;
@@ -187,7 +205,13 @@ static int too_many_packs(void)
return gc_auto_pack_limit < cnt;
 }
 
-static void add_repack_all_option(void)
+static int keep_one_pack(struct string_list_item *item, void *data)
+{
+   argv_array_pushf(, "--keep-pack=%s", basename(item->string));
+   return 0;
+}
+
+static void add_repack_all_option(struct string_list *keep_pack)
 {
if (prune_expire && !strcmp(prune_expire, "now"))
argv_array_push(, "-a");
@@ -196,6 +220,9 @@ static void add_repack_all_option(void)
if (prune_expire)
argv_array_pushf(, "--unpack-unreachable=%s", 
prune_expire);
}
+
+   if (keep_pack)
+   for_each_string_list(keep_pack, keep_one_pack, NULL);
 }
 
 static void add_repack_incremental_option(void)
@@ -219,7 +246,7 @@ static int need_to_gc(void)
 * there is no need.
 */
if (too_many_packs())
-   add_repack_all_option();
+   add_repack_all_option(NULL);
else if (too_many_loose_objects())
add_repack_incremental_option();
else
@@ -353,6 +380,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
const char *name;
pid_t pid;
int daemonized = 0;
+   int keep_base_pack = -1;
 
struct option builtin_gc_options[] = {
OPT__QUIET(, N_("suppress progress reporting")),
@@ -362,6 +390,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
OPT_BOOL(0, "aggressive", , N_("be more thorough 
(increased runtime)")),
OPT_BOOL(0, "auto", _gc, N_("enable auto-gc mode")),
OPT_BOOL(0, "force", , N_("force running gc even if there 
may be another gc running")),
+   OPT_BOOL(0, "keep-base-pack", _base_pack,
+N_("repack all other packs except the base pack")),
OPT_END()
};
 
@@ -427,8 +457,17 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 */
daemonized = !daemonize();
}
-   } else
-   add_repack_all_option();
+   } else {
+   struct string_list keep_pack = STRING_LIST_INIT_NODUP;
+
+   if (keep_base_pack != -1) {
+   if (keep_base_pack)
+   find_base_packs(_pack);
+   }
+
+   add_repack_all_option(_pack);
+   string_list_clear(_pack, 0);
+   }
 
name = 

[PATCH v3 5/7] gc: handle a corner case in gc.bigPackThreshold

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This config allows us to keep  packs back if their size is larger
than a limit. But if this N >= gc.autoPackLimit, we may have a
problem. We are supposed to reduce the number of packs after a
threshold because it affects performance.

We could tell the user that they have incompatible gc.bigPackThreshold
and gc.autoPackLimit, but it's kinda hard when 'git gc --auto' runs in
background. Instead let's fall back to the next best stategy: try to
reduce the number of packs anyway, but keep the base pack out. This
reduces the number of packs to two and hopefully won't take up too
much resources to repack (the assumption still is the base pack takes
most resources to handle).

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt | 5 +
 builtin/gc.c | 9 +++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index c12c58813c..ce40112e31 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1554,6 +1554,11 @@ gc.bigPackThreshold::
`git gc` is run. This is very similar to `--keep-base-pack`
except that all packs that meet the threshold are kept, not
just the base pack. Defaults to zero.
++
+Note that if the number of kept packs is more than gc.autoPackLimit,
+this configuration variable is ignored, all packs except the base pack
+will be repacked. After this the number of packs should go below
+gc.autoPackLimit and gc.bigPackThreshold should be respected again.
 
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
diff --git a/builtin/gc.c b/builtin/gc.c
index c0f1922c24..140c1bb7dd 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -336,9 +336,14 @@ static int need_to_gc(void)
if (too_many_packs()) {
struct string_list keep_pack = STRING_LIST_INIT_NODUP;
 
-   if (big_pack_threshold)
+   if (big_pack_threshold) {
find_base_packs(_pack, big_pack_threshold);
-   else {
+   if (keep_pack.nr >= gc_auto_pack_limit) {
+   big_pack_threshold = 0;
+   string_list_clear(_pack, 0);
+   find_base_packs(_pack, 0);
+   }
+   } else {
struct packed_git * p = find_base_packs(_pack, 0);
uint64_t mem_have, mem_want;
 
-- 
2.16.2.903.gd04caf5039



[PATCH v3 4/7] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-16 Thread Nguyễn Thái Ngọc Duy
pack-objects could be a big memory hog especially on large repos,
everybody knows that. The suggestion to stick a .keep file on the
giant base pack to avoid this problem is also known for a long time.

Recent patches add an option to do just this, but it has to be either
configured or activated manually. This patch lets `git gc --auto`
activate this mode automatically when it thinks `repack -ad` will use
a lot of memory and start affecting the system due to swapping or
flushing OS cache.

gc --auto decides to do this based on an estimation of pack-objects
memory usage, which is quite accurate at least for the heap part, and
whether that fits in half of system memory (the assumption here is for
desktop environment where there are many other applications running).

This mechanism only kicks in if gc.bigBasePackThreshold is not configured.
If it is, it is assumed that the user already knows what they want.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/git-gc.txt |  9 +++-
 builtin/gc.c | 99 +++-
 builtin/pack-objects.c   |  2 +-
 config.mak.uname |  1 +
 git-compat-util.h|  4 ++
 pack-objects.h   |  2 +
 t/t6500-gc.sh|  7 +++
 7 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 89f074f924..19b0d1741b 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -58,8 +58,13 @@ If the number of packs exceeds the value of 
`gc.autoPackLimit`,
 then existing packs (except those marked with a `.keep` file
 or over `gc.bigPackThreshold` limit)
 are consolidated into a single pack by using the `-A` option of
-'git repack'. Setting `gc.autoPackLimit` to 0 disables
-automatic consolidation of packs.
+'git repack'.
+If the amount of memory is estimated not enough for `git repack` to
+run smoothly and `gc.bigPackThreshold` is not set, the largest
+pack will also be excluded (this is the equivalent of running `git gc`
+with `--keep-base-pack`).
+Setting `gc.autoPackLimit` to 0 disables automatic consolidation of
+packs.
 
 --prune=::
Prune loose objects older than date (default is 2 weeks ago,
diff --git a/builtin/gc.c b/builtin/gc.c
index 849f0821a9..c0f1922c24 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,10 @@
 #include "argv-array.h"
 #include "commit.h"
 #include "packfile.h"
+#include "pack.h"
+#include "pack-objects.h"
+#include "blob.h"
+#include "tree.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -40,6 +44,7 @@ static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 static unsigned long big_pack_threshold;
+static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
 static struct argv_array pack_refs_cmd = ARGV_ARRAY_INIT;
 static struct argv_array reflog = ARGV_ARRAY_INIT;
@@ -128,6 +133,7 @@ static void gc_config(void)
git_config_get_expiry("gc.logexpiry", _log_expire);
 
git_config_get_ulong("gc.bigpackthreshold", _pack_threshold);
+   git_config_get_ulong("pack.deltacachesize", _delta_cache_size);
 
git_config(git_default_config, NULL);
 }
@@ -167,7 +173,8 @@ static int too_many_loose_objects(void)
return needed;
 }
 
-static void find_base_packs(struct string_list *packs, unsigned long limit)
+static struct packed_git *find_base_packs(struct string_list *packs,
+ unsigned long limit)
 {
struct packed_git *p, *base = NULL;
 
@@ -186,6 +193,8 @@ static void find_base_packs(struct string_list *packs, 
unsigned long limit)
 
if (base)
string_list_append(packs, base->pack_name);
+
+   return base;
 }
 
 static int too_many_packs(void)
@@ -211,6 +220,79 @@ static int too_many_packs(void)
return gc_auto_pack_limit < cnt;
 }
 
+static uint64_t total_ram(void)
+{
+#if defined(HAVE_SYSINFO)
+   struct sysinfo si;
+
+   if (!sysinfo())
+   return si.totalram;
+#elif defined(HAVE_BSD_SYSCTL) && (defined(HW_MEMSIZE) || defined(HW_PHYSMEM))
+   int64_t physical_memory;
+   int mib[2];
+   size_t length;
+
+   mib[0] = CTL_HW;
+# if defined(HW_MEMSIZE)
+   mib[1] = HW_MEMSIZE;
+# else
+   mib[1] = HW_PHYSMEM;
+# endif
+   length = sizeof(int64_t);
+   if (!sysctl(mib, 2, _memory, , NULL, 0))
+   return physical_memory;
+#elif defined(GIT_WINDOWS_NATIVE)
+   MEMORYSTATUSEX memInfo;
+
+   memInfo.dwLength = sizeof(MEMORYSTATUSEX);
+   if (GlobalMemoryStatusEx())
+   return memInfo.ullTotalPhys;
+#endif
+   return 0;
+}
+
+static uint64_t estimate_repack_memory(struct packed_git *pack)
+{
+   unsigned long nr_objects = approximate_object_count();
+   size_t os_cache, heap;
+
+   if (!pack || !nr_objects)
+   return 0;
+
+   /*
+* First we have to scan 

[PATCH v3 3/7] gc: detect base packs based on gc.bigPackThreshold config

2018-03-16 Thread Nguyễn Thái Ngọc Duy
The --keep-base-pack option is not very convenient to use because you
need to tell gc to do this explicitly (and probably on just a few
large repos).

Add a config key that enables this mode when packs larger than a
limit are found.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt |  6 ++
 Documentation/git-gc.txt |  6 --
 builtin/gc.c | 26 --
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f57e9cf10c..c12c58813c 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1549,6 +1549,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.bigPackThreshold::
+   If non-zero, all packs larger than this limit are kept when
+   `git gc` is run. This is very similar to `--keep-base-pack`
+   except that all packs that meet the threshold are kept, not
+   just the base pack. Defaults to zero.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 1717517043..89f074f924 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -55,7 +55,8 @@ all loose objects are combined into a single pack using
 disables automatic packing of loose objects.
 +
 If the number of packs exceeds the value of `gc.autoPackLimit`,
-then existing packs (except those marked with a `.keep` file)
+then existing packs (except those marked with a `.keep` file
+or over `gc.bigPackThreshold` limit)
 are consolidated into a single pack by using the `-A` option of
 'git repack'. Setting `gc.autoPackLimit` to 0 disables
 automatic consolidation of packs.
@@ -81,7 +82,8 @@ automatic consolidation of packs.
 --keep-base-pack::
All packs except the base pack and those marked with a `.keep`
files are consolidated into a single pack. The largest pack is
-   considered the base pack.
+   considered the base pack. When this option is used,
+   `gc.bigPackThreshold` is ignored.
 
 Configuration
 -
diff --git a/builtin/gc.c b/builtin/gc.c
index 362dd537a4..849f0821a9 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -39,6 +39,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static unsigned long big_pack_threshold;
 
 static struct argv_array pack_refs_cmd = ARGV_ARRAY_INIT;
 static struct argv_array reflog = ARGV_ARRAY_INIT;
@@ -126,6 +127,8 @@ static void gc_config(void)
git_config_get_expiry("gc.worktreepruneexpire", 
_worktrees_expire);
git_config_get_expiry("gc.logexpiry", _log_expire);
 
+   git_config_get_ulong("gc.bigpackthreshold", _pack_threshold);
+
git_config(git_default_config, NULL);
 }
 
@@ -164,7 +167,7 @@ static int too_many_loose_objects(void)
return needed;
 }
 
-static void find_base_packs(struct string_list *packs)
+static void find_base_packs(struct string_list *packs, unsigned long limit)
 {
struct packed_git *p, *base = NULL;
 
@@ -173,7 +176,10 @@ static void find_base_packs(struct string_list *packs)
for (p = packed_git; p; p = p->next) {
if (!p->pack_local)
continue;
-   if (!base || base->pack_size < p->pack_size) {
+   if (limit) {
+   if (p->pack_size >= limit)
+   string_list_append(packs, p->pack_name);
+   } else if (!base || base->pack_size < p->pack_size) {
base = p;
}
}
@@ -245,9 +251,15 @@ static int need_to_gc(void)
 * we run "repack -A -d -l".  Otherwise we tell the caller
 * there is no need.
 */
-   if (too_many_packs())
-   add_repack_all_option(NULL);
-   else if (too_many_loose_objects())
+   if (too_many_packs()) {
+   struct string_list keep_pack = STRING_LIST_INIT_NODUP;
+
+   if (big_pack_threshold)
+   find_base_packs(_pack, big_pack_threshold);
+
+   add_repack_all_option(_pack);
+   string_list_clear(_pack, 0);
+   } else if (too_many_loose_objects())
add_repack_incremental_option();
else
return 0;
@@ -462,7 +474,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
if (keep_base_pack != -1) {
if (keep_base_pack)
-   find_base_packs(_pack);
+   find_base_packs(_pack, 0);
+   } else if (big_pack_threshold) {
+   find_base_packs(_pack, 

[PATCH v3 6/7] pack-objects: show some progress when counting kept objects

2018-03-16 Thread Nguyễn Thái Ngọc Duy
We only show progress when there are new objects to be packed. But
when --keep-pack is specified on the base pack, we will exclude most
of objects. This makes 'pack-objects' stay silent for a long time
while the counting phase is going.

Let's show some progress whenever we visit an object instead. The
number of packed objects will be shown after if it's not the same as
the number of visited objects.

Since the meaning of this number has changed, use another word instead
of "Counting" to hint about the change.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6abde6ec6d..f74e9117f7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,7 +44,7 @@ static const char *pack_usage[] = {
 static struct packing_data to_pack;
 
 static struct pack_idx_entry **written_list;
-static uint32_t nr_result, nr_written;
+static uint32_t nr_result, nr_written, nr_seen;
 
 static int non_empty;
 static int reuse_delta = 1, reuse_object = 1;
@@ -1094,6 +1094,8 @@ static int add_object_entry(const struct object_id *oid, 
enum object_type type,
off_t found_offset = 0;
uint32_t index_pos;
 
+   display_progress(progress_state, ++nr_seen);
+
if (have_duplicate_entry(oid, exclude, _pos))
return 0;
 
@@ -1109,8 +,6 @@ static int add_object_entry(const struct object_id *oid, 
enum object_type type,
create_object_entry(oid, type, pack_name_hash(name),
exclude, name && no_try_delta(name),
index_pos, found_pack, found_offset);
-
-   display_progress(progress_state, nr_result);
return 1;
 }
 
@@ -1121,6 +1121,8 @@ static int add_object_entry_from_bitmap(const struct 
object_id *oid,
 {
uint32_t index_pos;
 
+   display_progress(progress_state, ++nr_seen);
+
if (have_duplicate_entry(oid, 0, _pos))
return 0;
 
@@ -1128,8 +1130,6 @@ static int add_object_entry_from_bitmap(const struct 
object_id *oid,
return 0;
 
create_object_entry(oid, type, name_hash, 0, 0, index_pos, pack, 
offset);
-
-   display_progress(progress_state, nr_result);
return 1;
 }
 
@@ -3210,7 +3210,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
}
 
if (progress)
-   progress_state = start_progress(_("Counting objects"), 0);
+   progress_state = start_progress(_("Enumerating objects"), 0);
if (!use_internal_rev_list)
read_object_list_from_stdin();
else {
-- 
2.16.2.903.gd04caf5039



[PATCH v3 1/7] repack: add --keep-pack option

2018-03-16 Thread Nguyễn Thái Ngọc Duy
We allow to keep existing packs by having companion .keep files. This
is helpful when a pack is permanently kept. In the next patch, git-gc
just wants to keep a pack temporarily, for one pack-objects
run. git-gc can use --keep-pack for this use case.

A note about why the pack_keep field cannot be reused and
pack_keep_in_core has to be added. This is about the case when
--keep-pack is specified together with either --keep-unreachable or
--unpack-unreachable, but --honor-pack-keep is NOT specified.

In this case, we want to exclude objects from the packs specified on
command line, not from ones with .keep files. If only one bit flag is
used, we have to clear pack_keep on pack files with the .keep file.

But we can't make any assumption about unreachable objects in .keep
packs. If "pack_keep" field is false for .keep packs, we could
potentially pull lots of unreachable objects into the new pack, or
unpack them loose. The safer approach is ignore all packs with either
.keep file or --keep-pack.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/git-pack-objects.txt |  9 +-
 Documentation/git-repack.txt   |  9 +-
 builtin/pack-objects.c | 48 ++
 builtin/repack.c   | 21 +++--
 cache.h|  1 +
 t/t7700-repack.sh  | 25 
 6 files changed, 102 insertions(+), 11 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 81bc490ac5..403524652a 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,7 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=] [--depth=]
-   [--revs [--unpacked | --all]]
+   [--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
[--shallow] [--keep-true-parents] < object-list
 
@@ -126,6 +126,13 @@ base-name::
has a .keep file to be ignored, even if it would have
otherwise been packed.
 
+--keep-pack=::
+   This flag causes an object already in the given pack to be
+   ignored, even if it would have otherwise been
+   packed. `` is the the pack file name without
+   leading directory (e.g. `pack-123.pack`). The option could be
+   specified multiple times to keep multiple packs.
+
 --incremental::
This flag causes an object already in a pack to be ignored
even if it would have otherwise been packed.
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ae750e9e11..ce497d9d12 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,7 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=] 
[--depth=] [--threads=]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=] 
[--depth=] [--threads=] [--keep-pack=]
 
 DESCRIPTION
 ---
@@ -133,6 +133,13 @@ other objects in that pack they already have locally.
with `-b` or `repack.writeBitmaps`, as it ensures that the
bitmapped packfile has the necessary objects.
 
+--keep-pack=::
+   Exclude the given pack from repacking. This is the equivalent
+   of having `.keep` file on the pack. `` is the the
+   pack file name without leading directory (e.g. `pack-123.pack`).
+   The option could be specified multiple times to keep multiple
+   packs.
+
 --unpack-unreachable=::
When loosening unreachable objects, do not bother loosening any
objects older than ``. This can be used to optimize out
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5c674b2843..7b9fe6c89f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -28,6 +28,7 @@
 #include "argv-array.h"
 #include "list.h"
 #include "packfile.h"
+#include "dir.h"
 
 static const char *pack_usage[] = {
N_("git pack-objects --stdout [...] [<  | < 
]"),
@@ -53,7 +54,7 @@ static int pack_loose_unreachable;
 static int local;
 static int have_non_local_packs;
 static int incremental;
-static int ignore_packed_keep;
+static int ignore_packed_keep, ignore_packed_keep_in_core;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
@@ -982,13 +983,15 @@ static int want_found_object(int exclude, struct 
packed_git *p)
 * Otherwise, we signal "-1" at the end to tell the caller that we do
 * not know either way, and it needs to check more packs.
 */
-   if (!ignore_packed_keep &&
+   if (!ignore_packed_keep && !ignore_packed_keep_in_core &&
(!local || !have_non_local_packs))
return 1;
 
if (local && !p->pack_local)
return 0;
-  

[PATCH v3 7/7] pack-objects: display progress in get_object_details()

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This code is mostly about reading object headers, which is cheap. But
when the number of objects is very large (e.g. 6.5M on linux-2.6.git)
and the system is under memory pressure, this could take some time (86
seconds on my system).

Show something during this time to let the user know pack-objects is
still going strong.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f74e9117f7..ac8f29dd52 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1715,6 +1715,10 @@ static void get_object_details(void)
uint32_t i;
struct object_entry **sorted_by_offset;
 
+   if (progress)
+   progress_state = start_progress(_("Getting object details"),
+   to_pack.nr_objects);
+
sorted_by_offset = xcalloc(to_pack.nr_objects, sizeof(struct 
object_entry *));
for (i = 0; i < to_pack.nr_objects; i++)
sorted_by_offset[i] = to_pack.objects + i;
@@ -1725,7 +1729,9 @@ static void get_object_details(void)
check_object(entry);
if (big_file_threshold < entry->size)
entry->no_try_delta = 1;
+   display_progress(progress_state, i + 1);
}
+   stop_progress(_state);
 
/*
 * This must happen in a second pass, since we rely on the delta
-- 
2.16.2.903.gd04caf5039



Re: [PATCH v2 4/5] pack-objects: show some progress when counting kept objects

2018-03-16 Thread Duy Nguyen
On Mon, Mar 12, 2018 at 7:32 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Tue, Mar 06 2018, Nguyễn Thái Ngọc Duy jotted:
>
>> We only show progress when there are new objects to be packed. But
>> when --keep-pack is specified on the base pack, we will exclude most
>> of objects. This makes 'pack-objects' stay silent for a long time
>> while the counting phase is going.
>>
>> Let's show some progress whenever we visit an object instead. The
>> number of packed objects will be shown after if it's not the same as
>> the number of visited objects.
>>
>> Since the meaning of this number has changed, use another word instead
>> of "Counting" to hint about the change.
>
> Can you elaborate on how the meaning has changed? With/without this on
> linux.git I get:
>
> With:
>
> Enumerating objects: 5901144, done.
> Getting object details: 100% (5901145/5901145), done.
> Delta compression using up to 8 threads.
>
> Without:
>
> Counting objects: 5901145, done.
> Delta compression using up to 8 threads.
>
> So now we're seemingly off-by-one but otherwise doing the same thing?

Yep, it's an off-by-one bug.

> As for as user feedback goes we might as well have said "Reticulating
> splines", but I have some bias towards keeping the current "Counting
> objects..." phrasing. We ourselves have other docs referring to it that
> aren't changed by this patch, and there's
> e.g. https://githubengineering.com/counting-objects/ and lots of other
> 3rd party docs that refer to this.

This is why I changed the phrase. The counting is now a bit different.
Documents describing this exact phrase won't apply to the new version.

The old way counts objects that will be packed. The new way simply
counts objects that are visited. When you keep some packs, the number
of objects you visit but not pack could be very high, while in normal
case the two numbers should be the same (e.g. you pack everything you
visit). I would prefer to print both values (e.g. "counting objects:
/") but it's not possible with the current progress
code.
-- 
Duy


getting fatal error trying to open git gui

2018-03-16 Thread Briggs, John
I just installed git for windows 10 and am getting "git-gui: fatal error" 
"Cannot parse Git version string.

When I execute "git version" in the command prompt I get
Git version 2.16.2.windows.1

Everything else seems to be working. How can I get the gui to work?

John



Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Junio C Hamano
SZEDER Gábor  writes:

> You should forget '--stdin-packs' and use '--stdin-commits' to generate
> the initial graph, it's much faster even without '--additive'[1].  See
>
>   
> https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/
>
> I still think that the default behaviour for 'git commit-graph write'
> should simply walk history from all refs instead of enumerating all
> objects in all packfiles.

Somehow I missed that one.  Thanks for the link to it.

It is not so surprising that history walking runs rings around
enumerating objects in packfiles, if packfiles are built well.

A well-built packfile tends to has newer objects in base form and
has delta that goes in backward direction (older objects are
represented as delta against newer ones).  This helps warlking from
the tips of the history quite a bit, because your delta base cache
will tend to have the base object (i.e. objects in the newer part of
the history you just walked) that will be required to access the
"next" older part of the history more often than not.

Trying to read the objects in the pack in their object name order
would essentially mean reading them in a cryptgraphically random
order.  Half the time you will end up wanting to access an object
that is near the tip of a very deep delta chain even before you've
accessed any of the base objects in the delta chain.

> [1] - Please excuse the bikeshed: '--additive' is such a strange
>   sounding option name, at least for me.  '--append', perhaps?

Yeah, I think "fetch --append" is probably a precedence.


[PATCH v4 03/11] pack-objects: use bitfield for object_entry::dfs_state

2018-03-16 Thread Nguyễn Thái Ngọc Duy
Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c |  3 +++
 pack-objects.h | 28 +---
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 13f6a44fb2..09f8b4ef3e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3049,6 +3049,9 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
OPT_END(),
};
 
+   if (DFS_NUM_STATES > (1 << OE_DFS_STATE_BITS))
+   die("BUG: too many dfs states, increase OE_DFS_STATE_BITS");
+
check_replace_refs = 0;
 
reset_pack_idx_option(_idx_opts);
diff --git a/pack-objects.h b/pack-objects.h
index 38d3ff167f..2bb1732098 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -1,6 +1,21 @@
 #ifndef PACK_OBJECTS_H
 #define PACK_OBJECTS_H
 
+#define OE_DFS_STATE_BITS  2
+
+/*
+ * State flags for depth-first search used for analyzing delta cycles.
+ *
+ * The depth is measured in delta-links to the base (so if A is a delta
+ * against B, then A has a depth of 1, and B a depth of 0).
+ */
+enum dfs_state {
+   DFS_NONE = 0,
+   DFS_ACTIVE,
+   DFS_DONE,
+   DFS_NUM_STATES
+};
+
 /*
  * basic object info
  * -
@@ -72,19 +87,10 @@ struct object_entry {
unsigned no_try_delta:1;
unsigned tagged:1; /* near the very tip of refs */
unsigned filled:1; /* assigned write-order */
+   unsigned dfs_state:OE_DFS_STATE_BITS;
 
-   /*
-* State flags for depth-first search used for analyzing delta cycles.
-*
-* The depth is measured in delta-links to the base (so if A is a delta
-* against B, then A has a depth of 1, and B a depth of 0).
-*/
-   enum {
-   DFS_NONE = 0,
-   DFS_ACTIVE,
-   DFS_DONE
-   } dfs_state;
int depth;
+
 };
 
 struct packing_data {
-- 
2.16.2.903.gd04caf5039



[PATCH v4 09/11] pack-objects: shrink size field in struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
It's very very rare that an uncompressd object is larger than 4GB
(partly because Git does not handle those large files very well to
begin with). Let's optimize it for the common case where object size
is smaller than this limit.

Shrink size field down to 32 bits [1] and one overflow bit. If the size
is too large, we read it back from disk.

Add two compare helpers that can take advantage of the overflow
bit (e.g. if the file is 4GB+, chances are it's already larger than
core.bigFileThreshold and there's no point in comparing the actual
value).

A small note about the conditional oe_set_size() in
check_object(). Technically if we don't get a valid type, it's not
wrong if we set uninitialized value "size" (we don't pre-initialize
this and sha1_object_info will not assign anything when it fails to
get the info).

This how changes the writing code path slightly which emits different
error messages (either way we die). One of our tests in t5530 depends
on this specific error message. Let's just keep the test as-is and
play safe by not assigning random value. That might trigger valgrind
anyway.

[1] it's actually already 32 bits on Windows

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 49 ++
 pack-objects.h | 43 +++-
 2 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9a0962cf31..14aa4acd50 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -274,7 +274,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
 
if (!usable_delta) {
if (oe_type(entry) == OBJ_BLOB &&
-   entry->size > big_file_threshold &&
+   oe_size_greater_than(entry, big_file_threshold) &&
(st = open_istream(entry->idx.oid.hash, , , 
NULL)) != NULL)
buf = NULL;
else {
@@ -384,12 +384,13 @@ static off_t write_reuse_object(struct hashfile *f, 
struct object_entry *entry,
unsigned char header[MAX_PACK_OBJECT_HEADER],
  dheader[MAX_PACK_OBJECT_HEADER];
unsigned hdrlen;
+   unsigned long entry_size = oe_size(entry);
 
if (DELTA(entry))
type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
hdrlen = encode_in_pack_object_header(header, sizeof(header),
- type, entry->size);
+ type, entry_size);
 
offset = entry->in_pack_offset;
revidx = find_pack_revindex(p, offset);
@@ -406,7 +407,7 @@ static off_t write_reuse_object(struct hashfile *f, struct 
object_entry *entry,
datalen -= entry->in_pack_header_size;
 
if (!pack_to_stdout && p->index_version == 1 &&
-   check_pack_inflate(p, _curs, offset, datalen, entry->size)) {
+   check_pack_inflate(p, _curs, offset, datalen, entry_size)) {
error("corrupt packed object for %s",
  oid_to_hex(>idx.oid));
unuse_pack(_curs);
@@ -1412,6 +1413,8 @@ static void cleanup_preferred_base(void)
 
 static void check_object(struct object_entry *entry)
 {
+   unsigned long size;
+
if (IN_PACK(entry)) {
struct packed_git *p = IN_PACK(entry);
struct pack_window *w_curs = NULL;
@@ -1431,13 +1434,14 @@ static void check_object(struct object_entry *entry)
 */
used = unpack_object_header_buffer(buf, avail,
   ,
-  >size);
+  );
if (used == 0)
goto give_up;
 
if (type < 0)
die("BUG: invalid type %d", type);
entry->in_pack_type = type;
+   oe_set_size(entry, size);
 
/*
 * Determine if this is a delta and if so whether we can
@@ -1505,7 +1509,7 @@ static void check_object(struct object_entry *entry)
 */
oe_set_type(entry, entry->in_pack_type);
SET_DELTA(entry, base_entry);
-   entry->delta_size = entry->size;
+   entry->delta_size = oe_size(entry);
entry->delta_sibling_idx = base_entry->delta_child_idx;
SET_DELTA_CHILD(base_entry, entry);
unuse_pack(_curs);
@@ -1513,14 +1517,17 @@ static void check_object(struct object_entry *entry)
}
 
if (oe_type(entry)) {
+   unsigned long size;
+
+   size = get_size_from_delta(p, _curs,
+   

[PATCH v4 07/11] pack-objects: refer to delta objects by index instead of pointer

2018-03-16 Thread Nguyễn Thái Ngọc Duy
These delta pointers always point to elements in the objects[] array
in packing_data struct. We can only hold maximum 4GB of those objects
because the array length, nr_objects, is uint32_t. We could use
uint32_t indexes to address these elements instead of pointers. On
64-bit architecture (8 bytes per pointer) this would save 4 bytes per
pointer.

Convert these delta pointers to indexes. Since we need to handle NULL
pointers as well, the index is shifted by one [1].

[1] This means we can only index 2^32-2 objects even though nr_objects
could contain 2^32-1 objects. It should not be a problem in
practice because when we grow objects[], nr_alloc would probably
blow up long before nr_objects hits the wall.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 116 ++---
 pack-objects.h |  67 ++--
 2 files changed, 124 insertions(+), 59 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ca993e55dd..cdbad57082 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -30,6 +30,12 @@
 #include "packfile.h"
 
 #define IN_PACK(obj) oe_in_pack(_pack, obj)
+#define DELTA(obj) oe_delta(_pack, obj)
+#define DELTA_CHILD(obj) oe_delta_child(_pack, obj)
+#define DELTA_SIBLING(obj) oe_delta_sibling(_pack, obj)
+#define SET_DELTA(obj, val) oe_set_delta(_pack, obj, val)
+#define SET_DELTA_CHILD(obj, val) oe_set_delta_child(_pack, obj, val)
+#define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(_pack, obj, val)
 
 static const char *pack_usage[] = {
N_("git pack-objects --stdout [...] [<  | < 
]"),
@@ -127,11 +133,11 @@ static void *get_delta(struct object_entry *entry)
buf = read_sha1_file(entry->idx.oid.hash, , );
if (!buf)
die("unable to read %s", oid_to_hex(>idx.oid));
-   base_buf = read_sha1_file(entry->delta->idx.oid.hash, ,
+   base_buf = read_sha1_file(DELTA(entry)->idx.oid.hash, ,
  _size);
if (!base_buf)
die("unable to read %s",
-   oid_to_hex(>delta->idx.oid));
+   oid_to_hex((entry)->idx.oid));
delta_buf = diff_delta(base_buf, base_size,
   buf, size, _size, 0);
if (!delta_buf || delta_size != entry->delta_size)
@@ -288,12 +294,12 @@ static unsigned long write_no_reuse_object(struct 
hashfile *f, struct object_ent
size = entry->delta_size;
buf = entry->delta_data;
entry->delta_data = NULL;
-   type = (allow_ofs_delta && entry->delta->idx.offset) ?
+   type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
} else {
buf = get_delta(entry);
size = entry->delta_size;
-   type = (allow_ofs_delta && entry->delta->idx.offset) ?
+   type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
}
 
@@ -317,7 +323,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
 * encoding of the relative offset for the delta
 * base from this object's position in the pack.
 */
-   off_t ofs = entry->idx.offset - entry->delta->idx.offset;
+   off_t ofs = entry->idx.offset - DELTA(entry)->idx.offset;
unsigned pos = sizeof(dheader) - 1;
dheader[pos] = ofs & 127;
while (ofs >>= 7)
@@ -343,7 +349,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
return 0;
}
hashwrite(f, header, hdrlen);
-   hashwrite(f, entry->delta->idx.oid.hash, 20);
+   hashwrite(f, DELTA(entry)->idx.oid.hash, 20);
hdrlen += 20;
} else {
if (limit && hdrlen + datalen + 20 >= limit) {
@@ -379,8 +385,8 @@ static off_t write_reuse_object(struct hashfile *f, struct 
object_entry *entry,
  dheader[MAX_PACK_OBJECT_HEADER];
unsigned hdrlen;
 
-   if (entry->delta)
-   type = (allow_ofs_delta && entry->delta->idx.offset) ?
+   if (DELTA(entry))
+   type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
hdrlen = encode_in_pack_object_header(header, sizeof(header),
  type, entry->size);
@@ -408,7 +414,7 @@ static off_t write_reuse_object(struct hashfile *f, struct 
object_entry *entry,
}
 
if (type == OBJ_OFS_DELTA) {
-   off_t ofs = entry->idx.offset - entry->delta->idx.offset;
+   off_t ofs = entry->idx.offset - DELTA(entry)->idx.offset;
unsigned pos = sizeof(dheader) - 1;
   

[PATCH v4 10/11] pack-objects: shrink delta_size field in struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
Allowing a delta size of 64 bits is crazy. Shrink this field down to
31 bits with one overflow bit.

If we find an existing delta larger than 2GB, we do not cache
delta_size at all and will get the value from oe_size(), potentially
from disk if it's larger than 4GB.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 24 ++--
 pack-objects.h | 23 ++-
 2 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 14aa4acd50..c388d87c3e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -30,10 +30,12 @@
 #include "packfile.h"
 
 #define IN_PACK(obj) oe_in_pack(_pack, obj)
+#define DELTA_SIZE(obj) oe_delta_size(_pack, obj)
 #define DELTA(obj) oe_delta(_pack, obj)
 #define DELTA_CHILD(obj) oe_delta_child(_pack, obj)
 #define DELTA_SIBLING(obj) oe_delta_sibling(_pack, obj)
 #define SET_DELTA(obj, val) oe_set_delta(_pack, obj, val)
+#define SET_DELTA_SIZE(obj, val) oe_set_delta_size(_pack, obj, val)
 #define SET_DELTA_CHILD(obj, val) oe_set_delta_child(_pack, obj, val)
 #define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(_pack, obj, val)
 
@@ -140,7 +142,7 @@ static void *get_delta(struct object_entry *entry)
oid_to_hex((entry)->idx.oid));
delta_buf = diff_delta(base_buf, base_size,
   buf, size, _size, 0);
-   if (!delta_buf || delta_size != entry->delta_size)
+   if (!delta_buf || delta_size != DELTA_SIZE(entry))
die("delta size changed");
free(buf);
free(base_buf);
@@ -291,14 +293,14 @@ static unsigned long write_no_reuse_object(struct 
hashfile *f, struct object_ent
FREE_AND_NULL(entry->delta_data);
entry->z_delta_size = 0;
} else if (entry->delta_data) {
-   size = entry->delta_size;
+   size = DELTA_SIZE(entry);
buf = entry->delta_data;
entry->delta_data = NULL;
type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
} else {
buf = get_delta(entry);
-   size = entry->delta_size;
+   size = DELTA_SIZE(entry);
type = (allow_ofs_delta && DELTA(entry)->idx.offset) ?
OBJ_OFS_DELTA : OBJ_REF_DELTA;
}
@@ -1509,7 +1511,7 @@ static void check_object(struct object_entry *entry)
 */
oe_set_type(entry, entry->in_pack_type);
SET_DELTA(entry, base_entry);
-   entry->delta_size = oe_size(entry);
+   SET_DELTA_SIZE(entry, oe_size(entry));
entry->delta_sibling_idx = base_entry->delta_child_idx;
SET_DELTA_CHILD(base_entry, entry);
unuse_pack(_curs);
@@ -1895,7 +1897,7 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
max_size = trg_size/2 - 20;
ref_depth = 1;
} else {
-   max_size = trg_entry->delta_size;
+   max_size = DELTA_SIZE(trg_entry);
ref_depth = trg->depth;
}
max_size = (uint64_t)max_size * (max_depth - src->depth) /
@@ -1966,10 +1968,12 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
delta_buf = create_delta(src->index, trg->data, trg_size, _size, 
max_size);
if (!delta_buf)
return 0;
+   if (delta_size >= maximum_unsigned_value_of_type(uint32_t))
+   return 0;
 
if (DELTA(trg_entry)) {
/* Prefer only shallower same-sized deltas. */
-   if (delta_size == trg_entry->delta_size &&
+   if (delta_size == DELTA_SIZE(trg_entry) &&
src->depth + 1 >= trg->depth) {
free(delta_buf);
return 0;
@@ -1984,7 +1988,7 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
free(trg_entry->delta_data);
cache_lock();
if (trg_entry->delta_data) {
-   delta_cache_size -= trg_entry->delta_size;
+   delta_cache_size -= DELTA_SIZE(trg_entry);
trg_entry->delta_data = NULL;
}
if (delta_cacheable(src_size, trg_size, delta_size)) {
@@ -1997,7 +2001,7 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
}
 
SET_DELTA(trg_entry, src_entry);
-   trg_entry->delta_size = delta_size;
+   SET_DELTA_SIZE(trg_entry, delta_size);
trg->depth = src->depth + 1;
 
return 1;
@@ -2120,11 +2124,11 @@ static void find_deltas(struct object_entry **list, 
unsigned *list_size,
if (entry->delta_data && !pack_to_stdout) {
unsigned long size;
 
-   size = 

[PATCH v4 05/11] pack-objects: move in_pack_pos out of struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
This field is only need for pack-bitmap, which is an optional
feature. Move it to a separate array that is only allocated when
pack-bitmap is used (it's not freed in the same way that objects[] is
not).

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c |  3 ++-
 pack-bitmap-write.c|  8 +---
 pack-bitmap.c  |  2 +-
 pack-bitmap.h  |  4 +++-
 pack-objects.h | 16 +++-
 5 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 668eaf8cd7..b281487b96 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -879,7 +879,8 @@ static void write_pack_file(void)
 
if (write_bitmap_index) {
bitmap_writer_set_checksum(oid.hash);
-   bitmap_writer_build_type_index(written_list, 
nr_written);
+   bitmap_writer_build_type_index(
+   _pack, written_list, nr_written);
}
 
finish_tmp_packfile(, pack_tmp_name,
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index fd11f08940..f7c897515b 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -48,7 +48,8 @@ void bitmap_writer_show_progress(int show)
 /**
  * Build the initial type index for the packfile
  */
-void bitmap_writer_build_type_index(struct pack_idx_entry **index,
+void bitmap_writer_build_type_index(struct packing_data *to_pack,
+   struct pack_idx_entry **index,
uint32_t index_nr)
 {
uint32_t i;
@@ -57,12 +58,13 @@ void bitmap_writer_build_type_index(struct pack_idx_entry 
**index,
writer.trees = ewah_new();
writer.blobs = ewah_new();
writer.tags = ewah_new();
+   ALLOC_ARRAY(to_pack->in_pack_pos, to_pack->nr_objects);
 
for (i = 0; i < index_nr; ++i) {
struct object_entry *entry = (struct object_entry *)index[i];
enum object_type real_type;
 
-   entry->in_pack_pos = i;
+   oe_set_in_pack_pos(to_pack, entry, i);
 
switch (oe_type(entry)) {
case OBJ_COMMIT:
@@ -147,7 +149,7 @@ static uint32_t find_object_pos(const unsigned char *sha1)
"(object %s is missing)", sha1_to_hex(sha1));
}
 
-   return entry->in_pack_pos;
+   return oe_in_pack_pos(writer.to_pack, entry);
 }
 
 static void show_object(struct object *object, const char *name, void *data)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 9270983e5f..865d9ecc4e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1032,7 +1032,7 @@ int rebuild_existing_bitmaps(struct packing_data *mapping,
oe = packlist_find(mapping, sha1, NULL);
 
if (oe)
-   reposition[i] = oe->in_pack_pos + 1;
+   reposition[i] = oe_in_pack_pos(mapping, oe) + 1;
}
 
rebuild = bitmap_new();
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3742a00e14..5ded2f139a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -44,7 +44,9 @@ int rebuild_existing_bitmaps(struct packing_data *mapping, 
khash_sha1 *reused_bi
 
 void bitmap_writer_show_progress(int show);
 void bitmap_writer_set_checksum(unsigned char *sha1);
-void bitmap_writer_build_type_index(struct pack_idx_entry **index, uint32_t 
index_nr);
+void bitmap_writer_build_type_index(struct packing_data *to_pack,
+   struct pack_idx_entry **index,
+   uint32_t index_nr);
 void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
unsigned int indexed_commits_nr, int max_bitmaps);
diff --git a/pack-objects.h b/pack-objects.h
index 50908d1f2d..dae160e7c2 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -78,7 +78,6 @@ struct object_entry {
unsigned in_pack_type:TYPE_BITS; /* could be delta */
unsigned type_valid:1;
uint32_t hash;  /* name hint hash */
-   unsigned int in_pack_pos;
unsigned char in_pack_header_size;
unsigned preferred_base:1; /*
* we do not pack this, but is available
@@ -98,6 +97,8 @@ struct packing_data {
 
int32_t *index;
uint32_t index_size;
+
+   unsigned int *in_pack_pos;
 };
 
 struct object_entry *packlist_alloc(struct packing_data *pdata,
@@ -143,4 +144,17 @@ static inline void oe_set_type(struct object_entry *e,
e->type_ = (unsigned)type;
 }
 
+static inline unsigned int oe_in_pack_pos(const struct packing_data *pack,
+ const struct object_entry *e)
+{
+   return pack->in_pack_pos[e - pack->objects];
+}
+
+static inline void oe_set_in_pack_pos(const struct packing_data *pack,
+ 

[PATCH v4 06/11] pack-objects: move in_pack out of struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
Instead of using 8 bytes (on 64 bit arch) to store a pointer to a
pack. Use an index instead since the number of packs should be
relatively small.

This limits the number of packs we can handle to 16k. For now if you hit
16k pack files limit, pack-objects will simply fail [1].

[1] The escape hatch is .keep file to limit the non-kept pack files
below 16k limit. Then you can go for another pack-objects run to
combine another 16k pack files. Repeat until you're satisfied.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/git-pack-objects.txt |  9 ++
 builtin/pack-objects.c | 40 +--
 cache.h|  1 +
 pack-objects.h | 44 +-
 4 files changed, 79 insertions(+), 15 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 3503c9e3e6..b8d936ccf5 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -269,6 +269,15 @@ Unexpected missing object will raise an error.
locally created objects [without .promisor] and objects from the
promisor remote [with .promisor].)  This is used with partial clone.
 
+LIMITATIONS
+---
+
+This command could only handle 16384 existing pack files at a time.
+If you have more than this, you need to exclude some pack files with
+".keep" file and --honor-pack-keep option, to combine 16k pack files
+in one, then remove these .keep files and run pack-objects one more
+time.
+
 SEE ALSO
 
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b281487b96..ca993e55dd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -29,6 +29,8 @@
 #include "list.h"
 #include "packfile.h"
 
+#define IN_PACK(obj) oe_in_pack(_pack, obj)
+
 static const char *pack_usage[] = {
N_("git pack-objects --stdout [...] [<  | < 
]"),
N_("git pack-objects [...]  [<  | < 
]"),
@@ -367,7 +369,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
 static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
unsigned long limit, int usable_delta)
 {
-   struct packed_git *p = entry->in_pack;
+   struct packed_git *p = IN_PACK(entry);
struct pack_window *w_curs = NULL;
struct revindex_entry *revidx;
off_t offset;
@@ -478,7 +480,7 @@ static off_t write_object(struct hashfile *f,
 
if (!reuse_object)
to_reuse = 0;   /* explicit */
-   else if (!entry->in_pack)
+   else if (!IN_PACK(entry))
to_reuse = 0;   /* can't reuse what we don't have */
else if (oe_type(entry) == OBJ_REF_DELTA ||
 oe_type(entry) == OBJ_OFS_DELTA)
@@ -1025,7 +1027,7 @@ static int want_object_in_pack(const struct object_id 
*oid,
if (*found_pack) {
want = want_found_object(exclude, *found_pack);
if (want != -1)
-   return want;
+   goto done;
}
 
list_for_each(pos, _git_mru) {
@@ -1048,11 +1050,16 @@ static int want_object_in_pack(const struct object_id 
*oid,
if (!exclude && want > 0)
list_move(>mru, _git_mru);
if (want != -1)
-   return want;
+   goto done;
}
}
 
-   return 1;
+   want = 1;
+done:
+   if (want && *found_pack && !(*found_pack)->index)
+   oe_add_pack(_pack, *found_pack);
+
+   return want;
 }
 
 static void create_object_entry(const struct object_id *oid,
@@ -1074,7 +1081,7 @@ static void create_object_entry(const struct object_id 
*oid,
else
nr_result++;
if (found_pack) {
-   entry->in_pack = found_pack;
+   oe_set_in_pack(entry, found_pack);
entry->in_pack_offset = found_offset;
}
 
@@ -1399,8 +1406,8 @@ static void cleanup_preferred_base(void)
 
 static void check_object(struct object_entry *entry)
 {
-   if (entry->in_pack) {
-   struct packed_git *p = entry->in_pack;
+   if (IN_PACK(entry)) {
+   struct packed_git *p = IN_PACK(entry);
struct pack_window *w_curs = NULL;
const unsigned char *base_ref = NULL;
struct object_entry *base_entry;
@@ -1535,14 +1542,16 @@ static int pack_offset_sort(const void *_a, const void 
*_b)
 {
const struct object_entry *a = *(struct object_entry **)_a;
const struct object_entry *b = *(struct object_entry **)_b;
+   const struct packed_git *a_in_pack = IN_PACK(a);
+   const struct packed_git *b_in_pack = IN_PACK(b);
 
/* avoid filesystem trashing with loose objects */
-   if (!a->in_pack && !b->in_pack)
+   if 

[PATCH v4 08/11] pack-objects: shrink z_delta_size field in struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
We only cache deltas when it's smaller than a certain limit. This limit
defaults to 1000 but save its compressed length in a 64-bit field.
Shrink that field down to 16 bits, so you can only cache 65kb deltas.
Larger deltas must be recomputed at when the pack is written down.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt |  3 ++-
 builtin/pack-objects.c   | 22 --
 pack-objects.h   |  3 ++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9bd3f5a789..00fa824448 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2449,7 +2449,8 @@ pack.deltaCacheLimit::
The maximum size of a delta, that is cached in
linkgit:git-pack-objects[1]. This cache is used to speed up the
writing object phase by not having to recompute the final delta
-   result once the best match for all objects is found. Defaults to 1000.
+   result once the best match for all objects is found.
+   Defaults to 1000. Maximum value is 65535.
 
 pack.threads::
Specifies the number of threads to spawn when searching for best
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index cdbad57082..9a0962cf31 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2105,12 +2105,19 @@ static void find_deltas(struct object_entry **list, 
unsigned *list_size,
 * between writes at that moment.
 */
if (entry->delta_data && !pack_to_stdout) {
-   entry->z_delta_size = do_compress(>delta_data,
- entry->delta_size);
-   cache_lock();
-   delta_cache_size -= entry->delta_size;
-   delta_cache_size += entry->z_delta_size;
-   cache_unlock();
+   unsigned long size;
+
+   size = do_compress(>delta_data, 
entry->delta_size);
+   entry->z_delta_size = size;
+   if (entry->z_delta_size == size) {
+   cache_lock();
+   delta_cache_size -= entry->delta_size;
+   delta_cache_size += entry->z_delta_size;
+   cache_unlock();
+   } else {
+   FREE_AND_NULL(entry->delta_data);
+   entry->z_delta_size = 0;
+   }
}
 
/* if we made n a delta, and if n is already at max
@@ -3089,6 +3096,9 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (depth >= (1 << OE_DEPTH_BITS))
die(_("delta chain depth %d is greater than maximum limit %d"),
depth, (1 << OE_DEPTH_BITS));
+   if (cache_max_small_delta_size >= (1 << OE_Z_DELTA_BITS))
+   die(_("pack.deltaCacheLimit is greater than maximum limit %d"),
+   1 << OE_Z_DELTA_BITS);
 
argv_array_push(, "pack-objects");
if (thin) {
diff --git a/pack-objects.h b/pack-objects.h
index 7f32de2a35..a66c37e35a 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -4,6 +4,7 @@
 #define OE_DFS_STATE_BITS  2
 #define OE_DEPTH_BITS  12
 #define OE_IN_PACK_BITS14
+#define OE_Z_DELTA_BITS16
 
 /*
  * State flags for depth-first search used for analyzing delta cycles.
@@ -78,7 +79,7 @@ struct object_entry {
 */
void *delta_data;   /* cached delta (uncompressed) */
unsigned long delta_size;   /* delta data size (uncompressed) */
-   unsigned long z_delta_size; /* delta data size (compressed) */
+   unsigned z_delta_size:OE_Z_DELTA_BITS;
unsigned type_:TYPE_BITS;
unsigned in_pack_type:TYPE_BITS; /* could be delta */
unsigned type_valid:1;
-- 
2.16.2.903.gd04caf5039



[PATCH v4 11/11] pack-objects.h: reorder members to shrink struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
Previous patches leave lots of holes and padding in this struct. This
patch reorders the members and shrinks the struct down to 80 bytes
(from 136 bytes, before any field shrinking is done) with 16 bits to
spare (and a couple more in in_pack_header_size when we really run out
of bits).

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 pack-objects.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/pack-objects.h b/pack-objects.h
index f430d938c6..0fa0c83294 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -70,35 +70,36 @@ enum dfs_state {
  */
 struct object_entry {
struct pack_idx_entry idx;
-   /* object uncompressed size _if_ size_valid is true */
-   uint32_t size_;
-   unsigned size_valid:1;
-   unsigned in_pack_idx:OE_IN_PACK_BITS;   /* already in pack */
+   void *delta_data;   /* cached delta (uncompressed) */
off_t in_pack_offset;
+   uint32_t hash;  /* name hint hash */
+   uint32_t size_; /* object uncompressed size _if_ size_valid is true */
uint32_t delta_idx; /* delta base object */
uint32_t delta_child_idx; /* deltified objects who bases me */
uint32_t delta_sibling_idx; /* other deltified objects who
 * uses the same base as me
 */
-   void *delta_data;   /* cached delta (uncompressed) */
uint32_t delta_size_:OE_DELTA_SIZE_BITS; /* delta data size 
(uncompressed) */
uint32_t delta_size_valid:1;
+   unsigned in_pack_idx:OE_IN_PACK_BITS;   /* already in pack */
+   unsigned size_valid:1;
unsigned z_delta_size:OE_Z_DELTA_BITS;
+   unsigned type_valid:1;
unsigned type_:TYPE_BITS;
unsigned in_pack_type:TYPE_BITS; /* could be delta */
-   unsigned type_valid:1;
-   uint32_t hash;  /* name hint hash */
-   unsigned char in_pack_header_size;
unsigned preferred_base:1; /*
* we do not pack this, but is available
* to be used as the base object to delta
* objects against.
*/
unsigned no_try_delta:1;
+   unsigned char in_pack_header_size;
unsigned tagged:1; /* near the very tip of refs */
unsigned filled:1; /* assigned write-order */
unsigned dfs_state:OE_DFS_STATE_BITS;
unsigned depth:OE_DEPTH_BITS;
+
+   /* size: 80, bit_padding: 16 bits */
 };
 
 struct packing_data {
-- 
2.16.2.903.gd04caf5039



[PATCH v4 02/11] pack-objects: turn type and in_pack_type to bitfields

2018-03-16 Thread Nguyễn Thái Ngọc Duy
An extra field type_valid is added to carry the equivalent of OBJ_BAD
in the original "type" field. in_pack_type always contains a valid
type so we only need 3 bits for it.

A note about accepting OBJ_NONE as "valid" type. The function
read_object_list_from_stdin() can pass this value [1] and it
eventually calls create_object_entry() where current code skip setting
"type" field if the incoming type is zero. This does not have any bad
side effects because "type" field should be memset()'d anyway.

But since we also need to set type_valid now, skipping oe_set_type()
leaves type_valid zero/false, which will make oe_type() return
OBJ_BAD, not OBJ_NONE anymore. Apparently we do care about OBJ_NONE in
prepare_pack(). This switch from OBJ_NONE to OBJ_BAD may trigger

fatal: unable to get type of object ...

Accepting OBJ_NONE [2] does sound wrong, but this is how it is has
been for a very long time and I haven't time to dig in further.

[1] See 5c49c11686 (pack-objects: better check_object() performances -
2007-04-16)

[2] 21666f1aae (convert object type handling from a string to a number
- 2007-02-26)

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 builtin/pack-objects.c | 58 +-
 cache.h|  2 ++
 object.h   |  1 -
 pack-bitmap-write.c|  6 ++---
 pack-objects.h | 20 +--
 5 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5c674b2843..13f6a44fb2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -265,7 +265,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
struct git_istream *st = NULL;
 
if (!usable_delta) {
-   if (entry->type == OBJ_BLOB &&
+   if (oe_type(entry) == OBJ_BLOB &&
entry->size > big_file_threshold &&
(st = open_istream(entry->idx.oid.hash, , , 
NULL)) != NULL)
buf = NULL;
@@ -371,7 +371,7 @@ static off_t write_reuse_object(struct hashfile *f, struct 
object_entry *entry,
struct pack_window *w_curs = NULL;
struct revindex_entry *revidx;
off_t offset;
-   enum object_type type = entry->type;
+   enum object_type type = oe_type(entry);
off_t datalen;
unsigned char header[MAX_PACK_OBJECT_HEADER],
  dheader[MAX_PACK_OBJECT_HEADER];
@@ -480,11 +480,12 @@ static off_t write_object(struct hashfile *f,
to_reuse = 0;   /* explicit */
else if (!entry->in_pack)
to_reuse = 0;   /* can't reuse what we don't have */
-   else if (entry->type == OBJ_REF_DELTA || entry->type == OBJ_OFS_DELTA)
+   else if (oe_type(entry) == OBJ_REF_DELTA ||
+oe_type(entry) == OBJ_OFS_DELTA)
/* check_object() decided it for us ... */
to_reuse = usable_delta;
/* ... but pack split may override that */
-   else if (entry->type != entry->in_pack_type)
+   else if (oe_type(entry) != entry->in_pack_type)
to_reuse = 0;   /* pack has delta which is unusable */
else if (entry->delta)
to_reuse = 0;   /* we want to pack afresh */
@@ -705,8 +706,8 @@ static struct object_entry **compute_write_order(void)
 * And then all remaining commits and tags.
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
-   if (objects[i].type != OBJ_COMMIT &&
-   objects[i].type != OBJ_TAG)
+   if (oe_type([i]) != OBJ_COMMIT &&
+   oe_type([i]) != OBJ_TAG)
continue;
add_to_write_order(wo, _end, [i]);
}
@@ -715,7 +716,7 @@ static struct object_entry **compute_write_order(void)
 * And then all the trees.
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
-   if (objects[i].type != OBJ_TREE)
+   if (oe_type([i]) != OBJ_TREE)
continue;
add_to_write_order(wo, _end, [i]);
}
@@ -1066,8 +1067,7 @@ static void create_object_entry(const struct object_id 
*oid,
 
entry = packlist_alloc(_pack, oid->hash, index_pos);
entry->hash = hash;
-   if (type)
-   entry->type = type;
+   oe_set_type(entry, type);
if (exclude)
entry->preferred_base = 1;
else
@@ -1407,6 +1407,7 @@ static void check_object(struct object_entry *entry)
unsigned long avail;
off_t ofs;
unsigned char *buf, c;
+   enum object_type type;
 
buf = use_pack(p, _curs, entry->in_pack_offset, );
 
@@ -1415,11 +1416,15 @@ static void check_object(struct object_entry *entry)
 * since non-delta representations could still be reused.
 

[PATCH v4 04/11] pack-objects: use bitfield for object_entry::depth

2018-03-16 Thread Nguyễn Thái Ngọc Duy
Because of struct packing from now on we can only handle max depth
4095 (or even lower when new booleans are added in this struct). This
should be ok since long delta chain will cause significant slow down
anyway.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt   | 1 +
 Documentation/git-pack-objects.txt | 4 +++-
 Documentation/git-repack.txt   | 4 +++-
 builtin/pack-objects.c | 4 
 pack-objects.h | 5 ++---
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f57e9cf10c..9bd3f5a789 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2412,6 +2412,7 @@ pack.window::
 pack.depth::
The maximum delta depth used by linkgit:git-pack-objects[1] when no
maximum depth is given on the command line. Defaults to 50.
+   Maximum value is 4095.
 
 pack.windowMemory::
The maximum size of memory that is consumed by each thread
diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 81bc490ac5..3503c9e3e6 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -96,7 +96,9 @@ base-name::
it too deep affects the performance on the unpacker
side, because delta data needs to be applied that many
times to get to the necessary object.
-   The default value for --window is 10 and --depth is 50.
++
+The default value for --window is 10 and --depth is 50. The maximum
+depth is 4095.
 
 --window-memory=::
This option provides an additional limit on top of `--window`;
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ae750e9e11..25c83c4927 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -90,7 +90,9 @@ other objects in that pack they already have locally.
space. `--depth` limits the maximum delta depth; making it too deep
affects the performance on the unpacker side, because delta data needs
to be applied that many times to get to the necessary object.
-   The default value for --window is 10 and --depth is 50.
++
+The default value for --window is 10 and --depth is 50. The maximum
+depth is 4095.
 
 --threads=::
This option is passed through to `git pack-objects`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 09f8b4ef3e..668eaf8cd7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3068,6 +3068,10 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (pack_to_stdout != !base_name || argc)
usage_with_options(pack_usage, pack_objects_options);
 
+   if (depth >= (1 << OE_DEPTH_BITS))
+   die(_("delta chain depth %d is greater than maximum limit %d"),
+   depth, (1 << OE_DEPTH_BITS));
+
argv_array_push(, "pack-objects");
if (thin) {
use_internal_rev_list = 1;
diff --git a/pack-objects.h b/pack-objects.h
index 2bb1732098..50908d1f2d 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -2,6 +2,7 @@
 #define PACK_OBJECTS_H
 
 #define OE_DFS_STATE_BITS  2
+#define OE_DEPTH_BITS  12
 
 /*
  * State flags for depth-first search used for analyzing delta cycles.
@@ -88,9 +89,7 @@ struct object_entry {
unsigned tagged:1; /* near the very tip of refs */
unsigned filled:1; /* assigned write-order */
unsigned dfs_state:OE_DFS_STATE_BITS;
-
-   int depth;
-
+   unsigned depth:OE_DEPTH_BITS;
 };
 
 struct packing_data {
-- 
2.16.2.903.gd04caf5039



[PATCH v4 00/11] nd/pack-objects-pack-struct updates

2018-03-16 Thread Nguyễn Thái Ngọc Duy
The most important change in v4 is it fixes a case where I failed to
propagate an error condition to later code in 02/11. This results in
new wrappers, oe_type() and oe_set_type(). This also reveals another
extra object type, OBJ_NONE, that's also used by pack-objects.

Other changes are comments fixes, commit messages fixes, off-by-one
bugs. No more saving compared to v3.

I also changed my approach a bit. I stop trying to make struct
reduction visible at every patch. All these patches shrink some field
even if the struct size is the same. The reordering and packing
happens at the last patch.

I'm not super happy that many corner cases of my changes are not
covered by the test suite. In many cases it's very hard or expensive
to create the right error condition. If only this code is part of
libgit.a and I could write C unit tests for it...

Interdiff

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0f65e0f243..c388d87c3e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -275,7 +275,7 @@ static unsigned long write_no_reuse_object(struct hashfile 
*f, struct object_ent
struct git_istream *st = NULL;
 
if (!usable_delta) {
-   if (entry->type == OBJ_BLOB &&
+   if (oe_type(entry) == OBJ_BLOB &&
oe_size_greater_than(entry, big_file_threshold) &&
(st = open_istream(entry->idx.oid.hash, , , 
NULL)) != NULL)
buf = NULL;
@@ -381,7 +381,7 @@ static off_t write_reuse_object(struct hashfile *f, struct 
object_entry *entry,
struct pack_window *w_curs = NULL;
struct revindex_entry *revidx;
off_t offset;
-   enum object_type type = entry->type;
+   enum object_type type = oe_type(entry);
off_t datalen;
unsigned char header[MAX_PACK_OBJECT_HEADER],
  dheader[MAX_PACK_OBJECT_HEADER];
@@ -491,11 +491,12 @@ static off_t write_object(struct hashfile *f,
to_reuse = 0;   /* explicit */
else if (!IN_PACK(entry))
to_reuse = 0;   /* can't reuse what we don't have */
-   else if (entry->type == OBJ_REF_DELTA || entry->type == OBJ_OFS_DELTA)
+   else if (oe_type(entry) == OBJ_REF_DELTA ||
+oe_type(entry) == OBJ_OFS_DELTA)
/* check_object() decided it for us ... */
to_reuse = usable_delta;
/* ... but pack split may override that */
-   else if (entry->type != entry->in_pack_type)
+   else if (oe_type(entry) != entry->in_pack_type)
to_reuse = 0;   /* pack has delta which is unusable */
else if (DELTA(entry))
to_reuse = 0;   /* we want to pack afresh */
@@ -716,8 +717,8 @@ static struct object_entry **compute_write_order(void)
 * And then all remaining commits and tags.
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
-   if (objects[i].type != OBJ_COMMIT &&
-   objects[i].type != OBJ_TAG)
+   if (oe_type([i]) != OBJ_COMMIT &&
+   oe_type([i]) != OBJ_TAG)
continue;
add_to_write_order(wo, _end, [i]);
}
@@ -726,7 +727,7 @@ static struct object_entry **compute_write_order(void)
 * And then all the trees.
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
-   if (objects[i].type != OBJ_TREE)
+   if (oe_type([i]) != OBJ_TREE)
continue;
add_to_write_order(wo, _end, [i]);
}
@@ -1083,8 +1084,7 @@ static void create_object_entry(const struct object_id 
*oid,
 
entry = packlist_alloc(_pack, oid->hash, index_pos);
entry->hash = hash;
-   if (type)
-   entry->type = type;
+   oe_set_type(entry, type);
if (exclude)
entry->preferred_base = 1;
else
@@ -1453,9 +1453,9 @@ static void check_object(struct object_entry *entry)
switch (entry->in_pack_type) {
default:
/* Not a delta hence we've already got all we need. */
-   entry->type = entry->in_pack_type;
+   oe_set_type(entry, entry->in_pack_type);
entry->in_pack_header_size = used;
-   if (entry->type < OBJ_COMMIT || entry->type > OBJ_BLOB)
+   if (oe_type(entry) < OBJ_COMMIT || oe_type(entry) > 
OBJ_BLOB)
goto give_up;
unuse_pack(_curs);
return;
@@ -1509,7 +1509,7 @@ static void check_object(struct object_entry *entry)
 * deltify other objects against, in order to avoid
 * circular deltas.
 */
-   entry->type = entry->in_pack_type;
+   

[PATCH v4 01/11] pack-objects: a bit of document about struct object_entry

2018-03-16 Thread Nguyễn Thái Ngọc Duy
The role of this comment block becomes more important after we shuffle
fields around to shrink this struct. It will be much harder to see what
field is related to what.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 pack-objects.h | 44 
 1 file changed, 44 insertions(+)

diff --git a/pack-objects.h b/pack-objects.h
index 03f1191659..85345a4af1 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -1,6 +1,50 @@
 #ifndef PACK_OBJECTS_H
 #define PACK_OBJECTS_H
 
+/*
+ * basic object info
+ * -
+ * idx.oid is filled up before delta searching starts. idx.crc32 and
+ * is only valid after the object is written out and will be used for
+ * generating the index. idx.offset will be both gradually set and
+ * used in writing phase (base objects get offset first, then deltas
+ * refer to them)
+ *
+ * "size" is the uncompressed object size. Compressed size is not
+ * cached (ie. raw data in a pack) but available via revindex.
+ *
+ * "hash" contains a path name hash which is used for sorting the
+ * delta list and also during delta searching. Once prepare_pack()
+ * returns it's no longer needed.
+ *
+ * source pack info
+ * 
+ * The (in_pack, in_pack_offset, in_pack_header_size) tuple contains
+ * the location of the object in the source pack, with or without
+ * header.
+ *
+ * "type" and "in_pack_type" both describe object type. in_pack_type
+ * may contain a delta type, while type is always the canonical type.
+ *
+ * deltas
+ * --
+ * Delta links (delta, delta_child and delta_sibling) are created
+ * reflect that delta graph from the source pack then updated or added
+ * during delta searching phase when we find better deltas.
+ *
+ * delta_child and delta_sibling are last needed in
+ * compute_write_order(). "delta" and "delta_size" must remain valid
+ * at object writing phase in case the delta is not cached.
+ *
+ * If a delta is cached in memory and is compressed delta_data points
+ * to the data and z_delta_size contains the compressed size. If it's
+ * uncompressed [1], z_delta_size must be zero. delta_size is always
+ * the uncompressed size and must be valid even if the delta is not
+ * cached.
+ *
+ * [1] during try_delta phase we don't bother with compressing because
+ * the delta could be quickly replaced with a better one.
+ */
 struct object_entry {
struct pack_idx_entry idx;
unsigned long size; /* uncompressed size */
-- 
2.16.2.903.gd04caf5039



Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-16 Thread Eric Sunshine
On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamano  wrote:
> Eric Sunshine  writes:
>> However, I'm having a tough time imagining cases in which callers
>> would want same_encoding() to return true if both arguments are NULL,
>> but outright crash if only one is NULL (which is the behavior even
>> before this patch). In other words, same_encoding() takes advantage of
>> is_encoding_utf8() for its convenience, not for its NULL-handling.
>> Given that view, the two explicit is_encoding_utf8() calls in
>> same_encoding() seem redundant once the same_utf_encoding() call is
>> added.
>
> So... does that mean we'd want something like this, or do you have
> something else in mind?
>
> int same_encoding(const char *src, const char *dst)
> {
> static const char utf8[] = "UTF-8";
>
> if (!src)
> src = utf8;
> if (!dst)
> dst = utf8;
> if (same_utf_encoding(src, dst))
> return 1;
> return !strcasecmp(src, dst);
> }

I am not proposing anything like that for this patch or patch series.
I'm merely asking why, after this patch, same_encoding() still
contains the (in my mind) now-unneeded conditional:

if (is_encoding_utf8(src) && is_encoding_utf8(dst))
return 1;

If I'm reading the current code correctly, same_encoding() will crash
when either (but not both) of its arguments is NULL and the non-NULL
argument is not a variation of "UTF-8". If both arguments are NULL,
then it won't crash, but I don't believe that it was intentional that
it should crash for some NULL cases but not others; rather, not
crashing when both are NULL is an _accidental_ side-effect of relying
upon is_encoding_utf8().

If I understood him correctly, Lars's justification for retaining the
conditional in question even after the patch is to maintain the
non-crashing behavior for both arguments being NULL, even though it
will continue to crash when only one is NULL. That justification
doesn't makes sense to me since I can't imagine clients relying on
accidental behavior of sometimes crashing, sometimes not, hence my
question to Lars.

As for your snippet above which ensures that 'src' and 'dst' are
non-NULL, that (or some variation) would be fine if we ever expect
callers of same_encoding() to pass NULL for either of those arguments,
but such a fix is outside the scope of this patch and even this patch
series (which does not require such a fix), IMHO.


Re: [PATCH v4 1/3] branch: introduce dont_fail parameter for branchname validation

2018-03-16 Thread Kaartic Sivaraam
On Friday 16 March 2018 01:57 AM, Junio C Hamano wrote:
> Kaartic Sivaraam  writes:
> 
> So... I am not finding dont_fail that was mentioned on the title
> anywhere else in the patch.  Such lack of attention to detail is
> a bit off-putting.
> 

I'm absolutely sorry x-<

I should have forgotten to update the commit subject during one of the
old rebases in which I renamed the parameter from 'dont_fail' to 'gently'.

I shouldn't have assumed that the commit messages of the old series held
in the reboot too. I should have re-read them "completely" and double
checked it. :-(


>> +enum branch_validation_result {
>> +/* Flags that convey there are fatal errors */
>> +VALIDATION_FATAL_BRANCH_EXISTS_NO_FORCE = -3,
>> +VALIDATION_FATAL_CANNOT_FORCE_UPDATE_CURRENT_BRANCH,
>> +VALIDATION_FATAL_INVALID_BRANCH_NAME,
>> +/* Flags that convey there are no fatal errors */
>> +VALIDATION_PASS_BRANCH_DOESNT_EXIST = 0,
>> +VALIDATION_PASS_BRANCH_EXISTS = 1,
>> +VALIDATION_WARN_BRANCH_EXISTS = 2
>> +};
> 
> where adding new error types will force us to touch _two_ lines
> (i.e. either you add a new error before NO_FORCE with value -4 and
> then remove the "= -3" from NO_FORCE, or you add a new error after
> INVALID, and update NO_FORCE to -4), which can easily be screwed up
> by a careless developer.  The current code is not wrong per-se, but
> I wonder if it can be made less error prone.
> 

At the top of my head I could think of 2 ways to get around this,

- Assigning the actual value to every single entry in the enum.

  This should solve the issue as a any new entry that would be
  added is expected to go "with a value". The compiler would
  warn you in the case of duplicate values. The drawback is: it
  might be a little restrictive and a little ugly. It would also
  likely cause maintenance issues if the number of values in the
  enum get bigger.

  (Of course this doesn't hold if, the careless programmer
   shatters "consistency" and adds an entry without a value to
   the enum and that change gets merged into the codebase ;-) )

- Using non-negative values for both errors and non-errors.

  This might make it hard to distinguish errors from non-errors
  but this would avoid errors completely without much issues,
  otherwise.

I might prefer the former as I find the possibility of the requirement
to distinguish the errors from non-errors to be high when compared with
the possibility of the requirement to add more new entries to the enum.

Any other ideas/suggestions ?

-- 

Kaartic

QUOTE:

“The most valuable person on any team is the person who makes everyone
else on the team more valuable, not the person who knows the most.”

  - Joel Spolsky



signature.asc
Description: OpenPGP digital signature


Re: [ANNOUNCE] Git v2.17.0-rc0

2018-03-16 Thread Junio C Hamano
Junio C Hamano  writes:

> I haven't wordsmithed it fully, but it should say something along
> the lines of ...
>
>  Documentation/RelNotes/2.16.0.txt | 10 ++
>  1 file changed, 10 insertions(+)

Eh, of course the addition should go to 2.17 release notes ;-)  I
just happened to be reviewing a topic forked earlier.


Re: [ANNOUNCE] Git v2.17.0-rc0

2018-03-16 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> On Fri, Mar 16 2018, Junio C. Hamano jotted:
>
>>   gitweb: hard-depend on the Digest::MD5 5.8 module
>
> I've just noticed this now, but while this module is in 5.8 RedHat's
> butchered perl doesn't have it in the base system, thus this introduces
> the do-we-even-care regression that git's full test suite won't pass on
> a RedHat (or CentOS) base system, because the gitweb tests will fail to
> "use" Digest::MD5.
>
> I'm slightly leaning towards not caring about it, since there's no other
> perl distributor that does this sort of split-out of the core, and if
> you're on a RedHat system they're solving your package problems, so this
> really only impacts the edge case of git developers and redhat
> packagers, both of whom can just do "yum install -y perl-Digest-MD5" to
> fix it.

Thanks for noting.  I agree that this is not something that requires
more than a mention near the beginning of release notes.

I haven't wordsmithed it fully, but it should say something along
the lines of ...

 Documentation/RelNotes/2.16.0.txt | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/RelNotes/2.16.0.txt 
b/Documentation/RelNotes/2.16.0.txt
index 8f0461eefd..8b4c24200b 100644
--- a/Documentation/RelNotes/2.16.0.txt
+++ b/Documentation/RelNotes/2.16.0.txt
@@ -6,6 +6,16 @@ Backward compatibility notes and other notable changes.
  * Use of an empty string as a pathspec element that is used for
'everything matches' is now an error.
 
+ * Part of Git that depends on Perl have required at least Perl 5.8
+   since Git v1.7.4 released in 2010, but we used to assume some core
+   modules from Perl distribution may not exist on the system and did
+   a conditional "eval { require <> }"; we no longer do this.
+   On a platform that ships a stripped-down Perl by default, the user
+   may have to install modules the platform chooses not to ship as
+   part of its core (e.g. Digest::MD5, File::Temp, File::Spec,
+   Net::SMTP, NET::Domain).  RedHat/CentOS excludes Digest::MD5 from
+   its base installation, for example.
+
 
 Updates since v2.15
 ---



Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-16 Thread Junio C Hamano
Eric Sunshine  writes:

> However, I'm having a tough time imagining cases in which callers
> would want same_encoding() to return true if both arguments are NULL,
> but outright crash if only one is NULL (which is the behavior even
> before this patch). In other words, same_encoding() takes advantage of
> is_encoding_utf8() for its convenience, not for its NULL-handling.
> Given that view, the two explicit is_encoding_utf8() calls in
> same_encoding() seem redundant once the same_utf_encoding() call is
> added.

So... does that mean we'd want something like this, or do you have
something else in mind?

int same_encoding(const char *src, const char *dst)
{
static const char utf8[] = "UTF-8";

if (!src)
src = utf8;
if (!dst)
dst = utf8;
if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);
}


Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

2018-03-16 Thread Duy Nguyen
On Thu, Mar 15, 2018 at 8:21 PM, Ævar Arnfjörð Bjarmason
 wrote:
>
> On Thu, Mar 15 2018, Duy Nguyen jotted:
>
>> On Mon, Mar 12, 2018 at 8:30 PM, Ævar Arnfjörð Bjarmason
>>  wrote:
>>> We already have pack.packSizeLimit, perhaps we could call this
>>> e.g. gc.keepPacksSize=2GB?
>>
>> I'm OK either way. The "base pack" concept comes from the
>> "--keep-base-pack" option where we do keep _one_ base pack. But gc
>> config var has a slightly different semantics when it can keep
>> multiple packs.
>
> I see, yeah it would be great to generalize it to N packs.
>
>>> Finally I wonder if there should be something equivalent to
>>> gc.autoPackLimit for this. I.e. with my proposed semantics above it's
>>> possible that we end up growing forever, i.e. I could have 1000 2GB
>>> packs and then 50 very small packs per gc.autoPackLimit.
>>>
>>> Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
>>> gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
>>> two smallest one and not issue a --keep-pack for those, although then
>>> maybe our memory use would spike past the limit.
>>>
>>> I don't know, maybe we can leave that for later, but I'm quite keen to
>>> turn the top-level config variable into something that just considers
>>> size instead of "base" if possible, and it seems we're >95% of the way
>>> to that already with this patch.
>>
>> At least I will try to ignore gc.keepPacksSize if all packs are kept
>> because of it. That repack run will hurt. But then we're back to one
>> giant pack and plenty of small packs that will take some time to grow
>> up to 2GB again.
>
> I think that semantic really should have its own option. The usefulness
> of this is significantly diminished if it's not a guarantee on the
> resource use of git-gc.
>
> Consider a very large repo where we clone and get a 4GB pack. Then as
> time goes on we end up with lots of loose objects and small packs from
> pulling, and eventually end up with say 4GB + 2x 500MB packs (if our
> limit is 500MB).
>
> If I understand what you're saying correctly if we ever match the gc
> --auto requirements because we have *just* the big packs and then a
> bunch of loose objects (say we rebased a lot) then we'll try to create a
> giant 5GB pack (+ loose objects).

Yes. There isn't a simple and easy solution here and I consider
packing (even if it's expensive) to regain performance is better than
not packing at all. I could tweak that a bit by keeping the largest
pack out (so we have to packs in the end). After a long long long time
when your second pack gets to 5 GB, then we hit the most expensive
repack. But that should be ok for now, I guess.

I think this repack strategy was discussed here at some point in the
past by Gerrit guys. Their goal was to reduce I/O, I believe. A
perfect solution probably could be found, but I don't want to hold
this series back until it's found and I don't want to introduce a
zillion config knobs that become useless later on when the perfect
solution is found.

>>> Actually maybe that should be a "if we're that low on memory, forget
>>> about GC for now" config, but urgh, there's a lot of potential
>>> complexity to be handled here...
>>
>> Yeah I think what you want is a hook. You can make custom rules then.
>> We already have pre-auto-gc hook and could pretty much do what you
>> want without pack-objects memory estimation. But if you want it, maybe
>> we can export the info to the hook somehow.
>
> I can do away with that particular thing, but I'd really like to do
> without the hook. I can automate it on some machines, but then we also
> have un-managed laptops run by users who clone big repos. It's much
> easier to tell them to set a few git config variables than have them
> install & keep some hook up-to-date.

That sounds like we need a mechanism to push hooks (and config stuff)
automatically from clone source. I think this topic was touched in the
summit? I don't object adding new config but we need to figure out
what we need, and from this thread I think there are too many "I don't
know" to settle on a solution.
-- 
Duy


Re: [PATCH v12 03/10] strbuf: add a case insensitive starts_with()

2018-03-16 Thread Junio C Hamano
Morten Welinder  writes:

> You need to cast those arguments of tolower to (unsigned char).  That's
> arguably a bug of the language.

Don't we do so in git-compat-util.h already?


Re: [PATCH v12 03/10] strbuf: add a case insensitive starts_with()

2018-03-16 Thread Morten Welinder
You need to cast those arguments of tolower to (unsigned char).  That's
arguably a bug of the language.

Character values less than zero aren't valid for tolower, unless they
happen to equal
EOF in which case the tolower calls don't mean what you want them to mean.  Per
the man page:

"If c is neither an unsigned char value nor EOF, the behavior of these
functions is undefined."

M.


On Thu, Mar 15, 2018 at 6:57 PM,   wrote:
> From: Lars Schneider 
>
> Check in a case insensitive manner if one string is a prefix of another
> string.
>
> This function is used in a subsequent commit.
>
> Signed-off-by: Lars Schneider 
> ---
>  git-compat-util.h | 1 +
>  strbuf.c  | 9 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index 68b2ad531e..95c9b34832 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, 
> va_list params);
>  extern void set_die_is_recursing_routine(int (*routine)(void));
>
>  extern int starts_with(const char *str, const char *prefix);
> +extern int istarts_with(const char *str, const char *prefix);
>
>  /*
>   * If the string "str" begins with the string found in "prefix", return 1.
> diff --git a/strbuf.c b/strbuf.c
> index b635f0bdc4..99812b8488 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
> return 0;
>  }
>
> +int istarts_with(const char *str, const char *prefix)
> +{
> +   for (; ; str++, prefix++)
> +   if (!*prefix)
> +   return 1;
> +   else if (tolower(*str) != tolower(*prefix))
> +   return 0;
> +}
> +
>  int skip_to_optional_arg_default(const char *str, const char *prefix,
>  const char **arg, const char *def)
>  {
> --
> 2.16.2
>


Re: Why don't we symlink libexec/git-core/* to bin/git?

2018-03-16 Thread Duy Nguyen
On Thu, Mar 15, 2018 at 6:16 PM, Johannes Schindelin
 wrote:
> To add some interesting information to this: in MinGit (the light-weight
> "Git for applications" we bundle to avoid adding a hefty 230MB to any
> application that wants to bundle Git for Windows), we simply ignored that
> old promise. We do support hooks written as Unix shell scripts in MinGit,
> and we have not had a single report since offering MinGit with v2.9.2 on
> July 16th, 2016, that it broke anybody's scripts, so it seems that users
> are more sensible than our promises ;-)

That's very good to hear. Perhaps we could slowly move away from
symlinking (or even hard linking) these builtin commands (with a
couple exception like receive-pack and stuff) ? We don't have to do it
right now but we can start announcing that we will drop it in maybe 2
or 3 releases. We do provide a new make target to recreate these links
so that packagers can make a "compat" package that contains just these
links if they want to. But by default a git package will have no
links.
-- 
Duy


[ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository

2018-03-16 Thread Michael Haggerty
What makes a Git repository unwieldy to work with and host? It turns
out that the respository's on-disk size in gigabytes is only part of
the story. From our experience at GitHub, repositories cause problems
because of poor internal layout at least as often as because of their
overall size. For example,

* blobs or trees that are too large
* large blobs that are modified frequently (e.g., database dumps)
* large trees that are modified frequently
* trees that expand to unreasonable size when checked out (e.g., "Git
bombs" [2])
* too many tiny Git objects
* too many references
* other oddities, such as giant octopus merges, super long reference
names or file paths, huge commit messages, etc.

`git-sizer` [1] is a new open-source tool that computes various
size-related statistics for a Git repository and points out those that
are likely to cause problems or inconvenience to its users.

I tried to make the output of `git-sizer` "opinionated" and easy to
interpret. Example output for the Linux kernel is appended below. I
also made it memory-efficient and resistant against git bombs.

I've written a blog post [3] about `git-sizer` with more explanation
and examples, and the main project page [1] has a long README with
some information about what the individual metrics mean and tips for
fixing problems.

I also put quite a bit of effort into making `git-sizer` fast. It does
its work (including figuring out path names for large objects) based
on a single traversal of the repository history using `git rev-list
--objects --reverse [...]`, followed by using the output of `git
cat-file --batch` or `git cat-file --batch-check` to get information
about individual objects.

On that subject, let me share some more technical details. `git-sizer`
is written in Go. I prototyped several ways of extracting object
information, which is critical to the performance because `git-sizer`
has to read all of the reachable non-blob objects in the repository.
The results surprised me:

| Mechanism for accessing Git data| Time   |
| --- | -: |
| `libgit2/git2go`| 25.5 s |
| `libgit2/git2go` with `ManagedTree` optimization| 18.9 s |
| `src-d/go-git`  | 63.0 s |
| Git command line client |  6.6 s |

It was almost a factor of four faster to read and parse the output of
Git plumbing commands (mainly `git for-each-ref`, `git rev-list
--objects`, `git cat-file --batch-check`, and `git cat-file --batch`)
than it was to use the Go bindings to libgit2. (I expect that part of
the reason is that Go's peculiar stack layout makes it quite expensive
to call out to C.) Even after Carlos Martin implemented an
experimental `ManagedTree` optimization that removed the need to call
C for every entry in a tree, it was still not competitive with the Git
CLI. `go-git`, which is a Git implementation in pure Go, was even
slower. So the final version of `git-sizer` calls `git` for accessing
the repository.

Feedback is welcome, including about the weightings [4] that I use to
compute the "level of concern" of the various metrics.

Have fun,
Michael

[1] https://github.com/github/git-sizer
[2] https://kate.io/blog/git-bomb/
[3] 
https://blog.github.com/2018-03-05-measuring-the-many-sizes-of-a-git-repository/
[4] 
https://github.com/github/git-sizer/blob/2e9a30f241ac357f2af01d42f0dd51fbbbae4b0b/sizes/output.go#L330-L401

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name | Value | Level of concern   |
|  | - | -- |
| Overall repository size  |   ||
| * Commits|   ||
|   * Count|   723 k   | *  |
|   * Total size   |   525 MiB | ** |
| * Trees  |   ||
|   * Count|  3.40 M   | ** |
|   * Total size   |  9.00 GiB |    |
|   * Total tree entries   |   264 M   | *  |
| * Blobs  |   ||
|   * Count|  1.65 M   | *  |
|   * Total size   |  55.8 GiB | *  |
| * Annotated tags |   ||
|   * Count|   534 ||
| * References |   ||
|   * Count|   539 |  

Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Junio C Hamano
Johannes Schindelin  writes:

>> > Stolee, you definitely want to inspect those changes (`git log --check`
>> > was introduced to show you whitespace problems). If all of those
>> > whitespace issues are unintentional, you can fix them using `git rebase
>> > --whitespace=fix` in the most efficient way.
>> 
>> Another way that may be easier (depending on the way Derrick works)
>> is to fetch from me and start working from there, as if they were
>> the last set of commits that were sent to the list.  "git log
>> --first-parent --oneline master..pu" would show where the tip of the
>> topic is.
>
> That is not really easier. We had that discussion before. Stolee would
> have to remove your Signed-off-by: lines *manually*.

In return, all the whitespace fixes (and other fixes if any) I did
on my end can be reused free by the submitter, instead of having to
redo it *manually*.

If a reroll of the series does not touch one specific commit, that
commit can be left as-is; I do not see a need to remove anybody's
sign-off or add yet another of your own, if the last two sign-offs
are from you and your upstream maintainer, if you did not change
anythning in what you got from the latter.  This depends on what
tool is used to work on refinement, but with "rebase -i", you'd
leave "pick" as "pick" and not "edit" or "reword" and it would do
the right thing.

If you did refine, you get an editor when you record that
refinement, so it is just a few key strokes, either "dd" or \C-k, to
do that removal *manually*.  So I am not sure why you are making a
mountain out of this molehill.

If you do want to remove the last two sign-off (i.e. penultimate one
by the author done during the initial submission, plus the last one
by me), well, "rebase -i" is open source.  We can add features to
the tool to help everybody collaborate better.  Extending changes
like planned addition of --signoff by Phillip, it is not all that
far-fetched to add a mechanism that notices a project-specific
trailer rewrite rules in-tree and uses that in between each step to
rewrite the trailer block of the commit message, for example, and
the rule

> I understand that it is a trade-off between time you have to spend and
> that others have to spend, and since you do not scale, that trade-off has
> to be in your favor.

That tradeoff may exist, but it does not weigh in the picture above
at all.

Perhaps it is better to try to actually think of a way to work
together better, instead of just whining.


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread SZEDER Gábor
On Fri, Mar 16, 2018 at 4:06 PM, Ævar Arnfjörð Bjarmason
 wrote:

> I noticed that it takes a *long* time to generate the graph, on a bigger
> repo I have it takes 20 minutes, and this is a repo where repack -A -d
> itself takes 5-8 minutes, probably on the upper end of that with the
> bitmap, but once you do that it's relatively snappy with --stdin-commits
> --additive when I feed it the new commits.
>
> I don't have any need really to make this run in 10m instead of 20m,
> just something I found interesting, i.e. how it compares to the repack
> itself.

You should forget '--stdin-packs' and use '--stdin-commits' to generate
the initial graph, it's much faster even without '--additive'[1].  See

  
https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/

I still think that the default behaviour for 'git commit-graph write'
should simply walk history from all refs instead of enumerating all
objects in all packfiles.


[1] - Please excuse the bikeshed: '--additive' is such a strange
  sounding option name, at least for me.  '--append', perhaps?


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Lars Schneider

> On 14 Mar 2018, at 21:43, Junio C Hamano  wrote:
> 
> Derrick Stolee  writes:
> 
>> This v6 includes feedback around csum-file.c and the rename of hashclose()
>> to finalize_hashfile(). These are the first two commits of the series, so
>> they could be pulled out independently.
>> 
>> The only other change since v5 is that I re-ran the performance numbers
>> in "commit: integrate commit graph with commit parsing".
> 
> Thanks.
> 
>> Hopefully this version is ready to merge. I have several follow-up topics
>> in mind to submit soon after, including:
> 
> A few patches add trailing blank lines and other whitespace
> breakages, which will stop my "git merge" later to 'next' and down,
> as I have a pre-commit hook to catch them.

@stolee: 

I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect
these kinds of things. I run this as part of my "prepare patch" [1] script
which is inspired by a similar script originally written by Dscho.

Do you think it would make sense to mention (or even
recommend) such a script in your awesome GfW CONTRIBUTING.md?


- Lars


[1] 
https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71




Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Wed, Mar 14 2018, Derrick Stolee jotted:

> Hopefully this version is ready to merge. I have several follow-up topics
> in mind to submit soon after, including:

I've been doing some preliminary testing of this internally, all good so
far, on a relatively small repo (~100k commits) I was using for testing:

- git -c core.commitGraph=true -C  rev-list --all:
-  /mnt/ext4_graph => min:273mean:279max:389-- (273 274 
275 276 277 279 282 282 345 389)
-/mnt/ext4 => min:1087   mean:1123   max:1175   -- (1087 
1092 1092 1104 1117 1123 1126 1136 1143 1175)

This is on a fresh clone with one giant pack and where the commit graph
data was generated afterwards with "git commit-graph write" for the
*_graph repo, so it contains all the commits.

So less than 25% of the mean time it spent before. Nice. Those are times
in milliseconds over 10 runs, for this particular one I didn't get much
of an improvement in --graph, but still ~10%:

- git -c core.commitGraph=true -C  log --oneline --graph:
-  /mnt/ext4_graph => min:1420   mean:1449   max:1586   -- (1420 
1423 1428 1434 1449 1449 1490 1548 1567 1586)
-/mnt/ext4 => min:1547   mean:1616   max:2136   -- (1547 
1557 1581 1585 1598 1616 1621 1740 1964 2136)

I noticed that it takes a *long* time to generate the graph, on a bigger
repo I have it takes 20 minutes, and this is a repo where repack -A -d
itself takes 5-8 minutes, probably on the upper end of that with the
bitmap, but once you do that it's relatively snappy with --stdin-commits
--additive when I feed it the new commits.

I don't have any need really to make this run in 10m instead of 20m,
just something I found interesting, i.e. how it compares to the repack
itself.


Re: Help...!

2018-03-16 Thread Konstantin Khomoutov
On Fri, Mar 16, 2018 at 07:38:07PM +0530, JYOTIK MAYUR wrote:

> i am working on a project that is git hosting website like GitHub. I
> am a student so i don't know much on how to make a website like GitHub
> so could please tell me what can be the appropriate steps to make a
> website like that(mostly the server part).

Sure, just study the code of the following projects:

- Go: https://gitea.io/
- C#: https://bonobogitserver.com/
- Java: http://gitblit.com/

Still, please note that this list deals with the development of Git,
so your question clearly is off-topic here, so please refrain from
asking such questions here.

You could try https://www.reddit.com/r/git/ for a start.



Help...!

2018-03-16 Thread JYOTIK MAYUR
i am working on a project that is git hosting website like GitHub. I
am a student so i don't know much on how to make a website like GitHub
so could please tell me what can be the appropriate steps to make a
website like that(mostly the server part).


  1   2   >