Re: Revision walking, commit dates, slop

2019-05-17 Thread Mike Hommey
On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote:
> On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote:
> > On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote:
> > > There are established corner cases, where in a repo where commit dates
> > > are not monotonically increasing, revision walking can go horribly
> > > wrong. This was discussed in the past in e.g.
> > > https://public-inbox.org/git/20150521061553.ga29...@glandium.org/
> > > 
> > > The only (simple) workable way, given the current algorithm, to get an
> > > accurate view off rev-list is to essentially make slop infinite. This
> > > works fine, at the expense of runtime.
> > > 
> > > Now, ignoring any modification for the above, I'm hitting another corner
> > > case in some other "weird" history, where I have 500k commits all with
> > > the same date. With such a commit dag, something as trivial as
> > > `git rev-list HEAD~..HEAD` goes through all commits from the root commit
> > > to HEAD, which takes multiple seconds, when the (obvious) output is one
> > > commit.
> > > 
> > > It looks like the only way revision walking stops going through all the
> > > ancestry is through slop, and slop is essentially made infinite by the
> > > fact all commits have the same date (because of the date check in
> > > still_interesting(). By extension, this means the workaound for the
> > > first corner case above, which is to make slop infinite, essentially
> > > makes all rev walking go through the entire ancestry of the commits
> > > given on the command line.
> > > 
> > > It feels like some cases of everybody_uninteresting should shorcut slop
> > > entirely, but considering the only way for slop to decrease at all is
> > > when everybody_uninteresting returns true, that would seem like a wrong
> > > assumption. But I'm also not sure what slop helps with in the first
> > > place (but I don't have a clear view of the broader picture of how the
> > > entire revision walking works).
> > > 
> > > Anyways, a rather easy way to witness this happening is to create a
> > > dummy repo like:
> > >   git init foo
> > >   cd foo
> > >   for i in $(seq 1 50); do
> > > echo $i > a;
> > > git add a;
> > > git commit -a -m $i;
> > >   done
> > > 
> > > The something as simple as `git rev-list HEAD~..HEAD` will go through
> > > all 50 commits (assuming the script above created commits in the same
> > > second, which it did on my machine)
> > > 
> > > By the time both HEAD~ and HEAD have been processed, the revision
> > > walking should have enough information to determine that it doesn't need
> > > to go further, but still does. Even with something like HEAD~2..HEAD,
> > > after the first round of processing parents it should be able to see
> > > there's not going to be any more interesting commits.
> > > 
> > > I'm willing to dig into this, but if someone familiar with the
> > > algorithm could give me some hints as to what I might be missing in the
> > > big picture, that would be helpful.
> > 
> > All the above is without commit-graph, I presume?  If so, then you
> > should give it a try, as it might bring immediate help in your
> > pathological repo.  With 5k commit in the same second (enforced via
> > 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get:
> > 
> >   $ best-of-five -q git rev-list HEAD~..HEAD
> >   0.069
> >   $ git commit-graph write --reachableComputing commit graph generation
> >   numbers: 100% (5000/5000), done.
> >   $ best-of-five -q git rev-list HEAD~..HEAD
> >   0.004
> 
> I'm not observing any difference from using commit-graph, whether in
> time or in the number of commits that are looked at in limit_list().

-c core.commitGraph=true does make a difference in time, but not in the
number of commits looked at in limit_list(). So it's only faster because
each iteration of the loop is faster. It means it's still dependent on
the depth of the dag, and the larger the repo will grow, the slower it
will get.

Mike


Re: [PATCH 2/2] revision: keep topo-walk free of unintersting commits

2019-05-21 Thread Mike Hommey
On Tue, May 21, 2019 at 09:59:53AM -0400, Derrick Stolee wrote:
> When updating the topo-order walk in b454241 (revision.c: generation-based
> topo-order algorithm, 2018-11-01), the logic was a huge rewrite of the
> walk logic. In that massive change, we accidentally included the
> UNINTERESTING commits in expand_topo_walk(). This means that a simple
> query like
> 
> git rev-list --topo-order HEAD~1..HEAD
> 
> will expand the topo walk for all commits reachable from HEAD, and not
> just one commit.
> 
> This change should speed up these cases, but there is still a need
> for corrected commit-date for some A..B queries.
> 
> Signed-off-by: Derrick Stolee 
> ---
> 
> Sorry for the patch-spam, but I took a moment to check this command
> on the Git repo, and was able to reproduce the slowness. That didn't
> make sense to me, so I added some log messages to expand_topo_walk()
> and notices we were walking the UNINITERESTING commits. This is part
> of the reason the new logic is slower for A..B commands, but not the
> whole reason.
> 
> You'll want this patch as well for a test.

Both patches help, thanks.

Mike


`git stash ` stopped working in 2.22.0

2019-06-14 Thread Mike Hommey
Hi,

`git stash  ` where n is a number used to work until 2.21.*.
It doesn't work in 2.22.0.

Bisection points to:

dc7bd382b1063303f4f45d243bff371899285acb is the first bad commit
commit dc7bd382b1063303f4f45d243bff371899285acb
Author: Paul-Sebastian Ungureanu 
Date:   Mon Feb 25 23:16:20 2019 +

stash: convert show to builtin

which I guess makes sense :)

Mike


Re: [PATCH] stash: fix show referencing stash index

2019-06-24 Thread Mike Hommey
On Sat, Jun 15, 2019 at 12:26:18PM +0100, Thomas Gummerer wrote:
> On 06/14, Mike Hommey wrote:
> > Hi,
> > 
> > `git stash  ` where n is a number used to work until 2.21.*.
> > It doesn't work in 2.22.0.
> > 
> > Bisection points to:
> > 
> > dc7bd382b1063303f4f45d243bff371899285acb is the first bad commit
> > commit dc7bd382b1063303f4f45d243bff371899285acb
> > Author: Paul-Sebastian Ungureanu 
> > Date:   Mon Feb 25 23:16:20 2019 +
> > 
> > stash: convert show to builtin
> > 
> > which I guess makes sense :)
> 
> Yup, this is definitely a bug.  I think it only affected 'git stash
> show' however, and not other stash subcommands.  If not, could you
> point me to where else you saw this bug?

I confirmed pop, apply, branch, and drop are not affected.

Mike


Multiple urls for remotes?

2019-06-26 Thread Mike Hommey
Hi,

I was surprised to figure out that urls/pushurls set up for remotes can
be set multiple times, and that those urls end up being used
sequentially.

Sadly, this has the side effect that one cannot override the config from
the command line, as the url is then added as an extra url, which is
tried after the already configured one.

e.g.
git -c remote.origin.url=/tmp/foo push origin

will keep pushing to wherever the .git/config's remote.origin.url points
to.

With all the configuration items that work this way, it's actually kind
of sad that it's not possible to force the value from the command line to
override anything that is set in the configuration.

It's worth noting that the documentation for -c says:

Pass a configuration parameter to the command. The value given will
override values from configuration files.

... which implies -c remote.origin.url=/tmp/foo should, in fact, replace
any other value already set.

Coming back to the fact that remote.origin.url and remote.origin.pushurl
can actually have multiple values, it's also worth noting that none of
the git-config, git-fetch or git-push documentation seem to say anything
about the fact there can be multiple values.

The git-config documentation says:
   remote..url
   The URL of a remote repository. See git-fetch(1) or git-push(1).

   remote..pushurl
   The push URL of a remote repository. See git-push(1).

The git-fetch and git-push documentations always talk about _the_ URL.

The only thing that seems to be talking about this is the documentation
for git-remote, which has --add for set-url. There doesn't seem to be any
documentation about what conditions are required for the non-primary URL
to be used.

Mike


[ANNOUNCE] git-cinnabar 0.5.2

2019-06-30 Thread Mike Hommey
Hi,

Git-cinnabar is a git remote helper to interact with mercurial
repositories. It allows to clone, pull and push from/to mercurial remote
repositories, using git.

Code on https://github.com/glandium/git-cinnabar
This release on
https://github.com/glandium/git-cinnabar/releases/tag/0.5.2

What's new since 0.5.1?

- Updated git to 2.22.0 for the helper.
- cinnabarclone support is now enabled by default. See details in
  `README.md` and `mercurial/cinnabarclone.py`.
- cinnabarclone now supports grafted repositories.
- `git cinnabar fsck` now does incremental checks against last known
  good state.
- Avoid git cinnabar sometimes thinking the helper is not up-to-date
  when it is.
- Removing bookmarks on a Mercurial server is now working properly.

Mike


Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
Hi,

I was looking at the disk size of the gecko repository on github[1],
which started at 4.7GB, and `git gc --aggressive`'d it, which made that
into 2.0G. But to achieve that required quite some resources.

My first attempt failed with OOM, on an AWS instance with 16 cores and
32GB RAM. I then went to another AWS instance, with 36 cores and 96GB
RAM. And that went through after a while... with a peak memory usage
above 60GB!

Since then, Peff kindly repacked the repo on the github end, so it
doesn't really need repacking locally anymore, but I can still reproduce
the > 60GB memory usage with the packed repository.

I gathered some data[2], all on the same 36 cores, 96GB RAM instance, with
36, 16 and 1 threads, and here's what can be observed:

With 36 threads, the overall process takes 45 minutes:
- 50 seconds enumerating and counting objects.
- ~22 minutes compressing objects
- ~22 minutes writing objects

Of the 22 minutes compressing objects, more than 15 minutes are spent on
the last percent of objects, and only during that part the memory usage
balloons above 20GB.

Memory usage goes back to 2.4G after finishing to compress.

With 16 threads, the overall process takes about the same time as above,
with about the same repartition.

But less time is spent on compressing the last percent of objects, and
memory usage goes above 20GB later than with 36 threads.

Finally, with 1 thread, the picture changes greatly. The overall process
takes 2.5h:
- 50 seconds enumerating and counting objects.
- ~2.5h compressing objects.
- 3 minutes and 25 seconds writing objects!

Memory usage stays reasonable, except at some point after 47 minutes,
where it starts to increase up to 12.7GB, and then goes back down about
half an hour later, all while stalling around the 13% progress mark.

My guess is all those stalls are happening when processing the files I
already had problems with in the past[3], except there are more of them
now (thankfully, they were removed, so there won't be more, but that
doesn't make the existing ones go away).

I never ended up working on trying to make that diff faster, maybe that
would help a little here, but that would probably not help much wrt the
memory usage. I wonder what git could reasonably do to avoid OOMing in
this case. Reduce the window size temporarily? Trade memory with time,
by not keeping the objects in memory?

I'm puzzled by the fact writing objects is so much faster with 1 thread.

It's worth noting that the AWS instances don't have swap by default,
which is actually good in this case, because if it had started to swap,
it would have taken forever.

1. https://github.com/mozilla/gecko
2. 
https://docs.google.com/spreadsheets/d/1IE8E3BhKurXsXgwBYFXs4mRBT_512v--ip6Vhxc3o-Y/edit?usp=sharing
3. https://public-inbox.org/git/20180703223823.qedmoy2imp4dc...@glandium.org/T/

Any thoughts?

Mike


Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
On Thu, Jul 04, 2019 at 12:04:11PM +, Eric Wong wrote:
> Mike Hommey  wrote:
> > I'm puzzled by the fact writing objects is so much faster with 1 thread.
> 
> I/O contention in the multi-threaded cases?
> 
> "public-inbox-index" (reading from git, writing to Xapian+SQLite)
> on a dev machine got slow because core count exceeded what SATA
> could handle and had to cap the default Xapian shard count to 3
> by default for v2 inboxes.
 
AFAICT, git doesn't write from multiple threads.
 
> As for memory use, does git use the producer/consumer pattern
> for malloc+free so free happens in a different thread from the
> malloc?
> 
> Because current versions of glibc gets pathological with that case
> (but I have a long-term fix for it):
> 
> https://public-inbox.org/libc-alpha/20180731084936.g4yw6wnvt677miti@dcvr/T/
> Only locklessinc malloc seemed OK in the IPC department
> (But I haven't kept up with jemalloc, tcmalloc developments in years)

Oh right, I forgot to mention:
- I thought this memory usage thing was [1] but it turns out it was real
  memory usage.
- glibc's mallinfo stores values as int, so it's useless to know how
  much memory was allocated when it's more than 4GB.
- glibc's malloc_stats relies on the same int data, so while it does
  print "in use" data, it can't print values above 4GB correctly.
- glibc has a malloc_stats function that, according to its manual page
  "addresses the deficiencies in malloc_stats and mallinfo", but while
  it outputs a large XML dump, it doesn't contain anything that looks
  remotely like the "in use" from malloc_stats.
- So all in all, I used jemalloc to gather the "allocated" stats.

Mike

1. https://sourceware.org/bugzilla/show_bug.cgi?id=23416


Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
On Thu, Jul 04, 2019 at 07:05:30PM +0900, Mike Hommey wrote:
> Hi,
> 
> I was looking at the disk size of the gecko repository on github[1],
> which started at 4.7GB, and `git gc --aggressive`'d it, which made that
> into 2.0G. But to achieve that required quite some resources.
> 
> My first attempt failed with OOM, on an AWS instance with 16 cores and
> 32GB RAM. I then went to another AWS instance, with 36 cores and 96GB
> RAM. And that went through after a while... with a peak memory usage
> above 60GB!
> 
> Since then, Peff kindly repacked the repo on the github end, so it
> doesn't really need repacking locally anymore, but I can still reproduce
> the > 60GB memory usage with the packed repository.
> 
> I gathered some data[2], all on the same 36 cores, 96GB RAM instance, with
> 36, 16 and 1 threads, and here's what can be observed:
> 
> With 36 threads, the overall process takes 45 minutes:
> - 50 seconds enumerating and counting objects.
> - ~22 minutes compressing objects
> - ~22 minutes writing objects
> 
> Of the 22 minutes compressing objects, more than 15 minutes are spent on
> the last percent of objects, and only during that part the memory usage
> balloons above 20GB.
> 
> Memory usage goes back to 2.4G after finishing to compress.
> 
> With 16 threads, the overall process takes about the same time as above,
> with about the same repartition.
> 
> But less time is spent on compressing the last percent of objects, and
> memory usage goes above 20GB later than with 36 threads.
> 
> Finally, with 1 thread, the picture changes greatly. The overall process
> takes 2.5h:
> - 50 seconds enumerating and counting objects.
> - ~2.5h compressing objects.
> - 3 minutes and 25 seconds writing objects!
> 
> Memory usage stays reasonable, except at some point after 47 minutes,
> where it starts to increase up to 12.7GB, and then goes back down about
> half an hour later, all while stalling around the 13% progress mark.
> 
> My guess is all those stalls are happening when processing the files I
> already had problems with in the past[3], except there are more of them
> now (thankfully, they were removed, so there won't be more, but that
> doesn't make the existing ones go away).
> 
> I never ended up working on trying to make that diff faster, maybe that
> would help a little here, but that would probably not help much wrt the
> memory usage. I wonder what git could reasonably do to avoid OOMing in
> this case. Reduce the window size temporarily? Trade memory with time,
> by not keeping the objects in memory?
> 
> I'm puzzled by the fact writing objects is so much faster with 1 thread.

Here's a perf report from the portion of "Writing" that is particularly
slow with compression having happened on 36 threads:
  100.00% 0.00%  git  [unknown]   [k] 0x

   99.97% 0.02%  git  git [.] write_one 

   99.97% 0.00%  git  git [.] write_pack_file   

   99.97% 0.00%  git  git [.] cmd_pack_objects  

   99.96% 0.00%  git  git [.] write_object (inlined)

   99.96% 0.00%  git  git [.] write_reuse_object 
(inlined)  
   99.92% 0.00%  git  git [.] write_no_reuse_object 

   98.12% 0.00%  git  git [.] get_delta (inlined)   

   72.36% 0.00%  git  git [.] diff_delta (inlined)  

   64.86%64.20%  git  git [.] create_delta_index

   26.32% 0.00%  git  git [.] repo_read_object_file 
(inlined)   
   26.32% 0.00%  git  git [.] read_object_file_extended 

   26.32% 0.00%  git  git [.] read_object   

   26.32% 0.00%  git  git [.] oid_object_info_extended  

   26.25% 0.00%  git  git [.] packed_object_info

   26.24% 0.00%  git  git [.] cache_or_unpack_entry 
(inlined)   
   24.30% 0.01%  git  git [.] unpack_entry  

   17.62% 0.00%  git  git [.] memcpy (inlined)  

   17.52%17.46%  git  libc-2.27.so[.] 
__memmove_avx_unaligned_erms  
   15.98% 0.22%  git  git [.] patch_delta   

7.60% 0.00%  git  git [.] unpack_compressed_entry   

7.49% 7.42%  git  

Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
On Thu, Jul 04, 2019 at 07:05:30PM +0900, Mike Hommey wrote:
> My guess is all those stalls are happening when processing the files I
> already had problems with in the past[3], except there are more of them
> now (thankfully, they were removed, so there won't be more, but that
> doesn't make the existing ones go away).
>
> 3. 
> https://public-inbox.org/git/20180703223823.qedmoy2imp4dc...@glandium.org/T/

I've more or less confirmed that's the cause of the long stalls during
the compression phase. It can be reproduced to some extent with:

$ git clone https://github.com/mozilla/gecko
$ cd gecko
$ git rev-list --all -- testing/web-platform/meta/MANIFEST.json | xargs -I '{}' 
git ls-tree '{}' testing/web-platform/meta/MANIFEST.json | sort -u | awk 
'{print $3}' | git cat-file --batch-check | sort -n -k 3 | awk '{print $1}' | 
git pack-objects --window=250 --depth=50 --no-reuse-delta my_pack

There might be some other file than testing/web-platform/meta/MANIFEST.json 
involved, because this uses "only" 40GB RAM, but that file is the one I know
of.

This however doesn't reproduce the "writing takes forever" part of the
problem.

Mike


Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
On Fri, Jul 05, 2019 at 01:09:55AM -0400, Jeff King wrote:
> On Thu, Jul 04, 2019 at 07:05:30PM +0900, Mike Hommey wrote:
> > Finally, with 1 thread, the picture changes greatly. The overall process
> > takes 2.5h:
> > - 50 seconds enumerating and counting objects.
> > - ~2.5h compressing objects.
> > - 3 minutes and 25 seconds writing objects!
> 
> That's weird. I'd expect us to find similar amounts of deltas, but we
> don't have the writing slow-down. I wonder if there is some bad
> interaction between the multi-threaded code and the delta cache.
> 
> Did you run the second, single-thread run against the exact same
> original repository you had? Or did you re-run it on the result of the
> multi-thread run? Another explanation is that the original repository
> had some poor patterns that made objects expensive to access (say, a ton
> of really deep delta chains). And so the difference between the two runs
> was not the threads, but just the on-disk repository state.
> 
> Kind of a long shot, but if that is what happened, try running another
> multi-threaded "repack -f" and see if its writing phase is faster.

I've run 36-threads, 16-threads and 1-thread in sequence on the same
repo, so 16-threads was repacking what was repacked by the 36-threads,
and 1-thread was repacking what was repacked by the 16-threads. I
assumed it didn't matter, but come to think of it, I guess it can.

Mike


Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-04 Thread Mike Hommey
On Fri, Jul 05, 2019 at 01:14:13AM -0400, Jeff King wrote:
> On Thu, Jul 04, 2019 at 10:13:20PM +0900, Mike Hommey wrote:
> 
> > > "public-inbox-index" (reading from git, writing to Xapian+SQLite)
> > > on a dev machine got slow because core count exceeded what SATA
> > > could handle and had to cap the default Xapian shard count to 3
> > > by default for v2 inboxes.
> >  
> > AFAICT, git doesn't write from multiple threads.
> 
> Right. That's always single threaded, and the main difference there is
> going to be what's in the delta base cache.
> 
> > Oh right, I forgot to mention:
> > - I thought this memory usage thing was [1] but it turns out it was real
> >   memory usage.
> > - glibc's mallinfo stores values as int, so it's useless to know how
> >   much memory was allocated when it's more than 4GB.
> > - glibc's malloc_stats relies on the same int data, so while it does
> >   print "in use" data, it can't print values above 4GB correctly.
> > - glibc has a malloc_stats function that, according to its manual page
> >   "addresses the deficiencies in malloc_stats and mallinfo", but while
> >   it outputs a large XML dump, it doesn't contain anything that looks
> >   remotely like the "in use" from malloc_stats.
> > - So all in all, I used jemalloc to gather the "allocated" stats.
> 
> I think I explained all of the memory-usage questions in my earlier
> response, but just for reference: if you have access to it, valgrind's
> "massif" tool is really good for this kind of profiling. Something like:
> 
>   valgrind --tool=massif git pack-objects ...
>   ms_print massif.out.*
> 
> which shows heap usage at various times, points out the snapshot with
> peak usage, and shows a backtrace of the main culprits at a few
> snapshots.

At the expense of time ;) A run would likely last an entire day under
massif (by which I mean a full 24 hours, not a 9-5 day).

Mike


Re: Surprising use of memory and time when repacking mozilla's gecko repository

2019-07-05 Thread Mike Hommey
On Fri, Jul 05, 2019 at 02:45:16PM +0900, Mike Hommey wrote:
> On Fri, Jul 05, 2019 at 01:09:55AM -0400, Jeff King wrote:
> > On Thu, Jul 04, 2019 at 07:05:30PM +0900, Mike Hommey wrote:
> > > Finally, with 1 thread, the picture changes greatly. The overall process
> > > takes 2.5h:
> > > - 50 seconds enumerating and counting objects.
> > > - ~2.5h compressing objects.
> > > - 3 minutes and 25 seconds writing objects!
> > 
> > That's weird. I'd expect us to find similar amounts of deltas, but we
> > don't have the writing slow-down. I wonder if there is some bad
> > interaction between the multi-threaded code and the delta cache.
> > 
> > Did you run the second, single-thread run against the exact same
> > original repository you had? Or did you re-run it on the result of the
> > multi-thread run? Another explanation is that the original repository
> > had some poor patterns that made objects expensive to access (say, a ton
> > of really deep delta chains). And so the difference between the two runs
> > was not the threads, but just the on-disk repository state.
> > 
> > Kind of a long shot, but if that is what happened, try running another
> > multi-threaded "repack -f" and see if its writing phase is faster.
> 
> I've run 36-threads, 16-threads and 1-thread in sequence on the same
> repo, so 16-threads was repacking what was repacked by the 36-threads,
> and 1-thread was repacking what was repacked by the 16-threads. I
> assumed it didn't matter, but come to think of it, I guess it can.

I tried:
- fresh clone -> 36-threads
- fresh clone -> 1-thread -> 36-threads

The 36-threads gc in the latter was only marginally faster than in the
former (between 19 and 20 minutes instead of 22 for both "Compressing"
and "Writing").

Mike


Re: cannot clone --single-commit instead of --single-branch

2019-08-01 Thread Mike Hommey
On Thu, Aug 01, 2019 at 07:43:22PM -0700, Jonathan Nieder wrote:
> Hi,
> 
> Bryan Turner wrote:
> 
> > Promisor remotes and other in-flight changes might help provide some
> > of what you're looking for, but I'm not aware of any already-available
> > solution.
> 
> You can do
> 
>   git clone --no-checkout --filter=blob:none $url repo

TIL. Why doesn't --filter appear in the git-clone manual page?

Mike


fast-import slowness when importing large files with small differences

2018-06-29 Thread Mike Hommey
Hi,

I noticed some slowness when fast-importing data from the Firefox mercurial
repository, where fast-import spends more than 5 minutes importing ~2000
revisions of one particular file. I reduced a testcase while still
using real data. One could synthesize data with kind of the same
properties, but I figured real data could be useful.

To reproduce:
$ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
$ git init bar
$ cd bar
$ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000

(--depth=2000 to minimize the pack size)

The python script doesn't have much overhead:
$ time python ../foo/import.py ../foo/data.gz > /dev/null

real0m14.564s
user0m9.813s
sys 0m4.703s

It generates about 26GB of data from that 4.2MB data.gz.

$ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000
git-fast-import statistics:
-
Alloc'd objects:   5000
Total objects: 1868 (   133 duplicates  )
  blobs  : 1868 (   133 duplicates   1867 deltas of   
1868 attempts)
  trees  :0 ( 0 duplicates  0 deltas of 
 0 attempts)
  commits:0 ( 0 duplicates  0 deltas of 
 0 attempts)
  tags   :0 ( 0 duplicates  0 deltas of 
 0 attempts)
Total branches:   0 ( 0 loads )
  marks:   1024 ( 0 unique)
  atoms:  0
Memory total:  2282 KiB
   pools:  2048 KiB
 objects:   234 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 35184372088832
pack_report: pack_used_ctr=  0
pack_report: pack_mmap_calls  =  0
pack_report: pack_open_windows=  0 /  0
pack_report: pack_mapped  =  0 /  0
-

321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 
83192maxresident)k
0inputs+10568outputs (0major+38689minor)pagefaults 0swaps

(The resulting pack is 5.3MB, fwiw)

Obviously, sha1'ing 26GB is not going to be free, but it's also not the
dominating cost, according to perf:

63.52%  git-fast-import  git-fast-import [.] create_delta_index
17.46%  git-fast-import  git-fast-import [.] sha1_compression_states
 9.89%  git-fast-import  git-fast-import [.] ubc_check
 6.23%  git-fast-import  git-fast-import [.] create_delta
 2.49%  git-fast-import  git-fast-import [.] sha1_process

That's a whole lot of time spent on create_delta_index.

FWIW, if delta was 100% free (yes, I tested that), the fast-import would
take 1:40 with the following profile:

58.74%  git-fast-import  git-fast-import [.] sha1_compression_states
32.45%  git-fast-import  git-fast-import [.] ubc_check
 8.25%  git-fast-import  git-fast-import [.] sha1_process

I toyed with the idea of eliminating common head and tail before
creating the delta, and got some promising result: a fast-import taking
3:22 instead of 5:50, with the following profile:

34.67%  git-fast-import  git-fast-import [.] create_delta_index
30.88%  git-fast-import  git-fast-import [.] sha1_compression_states
17.15%  git-fast-import  git-fast-import [.] ubc_check
 7.25%  git-fast-import  git-fast-import [.] store_object
 4.47%  git-fast-import  git-fast-import [.] sha1_process
 2.72%  git-fast-import  git-fast-import [.] create_delta2

The resulting pack is however much larger (for some reason, many objects
are left non-deltaed), and the deltas are partly broken (they don't
apply cleanly), but that just tells the code is not ready to be sent. I
don't expect working code would be much slower than this. The remaining
question is whether this is beneficial for more normal cases.

I also seemed to remember when I tested a while ago, that somehow xdiff
handles those files faster than diff-delta, and I'm wondering if it
would make sense to to make the pack code use xdiff. So I tested
replacing diff_delta with a call to xdi_diff_outf with a callback that
does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's
best, though, I haven't looked very deeply), and that finished in 5:15
with the following profile (without common head trimming,
xdiff-interface apparently does common tail trimming):

32.99%  git-fast-import  git-fast-import [.] xdl_prepare_ctx.isra.0
20.42%  git-fast-import  git-fast-import [.] sha1_compression_states
15.26%  git-fast-import  git-fast-import [.] xdl_hash_record
11.65%  git-fast-import  git-fast-import [.] ubc_check
 3.09%  git-fast-import  git-

Re: fast-import slowness when importing large files with small differences

2018-06-29 Thread Mike Hommey
On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jun 29 2018, Mike Hommey wrote:
> 
> > I noticed some slowness when fast-importing data from the Firefox mercurial
> > repository, where fast-import spends more than 5 minutes importing ~2000
> > revisions of one particular file. I reduced a testcase while still
> > using real data. One could synthesize data with kind of the same
> > properties, but I figured real data could be useful.
> >
> > To reproduce:
> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> > $ git init bar
> > $ cd bar
> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >
> > [...]
> > So maybe it would make sense to consolidate the diff code (after all,
> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> > of common head and tail, this gets down to 3:33.
> >
> > I'll also note that Facebook has imported xdiff from the git code base
> > into mercurial and improved performance on it, so it might also be worth
> > looking at what's worth taking from there.
> 
> It would be interesting to see how does this compares with a more naïve
> approach of committing every version of this file one-at-a-time into a
> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> suboptimal compared to just writing out a lot of redundant data and
> repacking it all at once later.

"Just" writing 26GB? And that's only one file. If I were to do that for
the whole repository, it would yield a > 100GB pack. Instead of < 2GB
currently.

Mike


[PATCH] fast-import: Don't count delta attempts against an empty buffer

2018-06-30 Thread Mike Hommey
When the reference buffer is empty, diff_delta returns NULL without
really attempting anything, yet fast-import counts that as a delta
attempt.

Signed-off-by: Mike Hommey 
---
 fast-import.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fast-import.c b/fast-import.c
index 4d55910ab9..12195d54d7 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1076,7 +1076,7 @@ static int store_object(
return 1;
}
 
-   if (last && last->data.buf && last->depth < max_depth
+   if (last && last->data.len && last->data.buf && last->depth < max_depth
&& dat->len > the_hash_algo->rawsz) {
 
delta_count_attempts_by_type[type]++;
-- 
2.18.0



Checks added for CVE-2018-11235 too restrictive?

2018-07-03 Thread Mike Hommey
Hi,

(Background) I'm the author of a git remote-helper that can talk
directly to mercurial servers, known as git-cinnabar. One of the design
decisions was to use git objects to store all the metadata necessary to
reconstruct mercurial changesets, manifests and files, and one special
thing that's done is that git trees are (ab)used to store mercurial
sha1s in commit references. This design decision was made so that the
metadata could possibly be exchanged, allowing to jumpstart clone of
large repositories faster than by converting from scratch from
mercurial.

I had a first shot at that a few months ago, but the format of the
metadata branch made it impossible to push to github, hitting its push
size limit. With some pre-processing work, I was able to push the data
to github, with the intent to come back with a new metadata branch
format that would make the push directly possible.

Fast forward to this week, where I was trying to upload a new metadata
branch, and failed in a rather interesting way: multiple lines looking
like:

remote: error: object 81eae74b046d284c47e788143bbbcc681cb53418: 
gitmodulesMissing: unable to read .gitmodules blob

which, apart from the fact that they have some formatting issue, appear
to be new from the fixes for CVE-2018-11235.

I can see what those fixes are trying to prevent, but they seem to be
overly restrictive, at least in the context of transfer.fsckObjects.

The core problem is that the mercurial repository has some .gitmodules
files in some subdirectories. As a consequence, the git objects storing
the info for those mercurial files contain trees with .gitmodules files
with a commit ref, and that's what the remote is complaining about.

(Surpringly, it doesn't hit the "non-blob found at .gitmodules" case
first, which is weird).

A small testcase to reproduce looks like this:

$ git init bar; cd bar
$ git fast-import < 0 +
data 0
 
M 16 81eae74b046d284c47e788143bbbcc681cb53418 bar/.gitmodules
 
EOF
$ git init ../qux
$ git -C ../qux config receive.fsckObjects true
$ git push -q ../qux bar
remote: error: object 81eae74b046d284c47e788143bbbcc681cb53418: 
gitmodulesMissing: unable to read .gitmodules blob
remote: fatal: fsck error in pack objects
error: remote unpack failed: unpack-objects abnormal exit
To ../qux
 ! [remote rejected] bar -> bar (unpacker error)
error: failed to push some refs to '../qux'

Would it be reasonable to make the transfer.fsckObject checks ignore
non-blob .gitmodules?

Cheers,

Mike


Re: [PATCH] fast-import: Don't count delta attempts against an empty buffer

2018-07-03 Thread Mike Hommey
On Tue, Jul 03, 2018 at 11:41:42AM -0700, Junio C Hamano wrote:
> Mike Hommey  writes:
> 
> > When the reference buffer is empty, diff_delta returns NULL without
> > really attempting anything, yet fast-import counts that as a delta
> > attempt.
> 
> But that is an attempt nevertheless, no?  Attempted and failed to
> find anything useful, that is.  What problem are you trying to solve
> and what issue are you trying to address, exactly?
> 
> ... goes and reads the patch ...
> 
> > Signed-off-by: Mike Hommey 
> > ---
> >  fast-import.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fast-import.c b/fast-import.c
> > index 4d55910ab9..12195d54d7 100644
> > --- a/fast-import.c
> > +++ b/fast-import.c
> > @@ -1076,7 +1076,7 @@ static int store_object(
> > return 1;
> > }
> >  
> > -   if (last && last->data.buf && last->depth < max_depth
> > +   if (last && last->data.len && last->data.buf && last->depth < max_depth
> > && dat->len > the_hash_algo->rawsz) {
> >  
> > delta_count_attempts_by_type[type]++;
> 
> This is a misleading patch as the title and the proposed log message
> both talk about "attempts are counted but we didn't actually do
> anything", implying that the only problem is that the counter is not
> aligned with reality.  The fact that the post-context we see here
> only shows the "counting" part does not help us, either.
> 
> But the next line in the post-context is actually code that does
> something important, which is ...
> 
>   delta = diff_delta(last->data.buf, last->data.len,
>   dat->buf, dat->len,
>   &deltalen, dat->len - the_hash_algo->rawsz);
>   } else
>   delta = NULL;
> 
> 
> ... and makes the reader realize that the change itself is much
> better than the patch with 3-line context, the title, and the
> proposed log message advertises it as ;-)
> 
> How about selling it this way instead?
> 
>   fast-import: do not call diff_delta() with empty buffer
> 
>   We know diff_delta() returns NULL, saying "no good delta
>   exists for it", when fed an empty data.  Check the length of
>   the data in the caller to avoid such a call.  
> 
>   This incidentally reduces the number of attempted deltification
>   we see in the final statistics.
> 
> or something like that?

Fair enough. Do you want me to send a v2 with this description?

Mike


Re: Checks added for CVE-2018-11235 too restrictive?

2018-07-03 Thread Mike Hommey
On Tue, Jul 03, 2018 at 10:15:19AM -0400, Jeff King wrote:
> On Tue, Jul 03, 2018 at 04:06:50PM +0900, Mike Hommey wrote:
> 
> > I had a first shot at that a few months ago, but the format of the
> > metadata branch made it impossible to push to github, hitting its push
> > size limit. With some pre-processing work, I was able to push the data
> > to github, with the intent to come back with a new metadata branch
> > format that would make the push directly possible.
> 
> This is sort of tangential to your email here, but I'm curious which
> limit you hit. Too-big blob, tree, or commit?

Too-big pack, iirc.

> > Fast forward to this week, where I was trying to upload a new metadata
> > branch, and failed in a rather interesting way: multiple lines looking
> > like:
> > 
> > remote: error: object 81eae74b046d284c47e788143bbbcc681cb53418: 
> > gitmodulesMissing: unable to read .gitmodules blob
> > 
> > which, apart from the fact that they have some formatting issue, appear
> > to be new from the fixes for CVE-2018-11235.
> 
> Also tangential to your main point (I promise I'll get to it ;) ), but
> what formatting issue do you mean? Are you referring to
> "gitmodulesMissing"? That's a machine-readable tag which means you can
> set "fsck.gitmodulesMissing" to "ignore".

Oh, I was reading it as something finishing with "gitmodules", and
another message starting with "Missing". The fact the error is so
long and on one line made me think that, I guess.

> > I can see what those fixes are trying to prevent, but they seem to be
> > overly restrictive, at least in the context of transfer.fsckObjects.
> > 
> > The core problem is that the mercurial repository has some .gitmodules
> > files in some subdirectories. As a consequence, the git objects storing
> > the info for those mercurial files contain trees with .gitmodules files
> > with a commit ref, and that's what the remote is complaining about.
> > 
> > (Surpringly, it doesn't hit the "non-blob found at .gitmodules" case
> > first, which is weird).
> 
> The reason it doesn't hit the "non-blob" case is that we are trying to
> analyze the object itself. So we don't pay any attention to the mode in
> the containing tree, but instead literally look for 81eae74b0. If it
> were a non-blob we'd complain then, but in fact we don't have it at all
> (which is otherwise OK because it's a gitlink).
> 
> > A small testcase to reproduce looks like this:
> > 
> > $ git init bar; cd bar
> > $ git fast-import < > commit refs/heads/bar
> > committer Bar  0 +
> > data 0
> >  
> > M 16 81eae74b046d284c47e788143bbbcc681cb53418 bar/.gitmodules
> >  
> > EOF
> >
> > [...]
> > 
> > Would it be reasonable to make the transfer.fsckObject checks ignore
> > non-blob .gitmodules?
> 
> I'm open to the idea that the new checks are too restrictive (both this
> and the gitmodulesParse error we're discussing in [1]). They definitely
> outlaw things that _used_ to be OK. And the security benefit is a little
> hand-wavy. They're not strictly needed to block the recent
> vulnerability[2].  However, they _could_ protect us from future problems
> (e.g., an overflow in the config-parsing code, which is not accessible
> to attackers outside of .gitmodules). So there is some value in being
> restrictive, but it's mostly hypothetical for now.
> 
> So I think we could simply loosen these cases without immediate effect.
> That said, I'm not sure that having a gitlink .gitmodules is sane in the
> first place. Attempting to "git submodule init" there is going to cause
> errors. Well, sort of -- your example actually includes it in a
> subdirectory of the repository, so technically we wouldn't use it for
> real submodules. That fudging (finding .gitmodules anywhere instead of
> just at the root) was a conscious decision to reduce the complexity and
> cost of the check.
> 
> It sounds like in this case you don't have existing history that does
> this with .gitmodules, but are rather looking into designing a new
> feature that stuffs data into git trees. I'd be interested to hear a bit
> more about that feature to see if there are other alternatives.

Yes, in fact, this history is not even meant to be checked out. It's
internal stuff, stuck in a ref in a non-traditional ref namespace (that
is, it's not under refs/heads, refs/tags, etc., it's 

Re: fast-import slowness when importing large files with small differences

2018-07-03 Thread Mike Hommey
On Tue, Jul 03, 2018 at 06:05:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jun 29 2018, Mike Hommey wrote:
> 
> > On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Fri, Jun 29 2018, Mike Hommey wrote:
> >>
> >> > I noticed some slowness when fast-importing data from the Firefox 
> >> > mercurial
> >> > repository, where fast-import spends more than 5 minutes importing ~2000
> >> > revisions of one particular file. I reduced a testcase while still
> >> > using real data. One could synthesize data with kind of the same
> >> > properties, but I figured real data could be useful.
> >> >
> >> > To reproduce:
> >> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git 
> >> > foo
> >> > $ git init bar
> >> > $ cd bar
> >> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >> >
> >> > [...]
> >> > So maybe it would make sense to consolidate the diff code (after all,
> >> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> >> > of common head and tail, this gets down to 3:33.
> >> >
> >> > I'll also note that Facebook has imported xdiff from the git code base
> >> > into mercurial and improved performance on it, so it might also be worth
> >> > looking at what's worth taking from there.
> >>
> >> It would be interesting to see how does this compares with a more naïve
> >> approach of committing every version of this file one-at-a-time into a
> >> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> >> suboptimal compared to just writing out a lot of redundant data and
> >> repacking it all at once later.
> >
> > "Just" writing 26GB? And that's only one file. If I were to do that for
> > the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> > currently.
> 
> To clarify on my terse response. I mean to try this on an isolated test
> case to see to what extent the problem you're describing is unique to
> fast-import, and to what extent it's encountered during "normal" git use
> when you commit all the revisions of that file in succession.
> 
> Perhaps the difference between the two would give some hint as to how to
> proceed, or not.

AIUI, git repack will end up creating delta indexes for every blob, so the
problem should exist there, but because it will be comparing "random"
blobs, it can't take the same kinds of shortcuts as fast-import could,
because fast-import only cares about diffing with the last imported
blob. So while fast-import can reduce the amount of work it does by not
creating an index for common heads and tails of the compared blobs, git
repack can't.

Mike


Re: Checks added for CVE-2018-11235 too restrictive?

2018-07-04 Thread Mike Hommey
On Wed, Jul 04, 2018 at 07:30:30AM +0900, Mike Hommey wrote:
> That being said, I'm not even sure this particular use case is worth a
> new feature. I'm not storing random stuff as gitlinks, I'm storing
> sha1s. Well, maybe a mode that makes the distinction between "git oid"
> and "external oid" might make things clearer for git itself, especially
> for fsck.

Actually, I'm also abusing the lower bits of the gitlink mode, to
differentiate between regular files, executable files, and symlinks.

Mike


Re: "Give me a break"... well, you gave me one

2019-03-06 Thread Mike Hommey
On Wed, Mar 06, 2019 at 03:14:11PM +0100, Johannes Schindelin wrote:
> Hi Stefan,
> 
> just wanted to express my gratitude for your idea to introduce the `break`
> command in `git rebase -i`'s todo list. I use it *all* the time now.

+1. Before that, I was using `x bash`, and ended up doing `git rebase
--continue` in that shell in many cases, which didn't end up so well
when I terminated said shell (just an error message, though, nothing
actually broke as a consequence). This feature is a blessing.

Mike


[PATCH] fix pack protocol example client/server communication

2019-03-16 Thread Mike Hommey
The pkt-line formatted lines contained the wrong pkt-len.

Signed-off-by: Mike Hommey 
---
 Documentation/technical/pack-protocol.txt | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/pack-protocol.txt 
b/Documentation/technical/pack-protocol.txt
index 7a2375a55d..c73e72de0e 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -657,14 +657,14 @@ can be rejected.
 An example client/server communication might look like this:
 
 
-   S: 007c74730d410fcb6603ace96f1dc55ea6196122532d 
refs/heads/local\0report-status delete-refs ofs-delta\n
+   S: 006274730d410fcb6603ace96f1dc55ea6196122532d 
refs/heads/local\0report-status delete-refs ofs-delta\n
S: 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\n
S: 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\n
-   S: 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/team\n
+   S: 003d74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/team\n
S: 
 
-   C: 003e7d1665144a3a975c05f1f43902ddaf084e784dbe 
74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/debug\n
-   C: 003e74730d410fcb6603ace96f1dc55ea6196122532d 
5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/master\n
+   C: 00677d1665144a3a975c05f1f43902ddaf084e784dbe 
74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/debug\n
+   C: 006874730d410fcb6603ace96f1dc55ea6196122532d 
5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/master\n
C: 
C: [PACKDATA]
 
-- 
2.21.0



Re: git-fast-import yields huge packfile

2019-03-16 Thread Mike Hommey
On Sat, Mar 16, 2019 at 02:04:33PM -0700, Linus Torvalds wrote:
> On Sat, Mar 16, 2019 at 1:31 PM Richard Hipp  wrote:
> >
> > Maybe I'm doing something wrong with the fast-import stream that is
> > defeating Git's attempts at delta compression
> 
> fast-import doesn't do fancy delta compression becayse that would
> defeat the "fast" part of fast-import.

fast-import however does try to do delta compression of blobs against the
last blob that was imported, so if you put your blobs in an order where
they can be delta-ed, you can win without a git repack.

For one-shot conversions, you can just rely on git repack.

Mike


How to properly find git config in a libgit.a-using executable?

2019-03-20 Thread Mike Hommey
Hi,

In git-cinnabar (the remote-helper that can talk to mercurial servers),
I'm using a fast-import-derived helper to do a lot of the heavy lifting,
because $REASONS. Anyways, while built (mostly) with the git build system,
using libgit.a, etc. the helper doesn't live in the GIT_EXEC_PATH. That
leads me to a very subtle problem: it doesn't necessarily find the
system config that git uses.

Because the system confit that git uses depends on how git was built,
the result might not be the right thing. For one, a Linux distro git
will likely have been built with prefix=/usr, which makes the system
config be /etc/gitconfig, but someone building their own git will have
a system config in etc/gitconfig relative to their git.

The latter is more of nitpicking, because practically speaking, it
doesn't matter much. Which is why I've been building with prefix=/usr
(at least for the helper binaries that I ship pre-built ; locally built
helpers actually don't get this treatment ; but that's also not much of
a practical problem because it seems Linux distros don't ship a
/etc/gitconfig anyways (at least Debian doesn't)).

Anyways, the real problem comes on Windows, because git-for-windows does
come with a system config that does make important tweaks, like setting
http.sslcainfo to a path that actually exists. And without reading that
system config, the helper doesn't find the cainfo file and fails to
connect to HTTPS mercurial servers.

Now, my question here is, what would you suggest I do to make my helper
find the right config?

I thought of a few options (it's worth noting the helper is invoked in a
way that makes $GIT_EXEC_PATH set, which can help a little):
- spawn `$GIT_EXEC_PATH/git-config -l -z`, parse its output, and set the
  internal config from that. That's the barbarian option.
- build the helper with RUNTIME_PREFIX, and modify the RUNTIME_PREFIX
  code to use $GIT_EXEC_PATH if it's set, rather than the path the
  executable is in. That actually sounds reasonable enough that I'd send
  a patch for git itself. But that doesn't quite address the nitpick case
  where ETC_GITCONFIG could be either `/etc/gitconfig` or
  `etc/gitconfig` depending how git was compiled, and there's no way to
  know which is the right one.

WDYT?

Mike


Re: How to properly find git config in a libgit.a-using executable?

2019-03-22 Thread Mike Hommey
On Fri, Mar 22, 2019 at 02:39:43PM +0100, Johannes Schindelin wrote:
> Hi Peff & Mike,
> 
> On Fri, 22 Mar 2019, Jeff King wrote:
> 
> > On Wed, Mar 20, 2019 at 07:19:41PM +0900, Mike Hommey wrote:
> >
> > > I thought of a few options (it's worth noting the helper is invoked in a
> > > way that makes $GIT_EXEC_PATH set, which can help a little):
> > > - spawn `$GIT_EXEC_PATH/git-config -l -z`, parse its output, and set the
> > >   internal config from that. That's the barbarian option.
> > > - build the helper with RUNTIME_PREFIX, and modify the RUNTIME_PREFIX
> > >   code to use $GIT_EXEC_PATH if it's set, rather than the path the
> > >   executable is in. That actually sounds reasonable enough that I'd send
> > >   a patch for git itself. But that doesn't quite address the nitpick case
> > >   where ETC_GITCONFIG could be either `/etc/gitconfig` or
> > >   `etc/gitconfig` depending how git was compiled, and there's no way to
> > >   know which is the right one.
> >
> > I'm not entirely sure I understand the problem, but it sounds like you
> > want to know the baked-in ETC_GITCONFIG for a built version of git (that
> > isn't necessarily the one that shares your build of libgit.a).
> >
> > There's no direct way to have Git print that out. It would be reasonable
> > to add one to rev-parse, I think.
> >
> > Barring that, here's a hack:
> >
> >   git config --system --show-origin --list -z |
> >   perl -lne '/^file:(.*?)\0/ and print $1 and exit 0'
> >
> > If the file is empty, it won't print anything, of course. But then,
> > you'd know that it also has no config in it. :)
> 
> How about
> 
>   GIT_EDITOR=echo git config --system -e 2>/dev/null
> 
> It will error out if the directory does not exist, for some reason, e.g.
> when you installed Git in your home directory via `make install` from a
> fresh clone. So you'll have to cope with that contingency.

Thank you both, I can probably work with this, although I might have to
alter the git init sequence. I'm not sure my usecase needs git to cater
for it more generally, though. Who else uses libgit.a?

Mike


Auto-gc in the background can take a long time to be put in the background

2019-03-25 Thread Mike Hommey
Hi,

Recently, I've noticed that whenever the auto-gc message shows up about
being spawned in the background, it still takes a while for git to
return to the shell.

I've finally looked at what it was stuck on, and it's 
`git reflog expire --all` taking more than 30s. I guess the question is
whether there's a reason this shouldn't run in the background? Another
is whether there's something that makes this slower than it should be.

Mike


[ANNOUNCE] git-cinnabar 0.5.0

2018-08-11 Thread Mike Hommey
Hi,

Git-cinnabar is a git remote helper to interact with mercurial
repositories. It allows to clone, pull and push from/to mercurial remote
repositories, using git.

Code on https://github.com/glandium/git-cinnabar
This release on
https://github.com/glandium/git-cinnabar/releases/tag/0.5.0

What's new since 0.4.0?

- git-cinnabar-helper is now mandatory. You can either download one with `git 
cinnabar download` on supported platforms or build one with `make`.
- Performance and memory consumption improvements.
- Metadata changes require to run `git cinnabar upgrade`.
- Mercurial tags are consolidated in a separate (fake) repository. See the 
README file.
- Updated git to 2.18.0 for the helper.
- Improved memory consumption and performance.
- Improved experimental support for pushing merges.
- Support for 
[clonebundles](https://www.mercurial-scm.org/wiki/ClonebundlesExtension) for 
faster clones when the server provides them.
- Removed support for the .git/hgrc file for mercurial specific configuration.
- Support any version of Git (was previously limited to 1.8.5 minimum)
- Git packs created by git-cinnabar are now smaller.
- Fixed incompatibilities with Mercurial 3.4 and >= 4.4.
- Fixed tag cache, which could lead to missing tags.
- The prebuilt helper for Linux now works across more distributions (as long as 
libcurl.so.4 is present, it should work)
- Properly support the `pack.packsizelimit` setting.
- Experimental support for initial clone from a git repository containing 
git-cinnabar metadata.
- Now can successfully clone the pypy and GNU octave mercurial repositories.
- More user-friendly errors.

Development process changes

It took about 6 months between version 0.3 and 0.4. It took more than 18 months 
to reach version 0.5 after that. That's a long time to wait for a new version, 
considering all the improvements that have happened under the hood.

>From now on, the `release` branch will point to the last tagged release, which 
>is roughly the same as before, but won't be the default branch when cloning 
>anymore.

The default branch when cloning will now be `master`, which will receive 
changes that are acceptable for dot releases (0.5.x). These include:
- Changes in behavior that are backwards compatible (e.g. adding new options 
which default to the current behavior).
- Changes that improve error handling.
- Changes to existing experimental features, and additions of new experimental 
features (that require knobs to be enabled).
- Changes to Continuous Integration/Tests.
- Git version upgrades for the helper.

The `next` branch will receive changes for the next "major" release, which as 
of writing is planned to be 0.6.0. These include:
- Changes in behavior.
- Changes in metadata.
- Stabilizing experimental features.
- Remove backwards compability with older metadata (< 0.5.0).

Mike


[PATCH] fast-import: Reinitialize command_buf rather than detach it.

2019-08-24 Thread Mike Hommey
command_buf.buf is also stored in cmd_hist, so instead of
strbuf_release, the current code uses strbuf_detach in order to
"leak" the buffer as far as the strbuf is concerned.

However, strbuf_detach does more than "leak" the strbuf buffer: it
possibly reallocates it to ensure a terminating nul character. And when
that happens, what is already stored in cmd_hist may now be a free()d
buffer.

In practice, though, command_buf.buf is most of the time a nul
terminated string, meaning command_buf.len < command_buf.alloc, and
strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
allocate a 1 byte buffer to store a nul character in it, which is then
leaked.

Since the code using strbuf_detach is assuming it does nothing to
command_buf.buf, it's overall safer to use strbuf_init, which has
the same practical effect in the usual case, and works appropriately
when command_buf is empty.
---
 fast-import.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..b1d07efe8c 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,9 @@ static int read_next_command(void)
} else {
struct recent_command *rc;
 
-   strbuf_detach(&command_buf, NULL);
+   // command_buf is enther empty or also stored in 
cmd_hist,
+   // reinitialize it.
+   strbuf_init(&command_buf, 0);
stdin_eof = strbuf_getline_lf(&command_buf, stdin);
if (stdin_eof)
return EOF;
@@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, 
uintmax_t *len_res)
char *term = xstrdup(data);
size_t term_len = command_buf.len - (data - command_buf.buf);
 
-   strbuf_detach(&command_buf, NULL);
+   // command_buf is enther empty or also stored in cmd_hist,
+   // reinitialize it.
+   strbuf_init(&command_buf, 0);
for (;;) {
if (strbuf_getline_lf(&command_buf, stdin) == EOF)
die("EOF in data (terminator '%s' not found)", 
term);
-- 
2.23.0



[PATCH] notes: avoid leaking duplicate entries

2019-08-24 Thread Mike Hommey
When add_note is called multiple times with the same key/value pair, the
leaf_node it creates is leaked by notes_tree_insert.
---
 notes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/notes.c b/notes.c
index 75c028b300..ec35f5b551 100644
--- a/notes.c
+++ b/notes.c
@@ -269,8 +269,10 @@ static int note_tree_insert(struct notes_tree *t, struct 
int_node *tree,
case PTR_TYPE_NOTE:
if (oideq(&l->key_oid, &entry->key_oid)) {
/* skip concatenation if l == entry */
-   if (oideq(&l->val_oid, &entry->val_oid))
+   if (oideq(&l->val_oid, &entry->val_oid)) {
+   free(entry);
return 0;
+   }
 
ret = combine_notes(&l->val_oid,
&entry->val_oid);
-- 
2.23.0



Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it.

2019-08-25 Thread Mike Hommey
On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:
> On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote:
> 
> > command_buf.buf is also stored in cmd_hist, so instead of
> > strbuf_release, the current code uses strbuf_detach in order to
> > "leak" the buffer as far as the strbuf is concerned.
> > 
> > However, strbuf_detach does more than "leak" the strbuf buffer: it
> > possibly reallocates it to ensure a terminating nul character. And when
> > that happens, what is already stored in cmd_hist may now be a free()d
> > buffer.
> > 
> > In practice, though, command_buf.buf is most of the time a nul
> > terminated string, meaning command_buf.len < command_buf.alloc, and
> > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> > allocate a 1 byte buffer to store a nul character in it, which is then
> > leaked.
> 
> I think this is stronger than just "most of the time". It's an invariant
> for strbufs to have a NUL, so the only case where detaching isn't a noop
> is the empty slopbuf case you mention.

If it's an invariant, why does detach actively tries to realloc and set
the nul terminator as if it can happen in more cases than when using the
slopbuf?

> Splitting hairs, perhaps, but I think with that explanation, we could
> probably argue that this case will never come up: strbuf_getline will
> either have allocated a buffer or will have returned EOF.

Note that the slopbuf case _does_ come up, and we always leak a 1 byte
buffer.

> That said, I do think it's quite confusing and is worth fixing, both for
> readability and for future-proofing. But...
(...)

I do agree the way fast-import works between cmd_hist and command_buf is
very brittle, as you've shown. I didn't feel like digging into it
though. Thanks for having gone further than I did.

Mike


Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership

2019-08-25 Thread Mike Hommey
On Sun, Aug 25, 2019 at 04:10:55AM -0400, Jeff King wrote:
> Fast-import's read_next_command() has somewhat odd memory ownership
> semantics for the command_buf strbuf. After reading a command, we copy
> the strbuf's pointer (without duplicating the string) into our cmd_hist
> array of recent commands. And then when we're about to read a new
> command, we clear the strbuf by calling strbuf_detach(), dropping
> ownership from the strbuf (leaving the cmd_hist reference as the
> remaining owner).
> 
> This has a few surprising implications:
> 
>   - if the strbuf hasn't been copied into cmd_hist (e.g., because we
> haven't ready any commands yet), then the strbuf_detach() will leak
> the resulting string
> 
>   - any modification to command_buf risks invalidating the pointer held
> by cmd_hist. There doesn't seem to be any way to trigger this
> currently (since we tend to modify it only by detaching and reading
> in a new value), but it's subtly dangerous.
> 
>   - any pointers into an input string will remain valid as long as
> cmd_hist points to them. So in general, you can point into
> command_buf.buf and call read_next_command() up to 100 times before
> your string is cycled out and freed, leaving you with a dangling
> pointer. This makes it easy to miss bugs during testing, as they
> might trigger only for a sufficiently large commit (e.g., the bug
> fixed in the previous commit).
> 
> Instead, let's make a new string to copy the command into the history
> array, rather than having dual ownership with the old. Then we can drop
> the strbuf_detach() calls entirely, and just reuse the same buffer
> within command_buf over and over. We'd normally have to strbuf_reset()
> it before using it again, but in both cases here we're using
> strbuf_getline(), which does it automatically for us.
> 
> This fixes the leak, and it means that even a single call to
> read_next_command() will invalidate any held pointers, making it easier
> to find bugs. In fact, we can drop the extra input lines added to the
> test case by the previous commit, as the unfixed bug would now trigger
> just from reading the commit message, even without any modified files in
> the commit.
> 
> Reported-by: Mike Hommey 
> Signed-off-by: Jeff King 
> ---
>  fast-import.c  | 4 +---
>  t/t9300-fast-import.sh | 5 -
>  2 files changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/fast-import.c b/fast-import.c
> index ee7258037a..1f9160b645 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,6 @@ static int read_next_command(void)
>   } else {
>   struct recent_command *rc;
>  
> - strbuf_detach(&command_buf, NULL);
>   stdin_eof = strbuf_getline_lf(&command_buf, stdin);
>   if (stdin_eof)
>   return EOF;
> @@ -1784,7 +1783,7 @@ static int read_next_command(void)
>   free(rc->buf);
>   }
>  
> - rc->buf = command_buf.buf;
> + rc->buf = xstrdup(command_buf.buf);

You could xstrndup(command_buf.buf, command_buf.len), which would avoid
a hidden strlen.

Mike


Re: [PATCH] notes: avoid leaking duplicate entries

2019-08-25 Thread Mike Hommey
On Sun, Aug 25, 2019 at 02:18:18PM +0900, Mike Hommey wrote:
> When add_note is called multiple times with the same key/value pair, the
> leaf_node it creates is leaked by notes_tree_insert.

For completeness, since I realized it was missing:
Signed-off-by: Mike Hommey 

Mike


[PATCH] commit: free the right buffer in release_commit_memory

2019-08-25 Thread Mike Hommey
The index field in the commit object is used to find the buffer
corresponding to that commit in the buffer_slab. Resetting it first
means free_commit_buffer is not going to free the right buffer.

Signed-off-by: Mike Hommey 
---
 commit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a98de16e3d..3fe5f8fa9c 100644
--- a/commit.c
+++ b/commit.c
@@ -364,8 +364,8 @@ struct object_id *get_commit_tree_oid(const struct commit 
*commit)
 void release_commit_memory(struct parsed_object_pool *pool, struct commit *c)
 {
set_commit_tree(c, NULL);
-   c->index = 0;
free_commit_buffer(pool, c);
+   c->index = 0;
free_commit_list(c->parents);
 
c->object.parsed = 0;
-- 
2.23.0



[PATCH] packfile: free packed_git memory when closing object store

2019-08-25 Thread Mike Hommey
Signed-off-by: Mike Hommey 
---
 packfile.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

Note, I'm not sure this is the right place to do it.

diff --git a/packfile.c b/packfile.c
index fc43a6c52c..b0cb84adda 100644
--- a/packfile.c
+++ b/packfile.c
@@ -339,13 +339,16 @@ void close_pack(struct packed_git *p)
 
 void close_object_store(struct raw_object_store *o)
 {
-   struct packed_git *p;
+   struct packed_git *p = o->packed_git;
 
-   for (p = o->packed_git; p; p = p->next)
+   while (p) {
+   struct packed_git *current = p;
if (p->do_not_close)
BUG("want to close pack marked 'do-not-close'");
-   else
-   close_pack(p);
+   close_pack(p);
+   p = p->next;
+   free(current);
+   }
 
if (o->multi_pack_index) {
close_midx(o->multi_pack_index);
-- 
2.23.0



[PATCH] http: don't leak urlmatch_config.vars

2019-08-26 Thread Mike Hommey
Signed-off-by: Mike Hommey 
---
 http.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/http.c b/http.c
index 27aa0a3192..9e33584f2d 100644
--- a/http.c
+++ b/http.c
@@ -1073,6 +1073,7 @@ void http_init(struct remote *remote, const char *url, 
int proactive_auth)
 
git_config(urlmatch_config_entry, &config);
free(normalized_url);
+   string_list_clear(&config.vars, 1);
 
 #if LIBCURL_VERSION_NUM >= 0x073800
if (http_ssl_backend) {
-- 
2.23.0



revision.c alters commit object state ; also no cleanup

2019-08-26 Thread Mike Hommey
Hi,

First, a little context: As you may have noticed, I've recently found a
small bunch of memory leaks in different parts of the code base. The
reason I did is that git-cinnabar[1] uses libgit.a, and I very recently
upgraded its CI to use a slightly more recent version of GCC, which
either enabled leak detection by default in ASAN or detect more leaks
than before. Anyways, my CI was busted because of leaks detected after
the upgrade, which made me look around.

git-cinnabar does make some specific uses of libgit.a that git itself
doesn't, and this is where things kind of fall apart:

First, revision.c doesn't come with a function to clear a struct
rev_info. I did create a function that does enough cleanup for my own
use (which turned out to be not enough), but I'm wondering if git
shouldn't itself have one in revision.c (and seeing that there's now
e.g. repo_clear, which didn't exist when I started git-cinnabar, I think
the answer would be that it should).

Then, revision.c kind of does nasty things to commit objects:
- remove_duplicate_parents alters commit->parents to skip duplicates[2]
- try_to_simplify_commit removes all parents but one[3] or all of them
  [4] in some cases

That's for the case that I did encounter. Maybe there's more.

One problem this causes is that it leaks the commit_list items that used
to contain the removed parents.

Another problem is that if something else uses the commits afterwards,
those commits parents don't match what you'd get before revision
walking. And I don't know how this should be addressed.

Ideas welcome.

Relatedly, the subthread that started at
 is kind of relevant. There
are two main reasons I'm using libgit.a: one is that I'm doing things
low-level enough that I don't know any other library that would work for
my use case. Another is that I was hoping that in the long term, it
could be merged into git.git. I guess the more time passes, the less
likely the latter is to happen. Which is fine, but it's good to know.

Mike


1. https://github.com/glandium/git-cinnabar/
2. 
https://github.com/git/git/blob/745f6812895b31c02b29bdfe4ae8e5498f776c26/revision.c#L2742
3. 
https://github.com/git/git/blob/745f6812895b31c02b29bdfe4ae8e5498f776c26/revision.c#L869
4. 
https://github.com/git/git/blob/745f6812895b31c02b29bdfe4ae8e5498f776c26/revision.c#L888


Re: [PATCH] packfile: free packed_git memory when closing object store

2019-08-26 Thread Mike Hommey
On Mon, Aug 26, 2019 at 11:06:29AM -0700, Junio C Hamano wrote:
> Mike Hommey  writes:
> 
> > Signed-off-by: Mike Hommey 
> > ---
> >  packfile.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > Note, I'm not sure this is the right place to do it.
> 
> I do not think this patch is complete, given that o->packed_git
> still has a non-NULL pointer.  IIRC, close_pack() was written for
> the explicit purpose of releasing resources while allowing us to
> reopen with open_packed_git() on it, so with the current
> arrangement, after releasing the resources held for this object
> store and doing something else, you should be able to come back to
> this object store and work in it again---this patch makes it harder
> if not impossible to do so.
> 
> I _think_ the patch is OK if you assigned NULL to o->packed_git,
> after making sure that the intention of all the callers of
> close_object_store() is to declare that this object store will not
> be accessed any longer during the lifetime of the process, and write
> it down as the contract between the callers and this function in a
> comment perhaps in packfile.h where the function is declared.

Maybe it would make more sense to do the complete cleanup in
raw_object_store_clear, then?

Relatedly, while looking around the other things that close_object_store
does, I saw that multi_pack_index is a sort of linked list... and
close_midx doesn't follow the links. Which raises the question whether
it should, or whether close_object_store should (considering it's
similar to packed_git in that regard, it would seem like
close_object_store should). It also raises the question what should be
free()ing multi_pack_index, because like packed_git, it's not free()d.

Mike


Re: revision.c alters commit object state ; also no cleanup

2019-08-27 Thread Mike Hommey
On Mon, Aug 26, 2019 at 11:14:13AM -0700, Junio C Hamano wrote:
> Mike Hommey  writes:
> 
> > First, revision.c doesn't come with a function to clear a struct
> > rev_info
> > Then, revision.c kind of does nasty things to commit objects...
> 
> Yeah, these two stem from the "run once and let exit() clean things
> up" design the oldest part of the system shares.  Regarding the
> latter, i.e. parent rewriting, I recall that there is a codepath
> that saves the original true parents away in a second field when we
> know we would want to show it (and avoid the overhead of copying
> when we do not have to), so you should be able to extend it to keep
> the original parent list.

I guess you're refering to save_parents/get_saved_parents? Would it make
sense for things to be reversed, as in, make revision.c keep the
simplified parents separately, and traverse through that?

Mike


error: cannot cherry-pick during a revert

2019-08-28 Thread Mike Hommey
Hi,

This just happened to me while cherry-pick'ing:

$ git cherry-pick HEAD@{1}
error: could not apply 614fe5e629b84... try
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: and commit the result with 'git commit'
Recorded preimage for 'taskcluster/ci/build/linux.yml'

(... this is where I fix my conflict ...)

$ git add -u
$ git cherry-pick --continue 
error: cannot cherry-pick during a revert.
fatal: cherry-pick failed

So apparently, cherry-pick thinks it was doing a revert when it hit a
conflict?

(This is with git 2.23)

Mike


[PATCH] replace: stop replace lookup when reaching a self-reference

2019-08-29 Thread Mike Hommey
It is possible to end up in situations where a replace ref points to
itself. In that case, it would seem better to stop the lookup rather
than try to follow the link infinitely and fail with "replace depth too
high".

Signed-off-by: Mike Hommey 
---
 replace-object.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

I'm not entirely sure whether it's actually worth fixing. Arguably, `git
replace` should also avoid the footgun in the first place (although in
my case, it wasn't due to `git replace` itself).

diff --git a/replace-object.c b/replace-object.c
index e295e87943..97e479fe2b 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -66,7 +66,7 @@ const struct object_id *do_lookup_replace_object(struct 
repository *r,
while (depth-- > 0) {
struct replace_object *repl_obj =
oidmap_get(r->objects->replace_map, cur);
-   if (!repl_obj)
+   if (!repl_obj || oideq(cur, &repl_obj->replacement))
return cur;
cur = &repl_obj->replacement;
}
-- 
2.23.0



Re: [DISCUSSION] Growing the Git community

2019-09-19 Thread Mike Hommey
On Thu, Sep 19, 2019 at 12:30:13PM -0400, Derrick Stolee wrote:
> During the Virtual Git Contributors' Summit, Dscho brought up the topic of
> "Inclusion & Diversity". We discussed ideas for how to make the community
> more welcoming to new contributors of all kinds. Let's discuss some of
> the ideas we talked about, and some that have been growing since.
> 
> Feel free to pick apart all of the claims I make below. This is based
> on my own experience and opinions. It should be a good baseline
> for us to all arrive with valuable action items.
> 
> I have CC'd some of the people who were part of that discussion. Sorry
> if I accidentally left someone out.
> 
> I. Goals and Perceived Problems
> 
> As a community, our number one goal is for Git to continue to be the best
> distributed version control system. At minimum, it should continue to be
> the most widely-used DVCS. Towards that goal, we need to make sure Git is
> the best solution for every kind of developer in every industry. The
> community cannot do this without including developers of all kinds. This
> means having a diverse community, for all senses of the word: Diverse in
> physical location, gender, professional status, age, and others.
> 
> In addition, the community must continue to grow, but members leave the
> community on a regular basis for multiple reasons. New contributors must
> join and mature within the community or the community will dwindle. Without
> dedicating effort and attention to this, natural forces may result in the
> community being represented only by contributors working at large tech
> companies focused on the engineering systems of very large groups.
> 
> It is worth noting that this community growth must never be at the cost
> of code quality. We must continue to hold all contributors to a high
> standard so Git stays a stable product.
> 
> Here are some problems that may exist within the Git community and may
> form a barrier to new contributors entering:
> 
> 1. Discovering how to contribute to Git is non-obvious.
> 
> 2. Submitting to a mailing list is a new experience for most developers.
>This includes the full review and discussion process.
> 
> 3. The high standards for patch quality are intimidating to new contributors.
> 
> 4. Some people do not feel comfortable engaging in a community without
>a clear Code of Conduct. This discomfort is significant and based on real
>experiences throughout society.
> 
> 5. Since Git development happens in a different place than where users
> acquire the end product, some are not aware that they can contribute.

6. Newcomers don't really have any idea /what/ they could contribute.
They either have to come with their own itch to scratch, or read the
code to figure out if there's something to fix.

Mike


Re: Bring together merge and rebase

2017-12-26 Thread Mike Hommey
On Fri, Dec 22, 2017 at 11:10:19PM -0700, Carl Baldwin wrote:
> The big contention among git users is whether to rebase or to merge
> changes [2][3] while iterating. I used to firmly believe that merging
> was the way to go and rebase was harmful. More recently, I have worked
> in some environments where I saw rebase used very effectively while
> iterating on changes and I relaxed my stance a lot. Now, I'm on the
> fence. I appreciate the strengths and weaknesses of both approaches. I
> waffle between the two depending on the situation, the tools being
> used, and I guess, to some extent, my mood.
> 
> I think what git needs is something brand new that brings the two
> together and has all of the advantages of both approaches. Let me
> explain what I've got in mind...
> 
> I've been calling this proposal `git replay` or `git replace` but I'd
> like to hear other suggestions for what to name it. It works like
> rebase except with one very important difference. Instead of orphaning
> the original commit, it keeps a pointer to it in the commit just like
> a `parent` entry but calls it `replaces` instead to distinguish it
> from regular history. In the resulting commit history, following
> `parent` pointers shows exactly the same history as if the commit had
> been rebased. Meanwhile, the history of iterating on the change itself
> is available by following `replaces` pointers. The new commit replaces
> the old one but keeps it around to record how the change evolved.
> 
> The git history now has two dimensions. The first shows a cleaned up
> history where fix ups and code review feedback have been rolled into
> the original changes and changes can possibly be ordered in a nice
> linear progression that is much easier to understand. The second
> drills into the history of a change. There is no loss and you don't
> change history in a way that will cause problems for others who have
> the older commits.
> 
> Replay handles collaboration between multiple authors on a single
> change. This is difficult and prone to accidental loss when using
> rebase and it results in a complex history when done with merge. With
> replay, collaborators could merge while collaborating on a single
> change and a record of each one's contributions can be preserved.
> Attempting this level of collaboration caused me many headaches when I
> worked with the gerrit workflow (which in many ways, I like a lot).
> 
> I blogged about this proposal earlier this year when I first thought
> of it [1]. I got busy and didn't think about it for a while. Now with
> a little time off of work, I've come back to revisit it. The blog
> entry has a few examples showing how it works and how the history will
> look in a few examples. Take a look.
> 
> Various git commands will have to learn how to handle this kind of
> history. For example, things like fetch, push, gc, and others that
> move history around and clean out orphaned history should treat
> anything reachable through `replaces` pointers as precious. Log and
> related history commands may need new switches to traverse the history
> differently in different situations. Bisect is a interesting one. I
> tend to think that bisect should prefer the regular commit history but
> have the ability to drill into the change history if necessary.
> 
> In my opinion, this proposal would bring together rebase and merge in
> a powerful way and could end the contention. Thanks for your
> consideration.

FWIW, your proposal has a lot in common (but is not quite equivalent) to
mercurial's obsolescence markers and changeset evolution features.

Mike


Re: [PATCH] fetch-pack: always allow fetching of literal SHA1s

2017-05-09 Thread Mike Hommey
On Wed, May 10, 2017 at 12:33:44AM -0400, Jeff King wrote:
> On Tue, May 09, 2017 at 09:22:11PM -0700, Shawn Pearce wrote:
> 
> > > Hmm. That makes sense generally, as the request should succeed. But it
> > > seems like we're creating a client that will sometimes succeed and
> > > sometimes fail, and the reasoning will be somewhat opaque to the user.
> > > I have a feeling I'm missing some context on when you'd expect this to
> > > kick in.
> > 
> > Specifically, someone I know was looking at building an application
> > that is passed only a SHA-1 for the tip of a ref on a popular hosting
> > site[1]. They wanted to run `git fetch URL SHA1`, but this failed
> > because the site doesn't have upload.allowtipsha1inwant enabled.
> > However the SHA1 was clearly in the output of git ls-remote.
> 
> OK. So this is basically a case where we expect that the user knows what
> they're doing.
> 
> > For various reasons they expected this to work, because it works
> > against other sites that do have upload.allowtipsha1inwant enabled.
> > And I personally just expected it to work because the fetch client
> > accepts SHA-1s, and the wire protocol uses "want SHA1" not "want ref",
> > and the SHA-1 passed on the command line was currently in the
> > advertisement when the connection opened, so its certainly something
> > the server is currently willing to serve.
> 
> Right, makes sense.  I wondered if GitHub should be turning on
> allowTipSHA1InWant, but it really doesn't make sense to. We _do_ hide
> some internal refs[1], and they're things that users wouldn't want to
> fetch. The problem for your case really is just on the client side, and
> this patch fixes it.

More broadly, I think it is desirable that any commit that can be
reached from public refs can be fetched by an explicit sha1 without
allowTipSHA1InWant.

Mike


[ANNOUNCE] git-cinnabar 0.5.0b1

2017-06-03 Thread Mike Hommey
Hi,

Git-cinnabar is a git remote helper to interact with mercurial
repositories. It allows to clone, pull and push from/to mercurial remote
repositories, using git.

Code on https://github.com/glandium/git-cinnabar
This release on
https://github.com/glandium/git-cinnabar/releases/tag/0.5.0b1

What's new since 0.4.0?

- git-cinnabar-helper is now mandatory. You can either download one with
  `git cinnabar download` on supported platforms or build one with `make
  helper`.
- Metadata changes require to run `git cinnabar fsck`.
- Mercurial tags are consolidated in a separate (fake) repository. See
  the README file.
- Updated git to 2.13.0 for git-cinnabar-helper.
- Improved memory consumption and performance.
- Improved experimental support for pushing merges.
- Experimental support for clonebundles
  (https://www.mercurial-scm.org/wiki/ClonebundlesExtension).
- Removed support for the .git/hgrc file for mercurial specific
  configuration.
- Support any version of Git (was previously limited to 1.8.5 minimum)

Mike


Re: Feature request: Please add support to stash specific files

2017-06-06 Thread Mike Hommey
On Tue, Jun 06, 2017 at 02:38:08PM -0400, rajdeep mondal wrote:
> Hi Randall,
> 
> I completely agree to what you are saying, but sometimes it just so
> happens that in the middle of a change, i feel like if some portion of
> the changes are fine I can commit them.  Stashing some of the files
> and being able to check the compile/tests at this point would be a
> really awesome change.
> 
> Stash supports a -p option to deal with this, it becomes cumbersome
> when the number of files are many.  Maybe it is something which would
> be a good to have feature. People need not use it if they dont want
> to.

Git 2.13.0 has that already.

git stash -- file1 file2

Mike


[PATCH] fast-import: Increase the default pack depth to 50

2017-06-07 Thread Mike Hommey
In 618e613a70, 10 years ago, the default for pack depth used for
git-pack-objects and git-repack was changed from 10 to 50, while
leaving fast-import's default to 10.

There doesn't seem to be a reason besides oversight for the change not
having happened in fast-import as well.

Interestingly, fast-import uses pack.depth when it's set, and the
git-config manual says the default for pack.depth is 50. While the
git-fast-import manual does say the default depth is 10, the
inconsistency is also confusing.

Signed-off-by: Mike Hommey 
---
 Documentation/git-fast-import.txt | 2 +-
 fast-import.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt 
b/Documentation/git-fast-import.txt
index 2b762654bf..3d3d219e58 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -121,7 +121,7 @@ Performance and Compression Tuning
 
 --depth=::
Maximum delta depth, for blob and tree deltification.
-   Default is 10.
+   Default is 50.
 
 --export-pack-edges=::
After creating a packfile, print a line of data to
diff --git a/fast-import.c b/fast-import.c
index e69d219682..05c73858b0 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -280,7 +280,7 @@ struct recent_command {
 };
 
 /* Configured limits on output */
-static unsigned long max_depth = 10;
+static unsigned long max_depth = 50;
 static off_t max_packsize;
 static int unpack_limit = 100;
 static int force_update;
-- 
2.13.0



Re: [PATCH] fast-import: Increase the default pack depth to 50

2017-06-08 Thread Mike Hommey
On Thu, Jun 08, 2017 at 03:05:37AM -0400, Jeff King wrote:
> On Thu, Jun 08, 2017 at 02:34:36PM +0900, Mike Hommey wrote:
> 
> > In 618e613a70, 10 years ago, the default for pack depth used for
> > git-pack-objects and git-repack was changed from 10 to 50, while
> > leaving fast-import's default to 10.
> > 
> > There doesn't seem to be a reason besides oversight for the change not
> > having happened in fast-import as well.
> > 
> > Interestingly, fast-import uses pack.depth when it's set, and the
> > git-config manual says the default for pack.depth is 50. While the
> > git-fast-import manual does say the default depth is 10, the
> > inconsistency is also confusing.
> 
> Makes sense. If anything, fast-import would want to allow a deeper depth
> than normal, since (IIRC) its delta chains are always completely linear.
> Whereas in a real pack, if we decide not to make a delta off the 50th
> item in a chain, we usually find the 48th or 49th, and end up with a
> bushier graph.
> 
> It probably doesn't matter that much, though, as you'd really want to
> `repack -f` afterwards if you care about getting good deltas. And one
> base object every 50 versions is probably fine for keeping the initial
> pack manageable.

It actually is possible to have non-linear delta chains with
fast-import, because the cat-blob command resets the delta base when
storing a blob. See
https://github.com/git/git/blob/v2.13.1/fast-import.c#L2984-L2987

As a side effect of that, git-cinnabar[1] ends up with decent-ish delta
chains out of the box, without having to go with repack -f, when the
mercurial server gives out changegroup version 2.

Mike

1. https://github.com/glandium/git-cinnabar/


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Mike Hommey
On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> Footnote *1*: SHA-256, as all hash functions whose output is essentially
> the entire internal state, are susceptible to a so-called "length
> extension attack", where the hash of a secret+message can be used to
> generate the hash of secret+message+piggyback without knowing the secret.
> This is not the case for Git: only visible data are hashed. The type of
> attacks Git has to worry about is very different from the length extension
> attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> say, a collision attack.

What do the experts think or SHA512/256, which completely removes the
concerns over length extension attack? (which I'd argue is better than
sweeping them under the carpet)

Mike


Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan

2017-06-15 Thread Mike Hommey
On Thu, Jun 15, 2017 at 09:01:45AM -0400, Jeff King wrote:
> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> 
> > On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > > the entire internal state, are susceptible to a so-called "length
> > > extension attack", where the hash of a secret+message can be used to
> > > generate the hash of secret+message+piggyback without knowing the secret.
> > > This is not the case for Git: only visible data are hashed. The type of
> > > attacks Git has to worry about is very different from the length extension
> > > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > > say, a collision attack.
> > 
> > What do the experts think or SHA512/256, which completely removes the
> > concerns over length extension attack? (which I'd argue is better than
> > sweeping them under the carpet)
> 
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).

AIUI, length extension does make brute force collision attacks (which,
really Shattered was) cheaper by allowing one to create the collision
with a small message and extend it later.

This might not be a credible thread against git, but if we go by that
standard, post-shattered Sha-1 is still fine for git. As a matter of
fact, MD5 would also be fine: there is still, to this day, no preimage
attack against them.

Mike


[ANNOUNCE] git-cinnabar 0.5.0b2

2017-06-15 Thread Mike Hommey
Hi,

Git-cinnabar is a git remote helper to interact with mercurial
repositories. It allows to clone, pull and push from/to mercurial remote
repositories, using git.

Code on https://github.com/glandium/git-cinnabar
This release on
https://github.com/glandium/git-cinnabar/releases/tag/0.5.0b2

What's new since 0.5.0 beta 1?

- Enabled support for clonebundles 
(https://www.mercurial-scm.org/wiki/ClonebundlesExtension) for faster clones 
when the server provides them.
- Git packs created by git-cinnabar are now smaller.
- Added a new `git cinnabar upgrade` command to handle metadata
  upgrade separately from `fsck`.
- Metadata upgrade is now significantly faster.
- `git cinnabar fsck` also faster.
- Both now also use significantly less memory.
- Updated git to 2.13.1 for git-cinnabar-helper.

Mike


Re: [PATCH] strbuf: use designated initializers in STRBUF_INIT

2017-07-10 Thread Mike Hommey
On Mon, Jul 10, 2017 at 10:57:57AM -0400, Ben Peart wrote:
> Correct.  MSVC also supports designated initializers but does not fully
> support C99.

Precision: *recent versions* of MSVC support designated initializer.
2013 introduced them, but there were bugs until 2015, see e.g.
https://stackoverflow.com/questions/24090739/possible-compiler-bug-in-msvc12-vs2013-with-designated-initializer

Mike


Re: [PATCH] strbuf: use designated initializers in STRBUF_INIT

2017-07-14 Thread Mike Hommey
On Fri, Jul 14, 2017 at 09:11:33AM -0700, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > Oh, absolutely.
> >
> > Here is another possible test balloon, that may actually be useful
> > as an example.  I think there is a topic in flight that touches this
> > array, unfortunately, so I probably would find another one that is
> > more stable for a real follow-up patch to the one from Peff.
> 
> And here it is.
> 
> As to other things that we currently not allow in our codebase that
> newer compilers can grok, here is what *I* think.  It is *not* meant
> to be an exhaustive "what's new in C99 that is not in C89? what is
> the final verdict on each of them?":
> 
>  - There were occasional cases where we wished if variable-length
>arrays, flexible array members and variadic macros were available
>in our codebase during the course of this project.  We would
>probably want to add a similar test baloon patch for each of
>them to this series that is currently two-patch long.

FWIW, variadic macros have subtle implementation differences that can
cause problems.
For instance, MSVC only supports ##__VA_ARGS__ by way of
ignoring ## somehow, but has the default behavior of dropping the comma
when __VA_ARGS__ is empty, which means , __VA_ARGS__ *without* ## has a
different behavior.
See also 
https://connect.microsoft.com/VisualStudio/feedback/details/380090/variadic-macro-replacement

Mike


Re: Git packs friendly to block-level deduplication

2018-01-24 Thread Mike Hommey
On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote:
> If you have a bunch of git repositories cloned of the same project on
> the same filesystem, it would be nice of the packs that are produced
> would be friendly to block-level deduplication.
> 
> This would save space, and the blocks would be more likely to be in
> cache when you access them, likely speeding up git operations even if
> the packing itself is less efficient.
> 
> Here's a hacky one-liner that clones git/git and peff/git (almost the
> same content) and md5sums each 4k packed block, and sort | uniq -c's
> them to see how many are the same:
> 
> (
>cd /tmp &&
>rm -rf git*;
>git clone --reference ~/g/git --dissociate g...@github.com:git/git.git 
> git1 &&
>git clone --reference ~/g/git --dissociate 
> g...@github.com:peff/git.git git2 &&
>for repo in git1 git2
>do
>(
>cd $repo &&
>git repack -A -d --max-pack-size=10m
>)
>done &&
>parallel "perl -MDigest::MD5=md5_hex -wE 'open my \$fh, q[<], shift; 
> my \$s; while (read \$fh, \$s, 2**12) { say md5_hex(\$s) }' {}" ::: \
>$(find /tmp/git*/.git/objects/pack -type f)|sort|uniq -c|sort 
> -nr|awk '{print $1}'|sort|uniq -c|sort -nr
> )
> 
> This produces a total of 0 blocks that are the same. If after the repack
> we throw this in there after the repack:
> 
> echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object 
> --revs .git/objects/pack/manual
> 
> Just over 8% of the blocks are the same, and of course this pack
> entirely duplicates the existing packs, and I don't know how to coerce
> repack/pack-objects into keeping this manual-* pack and re-packing the
> rest, removing any objects that exist in the manual-* pack.
> 
> Documentation/technical/pack-heuristics.txt goes over some of the ideas
> behind the algorithm, and Junio's 1b4bb16b9e ("pack-objects: optimize
> "recency order"", 2011-06-30) seems to be the last major tweak to it.
> 
> I couldn't find any references to someone trying to get this particular
> use-case working on-list. I.e. to pack different repositories with a
> shared history in such a way as to optimize for getting the most amount
> of identical blocks within packs.
> 
> It should be possible to produce such a pack, e.g. by having a repack
> mode that would say:
> 
>  1. Find what the main branch is
>  2. Get its commits in reverse order, produce packs of some chunk-size
> of commit batches.
>  3. Pack all the remaining content
> 
> This would delta much less efficiently, but as noted above the
> block-level deduplication might make up for it, and in any case some
> might want to use less disk space.
> 
> Has anyone here barked up this tree before? Suggestions? Tips on where
> to start hacking the repack code to accomplish this would be most
> welcome.

FWIW, I sidestep the problem entirely by using alternatives.

Mike


Re: Git packs friendly to block-level deduplication

2018-01-24 Thread Mike Hommey
On Wed, Jan 24, 2018 at 02:23:57PM -0800, Junio C Hamano wrote:
> Mike Hommey  writes:
> 
> > FWIW, I sidestep the problem entirely by using alternatives.
> 
> That's a funny way to use the word "side-step", I would say, as the
> alternate object store support is there exactly for this use case.

It's footgunny, though.

Mike


Re: SHA1 collisions found

2017-02-25 Thread Mike Hommey
On Sat, Feb 25, 2017 at 02:26:56PM -0500, Jeff King wrote:
> On Sat, Feb 25, 2017 at 06:50:50PM +, brian m. carlson wrote:
> 
> > > As long as the reader can tell from the format of object names
> > > stored in the "new object format" object from what era is being
> > > referred to in some way [*1*], we can name new objects with only new
> > > hash, I would think.  "new refers only to new" that stratifies
> > > objects into older and newer may make things simpler, but I am not
> > > convinced yet that it would give our users a smooth enough
> > > transition path (but I am open to be educated and pursuaded the
> > > other way).
> > 
> > I would simply use multihash[0] for this purpose.  New-style objects
> > serialize data in multihash format, so it's immediately obvious what
> > hash we're referring to.  That makes future transitions less
> > problematic.
> > 
> > [0] https://github.com/multiformats/multihash
> 
> I looked at that earlier, because I think it's a reasonable idea for
> future-proofing. The first byte is a "varint", but I couldn't find where
> they defined that format.
> 
> The closest I could find is:
> 
>   https://github.com/multiformats/unsigned-varint
> 
> whose README says:
> 
>   This unsigned varint (VARiable INTeger) format is for the use in all
>   the multiformats.
> 
> - We have not yet decided on a format yet. When we do, this readme
>   will be updated.
> 
> - We have time. All multiformats are far from requiring this varint.
> 
> which is not exactly confidence inspiring. They also put the length at
> the front of the hash. That's probably convenient if you're parsing an
> unknown set of hashes, but I'm not sure it's helpful inside Git objects.
> And there's an incentive to minimize header data at the front of a hash,
> because every byte is one more byte that every single hash will collide
> over, and people will have to type when passing hashes to "git show",
> etc.
> 
> I'd almost rather use something _really_ verbose like
> 
>   sha256:1234abcd...
> 
> in all of the objects. And then when we get an unadorned hash from the
> user, we guess it's sha256 (or whatever), and fallback to treating it as
> a sha1.
> 
> Using a syntactically-obvious name like that also solves one other
> problem: there are sha1 hashes whose first bytes will encode as a "this
> is sha256" multihash, creating some ambiguity.

Indeed, multihash only really is interesting when *all* hashes use it.
And obviously, git can't change the existing sha1s.

Mike


Re: SHA1 collisions found

2017-03-02 Thread Mike Hommey
On Thu, Mar 02, 2017 at 02:27:15PM -0800, Linus Torvalds wrote:
> On Thu, Mar 2, 2017 at 1:54 PM, Joey Hess  wrote:
> >
> > There's a surprising result of combining iterated hash functions, that
> > the combination is no more difficult to attack than the strongest hash
> > function used.
> 
> Duh. I should actually have known that. I started reading the paper
> and went "this seems very familiar". I'm pretty sure I've been pointed
> at that paper before (or maybe just a similar one), and I just didn't
> react enough for it to leave a lasting impact.

What if the "object version" is a hash of the content (as opposed to
header + content like the normal git hash)?

Mike


Re: RFC: Another proposed hash function transition plan

2017-03-06 Thread Mike Hommey
On Mon, Mar 06, 2017 at 03:40:30PM -0800, Jonathan Nieder wrote:
> David Lang wrote:
> 
> >> Translation table
> >> ~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> > This seems like a rather nontrival thing to design. It will need to
> > hold millions of mappings, and be quickly searchable from either
> > direction (sha1->new and new->sha1) while still be fairly fast to
> > insert new records into.
> 
> I am currently thinking of using LevelDB, since it has the advantages of
> being simple, already existing, and having already been ported to Java
> (allowing JGit can read and write the same format).
> 
> If that doesn't work, we'd try some other key-value store like Samba's
> tdb or Kyoto Cabinet.

FWIW, I'm using notes-like data to store mercurial->git mappings in
git-cinnabar, (ab)using the commit type in tree items. It's fast enough.

Mike


[PATCH] notes: Fix note_tree_consolidate not to break the note_tree structure

2017-03-25 Thread Mike Hommey
After a note is removed, note_tree_consolidate is called to eliminate
some useless nodes. The typical case is that if you had an int_node
with 2 PTR_TYPE_NOTEs in it, and remove one of them, then the
PTR_TYPE_INTERNAL pointer in the parent tree can be replaced with the
remaining PTR_TYPE_NOTE.

This works fine when PTR_TYPE_NOTEs are involved, but falls flat when
other types are involved.

To put things in more practical terms, let's say we start from an empty
notes tree, and add 3 notes:
- one for a sha1 that starts with 424
- one for a sha1 that starts with 428
- one for a sha1 that starts with 4c

To keep track of this, note_tree.root will have a PTR_TYPE_INTERNAL at
a[4], pointing to an int_node*.
In turn, that int_node* will have a PTR_TYPE_NOTE at a[0xc], pointing to
the leaf_node* with the key and value, and a PTR_TYPE_INTERNAL at a[2],
pointing to another int_node*.
That other int_node* will have 2 PTR_TYPE_NOTE, one at a[4] and the
other at a[8].

When looking for the note for the sha1 starting with 428, get_note() will
recurse through (simplified) root.a[4].a[2].a[8].

Now, if we remove the note for the sha1 that starts with 4c, we're left
with a int_node* with only one PTR_TYPE_INTERNAL entry in it. After
note_tree_consolidate runs, root.a[4] now points to what used to be
pointed at by root.a[4].a[2].

Which means looking up for the note for the sha1 starting with 428 now
fails because there is nothing at root.a[4].a[2] anymore: there is only
root.a[4].a[4] and root.a[4].a[8], which don't match the expected
structure for the lookup.

So if all there is left in an int_node* is a PTR_TYPE_INTERNAL pointer,
we can't safely remove it. I think the same applies for PTR_TYPE_SUBTREE
pointers. IOW, only PTR_TYPE_NOTEs are safe to be moved to the parent
int_node*.

This doesn't have a practical effect on git because all that happens
after a remove_note is a write_notes_tree, which just iterates the entire
note tree, but this affects anything using libgit.a that would try to do
lookups after removing notes.

Signed-off-by: Mike Hommey 
---
 notes.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/notes.c b/notes.c
index 2bab961ac..542563b28 100644
--- a/notes.c
+++ b/notes.c
@@ -153,8 +153,8 @@ static struct leaf_node *note_tree_find(struct notes_tree 
*t,
  * How to consolidate an int_node:
  * If there are > 1 non-NULL entries, give up and return non-zero.
  * Otherwise replace the int_node at the given index in the given parent node
- * with the only entry (or a NULL entry if no entries) from the given tree,
- * and return 0.
+ * with the only NOTE entry (or a NULL entry if no entries) from the given
+ * tree, and return 0.
  */
 static int note_tree_consolidate(struct int_node *tree,
struct int_node *parent, unsigned char index)
@@ -173,6 +173,8 @@ static int note_tree_consolidate(struct int_node *tree,
}
}
 
+   if (p && (GET_PTR_TYPE(p) != PTR_TYPE_NOTE))
+   return -2;
/* replace tree with p in parent[index] */
parent->a[index] = p;
free(tree);
-- 
2.12.1.dirty



Revision resolution for remote-helpers?

2017-08-17 Thread Mike Hommey
Hi,

As you might remember, I'm maintaining a remote helper that allows to
talk directly to mercurial servers with git: git-cinnabar.

When dealing with "foreign (non-git) repositories", it is often
necessary to refer to revisions with their native name. With mercurial,
that's a sha1, with svn it's a revision number, etc.

Git-cinnabar does provide a helper command that gives back the git
commit from the mercurial revision (and vice versa), but that's
cumbersome to use in commands.

I was thinking it could be useful to have a special syntax for revisions
that would query a helper program. The helper program could use a
similar protocol to that of the remote helpers.

My thought is that a string like :: could be used
wherever a committish is expected. That would call some helper
and request to resolve revision, and the helper would provide a git
commit as a response.

The reason for the :: prefix is that it matches the ::
prefix used for remote helpers.

Now, there are a few caveats:
- , for e.g. svn, pretty much would depend on the remote.
  OTOH, in mercurial, it doesn't. I think it would be fair for the
  helper to have to deal with making what appears after :: relevant
  to find the right revision, by possibly including a remote name.
- msys likes to completely fuck up command lines when they contain ::.
  For remote helpers, the alternative that works is
  :///etc.

Which leads me to think some "virtual" ref namespace could be a solution
to the problem. So instead of ::, the prefix would be /.
For e.g. svn, svn/$remote/$rev would be a natural way to specify the
revision for a given remote. For mercurial, hg/$sha1.

Potentially, this could be a sort of pluggable ref stores, which could
be used for extensions such as the currently discussed reftable.

On the opposite end of the problem, I'm also thinking about git log
--decorate= displaying the mercurial revisions where branch
decorations would normally go.

I have no patch to show for it. Those are ideas that I first want to
discuss before implementing anything.

Thoughts?

Mike


Re: Revision resolution for remote-helpers?

2017-08-18 Thread Mike Hommey
On Fri, Aug 18, 2017 at 08:15:09AM -0400, Jeff King wrote:
> On Fri, Aug 18, 2017 at 03:42:08PM +0900, Mike Hommey wrote:
> 
> > I was thinking it could be useful to have a special syntax for revisions
> > that would query a helper program. The helper program could use a
> > similar protocol to that of the remote helpers.
> 
> That sounds like a reasonable thing to want.
> 
> > My thought is that a string like :: could be used
> > wherever a committish is expected. That would call some helper
> > and request to resolve revision, and the helper would provide a git
> > commit as a response.
> 
> So I'm guessing this would look something like:
> 
>   git log svn::12345
> 
> I think even without Git support, you could do something like:
> 
>   git log $(git svn map 12345)

That's what I do, but subshells and all is extra cumbersome.

> which is similarly complex in terms of concepts, and not too many more
> characters. That would be a little more awkward outside of a shell,
> though.
> 
> But it did get me wondering if we could do _better_ with something built
> into Git. For example, could we have an external resolution helper that
> resolves names to object ids as a fallback after internal resolution has
> failed. And then you could do:
> 
>  git log 12345
> 
> and it would just work. Efficiency shouldn't be a big problem, because
> we'd hit the helper only in the error case.
> 
> I'd be more concerned about awkward ambiguities, though. If mercurial is
> also using sha1s, then there's nothing syntactic to differentiate the
> two. For that matter, 12345 has the same problem, since it could be a
> partial sha1.

For something as short, the likelihood of hitting an actual existing
abbreviated sha1 is quite high, too.

> It might work to actually check if we have the object and then bail
> to the remote resolver only if we don't. But that's actually conflating
> name resolution with object lookup, which our internals typically keep
> separate.
> 
> So maybe this is a bad direction to go in. I'm mostly just thinking out
> loud here.
> 
> > Which leads me to think some "virtual" ref namespace could be a solution
> > to the problem. So instead of ::, the prefix would be /.
> > For e.g. svn, svn/$remote/$rev would be a natural way to specify the
> > revision for a given remote. For mercurial, hg/$sha1.
> 
> Interesting. I do like the general idea of having external helpers to
> fill in bits of the virtual namespace. But it may also open many cans of
> worms. :)
> 
> > Potentially, this could be a sort of pluggable ref stores, which could
> > be used for extensions such as the currently discussed reftable.
> 
> The current pluggable code is definitely geared for build-time
> pluggability, not runtime. But I think you could have a builtin
> pluggable store that does the overlay, and then chains to another
> backend. I.e., configure something like:
> 
>   [extensions]
>   refBackend = externalOverlay
> 
>   [externalOverlay "svn"]
>   namespace = refs/svn
>   command = my-svn-mapper
> 
>   [externalOverlay]
>   chain = reftable
> 
> That would allow the externalOverlay thing to develop independent of the
> core of Git's refs code.

That's a lot of configuration, but it's definitely an interesting
proposition.

> > On the opposite end of the problem, I'm also thinking about git log
> > --decorate= displaying the mercurial revisions where branch
> > decorations would normally go.
> 
> Interesting thought. I'm not sure if that would be a good thing or a bad
> thing. But one of the virtual methods for pluggable backends is
> "enumerate all refs". If you're mapping every mercurial revision, that's
> going to be a lot of refs (and potentially a lot of overhead for certain
> operations).
> 
> I think the decorate code just looks at a few parts of the refs
> namespace right now (so a "refs/svn" would probably get ignored by
> default).

I think decorate would need its own special entry to the "ref query" API
to answer the question "what ref points to " instead of scanning
the whole namespace.

Mike


Re: Revision resolution for remote-helpers?

2017-08-18 Thread Mike Hommey
On Fri, Aug 18, 2017 at 03:06:37PM -0700, Jonathan Nieder wrote:
> Hi,
> 
> Mike Hommey wrote:
> 
> > My thought is that a string like :: could be used
> > wherever a committish is expected. That would call some helper
> > and request to resolve revision, and the helper would provide a git
> > commit as a response.
> 
> I like this idea.
> 
> > The reason for the :: prefix is that it matches the ::
> > prefix used for remote helpers.
> >
> > Now, there are a few caveats:
> > - , for e.g. svn, pretty much would depend on the remote.
> >   OTOH, in mercurial, it doesn't. I think it would be fair for the
> >   helper to have to deal with making what appears after :: relevant
> >   to find the right revision, by possibly including a remote name.
> > - msys likes to completely fuck up command lines when they contain ::.
> >   For remote helpers, the alternative that works is
> >   :///etc.
> 
> Hm --- is there a bug already open about this (e.g. in the Git for
> Windows project or in msys) where I can read more?

It's entirely an msys problem. Msys has weird rules to translate between
unix paths and windows paths on the command line, and botches everything
as a consequence. That's by "design".

http://www.mingw.org/wiki/Posix_path_conversion

(Particularly, see the last two entries)

That only happens when calling native Windows programs from a msys
shell.

> > Which leads me to think some "virtual" ref namespace could be a solution
> > to the problem. So instead of ::, the prefix would be /.
> > For e.g. svn, svn/$remote/$rev would be a natural way to specify the
> > revision for a given remote. For mercurial, hg/$sha1.
> 
> I see.  My naive assumption would be that a string like r12345 would be
> the most natural way for a user to want to specify a Subversion
> revision, but you're right that those only have meaning scoped to a
> particular server.  That makes the svn/ prefix more tolerable.
> 
> > Potentially, this could be a sort of pluggable ref stores, which could
> > be used for extensions such as the currently discussed reftable.
> >
> > On the opposite end of the problem, I'm also thinking about git log
> > --decorate= displaying the mercurial revisions where branch
> > decorations would normally go.
> >
> > I have no patch to show for it. Those are ideas that I first want to
> > discuss before implementing anything.
> 
> One additional thought: unlike refs, these are not necessarily
> enumerable (and I wouldn't expect "git show-ref" to show them) and
> they do not affect "git prune"'s reachability computation.
> 
> So internally I don't think refs is the right concept to map these to.
> I think the right layer is revision syntax handling (revision.c).

Makes sense.

Mike


Re: [git-for-windows] Re: Revision resolution for remote-helpers?

2017-08-24 Thread Mike Hommey
On Tue, Aug 22, 2017 at 10:15:20PM +0200, Johannes Schindelin wrote:
> Hi,
> 
> On Fri, 18 Aug 2017, Jonathan Nieder wrote:
> 
> > Mike Hommey wrote[1]:
> > > On Fri, Aug 18, 2017 at 03:06:37PM -0700, Jonathan Nieder wrote:
> > >> Mike Hommey wrote:
> > 
> > >>> The reason for the :: prefix is that it matches the ::
> > >>> prefix used for remote helpers.
> > >>>
> > >>> Now, there are a few caveats:
> > [...]
> > >>> - msys likes to completely fuck up command lines when they contain ::.
> > >>>   For remote helpers, the alternative that works is
> > >>>   :///etc.
> > >>
> > >> Hm --- is there a bug already open about this (e.g. in the Git for
> > >> Windows project or in msys) where I can read more?
> > >
> > > It's entirely an msys problem. Msys has weird rules to translate between
> > > unix paths and windows paths on the command line, and botches everything
> > > as a consequence. That's by "design".
> > >
> > > http://www.mingw.org/wiki/Posix_path_conversion
> > >
> > > (Particularly, see the last two entries)
> > >
> > > That only happens when calling native Windows programs from a msys
> > > shell.
> > 
> > Cc-ing the Git for Windows mailing list as an FYI.
> > 
> > I have faint memories that some developers on that project have had to
> > delve deep into Msys path modification rules.  It's possible they can
> > give us advice (e.g. about :: having been a bad choice of
> > syntax in the first place :)).
> 
> I think it is safe to assume that :: is not part of any Unix-y path. That
> is why the MSYS2 runtime does not try to play games with it by converting
> it to a Windows path.
> 
> (And indeed, I just tested this, an argument of the form
> "a::file://b/x/y/z" is not converted to a "Windows path")

Note that there are people out there using msys, *and* git for windows,
although I don't know if such people exist outside Mozilla.

Mike


Re: [git-for-windows] Re: Revision resolution for remote-helpers?

2017-08-25 Thread Mike Hommey
On Fri, Aug 25, 2017 at 12:58:52PM +0200, Johannes Schindelin wrote:
> Hi Mike,
> 
> On Thu, 24 Aug 2017, Mike Hommey wrote:
> 
> > On Tue, Aug 22, 2017 at 10:15:20PM +0200, Johannes Schindelin wrote:
> > > 
> > > On Fri, 18 Aug 2017, Jonathan Nieder wrote:
> > > 
> > > > Mike Hommey wrote[1]:
> > > > > On Fri, Aug 18, 2017 at 03:06:37PM -0700, Jonathan Nieder wrote:
> > > > >> Mike Hommey wrote:
> > > > 
> > > > >>> The reason for the :: prefix is that it matches the 
> > > > >>> ::
> > > > >>> prefix used for remote helpers.
> > > > >>>
> > > > >>> Now, there are a few caveats:
> > > > [...]
> > > > >>> - msys likes to completely fuck up command lines when they contain 
> > > > >>> ::.
> > > > >>>   For remote helpers, the alternative that works is
> > > > >>>   :///etc.
> > > > >>
> > > > >> Hm --- is there a bug already open about this (e.g. in the Git for
> > > > >> Windows project or in msys) where I can read more?
> > > > >
> > > > > It's entirely an msys problem. Msys has weird rules to translate 
> > > > > between
> > > > > unix paths and windows paths on the command line, and botches 
> > > > > everything
> > > > > as a consequence. That's by "design".
> > > > >
> > > > > http://www.mingw.org/wiki/Posix_path_conversion
> > > > >
> > > > > (Particularly, see the last two entries)
> > > > >
> > > > > That only happens when calling native Windows programs from a msys
> > > > > shell.
> > > > 
> > > > Cc-ing the Git for Windows mailing list as an FYI.
> > > > 
> > > > I have faint memories that some developers on that project have had to
> > > > delve deep into Msys path modification rules.  It's possible they can
> > > > give us advice (e.g. about :: having been a bad choice of
> > > > syntax in the first place :)).
> > > 
> > > I think it is safe to assume that :: is not part of any Unix-y path. That
> > > is why the MSYS2 runtime does not try to play games with it by converting
> > > it to a Windows path.
> > > 
> > > (And indeed, I just tested this, an argument of the form
> > > "a::file://b/x/y/z" is not converted to a "Windows path")
> > 
> > Note that there are people out there using msys, *and* git for windows,
> > although I don't know if such people exist outside Mozilla.
> 
> Note that I am maintainer of Git for Windows, not of any setup that uses
> MSys. Please do not even try to put more stuff on my plate.

I'm not trying to do that. I'm just saying that there are setups where
the current way of using remote helpers doesn't work out, and it's
completely independent of git or git for windows, and there's not much
git for windows can do about it except maybe unmangling what msys does,
but it's about as horrible as not doing anything.

This does bring the question, though, whether there should be an
alternative syntax, which there actually is, but it doesn't really allow
to convey things with a protocol after the double colons (e.g.
you can't really distinguish between hg::http://... and hg::http://...
with the hg:// form ; git-cinnabar allows the protocol to appear as part
of the port number, e.g. hg://host:http/... and hg:// defaults to https)

And this brings the question whether :: would be the right "trigger" for
the feature that opened this thread originally.

Mike


Re: [git-for-windows] Re: Revision resolution for remote-helpers?

2017-09-20 Thread Mike Hommey
On Fri, Aug 25, 2017 at 09:02:36PM +0900, Mike Hommey wrote:
> On Fri, Aug 25, 2017 at 12:58:52PM +0200, Johannes Schindelin wrote:
> > > > > Cc-ing the Git for Windows mailing list as an FYI.
> > > > > 
> > > > > I have faint memories that some developers on that project have had to
> > > > > delve deep into Msys path modification rules.  It's possible they can
> > > > > give us advice (e.g. about :: having been a bad choice of
> > > > > syntax in the first place :)).
> > > > 
> > > > I think it is safe to assume that :: is not part of any Unix-y path. 
> > > > That
> > > > is why the MSYS2 runtime does not try to play games with it by 
> > > > converting
> > > > it to a Windows path.
> > > > 
> > > > (And indeed, I just tested this, an argument of the form
> > > > "a::file://b/x/y/z" is not converted to a "Windows path")
> > > 
> > > Note that there are people out there using msys, *and* git for windows,
> > > although I don't know if such people exist outside Mozilla.
> > 
> > Note that I am maintainer of Git for Windows, not of any setup that uses
> > MSys. Please do not even try to put more stuff on my plate.
> 
> I'm not trying to do that. I'm just saying that there are setups where
> the current way of using remote helpers doesn't work out, and it's
> completely independent of git or git for windows, and there's not much
> git for windows can do about it except maybe unmangling what msys does,
> but it's about as horrible as not doing anything.
> 
> This does bring the question, though, whether there should be an
> alternative syntax, which there actually is, but it doesn't really allow
> to convey things with a protocol after the double colons (e.g.
> you can't really distinguish between hg::http://... and hg::http://...
> with the hg:// form ; git-cinnabar allows the protocol to appear as part
> of the port number, e.g. hg://host:http/... and hg:// defaults to https)
> 
> And this brings the question whether :: would be the right "trigger" for
> the feature that opened this thread originally.

(FYI, FWIW)

So, interestingly, I tried using the instructions on
https://github.com/git-for-windows/git/wiki/Install-inside-MSYS2-proper
today, and that led me to the same problem, being that the msys path
munging was breaking :: syntax.

It turns out, I had placed the git-for-windows section last in
pacman.conf, so msys2-runtime hadn't been updated. Once it is updated,
the :: syntax is not munged anymore, and everything works
as expected.

Meaning, in fact, that git-for-windows has addressed the problem on its
end, but the problem still exists when running git-for-windows from a
msys2 shell without the git-for-windows msys2 runtime.

Also, the munging happens at the caller side, so the shell needs to be
using git-for-windows's msys2 runtime, it's not enough that git itself
does.

Mike


Re: git-clone causes out of memory

2017-10-13 Thread Mike Hommey
On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> There's a gitbomb on github. It is undoubtedly creative and funny, but since
> this is a bug in git, I thought it'd be nice to report. The command:
> 
>   $ git clone https://github.com/x0rz/ShadowBrokersFiles

What fills memory is actually the checkout part of the command. git
clone -n doesn't fail.

Credit should go where it's due: https://kate.io/blog/git-bomb/
(with the bonus that it comes with explanations)

Mike


Re: git-clone causes out of memory

2017-10-13 Thread Mike Hommey
On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
> On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey  wrote:
> > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> >> There's a gitbomb on github. It is undoubtedly creative and funny, but 
> >> since
> >> this is a bug in git, I thought it'd be nice to report. The command:
> >>
> >>   $ git clone https://github.com/x0rz/ShadowBrokersFiles
> >
> > What fills memory is actually the checkout part of the command. git
> > clone -n doesn't fail.
> >
> > Credit should go where it's due: https://kate.io/blog/git-bomb/
> > (with the bonus that it comes with explanations)
> 
> Yeah, there is a thread on Hacker News about this too:
> 
> https://news.ycombinator.com/item?id=15457076
> 
> The original repo on GitHub is:
> 
> https://github.com/Katee/git-bomb.git
> 
> After cloning it with -n, there is the following "funny" situation:
> 
> $ time git rev-list HEAD
> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
> 18ed56cbc5012117e24a603e7c072cf65d36d469
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.004s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.005s
> user0m0.008s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.203s
> user0m0.112s
> sys 0m0.088s
> $ time git rev-list HEAD -- d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m1.305s
> user0m0.720s
> sys 0m0.580s
> $ time git rev-list HEAD -- d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m12.135s
> user0m6.700s
> sys 0m5.412s
> 
> So `git rev-list` becomes exponentially more expensive when you run it
> on a shorter directory path, though it is fast if you run it without a
> path.

That's because there are 10^7 files under d0/d0/d0, 10^6 under
d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.

So really, this is all about things being slower when there's a crazy
number of files. Picture me surprised.

What makes it kind of special is that the repository contains a lot of
paths/files, but very few objects, because it's duplicating everything.

All the 10^10 blobs have the same content, all the 10^9 trees that point
to them have the same content, all the 10^8 trees that point to those
trees have the same content, etc.

If git wasn't effectively deduplicating identical content, the repository
would be multiple gigabytes large.

Mike


Re: Make the git codebase thread-safe

2014-02-12 Thread Mike Hommey
On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> Stefan Zager  writes:
> 
> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup  wrote:
> >
> >> Really, give the above patch a try.  I am taking longer to finish it
> >> than anticipated (with a lot due to procrastination but that is,
> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> and nontrivial progress reports for my freely available work on GNU
> >> LilyPond).
> >
> > I will give that a try.  How much of a performance improvement have
> > you clocked?
> 
> Depends on file type and size.  With large files with lots of small
> changes, performance improvements get more impressive.
> 
> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> (performance improvement about a factor of 3), a large file in the style
> of /usr/share/dict/words clocking in at a factor of about 5.
> 
> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> are no improvements in system time (I/O) except for patch 4 of the
> series which helps perhaps 20% or so.
> 
> So the benefits of the patch will come into play mostly for big, bad
> files on Windows: other than that, the I/O time is likely to be the
> dominant player anyway.

How much fragmentation does that add to the files, though?

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make the git codebase thread-safe

2014-02-12 Thread Mike Hommey
On Wed, Feb 12, 2014 at 12:00:19PM +0100, Karsten Blees wrote:
> Am 12.02.2014 04:43, schrieb Duy Nguyen:
> > On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson  
> > wrote:
> >> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
> >>> We in the chromium project have a keen interest in adding threading to
> >>> git in the pursuit of performance for lengthy operations (checkout,
> >>> status, blame, ...).  Our motivation comes from hitting some
> >>> performance walls when working with repositories the size of chromium
> >>> and blink:
> >> +1 from Gentoo on performance improvements for large repos.
> >>
> >> The main repository in the ongoing Git migration project looks to be in
> >> the 1.5GB range (and for those that want to propose splitting it up, we
> >> have explored that option and found it lacking), with very deep history
> >> (but no branches of note, and very few tags).
> > 
> > From v1.9 shallow clone should work for all push/pull/clone... so
> > history depth does not matter (on the client side). As for
> > gentoo-x86's large worktree, using index v4 and avoid full-tree
> > operations (e.g. "status .", not "status"..) should make all
> > operations reasonably fast. I plan to make "status" fast even without
> > path limiting with the help of inotify, but that's not going to be
> > finished soon. Did I miss anything else?
> > 
> 
> Regarding git-status on msysgit, enable core.preloadindex and core.fscache 
> (as of 1.8.5.2).
> 
> There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to
> keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1
> instead of C:\Program Files).

You can use GetLongPathNameW to get the latter from the former.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make the git codebase thread-safe

2014-02-13 Thread Mike Hommey
On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:
> Mike Hommey  writes:
> 
> > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> >> Stefan Zager  writes:
> >> 
> >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup  wrote:
> >> >
> >> >> Really, give the above patch a try.  I am taking longer to finish it
> >> >> than anticipated (with a lot due to procrastination but that is,
> >> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> >> and nontrivial progress reports for my freely available work on GNU
> >> >> LilyPond).
> >> >
> >> > I will give that a try.  How much of a performance improvement have
> >> > you clocked?
> >> 
> >> Depends on file type and size.  With large files with lots of small
> >> changes, performance improvements get more impressive.
> >> 
> >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> >> (performance improvement about a factor of 3), a large file in the style
> >> of /usr/share/dict/words clocking in at a factor of about 5.
> >> 
> >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> >> are no improvements in system time (I/O) except for patch 4 of the
> >> series which helps perhaps 20% or so.
> >> 
> >> So the benefits of the patch will come into play mostly for big, bad
> >> files on Windows: other than that, the I/O time is likely to be the
> >> dominant player anyway.
> >
> > How much fragmentation does that add to the files, though?
> 
> Uh, git-blame is a read-only operation.  It does not add fragmentation
> to any file.  The patch will add a diff of probably a few dozen hunks to
> builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> that I expect even
> 
> git blame builtin/blame.c
> 
> to be faster than before.  But that interpretation of your question
> probably tries to make too much sense out of what is just nonsense in
> the given context.

Sorry, I thought you were talking about write operations, not reads.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make the git codebase thread-safe

2014-02-13 Thread Mike Hommey
On Thu, Feb 13, 2014 at 06:34:39PM +0900, Mike Hommey wrote:
> On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:
> > Mike Hommey  writes:
> > 
> > > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> > >> Stefan Zager  writes:
> > >> 
> > >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup  wrote:
> > >> >
> > >> >> Really, give the above patch a try.  I am taking longer to finish it
> > >> >> than anticipated (with a lot due to procrastination but that is,
> > >> >> unfortunately, a large part of my workflow), and it's cutting into my
> > >> >> "paychecks" (voluntary donations which to a good degree depend on 
> > >> >> timely
> > >> >> and nontrivial progress reports for my freely available work on GNU
> > >> >> LilyPond).
> > >> >
> > >> > I will give that a try.  How much of a performance improvement have
> > >> > you clocked?
> > >> 
> > >> Depends on file type and size.  With large files with lots of small
> > >> changes, performance improvements get more impressive.
> > >> 
> > >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> > >> (performance improvement about a factor of 3), a large file in the style
> > >> of /usr/share/dict/words clocking in at a factor of about 5.
> > >> 
> > >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> > >> are no improvements in system time (I/O) except for patch 4 of the
> > >> series which helps perhaps 20% or so.
> > >> 
> > >> So the benefits of the patch will come into play mostly for big, bad
> > >> files on Windows: other than that, the I/O time is likely to be the
> > >> dominant player anyway.
> > >
> > > How much fragmentation does that add to the files, though?
> > 
> > Uh, git-blame is a read-only operation.  It does not add fragmentation
> > to any file.  The patch will add a diff of probably a few dozen hunks to
> > builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> > that I expect even
> > 
> > git blame builtin/blame.c
> > 
> > to be faster than before.  But that interpretation of your question
> > probably tries to make too much sense out of what is just nonsense in
> > the given context.
> 
> Sorry, I thought you were talking about write operations, not reads.

Specifically, I thought you were talking about git checkout.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Git for Windows 1.9.0 (fwd)

2014-02-17 Thread Mike Hommey
On Tue, Feb 18, 2014 at 12:52:28AM +0100, Johannes Schindelin wrote:
> Hopefully the Postfix Greylisting Policy Server will not try again to
> greylist me, as it might however do without violating the RFC.
> 
> -- Forwarded message --
> Date: Tue, 18 Feb 2014 00:38:54 +0100 (CET)
> From: Johannes Schindelin 
> To: msys...@googlegroups.com
> Cc: git@vger.kernel.org
> Subject: [msysGit] Git for Windows 1.9.0
> 
> Dear Git fanbois,
> 
> this announcement informs you that the small team of volunteers who keep
> the Git ship afloat for the most prevalent desktop operating system
> managed to release yet another version of Git for Windows:
> 
> Git Release Notes (Git-1.9.0-preview20140217)
> Last update: 17 February 2013
> 
> Changes since Git-1.8.5.2-preview20131230
> 
> New Features
> - Comes with Git 1.9.0 plus Windows-specific patches.
> - Better work-arounds for Windows-specific path length limitations (pull
>   request #122)
> - Uses optimized TortoiseGitPLink when detected (msysGit pull request
>   #154)
> - Allow Windows users to use Linux Git on their files, using Vagrant
>   http://www.vagrantup.com/ (msysGit pull request #159)

I was curious about this, so i went to github and... couldn't find any
pull request above #126.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [msysGit] Git for Windows 1.9.0 (fwd)

2014-02-18 Thread Mike Hommey
On Tue, Feb 18, 2014 at 10:04:54AM +0100, Erik Faye-Lund wrote:
> It's right here: https://github.com/msysgit/msysgit/pull/159
> 
> You probably looked in our git repo rather than our msysGit repo.

Oh indeed I was, thanks.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fast-import deltas

2014-04-01 Thread Mike Hommey
Hi,

I am currently prototyping a "native" mercurial remote handler for git,
and it seems silly for git to compute deltas itself when I'm getting
deltas from the mercurial remote itself, albeit in a different form.

Would adding a fast-import command to handle deltas be considered useful
for git? If so, what kind of format would be suitable?

Cheers,

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fast-import deltas

2014-04-01 Thread Mike Hommey
On Tue, Apr 01, 2014 at 07:45:03AM -0400, Jeff King wrote:
> On Tue, Apr 01, 2014 at 07:25:54PM +0900, Mike Hommey wrote:
> 
> > I am currently prototyping a "native" mercurial remote handler for git,
> 
> For my own curiosity, how does this differ from what is in
> contrib/remote-helpers/git-remote-hg?

contrib/remote-helpers/git-remote-hg does a local mercurial clone before
doing the git conversion. While this is not really a problem for most
mercurial projects, it tends to be slow with big ones, like the firefox
source code. What I'm aiming at is something that can talk directly to a
remote mercurial server.

> > Would adding a fast-import command to handle deltas be considered useful
> > for git? If so, what kind of format would be suitable?
> 
> It breaks fast-import's "lowest common denominator" data model that is
> just passing commits and their contents over the stream. But we already
> do that in other cases for the sake of performance. I think the
> important thing is that the alternate formats are optional and enabled
> by the caller with command-line options.
> 
> That being said, I foresee a few complications:
> 
>   1. Git needs to know the sha1 of the full object. So unless the
>  generating script knows that ahead of time, git has to expand the
>  delta immediately anyway (this could still be a win if we end up
>  using a good delta from elsewhere rather than doing our own delta
>  search, but I suspect it's not so big a win as if we can just blit
>  the delta straight to disk).

Good point. That could quickly become a problem with long delta chains.

>   2. Git does not store on-disk deltas between objects that are not in
>  the same packfile. So you'd only be able to delta against an object
>  that came in the same stream (or you'd have to "fix" the packfile
>  on disk by adding an extra copy of the delta base, but that
>  probably eliminates any savings).

Arguably, this would make the most difference on initial clone of big
projects, or large incremental updates (like, after a few weeks), which
would use a single pack anyways.

> As for format, I believe that git is basically xdelta under the hood, so
> you'd want either that or something that can be trivially converted to
> it.

It seems to me fast-import keeps a kind of human readable format for its
protocol, i wonder if xdelta format would fit the bill. That being said,
I also wonder if i shouldn't just try to write a pack on my own...

Cheers,

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fast-import deltas

2014-04-01 Thread Mike Hommey
On Tue, Apr 01, 2014 at 09:15:12AM -0400, Jeff King wrote:
> > It seems to me fast-import keeps a kind of human readable format for its
> > protocol, i wonder if xdelta format would fit the bill. That being said,
> > I also wonder if i shouldn't just try to write a pack on my own...
> 
> The fast-import commands are human readable, but the blob contents are
> included inline. I don't see how sending a binary delta is any worse
> than sending a literal binary blob over the stream.

OTOH, the xdelta format is not exactly straightforward to produce, with
the variable length encoding of integers. Not exactly hard, but when
everything else in fast-import is straightforward, one has to wonder.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fast-import deltas

2014-04-01 Thread Mike Hommey
On Tue, Apr 01, 2014 at 10:14:02AM -0700, Junio C Hamano wrote:
> Mike Hommey  writes:
> 
> > On Tue, Apr 01, 2014 at 09:15:12AM -0400, Jeff King wrote:
> >> > It seems to me fast-import keeps a kind of human readable format for its
> >> > protocol, i wonder if xdelta format would fit the bill. That being said,
> >> > I also wonder if i shouldn't just try to write a pack on my own...
> >> 
> >> The fast-import commands are human readable, but the blob contents are
> >> included inline. I don't see how sending a binary delta is any worse
> >> than sending a literal binary blob over the stream.
> >
> > OTOH, the xdelta format is not exactly straightforward to produce, with
> > the variable length encoding of integers. Not exactly hard, but when
> > everything else in fast-import is straightforward, one has to wonder.
> 
> Unless you already have your change in the xdelta on hand, or the
> format your foreign change is in gives sufficient information to
> produce a corresponding xdelta without looking at the content that
> your foreign change applies to, it is silly to try to convert your
> foreign change into xdelta and feed it to fast-import.
> 
> What constitutes "sufficient" information?  The xdelta format is a
> series of instructions that lets you:
> 
>  - copy N bytes from offset in the source material to the
>destination; or
>  - copy these N literal bytes to the destination.
> 
> to an existing piece of content, identified by the object name of
> the "source material", to produce a result of "applying delta".

The patch format I'm getting from mercurial boils down to:
  - replace N bytes from offset in the source material with the given
M bytes.
Which can easily be converted to xdelta without looking at the original.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fast-import deltas

2014-04-01 Thread Mike Hommey
On Tue, Apr 01, 2014 at 03:32:49PM -0700, Junio C Hamano wrote:
> [Footnote]
> 
> *1* I am still not sure how useful the feature would be outside
> slurping from Hg and Git---for obvious reasons, though.  As long as
> the change is to a cleanly isolated codepath, it would be OK, I
> guess.

That's why I started the thread by asking if there would be some
interest for this feature. I'm not even sure it would be entirely
beneficial to my usecase, just a hunch.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fast-import deltas

2014-04-01 Thread Mike Hommey
On Wed, Apr 02, 2014 at 01:29:13AM +0200, Max Horn wrote:
> 
> On 01.04.2014, at 15:15, Jeff King  wrote:
> 
> > On Tue, Apr 01, 2014 at 10:07:03PM +0900, Mike Hommey wrote:
> > 
> >>> For my own curiosity, how does this differ from what is in
> >>> contrib/remote-helpers/git-remote-hg?
> >> 
> >> contrib/remote-helpers/git-remote-hg does a local mercurial clone
> >> before doing the git conversion. While this is not really a problem
> >> for most mercurial projects, it tends to be slow with big ones,
> >> like the firefox source code. What I'm aiming at is something that
> >> can talk directly to a remote mercurial server.
> > 
> > Ah, that makes sense. Thanks for explaining.
> 
> 
> Hm, myself, I am not quite convinced. Yes, there is an overhead, but
> it is one-time (well, the space overhead is not, but Mike only
> mentioned time, not space).

I didn't mention it, but it definitely is a factor. As for the
performance factor, it certainly is more than a one-time overhead.

> I wonder if it is really worth the effort
> to start yet another project on this... Moreover, I don't see a
> fundamental reason why one could not modify git-remote-hg to work this
> way.

The way git-remote-hg works is fundamentally different to how one can
directly get and push stuff to a remote mercurial server. As such, there
is not much value in the current git-remote-hg code for that purpose.
Also, I'm currently prototyping something to see whether what I think
should work really works. Starting from the current git-remote-hg code
would add useless constraints to the prototype.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
Hi,

For some reason, I need to know the sha1 corresponding to some marks
I'm putting in a fast-import stream. Unfortunately, this does not appear
to be possible.
- I'd rather not require a checkpoint to export marks each time I need
  such a sha1, and I'd rather not do that work that requires them after
  all the marks have been created (although currently, I'm forced to).
- fast-import's cat-blob only allows to cat blobs, which, well, is kind
  of in its name; how about adding an equivalent to the git-cat-file
  command?
- fast-import's `ls` command documentation about its output format
  mentions that the output may contain commits, so I tried the trick of
  creating a tree with commits, but fast-import then fails with:
fatal: Not a blob (actually a commit)
  which I totally understand, but then I wonder why the documentation
  mentions it and how one would get a tree containing references to
  commits. I guess the documentation should be fixed.

So, while there's a possible solution with an equivalent to the
git-cat-file command if that'd be accepted, that's also overkill for my
need, which is to simply get the sha1 corresponding to a mark. Would you
consider a patch adding a command that allows such a resolution?

Cheers,

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Tue, Nov 18, 2014 at 09:34:26AM +0900, Mike Hommey wrote:
> Hi,
> 
> For some reason, I need to know the sha1 corresponding to some marks
> I'm putting in a fast-import stream. Unfortunately, this does not appear
> to be possible.
> - I'd rather not require a checkpoint to export marks each time I need
>   such a sha1, and I'd rather not do that work that requires them after
>   all the marks have been created (although currently, I'm forced to).

BTW, if it so happens that all the operations that were done end up
creating objects that already existed for some reason, checkpoint
doesn't do anything, which is fine for the pack and tags, but not
necessarily so for export-marks.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Mon, Nov 17, 2014 at 05:40:28PM -0800, Jonathan Nieder wrote:
> Mike Hommey wrote:
> 
> > - fast-import's `ls` command documentation about its output format
> >   mentions that the output may contain commits, so I tried the trick of
> >   creating a tree with commits, but fast-import then fails with:
> > fatal: Not a blob (actually a commit)
> >   which I totally understand, but then I wonder why the documentation
> >   mentions it and how one would get a tree containing references to
> >   commits. I guess the documentation should be fixed.
> 
> Odd.  Here's what happens when I try:
> 
>  $ echo "ls $(git rev-parse HEAD)" | git fast-import --quiet
>  fatal: Missing space after tree-ish: ls 
> a4a226a366ab0a173ed9e5f70f2a95d0d21e54c5
>  fast-import: dumping crash report to .git/fast_import_crash_14080
>  $ echo "ls $(git rev-parse HEAD) " | git fast-import --quiet
>  04 tree d3d38e7d71cb40ebbaf2798b01837b3de43fd4a1
> 
> How did you get that "Not a blob" message?

When trying to *create* a tree with a commit in it, so instead of giving
the mark for a blob to a filemodify command, giving a mark for a commit.
That is what fails with "Not a blob".
So it's not possible to create a tree with a reference to a commit, at
least with fast-import.
But, the documentation for the ls command says this:

   Output uses the same format as git ls-tree  -- :

 SP ('blob' | 'tree' | 'commit') SP  HT  LF

The 'commit' string certainly seems it cannot be there.

> I think a good fix would be to teach parse_ls a mode with no 
> parameter.  Something like this (untested; needs cleanup and tests):

This would make both your commands output the same thing, right? It
wouldn't help my case :)

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Tue, Nov 18, 2014 at 11:21:37AM +0900, Mike Hommey wrote:
> On Tue, Nov 18, 2014 at 09:34:26AM +0900, Mike Hommey wrote:
> > Hi,
> > 
> > For some reason, I need to know the sha1 corresponding to some marks
> > I'm putting in a fast-import stream. Unfortunately, this does not appear
> > to be possible.
> > - I'd rather not require a checkpoint to export marks each time I need
> >   such a sha1, and I'd rather not do that work that requires them after
> >   all the marks have been created (although currently, I'm forced to).
> 
> BTW, if it so happens that all the operations that were done end up
> creating objects that already existed for some reason, checkpoint
> doesn't do anything, which is fine for the pack and tags, but not
> necessarily so for export-marks.

And while I'm here, it's sad that one needs to emit a dummy cat-blob or
ls command to wait for a checkpoint to be finished because they are the
only commands that trigger something written on the cat-blob fd that can
be waited for.

Mike.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Mon, Nov 17, 2014 at 06:51:31PM -0800, Jonathan Nieder wrote:
> Mike Hommey wrote:
> > On Mon, Nov 17, 2014 at 05:40:28PM -0800, Jonathan Nieder wrote:
> 
> >> How did you get that "Not a blob" message?
> >
> > When trying to *create* a tree with a commit in it, so instead of giving
> > the mark for a blob to a filemodify command, giving a mark for a commit.
> > That is what fails with "Not a blob".
> 
> Ah, I see what you were trying now.  It's complaining that the data
> and mode don't match up.  See  under 'filemodify' in the manual.
> 
> Something like
> 
>   M 16 :1 mycommit
> 
> should work fine, though that's a pretty ugly workaround for the
> inability to do
> 
>   ls :1

Actually, for my use, that ugly workaround actually improves things for
me, avoiding to use blobs in some of the stuff I want to store :) How
did I miss that? Thanks a lot for the enlightenment.

> [...]
> >> I think a good fix would be to teach parse_ls a mode with no 
> >> parameter.  Something like this (untested; needs cleanup and tests):
> >
> > This would make both your commands output the same thing, right? It
> > wouldn't help my case :)
> 
> It's easily possible my patch has a typo somewhere, but the expected
> output format would be
> 
>   commit 6066a7eac4b2bcdb86971783b583e4e408b32e81
> 
> That wouldn't help?

Oh, so `ls ` would print out what  is? That would
definitely help, although with the trick above, I probably wouldn't
actually need it anymore.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Mon, Nov 17, 2014 at 06:53:59PM -0800, Jonathan Nieder wrote:
> Mike Hommey wrote:
> 
> > BTW, if it so happens that all the operations that were done end up
> > creating objects that already existed for some reason, checkpoint
> > doesn't do anything, which is fine for the pack and tags, but not
> > necessarily so for export-marks.
> 
> Does something like this help?

I'm not sure about branches and tags, as they would mostly be noops,
but dump_marks definitely needs to happen. I did try to move dump_marks
out of the if already, and that did what I wanted.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-17 Thread Mike Hommey
On Mon, Nov 17, 2014 at 07:27:41PM -0800, Jonathan Nieder wrote:
> Mike Hommey wrote:
> 
> > And while I'm here, it's sad that one needs to emit a dummy cat-blob or
> > ls command to wait for a checkpoint to be finished
> 
> That's a good point.  (Though relying on checkpoints to read back
> information is an ugly trick, so if we can get other commands to
> provide the information you need then that would be better.)
> 
> The old-fashioned way is to do "progress checkpoint done".  Alas, that
> writes to the progress fd, which doesn't go to the remote helper's
> stdin.  How about something like this?

How about something more generic, like "sync", which purpose would be to
request synchronisation with fast-import, which would respond "sync" on
the cat-blob fd?

Or "ping"<->"pong".

Although... I wonder if that would really be useful for anything else
than checkpoint...

That said, I, for one, would be fine if this is not "fixed" as long as
marks can be resolved some other way (and, in fact, it may turn out that
I don't need to resolve marks to sha1s at all)

Thanks for you help

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-18 Thread Mike Hommey
On Tue, Nov 18, 2014 at 12:11:47PM +0900, Mike Hommey wrote:
> Oh, so `ls ` would print out what  is? That would
> definitely help, although with the trick above, I probably wouldn't
> actually need it anymore.

So, in the end, I was able to do everything with what's currently
provided by git fast-import, but one thing would probably make life
easier for me: being able to initialize a commit tree from a commit
that's not one of the direct parents.

Because the data I'm using gives diffs against possibly unrelated
commits, and because starting a tree from scratch is actually slow when
you have thousands of subdirectories, it would be easier if I could just
start from that tree I have a diff against and apply the changes.
Without this, there would be a lot of `ls` command emitting involved,
and I'm actually not sure that wouldn't be as slow as starting from
scratch (haven't implemented yet, so I can't tell). Also, I'm not sure
how I'm supposed to know how much to read back from `ls`.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Getting a commit sha1 from fast-import in a remote-helper

2014-11-18 Thread Mike Hommey
On Tue, Nov 18, 2014 at 06:21:22PM -0800, Jonathan Nieder wrote:
> Mike Hommey wrote:
> 
> > So, in the end, I was able to do everything with what's currently
> > provided by git fast-import, but one thing would probably make life
> > easier for me: being able to initialize a commit tree from a commit
> > that's not one of the direct parents.
> 
> IIRC then 'M 04' wants a tree object, not a commit object, so
> you'd have to do
> 
>   ls  ""
>   M 04  ""

That's what I'm planning to try ; Would doing:
M 04  ""
M 0644  some/path
D other/path

work? Or do I have to actually build a tree from the combination of the
output from various ls and those filedelete/filemodify?

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote-hg: do not fail on invalid bookmarks

2013-12-29 Thread Mike Hommey
On Sun, Dec 29, 2013 at 12:30:02PM +0100, Antoine Pelisse wrote:
> Mercurial can have bookmarks pointing to "nullid" (the empty root
> revision), while Git can not have references to it.
> When cloning or fetching from a Mercurial repository that has such a
> bookmark, the import will fail because git-remote-hg will not be able to
> create the corresponding reference.
> 
> Warn the user about the invalid reference, and continue the import,
> instead of stopping right away.

It's not invalid, it's used to indicate deleted bookmarks. (Tags have
the same property)

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote-hg: do not fail on invalid bookmarks

2013-12-30 Thread Mike Hommey
On Mon, Dec 30, 2013 at 08:41:13AM +0100, Antoine Pelisse wrote:
> On Sun, Dec 29, 2013 at 11:24 PM, Mike Hommey  wrote:
> > On Sun, Dec 29, 2013 at 12:30:02PM +0100, Antoine Pelisse wrote:
> >> Mercurial can have bookmarks pointing to "nullid" (the empty root
> >> revision), while Git can not have references to it.
> >> When cloning or fetching from a Mercurial repository that has such a
> >> bookmark, the import will fail because git-remote-hg will not be able to
> >> create the corresponding reference.
> >>
> >> Warn the user about the invalid reference, and continue the import,
> >> instead of stopping right away.
> >
> > It's not invalid, it's used to indicate deleted bookmarks. (Tags have
> > the same property)
> 
> Hey Mike,
> Indeed, I don't know how I ended-up with such a bookmark, but it
> prevented me from git-cloning the repository (and the backtrace was
> not very helpful at first).
> But I'm still not sure what you mean by "deleted bookmarks" ?
> I guess it's not "hg bookmark --delete", as it would not be listed at
> all. Is it "hg strip some_changeset" that end-up deleting the
> bookmarked changeset ? I think I've tested this use-case and it moved
> the bookmark to a parent changeset.

Mmmm after looking at the mercurial code, it looks like i was wrong and
bookmarks are not handled like tags. You can actually create such a
bookmark on purpose with:

$ hg bookmark -r null foo

Then, if you do, say:

$ hg up -r foo
$ echo a > a
$ hg add a
$ hg commit -m a

Then you end up with a completely new head with no ancestors in common
with the others.

In git terms,

$ hg bookmark -r null foo
$ hg up -r foo

is equivalent to

$ git checkout --orphan foo

But git never creates an actual ref in that case.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Consistency question

2014-01-17 Thread Mike Hommey
On Wed, Jan 15, 2014 at 06:13:30AM -0500, Jeff King wrote:
> On Wed, Jan 15, 2014 at 11:37:08AM +0100, David Kastrup wrote:
> 
> > The question is what guarantees I have with regard to the commit date of
> > a commit in relation to that of its parent commits:
> > 
> > a) none
> > b) commitdate(child) >= commitdate(parent)
> > c) commitdate(child) > commitdate(parent)
> 
> a) none
> 
> > Obviously, I can rely on c) being true "almost always":
> 
> Actually, b) is quite often the case in automated processes (e.g., "git
> am" or "git rebase"). The author dates are different, but the committer
> dates may be in the same second.
> 
> And of course a) is the result of clock skew and software bugs.

... or importing non-git repositories that don't have commit info
separated from author info like git does. In such cases, it's usual to
duplicate the author info as commit info so that clones of the same
non-git repo end up with the same git sha1s. Mercurial easily allows
author dates to be in a non topological order.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] repack.c: chmod +w before rename()

2014-01-24 Thread Mike Hommey
On Fri, Jan 24, 2014 at 10:49:13PM +, brian m. carlson wrote:
> On Fri, Jan 24, 2014 at 11:24:36PM +0100, Johannes Schindelin wrote:
> > > In general, I'm wary of changing permissions on a file to suit Windows's
> > > rename because of the symlink issue and the security issues that can
> > > result.
> > 
> > I agree on the Windows issue.
> 
> I personally feel that if Windows needs help to change permissions for a
> rename, that code should only ever be used on Windows.  Doesn't
> mingw_rename automatically do this anyway, and if it doesn't, shouldn't
> we put the code there instead?  Furthermore, it makes me very nervous to
> make the file 666.  Isn't 644 enough?

Arguably, umask is supposed to take care of making things right. OTOH,
since it's the destination file that's the problem not the renamed file,
the equivalent to mv -f would be to unlink() that file first, not to change
its permissions. That would work properly on unix too.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify support, nearly there

2014-01-28 Thread Mike Hommey
On Wed, Jan 29, 2014 at 01:47:30PM +0700, Duy Nguyen wrote:
> Just a quick update for the enthusiasts. My branch file-watcher [1]
> has got working per-user inotify support. It's a 20 patch series so
> I'll refrain from spamming git@vger for a while, even though it hurts
> your eyes a lot less than what I have posted so far. The test suite
> ran fine with it so it's not that buggy. It has new tests too, even
> though real inotify is not tested in the new tests. Documentation is
> there, either in .txt or comments. Using it is simple:
> 
> $ mkdir ~/.watcher
> $ git file-watcher --detach ~/.watcher
> $ git config --global filewatcher.path $HOME/.watcher
> 
> There's still some polishing work to do. But I think the core logic is
> done. I have some ideas what to be polished, but I'd appreciate
> feedback if anyone uses it. We may need to make lookup code faster
> later.
> 
> MacOS, FreeBSD and Windows contributors. If you have time and are
> interested, have a look at the protocol, which is basically documented
> in file-watcher.c:handle_command(), and see if something is
> incompatible with your file notification mechanism. MacOS and FreeBSD
> may reuse code from file-watcher.c, at least the unix socket part. I'm
> not so sure about Windows. It probably needs a brand new daemon
> because little could be shared, I don't know. I deliberately design
> the daemon dumb so writing a completely new one won't be so hard. My
> plan is focus on inotify and get it merged first, then new OS support
> can come later (with refactoring if needed, but should not change the
> protocol drastically).
> 
> [1] git clone https://github.com/pclouds/git.git file-watcher

Haven't looked at the code, so I don't know if you've done that, but in
case you haven't, it would be nice to have an environment variable or a
config option to make git use the file-watcher *and* normal lstat
operations, to check consistency.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fast-import: support the ls command anywhere comments are accepted

2014-11-25 Thread Mike Hommey
The git-fast-import manual page says about both cat-blob and ls that they can
be used "anywhere in the stream that comments are accepted", but in practice
it turns out it was only true for cat-blob. This change makes fast-import
behavior match its documentation.

Signed-off-by: Mike Hommey 
---
 fast-import.c | 4 
 1 file changed, 4 insertions(+)

The downside of this change is that if a script relies on the fixed behavior,
it won't work with older versions of git. I'm not sure it is better than
fixing the documentation to match the unfortunate current limitation?


diff --git a/fast-import.c b/fast-import.c
index d0bd285..7fd59ef 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1910,6 +1910,10 @@ static int read_next_command(void)
parse_cat_blob(p);
continue;
}
+   if (skip_prefix(command_buf.buf, "ls ", &p)) {
+   parse_ls(p, NULL);
+   continue;
+   }
if (command_buf.buf[0] == '#')
continue;
return 0;
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fast-import: support the ls command anywhere comments are accepted

2014-11-25 Thread Mike Hommey
On Wed, Nov 26, 2014 at 07:25:39AM +0900, Mike Hommey wrote:
> The git-fast-import manual page says about both cat-blob and ls that they can
> be used "anywhere in the stream that comments are accepted", but in practice
> it turns out it was only true for cat-blob. This change makes fast-import
> behavior match its documentation.
> 
> Signed-off-by: Mike Hommey 
> ---
>  fast-import.c | 4 
>  1 file changed, 4 insertions(+)
> 
> The downside of this change is that if a script relies on the fixed behavior,
> it won't work with older versions of git. I'm not sure it is better than
> fixing the documentation to match the unfortunate current limitation?

Note, if my reading of the code is correct, then cat-blob can't be used
between filemodify | filedelete | filecopy | filerename | filedeleteall
| notemodify either...

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file

2014-11-25 Thread Mike Hommey
Hi,

I have a note tree with a bit more than 200k notes.

$ time git notes --ref foo show $sha1 > /dev/null
real0m0.147s
user0m0.136s
sys 0m0.008s

That's a lot of time, especially when you have a script that does that
on a fair amount of sha1s.

Now, the interesting thing is this:

$ time git ls-tree -r refs/notes/foo $sha1 ${sha1:0:2}/${sha1:2:38} 
${sha1:0:2}/${sha1:2:2}/${sha1:4:36} > /dev/null
real0m0.006s
user0m0.008s
sys 0m0.000s

$ time git cat-file blob $objsha1 > /dev/null
real0m0.002s
user0m0.000s
sys 0m0.000s

And even better:

$ wc -l /tmp/note
39 /tmp/note
$ time git ls-tree refs/notes/foo $(awk '{print $1, substr($1,1,2) "/" 
substr($1,3), substr($1,1,2) "/" substr($1,3,2) "/" substr($1,5)}' /tmp/note) | 
awk '{print $3}' | git cat-file --batch > /dev/null
real0m0.035s
user0m0.028s
sys 0m0.004s

Reading 39 notes with ls-tree and cat-file --batch takes less time than
using git notes show for only one of them...

(and reading all 39 of them with git notes show takes 5.5s)

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file

2014-11-25 Thread Mike Hommey
On Tue, Nov 25, 2014 at 08:24:49PM -0500, Jeff King wrote:
> On Tue, Nov 25, 2014 at 08:00:51PM -0500, Jeff King wrote:
> 
> > On Wed, Nov 26, 2014 at 09:42:42AM +0900, Mike Hommey wrote:
> > 
> > > I have a note tree with a bit more than 200k notes.
> > >
> > > $ time git notes --ref foo show $sha1 > /dev/null
> > > real0m0.147s
> > > user0m0.136s
> > > sys 0m0.008s
> > > 
> > > That's a lot of time, especially when you have a script that does that
> > > on a fair amount of sha1s.
> > 
> > IIRC, the notes code populates an in-memory data structure, which gives
> > faster per-commit lookup at the cost of some setup time. Obviously for a
> > single lookup, that's going to be a bad tradeoff (but it does make sense
> > for "git log --notes"). I don't know offhand how difficult it would be
> > to tune the data structure differently (or avoid it altogether) if we
> > know ahead of time we are only going to do a small number of lookups.
> > But Johan (cc'd) might.
> 
> One other question: how were your notes created?
> 
> I tried to replicate your setup by creating one note per commit in
> linux.git (over 400k notes total). I did it with one big mktree,
> creating a single top-level notes tree. Doing a single "git notes show"
> lookup on the tree was something like 800ms.
> 
> However, this is not what trees created by git-notes look like. It
> shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
> dynamically in a way that keeps each individual tree size low. The
> in-memory data structure then only "faults in" tree objects as they are
> needed. So a single lookup should only hit a small part of the total
> tree.
> 
> Doing a single "git notes edit HEAD" in my case caused the notes code to
> write the result using its sharding algorithm. Subsequent "git notes
> show" invocations were only 14ms.
> 
> Did you use something besides git-notes to create the tree? From your
> examples, it looks like you were accounting for the sharding during
> lookup, so maybe this is leading in the wrong direction (but if so, I
> could not reproduce your times at all even with a much larger case).

So... this is interesting. I happen to have recreated the notes tree
"manually", and now each git notes show takes under 10ms.

Now, looking at the notes tree reflog, I see that at some point, some
notes were added at the top-level of the tree, without being nested,
which is strange.

And it looks like it's related to how I've been adding them, through
git-fast-import. I was using notemodify commands, and was using the
filemodify command to load the previous notes tree instead of using the
from command because I don't care about keeping the notes history.
So fast-import was actually filling the notes tree as if it were
starting over with whatever new notes were added with notemodify (which,
in a case where there were many, it filled with one level of
indirection)

I'm not sure this is a case worth fixing in fast-import. I can easily
work around it.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "git notes show" is orders of magnitude slower than doing it manually with ls-tree and cat-file

2014-11-25 Thread Mike Hommey
On Tue, Nov 25, 2014 at 08:34:57PM -0500, Jeff King wrote:
> On Tue, Nov 25, 2014 at 08:24:48PM -0500, Jeff King wrote:
> 
> > However, this is not what trees created by git-notes look like. It
> > shards the object sha1s into subtrees (1a/2b/{36}), and I think does so
> > dynamically in a way that keeps each individual tree size low. The
> > in-memory data structure then only "faults in" tree objects as they are
> > needed. So a single lookup should only hit a small part of the total
> > tree.
> > 
> > Doing a single "git notes edit HEAD" in my case caused the notes code to
> > write the result using its sharding algorithm. Subsequent "git notes
> > show" invocations were only 14ms.
> > 
> > Did you use something besides git-notes to create the tree? From your
> > examples, it looks like you were accounting for the sharding during
> > lookup, so maybe this is leading in the wrong direction (but if so, I
> > could not reproduce your times at all even with a much larger case).
> 
> Hmph. Having just written all that, I looked at your example again, and
> you are running "git ls-tree -r", which would read the whole tree
> anyway. So "git notes" should be _faster_ for a single lookup.

The -r actually doesn't matter, since what's being listed is a blob, not
a tree, so there is no recursion.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sha1_name: avoid unnecessary sha1 lookup in find_unique_abbrev

2014-11-26 Thread Mike Hommey
An example where this happens is when doing an ls-tree on a tree that
contains a commit link. In that case, find_unique_abbrev is called
to get a non-abbreviated hex sha1, but still, a lookup is done as
to whether the sha1 is in the repository (which ends up looking for
a loose object in .git/objects), while the result of that lookup is
not used when returning a non-abbreviated hex sha1.

Signed-off-by: Mike Hommey 
---
 sha1_name.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

This hit me on a corner case, where I kind of abuse commit links and
have a tree with tens of thousands of them.

Without the patch:
  $ time git ls-tree -r HEAD | wc -l
  114987

  real  0m4.412s
  user  0m1.980s
  sys   0m2.480s

With the patch:
  $ time git ls-tree -r HEAD | wc -l
  114987

  real  0m0.205s
  user  0m0.196s
  sys   0m0.012s


diff --git a/sha1_name.c b/sha1_name.c
index 5b004f5..cb88170 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -372,10 +372,10 @@ const char *find_unique_abbrev(const unsigned char *sha1, 
int len)
int status, exists;
static char hex[41];
 
-   exists = has_sha1_file(sha1);
memcpy(hex, sha1_to_hex(sha1), 40);
if (len == 40 || !len)
return hex;
+   exists = has_sha1_file(sha1);
while (len < 40) {
unsigned char sha1_ret[20];
status = get_short_sha1(hex, len, sha1_ret, GET_SHA1_QUIETLY);
-- 
2.1.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   4   >