[RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-07 Thread Derrick Stolee
Commentary: This file format uses the large offsets from the pack-index
version 2 format, but drops the CRC32 hashes from that format.

Also: I included the HASH footer at the end only because it is already in
the pack and pack-index formats, but not because it is particularly useful
here. If possible, I'd like to remove it and speed up MIDX writes.

-- >8 --

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 149 +++
 1 file changed, 149 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt 
b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 00..d31b03dec5
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,149 @@
+Multi-Pack-Index (MIDX) Design Notes
+
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short, with suffix ".midx")
+stores a list of objects and their offsets into multiple pack-
+files. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+A new config setting 'core.midx' must be enabled before writing
+or reading MIDX files.
+
+The MIDX files are updated by the 'midx' builtin with the
+following common parameter combinations:
+
+- 'git midx' gives the hash of the current MIDX head.
+- 'git midx --write --update-head --delete-expired' writes a new
+  MIDX file, points the MIDX head to that file, and deletes the
+  existing MIDX file if out-of-date.
+- 'git midx --read' lists some basic information about the current
+  MIDX head. Used for basic tests.
+- 'git midx --clear' deletes the current MIDX head.
+
+Design Details
+--
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.
+
+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+- If an object appears in multiple packfiles, then only one copy
+  is stored in the MIDX. This has a possible performance issue:
+  If an object appears as the delta-base of multiple objects from
+  multiple packs, then cross-pack delta calculations may slow down.
+  This is currently only theoretical and has not been demonstrated
+  to be a measurable issue.
+
+Current Limitations
+---
+
+- MIDX files are managed only by the midx builtin and is not
+  automatically updated on clone or fetch.
+
+- There is no '--verify' option for the midx builtin to verify
+  the contents of the MIDX file against the pack contents.
+
+- Constructing a MIDX file currently requires the single-pack
+  index for every pack being added to the MIDX.
+
+- The fsck builtin does not check MIDX files, but should.
+
+- The repack builtin is not aware of the MIDX files, and may
+  invalidate the MIDX files by deleting existing packfiles. The
+  MIDX may also be e

[RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Derrick Stolee
This RFC includes a new way to index the objects in multiple packs
using one file, called the multi-pack index (MIDX).

The commits are split into parts as follows:

[01] - A full design document.

[02] - The full file format for MIDX files.

[03] - Creation of core.midx config setting.

[04-12] - Creation of "midx" builtin that writes, reads, and deletes
  MIDX files.

[13-18] - Consume MIDX files for abbreviations and object loads.

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

* Design a proper command-line interface for constructing and checking
  MIDX files. The current "midx" builtin is probably inadequate.

* Determine what additional changes are needed before the feature can
  be merged. Specifically, I'm interested in the interactions with
  repack and fsck. The current patch also does not update the MIDX on
  a fetch (which adds a packfile) but would be valuable. Whenever
  possible, I tried to leave out features that could be added in a
  later patch.

* Consider splitting this patch into multiple patches, such as:

i. The MIDX design document.
   ii. The command-line interface for building and reading MIDX files.
  iii. Integrations with abbreviations and object lookups.

Please do not send any style nits to this patch, as I expect the code to
change dramatically before we consider merging.

I created three copies of the Linux repo with 1, 24, and 127 packfiles
each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies
gave significant performance improvements on the following comand:

git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
   24 | 90.81 s |40.06 s | -55.9% |  +12.4%
  127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

git midx --write --update-head --delete-expired --pack-dir=

As that release deploys we will gather more specific numbers on the
performance improvements and report them in this thread.

Derrick Stolee (18):
  docs: Multi-Pack Index (MIDX) Design Notes
  midx: specify midx file format
  midx: create core.midx config setting
  midx: write multi-pack indexes for an object list
  midx: create midx builtin with --write mode
  midx: add t5318-midx.sh test script
  midx: teach midx --write to update midx-head
  midx: teach git-midx to read midx file details
  midx: find details of nth object in midx
  midx: use existing midx when writing
  midx: teach git-midx to clear midx files
  midx: teach git-midx to delete expired files
  t5318-midx.h: confirm git actions are stable
  midx: load midx files when loading packs
  midx: use midx for approximate object count
  midx: nth_midxed_object_oid() and bsearch_midx()
  sha1_name: use midx for abbreviations
  packfile: use midx for object loads

 .gitignore   |   1 +
 Documentation/config.txt |   3 +
 Documentation/git-midx.txt   | 106 
 Documentation/technical/multi-pack-index.txt | 149 +
 Documentation/technical/pack-format.txt  |  85 +++
 Makefile |   2 +
 builtin.h|   1 +
 builtin/midx.c   | 352 +++
 cache.h  |   1 +
 command-list.txt |   1 +
 config.c |   5 +
 environment.c|   2 +
 git.c|   1 +
 midx.c   | 850 +++
 midx.h   | 136 +
 packfile.c   |  79 ++-
 packfile.h  

[RFC PATCH 02/18] midx: specify midx file format

2018-01-07 Thread Derrick Stolee
A multi-pack-index (MIDX) file indexes the objects in multiple
packfiles in a single pack directory. After a simple fixed-size
header, the version 1 file format uses chunks to specify
different regions of the data that correspond to different types
of data, including:

- List of OIDs in lex-order
- A fanout table into the OID list
- List of packfile names (relative to pack directory)
- List of object metadata
- Large offsets (if needed)

By adding extra optional chunks, we can easily extend this format
without invalidating written v1 files.

One value in the header corresponds to a number of "base MIDX files"
and will always be zero until the value is used in a later patch.

We considered using a hashtable format instead of an ordered list
of objects to reduce the O(log N) lookups to O(1) lookups, but two
main issues arose that lead us to abandon the idea:

- Extra space required to ensure collision counts were low.
- We need to identify the two lexicographically closest OIDs for
  fast abbreviations. Binary search allows this.

The current solution presents multiple packfiles as if they were
packed into a single packfile with one pack-index.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/pack-format.txt | 85 +
 1 file changed, 85 insertions(+)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 8e5bf60be3..ab459ef142 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -160,3 +160,88 @@ Pack file entry: <+
 corresponding packfile.
 
 20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The multi-pack-index (MIDX) files refer to multiple pack-files.
+
+In order to allow extensions that add extra data to the MIDX format, we
+organize the body into "chunks" and provide a lookup table at the beginning
+of the body. The header includes certain length values, such as the number
+of packs, the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'M', 'I', 'D', 'X'}
+
+   4-byte version number:
+   Git currently only supports version 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)
+
+   1-byte number (I) of base multi-pack-index files:
+   This value is currently always zero.
+
+   1-byte number (C) of "chunks"
+
+   4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes providing the chunk offsets:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.
+   (Chunks are provided in file-order, so you can infer the length
+   using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of objects (N). The number of objects with first byte
+   value i is (F[i] - F[i-1]) for i > 0.
+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all objects in the MIDX are stored in lexicographic
+   order in this chunk.
+
+   Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+   Stores two 4-byte values for every object.
+   1: The pack-int-id for the pack storing this object.
+   2: The offset within the pack.
+   If all offsets are less than 2^31, then the large offset chunk
+   will not exist and offsets are stored as in IDX v1.
+   If there is at least one offset value larger than 2^32-1, then
+   the large offset chunk must exist. If the large offset chunk
+   exists and the 31st bit is on, then removing that bit reveals
+   the row in the large offsets containing the 8-byte offset of
+   this object.
+
+   [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+   8-byte offsets into large packfiles.
+
+   Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+   P * 4 bytes storing the offset in the packfile name chunk for
+   the null-terminated string containing the filename for the
+   ith packfile. The filename is relative to the MIDX file's parent
+   directory.
+
+   Packfile Names (ID: {'P', 'N', 'A', 'M'})
+   Stores the packfile names as concatenated, null-terminated strings.
+   Packfiles must be list

Re: Question about the ahead-behind computation and status

2017-12-15 Thread Derrick Stolee

On 12/15/2017 1:30 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


The biggest reason for the 20 seconds is not just the number of
commits in the ahead/behind but how many commits are walked (including
common to both branches) before paint_down_to_common() breaks its
while loop due to queue_has_nonstale().

Hmm, queue_has_nonstale() looks to see if any element is not STALE
(where the definition of STALE is "known to be a common ancestor")
by potentially checking all elements in the queue.  I wonder if we
have an opportunity for a trivial optimization?  When the caller
knows that it dug one level and added the parents that are not
stale, it does not have to ask queue_has_nonstale() if there is any
non stale element, for example.


I thought this, too, but my tracing did not show significant time spent 
in this method. 99% of the time is spent unzipping and parsing commits.


If this was taking too long, then we could track a minimum timestamp for 
a commit that entered the queue in a non-stale state, but this will 
delay the termination condition a bit since commits can be marked stale 
after they enter the queue.



What do you exactly mean by "not just the number of commits in the
ahead/behind"?  Aren't the number of these commits pretty much
proportional to the number of commits we need to paint down to
common ancestors?  Is the latter a lot larger than the former
(i.e. are we somehow not stopping when we _could_ notice that we
can with better information)?




With the wide history, there is actually a large set of commits that are 
in the common history but have newer commit times than the oldest commit 
in only one history. Consider the following ASCII art:


  A
  |
  1
  |
  2
  |
  3
  |\
  4 B
  |\|
  5 C
  |\|
  6 D
  |\|
   .
   .
   .
  |\|
  N Z
  |/
  0

Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, 
commits B,C,D,...,Z are in the common history, but still must be walked.


Now imagine these two sets are actually much MUCH wider (thousands of 
commits that are pairwise non-ancestors). This causes the number of 
walked commits to be larger than just the number of commits in the 
symmetric difference of the histories.


Unfortunately, generation numbers are not the only optimization needed 
to make this call be sub-100ms. A graph data structure that lists the 
edges between commits would prevent the time spent in zlib decompressing 
and parsing commits. I'm working on investigating how such a data 
structure (and corresponding file format) could integrate in the 
commit-walking code.


Thanks,
-Stolee


Re: Question about the ahead-behind computation and status

2017-12-15 Thread Derrick Stolee

On 12/15/2017 10:08 AM, Jeff Hostetler wrote:

On 12/15/2017 5:08 AM, Jeff King wrote:

On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote:

[*] Sadly, the local repo was only about 20 days out of
 date (including the Thanksgiving holidays)


Taking 20 seconds to traverse 20 days worth of history sounds pretty
awful. How many commits is it?


150,000 commits, give or take.  The graph is many thousands of lanes
wide because of the merges and that isn't helping.

(But I should give you folks lots of credit -- it *only* took 20
seconds to find, unzip and decode 150,000 commit objects and walk
the commit graph.)


To give a few more data points, I created similar situation by checking 
out a big repo I hadn't updated in three months and it was 16,000 
commits behind. That took 1.5s to calculate the ahead/behind. Moving it 
100,000 commits behind it took 5s. This repo has about 300,000 total 
commits versus the 1.5 million commits in the Windows repo.


The biggest reason for the 20 seconds is not just the number of commits 
in the ahead/behind but how many commits are walked (including common to 
both branches) before paint_down_to_common() breaks its while loop due 
to queue_has_nonstale().


Thanks,
-Stolee



Re: [PATCH] builtin/tag.c: return appropriate value when --points-at finds an empty list

2017-12-11 Thread Derrick Stolee

On 12/11/2017 8:44 AM, George Papanikolaou wrote:

`git tag --points-at` can simply return if the given rev does not have
any tags pointing to it. It's not a failure but it shouldn't return
with 0 value.


I disagree. I think the 0 return means "I completed successfully" and 
the empty output means "I didn't find any tags pointing to this object."


Changing the return value here could break a lot of scripts out in the 
wild, and I consider this to be an "API" compatibility that needs to 
stay as-is.


What are you using "--points-at" where you need a nonzero exit code 
instead of a different indicator?


Thanks,
-Stolee



[RFC] 'unsigned long' to 'size_t' conversion

2017-12-06 Thread Derrick Stolee
There are several places in Git where we refer to the size of an object 
by an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned 
long' is 8 bytes, but in 64-bit Windows it is 4 bytes.


The main issue with this conversion is that large objects fail to load 
(they seem to hash and store just fine). For example, the following 
'blob8gb' is an 8 GB file where the ith byte is equal to i % 256:


$ git hash-object -w --no-filters blob8gb
5391939346b98600acc0283dda24649450cec51f

$ git cat-file -s 5391939346b98600acc0283dda24649450cec51f
error: bad object header
fatal: git cat-file: could not get object info

An existing discussion can be found here: 
https://github.com/git-for-windows/git/issues/1063


The error message results from unpack_object_header_buffer() which had 
its most-recent meaningful change in 'ea4f9685: 
unpack_object_header_buffer(): clear the size field upon error' (in 2011).


In my opinion, the correct thing to do would be to replace all 'unsigned 
long's that refer to an object size and replace them with 'size_t'. 
However, a simple "git grep 'unsigned long size'" reveals 194 results, 
and there are other permutations of names and pointer types all over.


This conversion would be a significant patch, so I wanted to get the 
community's thoughts on this conversion.


If there are small, isolated chunks that can be done safely, then this 
may be a good target for a first patch.


Thanks,

-Stolee



Re: [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-04 Thread Derrick Stolee

On 12/4/2017 11:56 AM, Jeff King wrote:

When you put your cover letter first, you need to use "scissors" like:
-- >8 --

to separate it from the commit message. Using three-dashes means "git
am" will include your cover letter as the commit message, and omit your
real commit message entirely.


Thanks. I updated our internal best-practices accordingly.

-Stolee


[PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-04 Thread Derrick Stolee
Thanks for the feedback on v1. This version fixes the cruft_cb
bug and streamlines the strlen(). I would include an inter-diff
but it was the same size as the patch.

Thanks,
-Stolee

---

Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --parents --raw -1000'
had the following performance:

 HEAD~1HEAD

 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_file.c  | 12 +++-
 t/perf/p4211-line-log.sh |  4 
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2fc8fa93b4 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1903,7 +1903,6 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
origlen = path->len;
strbuf_complete(path, '/');
strbuf_addf(path, "%02x", subdir_nr);
-   baselen = path->len;
 
dir = opendir(path->buf);
if (!dir) {
@@ -1914,15 +1913,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
 
oid.hash[0] = subdir_nr;
+   strbuf_addch(path, '/');
+   baselen = path->len;
 
while ((de = readdir(dir))) {
+   size_t namelen;
if (is_dot_or_dotdot(de->d_name))
continue;
 
+   namelen = strlen(de->d_name);
strbuf_setlen(path, baselen);
-   strbuf_addf(path, "/%s", de->d_name);
-
-   if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
+   strbuf_add(path, de->d_name, namelen);
+   if (namelen == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
if (obj_cb) {
@@ -1941,7 +1943,7 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
closedir(dir);
 
-   strbuf_setlen(path, baselen);
+   strbuf_setlen(path, baselen - 1);
if (!r && subdir_cb)
r = subdir_cb(subdir_nr, path->buf, data);
 
diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
git log --oneline --raw --parents >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents -1000' '
+   git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0



Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-01 Thread Derrick Stolee

On 12/1/2017 1:22 PM, Jeff King wrote:

On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote:

[snip]

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644

This overall looks good, but I noticed one bug and a few cosmetic
improvements.


Thanks for finding quality improvements to this patch. I'll let it sit 
over the weekend for more feedback before sending a v2.





--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
  
  	oid.hash[0] = subdir_nr;

+   strbuf_add(path, "/", 1);

Maybe:

   strbuf_addch(path, '/');

would be a bit more readable (it's also a tiny bit faster, but this
isn't part of the tight loop).


+   baselen = path->len;

We set this here so that the '/' is included as part of the base. Makes
sense, but can we now drop the earlier setting of baselen before the
opendir() call?


Yeah, probably. I had briefly considered just adding the '/' before the 
first assignment of "baselen", but didn't want to change the error 
output. I also don't know if there are side effects for other platforms 
by calling opendir() with a '/'-terminated path.



while ((de = readdir(dir))) {
if (is_dot_or_dotdot(de->d_name))
continue;
  
-		strbuf_setlen(path, baselen);

-   strbuf_addf(path, "/%s", de->d_name);
-
if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
+   strbuf_setlen(path, baselen);
+   strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);

Moving this code into the conditional makes sense here, since that's
where we know the actual size. But what about the case when we _don't_
hit this conditional. We still do:

   if (cruft_cb)
 cruft_cb(path->buf);

which is now broken (it will just get the directory name without the
entry appended).


Good catch! A big reason to pull it inside and use strbuf_add over 
strbuf_addstr is to avoid a duplicate strlen() calculation. However, I 
can store the length before the conditional.



I think the optimized versions probably needs to be something more like:

   while (de = readdir(...)) {
   strbuf_setlen(path, baselen);
   if (size is HEXSZ - 2) {
   strbuf_add(path, de->d_name, HEXSZ - 2);
  obj_cb(path->buf);
   } else if (cruft_cb) {
   strbuf_addstr(path, de->d_name);
  cruft_cb(path->buf);
   }
   }


Small change by storing the length in advance of the conditional:

while (de = readdir(...)) {
    int namelen = strlen(de->d_name);
    strbuf_setlen(path, baselen);
    strbuf_add(path, de->d_name, namelen);

    if (namelen == HEXSZ - 2)
        obj_cb(path->buf)
    else
        cruft_cb(path->buf);
}


Two other thoughts:

   - from past discussions, I suspect most of your performance
 improvement actually comes from avoiding addf(), and the difference
 between addstr() and add() may be negligible here. It might be worth
 timing strbuf_addstr(). If that's similarly fast, that means we
 could keep the logic out of the conditional entirely.


addstr() duplicates the strlen(), which isn't much but we might as well 
save cycles where we can if it isn't too hard.



   - there's an extra micro-optimization there, which is that if there's
 no obj_cb, we have no need to assemble the full path at all. I doubt
 it makes much of a difference, as most callers would pass an object
 callback (I'd be surprised if we even have one that doesn't).


After doing a few 'git grep' commands, I found several that include a 
NULL cruft_cb but none that have a NULL obj_cb.


Thanks,
-Stolee


[PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-01 Thread Derrick Stolee
Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --raw --parents -1000'
had the following performance:

 HEAD~1HEAD

 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_file.c  | 7 ---
 t/perf/p4211-line-log.sh | 4 
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
 
oid.hash[0] = subdir_nr;
+   strbuf_add(path, "/", 1);
+   baselen = path->len;
 
while ((de = readdir(dir))) {
if (is_dot_or_dotdot(de->d_name))
continue;
 
-   strbuf_setlen(path, baselen);
-   strbuf_addf(path, "/%s", de->d_name);
-
if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
+   strbuf_setlen(path, baselen);
+   strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
if (obj_cb) {
r = obj_cb(, path->buf, data);
if (r)
diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
git log --oneline --raw --parents >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents -1000' '
+   git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0



Re: cherry-pick very slow on big repository

2017-11-10 Thread Derrick Stolee

On 11/10/2017 7:37 AM, Peter Krefting wrote:

Jeff King:


Can you get a backtrace? I'd do something like:


Seems that it spends most time in diffcore_count_changes(), that is 
where it hits whenever I hit Ctrl+C (various line numbers 199-207 in 
diffcore-delta.c; this is on the v2.15.0 tag).


(gdb) bt
#0  diffcore_count_changes (src=src@entry=0x5db99970,
    dst=dst@entry=0x5d6a4810,
    src_count_p=src_count_p@entry=0x5db8,
    dst_count_p=dst_count_p@entry=0x5d6a4838,
    src_copied=src_copied@entry=0x7fffd3e0,
    literal_added=literal_added@entry=0x7fffd3f0)
    at diffcore-delta.c:203
#1  0x556dee1a in estimate_similarity (minimum_score=3,
    dst=0x5d6a4810, src=0x5db99970) at diffcore-rename.c:193
#2  diffcore_rename (options=options@entry=0x7fffd4f0)
    at diffcore-rename.c:560
#3  0x55623d83 in diffcore_std (
    options=options@entry=0x7fffd4f0) at diff.c:5846
...


Git is spending time detecting renames, which implies you probably 
renamed a folder or added and deleted a large number of files. This 
rename detection is quadratic (# adds times # deletes).


You can remove this rename detection by running your cherry-pick with 
`git -c diff.renameLimit=1 cherry-pick ...`


See https://git-scm.com/docs/diff-config#diff-config-diffrenameLimit

Thanks,
-Stolee


Re: [RFC] protocol version 2

2017-10-25 Thread Derrick Stolee

On 10/20/2017 1:18 PM, Brandon Williams wrote:

  Overview
==

This document presents a specification for a version 2 of Git's wire
protocol.  Protocol v2 will improve upon v1 in the following ways:

   * Instead of multiple service names, multiple commands will be
 supported by a single service
   * Easily extendable as capabilities are moved into their own section
 of the protocol, no longer being hidden behind a NUL byte and
 limited by the size of a pkt-line (as there will be a single
 capability per pkt-line).
   * Separate out other information hidden behind NUL bytes (e.g. agent
 string as a capability and symrefs can be requested using 'ls-ref')
   * Ref advertisement will be omitted unless explicitly requested
   * ls-ref command to explicitly request some refs


Hi Brandon,

I'm very interested in your protocol as a former server-side dev for the 
VSTS Git server, and understand some of these headaches. We built 
limited refs specifically to target the problem you are solving with 
ls-ref, but it requires knowledge about the authenticated user in order 
to work. I believe your suggestion is a better solution for the Git 
protocol.


The "easily extendable" part has specifically caught my interest, as we 
(Microsoft) would like to move most of the GVFS protocol into core Git, 
and this is a great way to do it. Even if not all features are accepted 
by upstream, we could use our GVFS-specific fork of Git to communicate 
to our servers without breaking normal users' interactions.


Please CC me in future versions of this proposal. Let me know if you 
want to chat directly about the "TODO" items below.


Speaking of TODOs, how much of this concept do you have working in a 
prototype? Do you have code that performs this version 2 handshake and 
communicates the ls-refs result?



  Ls-refs
-

Ls-refs can be looked at as the equivalent of the current ls-remote as
it is a way to query a remote for the references that it has.  Unlike
the current ls-remote, the filtering of the output is done on the server
side by passing a number of parameters to the server-side command
instead of the filtering occurring on the client.

Ls-ref takes in the following parameters:

   --head, --tags: Limit to only refs/heads or refs/tags


Nit: It would be better to use "--heads" to match refs/heads and your 
use of "--tags" for refs/tags.



   --refs: Do not show peeled tags or pseudorefs like HEAD


Assuming we are in the case where the server has a HEAD ref, why would 
that ever be advertised? Also, does this imply that without the --refs 
option we would peel annotated tags until we find non-tag OIDs? Neither 
of these functions seem useful as default behavior.



   --symref: In addition to the object pointed by it, show the underlying
 ref pointed by it when showing a symbolic ref
   : When specified, only references matching the given patterns
  are displayed.


Can you be specific about the patterns? For instance, it is not a good 
idea to allow the client to submit a regex for the server to compute. 
Instead, can we limit this pattern-matching to a prefix-set, such as the 
following list of prefixes:


    refs/heads/master
    refs/releases/*
    refs/heads/user/me/*

  Fetch
---

Fetch will need to be a modified version of the v1 fetch protocol.  Some
potential areas for improvement are: Ref-in-want, CDN offloading,
Fetch-options.

Since we'll have an 'ls-ref' service we can eliminate the need of fetch
to perform a ref-advertisement, instead a client can run the 'ls-refs'
service first, in order to find out what refs the server has, and then
request those refs directly using the fetch service.

//TODO Flush out the design

  Fetch-object
--

This service could be used by partial clones in order to request missing
objects.

//TODO Flush out the design


As you flesh our these "fetch" and "fetch-object" commands, keep in mind 
that partial clones could mean any of the following:


 * fetch all reachable objects except for blobs.

 * fetch all reachable objects except for blobs above a
   certain size.

 * fetch all commits, trees, (and blobs?) within a certain
   "cone" of the file system.


  Push
--

Push will need to be a modified version of the v1 push protocol.  Some
potential areas for improvement are: Fix push-options, Negotiation for
force push.


Negotiation is something to keep in mind for all pushes, especially in 
an ecosystem full of fork-based workflows. If you are working across 
forks and someone else syncs data between your remotes, you may re-push 
a large chunk of objects that are already present in a fork. Adding an 
ls-refs step before push would be a step in the right direction.

  Other Considerations
==

   * Move away from pkt-line framing?
   * Have responses structured in well known formats (e.g. JSON)
   * Eliminate initial round-trip using 'GIT_PROTOCOL' side-channel
   * Additional 

Re: [PATCH] revision: quit pruning diff more quickly when possible

2017-10-13 Thread Derrick Stolee

On 10/13/2017 11:27 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote:


On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

Here it is cleaned up and with a commit message. There's another case
that can be optimized, too: --remove-empty with an all-deletions commit.
That's probably even more obscure and pathological, but it was easy to
cover in the same breath.

I didn't bother making a perf script, since this really isn't indicative
of real-world performance. If we wanted to do perf regression tests
here, I think the best path forward would be:

   1. Make sure there the perf tests cover pathspecs (maybe in p0001?).

   2. Make it easy to run the whole perf suite against a "bomb" repo.
  This surely isn't the only slow thing of interest.

-- >8 --
Subject: revision: quit pruning diff more quickly when possible

When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.

But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.

But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:

   perl -e '
 chomp(my $blob = `git hash-object -w --stdin remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).

Signed-off-by: Jeff King <p...@peff.net>
---
  diff.h |  1 +
  revision.c | 16 +---
  2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/diff.h b/diff.h
index 7dcfcfbef7..4a34d256f1 100644
--- a/diff.h
+++ b/diff.h
@@ -180,6 +180,7 @@ struct diff_options {
pathchange_fn_t pathchange;
change_fn_t change;
add_remove_fn_t add_remove;
+   void *change_fn_data;
diff_format_fn_t format_callback;
void *format_callback_data;
diff_prefix_fn_t output_prefix;
diff --git a/revision.c b/revision.c
index 8fd222f3bf..a3f245e2cc 100644
--- a/revision.c
+++ b/revision.c
@@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct 
rev_info *revs,
   * if the whole diff is removal of old data, and otherwise
   * REV_TREE_DIFFERENT (of course if the trees are the same we
   * want REV_TREE_SAME).
- * That means that once we get to REV_TREE_DIFFERENT, we do not
- * have to look any further.
+ *
+ * The only time we care about the distinction is when
+ * remove_empty_trees is in effect, in which case we care only about
+ * whether the whole change is REV_TREE_NEW, or if there's another type
+ * of change. Which means we can stop the diff early in either of these
+ * cases:
+ *
+ *   1. We're not using remove_empty_trees at all.
+ *
+ *   2. We saw anything except REV_TREE_NEW.
   */
  static int tree_difference = REV_TREE_SAME;
  
@@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,

const char *fullpath, unsigned dirty_submodule)
  {
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
+   struct rev_info *revs = options->change_fn_data;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  
@@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)

DIFF_OPT_SET(>pruning, QUICK);
revs->pruning.add_remove = file_add_remove;
revs->pruning.change = file_change;
+   revs-&g

Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:26 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

-Peff


I just ran a first-level folder history in the Windows repository and 
`git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, 
a good percentage improvement but even on this enormous repo the bug 
doesn't present an issue to human users.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:20 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:


Hmm. So this patch makes it go fast:

diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (tree_difference & REV_TREE_DIFFERENT)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  


But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).

I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).

Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).

Answering my own question a little, there is a hint in the comment
a few lines above:

   /*
* The goal is to get REV_TREE_NEW as the result only if the
* diff consists of all '+' (and no other changes), REV_TREE_OLD
* if the whole diff is removal of old data, and otherwise
* REV_TREE_DIFFERENT (of course if the trees are the same we
* want REV_TREE_SAME).
* That means that once we get to REV_TREE_DIFFERENT, we do not
* have to look any further.
*/

So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.

Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():

  case REV_TREE_NEW:
   if (revs->remove_empty_trees &&
   rev_same_tree_as_empty(revs, p)) {
   /* We are adding all the specified
* paths from this parent, so the
* history beyond this parent is not
* interesting.  Remove its parents
* (they are grandparents for us).
* IOW, we pretend this parent is a
* "root" commit.
*/
   if (parse_commit(p) < 0)
   die("cannot simplify commit %s (invalid %s)",
   oid_to_hex(>object.oid),
   oid_to_hex(>object.oid));
   p->parents = NULL;
   }
 /* fallthrough */
 case REV_TREE_OLD:
 case REV_TREE_DIFFERENT:

So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.

-Peff


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is 
causing diff_can_quit_early() to return false. Due to the corner-case of 
the bug it seems it will not be a huge performance improvement in most 
cases. Still worth fixing and I'm looking at your suggestions to try and 
learn this area better.


It will speed up natural cases of someone adding or renaming a folder 
with a lot of contents, including someone initializing a repository from 
an existing codebase. This git bomb case happens to add all of the 
folders at once, which is why the performance is so noticeable. I use a 
version of this "exponential file growth" for testing that increases the 
folder-depth one commit at a time, so the contents are rather small in 
the first "add", hence not noticing this issue.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:50 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:


Since I don't understand enough about the consumers to diff_tree_oid() (and
the fact that the recursive behavior may be wanted in some cases), I think
we can fix this in builtin/rev-list.c with this simple diff:

---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
*prefix)

     git_config(git_default_config, NULL);
     init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+


(Note: I'm running tests now and see that this breaks behavior. 
Definitely not the solution we want.)



Hmm, this feels wrong, because we _do_ want to recurse down and follow
the pathspec to see if there are real changes.

We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
your pathspec goes). We should be able to see immediately that the entry
is not present between the two and not bother descending. After all,
we've set the QUICK flag in init_revisions(). So the real question is
why QUICK is not kicking in.

-Peff


I'm struggling to understand your meaning. We want to walk from root to 
d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe 
that's what the QUICK flag is supposed to do.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:15 AM, Derrick Stolee wrote:

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder <christian.cou...@gmail.com> writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?

Anything that is possible with a software should be considered as a 
possible
usecase. It's in fact a DoS attack. Imagine there's a server that 
using git
to process something, and now there's a way to knock down this 
server. It's

also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by 
VSTS. In our algorithm, we only walk the list of objects from commit 
to the tree containing the path item. For example, in the path 
d0/d0/d0, we would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because 
we are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: 
avoid duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Since I don't understand enough about the consumers to diff_tree_oid() 
(and the fact that the recursive behavior may be wanted in some cases), 
I think we can fix this in builtin/rev-list.c with this simple diff:


---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const 
char *prefix)


    git_config(git_default_config, NULL);
    init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
    revs.abbrev = DEFAULT_ABBREV;
    revs.commit_format = CMIT_FMT_UNSPECIFIED;
    argc = setup_revisions(argc, argv, , NULL);



Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder  writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?


Anything that is possible with a software should be considered as a possible
usecase. It's in fact a DoS attack. Imagine there's a server that using git
to process something, and now there's a way to knock down this server. It's
also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by VSTS. 
In our algorithm, we only walk the list of objects from commit to the 
tree containing the path item. For example, in the path d0/d0/d0, we 
would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because we 
are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid 
duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Re: [PATCH v5 0/4] Improve abbreviation disambiguation

2017-10-12 Thread Derrick Stolee

On 10/12/2017 8:02 AM, Derrick Stolee wrote:

Changes since previous version:

* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages

Thanks,
  Stolee

I forgot to mention that I rebased on master this morning to be sure 
this doesn't conflict with the binary-search patch that was queued this 
week.


Thanks,
 Stolee


[PATCH v5 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-12 Thread Derrick Stolee
Minimize OID comparisons during disambiguation of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

This commit completes a series of three changes to OID abbreviation
code, and the overall change can be seen using standard commands for
large repos. Below we report performance statistics for perf test 4211.6
from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | HEAD~3   | HEAD | Rel%  |
|---||--|--|---|
|  1|  0 |  41.27 s |  38.93 s | -4.8% |
| 24|  0 |  98.04 s |  91.35 s | -5.7% |
| 23| 323952 | 117.78 s | 112.18 s | -4.8% |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 76 +
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index fdd2711a6..7fd5f5f71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p,
uint32_t num, last, i, first = 0;
const struct object_id *current = NULL;
 
-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
num = p->num_objects;
last = num;
while (first < last) {
@@ -478,6 +480,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +508,67 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);
+   } else if (first < num - 1) {
+   nth_packed_object_oid(, p, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_packed_object_oid(, p, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
+static void find_abbrev_len_packed(struct min_abbrev_data *mad)
+{
+   struct packed_git *p;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next)
+   find_abbrev_len_for_pack(p, mad);
+}
+
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
struct disambiguate_state ds;
@@ -536,19 +600,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
-   if (init_object_disambiguation(hex, len, ) < 0)
-   return -1;
-
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
+   mad.hash = sha1;
+
+   find_abbrev_len_packed();
+
+   if (init_object_disambiguation(hex, mad.cur_len, ) < 0)
+   return -1;
 
ds.fn = extend_abbrev_len;
ds.always_call_fn = 1;
ds.cb_data = (void*)
 
find_short_object_filename();
-   find_short_packed_object();
(void)finish_object_disambiguation(, _ret);
 
hex[mad.cur_len] = 0;
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 1/4] p4211-line-log.sh: add log --online --raw --parents perf test

2017-10-12 Thread Derrick Stolee
Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time computing
diffs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/perf/p4211-line-log.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index b7ff68d4f..e0ed05907 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' '
git log -M -L 1:"$file" >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents' '
+   git log --oneline --raw --parents >/dev/null
+'
+
 test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 2/4] sha1_name: unroll len loop in find_unique_abbrev_r

2017-10-12 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

The focus of this change is to refactor the existing method in a way
that clearly does not change the current behavior. In some cases, the
new method is slower than the previous method. Later changes will
correct all performance loss.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index c7c5ab376..19603713f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 3/4] sha1_name: parse less while finding common prefix

2017-10-12 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 19603713f..fdd2711a6 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+unsigned int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 0/4] Improve abbreviation disambiguation

2017-10-12 Thread Derrick Stolee
Changes since previous version:

* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages 

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time
computing diffs. Below we report performance statistics for perf test
4211.6 from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | Base Time | New Time | Rel%  |
|---||---|--|---|
|  1|  0 |   41.27 s |  38.93 s | -4.8% |
| 24|  0 |   98.04 s |  91.35 s | -5.7% |
| 23| 323952 |  117.78 s | 112.18 s | -4.8% |

Derrick Stolee (4):
  p4211-line-log.sh: add log --online --raw --parents perf test
  sha1_name: unroll len loop in find_unique_abbrev_r
  sha1_name: parse less while finding common prefix
  sha1_name: minimize OID comparisons during disambiguation
 sha1_name.c  | 135 +--
 t/perf/p4211-line-log.sh |   4 ++
 2 files changed, 123 insertions(+), 16 deletions(-)

-- 
2.14.1.538.g56ec8fc98.dirty



Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-11 Thread Derrick Stolee

On 10/10/2017 9:30 AM, Jeff King wrote:

On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote:


On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King <p...@peff.net> writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?

I prefer to fix the problem in all code clones when they cause review
friction, so I'll send a fifth commit showing just the diff for these
packfile issues in sha1_name.c. See patch below.

Ah, that answers my earlier question. Junio mean unique_in_pack(). And
yeah, I think it suffers from the same problem.


Should open_pack_index() return a non-zero status if the packfile is empty?
Or, is there a meaningful reason to have empty packfiles?

I can't think of a compelling reason to have an empty packfile. But nor
do I think we should consider them an error when we can handle them
quietly (and returning non-zero status would cause Git to complain on
many operations in a repository that has such a file).

-Peff


Thanks for the comments. I found some typos in my commit messages, too.

I plan to send a (hopefully) final version tomorrow (Thursday). It will 
include:


* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages

If there are other issues, then please let me know.

Thanks,
-Stolee


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King  writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

   git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?


I prefer to fix the problem in all code clones when they cause review 
friction, so I'll send a fifth commit showing just the diff for these 
packfile issues in sha1_name.c. See patch below.


Should open_pack_index() return a non-zero status if the packfile is 
empty? Or, is there a meaningful reason to have empty packfiles?


Thanks,
-Stolee


diff --git a/sha1_name.c b/sha1_name.c
index 49ba67955..9f8a33e82 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p,
    uint32_t num, last, i, first = 0;
    const struct object_id *current = NULL;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {
@@ -513,7 +515,9 @@ static void find_abbrev_len_for_pack(struct 
packed_git *p,

    uint32_t num, last, first = 0;
    struct object_id oid;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {



Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/9/2017 9:49 AM, Jeff King wrote:

On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote:


@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
  }
  
+static void find_abbrev_len_for_pack(struct packed_git *p,

+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
[...]

Your cover letter lists:

   * Silently skip packfiles that fail to open with open_pack_index()

as a change from the previous version. But this looks the same as the
last round. I think this _does_ end up skipping such packfiles because
p->num_objects will be zero. Is it worth having a comment to that
effect (or even just an early return) to make it clear that the
situation is intentional?

Although...


+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);

If we have zero objects in the pack, what would nth_packed_object_oid()
be returning here?

So I actually think we do want an early return, not just when
open_packed_index() fails, but also when p->num_objects is zero.

-Peff


Sorry about this. I caught this while I was writing my cover letter and 
amended my last commit to include the following:


    if (open_pack_index(p))
        return;

After I amended the commit, I forgot to 'format-patch' again. I can send 
a diff between the commits after review has calmed.


Thanks,
-Stolee


[PATCH v4 1/4] p4211-line-log.sh: add log --online --raw --parents perf test

2017-10-08 Thread Derrick Stolee
Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time computing
diffs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/perf/p4211-line-log.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index b7ff68d4f..e0ed05907 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' '
git log -M -L 1:"$file" >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents' '
+   git log --oneline --raw --parents >/dev/null
+'
+
 test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-08 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

This commit completes a series of three changes to OID abbreviation
code, and the overall change can be seen using standard commands for
large repos. Below we report performance statistics for perf test 4211.6
from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | HEAD~3   | HEAD | Rel%  |
|---||--|--|---|
|  1|  0 |  41.27 s |  38.93 s | -4.8% |
| 24|  0 |  98.04 s |  91.35 s | -5.7% |
| 23| 323952 | 117.78 s | 112.18 s | -4.8% |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..49ba67955 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);
+   } else if (first < num - 1) {
+   nth_packed_object_oid(, p, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_packed_object_oid(, p, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
+static void find_abbrev_len_packed(struct min_abbrev_data *mad)
+{
+   struct packed_git *p;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next)
+   find_abbrev_len_for_pack(p, mad);
+}
+
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
struct disambiguate_state ds;
@@ -536,19 +596,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
-   if (init_object_disambiguation(hex, len, ) < 0)
-   return -1;
-
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
+   mad.hash = sha1;
+
+   find_abbrev_len_packed();
+
+   if (init_object_disambiguation(hex, mad.cur_len, ) < 0)
+   return -1;
 
ds.fn = extend_abbrev_len;
ds.always_call_fn = 1;
ds.cb_data = (void*)
 
find_short_object_filename();
-   find_short_packed_object();
(void)finish_object_disambiguation(, _ret);
 
hex[mad.cur_len] = 0;
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 3/4] sha1_name: parse less while finding common prefix

2017-10-08 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..5081aeb71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 0/4] Improve abbreviation disambiguation

2017-10-08 Thread Derrick Stolee
Changes since previous version:

* Fixed an overflow error in the binary search. I sent a separate patch
  to fix this error in existing searches; that patch should be applied
  before this one.

* Removed test-list-objects and test-abbrev in favor of a new git log
  test in p4211-line-log.sh. Limited perf numbers for Linux repo are
  given in cover letter and commit 4/4.

* Silently skip packfiles that fail to open with open_pack_index()

Thanks for all the comments from Jeff, Junio, Ramsey, and Stefan!

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time
computing diffs. Below we report performance statistics for perf test
4211.6 from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | Base Time | New Time | Rel%  |
|---||---|--|---|
|  1|  0 |   41.27 s |  38.93 s | -4.8% |
| 24|  0 |   98.04 s |  91.35 s | -5.7% |
| 23| 323952 |  117.78 s | 112.18 s | -4.8% |

Derrick Stolee (4):
  p4211-line-log.sh: add log --online --raw --parents perf test
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation

 sha1_name.c  | 129 +--
 t/perf/p4211-line-log.sh |   4 ++
 2 files changed, 118 insertions(+), 15 deletions(-)

-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 2/4] sha1_name: unroll len loop in find_unique_abbrev_r

2017-10-08 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

The focus of this change is to refactor the existing method in a way
that clearly does not change the current behavior. In some cases, the
new method is slower than the previous method. Later changes will
correct all performance loss.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2] cleanup: fix possible overflow errors in binary search

2017-10-08 Thread Derrick Stolee
A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

This translation is safe since the operation occurs inside a loop
conditioned on "min < max". The included changes were found using
the following git grep:

git grep '/ *2;' '*.c'

Making this cleanup will prevent future review friction when a new
binary search is contructed based on existing code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c  | 4 ++--
 builtin/pack-objects.c| 2 +-
 builtin/unpack-objects.c  | 2 +-
 cache-tree.c  | 2 +-
 compat/regex/regex_internal.c | 4 ++--
 compat/regex/regexec.c| 2 +-
 packfile.c| 2 +-
 sha1-lookup.c | 4 ++--
 sha1_name.c   | 2 +-
 string-list.c | 2 +-
 utf8.c| 2 +-
 xdiff/xpatience.c | 2 +-
 12 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f2be145e1..8ec459f52 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum 
object_type type)
int first = 0, last = nr_ofs_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ofs_delta_entry *delta = _deltas[next];
int cmp;
 
@@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum 
object_type type)
int first = 0, last = nr_ref_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ref_delta_entry *delta = _deltas[next];
int cmp;
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ee2c48ff..6e77dfd44 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash)
int lo = 0;
int hi = done_pbase_paths_num;
while (lo < hi) {
-   int mi = (hi + lo) / 2;
+   int mi = lo + (hi - lo) / 2;
if (done_pbase_paths[mi] == hash)
return mi;
if (done_pbase_paths[mi] < hash)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 689a29fac..62ea264c4 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, 
unsigned long delta_size,
lo = 0;
hi = nr;
while (lo < hi) {
-   mid = (lo + hi)/2;
+   mid = lo + (hi - lo) / 2;
if (base_offset < obj_list[mid].offset) {
hi = mid;
} else if (base_offset > obj_list[mid].offset) {
diff --git a/cache-tree.c b/cache-tree.c
index 71d092ed5..d3f740127 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char 
*path, int pathlen)
lo = 0;
hi = it->subtree_nr;
while (lo < hi) {
-   int mi = (lo + hi) / 2;
+   int mi = lo + (hi - lo) / 2;
struct cache_tree_sub *mdl = down[mi];
int cmp = subtree_name_cmp(path, pathlen,
   mdl->name, mdl->namelen);
diff --git a/compat/regex/regex_internal.c b/compat/regex/regex_internal.c
index d4121f2f4..98342b831 100644
--- a/compat/regex/regex_internal.c
+++ b/compat/regex/regex_internal.c
@@ -613,7 +613,7 @@ re_string_reconstruct (re_string_t *pstr, int idx, int 
eflags)
  int low = 0, high = pstr->valid_len, mid;
  do
{
- mid = (high + low) / 2;
+ mid = low + (high - low) / 2;
  if (pstr->offsets[mid] > offset)
high = mid;
  else if (pstr->offsets[mid] < offset)
@@ -1394,7 +1394,7 @@ re_node_set_contains (const re_node_set *set, int elem)
   right = set->nelem - 1;
   while (idx < right)
 {
-  mid = (idx + right) / 2;
+  mid = idx + (right - idx) / 2;
   if (set->elems[mid] < elem)
idx = mid + 1;
   else
diff --git a/compat/regex/regexec.c b/compat/regex/regexec.c
index 0a745d9c3..6f2b48a78 100644
--- a/compat/regex/regexec.c
+++ b/compat/regex/regexec.c
@@ -4284,7 +4284,7 @@ search_cur_bkref_entry (const re_match_context_t *mctx, 
int str_idx)
   last = right = mctx->nbkref_ents;
   for (left = 0; left < right;)
 {
-  mid = (left + right) / 2;
+  mid = left + (right - 

Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-07 Thread Derrick Stolee

On 10/6/2017 10:11 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 08:39:42AM -0400, Derrick Stolee wrote:

I'll run some perf numbers for these commands you recommend, and also see if
I can replicate some of the pain points that triggered this change using the
Linux repo.


Thanks!

-Peff



In my local copy, I added a test to p4211-line-log.sh that runs "git log 
--raw -r" and tested it on three copies of the Linux repo. In order, 
they have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles 
(~324,000 loose).


4211.6: git log --raw -r  43.34(42.62+0.65)   40.47(40.16+0.27)  -6.6%
4211.6: git log --raw -r  88.77(86.54+2.12)   82.44(81.87+0.52)  -7.1%
4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5%

We have moderate performance gains for this command, despite the command 
doing many more things than just checking abbreviations.


I plan to re-roll my patch on Monday including the following feedback items:

* Remove test-list-objects and test-abbrev in favor of a new "git log"
  performance test.

* Fix binary search overflow error.

* Check response from open_pack_index(p) in find_abbrev_len_for_pack().
  I plan to return without failure on non-zero result, which results in
  no failure on a bad pack and the abbreviation length will be the
  minimum required among all valid packs. (Thanks Stefan!)

I made note of a few things, but will not include them in my re-roll. 
I'll revisit them later if they are valuable:


- nth_packed_object_sha1() could be simplified in
  find_abbrev_len_for_pack().

- Teach 'cat-file' to --batch-check='%(objectsize:short)'. (Peff already
  included a patch, perhaps that could be reviewed separately.)

- Ramsay caught my lack of "static" in test-list-objects.c, but that
  file will be removed in the next patch. I'll make sure to use "static"
  in the future.

I'm not re-rolling immediately to allow for some extra review over the 
weekend, if anyone is so inclined.


Thanks,
-Stolee


Re: [PATCH] cleanup: fix possible overflow errors in binary search

2017-10-06 Thread Derrick Stolee

On 10/6/2017 10:18 AM, Jeff King wrote:

On Fri, Oct 06, 2017 at 09:52:31AM -0400, Derrick Stolee wrote:


A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

Great, thank you for picking this up!


The included changes were found using the following two greps:

grep "/ 2;" *.c
grep "/ 2;" */*.c
grep "/2;" */*.c

You can use[1]:

   git grep '/ 2;' '*.c'

to have Git expand the wildcard. That catches a few extra cases in
compat/regex/*.c.  Even though it's imported code, it might be
nice to cover those, too (since it's a possible bug, and also as a good
example).

[1] I'd actually write:

   git grep '/ *2;' '*.c'

 to do it all in one grep. :)


Thanks for the grep lesson! I knew there would be a simpler way to do 
what I wanted.



---
  builtin/index-pack.c | 4 ++--
  builtin/pack-objects.c   | 2 +-
  builtin/unpack-objects.c | 2 +-
  cache-tree.c | 2 +-
  packfile.c   | 2 +-
  sha1-lookup.c| 2 +-
  sha1_name.c  | 2 +-
  string-list.c| 2 +-
  utf8.c   | 2 +-
  xdiff/xpatience.c| 2 +-
  10 files changed, 11 insertions(+), 11 deletions(-)

These all look good to me (really the only way the conversion could be
bad is if "min" was higher than "max", and each case is just inside a
loop condition which makes sure that is not the case).

-Peff
I thought this should be simple enough. When we are all happy with the 
idea of this cleanup, I'll re-roll with the missed examples.


Thanks,
-Stolee


[PATCH] cleanup: fix possible overflow errors in binary search

2017-10-06 Thread Derrick Stolee
A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

The included changes were found using the following two greps:

grep "/ 2;" *.c
grep "/ 2;" */*.c
grep "/2;" */*.c

Making this cleanup will prevent future review friction when a new
binary search is contructed based on existing code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c | 4 ++--
 builtin/pack-objects.c   | 2 +-
 builtin/unpack-objects.c | 2 +-
 cache-tree.c | 2 +-
 packfile.c   | 2 +-
 sha1-lookup.c| 2 +-
 sha1_name.c  | 2 +-
 string-list.c| 2 +-
 utf8.c   | 2 +-
 xdiff/xpatience.c| 2 +-
 10 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f2be145e1..8ec459f52 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum 
object_type type)
int first = 0, last = nr_ofs_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ofs_delta_entry *delta = _deltas[next];
int cmp;
 
@@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum 
object_type type)
int first = 0, last = nr_ref_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ref_delta_entry *delta = _deltas[next];
int cmp;
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ee2c48ff..6e77dfd44 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash)
int lo = 0;
int hi = done_pbase_paths_num;
while (lo < hi) {
-   int mi = (hi + lo) / 2;
+   int mi = lo + (hi - lo) / 2;
if (done_pbase_paths[mi] == hash)
return mi;
if (done_pbase_paths[mi] < hash)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 689a29fac..62ea264c4 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, 
unsigned long delta_size,
lo = 0;
hi = nr;
while (lo < hi) {
-   mid = (lo + hi)/2;
+   mid = lo + (hi - lo) / 2;
if (base_offset < obj_list[mid].offset) {
hi = mid;
} else if (base_offset > obj_list[mid].offset) {
diff --git a/cache-tree.c b/cache-tree.c
index 71d092ed5..d3f740127 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char 
*path, int pathlen)
lo = 0;
hi = it->subtree_nr;
while (lo < hi) {
-   int mi = (lo + hi) / 2;
+   int mi = lo + (hi - lo) / 2;
struct cache_tree_sub *mdl = down[mi];
int cmp = subtree_name_cmp(path, pathlen,
   mdl->name, mdl->namelen);
diff --git a/packfile.c b/packfile.c
index eab754248..4a5fe7ab1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1743,7 +1743,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
   sha1[0], sha1[1], sha1[2], lo, hi, p->num_objects);
 
while (lo < hi) {
-   unsigned mi = (lo + hi) / 2;
+   unsigned mi = lo + (hi - lo) / 2;
int cmp = hashcmp(index + mi * stride, sha1);
 
if (debug_lookup)
diff --git a/sha1-lookup.c b/sha1-lookup.c
index 2552b7902..df08f8355 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -95,7 +95,7 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t 
nr,
hi = mi;
else
lo = mi + 1;
-   mi = (hi + lo) / 2;
+   mi = lo + (hi - lo) / 2;
} while (lo < hi);
return -lo-1;
 }
diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..c7c5ab376 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -157,7 +157,7 @@ static void unique_in_pack(struct packed_git *p,
num = p->num_objects;
last = num;
while (first < last) {
-   uint32_t mid = (first + last) / 2;
+   uint32_t mid = first + (last - first) / 2;
const unsigned char *current;
int cmp;
 
diff --git a/string-list.c b/string-list.c
index 806b4c872..a0cf0cfe8 100644
--- a/

Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Derrick Stolee



On 10/5/2017 6:00 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 06:48:10PM +0900, Junio C Hamano wrote:


Jeff King  writes:


This is weirdly specific. Can we accomplish the same thing with existing
tools?

E.g., could:

   git cat-file --batch-all-objects --batch-check='%(objectname)' |
   shuffle |
   head -n 100

do the same thing?

I know that "shuffle" isn't available everywhere, but I'd much rather
see us fill in portability gaps in a general way, rather than
introducing one-shot C code that needs to be maintained (and you
wouldn't _think_ that t/helper programs need much maintenance, but try
perusing "git log t/helper" output; they have to adapt to the same
tree-wide changes as the rest of the code).

I was thinking about this a bit more, and came to the conclusion
that "sort -R" and "shuf" are wrong tools to use.  We would want to
measure with something close to real world workload.  for example,
letting

git rev-list --all --objects

produce the listof objects in traversal order (i.e. this is very
similar to the order in which "git log -p" needs to access the
objects) and chomping at the number of sample objects you need in
your test would give you such a list.

Actually, I'd just as soon see timings for "git log --format=%h" or "git
log --raw", as opposed to patches 1 and 2.

You won't see a 90% speedup there, but you will see the actual
improvement that real-world users are going to experience, which is way
more important, IMHO.

-Peff

Thanks for thinking hard about this.

For some real-user context: Some engineers using Git for the Windows 
repo were seeing extremely slow commands, such as 'fetch' or 'commit', 
and when we took a trace we saw most of the time spinning in this 
abbreviation code. Our workaround so far has been to set core.abbrev=40.


I'll run some perf numbers for these commands you recommend, and also 
see if I can replicate some of the pain points that triggered this 
change using the Linux repo.


Thanks,
-Stolee


Re: [PATCH] test-list-objects: mark file-local symbols as static

2017-10-04 Thread Derrick Stolee

On 10/3/2017 5:51 PM, Ramsay Jones wrote:

Signed-off-by: Ramsay Jones 
---

Hi Derrick,

If you need to re-roll your 'ds/find-unique-abbrev-optim' branch,
could you please squash this into the relevant patch (commit 3792c78ba0,
"test-list-objects: list a subset of object ids", 01-10-2017).

Thanks!

ATB,
Ramsay Jones

  t/helper/test-list-objects.c | 32 
  1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
index 22bc9b4e6..5c5d3c03f 100644
--- a/t/helper/test-list-objects.c
+++ b/t/helper/test-list-objects.c
@@ -6,43 +6,43 @@ struct count {
int select_mod;
  };
  
-int count_loose(const struct object_id *oid,

-   const char *path,
-   void *data)
+static int count_loose(const struct object_id *oid,
+  const char *path,
+  void *data)
  {
((struct count*)data)->total++;
return 0;
  }
  
-int count_packed(const struct object_id *oid,

-struct packed_git *pack,
-uint32_t pos,
-void* data)
+static int count_packed(const struct object_id *oid,
+   struct packed_git *pack,
+   uint32_t pos,
+   void* data)
  {
((struct count*)data)->total++;
return 0;
  }
  
-void output(const struct object_id *oid,

-   struct count *c)
+static void output(const struct object_id *oid,
+  struct count *c)
  {
if (!(c->total % c->select_mod))
printf("%s\n", oid_to_hex(oid));
c->total--;
  }
  
-int output_loose(const struct object_id *oid,

-const char *path,
-void *data)
+static int output_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
  {
output(oid, (struct count*)data);
return 0;
  }
  
-int output_packed(const struct object_id *oid,

- struct packed_git *pack,
- uint32_t pos,
- void* data)
+static int output_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
  {
output(oid, (struct count*)data);
return 0;
Thanks, Ramsay. I applied these changes locally. I'll remember "static" 
in the future.


Thanks,
-Stolee


Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-04 Thread Derrick Stolee

On 10/4/2017 2:07 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;

The "always_call_fn" thing is a big sledgehammer that overrides
everything else in update_candidates().  It bypasses the careful
machinery set up to avoid having to open ambiguous object to learn
their types as much as possible.  One narrow exception when it is OK
to use is if we never limit our candidates with type.


I do not modify get_short_oid, which uses these advanced options, 
depending on the flags given. find_unique_abbrev_r() does not use these 
advanced options.



And it might appear that the conversion is safe (if only because we
do not see any type limitation in the get_short_oid() call above),
but I think there is one case where this patch changes the
behaviour: what happens if core.disambiguate was set to anything
other than "none"?  The new code does not know anything about type
based filtering, so it can end up reporting longer abbreviation than
it was asked to produce.  It may not be a problem in practice, though.

I am not sure if setting core.disambiguate is generally a good idea
in the first place, and if it is OK to break find_unique_abbrev()
with respect to the configuration variable like this patch does.


I do not think that type-aware disambiguation goes through this code 
path, since it requires giving different parameters to get_short_oid(). 
Test t1512-rev-parse-disambituagion.sh has a test 'core.disambiguate 
config can prefer types' that verifies this behavior.



I'd feel safe if we get extra input from Peff, who introduced the
feature in 5b33cb1f ("get_short_sha1: make default disambiguation
configurable", 2016-09-27).


I look forward to more feedback. Thanks for taking the time to look at 
my patch series.


Thanks,
-Stolee


Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-04 Thread Derrick Stolee

On 10/4/2017 2:10 AM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:
...

I understand that this patch on its own does not have good numbers. I
split the
patches 3 and 4 specifically to highlight two distinct changes:

Patch 3: Unroll the len loop that may inspect all files multiple times.
Patch 4: Parse less while disambiguating.

Patch 4 more than makes up for the performance hits in this patch.

Now you confused me even more.  When we read the similar table that
appears in [Patch 4/5], what does the "Base Time" column mean?
Vanilla Git with [Patch 3/5] applied?  Vanillay Git with [Patch 4/5]
alone applied?  Something else?
In PATCH 3, 4, and 5, I used the commit-by-commit diff for the perf 
numbers, so the "Base Time" for PATCH 4 is the time calculated when 
PATCH 3 is applied. The table in the [PATCH 0/5] message includes the 
relative change for all commits.


I recalculated the relative change for each patch related to the 
baseline (PATCH 2). Looking again, it appears I misspoke and PATCH 4 
does include a +8% change for a fully-repacked Linux repo relative to 
PATCH 2. Since PATCH 5 includes an optimization targeted directly at 
large packfiles, the final performance gain is significant in the 
fully-packed cases.


It is also worth looking at the absolute times for these cases, since 
the fully-packed case is significantly faster than the multiple-packfile 
case, so the relative change impacts users less.


One final note: the improvement was clearer in test p0008.1 when the 
test included "sort -R" to shuffle the known OIDs. Providing OIDs in 
lexicographic order has had a significant effect on the performance, 
which does not reflect real-world usage. I removed the "sort -R" because 
it is a GNU-ism, but if there is a good cross-platform alternative I 
would be happy to replace it.


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

| Repo  | Baseline | Patch 3 | Rel % | Patch 4 | Rel % | Patch 5 | Rel % |
|---|--|-|---|-|---|-|---|
| Git   | 0.09 | 0.06    | -33%  | 0.05    | -44%  | 0.05    | -44%  |
| Git   | 0.11 | 0.08    | -27%  | 0.08    | -27%  | 0.08    | -27%  |
| Git   | 0.09 | 0.07    | -22%  | 0.06    | -33%  | 0.06    | -33%  |
| Linux | 0.13 | 0.32    | 146%  | 0.14    | + 8%  | 0.05    | -62%  |
| Linux | 1.13 | 1.12    | - 1%  | 0.94    | -17%  | 0.88    | -22%  |
| Linux | 1.08 | 1.05    | - 3%  | 0.86    | -20%  | 0.80    | -26%  |
| VSTS  | 0.12 | 0.23    | +92%  | 0.11    | - 8%  | 0.05    | -58%  |
| VSTS  | 1.02 | 1.08    | + 6%  | 0.95    | - 7%  | 0.95    | - 7%  |
| VSTS  | 2.25 | 2.08    | - 8%  | 1.82    | -19%  | 1.93    | -14%  |

(Each repo has three versions, in order: 1 packfile, multiple packfiles, 
and multiple packfiles and loose objects.)


Thanks,
-Stolee



Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-03 Thread Derrick Stolee

On 10/3/2017 11:55 AM, Stefan Beller wrote:

@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
 return 0;
  }

+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);

coverity complained here with
 Calling "open_pack_index" without checking return value
 (as is done elsewhere 13 out of 15 times).
Good catch! This same line is repeated in unique_in_pack() in this same 
file, so if this is worth fixing then we should probably fix it there, too.

I think the easiest way out is just a

 if (open_pack_index(p))
 die(_("Cannot open existing pack idx file for '%s'"), p);

or is there another good approach?

You probably intended to have p->pack_name in the die();

Using `cat *.c | grep -A 2 "if (open_pack_index("` and `cat */*.c | grep 
-A 2 "if (open_pack_index("` I see a few places that return error codes 
or quietly fail. The cases that use die() are inside builtin/ so I don't 
think die() is the right choice here.


Since find_abbrev_len_for_pack() is intended to extend the abbreviation 
length when necessary, I think a silent return is best here:


    if (open_pack_index(p))
        return;

Thanks,
-Stolee


Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-03 Thread Derrick Stolee

On 10/3/2017 6:49 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New| |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%|
|---|---|-||||-|
| Git   | 1 |  230078 |  0 | 0.09 s | 0.06 s | - 33.3% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | - 27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.07 s | - 22.2% |
| Linux | 1 | 5606645 |  0 | 0.12 s | 0.32 s | +146.2% |
| Linux |24 | 5606645 |  0 | 1.12 s | 1.12 s | -  0.9% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | -  2.8% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.23 s | + 91.7% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 1.08 s | +  5.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 2.08 s | -  7.6% |

The above does not look so good, especially in cases where a
repository is well maintained by packing into smaller number of
packs, we get much worse result?
I understand that this patch on its own does not have good numbers. I 
split the

patches 3 and 4 specifically to highlight two distinct changes:

Patch 3: Unroll the len loop that may inspect all files multiple times.
Patch 4: Parse less while disambiguating.

Patch 4 more than makes up for the performance hits in this patch.



p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.08 s | -87.9% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.13 s | -85.6% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.10 s | -87.3% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.32 s | -33.3% |
| Linux |24 | 5606645 |  0 | 4.41 s | 1.09 s | -75.3% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.25 s | -45.7% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 1.15 s | -78.7% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.18 s | -79.9% |

The question is if this is even measuring a relevant workload.  How
often would we have a full 40-hex object name for which we actually
do not have the object, and ask its name to be abbreviated?

Compared to it, the result from p0008.1 feels a lot more important.
We know we make tons of "abbreviate the object name for this object
we have" and we see them every day in our "git log -p" output.

Seeing a lot more impressive result from p0008.2 than p0008.1 makes
me unsure if this patch is optimizing for the right case.

I'll have to see the code a bit deeper before I can comment on it.

Thanks.
I agree that p0008.1 is more important. p0008.2 covers an important case 
of the

previous implementation. The line

    exists = has_sha1_file(sha1);

will inspect all packfiles and scan the full loose-object directory that 
would contain
the object. The only reason for this call is to determine how to inspect 
the result of
get_short_oid(), but is a significant portion of the time that is gained 
here.


Thanks,
-Stolee


[PATCH v3 0/5] Improve abbreviation disambituation

2017-10-02 Thread Derrick Stolee
Thanks for the feedback on my previous versions, and for patience
with my inexperience on the mailing list. I tried to respond to all
feedback, but decided to not apply one suggestion:

* Removed the "sort -R" from p0008-abbrev.sh, since it is a GNU-
  specific flag. The perf test now considers objects in "discovery"
  order, which improved performance all versions of the test as
  expected. New perf numbers are provided.

* get_hex_char_from_oid(): renamed `i` to `pos`. I did not unify the
  code in this method with that in hex.c:sha1_to_hex_r due to the
  extra branch in get_hex_char_from_oid and wanting to inline the
  method. If we want to make this method available for other callers,
  then I would be happy to submit a separate patch for this change
  after the current patch is accepted.

* Removed unecessary includes.

* Use uint32_t instead of unsigned int in test-list-objects.c

* Rearranged arguments in test-list-objects and fixed the n == 0 bug.

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A helper program `test-list-objects` outputs a sampling of object ids,
which we reorder using `sort -R` before using them as input to a
performance test. 

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate performance improvements in this area.

I include performance test numbers in the commit messages for each
change, but I also include the difference between the baseline and the
final change here:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.09 s | 0.05 s | -44.4% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | -27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.06 s | -33.3% |
| Linux | 1 | 5606645 |  0 | 0.13 s | 0.05 s | -61.5% |
| Linux |24 | 5606645 |  0 | 1.13 s | 0.88 s | -22.1% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 0.80 s | -25.9% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.05 s | -58.3% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 0.95 s | - 6.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 1.93 s | -14.2% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.09 s
 Rel %: -28.1%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.07 s | -89.4% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.12 s | -86.7% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.09 s | -88.6% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.04 s | -91.7% |
| Linux |24 | 5606645 |  0 | 4.41 s | 0.85 s | -80.7% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.78 s | -81.0% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.04 s | -91.3% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 0.98 s | -81.9% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.04 s | -82.3% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 38.9 s
  New Time:  2.7 s
 Rel %: -93.1%

Derrick Stolee (5):
  test-list-objects: List a subset of object ids
  p0008-abbrev.sh: Test find_unique_abbrev() perf
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation
 Makefile 

[PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-02 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New| |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%|
|---|---|-||||-|
| Git   | 1 |  230078 |  0 | 0.09 s | 0.06 s | - 33.3% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | - 27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.07 s | - 22.2% |
| Linux | 1 | 5606645 |  0 | 0.12 s | 0.32 s | +146.2% |
| Linux |24 | 5606645 |  0 | 1.12 s | 1.12 s | -  0.9% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | -  2.8% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.23 s | + 91.7% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 1.08 s | +  5.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 2.08 s | -  7.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.62 s
 Rel %: -18.9%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.08 s | -87.9% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.13 s | -85.6% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.10 s | -87.3% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.32 s | -33.3% |
| Linux |24 | 5606645 |  0 | 4.41 s | 1.09 s | -75.3% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.25 s | -45.7% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 1.15 s | -78.7% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.18 s | -79.9% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 39.0 s
  New Time:  3.9 s
 Rel %: -90.0%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename(

[PATCH v3 4/5] sha1_name: Parse less while finding common prefix

2017-10-02 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.07 s | 0.06 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.14 s | -56.3% |
| Linux |24 | 5606645 |  0 | 1.12 s | 0.94 s | -16.1% |
| Linux |23 | 5283204 | 323441 | 1.05 s | 0.86 s | -18.1% |
| VSTS  | 1 | 4355923 |  0 | 0.23 s | 0.11 s | -52.2% |
| VSTS  |32 | 4355923 |  0 | 1.08 s | 0.95 s | -12.0% |
| VSTS  |31 | 4276829 |  79094 | 2.08 s | 1.82 s | -12.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.61 s
  New Time: 4.61 s
 Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.07 s | -12.5% |
| Git   | 5 |  230162 |  0 | 0.13 s | 0.12 s | - 7.7% |
| Git   | 4 |  154310 |  75852 | 0.10 s | 0.09 s | -10.0% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.13 s | -59.4% |
| Linux |24 | 5606645 |  0 | 1.09 s | 0.89 s | -18.3% |
| Linux |23 | 5283204 | 323441 | 0.99 s | 0.83 s | -16.2% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.11 s | -56.0% |
| VSTS  |32 | 4355923 |  0 | 1.15 s | 0.99 s | -13.9% |
| VSTS  |31 | 4276829 |  79094 | 1.18 s | 1.02 s | -13.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.91 s
  New Time: 3.08 s
 Rel %: -21.1%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..5081aeb71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-02 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.05 s | 0.05 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.06 s | 0.06 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.14 s | 0.05 s | -64.3% |
| Linux |24 | 5606645 |  0 | 0.94 s | 0.88 s | - 6.4% |
| Linux |23 | 5283204 | 323441 | 0.86 s | 0.80 s | - 7.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.05 s | -54.5% |
| VSTS  |32 | 4355923 |  0 | 0.95 s | 0.95 s |   0.0% |
| VSTS  |31 | 4276829 |  79094 | 1.82 s | 1.93 s | + 6.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.62 s
  New Time: 4.09 s
 Rel %: -11.5%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.07 s | 0.07 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.12 s | 0.12 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.09 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.13 s | 0.04 s | -69.2% |
| Linux |24 | 5606645 |  0 | 0.89 s | 0.85 s | - 4.5% |
| Linux |23 | 5283204 | 323441 | 0.83 s | 0.78 s | - 6.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.04 s | -63.6% |
| VSTS  |32 | 4355923 |  0 | 0.99 s | 0.98 s | - 1.0% |
| VSTS  |31 | 4276829 |  79094 | 1.02 s | 1.04 s | + 2.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.08 s
  New Time: 2.72 s
 Rel %: -11.8

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..54b3a37da 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = (first + last) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   

[PATCH v3 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf

2017-10-02 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for an input list of object ids.

Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For
test 0008.1, these objects exist. For test 0008.2 these objects do not
exist in the repo (with high probability).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 18 ++
 t/perf/p0008-abbrev.sh | 22 ++
 4 files changed, 42 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index 50a2eab80..63438a44e 100644
--- a/Makefile
+++ b/Makefile
@@ -638,6 +638,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 9696f54bb..2190781ff 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..3ad88611a
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,18 @@
+#include "cache.h"
+
+int cmd_main(int ac, const char **av)
+{
+   struct object_id oid;
+   char hex[GIT_MAX_HEXSZ + 2];
+   const char *end;
+
+   setup_git_directory();
+
+   while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) {
+   hex[GIT_MAX_HEXSZ] = 0;
+   if (!parse_oid_hex(hex, , ))
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..5cbc8a888
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test-list-objects 10 > objs.txt
+
+test_perf 'find_unique_abbrev() for existing objects' '
+   test-abbrev < objs.txt
+'
+
+test-list-objects --missing 10 > objs.txt
+
+test_perf 'find_unique_abbrev() for missing objects' '
+   test-abbrev < objs.txt
+'
+
+rm objs.txt
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v3 1/5] test-list-objects: List a subset of object ids

2017-10-02 Thread Derrick Stolee
Create test-list-objects helper program to output a random sample of
OIDs present in the repo.

If no command line arguments are provided, all OIDs are output.

The last command line argument specifies how many samples to output.
Samples are collected across the entire set of available OIDs to avoid
clustering and therefore quite uniformly distributed.

If a command line argument "--missing" is given before the sample count,
then a list of OIDs is generated without examining the repo.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile |  1 +
 t/helper/.gitignore  |  1 +
 t/helper/test-list-objects.c | 87 
 3 files changed, 89 insertions(+)
 create mode 100644 t/helper/test-list-objects.c

diff --git a/Makefile b/Makefile
index ed4ca438b..50a2eab80 100644
--- a/Makefile
+++ b/Makefile
@@ -652,6 +652,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
+TEST_PROGRAMS_NEED_X += test-list-objects
 TEST_PROGRAMS_NEED_X += test-match-trees
 TEST_PROGRAMS_NEED_X += test-mergesort
 TEST_PROGRAMS_NEED_X += test-mktemp
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 7c9d28a83..9696f54bb 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -13,6 +13,7 @@
 /test-index-version
 /test-lazy-init-name-hash
 /test-line-buffer
+/test-list-objects
 /test-match-trees
 /test-mergesort
 /test-mktemp
diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
new file mode 100644
index 0..22bc9b4e6
--- /dev/null
+++ b/t/helper/test-list-objects.c
@@ -0,0 +1,87 @@
+#include "cache.h"
+#include "packfile.h"
+
+struct count {
+   int total;
+   int select_mod;
+};
+
+int count_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+int count_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+void output(const struct object_id *oid,
+   struct count *c)
+{
+   if (!(c->total % c->select_mod))
+   printf("%s\n", oid_to_hex(oid));
+   c->total--;
+}
+
+int output_loose(const struct object_id *oid,
+const char *path,
+void *data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int output_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void* data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int cmd_main(int ac, const char **av)
+{
+   uint32_t hash_delt = 0xFDB97531;
+   uint32_t hash_base = 0x01020304;
+   int i, n = -1;
+   struct count c;
+   const int u_per_oid = sizeof(struct object_id) / sizeof(uint32_t);
+
+   c.total = 0;
+   if (ac > 1)
+   n = atoi(av[ac - 1]);
+
+   if (ac > 2 && !strcmp(av[1], "--missing")) {
+   while (c.total++ < n) {
+   for (i = 0; i < u_per_oid; i++) {
+   printf("%08x", hash_base);
+   hash_base += hash_delt;
+   }
+   printf("\n");
+   }
+   } else {
+   setup_git_directory();
+
+   if (n > 0) {
+   for_each_loose_object(count_loose, , 0);
+   for_each_packed_object(count_packed, , 0);
+   c.select_mod = 1 + c.total / n;
+   } else {
+   c.select_mod = 1;
+   }
+
+   for_each_loose_object(output_loose, , 0);
+   for_each_packed_object(output_packed, , 0);
+   }
+
+   exit(0);
+}
-- 
2.14.1.538.g56ec8fc98.dirty



Re: [PATCH v2 4/5] sha1_name: Parse less while finding common prefix

2017-10-02 Thread Derrick Stolee

My v3 patch is incoming, but I wanted to respond directly to this message.

On 9/25/2017 7:42 PM, Stefan Beller wrote:

On Mon, Sep 25, 2017 at 2:54 AM, Derrick Stolee <dsto...@microsoft.com> wrote:

Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.17 s | 0.16 s | - 5.9% |
| Git   | 4 |  154310 |  75852 | 0.14 s | 0.12 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.50 s | 0.25 s | -50.0% |
| Linux |24 | 5606645 |  0 | 2.41 s | 2.08 s | -13.7% |
| Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% |
| VSTS  | 1 | 4355923 |  0 | 0.40 s | 0.22 s | -45.0% |
| VSTS  |32 | 4355923 |  0 | 2.09 s | 1.99 s | - 4.8% |
| VSTS  |31 | 4276829 |  79094 | 3.60 s | 3.20 s | -11.1% |

For the Windows repo running in Windows Subsystem for Linux:

 Pack Files: 50
Packed Objects: 22,385,898
  Loose Objects: 492
  Base Time: 4.61 s
   New Time: 4.61 s
  Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.14 s | 0.15 s | + 7.1% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.40 s | 0.17 s | -57.5% |
| Linux |24 | 5606645 |  0 | 1.59 s | 1.30 s | -18.2% |
| Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.12 s | -52.0% |
| VSTS  |32 | 4355923 |  0 | 1.45 s | 1.34 s | - 7.6% |
| VSTS  |31 | 4276829 |  79094 | 1.59 s | 1.34 s | -15.7% |

For the Windows repo running in Windows Subsystem for Linux:

 Pack Files: 50
Packed Objects: 22,385,898
  Loose Objects: 492
  Base Time: 3.91 s
   New Time: 3.08 s
  Rel %: -21.1%


These number look pretty cool!


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

double signoff?


Oops! I'll be more careful with my format-patch in the future.




---
  sha1_name.c | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
 char *hex;
  };

+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)

'i' is not very descriptive, maybe add a comment?
(I realize it is just walking through the char*s one by one)

I renamed 'i' to 'pos' in my v3.



Maybe this function (together with the change in the while below)
could go into hex.c as "int progressively_cmp_oids" that returns
the position at which the given oids differ?


+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}

sha1_to_hex_r has very similar code, though it iterates less
and covers both cases in the loop body.

That is the actual reason I propose moving this function
(or a variant thereof) to hex.c as there we can share code.


You're right that sha1_to_hex_r is similar, in fact I based my work on 
it. There are a few reasons I didn't combine the two implementations:


* I wanted to be sure my patch was only touching the code for 
disambiguating short-shas. Modifying code in hex.c would touch many more 
code paths.


* I realize that the extra branch in my version is slower than the 
branchless loop body in sha1_to_hex_r, so either I would slow that 
method or make the method call more complicated by returning two chars 
at a time.


* I wanted to strongly hint that the method should be inlined, but I'm 
not sure how to guarantee that happens across a linker boundary without 
doing strange things in header files.


I'm happy to revisit this after my patch is complete, since I think 
there are interesting trade-offs to cons

Re: What's cooking in git.git (Sep 2017, #06; Fri, 29)

2017-09-29 Thread Derrick Stolee

Hi Junio,

On 9/29/2017 12:34 AM, Junio C Hamano wrote:

* ds/find-unique-abbrev-optim (2017-09-19) 4 commits
  - SQUASH???
  - sha1_name: parse less while finding common prefix
  - sha1_name: unroll len loop in find_unique_abbrev_r()
  - sha1_name: create perf test for find_unique_abbrev()


I'll re-roll my patch on Monday if reviews have stabilized. I think I 
understand your comments this time (especially around 32-bit ints).

Since I'm new to the list, I'm not sure what the change in messages
means here.

What does "SQUASH???" mean? Is that why there are three meaningful 
commits in this note, despite my five-commit patch? Would you like me to 
squash the commits in v3?


Thanks,
-Stolee


[PATCH v2 0/5] Improve abbreviation disambiguation

2017-09-25 Thread Derrick Stolee
Thanks for the feedback on my v1 patch. Thanks also to Jeff Hostetler
for helping me with this v2 patch, which includes an extra performance
improvement in commit 5.

Changes since last version:

* Added helper program test-list-objects to construct a list of
  existing object ids.

* test-abbrev now disambiguates a list of OIDs from stdin.

* p0008-abbrev.sh now has two tests:
* 0008.1 tests 100,000 known OIDs
* 0008.2 tests 100,000 missing OIDs

* Added a third performance improvement that uses binary search within
  packfiles to inspect at most two object ids per packfile.

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A helper program `test-list-objects` outputs a sampling of object ids,
which we reorder using `sort -R` before using them as input to a
performance test. 

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate performance improvements in this area.

I include performance test numbers in the commit messages for each
change, but I also include the difference between the baseline and the
final change here:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.12 s | 0.05 s | -58.3% |
| Git   | 5 |  230162 |  0 | 0.25 s | 0.15 s | -40.0% |
| Git   | 4 |  154310 |  75852 | 0.18 s | 0.11 s | -38.9% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.10 s | -68.8% |
| Linux |24 | 5606645 |  0 | 2.77 s | 2.00 s | -27.8% |
| Linux |23 | 5283204 | 323441 | 2.86 s | 1.62 s | -43.4% |
| VSTS  | 1 | 4355923 |  0 | 0.27 s | 0.09 s | -66.7% |
| VSTS  |32 | 4355923 |  0 | 2.41 s | 2.01 s | -16.6% |
| VSTS  |31 | 4276829 |  79094 | 4.22 s | 3.02 s | -28.4% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.09 s
 Rel %: -28.1%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.61 s | 0.04 s | -93.4% |
| Git   | 5 |  230162 |  0 | 1.30 s | 0.15 s | -88.5% |
| Git   | 4 |  154310 |  75852 | 1.07 s | 0.12 s | -88.8% |
| Linux | 1 | 5606645 |  0 | 0.65 s | 0.05 s | -92.3% |
| Linux |24 | 5606645 |  0 | 7.12 s | 1.28 s | -82.0% |
| Linux |23 | 5283204 | 323441 | 6.33 s | 0.96 s | -84.8% |
| VSTS  | 1 | 4355923 |  0 | 0.64 s | 0.05 s | -92.2% |
| VSTS  |32 | 4355923 |  0 | 7.77 s | 1.36 s | -82.5% |
| VSTS  |31 | 4276829 |  79094 | 7.76 s | 1.36 s | -82.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 38.9 s
  New Time:  2.7 s
 Rel %: -93.1%

Derrick Stolee (5):
  test-list-objects: List a subset of object ids
  p0008-abbrev.sh: Test find_unique_abbrev() perf
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation

 Makefile |   2 +
 sha1_name.c  | 128 ++-
 t/helper/.gitignore  |   2 +
 t/helper/test-abbrev.c   |  19 +++
 t/helper/test-list-objects.c |  85 
 t/perf/p0008-abbrev.sh   |  22 
 6 files changed, 243 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode

[PATCH v2 4/5] sha1_name: Parse less while finding common prefix

2017-09-25 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.17 s | 0.16 s | - 5.9% |
| Git   | 4 |  154310 |  75852 | 0.14 s | 0.12 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.50 s | 0.25 s | -50.0% |
| Linux |24 | 5606645 |  0 | 2.41 s | 2.08 s | -13.7% |
| Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% |
| VSTS  | 1 | 4355923 |  0 | 0.40 s | 0.22 s | -45.0% |
| VSTS  |32 | 4355923 |  0 | 2.09 s | 1.99 s | - 4.8% |
| VSTS  |31 | 4276829 |  79094 | 3.60 s | 3.20 s | -11.1% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.61 s
  New Time: 4.61 s
 Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.14 s | 0.15 s | + 7.1% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.40 s | 0.17 s | -57.5% |
| Linux |24 | 5606645 |  0 | 1.59 s | 1.30 s | -18.2% |
| Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.12 s | -52.0% |
| VSTS  |32 | 4355923 |  0 | 1.45 s | 1.34 s | - 7.6% |
| VSTS  |31 | 4276829 |  79094 | 1.59 s | 1.34 s | -15.7% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.91 s
  New Time: 3.08 s
 Rel %: -21.1%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-09-25 Thread Derrick Stolee
Create test-list-objects helper program to output a random sample of
OIDs present in the repo.

If no command line arguments are provided, all OIDs are output.

The first command line argument specifies how many samples to output.
Samples are collected across the entire set of available OIDs to avoid
clustering and therefore quite uniformly distributed.

If a second command line argument "--missing" is given, then a list of
OIDs is generated without examining the repo.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile |  1 +
 t/helper/.gitignore  |  1 +
 t/helper/test-list-objects.c | 85 
 3 files changed, 87 insertions(+)
 create mode 100644 t/helper/test-list-objects.c

diff --git a/Makefile b/Makefile
index f2bb7f2f6..af94b655a 100644
--- a/Makefile
+++ b/Makefile
@@ -647,6 +647,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
+TEST_PROGRAMS_NEED_X += test-list-objects
 TEST_PROGRAMS_NEED_X += test-match-trees
 TEST_PROGRAMS_NEED_X += test-mergesort
 TEST_PROGRAMS_NEED_X += test-mktemp
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..9dd1c9f3c 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -13,6 +13,7 @@
 /test-index-version
 /test-lazy-init-name-hash
 /test-line-buffer
+/test-list-objects
 /test-match-trees
 /test-mergesort
 /test-mktemp
diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
new file mode 100644
index 0..83b1250fe
--- /dev/null
+++ b/t/helper/test-list-objects.c
@@ -0,0 +1,85 @@
+#include "cache.h"
+#include "packfile.h"
+#include 
+
+struct count {
+   int total;
+   int select_mod;
+};
+
+int count_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+int count_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+void output(const struct object_id *oid,
+   struct count *c)
+{
+   if (!(c->total % c->select_mod))
+   printf("%s\n", oid_to_hex(oid));
+   c->total--;
+}
+
+int output_loose(const struct object_id *oid,
+const char *path,
+void *data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int output_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void* data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int cmd_main(int ac, const char **av)
+{
+   unsigned int hash_delt = 0xFDB97531;
+   unsigned int hash_base = 0x01020304;
+   int i, n = 0;
+   struct count c;
+   const int u_per_oid = sizeof(struct object_id) / sizeof(unsigned int);
+
+   c.total = 0;
+   if (ac > 1)
+   n = atoi(av[1]);
+
+   if (ac > 2 && !strcmp(av[2], "--missing")) {
+   while (c.total++ < n) {
+   for (i = 0; i < u_per_oid; i++) {
+   printf("%08x", hash_base);
+   hash_base += hash_delt;
+   }
+   printf("\n");
+   }
+   } else {
+   setup_git_directory();
+
+   for_each_loose_object(count_loose, , 0);
+   for_each_packed_object(count_packed, , 0);
+
+   c.select_mod = 1 + c.total / n;
+
+   for_each_loose_object(output_loose, , 0);
+   for_each_packed_object(output_packed, , 0);
+   }
+
+   exit(0);
+}
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-09-25 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.05 s | -37.5% |
| Git   | 5 |  230162 |  0 | 0.16 s | 0.15 s | - 6.3% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.11 s | - 8.3% |
| Linux | 1 | 5606645 |  0 | 0.25 s | 0.10 s | -60.0% |
| Linux |24 | 5606645 |  0 | 2.08 s | 2.00 s | - 3.8% |
| Linux |23 | 5283204 | 323441 | 1.69 s | 1.62 s | - 4.1% |
| VSTS  | 1 | 4355923 |  0 | 0.22 s | 0.09 s | -59.1% |
| VSTS  |32 | 4355923 |  0 | 1.99 s | 2.01 s | + 1.0% |
| VSTS  |31 | 4276829 |  79094 | 3.20 s | 3.02 s | - 5.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.62 s
  New Time: 4.09 s
 Rel %: -11.5%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.05 s | 0.04 s | -20.0% |
| Git   | 5 |  230162 |  0 | 0.15 s | 0.15 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.17 s | 0.05 s | -70.6% |
| Linux |24 | 5606645 |  0 | 1.30 s | 1.28 s | - 1.5% |
| Linux |23 | 5283204 | 323441 | 1.10 s | 0.96 s | -12.7% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.05 s | -58.3% |
| VSTS  |32 | 4355923 |  0 | 1.34 s | 1.36 s | + 1.5% |
| VSTS  |31 | 4276829 |  79094 | 1.34 s | 1.36 s | + 1.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.08 s
  New Time: 2.72 s
 Rel %: -11.8

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index bb47b6702..1566cd4fc 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
@@ -504,6 +505,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = (first + last) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+ 

[PATCH v2 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf

2017-09-25 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for an input list of object ids.

Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For
test 0008.1, these objects exist. For test 0008.2 these objects do not
exist in the repo (with high probability). For each test, use `sort -R`
to (deterministically) shuffle the sample of object ids to not check
abbreviations in lexicographic order.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 19 +++
 t/perf/p0008-abbrev.sh | 22 ++
 4 files changed, 43 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index af94b655a..75835c7c0 100644
--- a/Makefile
+++ b/Makefile
@@ -633,6 +633,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 9dd1c9f3c..ab9df8369 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..6866896eb
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,19 @@
+#include "cache.h"
+#include 
+
+int cmd_main(int ac, const char **av)
+{
+   struct object_id oid;
+   char hex[GIT_MAX_HEXSZ + 2];
+   const char *end;
+
+   setup_git_directory();
+
+   while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) {
+   hex[GIT_MAX_HEXSZ] = 0;
+   if (!parse_oid_hex(hex, , ))
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..ba25e7824
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test-list-objects 10 | sort -R > objs.txt
+
+test_perf 'find_unique_abbrev() for existing objects' '
+   test-abbrev < objs.txt
+'
+
+test-list-objects 10 --missing | sort -R > objs.txt
+
+test_perf 'find_unique_abbrev() for missing objects' '
+   test-abbrev < objs.txt
+'
+
+rm objs.txt
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-09-25 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.12 s | 0.08 s | -33.3% |
| Git   | 5 |  230162 |  0 | 0.25 s | 0.17 s | -32.0% |
| Git   | 4 |  154310 |  75852 | 0.18 s | 0.14 s | -22.2% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.50 s | +56.2% |
| Linux |24 | 5606645 |  0 | 2.77 s | 2.41 s | -13.0% |
| Linux |23 | 5283204 | 323441 | 2.86 s | 1.99 s | -30.4% |
| VSTS  | 1 | 4355923 |  0 | 0.27 s | 0.40 s | +48.1% |
| VSTS  |32 | 4355923 |  0 | 2.41 s | 2.09 s | -13.3% |
| VSTS  |31 | 4276829 |  79094 | 4.22 s | 3.60 s | -14.7% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.62 s
 Rel %: -18.9%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.61 s | 0.06 s | -90.2% |
| Git   | 5 |  230162 |  0 | 1.30 s | 0.14 s | -89.2% |
| Git   | 4 |  154310 |  75852 | 1.07 s | 0.12 s | -88.8% |
| Linux | 1 | 5606645 |  0 | 0.65 s | 0.40 s | -38.5% |
| Linux |24 | 5606645 |  0 | 7.12 s | 1.59 s | -77.7% |
| Linux |23 | 5283204 | 323441 | 6.33 s | 1.23 s | -80.6% |
| VSTS  | 1 | 4355923 |  0 | 0.64 s | 0.25 s | -60.9% |
| VSTS  |32 | 4355923 |  0 | 7.77 s | 1.45 s | -81.3% |
| VSTS  |31 | 4276829 |  79094 | 7.76 s | 1.59 s | -79.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 39.0 s
  New Time:  3.9 s
 Rel %: -90.0%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+  

Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()

2017-09-18 Thread Derrick Stolee

On 9/17/2017 5:51 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+int cmd_main(int ac, const char **av)
+{
+   setup_git_directory();


As far as I recall, we do not (yet) allow declaration after
statement in our codebase.  Move this down to make it after all
decls.


Will fix.


+
+   unsigned int hash_delt = 0x13579BDF;
+   unsigned int hash_base = 0x01020304;
+   struct object_id oid;
+
+   int i, count = 0;
+   int n = sizeof(struct object_id) / sizeof(int);


It probably is technically OK to assume sizeof(int) always equals to
sizeof(unsigned), but because you use 'n' _only_ to work with uint
and never with int, it would make more sense to match.


Will fix. I also notice that "n" should be const.


But I do not think we want this "clever" optimization that involves
'n' in the first place.

>>> +  while (count++ < 10) {

+   for (i = 0; i < n; i++)
+   ((unsigned int*)oid.hash)[i] = hash_base;


Does it make sense to assume that uint is always 4-byte (so this
code won't work if it is 8-byte on your platform) and doing this is
faster than using platform-optimized memcpy()?


I'm not sure what you mean by using memcpy to improve this, because
it would require calling memcpy in the inner loop, such as

for (i = 0; i < n; i++)
memcpy(oid.hash + i * sizeof(unsigned), _base,
   sizeof(unsigned));

I'm probably misunderstanding your intended use of memcpy().


+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+
+   hash_base += hash_delt;
+   }


I also wonder if this is measuring the right thing.  I am guessing
that by making many queries for a unique abbreviation of "random"
(well, it is deterministic, but my point is these queries are not
based on the names of objects that exist in the repository) hashes,
this test measures how much time it takes for us to decide that such
a random hash does not exist.  In the real life use, we make the
opposite query far more frequently: we have an object that we _know_
exists in the repository and we try to find a sufficient length to
disambiguate it from others, and I suspect that these two use
different logic.  Don't you need to be measuring the time it takes
to compute the shortest abbreviation of an object that exists
instead?


First, this doesn't just measure the time it takes to determine non-
existence, because it finds the len required to disambiguate an
"incoming" hash from all known hashes. When testing, I put in a
simple printf to report the result abbreviation so I could see how
often it needed to be extended. In this sense, the test exposes the
while loop that is removed by PATCH 2/3.

Second, your criticism about extant hashes is valid, and one I
struggled to reconcile. I see two issues with testing known hashes:

1. By determining the hash exists, we have inspected the file that
   contains it (either a .idx or the loose-object directory). This
   has side-effects that warm up the file cache so the looped method
   is artificially faster to find matching objects. The effect is
   particularly significant on a repo with exactly one packfile.

2. If we iterate over the entire set of objects, this test takes
   O(N*t(N)) time, where t(N) is the average time to compute a
   minimum abbreviation. For large repos, this can take several
   minutes. Instead, even with the constant 100,000 test hashes, we
   have an O(t(N)) test. We could avoid the asymptotic growth by
   limiting the number of existing hashes we use, but how do we
   find a sufficiently uniform sample from them?

By looking at some other perf tests, I see that we can add a pre-
requisite action. I will investigate making another helper that
uniformly selects a set of hashes from the repo and writes them
to stdout in a random order. p0008-abbrev.sh will run the helper as
a prerequisite, saving the data to a file. test-abbrev will read
the hashes from stdin to test find_unique_abbrev. This should avoid
the side-effects I mentioned (test-abbrev will not warm up indexes)
while also testing abbreviation lengths for existing objects.

I'll get started on these changes and send a new patch with new perf
data in a couple days. Please let me know if there is a better way
to measure performance for this change.

Thanks,
-Stolee



[PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()

2017-09-15 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for 100,000 object ids. The ids are created
by iterating an unsigned int hash_base by a constant hash_delta and
copying hash_base five times across the sha1. Iterating by hash_delta
does not create a duplicate value for over 10,000,000 iterations.

test-abberv demonstrates the performance improvements that will be
shown by later improvements to the find_unique_abberv(). The value of
100,000 is large enough to show the significance of the later
improvements while only taking a few seconds on large repos.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 23 +++
 t/perf/p0008-abbrev.sh | 12 
 4 files changed, 37 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index f2bb7f2f6..081ca05e8 100644
--- a/Makefile
+++ b/Makefile
@@ -633,6 +633,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..80ce7d836 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..cb3551df9
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,23 @@
+#include "cache.h"
+
+int cmd_main(int ac, const char **av)
+{
+   setup_git_directory();
+
+   unsigned int hash_delt = 0x13579BDF;
+   unsigned int hash_base = 0x01020304;
+   struct object_id oid;
+
+   int i, count = 0;
+   int n = sizeof(struct object_id) / sizeof(int);
+   while (count++ < 10) {
+   for (i = 0; i < n; i++)
+   ((unsigned int*)oid.hash)[i] = hash_base;
+
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+
+   hash_base += hash_delt;
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..7c3fad807
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'find_unique_abbrev()' '
+   test-abbrev
+'
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 3/3] sha1_name: Parse less while finding common prefix

2017-09-15 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

This change decreases the time to run test-abbrev by up to 40% on
large repos.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 0/3] Improve abbreviation disambiguation

2017-09-15 Thread Derrick Stolee
Hello,

My name is Derrick Stolee and I just switched teams at Microsoft from
the VSTS Git Server to work on performance improvements in core Git.

This is my first patch submission, and I look forward to your feedback.

Thanks,
 Stolee


When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate this performance improvement. Here are some
performance results for the three included commits, using
GIT_PERF_REPEAT_COUNT=10 since the first test is frequently an outlier
due to the file cache being cold.

Running git on a Linux VM, we see the following gains.

| Repo| Pack-Files | Loose Objs | Baseline | Patch 2 | Patch 3 |
|-|||--|-|-|
| Git.git | 1  | 0  | 0.46 s   | -87.0%  | -87.0%  |
| Git.git | 5  | 0  | 1.04 s   | -84.6%  | -85.6%  |
| Git.git | 4  | 75852  | 0.88 s   | -86.4%  | -86.4%  |
| Linux   | 1  | 0  | 0.63 s   | -38.1%  | -69.8%  |
| Linux   | 24 | 0  | 5.41 s   | -69.3%  | -71.5%  |
| Linux   | 23 | 323441 | 5.41 s   | -70.6%  | -73.4%  |

Running a similar patch on Git for Windows, we see the following gains.

| Repo  | Pack-Files | Loose | Baseline | Patch 2 | Patch 3 |
|---||---|--|-|-|
| GitForWindows | 6  | 319   | 7.19 s   | -91.1%  | -91.5%  |
| VSTS  | 3  | 38| 7.83 s   | -88.9%  | -90.9%  |
| Linux | 3  | 0 | 7.92 s   | -87.9%  | -90.2%  |
| Windows   | 50 | 219   | 17.8 s   | -98.6%  | -98.6%  |

Note that performance improves in all cases, but the performance gain
is larger when there are multiple, large pack-files. This gain comes
from the lack of in-memory caching of index files that have already been
inspected.


Derrick Stolee (3):
  sha1_name: Create perf test for find_unique_abbrev()
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix

 Makefile   |  1 +
 sha1_name.c| 66 ++
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 22 +
 t/perf/p0008-abbrev.sh | 12 +
 5 files changed, 87 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-09-15 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



<    9   10   11   12   13   14