Re: [PATCH v2 5/5] sha1_file: support loading lazy objects

2017-08-08 Thread Ben Peart



On 7/31/2017 5:02 PM, Jonathan Tan wrote:

Teach sha1_file to invoke the command configured in
extensions.lazyObject whenever an object is requested and unavailable.

The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate missing objects (without invoking the command) or be
more efficient in invoking this command.


To prevent fetch from downloading all missing objects, you will also 
need to add logic in check_connected.  The simplest model is to simply 
return 0 if repository_format_lazy_object is set.


/*
 * Running a with lazy_objects there will be objects that are
 * missing locally and we don't want to download a bunch of
 * commits, trees, and blobs just to make sure everything is
 * reachable locally so this option will skip reachablility
 * checks below that use rev-list.  This will stop the check
 * before uploadpack runs to determine if there is anything to
 * fetch.  Returning zero for the first check will also prevent the
 * uploadpack from happening.  It will also skip the check after
 * the fetch is finished to make sure all the objects where
 * downloaded in the pack file.  This will allow the fetch to
 * run and get all the latest tip commit ids for all the branches
 * in the fetch but not pull down commits, trees, or blobs via
 * upload pack.
 */
if (repository_format_lazy_object)
return 0;

[...]

diff --git a/sha1_file.c b/sha1_file.c
index b60ae15f7..1785c61d8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,11 @@
  #include "list.h"
  #include "mergesort.h"
  #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
+#include "lazy-object.h"
+#include "sha1-array.h"
  
  #define SZ_FMT PRIuMAX

  static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -2984,6 +2989,7 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi,
const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
lookup_replace_object(sha1) :
sha1;
+   int already_retried = 0;
  
  	if (!oi)

oi = _oi;
@@ -3008,30 +3014,38 @@ int sha1_object_info_extended(const unsigned char 
*sha1, struct object_info *oi,
}
}
  
-	if (!find_pack_entry(real, )) {

-   /* Most likely it's a loose object. */
-   if (!sha1_loose_object_info(real, oi, flags)) {
-   oi->whence = OI_LOOSE;
-   return 0;
-   }
+retry:
+   if (find_pack_entry(real, ))
+   goto found_packed;
  
-		/* Not a loose object; someone else may have just packed it. */

-   if (flags & OBJECT_INFO_QUICK) {
-   return -1;
-   } else {
-   reprepare_packed_git();
-   if (!find_pack_entry(real, ))
-   return -1;
-   }
+   /* Most likely it's a loose object. */
+   if (!sha1_loose_object_info(real, oi, flags)) {
+   oi->whence = OI_LOOSE;
+   return 0;
}
  
+	/* Not a loose object; someone else may have just packed it. */

+   reprepare_packed_git();
+   if (find_pack_entry(real, ))
+   goto found_packed;


Same feedback as before.  I like to avoid using goto's as flow control 
other than in error handling.


Also, this patch looses the OBJECT_INFO_QUICK logic which could be restored.

[...]


diff --git a/t/t0410/lazy-object b/t/t0410/lazy-object
new file mode 100755
index 0..4f4a9c38a
--- /dev/null
+++ b/t/t0410/lazy-object
@@ -0,0 +1,102 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git lazyObject protocol version 1. See
+# the documentation for extensions.lazyObject in
+# Documentation/technical/repository-version.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = $ARGV[0];
+


At some point, this should be based on the refactored pkt_* functions 
currently contained in the ObjectDB patch series.



+sub packet_bin_read {
+   my $buffer;
+   my $bytes_read = read STDIN, $buffer, 4;
+   if ( $bytes_read == 0 ) {
+
+   # EOF - Git stopped talking to us!
+   exit();
+   }
+   elsif ( $bytes_read != 4 ) {
+   die "invalid packet: '$buffer'";
+   }
+   my $pkt_size = 

Re: [PATCH v2 5/5] sha1_file: support loading lazy objects

2017-07-31 Thread Junio C Hamano
Jonathan Tan  writes:

> Teach sha1_file to invoke the command configured in
> extensions.lazyObject whenever an object is requested and unavailable.
>
> The usage of the hook can be suppressed through a flag when invoking
> has_object_file_with_flags() and other similar functions.
>
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate missing objects (without invoking the command) or be
> more efficient in invoking this command.
>
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>  regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>  need to check callers that know about the loose/packed distinction
>  and operate on both differently, and ensure that they can handle
>  the concept of objects that are neither loose nor packed)
>
> (1) is handled by the modification to sha1_object_info_extended().
>
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>  - reachable - only to find recent objects
>  - builtin/fsck - already knows about missing objects
>  - builtin/cat-file - warning message added in this commit
>
> Callers of the other functions do not need to be changed:
>  - parse_pack_index
>- http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>- this searches a single pack that is provided as an argument; the
>  caller already knows (through other means) that the sought object
>  is in a specific pack
>  - find_sha1_pack
>- fast-import - appears to be an optimization to not store a
>  file if it is already in a pack
>- http-walker - to search through a struct alt_base
>- http-push - to search through remote packs
>  - has_sha1_pack
>- builtin/fsck - already knows about promised objects
>- builtin/count-objects - informational purposes only (check if loose
>  object is also packed)
>- builtin/prune-packed - check if object to be pruned is packed (if
>  not, don't prune it)
>- revision - used to exclude packed objects if requested by user
>- diff - just for optimization
>
> An alternative design that I considered but rejected:
>
>  - Adding a hook whenever a packed object is requested, not on any
>object.  That is, whenever we attempt to search the packfiles for an
>object, if it is missing (from the packfiles and from the loose
>object storage), to invoke the hook (which must then store it as a
>packfile), open the packfile the hook generated, and report that the
>object is found in that new packfile. This reduces the amount of
>analysis needed (in that we only need to look at how packed objects
>are handled), but requires that the hook generate packfiles (or for
>sha1_file to pack whatever loose objects are generated), creating one
>packfile for each missing object and potentially very many packfiles
>that must be linearly searched. This may be tolerable now for repos
>that only have a few missing objects (for example, repos that only
>want to exclude large blobs), and might be tolerable in the future if
>we have batching support for the most commonly used commands, but is
>not tolerable now for repos that exclude a large amount of objects.
>
> Helped-by: Ben Peart 
> Signed-off-by: Jonathan Tan 
> ---

Even though I said a hugely negative thing about the "missing
objects are always OK" butchering of fsck, I do like what this patch
does.  The interface is reasonably well isolated, and moving of the
long-running-process documentation to a standalone file is very
sensible.



[PATCH v2 5/5] sha1_file: support loading lazy objects

2017-07-31 Thread Jonathan Tan
Teach sha1_file to invoke the command configured in
extensions.lazyObject whenever an object is requested and unavailable.

The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate missing objects (without invoking the command) or be
more efficient in invoking this command.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
 regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
 need to check callers that know about the loose/packed distinction
 and operate on both differently, and ensure that they can handle
 the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about missing objects
 - builtin/cat-file - warning message added in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
 caller already knows (through other means) that the sought object
 is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
 file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised objects
   - builtin/count-objects - informational purposes only (check if loose
 object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
 not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed object is requested, not on any
   object.  That is, whenever we attempt to search the packfiles for an
   object, if it is missing (from the packfiles and from the loose
   object storage), to invoke the hook (which must then store it as a
   packfile), open the packfile the hook generated, and report that the
   object is found in that new packfile. This reduces the amount of
   analysis needed (in that we only need to look at how packed objects
   are handled), but requires that the hook generate packfiles (or for
   sha1_file to pack whatever loose objects are generated), creating one
   packfile for each missing object and potentially very many packfiles
   that must be linearly searched. This may be tolerable now for repos
   that only have a few missing objects (for example, repos that only
   want to exclude large blobs), and might be tolerable in the future if
   we have batching support for the most commonly used commands, but is
   not tolerable now for repos that exclude a large amount of objects.

Helped-by: Ben Peart 
Signed-off-by: Jonathan Tan 
---
 Documentation/Makefile |   1 +
 Documentation/gitattributes.txt|  54 ++-
 Documentation/gitrepository-layout.txt |   3 +
 .../technical/long-running-process-protocol.txt|  50 ++
 Documentation/technical/repository-version.txt |  16 
 Makefile   |   1 +
 builtin/cat-file.c |   2 +
 builtin/fsck.c |   2 +-
 cache.h|   2 +
 lazy-object.c  |  80 
 lazy-object.h  |  12 +++
 object.c   |   7 ++
 object.h   |  13 +++
 sha1_file.c|  44 ++---
 t/t0410-lazy-object.sh |  18 
 t/t0410/lazy-object| 102 +
 16 files changed, 345 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/technical/long-running-process-protocol.txt
 create mode 100644 lazy-object.c
 create mode 100644 lazy-object.h
 create mode 100755 t/t0410/lazy-object

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2415e0d65..5657d49c2 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -69,6 +69,7 @@