Re: [PATCH] partial-clone: design doc

2017-12-14 Thread Jeff Hostetler



On 12/13/2017 8:17 AM, Philip Oakley wrote:

From: "Junio C Hamano" 

"Philip Oakley"  writes:


+  These filtered packfiles are incomplete in the traditional sense
because
+  they may contain trees that reference blobs that the client does
not have.


Is a comment needed here noting that currently, IIUC, the complete
trees are fetched in the packfiles, it's just the un-necessary blobs
that are omitted ?


I probably am misreading what you meant to say, but the above
statement with "currently" taken literally to mean the system
without JeffH's changes, is false.


I was meaning the current JeffH's V6 series, rather than the last Git release.

In one of the previous discussions Jeff had noted that (at that time) his 
partial design would provide a full set of trees for the selected commits 
(excluding the trees already available locally), but only a few of the file 
blobs (based on the filter spec).

So yes, I should have been clearer to avoid talking at cross purposes.


Right, we build upon the existing thin-pack capabilities such that a
fetch following a clone gets a packfile that assumes the client already
has all of the objects in the "edge".  So a fetch would not need to
receive trees and blobs that are already present in the edge commits.

What we are adding here is a way to filter/restrict even further the
set of objects sent to the client.





When the receiver says it has commit A and the sender wants to send
a commit B (because the receiver said it does not have it, and it
wants it), trees in A are not sent in the pack the sender sends to
give objects sufficient to complete B, which the receiver wanted to
have, even if B also has those trees.  If you fetch from me twice
and between that time Documentation/ directory did not change, the
second fetch will not have the tree object that corresponds to that
hierarchy (and of course no blobs and sub trees inside it).


Though, after the fetch has completed (v2.15 Git), the receiver will have the 
'full set of trees and blobs'. In Jeff's design (V6) the reciever would still 
have a full set of trees, but only a partial set of the blobs. So my viewpoint 
was not of the pack file but of the receiver's object store after the fetch.


Currently (with our changes) the receiver will have all of the trees
and only some of the blobs.  If we later add another filter that can
filter trees, the client will also have missing but referenced trees too.

 


So "the complete trees are fetched" is not true.  What is true (and
what matters more in JeffH's document) is that fetching is done in
such a way that objects resulting in the receiving repository are
complete in the current system that does not allow promised objects.
If some objects resulting in the receiving repository are incomplete,
the current system considers that we corrupted the repository.

The promise mechanism says that it is fine for the receiving end to
lack blobs, trees or commits, as long as the promisor repository
tells it that these "missing" objects can be obtained from it later.


True. (though I'm not sure exactly how Jeff decides about commits - I thought 
theye were not part of this optimisation)


I've not talked about commit filtering -- mainly because we already
have such machinery in shallow-clone -- and I did not want to mess
with the haves/wants computations.

But it will work with missing commits, because of the way object lookup
happens a missing commit will trigger the fetch-object code just like it
does for missing blobs.  The ODB layer doesn't really care what type of
object it is -- just that it is missing and needs to be dynamically fetched.
 
Thanks

Jeff


Re: [PATCH] partial-clone: design doc

2017-12-14 Thread Jeff Hostetler


Sorry, I didn't see this message in my inbox when I posted V2 of the
design doc.  I'll address questions here and update the doc as necessary.


On 12/12/2017 6:31 PM, Philip Oakley wrote:

From: "Jeff Hostetler" 

From: Jeff Hostetler 

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler 
Signed-off-by: Jonathan Tan 
---
Documentation/technical/partial-clone.txt | 240 ++
1 file changed, 240 insertions(+)
create mode 100644 Documentation/technical/partial-clone.txt

diff --git a/Documentation/technical/partial-clone.txt 
b/Documentation/technical/partial-clone.txt
new file mode 100644
index 000..7ab39d8
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,240 @@
+Partial Clone Design Notes
+==
+
+The "Partial Clone" feature is a performance optimization for git that
+allows git to function without having a complete copy of the repository.
+


I think it would be worthwhile at least listing the issues that make the 
'optimisation' necessary, and then the available factors that make the 
optimisation possible. This helps for future adjustments when those issues and 
factors change.

I think the issues are:
* the size of the repository that is being cloned, both in the width of a 
commit (you mentioned 100M trees) and the time (hours to days) / size to clone 
over the connection.

While the supporting factor is:
* the remote is always on-line and available for on-demand object fetching 
(seconds)

The solution choice then should fall out fairly obviously, and we can separate 
out the other optimisations that are based on other views about the issues. 
E.g. my desire for a solution in the off-line case.

In fact the current design, apart from some terminology, does look well 
matched, with only a couple of places that would be affected.

The airplane-mode expectations of a partial clone should also be stated.


Good points.  I'll try to work these into V3.
 
 

+During clone and fetch operations, git normally downloads the complete
+contents and history of the repository.  That is, during clone the client
+receives all of the commits, trees, and blobs in the repository into a
+local ODB.  Subsequent fetches extend the local ODB with any new objects.
+For large repositories, this can take significant time to download and
+large amounts of diskspace to store.
+
+The goal of this work is to allow git better handle extremely large
+repositories.


Shouln't this goal be nearer the top?


maybe. i'll see about reordering the paragraphs in the introduction.





   Often in these repositories there are many files that the
+user does not need such as ancient versions of source files, files in
+portions of the worktree outside of the user's work area, or large binary
+assets.  If we can avoid downloading such unneeded objects *in advance*
+during clone and fetch operations, we can decrease download times and
+reduce ODB disk usage.
+


Does this need to distinguish between the shallow clone mechanism for reducing 
the cloning of old history from the desire for a width wise partial clone of 
only the users narrow work area, and/or without large files/blobs?


I tried to state in the next section that partial clone is independent of
shallow clone.  That is, our stuff works on filtering *within* the
set of commits received.  The existing shallow clone and have/wants
commit limiting features still apply.  I didn't go into detail on the
specific filters, because they are documented elsewhere and I view them
as an expandable set.  The primary goal here is to describe how we
handle missing objects without regard to why an object is missing.

 

+
+Non-Goals
+-
+
+Partial clone is independent of and not intended to conflict with
+shallow-clone, refspec, or limited-ref mechanisms since these all operate
+at the DAG level whereas partial clone and fetch works *within* the set
+of commits already chosen for download.
+

[...]

+Design Details
+--

[...]

+  These filtered packfiles are incomplete in the traditional sense because
+  they may contain trees that reference blobs that the client does not have.


Is a comment needed here noting that currently, IIUC, the complete trees are 
fetched in the packfiles, it's just the un-necessary blobs that are omitted ?


Currently, we have filters to omit unwanted blobs.  Later, we hope to
add other filters to omit trees too.  My point was that the packfiles
are incomplete (have missing objects).  I'll reword the above statement
a little.



+
+
+ How the local repository gracefully handles missing objects
+
+With partial clone, the fact that objects can be missing makes such
+repositories incompatible with older versions of Git, necessitating a
+repository extension (see the documentation of "extensions.partialClone"
+for more information).
+
+An 

Re: [PATCH] partial-clone: design doc

2017-12-13 Thread Jeff Hostetler



On 12/8/2017 3:14 PM, Junio C Hamano wrote:

Jeff Hostetler  writes:


From: Jeff Hostetler 

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler 
Signed-off-by: Jonathan Tan 
---


Thanks.


+Non-Goals
+-
+
+Partial clone is independent of and not intended to conflict with
+shallow-clone, refspec, or limited-ref mechanisms since these all operate
+at the DAG level whereas partial clone and fetch works *within* the set
+of commits already chosen for download.


It probably is not a huge deal (simply because it is about
"Non-Goals") but I have no idea what "refspec" and "limited-ref
mechanism" refer to in the above sentence, and I suspect many others
share the same puzzlement.


I'll reword this.  There was a question on the list earlier about
having a filter for commits in addition to ones for blobs and trees.

I just wanted to emphasize that we already have ways to filter or
limit commits using --shallow-* or --single-branch in clone and 1 or
more '' args in fetch.

 

+An object may be missing due to a partial clone or fetch, or missing due
+to repository corruption. To differentiate these cases, the local
+repository specially indicates packfiles obtained from the promisor
+remote. These "promisor packfiles" consist of a ".promisor" file
+with arbitrary contents (like the ".keep" files), in addition to
+their ".pack" and ".idx" files. (In the future, this ability
+may be extended to loose objects[a].)
+ ...
+Foot Notes
+--
+
+[a] Remembering that loose objects are promisor objects is mainly
+important for trees, since they may refer to promisor blobs that
+the user does not have.  We do not need to mark loose blobs as
+promisor because they do not refer to other objects.


I fail to see any logical link between the "loose" and "tree".
Putting it differently, I do not see why "tree" is so special.

A promisor pack that contains a tree but lacks blobs the tree refers
to would be sufficient to let us remember that these missing blobs
are not corruption.  A loose commit or a tag that is somehow marked
as obtained from a promisor, if it can serve just like a commit or a
tag in a promisor pack to promise its direct pointee, would equally
be useful (if very inefficient).

In any case, I suspect "since they may refer to promisor blobs" is a
typo of "since they may refer to promised blobs".


right. good point. i was only thinking about the tree==>blob
relationship.





+- Currently, dynamic object fetching invokes fetch-pack for each item
+  because most algorithms stumble upon a missing object and need to have
+  it resolved before continuing their work.  This may incur significant
+  overhead -- and multiple authentication requests -- if many objects are
+  needed.
+
+  We need to investigate use of a long-running process, such as proposed
+  in [5,6] to reduce process startup and overhead costs.


Also perhaps in some operations we can enumerate the objects we will
need upfront and ask for them in one go (e.g. "git log -p A..B" may
internally want to do "rev-list --objects A..B" to enumerate trees
and blobs that we may lack upfront).  I do not think having the
other side guess is a good idea, though.


right.




+- We currently only promisor packfiles.  We need to add support for
+  promisor loose objects as described earlier.


The earlier description was not convincing enough to feel the need
to me; at least not yet.


It seems like we need it if a promisor packfile gets unpacked for any
reason.  But right, I'm not sure how urgent it is.


Thanks
Jeff




Re: [PATCH] partial-clone: design doc

2017-12-13 Thread Philip Oakley

From: "Junio C Hamano" 

"Philip Oakley"  writes:


+  These filtered packfiles are incomplete in the traditional sense
because
+  they may contain trees that reference blobs that the client does
not have.


Is a comment needed here noting that currently, IIUC, the complete
trees are fetched in the packfiles, it's just the un-necessary blobs
that are omitted ?


I probably am misreading what you meant to say, but the above
statement with "currently" taken literally to mean the system
without JeffH's changes, is false.


I was meaning the current JeffH's V6 series, rather than the last Git 
release.


In one of the previous discussions Jeff had noted that (at that time) his 
partial design would provide a full set of trees for the selected commits 
(excluding the trees already available locally), but only a few of the file 
blobs (based on the filter spec).


So yes, I should have been clearer to avoid talking at cross purposes.



When the receiver says it has commit A and the sender wants to send
a commit B (because the receiver said it does not have it, and it
wants it), trees in A are not sent in the pack the sender sends to
give objects sufficient to complete B, which the receiver wanted to
have, even if B also has those trees.  If you fetch from me twice
and between that time Documentation/ directory did not change, the
second fetch will not have the tree object that corresponds to that
hierarchy (and of course no blobs and sub trees inside it).


Though, after the fetch has completed (v2.15 Git), the receiver will have 
the 'full set of trees and blobs'. In Jeff's design (V6) the reciever would 
still have a full set of trees, but only a partial set of the blobs. So my 
viewpoint was not of the pack file but of the receiver's object store after 
the fetch.




So "the complete trees are fetched" is not true.  What is true (and
what matters more in JeffH's document) is that fetching is done in
such a way that objects resulting in the receiving repository are
complete in the current system that does not allow promised objects.
If some objects resulting in the receiving repository are incomplete,
the current system considers that we corrupted the repository.

The promise mechanism says that it is fine for the receiving end to
lack blobs, trees or commits, as long as the promisor repository
tells it that these "missing" objects can be obtained from it later.


True. (though I'm not sure exactly how Jeff decides about commits - I 
thought theye were not part of this optimisation)



The way the receiving end which notices that it does not have an
otherwise required blob, tree or commit is one promised by the
promisor repository is to see if it is referenced by a pack that
came from such a promisor repository.


.. and marked as such with the ".promisor" extension.



Thanks. 



Re: [PATCH] partial-clone: design doc

2017-12-12 Thread Junio C Hamano
"Philip Oakley"  writes:

>> +  These filtered packfiles are incomplete in the traditional sense
>> because
>> +  they may contain trees that reference blobs that the client does
>> not have.
>
> Is a comment needed here noting that currently, IIUC, the complete
> trees are fetched in the packfiles, it's just the un-necessary blobs
> that are omitted ?

I probably am misreading what you meant to say, but the above
statement with "currently" taken literally to mean the system
without JeffH's changes, is false.

When the receiver says it has commit A and the sender wants to send
a commit B (because the receiver said it does not have it, and it
wants it), trees in A are not sent in the pack the sender sends to
give objects sufficient to complete B, which the receiver wanted to
have, even if B also has those trees.  If you fetch from me twice
and between that time Documentation/ directory did not change, the
second fetch will not have the tree object that corresponds to that
hierarchy (and of course no blobs and sub trees inside it).

So "the complete trees are fetched" is not true.  What is true (and
what matters more in JeffH's document) is that fetching is done in
such a way that objects resulting in the receiving repository are
complete in the current system that does not allow promised objects.
If some objects resulting in the receiving repository are incomplete,
the current system considers that we corrupted the repository.

The promise mechanism says that it is fine for the receiving end to
lack blobs, trees or commits, as long as the promisor repository
tells it that these "missing" objects can be obtained from it later.
The way the receiving end which notices that it does not have an
otherwise required blob, tree or commit is one promised by the
promisor repository is to see if it is referenced by a pack that
came from such a promisor repository.




Re: [PATCH] partial-clone: design doc

2017-12-12 Thread Philip Oakley

From: "Jeff Hostetler" 

From: Jeff Hostetler 

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler 
Signed-off-by: Jonathan Tan 
---
Documentation/technical/partial-clone.txt | 240 
++

1 file changed, 240 insertions(+)
create mode 100644 Documentation/technical/partial-clone.txt

diff --git a/Documentation/technical/partial-clone.txt 
b/Documentation/technical/partial-clone.txt

new file mode 100644
index 000..7ab39d8
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,240 @@
+Partial Clone Design Notes
+==
+
+The "Partial Clone" feature is a performance optimization for git that
+allows git to function without having a complete copy of the repository.
+


I think it would be worthwhile at least listing the issues that make the 
'optimisation' necessary, and then the available factors that make the 
optimisation possible. This helps for future adjustments when those issues 
and factors change.


I think the issues are:
* the size of the repository that is being cloned, both in the width of a 
commit (you mentioned 100M trees) and the time (hours to days) / size to 
clone over the connection.


While the supporting factor is:
* the remote is always on-line and available for on-demand object fetching 
(seconds)


The solution choice then should fall out fairly obviously, and we can 
separate out the other optimisations that are based on other views about the 
issues. E.g. my desire for a solution in the off-line case.


In fact the current design, apart from some terminology, does look well 
matched, with only a couple of places that would be affected.


The airplane-mode expectations of a partial clone should also be stated.



+During clone and fetch operations, git normally downloads the complete
+contents and history of the repository.  That is, during clone the client
+receives all of the commits, trees, and blobs in the repository into a
+local ODB.  Subsequent fetches extend the local ODB with any new objects.
+For large repositories, this can take significant time to download and
+large amounts of diskspace to store.
+
+The goal of this work is to allow git better handle extremely large
+repositories.


Shouln't this goal be nearer the top?


   Often in these repositories there are many files that the
+user does not need such as ancient versions of source files, files in
+portions of the worktree outside of the user's work area, or large binary
+assets.  If we can avoid downloading such unneeded objects *in advance*
+during clone and fetch operations, we can decrease download times and
+reduce ODB disk usage.
+


Does this need to distinguish between the shallow clone mechanism for 
reducing the cloning of old history from the desire for a width wise partial 
clone of only the users narrow work area, and/or without large files/blobs?



+
+Non-Goals
+-
+
+Partial clone is independent of and not intended to conflict with
+shallow-clone, refspec, or limited-ref mechanisms since these all operate
+at the DAG level whereas partial clone and fetch works *within* the set
+of commits already chosen for download.
+
+
+Design Overview
+---
+
+Partial clone logically consists of the following parts:
+
+- A mechanism for the client to describe unneeded or unwanted objects to
+  the server.
+
+- A mechanism for the server to omit such unwanted objects from packfiles
+  sent to the client.
+
+- A mechanism for the client to gracefully handle missing objects (that
+  were previously omitted by the server).
+
+- A mechanism for the client to backfill missing objects as needed.
+
+
+Design Details
+--
+
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+  upload-pack negotiation.
+
+  This uses the existing capability discovery mechanism.
+  See "filter" in Documentation/technical/pack-protocol.txt.
+
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+  server to request filtering during packfile construction.
+
+  There are various filters available to accomodate different situations.
+  See "--filter=" in Documentation/rev-list-options.txt.
+
+- On the server pack-objects applies the requested filter-spec as it
+  creates "filtered" packfiles for the client.
+
+  These filtered packfiles are incomplete in the traditional sense 
because
+  they may contain trees that reference blobs that the client does not 
have.


Is a comment needed here noting that currently, IIUC, the complete trees are 
fetched in the packfiles, it's just the un-necessary blobs that are omitted 
?



+
+
+ How the local repository gracefully handles missing objects
+
+With partial clone, the fact that objects can be missing makes such
+repositories incompatible with older versions of Git, necessitating a
+repository extension (see the 

Re: [PATCH] partial-clone: design doc

2017-12-08 Thread Junio C Hamano
Jeff Hostetler  writes:

> From: Jeff Hostetler 
>
> First draft of design document for partial clone feature.
>
> Signed-off-by: Jeff Hostetler 
> Signed-off-by: Jonathan Tan 
> ---

Thanks.

> +Non-Goals
> +-
> +
> +Partial clone is independent of and not intended to conflict with
> +shallow-clone, refspec, or limited-ref mechanisms since these all operate
> +at the DAG level whereas partial clone and fetch works *within* the set
> +of commits already chosen for download.

It probably is not a huge deal (simply because it is about
"Non-Goals") but I have no idea what "refspec" and "limited-ref
mechanism" refer to in the above sentence, and I suspect many others
share the same puzzlement.

> +An object may be missing due to a partial clone or fetch, or missing due
> +to repository corruption. To differentiate these cases, the local
> +repository specially indicates packfiles obtained from the promisor
> +remote. These "promisor packfiles" consist of a ".promisor" file
> +with arbitrary contents (like the ".keep" files), in addition to
> +their ".pack" and ".idx" files. (In the future, this ability
> +may be extended to loose objects[a].)
> + ...
> +Foot Notes
> +--
> +
> +[a] Remembering that loose objects are promisor objects is mainly
> +important for trees, since they may refer to promisor blobs that
> +the user does not have.  We do not need to mark loose blobs as
> +promisor because they do not refer to other objects.

I fail to see any logical link between the "loose" and "tree".
Putting it differently, I do not see why "tree" is so special.

A promisor pack that contains a tree but lacks blobs the tree refers
to would be sufficient to let us remember that these missing blobs
are not corruption.  A loose commit or a tag that is somehow marked
as obtained from a promisor, if it can serve just like a commit or a
tag in a promisor pack to promise its direct pointee, would equally
be useful (if very inefficient).

In any case, I suspect "since they may refer to promisor blobs" is a
typo of "since they may refer to promised blobs".

> +- Currently, dynamic object fetching invokes fetch-pack for each item
> +  because most algorithms stumble upon a missing object and need to have
> +  it resolved before continuing their work.  This may incur significant
> +  overhead -- and multiple authentication requests -- if many objects are
> +  needed.
> +
> +  We need to investigate use of a long-running process, such as proposed
> +  in [5,6] to reduce process startup and overhead costs.

Also perhaps in some operations we can enumerate the objects we will
need upfront and ask for them in one go (e.g. "git log -p A..B" may
internally want to do "rev-list --objects A..B" to enumerate trees
and blobs that we may lack upfront).  I do not think having the
other side guess is a good idea, though.

> +- We currently only promisor packfiles.  We need to add support for
> +  promisor loose objects as described earlier.

The earlier description was not convincing enough to feel the need
to me; at least not yet.