Re: [PATCH v4 2/3] implement fetching of moved submodules

2017-10-19 Thread Stefan Beller
On Wed, Oct 18, 2017 at 5:35 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>>> but if we already have a submodule with that name (the most likely
>>> explanation for its existence is because it started its life there
>>> and then later moved), and the submodule is bound to a different
>>> path, then that is a different submodule.  Skipping and warning both
>>> are sensible thing to do.
>>
>> Skipping and warning is sensible once we decide to go this way.
>>
>> I propose to take a step back and not throw away the information
>> whether the given string is a name or path, as then we do not have
>> to warn&skip, but we can treat both correctly.
>
> Now either one of us is utterly confused, and I suspect it is me, as
> I do not see how "treat both correctly" could possibly work in the
> case this code warns and skips.
>
> At this point in the flow, we already know that it is not name,
> because we asked and got a "Nah, there is no submodule registered in
> .gitmodules at that path" from submodule_from_path().  Then we ask
> submodule_from_name() if there is any submodule registered under the
> name it would have got if it were added there, and we indeed find
> one.  And that is definitely *not* a submodule we are looking for,
> because if it were, its .path would have pointed at the path we were
> using to ask in the first place.  The one we originally found at
> path and are interested in finding out the details is not known to
> .gitmodules, and the one under that name is not the one that we are
> intereted in, so fetching from the repository the other one that
> happens to have the same name but is different from the submodule we
> are interested in would simply be wrong.

Eventually we'd want to also init new submodules on fetch
(if you use submodule.active to specify the interesting submodules),
and in that case I would imagine to fetch both submodules.

As I wrote the code to further improve this series,
I realized that this is maybe "good enough" for now,
so assume that I have reviewed this series and found it good.

> If we only have path without any .gitmodules entry (hence there is
> not even URL), how would we proceed from that point on?

Oh well, right. We can only offer to keep the existing behavior
which means supporting existing repos in place at that path.

Stefan

>


Re: [PATCH v4 2/3] implement fetching of moved submodules

2017-10-18 Thread Junio C Hamano
Stefan Beller  writes:

>> but if we already have a submodule with that name (the most likely
>> explanation for its existence is because it started its life there
>> and then later moved), and the submodule is bound to a different
>> path, then that is a different submodule.  Skipping and warning both
>> are sensible thing to do.
>
> Skipping and warning is sensible once we decide to go this way.
>
> I propose to take a step back and not throw away the information
> whether the given string is a name or path, as then we do not have
> to warn&skip, but we can treat both correctly.

Now either one of us is utterly confused, and I suspect it is me, as
I do not see how "treat both correctly" could possibly work in the
case this code warns and skips.

At this point in the flow, we already know that it is not name,
because we asked and got a "Nah, there is no submodule registered in
.gitmodules at that path" from submodule_from_path().  Then we ask
submodule_from_name() if there is any submodule registered under the
name it would have got if it were added there, and we indeed find
one.  And that is definitely *not* a submodule we are looking for,
because if it were, its .path would have pointed at the path we were
using to ask in the first place.  The one we originally found at
path and are interested in finding out the details is not known to
.gitmodules, and the one under that name is not the one that we are
intereted in, so fetching from the repository the other one that
happens to have the same name but is different from the submodule we
are interested in would simply be wrong.

If we only have path without any .gitmodules entry (hence there is
not even URL), how would we proceed from that point on?



Re: [PATCH v4 2/3] implement fetching of moved submodules

2017-10-18 Thread Stefan Beller
On Tue, Oct 17, 2017 at 5:03 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>>> +   /* make sure name does not collide with existing 
>>> one */
>>> +   submodule = submodule_from_name(commit_oid, name);
>>> +   if (submodule) {
>>> +   warning("Submodule in commit %s at path: "
>>> +   "'%s' collides with a submodule 
>>> named "
>>> +   "the same. Skipping it.",
>>> +   oid_to_hex(commit_oid), name);
>>> +   name = NULL;
>>> +   }
>>
>> This is the ugly part of using one string list and storing names or
>> path in it. I wonder if we could omit this warning if we had 2 string lists?
>
> We are keying off of 'name', because that is what will give a module
> its identity.  If we have a gitlink whose path is not in .gitmodules
> in the same tree, then we are seeing an unregistered submodule.

Right, so it has no submodule specific identity and we chose to "fake it"
by pretending its path is its name. However this requires checking as
there might be overlap in the name-namespace and the path-namespace.


>  If
> we were to "git add" it, then we'd use its path as the default name,

I presume "git submodule add"

> but if we already have a submodule with that name (the most likely
> explanation for its existence is because it started its life there
> and then later moved), and the submodule is bound to a different
> path, then that is a different submodule.  Skipping and warning both
> are sensible thing to do.

Skipping and warning is sensible once we decide to go this way.

I propose to take a step back and not throw away the information
whether the given string is a name or path, as then we do not have
to warn&skip, but we can treat both correctly.

As we only need to store an additional boolean (is it path or name?),
I had suggested to just use two lists, one for key-by-name and one
key-by-path, where we intend to use the key-by-name for submodules
and the by-path only for those with no name (i.e. lone gitlinks), hence
making this a "fallback list"

>
> I do not know what you see as ugly here,

the necessity of warn&skip instead of having a solution that
works in corner cases just fine.

> and more importantly, I am
> not sure how having two lists would help.

The current situation is that we use the path of the submodules only,
which makes it work without warn&skip, but it has other disadvantages
(i.e. new & moved submodules are not detected), which we want to fix.

We can add this functionality without caving in to skip the corner case
by storing an additional bit of information. The renaming is detected by
having a constant name before and after, just the path changed.
So we could continue to use by-path logic and only have the name
for rename detection. However that seems to be ugly, too. So we
seem to think that the by-name is better (as it is more in line with what
we think should happen, it is easier to explain, review and maintain(?)).

So we could have by-name keys, with the extra information of whether the
key is genuine or a "fake" key, which is t be resolved to a path instead.
And as that is just one bit, I proposed two lists for that.

Do I miss an essential part here?

Thanks,
Stefan


Re: [PATCH v4 2/3] implement fetching of moved submodules

2017-10-17 Thread Junio C Hamano
Stefan Beller  writes:

>> +   /* make sure name does not collide with existing one 
>> */
>> +   submodule = submodule_from_name(commit_oid, name);
>> +   if (submodule) {
>> +   warning("Submodule in commit %s at path: "
>> +   "'%s' collides with a submodule 
>> named "
>> +   "the same. Skipping it.",
>> +   oid_to_hex(commit_oid), name);
>> +   name = NULL;
>> +   }
>
> This is the ugly part of using one string list and storing names or
> path in it. I wonder if we could omit this warning if we had 2 string lists?

We are keying off of 'name', because that is what will give a module
its identity.  If we have a gitlink whose path is not in .gitmodules
in the same tree, then we are seeing an unregistered submodule.  If
we were to "git add" it, then we'd use its path as the default name,
but if we already have a submodule with that name (the most likely
explanation for its existence is because it started its life there
and then later moved), and the submodule is bound to a different
path, then that is a different submodule.  Skipping and warning both
are sensible thing to do.

I do not know what you see as ugly here, and more importantly, I am
not sure how having two lists would help.


Re: [PATCH v4 2/3] implement fetching of moved submodules

2017-10-17 Thread Stefan Beller
On Mon, Oct 16, 2017 at 6:58 AM, Heiko Voigt  wrote:
> We store the changed submodules paths to calculate which submodule needs
> fetching. This does not work for moved submodules since their paths do
> not stay the same in case of a moved submodules. In case of new
> submodules we do not have a path in the current checkout, since they
> just appeared in this fetch.
>
> It is more general to collect the submodule names for changes instead of
> their paths to include the above cases. If we do not have a
> configuration for a gitlink we rely on constructing a default name from
> the path if a git repository can be found at its path. We skip
> non-configured gitlinks whose default name collides with a configured
> one.

Thanks for working on this!

As detailed below, I wonder if it is easier (in maintenance, explaining
correctness, reviewing) if we'd rather keep two lists around. One for
based on names, and if we cannot lookup a name for a submodule, we
rather use the second path based list as a fallback. That would avoid
potential namespace collisions between names and paths, as well as
not having the confusion of mapping back and forth.

Most functions would then need to operate on path, as the name->path
mapping can be looked up for the first list, but the path->name mapping
cannot be looked up for the second list.

Cheers,
Stefan

> With the change described above we implement 'on-demand' fetching of
> changes in moved submodules.
>
> Signed-off-by: Heiko Voigt 

> ---
>  submodule-config.h  |   3 +
>  submodule.c | 138 
> 
>  t/t5526-fetch-submodules.sh |  35 +++
>  3 files changed, 138 insertions(+), 38 deletions(-)
>
> diff --git a/submodule-config.h b/submodule-config.h
> index e3845831f6..a5503a5d17 100644
> --- a/submodule-config.h
> +++ b/submodule-config.h
> @@ -22,6 +22,9 @@ struct submodule {
> int recommend_shallow;
>  };
>
> +#define SUBMODULE_INIT { NULL, NULL, NULL, RECURSE_SUBMODULES_NONE, \
> +   NULL, NULL, SUBMODULE_UPDATE_STRATEGY_INIT, {0}, -1 };
> +
>  struct submodule_cache;
>  struct repository;
>
> diff --git a/submodule.c b/submodule.c
> index 63e7094e16..71d1773e2e 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -21,7 +21,7 @@
>  #include "parse-options.h"
>
>  static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
> -static struct string_list changed_submodule_paths = STRING_LIST_INIT_DUP;
> +static struct string_list changed_submodule_names = STRING_LIST_INIT_DUP;
>  static int initialized_fetch_ref_tips;
>  static struct oid_array ref_tips_before_fetch;
>  static struct oid_array ref_tips_after_fetch;
> @@ -674,11 +674,11 @@ const struct submodule *submodule_from_ce(const struct 
> cache_entry *ce)
>  }
>
>  static struct oid_array *submodule_commits(struct string_list *submodules,
> -  const char *path)
> +  const char *name)
>  {
> struct string_list_item *item;
>
> -   item = string_list_insert(submodules, path);
> +   item = string_list_insert(submodules, name);
> if (item->util)
> return (struct oid_array *) item->util;
>
> @@ -687,39 +687,67 @@ static struct oid_array *submodule_commits(struct 
> string_list *submodules,
> return (struct oid_array *) item->util;
>  }
>
> +struct collect_changed_submodules_cb_data {
> +   struct string_list *changed;
> +   const struct object_id *commit_oid;
> +};
> +
> +/*
> + * this would normally be two functions: default_name_from_path() and

Please start the comment capitalised. (minor nit)

> + * path_from_default_name(). Since the default name is the same as
> + * the submodule path we can get away with just one function which only
> + * checks whether there is a submodule in the working directory at that
> + * location.

This is an interesting comment, as it hints that we ought to keep it that way.
Earlier I was wondering if we want to make the name distinctively different
than its path, as that will confuse users *less* IMHO. (I just remember
someone asking on the mailing list why their "rename" did not work, as
they just renamed everything in the .gitmodules that looked like the path)

As the path/name is confusing, I'd wish we'd be super concise, such that
errors are harder to sneak into. For example, the arguments name should
be "path" as that is the only thing we can look up using is_sub_pop_gently,
if a "name" is given, than it just works because the chosen default name
was its path.

> +static const char *default_name_or_path(const char *path_or_name)
> +{
> +   int error_code;
> +
> +   if (!is_submodule_populated_gently(path_or_name, &error_code))
> +   return NULL;
> +
> +   return path_or_name;
> +}
> +


> +   if (submodule)
> +   name = submodule->name;
> +   else {
> +   name = default_name_o

[PATCH v4 2/3] implement fetching of moved submodules

2017-10-16 Thread Heiko Voigt
We store the changed submodules paths to calculate which submodule needs
fetching. This does not work for moved submodules since their paths do
not stay the same in case of a moved submodules. In case of new
submodules we do not have a path in the current checkout, since they
just appeared in this fetch.

It is more general to collect the submodule names for changes instead of
their paths to include the above cases. If we do not have a
configuration for a gitlink we rely on constructing a default name from
the path if a git repository can be found at its path. We skip
non-configured gitlinks whose default name collides with a configured
one.

With the change described above we implement 'on-demand' fetching of
changes in moved submodules.

Signed-off-by: Heiko Voigt 
---
 submodule-config.h  |   3 +
 submodule.c | 138 
 t/t5526-fetch-submodules.sh |  35 +++
 3 files changed, 138 insertions(+), 38 deletions(-)

diff --git a/submodule-config.h b/submodule-config.h
index e3845831f6..a5503a5d17 100644
--- a/submodule-config.h
+++ b/submodule-config.h
@@ -22,6 +22,9 @@ struct submodule {
int recommend_shallow;
 };
 
+#define SUBMODULE_INIT { NULL, NULL, NULL, RECURSE_SUBMODULES_NONE, \
+   NULL, NULL, SUBMODULE_UPDATE_STRATEGY_INIT, {0}, -1 };
+
 struct submodule_cache;
 struct repository;
 
diff --git a/submodule.c b/submodule.c
index 63e7094e16..71d1773e2e 100644
--- a/submodule.c
+++ b/submodule.c
@@ -21,7 +21,7 @@
 #include "parse-options.h"
 
 static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
-static struct string_list changed_submodule_paths = STRING_LIST_INIT_DUP;
+static struct string_list changed_submodule_names = STRING_LIST_INIT_DUP;
 static int initialized_fetch_ref_tips;
 static struct oid_array ref_tips_before_fetch;
 static struct oid_array ref_tips_after_fetch;
@@ -674,11 +674,11 @@ const struct submodule *submodule_from_ce(const struct 
cache_entry *ce)
 }
 
 static struct oid_array *submodule_commits(struct string_list *submodules,
-  const char *path)
+  const char *name)
 {
struct string_list_item *item;
 
-   item = string_list_insert(submodules, path);
+   item = string_list_insert(submodules, name);
if (item->util)
return (struct oid_array *) item->util;
 
@@ -687,39 +687,67 @@ static struct oid_array *submodule_commits(struct 
string_list *submodules,
return (struct oid_array *) item->util;
 }
 
+struct collect_changed_submodules_cb_data {
+   struct string_list *changed;
+   const struct object_id *commit_oid;
+};
+
+/*
+ * this would normally be two functions: default_name_from_path() and
+ * path_from_default_name(). Since the default name is the same as
+ * the submodule path we can get away with just one function which only
+ * checks whether there is a submodule in the working directory at that
+ * location.
+ */
+static const char *default_name_or_path(const char *path_or_name)
+{
+   int error_code;
+
+   if (!is_submodule_populated_gently(path_or_name, &error_code))
+   return NULL;
+
+   return path_or_name;
+}
+
 static void collect_changed_submodules_cb(struct diff_queue_struct *q,
  struct diff_options *options,
  void *data)
 {
+   struct collect_changed_submodules_cb_data *me = data;
+   struct string_list *changed = me->changed;
+   const struct object_id *commit_oid = me->commit_oid;
int i;
-   struct string_list *changed = data;
 
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
struct oid_array *commits;
+   const struct submodule *submodule;
+   const char *name;
+
if (!S_ISGITLINK(p->two->mode))
continue;
 
-   if (S_ISGITLINK(p->one->mode)) {
-   /*
-* NEEDSWORK: We should honor the name configured in
-* the .gitmodules file of the commit we are examining
-* here to be able to correctly follow submodules
-* being moved around.
-*/
-   commits = submodule_commits(changed, p->two->path);
-   oid_array_append(commits, &p->two->oid);
-   } else {
-   /* Submodule is new or was moved here */
-   /*
-* NEEDSWORK: When the .git directories of submodules
-* live inside the superprojects .git directory some
-* day we should fetch new submodules directly into
-* that location too when config or options request
-* that so they can be