Re: [PATCH v2 11/25] object_array_remove_duplicates(): rewrite to reduce copying

2013-06-02 Thread Junio C Hamano
Michael Haggerty  writes:

>> The only caller of remove-duplicates is bundle.c, which gets many
>> starting points and end points from the command line and tries to be
>> nice by removing obvious duplicates, e.g.
>> 
>>  git bundle create t.bundle master master
>> 
>> but I think its logic of deduping is wrong.  It runs dwim_ref() on
>> the incoming refs after the remove-duplicates call, so
>> 
>>  git bundle create t.bundle master heads/mater
>> 
>> will end up with two copies of refs/heads/master.  To fix it, the
>> code must dedup the result of running dwim_ref(), and at that point,
>> there is no reason to call object_array_remove_duplicates().
>
> That sounds reasonable.
>
> I poked around this code a bit to understand what is going on, and it
> occurred to me that the object_array can include both positive and
> negative references, right?  And yet object_array_remove_duplicates()
> only considers names, not flags.  So it seems to me that if the deduping
> code would see the same reference twice, once positive and once
> negative, then it would throw an arbitrary one of them out, which would
> be wrong.
>
> But I couldn't provoke this situation, so perhaps setup_revisions()
> already specially treats combinations like "master ^master"?  (If that's
> true then why? and wouldn't it get confused by "master ^heads/master"?)

With "git bundle create t.bundle ^master master", you see two
entries in revs.pending.objects[] but they are the same object and
is already marked as uninteresting, so you will not see 'master' in
the result.

This parsing loop predates the more recent revs->cmdline mechanism,
that treats these two command line arguments as separate entities,
so that we can more reliably tell what the real end-user input is.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/25] object_array_remove_duplicates(): rewrite to reduce copying

2013-05-30 Thread Michael Haggerty
On 05/29/2013 06:18 PM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
>> The old version copied one entry to its destination position, then
>> deleted any matching entries from the tail of the array.  This
>> required the tail of the array to be copied multiple times.  It didn't
>> affect the complexity of the algorithm because the whole tail has to
>> be searched through anyway.  But all the copying was unnecessary.
>>
>> Instead, check for the existence of an entry with the same name in the
>> *head* of the list before copying an entry to its final position.
>> This way each entry has to be copied at most one time.
>>
>> Extract a helper function contains_name() to do a bit of the work.
>>
>> Signed-off-by: Michael Haggerty 
>> ---
>>  object.c | 32 +---
>>  object.h |  6 +-
>>  2 files changed, 26 insertions(+), 12 deletions(-)
>>
>> diff --git a/object.c b/object.c
>> index fcd4a82..10b5349 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -294,22 +294,32 @@ void object_array_filter(struct object_array *array,
>>  array->nr = dst;
>>  }
>>  
>> +/*
>> + * Return true iff array already contains an entry with name.
>> + */
>> +static int contains_name(struct object_array *array, const char *name)
>> +{
>> +unsigned nr = array->nr, i;
>> +struct object_array_entry *object = array->objects;
>> +
>> +for (i = 0; i < nr; i++, object++)
>> +if (!strcmp(object->name, name))
>> +return 1;
>> +return 0;
>> +}
> 
> Because some codepaths (e.g. patch 14/25) stuff NULL in the name
> field, we may want to be more careful with this.
> 
> This is not a new problem, and I think the longer term solution is
> to get rid of object_array_remove_duplicates(), so it is perfectly
> fine to leave this function broken with respect to NULL input as-is.

You make a good point about NULL names, but I agree to leave it for now
since it needs work anyway.

> The only caller of remove-duplicates is bundle.c, which gets many
> starting points and end points from the command line and tries to be
> nice by removing obvious duplicates, e.g.
> 
>   git bundle create t.bundle master master
> 
> but I think its logic of deduping is wrong.  It runs dwim_ref() on
> the incoming refs after the remove-duplicates call, so
> 
>   git bundle create t.bundle master heads/mater
> 
> will end up with two copies of refs/heads/master.  To fix it, the
> code must dedup the result of running dwim_ref(), and at that point,
> there is no reason to call object_array_remove_duplicates().

That sounds reasonable.

I poked around this code a bit to understand what is going on, and it
occurred to me that the object_array can include both positive and
negative references, right?  And yet object_array_remove_duplicates()
only considers names, not flags.  So it seems to me that if the deduping
code would see the same reference twice, once positive and once
negative, then it would throw an arbitrary one of them out, which would
be wrong.

But I couldn't provoke this situation, so perhaps setup_revisions()
already specially treats combinations like "master ^master"?  (If that's
true then why? and wouldn't it get confused by "master ^heads/master"?)

I suppose someday I will have to dig into these functions and maybe even
document them.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/25] object_array_remove_duplicates(): rewrite to reduce copying

2013-05-29 Thread Junio C Hamano
Michael Haggerty  writes:

> The old version copied one entry to its destination position, then
> deleted any matching entries from the tail of the array.  This
> required the tail of the array to be copied multiple times.  It didn't
> affect the complexity of the algorithm because the whole tail has to
> be searched through anyway.  But all the copying was unnecessary.
>
> Instead, check for the existence of an entry with the same name in the
> *head* of the list before copying an entry to its final position.
> This way each entry has to be copied at most one time.
>
> Extract a helper function contains_name() to do a bit of the work.
>
> Signed-off-by: Michael Haggerty 
> ---
>  object.c | 32 +---
>  object.h |  6 +-
>  2 files changed, 26 insertions(+), 12 deletions(-)
>
> diff --git a/object.c b/object.c
> index fcd4a82..10b5349 100644
> --- a/object.c
> +++ b/object.c
> @@ -294,22 +294,32 @@ void object_array_filter(struct object_array *array,
>   array->nr = dst;
>  }
>  
> +/*
> + * Return true iff array already contains an entry with name.
> + */
> +static int contains_name(struct object_array *array, const char *name)
> +{
> + unsigned nr = array->nr, i;
> + struct object_array_entry *object = array->objects;
> +
> + for (i = 0; i < nr; i++, object++)
> + if (!strcmp(object->name, name))
> + return 1;
> + return 0;
> +}

Because some codepaths (e.g. patch 14/25) stuff NULL in the name
field, we may want to be more careful with this.

This is not a new problem, and I think the longer term solution is
to get rid of object_array_remove_duplicates(), so it is perfectly
fine to leave this function broken with respect to NULL input as-is.

The only caller of remove-duplicates is bundle.c, which gets many
starting points and end points from the command line and tries to be
nice by removing obvious duplicates, e.g.

git bundle create t.bundle master master

but I think its logic of deduping is wrong.  It runs dwim_ref() on
the incoming refs after the remove-duplicates call, so

git bundle create t.bundle master heads/mater

will end up with two copies of refs/heads/master.  To fix it, the
code must dedup the result of running dwim_ref(), and at that point,
there is no reason to call object_array_remove_duplicates().

> +
>  void object_array_remove_duplicates(struct object_array *array)
>  {
> - unsigned int ref, src, dst;
> + unsigned nr = array->nr, src;
>   struct object_array_entry *objects = array->objects;
>  
> - for (ref = 0; ref + 1 < array->nr; ref++) {
> - for (src = ref + 1, dst = src;
> -  src < array->nr;
> -  src++) {
> - if (!strcmp(objects[ref].name, objects[src].name))
> - continue;
> - if (src != dst)
> - objects[dst] = objects[src];
> - dst++;
> + array->nr = 0;
> + for (src = 0; src < nr; src++) {
> + if (!contains_name(array, objects[src].name)) {
> + if (src != array->nr)
> + objects[array->nr] = objects[src];
> + array->nr++;
>   }
> - array->nr = dst;
>   }
>  }

>  
> diff --git a/object.h b/object.h
> index 0d39ff4..6c1c27f 100644
> --- a/object.h
> +++ b/object.h
> @@ -96,7 +96,11 @@ typedef int (*object_array_each_func_t)(struct 
> object_array_entry *, void *);
>  void object_array_filter(struct object_array *array,
>object_array_each_func_t want, void *cb_data);
>  
> -void object_array_remove_duplicates(struct object_array *);
> +/*
> + * Remove from array all but the first entry with a given name.
> + * Warning: this function uses an O(N^2) algorithm.
> + */
> +void object_array_remove_duplicates(struct object_array *array);
>  
>  void clear_object_flags(unsigned flags);
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 11/25] object_array_remove_duplicates(): rewrite to reduce copying

2013-05-25 Thread Michael Haggerty
The old version copied one entry to its destination position, then
deleted any matching entries from the tail of the array.  This
required the tail of the array to be copied multiple times.  It didn't
affect the complexity of the algorithm because the whole tail has to
be searched through anyway.  But all the copying was unnecessary.

Instead, check for the existence of an entry with the same name in the
*head* of the list before copying an entry to its final position.
This way each entry has to be copied at most one time.

Extract a helper function contains_name() to do a bit of the work.

Signed-off-by: Michael Haggerty 
---
 object.c | 32 +---
 object.h |  6 +-
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/object.c b/object.c
index fcd4a82..10b5349 100644
--- a/object.c
+++ b/object.c
@@ -294,22 +294,32 @@ void object_array_filter(struct object_array *array,
array->nr = dst;
 }
 
+/*
+ * Return true iff array already contains an entry with name.
+ */
+static int contains_name(struct object_array *array, const char *name)
+{
+   unsigned nr = array->nr, i;
+   struct object_array_entry *object = array->objects;
+
+   for (i = 0; i < nr; i++, object++)
+   if (!strcmp(object->name, name))
+   return 1;
+   return 0;
+}
+
 void object_array_remove_duplicates(struct object_array *array)
 {
-   unsigned int ref, src, dst;
+   unsigned nr = array->nr, src;
struct object_array_entry *objects = array->objects;
 
-   for (ref = 0; ref + 1 < array->nr; ref++) {
-   for (src = ref + 1, dst = src;
-src < array->nr;
-src++) {
-   if (!strcmp(objects[ref].name, objects[src].name))
-   continue;
-   if (src != dst)
-   objects[dst] = objects[src];
-   dst++;
+   array->nr = 0;
+   for (src = 0; src < nr; src++) {
+   if (!contains_name(array, objects[src].name)) {
+   if (src != array->nr)
+   objects[array->nr] = objects[src];
+   array->nr++;
}
-   array->nr = dst;
}
 }
 
diff --git a/object.h b/object.h
index 0d39ff4..6c1c27f 100644
--- a/object.h
+++ b/object.h
@@ -96,7 +96,11 @@ typedef int (*object_array_each_func_t)(struct 
object_array_entry *, void *);
 void object_array_filter(struct object_array *array,
 object_array_each_func_t want, void *cb_data);
 
-void object_array_remove_duplicates(struct object_array *);
+/*
+ * Remove from array all but the first entry with a given name.
+ * Warning: this function uses an O(N^2) algorithm.
+ */
+void object_array_remove_duplicates(struct object_array *array);
 
 void clear_object_flags(unsigned flags);
 
-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html