Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-31 Thread Johannes Schindelin
Hi Brian,

On Wed, 30 May 2018, brian m. carlson wrote:

> On Wed, May 30, 2018 at 09:14:06AM -0700, Stefan Beller wrote:
> > Good point. I remember my initial reaction to the file names was
> > expecting some hungarian notation, which totally didn't make sense, so
> > I refrained from commenting. Searching the web for the algorithm,
> > maybe 'lapjv.c' is adequate?  (short for "Linear Assignment Problem
> > Jonker Volgenant") Matlab has a function named lapjv solving the same
> > problem, so it would fall in line with the outside world.
> > 
> > Out of interest, why is it called hungarian in the first place? (I
> > presume that comes from some background of DScho in image processing
> > or such, so the the answer will be interesting for sure:)
> 
> I think it's because tbdiff uses the hungarian Python module, which
> implements the Hungarian method, also known as the Kuhn-Munkres
> algorithm, for solving the linear assignment problem.  This is the
> Jonker-Volgenant algorithm, which solves the same problem.  It's faster,
> but less tolerant.
> 
> At least this is what I just learned after about ten minutes of
> searching.

You learned well.

The Assignment Problem (or "Linear Assignment Problem") is generally
solved by the Hungarian algorithm. I forgot why it is called that way.
Kuhn-Munkres came up with a simplification of the algorithm IIRC but it
still is O(N^4). Then Jonker-Volgenant took a very different approach that
somehow results in O(N^3). It's been *years* since I studied both
implementations, so I cannot really explain what they do, and how the
latter achieves its order-of-magnitude better performance.

And after reading these mails, I agree that the name "hungarian" might be
confusing.

I also think that "lapjv" is confusing. In general, I try to let Matlab
conventions inform on my naming as little as possible, and I find my
naming fu a lot better for it.

So in this case, how about `linear-assignment.c`?

Ciao,
Dscho


Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-30 Thread brian m. carlson
On Wed, May 30, 2018 at 09:14:06AM -0700, Stefan Beller wrote:
> Good point. I remember my initial reaction to the file names was expecting
> some hungarian notation, which totally didn't make sense, so I refrained from
> commenting. Searching the web for the algorithm, maybe 'lapjv.c' is adequate?
> (short for "Linear Assignment Problem Jonker Volgenant") Matlab has a function
> named lapjv solving the same problem, so it would fall in line with the 
> outside
> world.
> 
> Out of interest, why is it called hungarian in the first place? (I presume 
> that
> comes from some background of DScho in image processing or such, so the
> the answer will be interesting for sure:)

I think it's because tbdiff uses the hungarian Python module, which
implements the Hungarian method, also known as the Kuhn-Munkres
algorithm, for solving the linear assignment problem.  This is the
Jonker-Volgenant algorithm, which solves the same problem.  It's faster,
but less tolerant.

At least this is what I just learned after about ten minutes of
searching.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-30 Thread Stefan Beller
On Wed, May 30, 2018 at 6:55 AM, SZEDER Gábor  wrote:
>> The Jonker-Volgenant algorithm was implemented to answer questions such
>> as: given two different versions of a topic branch (or iterations of a
>> patch series), what is the best pairing of commits/patches between the
>> different versions?
>>
>> Signed-off-by: Johannes Schindelin 
>> ---
>>  Makefile|   1 +
>>  hungarian.c | 205 
>>  hungarian.h |  19 +
>
> (Nit: I personally don't really like these filenames, I know they will
> surprise and distract me every time I notice them for years to come... :)

Good point. I remember my initial reaction to the file names was expecting
some hungarian notation, which totally didn't make sense, so I refrained from
commenting. Searching the web for the algorithm, maybe 'lapjv.c' is adequate?
(short for "Linear Assignment Problem Jonker Volgenant") Matlab has a function
named lapjv solving the same problem, so it would fall in line with the outside
world.

Out of interest, why is it called hungarian in the first place? (I presume that
comes from some background of DScho in image processing or such, so the
the answer will be interesting for sure:)


Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-30 Thread SZEDER Gábor
> The Jonker-Volgenant algorithm was implemented to answer questions such
> as: given two different versions of a topic branch (or iterations of a
> patch series), what is the best pairing of commits/patches between the
> different versions?
> 
> Signed-off-by: Johannes Schindelin 
> ---
>  Makefile|   1 +
>  hungarian.c | 205 
>  hungarian.h |  19 +

(Nit: I personally don't really like these filenames, I know they will
surprise and distract me every time I notice them for years to come... :)

> +int compute_assignment(int column_count, int row_count, double *cost,
> +int *column2row, int *row2column)
> +{
> + double *v = xmalloc(sizeof(double) * column_count), *d;
> + int *free_row, free_count = 0, saved_free_count, *pred, *col;
> + int i, j, phase;



> + for (free_count = 0; free_count < saved_free_count; free_count++) {
> + int i1 = free_row[free_count], low = 0, up = 0, last, k;
> + double min, c, u1;



> + /* augmentation */
> + do {
> + if (j < 0)
> + BUG("negative j: %d", j);
> + i = pred[j];
> + column2row[j] = i;
> + k = j;
> + j = row2column[i];
> + row2column[i] = k;

Coccinelle suggests to replace the last three lines above with:

  SWAP(j, row2column[i]);

I think it's right, using the SWAP macro makes the resulting code not
only shorter and clearer, but it also saves the reader from thinking
about whether it's important to set 'k = j' (I think it's not), or 'k'
is just used here in lieu of a dedicated 'tmp' variable (I think it
is).

> + } while (i1 != i);
> + }
> +


Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-05 Thread Johannes Schindelin
Hi Peff,

On Sat, 5 May 2018, Jeff King wrote:

> On Fri, May 04, 2018 at 05:34:29PM +0200, Johannes Schindelin wrote:
> 
> > The Jonker-Volgenant algorithm was implemented to answer questions such
> > as: given two different versions of a topic branch (or iterations of a
> > patch series), what is the best pairing of commits/patches between the
> > different versions?
> 
> I love git-tbdiff, so I'm excited to see it getting more exposure (and a
> speedup). Thanks for working on this!

:-)

> Two minor nits on this patch:
> 
> > +/*
> > + * The parameter `cost` is the cost matrix: the cost to assign column j to 
> > row
> > + * i is `cost[j + column_count * i].
> > + */
> > +int compute_assignment(int column_count, int row_count, double *cost,
> > +  int *column2row, int *row2column)
> > +{
> > +   double *v = xmalloc(sizeof(double) * column_count), *d;
> 
> Please use st_mult, xcalloc, or ALLOC_ARRAY here to avoid unchecked
> multiplication in an allocation. This is probably hard to exploit in
> practice (since you'd need sizeof(size_t)/8 columns, which presumably
> requires allocating some heavier-weight struct per item). But it makes
> auditing easier if we avoid the pattern altogether.

Sure. I did mean to return errors in those case, but I guess it is not
worth the trouble (what would we do in case of out-of-memory?).

> > +/*
> > + * Compute an assignment of columns -> rows (and vice versa) such that 
> > every
> > + * column is assigned to at most one row (and vice versa) minimizing the
> > + * overall cost.
> > + *
> > + * The parameter `cost` is the cost matrix: the cost to assign column j to 
> > row
> > + * i is `cost[j + column_count * i].
> > + *
> > + * The arrays column2row and row2column will be populated with the 
> > respective
> > + * assignments (-1 for unassigned, which can happen only if column_count !=
> > + * row_count).
> > + */
> > +int compute_assignment(int column_count, int row_count, double *cost,
> > +  int *column2row, int *row2column);
> 
> It looks like this always returns zero. Is there a ever a case where we
> would return an error if this? If not, should it just be void?

Fixed.

Ciao,
Dscho


Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-05 Thread Jeff King
On Fri, May 04, 2018 at 05:34:29PM +0200, Johannes Schindelin wrote:

> The Jonker-Volgenant algorithm was implemented to answer questions such
> as: given two different versions of a topic branch (or iterations of a
> patch series), what is the best pairing of commits/patches between the
> different versions?

I love git-tbdiff, so I'm excited to see it getting more exposure (and a
speedup). Thanks for working on this!

Two minor nits on this patch:

> +/*
> + * The parameter `cost` is the cost matrix: the cost to assign column j to 
> row
> + * i is `cost[j + column_count * i].
> + */
> +int compute_assignment(int column_count, int row_count, double *cost,
> +int *column2row, int *row2column)
> +{
> + double *v = xmalloc(sizeof(double) * column_count), *d;

Please use st_mult, xcalloc, or ALLOC_ARRAY here to avoid unchecked
multiplication in an allocation. This is probably hard to exploit in
practice (since you'd need sizeof(size_t)/8 columns, which presumably
requires allocating some heavier-weight struct per item). But it makes
auditing easier if we avoid the pattern altogether.

> +/*
> + * Compute an assignment of columns -> rows (and vice versa) such that every
> + * column is assigned to at most one row (and vice versa) minimizing the
> + * overall cost.
> + *
> + * The parameter `cost` is the cost matrix: the cost to assign column j to 
> row
> + * i is `cost[j + column_count * i].
> + *
> + * The arrays column2row and row2column will be populated with the respective
> + * assignments (-1 for unassigned, which can happen only if column_count !=
> + * row_count).
> + */
> +int compute_assignment(int column_count, int row_count, double *cost,
> +int *column2row, int *row2column);

It looks like this always returns zero. Is there a ever a case where we
would return an error if this? If not, should it just be void?

-Peff


[PATCH v2 01/18] Add a function to solve least-cost assignment problems

2018-05-04 Thread Johannes Schindelin
The Jonker-Volgenant algorithm was implemented to answer questions such
as: given two different versions of a topic branch (or iterations of a
patch series), what is the best pairing of commits/patches between the
different versions?

Signed-off-by: Johannes Schindelin 
---
 Makefile|   1 +
 hungarian.c | 205 
 hungarian.h |  19 +
 3 files changed, 225 insertions(+)
 create mode 100644 hungarian.c
 create mode 100644 hungarian.h

diff --git a/Makefile b/Makefile
index 50da82b0169..96f2e76a904 100644
--- a/Makefile
+++ b/Makefile
@@ -829,6 +829,7 @@ LIB_OBJS += gpg-interface.o
 LIB_OBJS += graph.o
 LIB_OBJS += grep.o
 LIB_OBJS += hashmap.o
+LIB_OBJS += hungarian.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
diff --git a/hungarian.c b/hungarian.c
new file mode 100644
index 000..346299a97d9
--- /dev/null
+++ b/hungarian.c
@@ -0,0 +1,205 @@
+/*
+ * Based on: Jonker, R., & Volgenant, A. (1987). A shortest augmenting path
+ * algorithm for dense and sparse linear assignment problems. Computing,
+ * 38(4), 325-340.
+ */
+#include "cache.h"
+#include "hungarian.h"
+#include 
+
+#define COST(column, row) cost[(column) + column_count * (row)]
+
+/*
+ * The parameter `cost` is the cost matrix: the cost to assign column j to row
+ * i is `cost[j + column_count * i].
+ */
+int compute_assignment(int column_count, int row_count, double *cost,
+  int *column2row, int *row2column)
+{
+   double *v = xmalloc(sizeof(double) * column_count), *d;
+   int *free_row, free_count = 0, saved_free_count, *pred, *col;
+   int i, j, phase;
+
+   memset(column2row, -1, sizeof(int) * column_count);
+   memset(row2column, -1, sizeof(int) * row_count);
+
+   /* column reduction */
+   for (j = column_count - 1; j >= 0; j--) {
+   int i1 = 0;
+
+   for (i = 1; i < row_count; i++)
+   if (COST(j, i1) > COST(j, i))
+   i1 = i;
+   v[j] = COST(j, i1);
+   if (row2column[i1] == -1) {
+   /* row i1 unassigned */
+   row2column[i1] = j;
+   column2row[j] = i1;
+   } else {
+   if (row2column[i1] >= 0)
+   row2column[i1] = -2 - row2column[i1];
+   column2row[j] = -1;
+   }
+   }
+
+   /* reduction transfer */
+   free_row = xmalloc(sizeof(int) * row_count);
+   for (int i = 0; i < row_count; i++) {
+   int j1 = row2column[i];
+   if (j1 == -1)
+   free_row[free_count++] = i;
+   else if (j1 < -1)
+   row2column[i] = -2 - j1;
+   else {
+   double min = COST(!j1, i) - v[!j1];
+   for (j = 1; j < column_count; j++)
+   if (j != j1 && min > COST(j, i) - v[j])
+   min = COST(j, i) - v[j];
+   v[j1] -= min;
+   }
+   }
+
+   if (free_count ==
+   (column_count < row_count ? row_count - column_count : 0)) {
+   free(v);
+   free(free_row);
+   return 0;
+   }
+
+   /* augmenting row reduction */
+   for (phase = 0; phase < 2; phase++) {
+   int k = 0;
+
+   saved_free_count = free_count;
+   free_count = 0;
+   while (k < saved_free_count) {
+   double u1, u2;
+   int j1 = 0, j2, i0;
+
+   i = free_row[k++];
+   u1 = COST(j1, i) - v[j1];
+   j2 = -1;
+   u2 = DBL_MAX;
+   for (j = 1; j < column_count; j++) {
+   double c = COST(j, i) - v[j];
+   if (u2 > c) {
+   if (u1 < c) {
+   u2 = c;
+   j2 = j;
+   } else {
+   u2 = u1;
+   u1 = c;
+   j2 = j1;
+   j1 = j;
+   }
+   }
+   }
+   if (j2 < 0) {
+   j2 = j1;
+   u2 = u1;
+   }
+
+   i0 = column2row[j1];
+   if (u1 < u2)
+   v[j1] -= u2 - u1;
+   else if (i0 >= 0) {
+   j1 = j2;
+   i0 = column2row[j1];
+