Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
Hi Brian, On Wed, 30 May 2018, brian m. carlson wrote: > On Wed, May 30, 2018 at 09:14:06AM -0700, Stefan Beller wrote: > > Good point. I remember my initial reaction to the file names was > > expecting some hungarian notation, which totally didn't make sense, so > > I refrained from commenting. Searching the web for the algorithm, > > maybe 'lapjv.c' is adequate? (short for "Linear Assignment Problem > > Jonker Volgenant") Matlab has a function named lapjv solving the same > > problem, so it would fall in line with the outside world. > > > > Out of interest, why is it called hungarian in the first place? (I > > presume that comes from some background of DScho in image processing > > or such, so the the answer will be interesting for sure:) > > I think it's because tbdiff uses the hungarian Python module, which > implements the Hungarian method, also known as the Kuhn-Munkres > algorithm, for solving the linear assignment problem. This is the > Jonker-Volgenant algorithm, which solves the same problem. It's faster, > but less tolerant. > > At least this is what I just learned after about ten minutes of > searching. You learned well. The Assignment Problem (or "Linear Assignment Problem") is generally solved by the Hungarian algorithm. I forgot why it is called that way. Kuhn-Munkres came up with a simplification of the algorithm IIRC but it still is O(N^4). Then Jonker-Volgenant took a very different approach that somehow results in O(N^3). It's been *years* since I studied both implementations, so I cannot really explain what they do, and how the latter achieves its order-of-magnitude better performance. And after reading these mails, I agree that the name "hungarian" might be confusing. I also think that "lapjv" is confusing. In general, I try to let Matlab conventions inform on my naming as little as possible, and I find my naming fu a lot better for it. So in this case, how about `linear-assignment.c`? Ciao, Dscho
Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
On Wed, May 30, 2018 at 09:14:06AM -0700, Stefan Beller wrote: > Good point. I remember my initial reaction to the file names was expecting > some hungarian notation, which totally didn't make sense, so I refrained from > commenting. Searching the web for the algorithm, maybe 'lapjv.c' is adequate? > (short for "Linear Assignment Problem Jonker Volgenant") Matlab has a function > named lapjv solving the same problem, so it would fall in line with the > outside > world. > > Out of interest, why is it called hungarian in the first place? (I presume > that > comes from some background of DScho in image processing or such, so the > the answer will be interesting for sure:) I think it's because tbdiff uses the hungarian Python module, which implements the Hungarian method, also known as the Kuhn-Munkres algorithm, for solving the linear assignment problem. This is the Jonker-Volgenant algorithm, which solves the same problem. It's faster, but less tolerant. At least this is what I just learned after about ten minutes of searching. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
On Wed, May 30, 2018 at 6:55 AM, SZEDER Gábor wrote: >> The Jonker-Volgenant algorithm was implemented to answer questions such >> as: given two different versions of a topic branch (or iterations of a >> patch series), what is the best pairing of commits/patches between the >> different versions? >> >> Signed-off-by: Johannes Schindelin >> --- >> Makefile| 1 + >> hungarian.c | 205 >> hungarian.h | 19 + > > (Nit: I personally don't really like these filenames, I know they will > surprise and distract me every time I notice them for years to come... :) Good point. I remember my initial reaction to the file names was expecting some hungarian notation, which totally didn't make sense, so I refrained from commenting. Searching the web for the algorithm, maybe 'lapjv.c' is adequate? (short for "Linear Assignment Problem Jonker Volgenant") Matlab has a function named lapjv solving the same problem, so it would fall in line with the outside world. Out of interest, why is it called hungarian in the first place? (I presume that comes from some background of DScho in image processing or such, so the the answer will be interesting for sure:)
Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
> The Jonker-Volgenant algorithm was implemented to answer questions such > as: given two different versions of a topic branch (or iterations of a > patch series), what is the best pairing of commits/patches between the > different versions? > > Signed-off-by: Johannes Schindelin > --- > Makefile| 1 + > hungarian.c | 205 > hungarian.h | 19 + (Nit: I personally don't really like these filenames, I know they will surprise and distract me every time I notice them for years to come... :) > +int compute_assignment(int column_count, int row_count, double *cost, > +int *column2row, int *row2column) > +{ > + double *v = xmalloc(sizeof(double) * column_count), *d; > + int *free_row, free_count = 0, saved_free_count, *pred, *col; > + int i, j, phase; > + for (free_count = 0; free_count < saved_free_count; free_count++) { > + int i1 = free_row[free_count], low = 0, up = 0, last, k; > + double min, c, u1; > + /* augmentation */ > + do { > + if (j < 0) > + BUG("negative j: %d", j); > + i = pred[j]; > + column2row[j] = i; > + k = j; > + j = row2column[i]; > + row2column[i] = k; Coccinelle suggests to replace the last three lines above with: SWAP(j, row2column[i]); I think it's right, using the SWAP macro makes the resulting code not only shorter and clearer, but it also saves the reader from thinking about whether it's important to set 'k = j' (I think it's not), or 'k' is just used here in lieu of a dedicated 'tmp' variable (I think it is). > + } while (i1 != i); > + } > +
Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
Hi Peff, On Sat, 5 May 2018, Jeff King wrote: > On Fri, May 04, 2018 at 05:34:29PM +0200, Johannes Schindelin wrote: > > > The Jonker-Volgenant algorithm was implemented to answer questions such > > as: given two different versions of a topic branch (or iterations of a > > patch series), what is the best pairing of commits/patches between the > > different versions? > > I love git-tbdiff, so I'm excited to see it getting more exposure (and a > speedup). Thanks for working on this! :-) > Two minor nits on this patch: > > > +/* > > + * The parameter `cost` is the cost matrix: the cost to assign column j to > > row > > + * i is `cost[j + column_count * i]. > > + */ > > +int compute_assignment(int column_count, int row_count, double *cost, > > + int *column2row, int *row2column) > > +{ > > + double *v = xmalloc(sizeof(double) * column_count), *d; > > Please use st_mult, xcalloc, or ALLOC_ARRAY here to avoid unchecked > multiplication in an allocation. This is probably hard to exploit in > practice (since you'd need sizeof(size_t)/8 columns, which presumably > requires allocating some heavier-weight struct per item). But it makes > auditing easier if we avoid the pattern altogether. Sure. I did mean to return errors in those case, but I guess it is not worth the trouble (what would we do in case of out-of-memory?). > > +/* > > + * Compute an assignment of columns -> rows (and vice versa) such that > > every > > + * column is assigned to at most one row (and vice versa) minimizing the > > + * overall cost. > > + * > > + * The parameter `cost` is the cost matrix: the cost to assign column j to > > row > > + * i is `cost[j + column_count * i]. > > + * > > + * The arrays column2row and row2column will be populated with the > > respective > > + * assignments (-1 for unassigned, which can happen only if column_count != > > + * row_count). > > + */ > > +int compute_assignment(int column_count, int row_count, double *cost, > > + int *column2row, int *row2column); > > It looks like this always returns zero. Is there a ever a case where we > would return an error if this? If not, should it just be void? Fixed. Ciao, Dscho
Re: [PATCH v2 01/18] Add a function to solve least-cost assignment problems
On Fri, May 04, 2018 at 05:34:29PM +0200, Johannes Schindelin wrote: > The Jonker-Volgenant algorithm was implemented to answer questions such > as: given two different versions of a topic branch (or iterations of a > patch series), what is the best pairing of commits/patches between the > different versions? I love git-tbdiff, so I'm excited to see it getting more exposure (and a speedup). Thanks for working on this! Two minor nits on this patch: > +/* > + * The parameter `cost` is the cost matrix: the cost to assign column j to > row > + * i is `cost[j + column_count * i]. > + */ > +int compute_assignment(int column_count, int row_count, double *cost, > +int *column2row, int *row2column) > +{ > + double *v = xmalloc(sizeof(double) * column_count), *d; Please use st_mult, xcalloc, or ALLOC_ARRAY here to avoid unchecked multiplication in an allocation. This is probably hard to exploit in practice (since you'd need sizeof(size_t)/8 columns, which presumably requires allocating some heavier-weight struct per item). But it makes auditing easier if we avoid the pattern altogether. > +/* > + * Compute an assignment of columns -> rows (and vice versa) such that every > + * column is assigned to at most one row (and vice versa) minimizing the > + * overall cost. > + * > + * The parameter `cost` is the cost matrix: the cost to assign column j to > row > + * i is `cost[j + column_count * i]. > + * > + * The arrays column2row and row2column will be populated with the respective > + * assignments (-1 for unassigned, which can happen only if column_count != > + * row_count). > + */ > +int compute_assignment(int column_count, int row_count, double *cost, > +int *column2row, int *row2column); It looks like this always returns zero. Is there a ever a case where we would return an error if this? If not, should it just be void? -Peff
[PATCH v2 01/18] Add a function to solve least-cost assignment problems
The Jonker-Volgenant algorithm was implemented to answer questions such as: given two different versions of a topic branch (or iterations of a patch series), what is the best pairing of commits/patches between the different versions? Signed-off-by: Johannes Schindelin --- Makefile| 1 + hungarian.c | 205 hungarian.h | 19 + 3 files changed, 225 insertions(+) create mode 100644 hungarian.c create mode 100644 hungarian.h diff --git a/Makefile b/Makefile index 50da82b0169..96f2e76a904 100644 --- a/Makefile +++ b/Makefile @@ -829,6 +829,7 @@ LIB_OBJS += gpg-interface.o LIB_OBJS += graph.o LIB_OBJS += grep.o LIB_OBJS += hashmap.o +LIB_OBJS += hungarian.o LIB_OBJS += help.o LIB_OBJS += hex.o LIB_OBJS += ident.o diff --git a/hungarian.c b/hungarian.c new file mode 100644 index 000..346299a97d9 --- /dev/null +++ b/hungarian.c @@ -0,0 +1,205 @@ +/* + * Based on: Jonker, R., & Volgenant, A. (1987). A shortest augmenting path + * algorithm for dense and sparse linear assignment problems. Computing, + * 38(4), 325-340. + */ +#include "cache.h" +#include "hungarian.h" +#include + +#define COST(column, row) cost[(column) + column_count * (row)] + +/* + * The parameter `cost` is the cost matrix: the cost to assign column j to row + * i is `cost[j + column_count * i]. + */ +int compute_assignment(int column_count, int row_count, double *cost, + int *column2row, int *row2column) +{ + double *v = xmalloc(sizeof(double) * column_count), *d; + int *free_row, free_count = 0, saved_free_count, *pred, *col; + int i, j, phase; + + memset(column2row, -1, sizeof(int) * column_count); + memset(row2column, -1, sizeof(int) * row_count); + + /* column reduction */ + for (j = column_count - 1; j >= 0; j--) { + int i1 = 0; + + for (i = 1; i < row_count; i++) + if (COST(j, i1) > COST(j, i)) + i1 = i; + v[j] = COST(j, i1); + if (row2column[i1] == -1) { + /* row i1 unassigned */ + row2column[i1] = j; + column2row[j] = i1; + } else { + if (row2column[i1] >= 0) + row2column[i1] = -2 - row2column[i1]; + column2row[j] = -1; + } + } + + /* reduction transfer */ + free_row = xmalloc(sizeof(int) * row_count); + for (int i = 0; i < row_count; i++) { + int j1 = row2column[i]; + if (j1 == -1) + free_row[free_count++] = i; + else if (j1 < -1) + row2column[i] = -2 - j1; + else { + double min = COST(!j1, i) - v[!j1]; + for (j = 1; j < column_count; j++) + if (j != j1 && min > COST(j, i) - v[j]) + min = COST(j, i) - v[j]; + v[j1] -= min; + } + } + + if (free_count == + (column_count < row_count ? row_count - column_count : 0)) { + free(v); + free(free_row); + return 0; + } + + /* augmenting row reduction */ + for (phase = 0; phase < 2; phase++) { + int k = 0; + + saved_free_count = free_count; + free_count = 0; + while (k < saved_free_count) { + double u1, u2; + int j1 = 0, j2, i0; + + i = free_row[k++]; + u1 = COST(j1, i) - v[j1]; + j2 = -1; + u2 = DBL_MAX; + for (j = 1; j < column_count; j++) { + double c = COST(j, i) - v[j]; + if (u2 > c) { + if (u1 < c) { + u2 = c; + j2 = j; + } else { + u2 = u1; + u1 = c; + j2 = j1; + j1 = j; + } + } + } + if (j2 < 0) { + j2 = j1; + u2 = u1; + } + + i0 = column2row[j1]; + if (u1 < u2) + v[j1] -= u2 - u1; + else if (i0 >= 0) { + j1 = j2; + i0 = column2row[j1]; + } + +