[EM] Fast Condorcet-Kemeny calculation times, clarification of NP-hardness issue

Richard Fobes Sun, 04 Mar 2012 12:45:09 -0800

Finally, after reading the articles cited by Warren Smith (listed at thebottom of this reply) plus some related articles, I can reply to hisinsistence that Condorcet-Kemeny calculations take too long tocalculate. Also, this reply addresses the same claim that appears inWikipedia both in the "Kemeny-Young method" article and in thecomparison table within the Wikipedia "Voting systems" article (in the"polynomial time" column that Markus Schulze added).

One source of confusion is that Warren, and perhaps others, regard theCondorcet-Kemeny problem as a "decision problem" that only has a "yes"or "no" answer. This view is suggested by Warren's reference (below andin other messages) to the problem as being NP-complete, which onlyapplies to decision problems. Although it is possible to formulate adecision problem based on one or more specified characteristics of theCondorcet-Kemeny method, that is a different problem than theCondorcet-Kemeny problem.

In the real world of elections, the Condorcet-Kemeny problem is tocalculate a ranking of all choices (e.g. candidates) that maximizes thesequence score (or minimizes the "Kemeny score").

Clearly the Condorcet-Kemeny problem is an optimization problem, not adecision problem (and not a search problem). It is an optimizationproblem because we have a way to measure how closely the solutionreaches its goal.

(For contrast, consider the NP-hard "subset sum problem" in which thegoal is to determine whether a specified list of integers contains asubset that can be added and/or subtracted to yield zero. Any subseteither sums to zero or it doesn't sum to zero. This makes it easy toformulate the related decision (yes/no) problem that asks whether such asubset exists for a given set of numbers.)

Because the Condorcet-Kemeny problem is an optimization problem, thesolution to the Condorcet-Kemeny problem can be an approximation. Ifthis approach is used, it becomes relevant to ask how closely theapproximation reaches the ranking that has the highest sequence score.Yet even this question -- of "how close?" -- is not a decision problem(because it goes beyond a yes or no answer).

Keeping in mind that VoteFair popularity ranking calculations aremathematically equivalent to the Condorcet-Kemeny method, my claim isthat VoteFair popularity ranking calculations yield, at the least, thesame top-ranked choice, and the same few top-ranked choices, as thesolution produced by examining every sequence score -- except (and thisis the important part) in cases where the voter preferences are soconvoluted that any top-ranked choice and any few top-ranked choiceswould be controversial. As one academic paper elegantly put it:"garbage in, garbage out".

More specifically, here is a set of claims that more rigorously statethe above ambiguous claim.

Claim 1: For _some_ _instances_, a polynomial-time calculation canidentify the full ranking that produces the highest Condorcet-Kemenysequence score.

Claim 2: For _some_ _instances_, a polynomial-time calculation can rankthe top most-popular candidates/choices and this partial ranking will bethe same as the top portion of the full ranking as determined byidentifying the highest Condorcet-Kemeny sequence score.

Claim 3: For the _remaining_ _instances_ (not covered in Claims 1 and2), an approximation of the full Condorcet-Kemeny ranking can becalculated in polynomial time.

Claim 4: For any cases in which the top-ranked candidate/choiceaccording to the VoteFair popularity ranking algorithm differs from thetop-ranked candidate/choice according to a full calculation of allsequence scores, the outcome of a runoff election between the twocandidates/choices would be difficult to predict.

As done in the academic literature, I am excluding the cases in whichmore than one sequence has the same highest sequence score.


To help clarify the validity of these claims, I'll use an analogy.

Consider a special case of the rigorously studied Traveling SalesmanProblem (TSP), which is NP-hard to solve. (The TSP also can beexpressed as a decision problem, in which case the decision problem isNP-complete, but that variation is not the problem discussed here.)

The special case -- which I will refer to as the non-returning TravelingSalesman Problem -- is that we want to know which city the salesmanvisits first, and we want to know, with successively less interest,which city the salesman visits second, third, and so on. Additionally,for this special case, we specify that the cities to be visited areroughly located between a beginning point "B" and and ending point "E".

To make this special case mathematically equivalent to the normalTraveling Salesman Problem in which the salesman returns to the startingcity, we create a path of closely spaced cities (labeled "+" below) thatlead back to the starting city "B".

Here is a diagram of this problem. Remember that the most importantthing we want to know is which city ("*") the salesman visits first.


B = Beginning city
* = City to visit
E = Ending city for main portion
+ = City on path back to beginning
(periods = background; assumes monospace font)

Instance 1:
.................................................B.
.....................................*............+
..................................................+
.....................................*............+
...................................*..............+
..............................*...................+
..................................................+
................................*.................+
.........................*........................+
......................*.....*.....................+
..................................................+
..................*..*.....*......................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

In this case it is obvious which city is the first one on the path fromB to E. And it is obvious which are the next four cities on the path.

What we do not know is the sequence of cities after that (for the paththat is shortest).

Now let's consider a different instance of this non-returning TravelingSalesman Problem.


Instance 2:
.................................................B.
..........................*.......................+
........................*....*....................+
................*.........*...*...................+
.............*.........*....*...*.*...............+
................*...*......*.....*...*............+
.......................*......*...*......*........+
..........*......*.........*......*...*...........+
.............*........*.........*......*..........+
..................*.........*......*..............+
.........*.....*.......*..........................+
.............*.....*..........*....*..............+
..................*..*.....*......................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

In this instance we cannot know which city is the first city on theshortest path until we know the shortest path through all the cities.

Calculating the absolute shortest path in a convoluted case likeInstance 2 might require a calculation time that is super-polynomial(more than what can be expressed as a polynomial function of the citycount).


However, we can estimate the shortest path.

Such an approximation might identify a first city that is different fromthe first city on the absolute shortest path. If the "wrong" city isidentified as the first-visited city, it is understandable that thisoccurs because there is not a clearly identifiable first-visit city inthis instance.


This analogy can be extended to the Condorcet-Kemeny problem.

In normal election situations, the most important part of the solutionis the first-ranked winner. In fact, most voting methods are not_designed_ to identify more than the first-ranked winner.

In contrast, the Condorcet-Kemeny problem is designed to identify a fullranking. Accordingly, the second-most important part (of solving theCondorcet-Kemeny problem) is to identify the top few highest-ranked choices.

Both of these important goals can be achieved without fully ranking allthe choices. This is analogous to solving Instance 1 of thenon-returning Traveling Salesman Problem.

The importance of calculating the few top-ranked choices, and thereduced importance of calculating the lower-ranked choices, is furtherdemonstrated when the Condorcet-Kemeny method is used to aggregate(merge/join/etc.) separate rankings from different search engines (toyield "meta-search" results, which is the intended goal specified by IBMemployees who authored one of the cited articles about Condorcet-Kemenycalculations). Specifically, a search-engine user is unlikely to lookat the search results beyond the first few pages, which means thatcarefully calculating the full meta-search ranking for thousands ofsearch results is pointless, and therefore the calculation time for afull ranking is irrelevant.

(As a further contrast, to clarify this point about a partial solutionbeing useful, the subset-sum problem does not have a partial solution.All that matters is the existence of at least one solution, or theabsence of any solution.)

Therefore, in some instances we can solve the NP-hard Condorcet-Kemenyproblem "quickly" (in polynomial time) in the same way that we can"quickly" (in polynomial time) solve some instances -- such as Instance1 -- of the NP-hard non-returning Traveling Salesman Problem.

In instances where we use an approximate solution for theCondorcet-Kemeny problem, the approximate solution can be calculated inpolynomial time. Specifically, the algorithm used for VoteFairpopularity ranking, which seeks to maximize the Condorcet-Kemenysequence score, always can be solved in polynomial time (as evidenced byall the programming loops being bounded).

To further clarify these points, consider the following instance of thenon-returning Traveling Salesman Problem.


Instance 3:
.................................................B.
..........................*.......................+
........................*....*....................+
................*.........*...*...................+
.............*.........*....*...*.*...............+
................*...*......*.....*...*............+
.......................*......*...*......*........+
.................*.........*......*...*...........+
.............*........*.........*......*..........+
..................*.........*......*..............+
.......................*..........................+
...................*..............................+
..................*..*............................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

For this instance, we can calculate the absolute shortest path throughthe group of cities closest to the starting point "B" without alsocalculating the absolute shortest path through the group of citiesclosest to the ending point "E".

Similarly some instances of the Condorcet-Kemeny problem do not requirecalculating the exact order of lower-ranked choices (e.g. candidates) inorder to exactly find the maximum-sequence-score ranking of thetop-ranked choices.

Now that the word "instance" and the concept of a partial order areclear, I will offer proofs for Claims 1, 2, and 3.

Proof of Claim 1: If an instance has a Condorcet winner and eachsuccessively ranked choice is pairwise preferred over all the otherremaining choices, this instance can be ranked in polynomial time.

Proof of Claim 2: If an instance has a Condorcet winner and the next fewsuccessively ranked choices are each pairwise preferred over all theremaining choices, the top-ranked choices for this instance can beranked in polynomial time.

Proof of Claim 3: There are polynomial-time approximation methods thatcan efficiently find a sequence that has a Condorcet-Kemeny sequencescore that is close to the largest sequence score.

(Clarification: I am not claiming that a ranking result based onapproximation will have the same fairness characteristics that areattributed to the "exact" Condorcet-Kemeny method.)

Using lots of real-life data, plus data that has unusualcalculation-related characteristics, I have tested the VoteFair rankingalgorithm against the full approach that calculates all sequence scoresfor up to six choices. In all these cases there are no differences inthe top-ranked choice, nor are there any differences in the full rankingfor the cases that have no ties. (The cases that involve ties involvemultiple sequences that have the same highest score, the resolution ofwhich is not specified in the Condorcet-Kemeny method.)

Of course Claim 4 would be difficult to prove. (This claim says that ifthe two methods do not identify the same winner, the outcome of a runoffelection would be difficult to predict.) The point of Claim 4 is toclarify the concept of "controversial" and state that if the two methodsidentify different winners, neither winner is uncontroversial.

As a reminder (especially for anyone skimming), I am not saying that theTraveling Salesman Problem is mathematically related to theCondorcet-Kemeny problem (beyond both being categorized as NP-hardproblems). Instead I am using the well-studied traveling salesmanproblem as an analogy to clarify characteristics of the Condorcet-Kemenyproblem that some election-method experts seem to misunderstand.

Perhaps the misunderstanding arises because the Condorcet-Kemeny methodmust fully rank all the choices in order to identify the top-rankedchoice. In contrast, other methods do the opposite, namely theyidentify the top-ranked choice and then, if a further ranking is needed,the process is repeated (although for instant-runoff voting and theCondorcet-Schulze method the process of calculating the winner yieldsinformation that can be used to determine some or all of a full ranking).

If anyone has questions about the calculations done by the open-sourceVoteFair popularity ranking software, and especially about its abilityto efficiently identify the highest sequence score based on meaningfulvoter preferences, I invite them to look at the clearly commented code.The code is on GitHub (in the CPSolver account) and on the Perl CPANarchive (which is mirrored on more than two hundred servers around theworld).

In summary, although the Condorcet-Kemeny method is mathematicallycategorized as an NP-hard problem, the instances that are NP-hard tosolve involve either the less-important lower-ranked choices (analogousto Instance 1 in the non-returning Traveling Salesman Problem), orinvolve convoluted top-ranked voter preferences that yield controversialresults (analogous to Instances 2 and 3), or both. For all otherinstances -- which include all meaningful election situations --score-optimized top-ranking results can be calculated in polynomial time.

Clearly, in contrast to what Warren Smith and Markus Schulze and someother election-method experts claim, the calculation time required bythe Condorcet-Kemeny method is quite practical for use in real-lifeelections.

I'll close with a quote from the article by (IBM researchers) Davenportand Kalananam that Warren cited: "NP-hardness is a only [sic] worst casecomplexity result which may not reflect the difficulty of solvingproblems which arise in practice."


Richard Fobes

About the citations below: I was not able to read the article byBartholdi, Tovey, and Trick because it requires paying a $35 fee. Alas,it is the article that other articles refer to for the proof ofNP-hardness. However, the other articles, plus related academicarticles, plus Wikipedia articles, provided sufficient perspective.


Again, thank you Warren, for providing the citations.


On 12/24/2011 10:25 AM, Warren Smith wrote:

Rank Aggregation Revisited
Cynthia Dwork, Ravi Kumar, Moni Naor, D. Sivakumar
http://www.eecs.harvard.edu/~michaelm/CS222/rank2.pdf

J. J. Bartholdi, C. A. Tovey, and M. A. Trick: Voting schemes for
which it can be difficult to tell who won the
election, Social Choice and Welfare, 6(2):157–165, 1989.

A Computational Study of the Kemeny Rule for Preference Aggregation
Andrew Davenport and Jayant Kalagnanam
http://www.aaai.org/Papers/AAAI/2004/AAAI04-110.pdf

Cohen, W.; Schapire, R.; and Singer, Y. 1999. Learning to order
things. Journal of Artificial Intelligence Research 10:213-270.
http://www.jair.org/media/587/live-587-1788-jair.ps

etc
and it should be noted that it is NP-complete to find an ordering better than X,
and also NP-hard merely to find the Kemeny winner...



----
Election-Methods mailing list - see http://electorama.com/em for list info

[EM] Fast Condorcet-Kemeny calculation times, clarification of NP-hardness issue

Reply via email to