Matching to minimize a cost is known as the linear assignment problem, can be solved in n^3 cost, and is implemented in scikit-learn in sklearn.utils.linear_assignment_.linear_assignment or in recent versions of scipy as scipy.optimize.linear_sum_assignment
Of course, this problem will require much more coding (you need to build your pairwise cost matrix) and much more computing cost (n^3 instead of n^2) than a standard nearest-neighbor. Gaël On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote: > Hi Jake, > Thanks for the reply. Yes, trying this out resulted from looking for ways in > python to implement propensity score matching. I found a package, pscore_match > (http://www.kellieottoboni.com/pscore_match/), but the matching was really > terrible. Specifically, I'm matching based on age, race, gender, HIV status, > hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for > matching performed WAY better, I was so surprised at how well every factor was > matched for. The only issue is that it uses replacement. > Here's what I'm currently testing. I need each case to match to 20 controls, > so > since NearestNeighbors uses replacement, I'm matching each case to many > controls (15000), taking all of the distances for all of the pairs, and > retaining only the smallest distances for each control. Since many controls > are > re-used (since the algorithm uses replacement), the hope is that enough > controls are matched to many different cases so that each case ends up being > matched to 20 unique controls. Does this method make sense?? > Best, > Randy > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jake...@cs.washington.edu> > wrote: > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljel...@gmail.com> > wrote: > Hello to the Scikit-learn community! > I am doing case-control matching for an electronic health records > study. My question is, is it possible to run Sklearn's > NearestNeighbors > function without replacement? As in, match the treated group to the > untreated group without re-using any of the untreated group data > points? If so, how? By default, it uses replacement. I know this > because I tested it on some data of mine. > The code I used is in the confirmed answer here: https:// > stats.stackexchange.com/questions/206832/matched-pai > rs-in-python-propensity-score-matching > Thanks so much in advance, > No, pairwise matching without replacement is not implemented within > scikit-learn's nearest neighbors routines. > It seems like an algorithm you would have to think carefully about because > the number of potential pairs grows exponentially with the number of > points, and I don't think it's true that choosing the nearest available > neighbor of points in sequence will guarantee you to find the optimal > configuration. You'd also have to carefully define what you mean by > "optimal"... are you seeking to minimize the sum of all distances? The sum > of squared distances? The maximum distance? The results would change > depending on the metric you define. And you'd probably have to figure out > some way to reduce the exponential search space in order to calculate the > result in a reasonable amount of time for your data. > You might look into the literature on propensity score matching; I think > that's one area where this kind of neighbors-without-replacement algorithm > is often used. > Best, > Jake > -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn