[R] Summary: Faster way for weighted matching

Frank E Harrell Jr Thu, 16 Jan 2003 07:09:25 -0800

I received some great ideas (see below) from a number of people to whom I am grateful. 
 Here is the code I put together from many of their suggestions:


lx <- length(x)
lw <- length(w)
z <- matrix(abs( rep( x , lw ) - rep( w, each = lx ) ),
            nrow=lw, ncol=lx, byrow=TRUE)
s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )
z <- (1 - (z/rep(s,length=lx*lw))^3)^3
sums <- rowSums(z)
z <- z/rep(sums, length=lx*lw)

This improved the speed roughly 25% over the original code.  Unfortunately I realized 
(once again) that any solution of this type based on outer( ) or something that 
generates an lx*lw matrix does not scale well for large problems because of memory 
usage.  So I'm probably going to resort to a Ratfor/Fortran solution to this problem.

Again thanks to the people mentioned below.
------------------------

Tim Hesterberg <[EMAIL PROTECTED]>:

z <- abs(outer(w, x, "-"))
>s <- apply(z, 1, max)
>z <- (1 - sweep(z, 1, s, FUN='/')^3)^3

Instead of sweep, use
        z - rep(s, each=nrow(z))


>sums <- apply(z, 1, sum)

rowSums(z)

>z <- sweep(z, 1, sums, FUN='/')

z / rep(sums, each=nrow(z))



[EMAIL PROTECTED]:

The sweeps could be replaced by simple division using the recycling rule,
but I haven't tested whether it is significantly faster.



Charles Berry <[EMAIL PROTECTED]>:

        1) calculate the transpose of z.  If z really is big, that should
                reduce caching in subsequent calcs.

        2) unroll the code in 'outer' and 'sweep' e.g. for 'outer'
                l.w <- length(w)
                l.x <- length(x)
                z <- abs( rep( x , l.w ) - rep( w, each = l.x ) )
                dim( z ) <- c( l.x, l.w )

        3) find s as follows:

                s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )

        4) use colSums() instead of apply(, , sum)



Thomas W Blackwell <[EMAIL PROTECTED]>:

Matrix z is potentially enormous.  On large matrices, it's
a familiar truism that    as.vector(z %*% rep(1, dim(z)[2]))
is MUCH faster than       apply(z, 1, sum).

The equivalent "inline" code for "max" is not so nice.
So here's an alternative version (untested):

 z  <- abs(outer(w, x, "-"))
wid <- seq(along=w)
top <- 1 + round(max(z), 0)   #   we know z is nonnegative by construction
 s  <- z[ sort.list((z / top) + matrix(wid, nrow=dim(z)[1], ncol=dim(z)[2], 
byrow=F)[wid * dim(z)[2]] ]
 z  <- (1 - (z / matrix(s, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F))^3)^3
sums <- as.vector(z %*% rep(1, dim(z)[2]))
 z  <- z / matrix(sums, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F)

Excruciatingly ugly, but it works.  Depends how much you're
willing to sacrifice transparent code in favor of speed.
(And, clearly I could save eight calls to dim(z) by defining
a couple of temporary variables.)

But surely, you know these hacks from way back. [Not necessary Thomas!  -Frank]

Original post:

> For each element in w I want to find a good match (subscript number) of an element 
>in x.  x and w can be long.  Instead of just finding the closest match I want to use 
>weighted multinomial sampling (which I've already figured out once I have the 
>probabilities) where the probabilities come from the tricube function of absolute 
>differences between donor and target values, but normalized to sum to one, and using 
>the maximum absolute difference as the scaling factor.  This is similar to the loess 
>weighting function with f=1.  Here's code that works, to get the probability matrix 
>to use for sampling:
>
> z <- abs(outer(w, x, "-"))
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
>
> Example:
> w <- c(1,2,3,7)
> x <- c(0,1.5,3)
> z <- abs(outer(w,x,"-"))
> > z
>      [,1] [,2] [,3]
> [1,]    1  0.5    2
> [2,]    2  0.5    1
> [3,]    3  1.5    0
> [4,]    7  5.5    4
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> z
> [1,] 0.6699219 0.9538536 0.0000000
> [2,] 0.0000000 0.9538536 0.6699219
> [3,] 0.0000000 0.6699219 1.0000000
> [4,] 0.0000000 0.1365445 0.5381833
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
> z   # each row represents multinomial probabilities summing to 1
> [1,] 0.4125705 0.5874295 0.0000000
> [2,] 0.0000000 0.5874295 0.4125705
> [3,] 0.0000000 0.4011696 0.5988304
> [4,] 0.0000000 0.2023697 0.7976303
>
>
> The code is moderately fast.  Does anyone know of a significantly faster method or 
>have any comments on the choice of weighting function for such sampling?   This will 
>be used in the context of predictive mean matching for multiple imputation.   Thanks 
>- Frank

-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat

______________________________________________
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help

[R] Summary: Faster way for weighted matching

Reply via email to