I received some great ideas (see below) from a number of people to whom I am grateful.
Here is the code I put together from many of their suggestions:
lx <- length(x)
lw <- length(w)
z <- matrix(abs( rep( x , lw ) - rep( w, each = lx ) ),
nrow=lw, ncol=lx, byrow=TRUE)
s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )
z <- (1 - (z/rep(s,length=lx*lw))^3)^3
sums <- rowSums(z)
z <- z/rep(sums, length=lx*lw)
This improved the speed roughly 25% over the original code. Unfortunately I realized
(once again) that any solution of this type based on outer( ) or something that
generates an lx*lw matrix does not scale well for large problems because of memory
usage. So I'm probably going to resort to a Ratfor/Fortran solution to this problem.
Again thanks to the people mentioned below.
------------------------
Tim Hesterberg <[EMAIL PROTECTED]>:
z <- abs(outer(w, x, "-"))
>s <- apply(z, 1, max)
>z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
Instead of sweep, use
z - rep(s, each=nrow(z))
>sums <- apply(z, 1, sum)
rowSums(z)
>z <- sweep(z, 1, sums, FUN='/')
z / rep(sums, each=nrow(z))
[EMAIL PROTECTED]:
The sweeps could be replaced by simple division using the recycling rule,
but I haven't tested whether it is significantly faster.
Charles Berry <[EMAIL PROTECTED]>:
1) calculate the transpose of z. If z really is big, that should
reduce caching in subsequent calcs.
2) unroll the code in 'outer' and 'sweep' e.g. for 'outer'
l.w <- length(w)
l.x <- length(x)
z <- abs( rep( x , l.w ) - rep( w, each = l.x ) )
dim( z ) <- c( l.x, l.w )
3) find s as follows:
s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )
4) use colSums() instead of apply(, , sum)
Thomas W Blackwell <[EMAIL PROTECTED]>:
Matrix z is potentially enormous. On large matrices, it's
a familiar truism that as.vector(z %*% rep(1, dim(z)[2]))
is MUCH faster than apply(z, 1, sum).
The equivalent "inline" code for "max" is not so nice.
So here's an alternative version (untested):
z <- abs(outer(w, x, "-"))
wid <- seq(along=w)
top <- 1 + round(max(z), 0) # we know z is nonnegative by construction
s <- z[ sort.list((z / top) + matrix(wid, nrow=dim(z)[1], ncol=dim(z)[2],
byrow=F)[wid * dim(z)[2]] ]
z <- (1 - (z / matrix(s, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F))^3)^3
sums <- as.vector(z %*% rep(1, dim(z)[2]))
z <- z / matrix(sums, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F)
Excruciatingly ugly, but it works. Depends how much you're
willing to sacrifice transparent code in favor of speed.
(And, clearly I could save eight calls to dim(z) by defining
a couple of temporary variables.)
But surely, you know these hacks from way back. [Not necessary Thomas! -Frank]
Original post:
> For each element in w I want to find a good match (subscript number) of an element
>in x. x and w can be long. Instead of just finding the closest match I want to use
>weighted multinomial sampling (which I've already figured out once I have the
>probabilities) where the probabilities come from the tricube function of absolute
>differences between donor and target values, but normalized to sum to one, and using
>the maximum absolute difference as the scaling factor. This is similar to the loess
>weighting function with f=1. Here's code that works, to get the probability matrix
>to use for sampling:
>
> z <- abs(outer(w, x, "-"))
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
>
> Example:
> w <- c(1,2,3,7)
> x <- c(0,1.5,3)
> z <- abs(outer(w,x,"-"))
> > z
> [,1] [,2] [,3]
> [1,] 1 0.5 2
> [2,] 2 0.5 1
> [3,] 3 1.5 0
> [4,] 7 5.5 4
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> z
> [1,] 0.6699219 0.9538536 0.0000000
> [2,] 0.0000000 0.9538536 0.6699219
> [3,] 0.0000000 0.6699219 1.0000000
> [4,] 0.0000000 0.1365445 0.5381833
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
> z # each row represents multinomial probabilities summing to 1
> [1,] 0.4125705 0.5874295 0.0000000
> [2,] 0.0000000 0.5874295 0.4125705
> [3,] 0.0000000 0.4011696 0.5988304
> [4,] 0.0000000 0.2023697 0.7976303
>
>
> The code is moderately fast. Does anyone know of a significantly faster method or
>have any comments on the choice of weighting function for such sampling? This will
>be used in the context of predictive mean matching for multiple imputation. Thanks
>- Frank
--
Frank E Harrell Jr Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
______________________________________________
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help