On Tue, 2005-06-28 at 13:03 -0400, ferdinand principia wrote: > Hi, > > I need to calculate Kendall's tau for large data > vectors (length > 100'000). > Is somebody aware of a faster algorithm or package > function than "cor(, method="kendall")"? > There are ties in the data to be considered (Kendall's > tau-b). > > Any suggestions? > > Regards > Ferdinand
The time intensive part of the process is typically the ranking/ordering of the vector pairs to calculate the numbers of concordant and discordant pairs. If the number of _unique pairs_ in your data is substantially less than the number of total pairs (in other words, creating a smaller 2d contingency table from a pair of your vectors makes sense), then the following may be of help. # Calculate CONcordant Pairs in a table # cycle through x[r, c] and multiply by # sum(x elements below and to the right of x[r, c]) # x = table concordant <- function(x) { x <- matrix(as.numeric(x), dim(x)) # get sum(matrix values > r AND > c) # for each matrix[r, c] mat.lr <- function(r, c) { lr <- x[(r.x > r) & (c.x > c)] sum(lr) } # get row and column index for each # matrix element r.x <- row(x) c.x <- col(x) # return the sum of each matrix[r, c] * sums # using mapply to sequence thru each matrix[r, c] sum(x * mapply(mat.lr, r = r.x, c = c.x)) } # Calculate DIScordant Pairs in a table # cycle through x[r, c] and multiply by # sum(x elements below and to the left of x[r, c]) # x = table discordant <- function(x) { x <- matrix(as.numeric(x), dim(x)) # get sum(matrix values > r AND < c) # for each matrix[r, c] mat.ll <- function(r, c) { ll <- x[(r.x > r) & (c.x < c)] sum(ll) } # get row and column index for each # matrix element r.x <- row(x) c.x <- col(x) # return the sum of each matrix[r, c] * sums # using mapply to sequence thru each matrix[r, c] sum(x * mapply(mat.ll, r = r.x, c = c.x)) } # Calculate Kendall's Tau-b # x = table calc.KTb <- function(x) { x <- matrix(as.numeric(x), dim(x)) c <- concordant(x) d <- discordant(x) n <- sum(x) SumR <- rowSums(x) SumC <- colSums(x) KTb <- (2 * (c - d)) / sqrt(((n ^ 2) - (sum(SumR ^ 2))) * ((n ^ 2) - (sum(SumC ^ 2)))) KTb } Note that I made some modifications of the above, relative to prior versions that I have posted to handle large numbers of pairs to avoid integer overflows in summations. Hence the: x <- matrix(as.numeric(x), dim(x)) conversion in each function. Now, create some random test data, with 100,000 elements in each vector, sampling from 'letters', which would yield a 26 x 26 table: a <- sample(letters, 100000, replace = TRUE) b <- sample(letters, 100000, replace = TRUE) > dim(table(a, b)) [1] 26 26 > system.time(print(calc.KTb(table(a, b)))) [1] 0.0006906088 [1] 0.77 0.02 0.83 0.00 0.00 Note that in the above, the initial table takes most of the time: > system.time(table(a, b)) [1] 0.55 0.00 0.56 0.00 0.00 Hence: > tab.ab <- table(a, b) > system.time(print(calc.KTb(tab.ab))) [1] 0.0006906088 [1] 0.25 0.01 0.27 0.00 0.00 I should note that I also ran: > system.time(print(cor(a, b, method = "kendall"))) [1] 0.0006906088 [1] 694.80 7.72 931.89 0.00 0.00 Nice to know the results work out at least... :-) I have not tested with substantially larger 2d matrices, but would envision that as the dimensions of the resultant tabulation increases, my method probably approaches and may even become less efficient than the approach implemented in cor(). Some testing would validate this and perhaps point to coding the concordant() and discordant() functions in C for improvement in timing. HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html