[R] faster algorithm for Kendall's tau

2005-06-28 Thread ferdinand principia
Hi,

I need to calculate Kendall's tau for large data
vectors (length > 100'000). 
Is somebody aware of a faster algorithm or package
function than "cor(, method="kendall")"? 
There are ties in the data to be considered (Kendall's
tau-b).

Any suggestions?

Regards
Ferdinand

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] faster algorithm for Kendall's tau

2005-06-28 Thread Marc Schwartz (via MN)
On Tue, 2005-06-28 at 13:03 -0400, ferdinand principia wrote:
> Hi,
> 
> I need to calculate Kendall's tau for large data
> vectors (length > 100'000). 
> Is somebody aware of a faster algorithm or package
> function than "cor(, method="kendall")"? 
> There are ties in the data to be considered (Kendall's
> tau-b).
> 
> Any suggestions?
> 
> Regards
> Ferdinand


The time intensive part of the process is typically the ranking/ordering
of the vector pairs to calculate the numbers of concordant and
discordant pairs.

If the number of _unique pairs_ in your data is substantially less than
the number of total pairs (in other words, creating a smaller 2d
contingency table from a pair of your vectors makes sense), then the
following may be of help.

# Calculate CONcordant Pairs in a table
# cycle through x[r, c] and multiply by
# sum(x elements below and to the right of x[r, c])
# x = table
concordant <- function(x)
{
  x <- matrix(as.numeric(x), dim(x))
  
  # get sum(matrix values > r AND > c)
  # for each matrix[r, c]
  mat.lr <- function(r, c)
  { 
lr <- x[(r.x > r) & (c.x > c)]
sum(lr)
  }

  # get row and column index for each
  # matrix element
  r.x <- row(x)
  c.x <- col(x)

  # return the sum of each matrix[r, c] * sums
  # using mapply to sequence thru each matrix[r, c]
  sum(x * mapply(mat.lr, r = r.x, c = c.x))
}

# Calculate DIScordant Pairs in a table
# cycle through x[r, c] and multiply by
# sum(x elements below and to the left of x[r, c])
# x = table
discordant <- function(x)
{
  x <- matrix(as.numeric(x), dim(x))
  
  # get sum(matrix values > r AND < c)
  # for each matrix[r, c]
  mat.ll <- function(r, c)
  { 
ll <- x[(r.x > r) & (c.x < c)]
sum(ll)
  }

  # get row and column index for each
  # matrix element
  r.x <- row(x)
  c.x <- col(x)

  # return the sum of each matrix[r, c] * sums
  # using mapply to sequence thru each matrix[r, c]
  sum(x * mapply(mat.ll, r = r.x, c = c.x))
}


# Calculate Kendall's Tau-b
# x = table
calc.KTb <- function(x)
{
  x <- matrix(as.numeric(x), dim(x))
  
  c <- concordant(x)
  d <- discordant(x)

  n <- sum(x)
  SumR <- rowSums(x)
  SumC <- colSums(x)

  KTb <- (2 * (c - d)) / sqrt(((n ^ 2) -
 (sum(SumR ^ 2))) * ((n ^ 2) -
 (sum(SumC ^ 2

  KTb
}


Note that I made some modifications of the above, relative to prior
versions that I have posted to handle large numbers of pairs to avoid
integer overflows in summations. Hence the:

  x <- matrix(as.numeric(x), dim(x))

conversion in each function.

Now, create some random test data, with 100,000 elements in each vector,
sampling from 'letters', which would yield a 26 x 26 table:

 a <- sample(letters, 10, replace = TRUE)
 b <- sample(letters, 10, replace = TRUE)
 
 > dim(table(a, b))
 [1] 26 26

 > system.time(print(calc.KTb(table(a, b
[1] 0.0006906088
[1] 0.77 0.02 0.83 0.00 0.00

Note that in the above, the initial table takes most of the time:

> system.time(table(a, b))
[1] 0.55 0.00 0.56 0.00 0.00

Hence:

> tab.ab <- table(a, b)
> system.time(print(calc.KTb(tab.ab)))
[1] 0.0006906088
[1] 0.25 0.01 0.27 0.00 0.00


I should note that I also ran:

> system.time(print(cor(a, b, method = "kendall")))
[1] 0.0006906088 
[1] 694.80   7.72 931.89   0.00   0.00 

Nice to know the results work out at least...  :-)


I have not tested with substantially larger 2d matrices, but would
envision that as the dimensions of the resultant tabulation increases,
my method probably approaches and may even become less efficient than
the approach implemented in cor(). Some testing would validate this and
perhaps point to coding the concordant() and discordant() functions in C
for improvement in timing.

HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] faster algorithm for Kendall's tau

2005-06-28 Thread ferdinand principia
Sorry,

I should specifiy in more detail what my data looks
like. The data vectors (simulations) are mostly
composed of floats (for which it's pretty unlikely to
produce ties), but there are integer values to be
found as well (up to 10% of vector elements). 
As I undestand, Marc's algo is not suited for this
case. 

Is there another solution?

Thanks
Ferdinand


--- "Marc Schwartz (via MN)" <[EMAIL PROTECTED]>
wrote:

> On Tue, 2005-06-28 at 13:03 -0400, ferdinand
> principia wrote:
> > Hi,
> > 
> > I need to calculate Kendall's tau for large data
> > vectors (length > 100'000). 
> > Is somebody aware of a faster algorithm or package
> > function than "cor(, method="kendall")"? 
> > There are ties in the data to be considered
> (Kendall's
> > tau-b).
> > 
> > Any suggestions?
> > 
> > Regards
> > Ferdinand
> 
> 
> The time intensive part of the process is typically
> the ranking/ordering
> of the vector pairs to calculate the numbers of
> concordant and
> discordant pairs.
> 
> If the number of _unique pairs_ in your data is
> substantially less than
> the number of total pairs (in other words, creating
> a smaller 2d
> contingency table from a pair of your vectors makes
> sense), then the
> following may be of help.
> 
> # Calculate CONcordant Pairs in a table
> # cycle through x[r, c] and multiply by
> # sum(x elements below and to the right of x[r, c])
> # x = table
> concordant <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   # get sum(matrix values > r AND > c)
>   # for each matrix[r, c]
>   mat.lr <- function(r, c)
>   { 
> lr <- x[(r.x > r) & (c.x > c)]
> sum(lr)
>   }
> 
>   # get row and column index for each
>   # matrix element
>   r.x <- row(x)
>   c.x <- col(x)
> 
>   # return the sum of each matrix[r, c] * sums
>   # using mapply to sequence thru each matrix[r, c]
>   sum(x * mapply(mat.lr, r = r.x, c = c.x))
> }
> 
> # Calculate DIScordant Pairs in a table
> # cycle through x[r, c] and multiply by
> # sum(x elements below and to the left of x[r, c])
> # x = table
> discordant <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   # get sum(matrix values > r AND < c)
>   # for each matrix[r, c]
>   mat.ll <- function(r, c)
>   { 
> ll <- x[(r.x > r) & (c.x < c)]
> sum(ll)
>   }
> 
>   # get row and column index for each
>   # matrix element
>   r.x <- row(x)
>   c.x <- col(x)
> 
>   # return the sum of each matrix[r, c] * sums
>   # using mapply to sequence thru each matrix[r, c]
>   sum(x * mapply(mat.ll, r = r.x, c = c.x))
> }
> 
> 
> # Calculate Kendall's Tau-b
> # x = table
> calc.KTb <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   c <- concordant(x)
>   d <- discordant(x)
> 
>   n <- sum(x)
>   SumR <- rowSums(x)
>   SumC <- colSums(x)
> 
>   KTb <- (2 * (c - d)) / sqrt(((n ^ 2) -
>  (sum(SumR ^ 2))) * ((n ^ 2) -
>  (sum(SumC ^ 2
> 
>   KTb
> }
> 
> 
> Note that I made some modifications of the above,
> relative to prior
> versions that I have posted to handle large numbers
> of pairs to avoid
> integer overflows in summations. Hence the:
> 
>   x <- matrix(as.numeric(x), dim(x))
> 
> conversion in each function.
> 
> Now, create some random test data, with 100,000
> elements in each vector,
> sampling from 'letters', which would yield a 26 x 26
> table:
> 
>  a <- sample(letters, 10, replace = TRUE)
>  b <- sample(letters, 10, replace = TRUE)
>  
>  > dim(table(a, b))
>  [1] 26 26
> 
>  > system.time(print(calc.KTb(table(a, b
> [1] 0.0006906088
> [1] 0.77 0.02 0.83 0.00 0.00
> 
> Note that in the above, the initial table takes most
> of the time:
> 
> > system.time(table(a, b))
> [1] 0.55 0.00 0.56 0.00 0.00
> 
> Hence:
> 
> > tab.ab <- table(a, b)
> > system.time(print(calc.KTb(tab.ab)))
> [1] 0.0006906088
> [1] 0.25 0.01 0.27 0.00 0.00
> 
> 
> I should note that I also ran:
> 
> > system.time(print(cor(a, b, method = "kendall")))
> [1] 0.0006906088 
> [1] 694.80   7.72 931.89   0.00   0.00 
> 
> Nice to know the results work out at least...  :-)
> 
> 
> I have not tested with substantially larger 2d
> matrices, but would
> envision that as the dimensions of the resultant
> tabulation increases,
> my method probably approaches and may even become
> less efficient than
> the approach implemented in cor(). Some testing
> would validate this and
> perhaps point to coding the concordant() and
> discordant() functions in C
> for improvement in timing.
> 
> HTH,
> 
> Marc Schwartz
> 
> 
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html