It so happens I have been looking at very similar changes, as well as
adding multi-threading support for dist(); these should make it into
R-devel later this summer.

Best,

luke

On Sun, 17 Jun 2012, Stefan Evert wrote:


I'm working on analyzing a large data set, lets asume that
dim(Data)=c(1000,8700). I want to calculate the canberra distance
between the columns of this matrix, and using a toy example ('test' is
a matrix filled with random numbers 0-1):

system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, upper 
= FALSE, p = 2)))
   user   system  elapsed
1417.713    3.219 1421.144
The system.time results also confuse me a bit, since 99% of the time
is not system time but user time. What does that mean?

User time is the time that R spends working on your problem; system time refers 
to tasks done by the operating system, e.g. disk access, managing locks and, 
most importantly, swapping when you run out of RAM.  With multi-threading, 
system time can be much larger than the time that has actually elapsed.


Is there any way to calculate the distance which would take less time?

Well, one thing you can do is to get a faster computer. :-)  The command above 
takes only 670 seconds on my MacBook Pro (without multi-threading).

Calculating a distance matrix is an expensive computation.  In your example, R 
has to carry out (8700 * 8700 * 1000) / 2 = 37.8 billion floating point 
divisions.  With approx. 27 clock cycles per division (according to tables I've 
found on the Web), this takes at least 340 seconds even on a 3GHz CPU (and 
ignoring memory access, addition/subtraction, loops, etc.).

You can shave off some of the time if you implement the distance calculation in 
C, inline the code to avoid callback functions in loops, operate on columns of 
the matrix directly (which should be more cache-friendly than rows) and don't 
check for NA's, NaN's and other degenerate cases.

I've done just that in my experimental R package "wordspace", which isn't on 
CRAN yet:

library(wordspace)
A <- matrix(runif(8.7e6), 1000, 8700)

system.time(d1 <- as.matrix(dist(t(A), method="canberra")))
  user  system elapsed
669.207   2.724 669.305

system.time(d2 <- dist.matrix(A, method="canberra", byrow=FALSE))
  user  system elapsed
250.534   0.784 250.301

all(d1 == d2)
[1] TRUE

If you aren't tied to Canberra distance, you can use a less expensive metric 
such as the Manhattan distance for an additional, more substantial speed boost:

system.time(d3 <- dist.matrix(A, method="manhattan", byrow=FALSE))
  user  system elapsed
42.488   0.999  43.569

This is still single-threaded, so you can run multiple of these calculations in 
parallel depending on how many cores your server has.

Hope this helps,
Stefan


PS: In case you'd like to give it a try yourself and aren't daunted by a 
complete lack of documentation:

        svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/pkg


[ ev...@linglit.tu-darmstadt.de | http://purl.org/stefan.evert ]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to