Re: [R] Efficient distance calculation on big matrix

Boel Brynedal Mon, 18 Jun 2012 07:29:48 -0700

Stefan, that looks wonderful! I am most certainly going to try to
download and use your 'wordspace', even though I am unsure on how to
even download it at this point. Many thanks! But yes, I need the
Canberra distance unfortunately. But decreasing the time by more than
60% will save me days! If I can get there.


2012/6/17 Stefan Evert <stefa...@collocations.de>:
>
>> I'm working on analyzing a large data set, lets asume that
>> dim(Data)=c(1000,8700). I want to calculate the canberra distance
>> between the columns of this matrix, and using a toy example ('test' is
>> a matrix filled with random numbers 0-1):
>>
>>> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, 
>>> upper = FALSE, p = 2)))
>>    user   system  elapsed
>> 1417.713 3.219 1421.144
>> The system.time results also confuse me a bit, since 99% of the time
>> is not system time but user time. What does that mean?
>
> User time is the time that R spends working on your problem; system time 
> refers to tasks done by the operating system, e.g. disk access, managing 
> locks and, most importantly, swapping when you run out of RAM.  With 
> multi-threading, system time can be much larger than the time that has 
> actually elapsed.
>
>>
>> Is there any way to calculate the distance which would take less time?
>
> Well, one thing you can do is to get a faster computer. :-)  The command 
> above takes only 670 seconds on my MacBook Pro (without multi-threading).
>
> Calculating a distance matrix is an expensive computation.  In your example, 
> R has to carry out (8700 * 8700 * 1000) / 2 = 37.8 billion floating point 
> divisions.  With approx. 27 clock cycles per division (according to tables 
> I've found on the Web), this takes at least 340 seconds even on a 3GHz CPU 
> (and ignoring memory access, addition/subtraction, loops, etc.).
>
> You can shave off some of the time if you implement the distance calculation 
> in C, inline the code to avoid callback functions in loops, operate on 
> columns of the matrix directly (which should be more cache-friendly than 
> rows) and don't check for NA's, NaN's and other degenerate cases.
>
> I've done just that in my experimental R package "wordspace", which isn't on 
> CRAN yet:
>
>> library(wordspace)
>> A <- matrix(runif(8.7e6), 1000, 8700)
>
>> system.time(d1 <- as.matrix(dist(t(A), method="canberra")))
>   user  system elapsed
> 669.207 2.724 669.305
>
>> system.time(d2 <- dist.matrix(A, method="canberra", byrow=FALSE))
>   user  system elapsed
> 250.534 0.784 250.301
>
>> all(d1 == d2)
> [1] TRUE
>
> If you aren't tied to Canberra distance, you can use a less expensive metric 
> such as the Manhattan distance for an additional, more substantial speed 
> boost:
>
>> system.time(d3 <- dist.matrix(A, method="manhattan", byrow=FALSE))
>   user  system elapsed
>  42.488   0.999  43.569
>
> This is still single-threaded, so you can run multiple of these calculations 
> in parallel depending on how many cores your server has.
>
> Hope this helps,
> Stefan
>
>
> PS: In case you'd like to give it a try yourself and aren't daunted by a 
> complete lack of documentation:
>
>        svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/pkg
>
>
> [ ev...@linglit.tu-darmstadt.de | http://purl.org/stefan.evert ]
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Efficient distance calculation on big matrix

Reply via email to