Stefan, that looks wonderful! I am most certainly going to try to download and use your 'wordspace', even though I am unsure on how to even download it at this point. Many thanks! But yes, I need the Canberra distance unfortunately. But decreasing the time by more than 60% will save me days! If I can get there.
2012/6/17 Stefan Evert <stefa...@collocations.de>: > >> I'm working on analyzing a large data set, lets asume that >> dim(Data)=c(1000,8700). I want to calculate the canberra distance >> between the columns of this matrix, and using a toy example ('test' is >> a matrix filled with random numbers 0-1): >> >>> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, >>> upper = FALSE, p = 2))) >> user system elapsed >> 1417.713 3.219 1421.144 >> The system.time results also confuse me a bit, since 99% of the time >> is not system time but user time. What does that mean? > > User time is the time that R spends working on your problem; system time > refers to tasks done by the operating system, e.g. disk access, managing > locks and, most importantly, swapping when you run out of RAM. With > multi-threading, system time can be much larger than the time that has > actually elapsed. > >> >> Is there any way to calculate the distance which would take less time? > > Well, one thing you can do is to get a faster computer. :-) The command > above takes only 670 seconds on my MacBook Pro (without multi-threading). > > Calculating a distance matrix is an expensive computation. In your example, > R has to carry out (8700 * 8700 * 1000) / 2 = 37.8 billion floating point > divisions. With approx. 27 clock cycles per division (according to tables > I've found on the Web), this takes at least 340 seconds even on a 3GHz CPU > (and ignoring memory access, addition/subtraction, loops, etc.). > > You can shave off some of the time if you implement the distance calculation > in C, inline the code to avoid callback functions in loops, operate on > columns of the matrix directly (which should be more cache-friendly than > rows) and don't check for NA's, NaN's and other degenerate cases. > > I've done just that in my experimental R package "wordspace", which isn't on > CRAN yet: > >> library(wordspace) >> A <- matrix(runif(8.7e6), 1000, 8700) > >> system.time(d1 <- as.matrix(dist(t(A), method="canberra"))) > user system elapsed > 669.207 2.724 669.305 > >> system.time(d2 <- dist.matrix(A, method="canberra", byrow=FALSE)) > user system elapsed > 250.534 0.784 250.301 > >> all(d1 == d2) > [1] TRUE > > If you aren't tied to Canberra distance, you can use a less expensive metric > such as the Manhattan distance for an additional, more substantial speed > boost: > >> system.time(d3 <- dist.matrix(A, method="manhattan", byrow=FALSE)) > user system elapsed > 42.488 0.999 43.569 > > This is still single-threaded, so you can run multiple of these calculations > in parallel depending on how many cores your server has. > > Hope this helps, > Stefan > > > PS: In case you'd like to give it a try yourself and aren't daunted by a > complete lack of documentation: > > svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/pkg > > > [ ev...@linglit.tu-darmstadt.de | http://purl.org/stefan.evert ] > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.