Dear R developers,
I am visualising high dimensional genomic data and for this purpose I need to 
compute pairwise distances between many points in a high-dimensional space (say 
I have a matrix of 5,000 rows and 20,000 columns, so the result is a 
5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many 
hours (I am doing this on a Linux server with more than 100 GB of RAM, so this 
is not the problem). When I write the matrix to disk, read it ans compute the 
distances in C, write them to the disk and read them into R it takes 10 - 15 
minutes (and I did not spend much time on optimising my C code).The question is 
why the R function is so slow? I understand that it calls C (or C++) to compute 
the distance. My suspicion is that the transposed matrix is passed to C and so 
each time a distance between two columns of a matrix is computed, and since C 
stores matrices by rows it is very inefficient and causes many cache misses (my 
first C implementation was like this and I had to stop the r
 un after an hour when it failed to complete).If my suspicion is correct, is it 
possible to re-write the dist function so that it works faster on large 
matrices?
Best regards,Moshe OlshanskyMonash University

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to