Re: [R] Calculate Closest 5 Cases?

Tom Blackwell Fri, 13 Feb 2004 10:43:09 -0800

Danny  -

The flip answer is, it depends on the size of your computer.
One can readily calculate the number of entries in the pairwise
distance matrix that you would like to calculate, and ask whether
it will fit inside the physical memory installed in your computer.
It is  50,000 x 50,000 x 8 bytes per floating point number, for
a total of 20,000,000,000 bytes or 20 gigabytes.  The critical
information that's still missing is that R needs enough space
for 10 or 20 copies of the largest object in its workspace, in
order to turn around and assign that object to a new name, or
do any summaries on it, etc.  So,  . . .  if you have a computer
with between 200 and 400 gigabytes of random access memory, yes,
you can calculate and summarize the matrix of pairwise distances.
But that requires more memory slots than any ordinary motherboard
provides.  (It would be a mother of a motherboard !)


So, failing that, you could always use Adrian Raftery and Chris
Fraley's 'mclust' package to cluster your data into 50 or more
clusters of very similar cases (instructions for running mclust()
on large data sets are found in the manual which comes with the
package), then calculate all pairwise distances only between the
cases within each cluster.  That's a bit of work to code up.
You wouldn't want to work interactively for each of 50 clusters.
But it certainly can be done in R.  Depends how much effort you
want to put into it.

-  tom blackwell  -  u mighigan medical school  -  ann arbor  -

On Fri, 13 Feb 2004 [EMAIL PROTECTED] wrote:

> I've only begun investigating R as a substitute for SPSS.
>
> I have a need to identify for each CASE the closest (or most similar) 5
> other CASES (not including itself as it is automatically the closest).  I
> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use Correlate > 
> Distances to generate a matrix of similarity, but only on a small sample.  The 
> entire matrix can not be processed at once due to memory limitations.
>
> The data are all percents, so they are easy comparable.
>
> Is there any way to do this in R?
>
> Below is a small sample of the data (from SPSS) and the desired output.
>
> Thanks,
>
> Danny
>
>
>
>
> *Sample Data.
> DATA LIST LIST /id(F8) var1(F8.2) var2(F8.2) var3(F8.2) var4(F8.2) var5
> (F8.2) var6(F8.2) var7(F8.2) var8(F8.2) var9(F8.2) var10(F8.2) var11(F8.2).
> BEGIN DATA.
> 10170069      3.51    4.02    6.53    11.05   6.53    8.04    13.57   20.10   11.05  
>  8.55
>       7.04
> 10190229      1.89    5.66    4.61    7.62    8.45    13.21   9.50    20.82   16.07  
>  9.36
>       3.77
> 10540023      3.40    5.08    3.39    4.52    10.18   14.71   13.56   16.38   9.60   
>  7.89
>       11.85
> 10650413      6.64    6.64    3.73    4.70    3.78    13.23   19.82   15.98   12.26  
>  8.48
>       3.78
> 10662074      5.11    5.81    4.37    5.11    6.55    14.60   18.97   11.68   10.25  
>  8.75
>       8.79
> 10770041      6.43    4.17    6.34    4.26    6.34    4.26    19.11   19.20   14.95  
>  12.77
>       4.35
> 11010422      3.14    4.71    6.81    7.85    5.75    6.81    15.18   15.18   13.61  
>  11.00
>       9.44
> 11060762      7.03    5.03    6.95    5.99    5.92    12.94   15.01   12.06   11.98  
>  8.06
>       9.02
> 11070078      4.61    9.22    4.61    7.94    6.27    12.75   14.02   20.49   7.75   
>  7.75
>       4.61
> 11180646      4.48    5.35    6.29    5.42    4.55    11.71   20.74   15.32   14.45  
>  8.09
>       3.61
> 11460001      5.71    7.34    6.48    5.68    4.07    10.55   13.83   18.69   12.15  
>  9.76
>       4.87
> 11650133      6.00    3.72    6.72    6.00    7.50    17.94   13.44   16.37   13.51  
>  5.15
>       3.65
> 11650275      4.02    8.06    6.06    8.10    5.06    8.10    17.16   14.12   12.14  
>  14.12
>       4.02
> 11780034      4.25    4.28    5.30    5.33    6.38    14.88   15.96   18.08   14.85  
>  7.48
>       3.20
> 11790016      4.40    4.40    5.54    4.40    4.40    10.93   17.67   19.72   13.20  
>  12.13
>       4.33
> 12660338      6.60    7.54    5.66    8.49    10.38   11.31   16.06   12.26   8.49   
>  8.49
>       4.73
> 12660644      5.51    3.14    3.95    7.09    7.11    14.98   15.72   18.90   9.44   
>  5.50
>       8.65
> 12661667      5.44    4.50    5.44    4.50    5.44    12.69   13.63   11.81   9.07   
>  13.68
>       13.79
> END DATA.
>
> *Output should be:.
> *.
> *     ID1     CLOSEID1        CLOSEID2        CLOSEID3        CLOSEID4        
> CLOSEID5.
> *     ID2     CLOSEID1        CLOSEID2        CLOSEID3        CLOSEID4        
> CLOSEID5.
> *     ID3     CLOSEID1        CLOSEID2        CLOSEID3        CLOSEID4        
> CLOSEID5.
> *     ID4     CLOSEID1        CLOSEID2        CLOSEID3        CLOSEID4        
> CLOSEID5.
> *     ID5     CLOSEID1        CLOSEID2        CLOSEID3        CLOSEID4        
> CLOSEID5.
>
> ______________________________________________
> [EMAIL PROTECTED] mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Calculate Closest 5 Cases?

Reply via email to