[R] Sparse Matrices in R

2004-08-31 Thread Danny Heuman

I have data in i,j,r format, 



where r is the value in location A[i,j] for some imaginary matrix A.

I need to build this matrix A, but given the sizes of i and j, I believe that using a 
sparse format would be most adequate.

Hopefully this will allow me to perform some basic matrix manipulation such as 
multiplication, addition, rowsums,  transpositions, subsetting etc etc.



Is there any way to achieve this goal in R?
Thanks,
 
Danny

[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Calculating sum of squares deviation between 2 similar matrices

2004-07-13 Thread Danny Heuman
Hi all,
 
I've got clusters and would like to match individual records to each
cluster based on a sum of squares deviation.  For each cluster and
individual, I've got 50 variables to use (measured in the same way).
 
Matrix 1 is individuals and is 25000x50.  Matrix 2 is the cluster
centroids and is 100x50.  The same variables are found in each matrix
in the same order.  I'd like to calculate the 'distance' of matrix 1 to
matrix 2 and get a ranking of matrix 2's distances (and row
IDs 1 to 100) sorted by distance.
 
I tried using the RDIST and DIST functions but they have true (Euclidean)
distances and all I want is the sum of squares deviation across the 50 variables.
I don't know how to program the sum of squares deviation across the 50
variables and do it efficiently.  Because of the size of the data I'm not sure
that apply would work well here, that is why I was using a for loop.
 
The (highly inefficient) code I was using is below if that helps at all.
I give you permission to laugh if you want.  I'm not remotely close to a
programmer.
 
Are there any suggestions from the general readership?  I'm using the 1.9.0
on Windows XP with 1GB of RAM.
 
Thanks for your attention,
Danny

---
#Calculate Euclidean distances between two sets of matrices.
library(foreign)
library(fields)
 
#centroid is small file with 100x50
centroid <- as.data.frame(read.spss("C:\\centroid.sav"))
#in_data is 25000x50
in_data <- as.data.frame(read.spss("C:\\in_vars.sav"))
 
#loop through the in_data records, calculate distances to the 100 centroids
#sort the distances in ascending order and write out the centroid # and
distance for all 100.
 
for(i in 1:nrow(in_data)) {
 
#first column is the centroid #.  columns 2 through 51 have data.
aa <- as.matrix(centroid[,2:51])

#first column is a unique identifier.  columns 2 through 51 have data.
bb <- as.matrix(in_data[i,2:51])
 
#merge the in_data row to the 100 centroids and calculate Euclidean distance.
cc <- rdist(rbind(bb,aa))
 
#take first column of distance matrix - this column is the distance of
in_data row to all 100 centroids.
dd <- as.matrix(cc[1,2:151])
 
#sort dd on distance and attach the centroid number.
ee <-c(t(cbind(sort.list(dd), sort(dd
 
#write sorted distance to file
write(ee,  file="C:\\cluster_distances.txt",ncol=300, append=TRUE)
 
}


[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] R-crash using read.shape (maptools)

2004-04-28 Thread Danny Heuman
Hi Herry,

On Thu, 29 Apr 2004 12:20:44 +1000, you wrote:

>Hi List,
>
>I am trying to read a large shapefile (~37,000 polys) using read.shape [winxp, 1gig 
>ram, dellbox). I receive the following error:
>
>AppName: rgui.exe   AppVer: 1.90.30412.0ModName: maptools.dll
>ModVer: 1.90.30412.0Offset: 309d
>
>The getinfo.shape returns info, and the shapefile is readable in arcmap. 
>
>Any ideas on how to overcome this?
>
>Thanks Herry
>
>---
>Alexander Herr - Herry
>
> 
>
>__
>[EMAIL PROTECTED] mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


I've had difficulty if there is too much detail in the polygon
definition (i.e. too many nodes).  Try thinning the polygons and try
again.

Danny

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] How to improve this code?

2004-04-03 Thread Danny Heuman
Hi all,

I've got some functioning code that I've literally taken hours to
write.  My 'R' coding is getting better...it used to take days :)

I know I've done a poor job of optimizing the code.  In addition, I'm
missing an important step and don't know where to put it.

So, three questions:

1)  I'd like the resulting output to be sorted on distance (ascending)
and to have the 'rank' column represent the sort order, so that rank 1
is the first customer and rank 10 is the 10th.  Where do I do this?

2)  Can someone suggest ways of 'optimizing' or improving the code?
It's the only way I'm going to learn better ways of approaching R.

3)  If there are no customers in the store's Trade Area, I'd like the
output file have nothing written to it .  How can I do that?

All help is appreciated.

Thanks,

Danny


*
library(fields)

#Format of input files:  ID, LONGITUDE, LATITUDE

#Generate Store List
storelist <- cbind(1:100, matrix(rnorm(100, mean = -60,  sd = 3), ncol
= 1),
 matrix(rnorm(100, mean = 50, sd = 3), ncol = 1))

#Generate Customer List
customerlist <- cbind(1:1,matrix(rnorm(1, mean = -60,  sd =
20), ncol = 1),
 matrix(rnorm(1, mean = 50, sd = 10), ncol = 1))


#Output file
outfile <- "c:\\output.txt"
outfilecolnames <- c("rank","storeid","custid","distance")
write.table(t(outfilecolnames), file = outfile, append=TRUE,
sep=",",row.names=FALSE, col.names=FALSE)

#Trade Area Size
TAsize <- c(100)

custlatlon <- customerlist[, 2:3]

for(i in 1:length(TAsize)){
for(j in 1:nrow(storelist)){
cat("Store: ", storelist[j],"  TA Size = ", TAsize[i],
"\n")

storelatlon <- storelist[j, 2:3]

whichval <-
which(rdist.earth(t(as.matrix(storelatlon)), as.matrix(custlatlon),
miles=F) <= TAsize[i])

dist <-
as.data.frame(rdist.earth(t(as.matrix(storelatlon)),
as.matrix(custlatlon), miles=F)[whichval])

storetag <-
as.data.frame(cbind(1:nrow(dist),storelist[j,1]))
fincalc <-
as.data.frame(cbind(1:nrow(dist),(customerlist[whichval,1]),rdist.earth(t(as.matrix(storelatlon)),
as.matrix(custlatlon), miles=F)[whichval]))

combinedata <- data.frame(storetag, fincalc)

combinefinal <- subset(combinedata, select= c(-1,-3))

flush.console()

write.table(combinefinal, file = outfile, append=TRUE,
sep=",", col.names=FALSE)
}

}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Cluster Analysis with minimum cluster size?

2004-03-26 Thread Danny Heuman
Hi all,

Is it possible to run kmeans, pam or clara with a constraint such that
no resulting cluster has fewer than X cases?

These kmeans algorithms often find clusters that are too small for my
use.  There are usually a few clusters with 1-10 cases (generally
substantial outliers).  I then have to manually assign the small ones
to other sizable clusters.

If this doesn't exist, it there such an algorithm that does this?

Thanks,

Danny

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Distance and Aggregate Data - Again...

2004-02-25 Thread Danny Heuman
I appreciate the help I've been given so far.  The issue I face is
that the data I'm working with has 53000 rows, so in calculating
distance, finding all recids that fall within 2km and summing the
population, etc. - a) takes too long and b) have no sense of progress.

Below is a loop that reads each recid one at a time, calculates the
distance and identifies the recids that fall within 2 km.  It iterates
through all records successfully.

Where I'm stuck is how to get the sum of population and dwellings and
the mean age for the records that are selected.  Also, the desired
output should have the following fields:  recid, sum(pop), sum(dwell),
mean(age).  I don't know how to write only those fields out to the
file.

Any suggestions?

Thank you for your help,

Danny


#
library(fields)

d <- as.matrix( read.csv("filein.csv") )

for(i in 1:nrow(d)){
lonlat1 <- d[i,2:3]
lonlat2 <- d[,2:3]
distval <- d[,1] [which(rdist.earth( t( as.matrix(lonlat1) ),
as.matrix(lonlat2), miles=F ) < 2)]
write(distval,file="C:\\outfile.out",ncol=1, append=TRUE)
}
#


--
Sample Input Data
--
recid,lat,long,pop,dwell,age
10010265,47.5971174,-52.7039227,584,219,38
10010260,47.5971574,-52.7039147,488,188,34
10010263,47.5936538,-52.7037037,605,232,43
10010287,47.5739426,-52.7035365,548,256,29
10010290,47.570,-52.703182,559,336,36
10010284,47.5958782,-52.7013245,394,261,61
10010191,47.5322617,-52.7037037,892,323,23
10010291,47.5700412,-52.7009,0,0,0
10010289,47.5714152,-52.70023,0,0,0
10010285,47.5832183,-52.6995828,469,239,44
10010273,47.5800199,-52.6984875,855,283,28
10010190,47.472353,-52.697991,0,0,0
10010274,47.6018197,-52.6978362,344,117,51
10010288,47.5755249,-52.6978207,33,0,19
10010275,47.6005037,-52.697991,232,93,43
10010279,47.5915368,-52.6954916,983,437,33
10010276,47.5993086,-52.6954808,329,131,28
10010278,47.5958782,-52.6934253,251,107,27
10010354,47.5991086,-52.6934037,27,14,47
10010277,47.5968782,-52.6914148,515,194,37
10010293,47.5778754,-52.6954808,58,0,40
10010292,47.5722183,-52.6899332,1112,523,28
10010353,47.6356972,-52.6896838,1387,471,32
10010283,47.5958439,-52.6884621,531,296,41
10010281,47.5983891,-52.6880528,307,113,52
10010280,47.5958439,-52.6878177,374,129,18
10010282,47.5999645,-52.6880528,637,226,22
10010286,47.5797909,-52.6872042,446,280,32
10010355,47.5797609,-52.6872055,197,72,39

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html