>>>>> Gavin Simpson <gavin.simp...@ucl.ac.uk> >>>>> on Fri, 28 Jan 2011 09:23:05 +0000 writes:
> On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote: >> Hello, >> >> Yes, that's right, it is a values matrix. Not a dissimilarity matrix. >> >> i.e. >> >> > str(iMatrix) >> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ... >> - attr(*, "dimnames")=List of 2 >> ..$ : NULL >> ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ... Ok, so in the end you want to draw a dendrogram for 23'371 observational units, really ? I think I would not use a hierarchical clustering method for so many units, but rather clara() or maybe pam() or then model based or other methods, rather than fully hierarchical ones.... ... but yes, that's not the issue here, and see further down ... BTW: The object 'iMatrix' you provided for download has only 50 columns, not 56... >> >> For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column. GS> Sorry, my bad. Try this: GS> apply(iMatrix, 1, function(x) all(is.na(x))) GS> will check that you have no fully `NA` rows. GS> Also look at str(iMatrix) for potential problems. GS> Finally, try: GS> out <- dist(iMatrix) any(is.na(out)) GS> should repeat what agnes is doing to compute the GS> dissimilarity matrix. If that returns TRUE, go and find GS> which samples are giving NA dissimilarity and why. GS> The issue is not NA in the input data, but that your GS> input data is leading to NA in the computed GS> dissimilarities. This might be due to NA's in your input GS> data, where a pair of samples has no common set of data GS> for example. Yes, that's right on spot, thank you Gavin. This is indeed to true: It *does* allow for NA's (in the data matrix), but if the pattern of NA's is such that the dissimilarity between two observations becomes undefined, namely e.g. if they have no common non-missings, then ``that's too much''. In general, I'd recommend to use dm <- daisy(....,...) trying methods, that are better with NAs, e.g. Gower's metric, until dm() has {nearly} no NAs, and then figure out some imputation to replace all NA's in dm by "reasonable values", then do clustering with the resulting dissimilarity "matrix" dm. HOWEVER, in your case, dm would correspond to 23371 x 23371 dissimilarity matrix, stored as a double precision matrix (on a 64-bit platform) that's an object of size 4.4 GBytes, not very convenient to work with. as dissimilarity object it will only be about half of that size, but that's still ``a bit large''.. As I said above, for such data, I would never do fully hierarchical clustering, but rather something else. Martin Maechler, ETH Zurich GS> HTH GS> G >> The part of the agnes documentation I was referring to is : >> >> "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed." >> >> So, I'm under the impression it handles NAs on its own ? >> >> - Dario. >> >> ---- Original message ---- >> >Date: Thu, 27 Jan 2011 12:53:27 +0000 >> >From: Gavin Simpson <gavin.simp...@ucl.ac.uk> >> >Subject: Re: [R] agnes clustering and NAs >> >To: Uwe Ligges <lig...@statistik.tu-dortmund.de> >> >Cc: d.strbe...@garvan.org.au, r-help@r-project.org >> > >> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote: >> >> >> >> On 27.01.2011 05:00, Dario Strbenac wrote: >> >> > Hello, >> >> > >> >> > In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like : >> >> > >> >> >> m<- matrix(c( >> >> > 1, 1, 1, 2, >> >> > 1, NA, 1, 1, >> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE) >> >> >> agnes(m) >> >> > Call: agnes(x = m) >> >> > Agglomerative coefficient: 0.1614168 >> >> > Order of objects: >> >> > [1] 1 2 3 >> >> > Height (summary): >> >> > Min. 1st Qu. Median Mean 3rd Qu. Max. >> >> > 1.155 1.247 1.339 1.339 1.431 1.524 >> >> > >> >> > Available components: >> >> > [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" >> >> > >> >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error : >> >> > >> >> >> agnes(iMatrix) >> >> > Error in agnes(iMatrix) : >> >> > No clustering performed, NA-values in the dissimilarity matrix. >> >> > >> >> > I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation. >> >> >> >> >> >> I haven't looked in the file, but you need to get rid of all NA, or in >> >> other words, all rows that contain *any* NA values. >> > >> >If one believes the documentation, then that only applies to the case >> >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw >> >data matrix or data frame. >> > >> >The only way the OP could have gotten that error with the call shown is >> >if iMatrix were not a dissimilarity matrix inheriting from class "dist", >> >so `NA`s should be allowed. >> > >> >My guess would be that the OP didn't get rid of all the `NA`s. >> > >> >Dario: what does: >> > >> >sapply(iMatrix, function(x) any(is.na(x))) >> > >> >or if iMatrix is a matrix: >> > >> >apply(iMatrix, 2, function(x) any(is.na(x))) >> > >> >say? >> > >> >G >> > >> >> Uwe Ligges >> >> >> >> >> >> >> >> > The matrix I'm using can be obtained here : >> >> > http://129.94.136.7/file_dump/dario/iMatrix.obj >> >> > >> >> > -------------------------------------- >> >> > Dario Strbenac >> >> > Research Assistant >> >> > Cancer Epigenetics >> >> > Garvan Institute of Medical Research >> >> > Darlinghurst NSW 2010 >> >> > Australia >> >> > >> >-- >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% >> > Dr. Gavin Simpson [t] +44 (0)20 7679 0522 >> > ECRC, UCL Geography, [f] +44 (0)20 7679 0565 >> > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk >> > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ >> > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.