Re: [R] agnes clustering and NAs
Hello, Thankyou for the clarification about the NAs. For your interest, thankfully my end goal was not to plot a dendrogram with 23371 elements, but just to use the output of the clustering to re-order the rows of a matrix before plotting it with image(). Since clara() and pam() are partitioning based approaches, I suppose I could instead stay with hclust() after removing the offending rows, so that I have the ordering position of each gene, not its cluster membership. I have 12 GB RAM on my 64-bit system, so the time it takes to run should be my only problem. - Dario. Original message >Date: Fri, 28 Jan 2011 12:34:26 +0100 >From: Martin Maechler >Subject: Re: [R] agnes clustering and NAs >To: gavin.simp...@ucl.ac.uk >Cc: d.strbe...@garvan.org.au, r-help@r-project.org, Uwe Ligges > > >>>>>> Gavin Simpson >>>>>> on Fri, 28 Jan 2011 09:23:05 + writes: > >> On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote: >>> Hello, >>> >>> Yes, that's right, it is a values matrix. Not a dissimilarity matrix. >>> >>> i.e. >>> >>> > str(iMatrix) >>> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ... >>> - attr(*, "dimnames")=List of 2 >>> ..$ : NULL >>> ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ... > >Ok, so in the end you want to draw a dendrogram for 23'371 >observational units, really ? > >I think I would not use a hierarchical clustering method for so >many units, but rather clara() or maybe pam() or then model >based or other methods, rather than fully hierarchical ones >... >but yes, that's not the issue here, and see further down ... > >BTW: The object 'iMatrix' you provided for download has only 50 > columns, not 56... >>> >>> For the snippet of checking for NAs, I get all TRUEs, so I have at > least one NA in each column. > >GS> Sorry, my bad. Try this: > >GS> apply(iMatrix, 1, function(x) all(is.na(x))) > >GS> will check that you have no fully `NA` rows. > >GS> Also look at str(iMatrix) for potential problems. > >GS> Finally, try: > >GS> out <- dist(iMatrix) any(is.na(out)) > >GS> should repeat what agnes is doing to compute the >GS> dissimilarity matrix. If that returns TRUE, go and find >GS> which samples are giving NA dissimilarity and why. > >GS> The issue is not NA in the input data, but that your >GS> input data is leading to NA in the computed >GS> dissimilarities. This might be due to NA's in your input >GS> data, where a pair of samples has no common set of data >GS> for example. > >Yes, that's right on spot, thank you Gavin. > >This is indeed to true: >It *does* allow for NA's (in the data matrix), but if the >pattern of NA's is such that the dissimilarity between two >observations becomes undefined, namely e.g. if they have no >common non-missings, then ``that's too much''. > >In general, I'd recommend to use > dm <- daisy(,...) >trying methods, that are better with NAs, e.g. Gower's metric, >until dm() has {nearly} no NAs, >and then figure out some imputation to replace all NA's in dm >by "reasonable values", >then do clustering with the resulting dissimilarity "matrix" dm. > >HOWEVER, in your case, dm would correspond to > 23371 x 23371 dissimilarity matrix, >stored as a double precision matrix (on a 64-bit platform) >that's an object of size 4.4 GBytes, not very convenient to work >with. >as dissimilarity object it will only be about half of that size, >but that's still ``a bit large''.. >As I said above, for such data, I would never do fully >hierarchical clustering, >but rather something else. > >Martin Maechler, ETH Zurich > > >GS> HTH >GS> G > >>> The part of the agnes documentation I was referring to is : >>> >>> "In case of a matrix or data frame, each row corresponds to an > observation, and each column corresponds to a variable. All variables must be > numeric. Missing values (NAs) are allowed." >>> >>> So, I'm under the impression it handles NAs on its own ? >>> >>> - Dario. >>> >>> Original message >>> >Date: Thu, 27 Jan 2011 12:53:27 + >>> >From: Gavin Simpson >>> >Subject:
Re: [R] agnes clustering and NAs
>>>>> Gavin Simpson >>>>> on Fri, 28 Jan 2011 09:23:05 + writes: > On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote: >> Hello, >> >> Yes, that's right, it is a values matrix. Not a dissimilarity matrix. >> >> i.e. >> >> > str(iMatrix) >> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ... >> - attr(*, "dimnames")=List of 2 >> ..$ : NULL >> ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ... Ok, so in the end you want to draw a dendrogram for 23'371 observational units, really ? I think I would not use a hierarchical clustering method for so many units, but rather clara() or maybe pam() or then model based or other methods, rather than fully hierarchical ones ... but yes, that's not the issue here, and see further down ... BTW: The object 'iMatrix' you provided for download has only 50 columns, not 56... >> >> For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column. GS> Sorry, my bad. Try this: GS> apply(iMatrix, 1, function(x) all(is.na(x))) GS> will check that you have no fully `NA` rows. GS> Also look at str(iMatrix) for potential problems. GS> Finally, try: GS> out <- dist(iMatrix) any(is.na(out)) GS> should repeat what agnes is doing to compute the GS> dissimilarity matrix. If that returns TRUE, go and find GS> which samples are giving NA dissimilarity and why. GS> The issue is not NA in the input data, but that your GS> input data is leading to NA in the computed GS> dissimilarities. This might be due to NA's in your input GS> data, where a pair of samples has no common set of data GS> for example. Yes, that's right on spot, thank you Gavin. This is indeed to true: It *does* allow for NA's (in the data matrix), but if the pattern of NA's is such that the dissimilarity between two observations becomes undefined, namely e.g. if they have no common non-missings, then ``that's too much''. In general, I'd recommend to use dm <- daisy(,...) trying methods, that are better with NAs, e.g. Gower's metric, until dm() has {nearly} no NAs, and then figure out some imputation to replace all NA's in dm by "reasonable values", then do clustering with the resulting dissimilarity "matrix" dm. HOWEVER, in your case, dm would correspond to 23371 x 23371 dissimilarity matrix, stored as a double precision matrix (on a 64-bit platform) that's an object of size 4.4 GBytes, not very convenient to work with. as dissimilarity object it will only be about half of that size, but that's still ``a bit large''.. As I said above, for such data, I would never do fully hierarchical clustering, but rather something else. Martin Maechler, ETH Zurich GS> HTH GS> G >> The part of the agnes documentation I was referring to is : >> >> "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed." >> >> So, I'm under the impression it handles NAs on its own ? >> >> - Dario. >> >> Original message >> >Date: Thu, 27 Jan 2011 12:53:27 + >> >From: Gavin Simpson >> >Subject: Re: [R] agnes clustering and NAs >> >To: Uwe Ligges >> >Cc: d.strbe...@garvan.org.au, r-help@r-project.org >> > >> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote: >> >> >> >> On 27.01.2011 05:00, Dario Strbenac wrote: >> >> > Hello, >> >> > >> >> > In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like : >> >> > >> >> >> m<- matrix(c( >> >> > 1, 1, 1, 2, >> >> > 1, NA, 1, 1, >> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE) >> >> >> agnes(m) >> >> > Call:agnes(x = m) >> >> > Agglomerative coefficient: 0.1614168 >> >> > Order of objects: >> >> > [1] 1 2 3 >> >> > Height (summary): >> >> > Min. 1st Qu. MedianMean 3rd Qu.Max. >> >> >1.155 1.247 1.339 1.339 1.431 1.524 >> >> > >> >> > Ava
Re: [R] agnes clustering and NAs
On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote: > Hello, > > Yes, that's right, it is a values matrix. Not a dissimilarity matrix. > > i.e. > > > str(iMatrix) > num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ... > - attr(*, "dimnames")=List of 2 > ..$ : NULL > ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ... > > For the snippet of checking for NAs, I get all TRUEs, so I have at least one > NA in each column. Sorry, my bad. Try this: apply(iMatrix, 1, function(x) all(is.na(x))) will check that you have no fully `NA` rows. Also look at str(iMatrix) for potential problems. Finally, try: out <- dist(iMatrix) any(is.na(out)) should repeat what agnes is doing to compute the dissimilarity matrix. If that returns TRUE, go and find which samples are giving NA dissimilarity and why. The issue is not NA in the input data, but that your input data is leading to NA in the computed dissimilarities. This might be due to NA's in your input data, where a pair of samples has no common set of data for example. HTH G > The part of the agnes documentation I was referring to is : > > "In case of a matrix or data frame, each row corresponds to an observation, > and each column corresponds to a variable. All variables must be numeric. > Missing values (NAs) are allowed." > > So, I'm under the impression it handles NAs on its own ? > > - Dario. > > Original message > >Date: Thu, 27 Jan 2011 12:53:27 + > >From: Gavin Simpson > >Subject: Re: [R] agnes clustering and NAs > >To: Uwe Ligges > >Cc: d.strbe...@garvan.org.au, r-help@r-project.org > > > >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote: > >> > >> On 27.01.2011 05:00, Dario Strbenac wrote: > >> > Hello, > >> > > >> > In the documentation for agnes in the package 'cluster', it says that > >> > NAs are allowed, and sure enough it works for a small example like : > >> > > >> >> m<- matrix(c( > >> > 1, 1, 1, 2, > >> > 1, NA, 1, 1, > >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE) > >> >> agnes(m) > >> > Call:agnes(x = m) > >> > Agglomerative coefficient: 0.1614168 > >> > Order of objects: > >> > [1] 1 2 3 > >> > Height (summary): > >> > Min. 1st Qu. MedianMean 3rd Qu.Max. > >> >1.155 1.247 1.339 1.339 1.431 1.524 > >> > > >> > Available components: > >> > [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" > >> > > >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it > >> > and it runs for about a minute, then gives an error : > >> > > >> >> agnes(iMatrix) > >> > Error in agnes(iMatrix) : > >> >No clustering performed, NA-values in the dissimilarity matrix. > >> > > >> > I've also tried getting rid of rows with all NAs in them, and it still > >> > gave me the same error. Is this a bug in agnes() ? It doesn't seem to > >> > fulfil the claim made by its documentation. > >> > >> > >> I haven't looked in the file, but you need to get rid of all NA, or in > >> other words, all rows that contain *any* NA values. > > > >If one believes the documentation, then that only applies to the case > >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw > >data matrix or data frame. > > > >The only way the OP could have gotten that error with the call shown is > >if iMatrix were not a dissimilarity matrix inheriting from class "dist", > >so `NA`s should be allowed. > > > >My guess would be that the OP didn't get rid of all the `NA`s. > > > >Dario: what does: > > > >sapply(iMatrix, function(x) any(is.na(x))) > > > >or if iMatrix is a matrix: > > > >apply(iMatrix, 2, function(x) any(is.na(x))) > > > >say? > > > >G > > > >> Uwe Ligges > >> > >> > >> > >> > The matrix I'm using can be obtained here : > >> > http://129.94.136.7/file_dump/dario/iMatrix.obj > >> > > >> > -- > >> > Dario Strbenac > >> > Research Assistant > >> > Cancer Epigenetics > >> > Garvan Institute of Medical Resea
Re: [R] agnes clustering and NAs
Hello, Yes, that's right, it is a values matrix. Not a dissimilarity matrix. i.e. > str(iMatrix) num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ... For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column. The part of the agnes documentation I was referring to is : "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed." So, I'm under the impression it handles NAs on its own ? - Dario. Original message >Date: Thu, 27 Jan 2011 12:53:27 + >From: Gavin Simpson >Subject: Re: [R] agnes clustering and NAs >To: Uwe Ligges >Cc: d.strbe...@garvan.org.au, r-help@r-project.org > >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote: >> >> On 27.01.2011 05:00, Dario Strbenac wrote: >> > Hello, >> > >> > In the documentation for agnes in the package 'cluster', it says that NAs >> > are allowed, and sure enough it works for a small example like : >> > >> >> m<- matrix(c( >> > 1, 1, 1, 2, >> > 1, NA, 1, 1, >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE) >> >> agnes(m) >> > Call:agnes(x = m) >> > Agglomerative coefficient: 0.1614168 >> > Order of objects: >> > [1] 1 2 3 >> > Height (summary): >> > Min. 1st Qu. MedianMean 3rd Qu.Max. >> >1.155 1.247 1.339 1.339 1.431 1.524 >> > >> > Available components: >> > [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" >> > >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and >> > it runs for about a minute, then gives an error : >> > >> >> agnes(iMatrix) >> > Error in agnes(iMatrix) : >> >No clustering performed, NA-values in the dissimilarity matrix. >> > >> > I've also tried getting rid of rows with all NAs in them, and it still >> > gave me the same error. Is this a bug in agnes() ? It doesn't seem to >> > fulfil the claim made by its documentation. >> >> >> I haven't looked in the file, but you need to get rid of all NA, or in >> other words, all rows that contain *any* NA values. > >If one believes the documentation, then that only applies to the case >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw >data matrix or data frame. > >The only way the OP could have gotten that error with the call shown is >if iMatrix were not a dissimilarity matrix inheriting from class "dist", >so `NA`s should be allowed. > >My guess would be that the OP didn't get rid of all the `NA`s. > >Dario: what does: > >sapply(iMatrix, function(x) any(is.na(x))) > >or if iMatrix is a matrix: > >apply(iMatrix, 2, function(x) any(is.na(x))) > >say? > >G > >> Uwe Ligges >> >> >> >> > The matrix I'm using can be obtained here : >> > http://129.94.136.7/file_dump/dario/iMatrix.obj >> > >> > -- >> > Dario Strbenac >> > Research Assistant >> > Cancer Epigenetics >> > Garvan Institute of Medical Research >> > Darlinghurst NSW 2010 >> > Australia >> > >> > __ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >-- >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > Dr. Gavin Simpson [t] +44 (0)20 7679 0522 > ECRC, UCL Geography, [f] +44 (0)20 7679 0565 > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > -- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] agnes clustering and NAs
On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote: > > On 27.01.2011 05:00, Dario Strbenac wrote: > > Hello, > > > > In the documentation for agnes in the package 'cluster', it says that NAs > > are allowed, and sure enough it works for a small example like : > > > >> m<- matrix(c( > > 1, 1, 1, 2, > > 1, NA, 1, 1, > > 1, 2, 2, 2), nrow = 3, byrow = TRUE) > >> agnes(m) > > Call:agnes(x = m) > > Agglomerative coefficient: 0.1614168 > > Order of objects: > > [1] 1 2 3 > > Height (summary): > > Min. 1st Qu. MedianMean 3rd Qu.Max. > >1.155 1.247 1.339 1.339 1.431 1.524 > > > > Available components: > > [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" > > > > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and > > it runs for about a minute, then gives an error : > > > >> agnes(iMatrix) > > Error in agnes(iMatrix) : > >No clustering performed, NA-values in the dissimilarity matrix. > > > > I've also tried getting rid of rows with all NAs in them, and it still gave > > me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the > > claim made by its documentation. > > > I haven't looked in the file, but you need to get rid of all NA, or in > other words, all rows that contain *any* NA values. If one believes the documentation, then that only applies to the case where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw data matrix or data frame. The only way the OP could have gotten that error with the call shown is if iMatrix were not a dissimilarity matrix inheriting from class "dist", so `NA`s should be allowed. My guess would be that the OP didn't get rid of all the `NA`s. Dario: what does: sapply(iMatrix, function(x) any(is.na(x))) or if iMatrix is a matrix: apply(iMatrix, 2, function(x) any(is.na(x))) say? G > Uwe Ligges > > > > > The matrix I'm using can be obtained here : > > http://129.94.136.7/file_dump/dario/iMatrix.obj > > > > -- > > Dario Strbenac > > Research Assistant > > Cancer Epigenetics > > Garvan Institute of Medical Research > > Darlinghurst NSW 2010 > > Australia > > > > __ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] agnes clustering and NAs
On 27.01.2011 05:00, Dario Strbenac wrote: Hello, In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like : m<- matrix(c( 1, 1, 1, 2, 1, NA, 1, 1, 1, 2, 2, 2), nrow = 3, byrow = TRUE) agnes(m) Call:agnes(x = m) Agglomerative coefficient: 0.1614168 Order of objects: [1] 1 2 3 Height (summary): Min. 1st Qu. MedianMean 3rd Qu.Max. 1.155 1.247 1.339 1.339 1.431 1.524 Available components: [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error : agnes(iMatrix) Error in agnes(iMatrix) : No clustering performed, NA-values in the dissimilarity matrix. I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation. I haven't looked in the file, but you need to get rid of all NA, or in other words, all rows that contain *any* NA values. Uwe Ligges The matrix I'm using can be obtained here : http://129.94.136.7/file_dump/dario/iMatrix.obj -- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] agnes clustering and NAs
Hello, In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like : > m <- matrix(c( 1, 1, 1, 2, 1, NA, 1, 1, 1, 2, 2, 2), nrow = 3, byrow = TRUE) > agnes(m) Call:agnes(x = m) Agglomerative coefficient: 0.1614168 Order of objects: [1] 1 2 3 Height (summary): Min. 1st Qu. MedianMean 3rd Qu.Max. 1.155 1.247 1.339 1.339 1.431 1.524 Available components: [1] "order" "height" "ac" "merge" "diss" "call" "method" "data" But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error : > agnes(iMatrix) Error in agnes(iMatrix) : No clustering performed, NA-values in the dissimilarity matrix. I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation. The matrix I'm using can be obtained here : http://129.94.136.7/file_dump/dario/iMatrix.obj -- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.