Re: [R] Cluster analysis
Hi, R has a vast array of tools for cluster analysis. There's even a task view: https://cran.r-project.org/web/views/Cluster.html Which method is best for your needs is going to require you spending some time working to understand the pros and cons, and possibly consulting with a local statistician. Sarah On Sun, Mar 31, 2019 at 4:20 PM bienvenidoz...@gmail.com wrote: > > Hi, > I have data from farmers with different variables. I would like to classify > them according to some variables. Can you help me with "R" to find the best > variables to classify them and how to classify them with "R". Some variables > are numerical others are ordinal. > > Best regards, > Bienvenue > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Sarah Goslee (she/her) http://www.numberwright.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis
Hi, I have data from farmers with different variables. I would like to classify them according to some variables. Can you help me with "R" to find the best variables to classify them and how to classify them with "R". Some variables are numerical others are ordinal. Best regards, Bienvenue [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis with Weighted attribute
Hi! All. I'm not much familiar with R. So I tried to find a R function or packages that could work with my problems. What I wonder is, Whether there is any R function or package that includes the cluster analysis considering with the weighted attribute. I saw several papers that dealt with the Attribute Value Weighting in K-Modes Clustering. but I could not find the R function or packages related with this. We got the weight of each attributes by interviewing the experts. What we want to do is do cluster analysis regarding with those weighted value on the attributes. Is there any suggestion for me?? It would be much appreciated ! Thanks for your interest on my question! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis
Hi -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Venky Sent: Wednesday, June 17, 2015 8:43 AM To: R Help R Subject: [R] cluster analysis Hi friends, I have data like this In R or elsewhere? Group Employee size WOE Employee size2 Weight of Evidence 1081680995 0 0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464 1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537 0.128755 0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5 0.12875537 1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825 In this data first variable (Employee size) has 10 rows and variable 2 (employee size2) has only 5 rows Extremely messy due to HTML posting. Use plain text post as recommended by Posting Guide. Question 1:there are different number of rows so that, we can able to do K-means cluster or not? I am not an expert but why not to try it? Question 2:If we run k-means clustering in R answer not coming because of NA exists I have used dataset-na.omit(dataset) But that time also i cannot able to run clustering Perhaps not enough data remained after NA removing. To get better answer you shall provide reproducible example or at least some usable data. Cheers Petr Please help me to find this answer [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny pouze jeho adresátům. Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze svého systému. Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat. Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či zpožděním přenosu e-mailu. V případě, že je tento e-mail součástí obchodního jednání: - vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a to z jakéhokoliv důvodu i bez uvedení důvodu. - a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce s dodatkem či odchylkou. - trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným dosažením shody na všech jejích náležitostech. - odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi či osobě jím zastoupené známá. This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients. If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system. If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner. The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email. In case that this e-mail forms part of business dealings: - the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning. - if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation. - the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects. - the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented by the recipient. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis
Hi friends, I have data like this Group Employee size WOE Employee size2 Weight of Evidence 1081680995 0 0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464 1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537 0.128755 0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5 0.12875537 1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825 In this data first variable (Employee size) has 10 rows and variable 2 (employee size2) has only 5 rows Question 1:there are different number of rows so that, we can able to do K-means cluster or not? Question 2:If we run k-means clustering in R answer not coming because of NA exists I have used dataset-na.omit(dataset) But that time also i cannot able to run clustering Please help me to find this answer [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis using term frequencies
Dear Sun Shine, dtes - dist(tes.df, method = 'euclidean') dtesFreq - hclust(dtes, method = 'ward.D') plot(dtesFreq, labels = names(tes.df)) However, I get an error message when trying to plot this: Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, : invalid dendrogram input. I don't see anything wrong with the code, so what I'd do is run str(dtes) and str(dtesFreq) to see whether these are what they should be (or if not, what they are instead). I'm clearly screwing something up, either in my source data.frame or in my setting hclust up, but don't know which, nor how. Can't comment on your source data but generally, whatever you do, use str() or even print() to see whether the R-objects are allright or what went wrong. More than just identifying the error however, I am interested in finding a smart (efficient/ elegant) way of checking the occurrence and frequency value of the terms that may be associated with 'sports', 'learning', and 'extra-mural' and extracting these into a matrix or data frame so that I can analyse and plot their clustering to see if how I associated these terms is actually supported statistically. The first thing that comes to my mind (not necessarily the best/most elegant) is to run... dtes3 - cutree(dtesFreq,3) ...and to table dtes3 against your manual classification. Note that 3 is the most natural number of clusters to cut the tree here but may not be the best to match your classification (for example, you may have a one-point cluster in the 3-cluster solution, so it may effectively be a two-cluster solution with an outlier). Your dendrogram, if you succeed plotting it, may give you a hint about that. Hope this helps, Christian I'm sure that there must be a way of doing this in R, but I'm obviously not going about it correctly. Can anyone shine a light please? Thanks for any help/ guidance. Regards, Sun __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis using term frequencies
Hi list I am using the 'tm' package to review meeting notes at a school to identify terms frequently associated with 'learning', 'sports', and 'extra-mural' activities, and then to sort any terms according to these three headers in a way that could be supported statistically (as opposed to, say, my own bias, etc.). To accomplish this, I have done the following: (1) After the usual pre-processing of the text data, loading it as a corpus and then converting it into a document term matrix (called 'allTerms'), I have identified the 20 most frequently occurring terms in the meeting notes and extracted these into a named vector called 'freqTerms'. Many of the terms returned have nothing to do with any of the three themes of 'learning', 'sports', or 'extra-mural'. (2) Therefore, I have also manually generated a list of terms and synonyms for 'learning' and 'sports', etc. (e.g. 'football', 'soccer', 'drama', 'chess', etc.) and then tested for the occurrence of each of these terms in the corpus, e.g.: allTerms['soccer'] and have come up with a list of some 30 terms together with their frequencies. I manually sorted these according to three headers 'learning', 'sports', and 'extra-mural' and dropped these into a table in a word processing document. Some of these terms are also in the freqTerms vector. What I want to do now is to use cluster analysis (hclust, from the 'cluster' library) to plot a dendrogram of the terms I have manually checked and put into the table, in order to see how closely similar the terms are and whether they cluster in ways similar to the way as I manually sorted these under the table column headers of 'learning', 'sports', and 'extra-mural'. To do this, I dropped these manually sorted terms into a data frame together with the associated values (which I called 'tes.df') and then tried plotting this as follows: dtes - dist(tes.df, method = 'euclidean') dtesFreq - hclust(dtes, method = 'ward.D') plot(dtesFreq, labels = names(tes.df)) However, I get an error message when trying to plot this: Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, : invalid dendrogram input. I'm clearly screwing something up, either in my source data.frame or in my setting hclust up, but don't know which, nor how. More than just identifying the error however, I am interested in finding a smart (efficient/ elegant) way of checking the occurrence and frequency value of the terms that may be associated with 'sports', 'learning', and 'extra-mural' and extracting these into a matrix or data frame so that I can analyse and plot their clustering to see if how I associated these terms is actually supported statistically. I'm sure that there must be a way of doing this in R, but I'm obviously not going about it correctly. Can anyone shine a light please? Thanks for any help/ guidance. Regards, Sun __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis
I want to do Agglomerative Hierarchical clustering using complete linkage method in R using the function agnes or hclust. 1. Can i do a cluster analysis of h=(n+p+1)/2 out of n observation? note that p=nomber of variables(dependent and independent) 2. Can i plot the dendrogram and get the cluster history of this analysis in R? 3. Can i use the cluster with the largest values to sort the n observations in ascending order? Your assistance and guide will be greatly appreciated in solving problems 1-3 Thanks EKELE ALIH [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis
I am doing cluster analysis of my SNPs data. I have 2 questions: 1. I draw the cluster in hclust using the following codes.change direction to vertical. data - read.table(as.matrix(file.choose()), header=T, row.names = 1, sep=\t) plot(hclust(as.dist(data),method=complete)) it is horizontal, and I dont know how to change to vertical shape? 2. I would like to have bootstraps, but no luck. I am using the following codes: result - pvclust(as.dist(data), method.dist=cor, method.hclust=complete, nboot=1000) Error in cor(x, method = pearson, use = use.cor) : supply both 'x' and 'y' or a matrix-like 'x' I will appreciate if someone could help me please -- *Abbasali Ali Ravanlou PhD candidate of Plant Pathology **Dept. of Crop Sci.* *University of Illinois-UC** ** ** ** * [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis on weighted survey data with continuous and categorical variables
I am trying to perform cluster analysis on survey data where each respondent has answered several questions, some of which have categorical answers (blue pink green etc) and some of which have scale answers (rating from 1 to 10 etc).My problem is that certain age groups were over-sampled and I need to weight the data collected in order to accurately reflect the current population.Will it make a difference if I do the cluster analysis on the weighted data, and if so, how do I do cluster analysis on the weighted data?Any advice would be much appreciated!Thanks Emma [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis on weighted survey data with continuous and categorical variables
On Wed, Mar 20, 2013 at 3:55 AM, Emma Gibson waterbab...@hotmail.comwrote: I am trying to perform cluster analysis on survey data where each respondent has answered several questions, some of which have categorical answers (blue pink green etc) and some of which have scale answers (rating from 1 to 10 etc).My problem is that certain age groups were over-sampled and I need to weight the data collected in order to accurately reflect the current population.Will it make a difference if I do the cluster analysis on the weighted data, and if so, how do I do cluster analysis on the weighted data?Any advice would be much appreciated!Thanks Emma The unequal sampling will have some effect on most clustering methods (eg not single-linkage, but k-means or average-linkage). Whether this matters depends on whether you have genuinely separate clusters in the population or a general mush that you are trying to segment in some convenient way. If you have genuine well-separated clusters, then ignoring the oversampling is likely to do well. If you don't, you will get a segementation into clusters that partitions the over-sampled people too finely and the under-sampled people too coarsely. I don't know of any R functions that cluster with sampling weights. If your data set is fairly small, you could expand it by making duplicates (perhaps jittered) of some points, and cluster the expanded data set. On the other hand, if it is very large, you can thin it out to a uniform sample by sampling from it with probability inversely proportional to the original sampling probability. - thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis in the setting of repeated measures
Does R have any function for performing cluster analysis when each subject contributes more than one observation to the analysis, i.e. a repeated measures cluster analysis? I prefer an agglomerative clustering, but would certainly be happy with a K-mean or other clustering technique. To the best of my knowledge, the standard R clustering functions (e.g. kmeans, hclust, pvclust) all assume that each subject contributes a single line of data to the analyses. Thanks, John John David Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing) Confidentiality Statement: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster Analysis and PCoA (mixt variables)
Hello everyone, I mail you because of my lake of knowlegde regarding statistics. I'm using the CA and PCoA (but maybe should I use some other techniques) to determine the differences and similarities between a large sample of plants using different kind of traits through matrix of mixte variables. I understood that the daisy() function using the gower metric and defining the different type of variable is a good way to deal with such mixt variable. And in fact, my plots (cluster{agnes})(more that my PCoA) are quite reflecting what I was expecting from the aspect of those different plants. My problem : The problem now is that I need to understand wich variables are considered to produce the dissimilarity matrix that is used for the cluster analysis or the PCoA. In other word, how are construct the branch of my Cluster Analysis tree? It has been one month since I tried to figured most of the things out of what I know today in data analysis and R software world. So, I'm really sorry for asking so simple things that do not exactly focus on the R issues but I tried in many ways but I just can't figure it out. Thank you Julien Mehl Vettori [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis error - mclust package
I am following instructions online for cluster analysis using the mclust package, and keep getting errors. http://www.statmethods.net/advstats/cluster.html These are the instructions (there is no sample dataset unfortunately): # Model Based Clustering library(mclust) fit - Mclust(mydata) plot(fit, mydata) # plot results print(fit) # display the best model This is what I did and the error I get: library(mclust) fit - Mclust(mydat) plot(fit, mydat) #plot results Error in match.arg(what, c(BIC, classification, uncertainty, density), : 'arg' must be NULL or a character vector My data is arranged so I have each row representing one individual with 9 values for morphological data. I want to see if they will group into 2 clusters, representing gender. I have tried using the instructions from the cran-r website, but they didn't work either Any help would be great, thank you -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-error-mclust-package-tp4650842.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
It's hard to answer these questions without knowing what the errors are and how they can be reproduced. Best, Ingmar On Thu, Nov 22, 2012 at 1:03 AM, KitKat katherinewri...@trentu.ca wrote: Thanks, I have been trying that site and another one (http://www.statmethods.net/advstats/cluster.html) I don't know if I should be doing mclust or mcclust, but either way, the codes are not working. I am following the guidelines online at: mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf mclust - http://cran.r-project.org/ I am relatively new to R, but so far I have been able to figure out dfa, manova, pca... I cannot get these codes to work, I keep getting various errors. Are there other resources that have details about what codes to use or what to do when errors result? I have not found anything else helpful Thank you -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
These are the errors I've been having. I have been trying 3 different things 1- Mclust: This is the example I have been following: # Model Based Clustering library(mclust) fit - Mclust(mydata) plot(fit, mydata) # plot results print(fit) # display the best model What I have done: fit - Mclust(mydat) plot(fit, mydat) #plot results Error in match.arg(what, c(BIC, classification, uncertainty, density), : 'arg' must be NULL or a character vector 2- Mclust using different website (cran-r) instructions This is the example: mydatMclust - Mclust(mydat) summary(mydatMclust) summary(mydatMclust, parameters = TRUE) plot(mydatMclust) There are a couple other steps but the plot is the problem. I get two plots, there should be four. One should be plotting all my individuals but it's plotting my variables instead. It's also taking a very long time. R script at this point says: Waiting to confirm page change… 3. Mcclust Instructions from cran-r: data(cls.draw2) # sample of 500 clusterings from a Bayesian cluster model tru.class - rep(1:8,each=50) # the true grouping of the observations psm2 - comp.psm(cls.draw2) # posterior similarity matrix # optimize criteria based on PSM mbind2 - minbinder(psm2) mpear2 - maxpear(psm2) # Relabelling k - apply(cls.draw2,1, function(cl) length(table(cl))) max.k - as.numeric(names(table(k))[which.max(table(k))]) relab2 - relabel(cls.draw2[k==max.k,]) # compare clusterings found by different methods with true grouping arandi(mpear2$cl, tru.class) arandi(mbind2$cl, tru.class) arandi(relab2$cl, tru.class) I called my data: mydat so I changed that where appropriate. I cannot get past one early step, psm2 - comp.psm(cls.draw2).. the error reads: Error: could not find function comp.psm I think I have all appropriate packages installed. I don't know what more to do on these three errors. Any help would be great! Thank you -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650466.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
Thank you for replying! I made a new post asking if there are any websites or files on how to download package mclust (or other Bayesian cluster analysis packages) and the appropriate R functions? Sorry I don't know how this forum works yet -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
http://cran.r-project.org/web/views/Cluster.html might be a good start Brian On Nov 21, 2012, at 1:36 PM, KitKat wrote: Thank you for replying! I made a new post asking if there are any websites or files on how to download package mclust (or other Bayesian cluster analysis packages) and the appropriate R functions? Sorry I don't know how this forum works yet -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
Thanks, I have been trying that site and another one (http://www.statmethods.net/advstats/cluster.html) I don't know if I should be doing mclust or mcclust, but either way, the codes are not working. I am following the guidelines online at: mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf mclust - http://cran.r-project.org/ I am relatively new to R, but so far I have been able to figure out dfa, manova, pca... I cannot get these codes to work, I keep getting various errors. Are there other resources that have details about what codes to use or what to do when errors result? I have not found anything else helpful Thank you -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
Dear Katherine, function flexmixedruns in package fpc may do what you want; it fits mixtures with continuous and categorical variables, can use the BIC for giving you the number of mixture components and also gives you posterior probabilities for cases to belong to components. Note that generally finding the right cluster analysis method is a complicated task and depends crucially on your application, what use you want to make of the clusters etc., so what's best cannot be conclusively said on a mailing list. The same holds for whether and how to select variables. Certainly it's not wrong in general to use all the variables that you have but whether it's better otherwise depends on what meaning your variables have and how this relates to the aim of clustering, what to do with the variables afterwards etc. You may have a look at http://www.rss.org.uk/site/cms/contentviewarticle.asp?article=866#Link%20to%20Nov.%202012%20paper where I discuss a number of related issues. Best regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of KitKat [katherinewri...@trentu.ca] Sent: 15 November 2012 18:14 To: r-help@r-project.org Subject: [R] cluster analysis in R I have two issues. 1-I am trying to use morphology to identify gender. I have 9 variables, both continuous and categorical. I was using two-step cluster analysis in SPSS because two-step could deal with different types of variables. But the output tells me that an animal is in cluster 1 or 2, it does not give me a probability (ex. 0.70 cluster 2). I also did not want to specify that I want two clusters, I wanted to see if analysis would naturally give me two clusters. These were all advantages to using SPSS but now I'm having trouble. Does cluster analysis in R give probabilities? Which type of cluster analysis in R is best to use? I did not think hierarchical analysis was a great choice, but maybe I'm wrong. I don't want to create the average variable, I want the analysis to do it on its own. I'm also new to R so would have to figure out the right codes to enter, etc. 2-I was also told to analyze each variable on its own before including it in cluster analysis. I had first included them all then teased out which ones were not important, but now have been asked to do the reverse. I cannot do cluster analysis on one variable -for example, one variable is either present or absent on an individual so of course cluster analysis gives me two clusters, one representing present and one representing absent. I was told to use regression, but how can regression also not give the same result? I feel like it would give me a line connecting a bunch of 0s to 1s. I don't know what to use, or if I can analyze each variable like this before putting them into cluster analysis. I ultimately want to only use the smallest number of variables necessary to identify gender. I have tried reading manuals etc and talking to people at my school, but nothing has helped. If anyone has any insight, that would be much appreciated Thank you! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis in R
I have two issues. 1-I am trying to use morphology to identify gender. I have 9 variables, both continuous and categorical. I was using two-step cluster analysis in SPSS because two-step could deal with different types of variables. But the output tells me that an animal is in cluster 1 or 2, it does not give me a probability (ex. 0.70 cluster 2). I also did not want to specify that I want two clusters, I wanted to see if analysis would naturally give me two clusters. These were all advantages to using SPSS but now I'm having trouble. Does cluster analysis in R give probabilities? Which type of cluster analysis in R is best to use? I did not think hierarchical analysis was a great choice, but maybe I'm wrong. I don't want to create the average variable, I want the analysis to do it on its own. I'm also new to R so would have to figure out the right codes to enter, etc. 2-I was also told to analyze each variable on its own before including it in cluster analysis. I had first included them all then teased out which ones were not important, but now have been asked to do the reverse. I cannot do cluster analysis on one variable -for example, one variable is either present or absent on an individual so of course cluster analysis gives me two clusters, one representing present and one representing absent. I was told to use regression, but how can regression also not give the same result? I feel like it would give me a line connecting a bunch of 0s to 1s. I don't know what to use, or if I can analyze each variable like this before putting them into cluster analysis. I ultimately want to only use the smallest number of variables necessary to identify gender. I have tried reading manuals etc and talking to people at my school, but nothing has helped. If anyone has any insight, that would be much appreciated Thank you! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
Dear KitKat, After installing R and reading some introductory material on getting started with R you may want to check the CRAN task view on cluster analysis: http://cran.r-project.org/web/views/Cluster.html which has many useful references to all kinds and flavors of clustering techniques, hierarchical or not, selecting the nr of clusters based on some model selection statistic, et cetera. hth, Ingmar On Thu, Nov 15, 2012 at 7:14 PM, KitKat katherinewri...@trentu.ca wrote: I have two issues. 1-I am trying to use morphology to identify gender. I have 9 variables, both continuous and categorical. I was using two-step cluster analysis in SPSS because two-step could deal with different types of variables. But the output tells me that an animal is in cluster 1 or 2, it does not give me a probability (ex. 0.70 cluster 2). I also did not want to specify that I want two clusters, I wanted to see if analysis would naturally give me two clusters. These were all advantages to using SPSS but now I'm having trouble. Does cluster analysis in R give probabilities? Which type of cluster analysis in R is best to use? I did not think hierarchical analysis was a great choice, but maybe I'm wrong. I don't want to create the average variable, I want the analysis to do it on its own. I'm also new to R so would have to figure out the right codes to enter, etc. 2-I was also told to analyze each variable on its own before including it in cluster analysis. I had first included them all then teased out which ones were not important, but now have been asked to do the reverse. I cannot do cluster analysis on one variable -for example, one variable is either present or absent on an individual so of course cluster analysis gives me two clusters, one representing present and one representing absent. I was told to use regression, but how can regression also not give the same result? I feel like it would give me a line connecting a bunch of 0s to 1s. I don't know what to use, or if I can analyze each variable like this before putting them into cluster analysis. I ultimately want to only use the smallest number of variables necessary to identify gender. I have tried reading manuals etc and talking to people at my school, but nothing has helped. If anyone has any insight, that would be much appreciated Thank you! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
Have a look at the package mclust. Jose From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of Ingmar Visser [i.vis...@uva.nl] Sent: 15 November 2012 21:10 To: KitKat Cc: r-help@r-project.org Subject: Re: [R] cluster analysis in R Dear KitKat, After installing R and reading some introductory material on getting started with R you may want to check the CRAN task view on cluster analysis: http://cran.r-project.org/web/views/Cluster.html which has many useful references to all kinds and flavors of clustering techniques, hierarchical or not, selecting the nr of clusters based on some model selection statistic, et cetera. hth, Ingmar On Thu, Nov 15, 2012 at 7:14 PM, KitKat katherinewri...@trentu.ca wrote: I have two issues. 1-I am trying to use morphology to identify gender. I have 9 variables, both continuous and categorical. I was using two-step cluster analysis in SPSS because two-step could deal with different types of variables. But the output tells me that an animal is in cluster 1 or 2, it does not give me a probability (ex. 0.70 cluster 2). I also did not want to specify that I want two clusters, I wanted to see if analysis would naturally give me two clusters. These were all advantages to using SPSS but now I'm having trouble. Does cluster analysis in R give probabilities? Which type of cluster analysis in R is best to use? I did not think hierarchical analysis was a great choice, but maybe I'm wrong. I don't want to create the average variable, I want the analysis to do it on its own. I'm also new to R so would have to figure out the right codes to enter, etc. 2-I was also told to analyze each variable on its own before including it in cluster analysis. I had first included them all then teased out which ones were not important, but now have been asked to do the reverse. I cannot do cluster analysis on one variable -for example, one variable is either present or absent on an individual so of course cluster analysis gives me two clusters, one representing present and one representing absent. I was told to use regression, but how can regression also not give the same result? I feel like it would give me a line connecting a bunch of 0s to 1s. I don't know what to use, or if I can analyze each variable like this before putting them into cluster analysis. I ultimately want to only use the smallest number of variables necessary to identify gender. I have tried reading manuals etc and talking to people at my school, but nothing has helped. If anyone has any insight, that would be much appreciated Thank you! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Wrap Up Run 10k next March to raise vital funds for Age UK Six exciting new 10k races are taking place throughout the country and we want you to join in the fun! Whether you're a runner or not, these are events are for everyone ~ from walking groups to serious athletes. The Age UK Events Team will provide you with a training plan to suit your level and lots of tips to make this your first successful challenge of 2012. Beat the January blues and raise some vital funds to help us prevent avoidable deaths amongst older people this winter. Sign up now! www.ageuk.org.uk/10k Coming to; London Crystal Palace, Southport, Tatton Park, Cheshire Harewood House, Leeds,Coventry, Exeter Age UK Improving later life www.ageuk.org.uk --- Age UK is a registered charity and company limited by guarantee, (registered charity number 1128267, registered company number 6825798). Registered office: Tavis House, 1-6 Tavistock Square, London WC1H 9NA. For the purposes of promoting Age UK Insurance, Age UK is an Appointed Representative of Age UK Enterprises Limited, Age UK is an Introducer Appointed Representative of JLT Benefit Solutions Limited and Simplyhealth Access for the purposes of introducing potential annuity and health cash plans customers respectively. Age UK Enterprises Limited, JLT Benefit Solutions Limited and Simplyhealth Access are all authorised and regulated by the Financial Services Authority. -- This email and any files transmitted with it are confidential and intended
Re: [R] Cluster Analysis
Hi, Taisa, It depends on many paramfactors, e.g. nature of your data, volume of data set etc. The analog of SAS fastclus in R - kmeans (for practical example check slide #35 here: http://www.slideshare.net/whitish/textmining-with-r) Check also kmedoids (pam) and hclust. Good luck, -Alex From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of Taisa Brown [taisa.br...@unb.ca] Sent: 15 April 2012 03:28 To: r-help@r-project.org Subject: [R] Cluster Analysis Hi, I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER would be. I need to be able to test the significance of the clusters by comparing the probability of obtaining an equal or greater pseudo F to the Bonferroni-corrected level. I will also need to plot r squared against the number of clusters. Thanks so much, Taisa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster Analysis
At the R command prompt ?kmeans (for info on the R equivalent to FASTCLUS) ?hclust (for info on the R equivalent to CLUSTER) Install package clusterSim and look at function index.G1 for the Calinski-Harabasz pseudo F-statistic -- David L Carlson Associate Professor of Anthropology Texas AM University College Station, TX 77843-4352 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of Taisa Brown Sent: Saturday, April 14, 2012 7:29 PM To: r-help@r-project.org Subject: [R] Cluster Analysis Hi, I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER would be. I need to be able to test the significance of the clusters by comparing the probability of obtaining an equal or greater pseudo F to the Bonferroni-corrected level. I will also need to plot r squared against the number of clusters. Thanks so much, Taisa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster Analysis
Hi, I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER would be. I need to be able to test the significance of the clusters by comparing the probability of obtaining an equal or greater pseudo F to the Bonferroni-corrected level. I will also need to plot r squared against the number of clusters. Thanks so much, Taisa [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis with pairwise data
Hello, I want to do a cluster analysis with my data. The problem is, that the variables dont't consist of single value but the entries are pairs of values. That lokks like this: Variable 1:Variable2: Variable3: ... (1,2) (1,5) (4,2) (7,8) (3,88) (6,5) (4,7) (12,4) (4,4) . . . . . . . . . Is it possible to perform a cluster-analysis with this kind of data in R ? I dont even know how to get this data in a matrix or a dada-frame or anything like this. It would be really nice if somebody could help me. Best regards and happy Easter Claudia __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis with pairwise data
You can create distance matrices for each Variable, square them, sum them, and take the square root. As for getting the data into a data frame, the simplest would be to enter the three variables into six columns like the following: data [,1] [,2] [,3] [,4] [,5] [,6] [1,]121542 [2,]783 8865 [3,]47 12444 Then use dist() on each pair of columns: 1:2, 3:4, 5:6 . . . e.g. for the 3 rows of data you provided size - nrow(data)*(nrow(data)-1)/2 dm - dist(rep(0, size)) for(i in seq(1, 6, 2)) { dm - dm + dist(data[,i:(i+1)])^2 } dm - sqrt(dm) dm -- David L Carlson Associate Professor of Anthropology Texas AM University College Station, TX 77843-4352 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of paladini Sent: Wednesday, April 04, 2012 6:32 AM To: r-help@r-project.org Subject: [R] cluster analysis with pairwise data Hello, I want to do a cluster analysis with my data. The problem is, that the variables dont't consist of single value but the entries are pairs of values. That lokks like this: Variable 1:Variable2: Variable3: ... (1,2) (1,5) (4,2) (7,8) (3,88) (6,5) (4,7) (12,4) (4,4) . . . . . . . . . Is it possible to perform a cluster-analysis with this kind of data in R ? I dont even know how to get this data in a matrix or a dada-frame or anything like this. It would be really nice if somebody could help me. Best regards and happy Easter Claudia __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis with pairwise data
On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote: Hello, I want to do a cluster analysis with my data. The problem is, that the variables dont't consist of single value but the entries are pairs of values. That lokks like this: Variable 1:Variable2: Variable3: ... (1,2) (1,5) (4,2) (7,8) (3,88) (6,5) (4,7) (12,4) (4,4) . . . . . . . . . Is it possible to perform a cluster-analysis with this kind of data in R ? I dont even know how to get this data in a matrix or a dada-frame or anything like this. Hi. The data as they are may be read into R as character data. The exact way depends on the format of the data in the file. The result may look like the following. Var1 - c((1,2), (7,8), (4,7)) Var2 - c((1,5), (3,88), (12,4)) Var3 - c((4,2), (6,5), (4,4)) DF - data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE) If you want to use a distance between pairs depending on the numbers (and not only equal/different pair), then the data should to be transformed to a numeric format. For example, as follows trans - function(x) { y - strsplit(gsub([()], , x), ,) unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0 } DF - data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3)) DF Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1 1 1 2 1 542 2 7 8 3 8865 3 4 7 12 444 Then, see library(help=cluster). Hope this helps. Petr Savicky. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis with pairwise data
On Wed, Apr 4, 2012 at 10:12 AM, Petr Savicky savi...@cs.cas.cz wrote: On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote: Var1 - c((1,2), (7,8), (4,7)) Var2 - c((1,5), (3,88), (12,4)) Var3 - c((4,2), (6,5), (4,4)) DF - data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE) If you want to use a distance between pairs depending on the numbers (and not only equal/different pair), then the data should to be transformed to a numeric format. Or if the pairs have unique meaning ?daisy , also in the cluster package, comes in handy (in this case you'll want to keep Vi as factors in the call to DF). Cheers For example, as follows trans - function(x) { y - strsplit(gsub([()], , x), ,) unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0 } DF - data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3)) DF Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1 1 1 2 1 5 4 2 2 7 8 3 88 6 5 3 4 7 12 4 4 4 Then, see library(help=cluster). Hope this helps. Petr Savicky. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis on extreme event
Dear all, I'm modelling extreme rainfall,particularly those that lie above a threshold was searching for a suitable package in R which may enable a cluster analysis on those extreme events and would really appreciate for any suggestions. Thanks, Fir __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis, factor variables, large data set
Dear R helpers, I have a large data set with 36 variables and about 50.000 cases. The variabels represent labour market status during 36 months, there are 8 different variable values (e.g. Full-time Employment, Student,...) Only cases with at least one change in labour market status is included in the data set. To analyse sub sets of the data, I have used daisy in the cluster-package to create a distance matrix and then used pam (or pamk in the fpc-package), to get a k-medoids cluster-solution. Now I want to analyse the whole set. clara is said to cope with large data sets, but the first step in the cluster analysis, the creation of the distance matrix must be done by another function since clara only works with numeric data. Is there an alternative to the daisy - clara route that does not require as much RAM? What functions would you recommend for a cluster analysis of this kind of data on large data set? regards, Hans Ekbrand __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, factor variables, large data set
Dear Hans, clara doesn't require a distance matrix as input (and therefore doesn't require you to run daisy), it will work with the raw data matrix using Euclidean distances implicitly. I can't tell you whether Euclidean distances are appropriate in this situation (this depends on the interpretation and variables and particularly on how they are scaled), but they may be fine at least after some transformation and standardisation of your variables. Hope this helps, Christian On Thu, 31 Mar 2011, Hans Ekbrand wrote: Dear R helpers, I have a large data set with 36 variables and about 50.000 cases. The variabels represent labour market status during 36 months, there are 8 different variable values (e.g. Full-time Employment, Student,...) Only cases with at least one change in labour market status is included in the data set. To analyse sub sets of the data, I have used daisy in the cluster-package to create a distance matrix and then used pam (or pamk in the fpc-package), to get a k-medoids cluster-solution. Now I want to analyse the whole set. clara is said to cope with large data sets, but the first step in the cluster analysis, the creation of the distance matrix must be done by another function since clara only works with numeric data. Is there an alternative to the daisy - clara route that does not require as much RAM? What functions would you recommend for a cluster analysis of this kind of data on large data set? regards, Hans Ekbrand __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, factor variables, large data set
On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote: Dear Hans, clara doesn't require a distance matrix as input (and therefore doesn't require you to run daisy), it will work with the raw data matrix using Euclidean distances implicitly. I can't tell you whether Euclidean distances are appropriate in this situation (this depends on the interpretation and variables and particularly on how they are scaled), but they may be fine at least after some transformation and standardisation of your variables. The variables are unordered factors, stored as integers 1:9, where 1 means Full-time employment 2 means Part-time employment 3 means Student 4 means Full-time self-employee ... Does euclidean distances make sense on unordered factors coded as integers? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, factor variables, large data set
On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote: On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote: Dear Hans, clara doesn't require a distance matrix as input (and therefore doesn't require you to run daisy), it will work with the raw data matrix using Euclidean distances implicitly. I can't tell you whether Euclidean distances are appropriate in this situation (this depends on the interpretation and variables and particularly on how they are scaled), but they may be fine at least after some transformation and standardisation of your variables. The variables are unordered factors, stored as integers 1:9, where 1 means Full-time employment 2 means Part-time employment 3 means Student 4 means Full-time self-employee ... Does euclidean distances make sense on unordered factors coded as integers? To be clear, here is an extract my.df.full[900:910, 16:19] PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year 900 2 2 1 2 901 1 1 1 1 902 1 1 1 1 903 2 2 2 2 904 1 1 1 1 905 2 2 2 2 906 7 8 2 7 907 5 5 5 5 908 1 1 1 1 909 1 1 1 1 910 1 1 1 1 class(my.df.full[,16]) [1] integer __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, factor variables, large data set
On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand h...@sociologi.cjb.net wrote: The variables are unordered factors, stored as integers 1:9, where 1 means Full-time employment 2 means Part-time employment 3 means Student 4 means Full-time self-employee ... Does euclidean distances make sense on unordered factors coded as integers? It probably doesn't. You said you have some 36 observations for each case, correct? You can turn these 36 observations into a vector of length 36 * 9 on which Euclidean distance will make some sense, namely k changes will produce a distance of sqrt(2*k). For each observation with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0) where the entry 1 is in the p-th component. Hence, if values p1 and p2 are the same, euclidean distance between r1 and r2 is zero; if they are not the same, Euclidan distance is sqrt(2). Here's some possible R code: transform = function(obsVector, maxVal) { templateMat = matrix(0, maxVal, maxVal); diag(templateMat) = 1; return(as.vector(templateMat[, obsVector])); } set.seed(10) n = 4; m = 5; max = 4; data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n); data [,1] [,2] [,3] [,4] [1,]3312 [2,]1332 [3,]3324 [4,]1242 [5,]4141 trafoData = apply(data, 2, transform, maxVal = max); trafoData [,1] [,2] [,3] [,4] [1,]0010 [2,]0001 [3,]1100 [4,]0000 [5,]1000 [6,]0001 [7,]0110 [8,]0000 [9,]0000 [10,]0010 [11,]1100 [12,]0001 [13,]1000 [14,]0101 [15,]0000 [16,]0010 [17,]0101 [18,]0000 [19,]0000 [20,]1010 The code assumes that cases are in columns and observations in rows of data. Examine data and trafoData to see how the transformation works. Once you have the transformed data, simply apply your favorite clustering method that uses Euclidean distance. HTH, Peter __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis: predefined clusters
Peter Langfelder wrote: On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert derik2...@yahoo.de wrote: Dear list, running a hierachical cluster analysis I want to define a number of objects that build a cluster already. In other words: I want to force some of the cases to be in the same cluster from the start of the algorithm. Any hints? Thanks in advance! The hclust function has an argument 'members' that should allow you to do that. You will need to specify the dissimilarity matrix accordingly. Peter Thank you! But to specify the dissimilarity matrix correctly seems to be major task. Anyone who has done so sofar? -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-predefined-clusters-tp3060433p3067215.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis: predefined clusters
Dear list, running a hierachical cluster analysis I want to define a number of objects that build a cluster already. In other words: I want to force some of the cases to be in the same cluster from the start of the algorithm. Any hints? Thanks in advance! Derik [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis: predefined clusters
On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert derik2...@yahoo.de wrote: Dear list, running a hierachical cluster analysis I want to define a number of objects that build a cluster already. In other words: I want to force some of the cases to be in the same cluster from the start of the algorithm. Any hints? Thanks in advance! The hclust function has an argument 'members' that should allow you to do that. You will need to specify the dissimilarity matrix accordingly. Peter __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Hi Ulrich, I'm studying the principles of Affinity Propagation and I'm really glad to use your package (apcluster) in order to cluster my data. I have just an issue to solve.. If I apply the funcion: apcluster(sim) where sim is the matrix of dissimilarities, sometimes I encounter the warning message: Algorithm did not converge. Turn on details and call plot() to monitor net similarity. Consider increasing maxits and convits, and, if oscillations occur also increasing damping factor lam. with too high number of clusters. I thought to solve the problem setting the argument p of the function apcluster() to mean(PreferenceRange(sim)): apcluster(sim, p=mean(preferenceRange(sim))) and actually it seems to be a good solution because I don't receive any warning message and the number of cluster is slower. Do you think it's a good solution? I submitt that I have to use apcluster() in an automatic procedure so I can't manipulate directly the arguments of the funcion. Thanks in advance. Giuseppe -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2715278.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis
Pablo, we've had success using http://mephisto.unige.ch/traminer/preview.shtml to look at marketing paths. Question would be how many distinct case step discriptions are there? HTH, Jim On Jul 26, 2010 9:44 AM, Pablo Cerdeira pablo.cerde...@gmail.com wrote: Hi all, I have no idea if this question is to easy to be answered, but I´m starting with R. So, here we go. I have a large dataset with a lot of steps a judicial case. A sample is attached. I´d like to do a cluster analysis to try to understand with one is the most usual path followed by this legal cases. After that, I´d like to plot a cluster tree. In the attached sample, the column: - id_processo is the primary key of a legal case; - number is the step number in the legal case; - andamento is the description of the legal case step. I have no idea on how to do it using R. Can someone help me? Thanks in advanced -- *Pablo de Camargo Cerdeira* pa...@fgv.br pablo.cerde...@gmail.com +55 (21) 3799-6065 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis
Hi Allan, It helps a lot. I´ll try to read more about it. But, as you asked me, here goes a brief explanation about the necessary columns of the sample date paste at the end: id_processo: identify a legal case, it is its primary key. ordem_andamento: is the step number inside a legal case (id_processo); id_andamento: is the primary key of the step. I´d like to identify the most commom steps (id_andamento) sequence (ordem_andamento) inside a lot of legal cases (id_processo). Probably a cluster analysis with a dendogram plot is what I´m looking for. Here goes the sample of two different legal cases (2 different id_processo): Best regards and thank you in advanced id_processo,proc_num,ordem_andamento,id_andamento,andamento,data,dias,origem_tribunal,data_entrada,relator,duracao_dias 1480010,1,1,208,DISTRIBUIDO,1988-10-06 00:00:00,5,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,2,69,CONCLUSAO,1988-10-06 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,3,180,DESPACHO ORDINATORIO,1988-10-11 00:00:00,8,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,4,465,PEDIDO DE INFORMACOES,1988-10-19 00:00:00,1,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,5,465,PEDIDO DE INFORMACOES,1988-10-20 00:00:00,15,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,6,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-11-04 00:00:00,24,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,7,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-11-28 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,8,69,CONCLUSAO,1988-11-28 00:00:00,38,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,9,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-01-05 00:00:00,874,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,10,26,AUTOS DEVOLVIDOS,1991-05-29 00:00:00,8,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,11,75,CONCLUSOS AO RELATOR,1991-05-29 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,12,578,VISTA AO ADVOGADO-GERAL DA UNIAO,1991-06-06 00:00:00,232,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,13,507,RECEBIMENTO DOS AUTOS,1992-01-24 00:00:00,10,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,14,75,CONCLUSOS AO RELATOR,1992-02-03 00:00:00,21,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,15,284,JULG. POR DESPACHO - NEGADO SEGUIMENTO,1992-02-24 00:00:00,3,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,16,497,PUBLICADO DESPACHO NO DJ,1992-02-27 00:00:00,12,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,17,163,DECORRIDO O PRAZO,1992-03-10 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480010,1,18,34,BAIXA AO ARQUIVO DO STF,1992-03-10 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251 1480183,2,1,208,DISTRIBUIDO,1988-10-12 00:00:00,8,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,2,69,CONCLUSAO,1988-10-12 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,3,352,JULGAMENTO NO PLENO,1988-10-20 00:00:00,22,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,4,476,PETICAO AVULSA,1988-11-11 00:00:00,13,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,5,531,REMESSA DOS AUTOS,1988-11-11 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,6,495,PUBLICADO ACORDAO, DJ:,1988-11-24 00:00:00,11,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,7,163,DECORRIDO O PRAZO,1988-12-05 00:00:00,8,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,8,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-12-13 00:00:00,63,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,9,69,CONCLUSAO,1988-12-13 00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,10,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-02-14 00:00:00,83,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,11,69,CONCLUSAO,1989-05-08 00:00:00,91,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,12,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-08-07 00:00:00,21,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,13,69,CONCLUSAO,1989-08-28 00:00:00,2,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677 1480183,2,14,484,PROCESSO A JULGAMENTO -
Re: [R] Cluster analysis
Hi Jim, Ow! Very nice job at http://mephisto.unige.ch/traminer/preview.shtml I´m going to read more about it. I have a lot of different steps, in a sequence. Actually, 586 different possible steps, but I have 4269 legal cases, with a maximum of 379 steps each one. If you want, I can send this dataset to you. Best regards and thank you very much, On Tue, Jul 27, 2010 at 10:16 AM, Jim Porzak jpor...@gmail.com wrote: Pablo, we've had success using http://mephisto.unige.ch/traminer/preview.shtml to look at marketing paths. Question would be how many distinct case step discriptions are there? HTH, Jim On Jul 26, 2010 9:44 AM, Pablo Cerdeira pablo.cerde...@gmail.com wrote: Hi all, I have no idea if this question is to easy to be answered, but I´m starting with R. So, here we go. I have a large dataset with a lot of steps a judicial case. A sample is attached. I´d like to do a cluster analysis to try to understand with one is the most usual path followed by this legal cases. After that, I´d like to plot a cluster tree. In the attached sample, the column: - id_processo is the primary key of a legal case; - number is the step number in the legal case; - andamento is the description of the legal case step. I have no idea on how to do it using R. Can someone help me? Thanks in advanced -- *Pablo de Camargo Cerdeira* pa...@fgv.br pablo.cerde...@gmail.com +55 (21) 3799-6065 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- *Pablo de Camargo Cerdeira* pa...@fgv.br pablo.cerde...@gmail.com +55 (21) 3799-6065 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis
Hi all, I have no idea if this question is to easy to be answered, but I´m starting with R. So, here we go. I have a large dataset with a lot of steps a judicial case. A sample is attached. I´d like to do a cluster analysis to try to understand with one is the most usual path followed by this legal cases. After that, I´d like to plot a cluster tree. In the attached sample, the column: - id_processo is the primary key of a legal case; - number is the step number in the legal case; - andamento is the description of the legal case step. I have no idea on how to do it using R. Can someone help me? Thanks in advanced -- *Pablo de Camargo Cerdeira* pa...@fgv.br pablo.cerde...@gmail.com +55 (21) 3799-6065 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
abanero wrote: Do you know something like “knn1” that works with categorical variables too? Do you have any suggestion? There are surely plenty of clustering algorithms around that do not require a vector space structure on the inputs (like KNN does). I think agglomerative clustering would solve the problem as well as a kernel-based clustering (assuming that you have a way to positive semi-definite measure of the similarity of two samples). Probably the simplest way is Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation; see CRAN package apcluster I have co-developed). All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables - as well as for mixtures of both (the choice of the similarity measures and how to aggregate the different variables is left to you, of course). Your final classification task can be accomplished simply by assigning the new sample to the cluster whose exemplar is most similar. Joris Meys wrote: Not a direct answer, but from your description it looks like you are better of with supervised classification algorithms instead of unsupervised clustering. If you say that this is a purely supervised task that can be solved without clustering, I disagree. abanero does not mention any class labels. So it seems to me that it is indeed necessary to do unsupervised clustering first. However, I agree that the second task of assigning new samples to clusters/classes/whatever can also be solved by almost any supervised technique if samples are labeled according to their cluster membership first. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Dear abanero, In principle, k nearest neighbours classification can be computed on any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume Euclidean vectors as input, which restricts their use. I'd probably compute an appropriate dissimilarity between points (have a look at Gower's distance in daisy, package cluster), and the implement nearest neighbours classification myself if I needed it. It should be pretty straightforward to implement. If you want unsupervised classification (clustering) instead, you have the choice between all kinds of dissimilarity based algorithms then (hclust, pam, agnes etc.) Christian On Thu, 27 May 2010, Ulrich Bodenhofer wrote: abanero wrote: Do you know something like “knn1” that works with categorical variables too? Do you have any suggestion? There are surely plenty of clustering algorithms around that do not require a vector space structure on the inputs (like KNN does). I think agglomerative clustering would solve the problem as well as a kernel-based clustering (assuming that you have a way to positive semi-definite measure of the similarity of two samples). Probably the simplest way is Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation; see CRAN package apcluster I have co-developed). All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables - as well as for mixtures of both (the choice of the similarity measures and how to aggregate the different variables is left to you, of course). Your final classification task can be accomplished simply by assigning the new sample to the cluster whose exemplar is most similar. Joris Meys wrote: Not a direct answer, but from your description it looks like you are better of with supervised classification algorithms instead of unsupervised clustering. If you say that this is a purely supervised task that can be solved without clustering, I disagree. abanero does not mention any class labels. So it seems to me that it is indeed necessary to do unsupervised clustering first. However, I agree that the second task of assigning new samples to clusters/classes/whatever can also be solved by almost any supervised technique if samples are labeled according to their cluster membership first. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Hi, thank you Joris and Ulrich for you answers. Joris Meys wrote: see the library randomForest for example I'm trying to find some example in randomForest with categorical variables but I haven't found anything. Do you know any example with both categorical and numerical variables? Anyway I don't have any class labels yet. How could I find clusters with randomForest? Ulrich wrote: Probably the simplest way is Affinity Propagation[...] All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables. I had a look at the documentation of the package apcluster. That's interesting but do you have any example using it with both categorical and numerical variables? I'd like to test it with a large dataset.. Thanks a lot! Cheers Giuseppe -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Hi Abanero, first, I have to correct myself. Knn1 is a supervised learning algorithm, so my comment wasn't completely correct. In any case, if you want to do a clustering prior to a supervised classification, the function daisy() can handle any kind of variable. The resulting distance matrix can be used with a number of different methods. And you're right, randomForest doesn't handle categorical variables either. So I haven't been of great help here... Cheers Joris On Thu, May 27, 2010 at 1:25 PM, abanero gdevi...@xtel.it wrote: Hi, thank you Joris and Ulrich for you answers. Joris Meys wrote: see the library randomForest for example I'm trying to find some example in randomForest with categorical variables but I haven't found anything. Do you know any example with both categorical and numerical variables? Anyway I don't have any class labels yet. How could I find clusters with randomForest? Ulrich wrote: Probably the simplest way is Affinity Propagation[...] All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables. I had a look at the documentation of the package apcluster. That's interesting but do you have any example using it with both categorical and numerical variables? I'd like to test it with a large dataset.. Thanks a lot! Cheers Giuseppe -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
I'm confusing myself :-) randomForest cannot handle character vectors as predictors. (Which is why I, to my surprise, found out that a categorical variable could not be used in the function). It can handle categorical variables as predictors IF they are put in as a factor. Obviously they handle categorical variables as a response variable. I hope I'm not going to add up more mistakes, it's been enough for the day... Cheers Joris On Thu, May 27, 2010 at 2:08 PM, steve_fried...@nps.gov wrote: Joris, I've been following this thread for a few days as I am beginning to use randomForest in my work. I am confused by your last email. What do you mean that randomForest does not handle categorical variables ? It can be used in either regression or classification analysis. Do you mean that categorical predictors are not suitable? Certainly they are as the response. Would you be so kind, and clarify what you were suggesting. Thanks, Steve Friedman Ph. D. Spatial Statistical Analyst Everglades and Dry Tortugas National Park 950 N Krome Ave (3rd Floor) Homestead, Florida 33034 steve_fried...@nps.gov Office (305) 224 - 4282 Fax (305) 224 - 4147 Joris Meys jorism...@gmail. com To Sent by: abanero gdevi...@xtel.it r-help-boun...@r- cc project.org r-help@r-project.org Subject Re: [R] cluster analysis and 05/27/2010 07:56 supervised classification: an AMalternative to knn1? Hi Abanero, first, I have to correct myself. Knn1 is a supervised learning algorithm, so my comment wasn't completely correct. In any case, if you want to do a clustering prior to a supervised classification, the function daisy() can handle any kind of variable. The resulting distance matrix can be used with a number of different methods. And you're right, randomForest doesn't handle categorical variables either. So I haven't been of great help here... Cheers Joris On Thu, May 27, 2010 at 1:25 PM, abanero gdevi...@xtel.it wrote: Hi, thank you Joris and Ulrich for you answers. Joris Meys wrote: see the library randomForest for example I'm trying to find some example in randomForest with categorical variables but I haven't found anything. Do you know any example with both categorical and numerical variables? Anyway I don't have any class labels yet. How could I find clusters with randomForest? Ulrich wrote: Probably the simplest way is Affinity Propagation[...] All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables. I had a look at the documentation of the package apcluster. That's interesting but do you have any example using it with both categorical and numerical variables? I'd like to test it with a large dataset.. Thanks a lot! Cheers Giuseppe -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
I had a look at the documentation of the package apcluster. That's interesting but do you have any example using it with both categorical and numerical variables? I'd like to test it with a large dataset.. Your posting has opened my eyes: problems where both numerical and categorical features occur are probably among the most attractive applications of affinity propagation. So I am considering to include such an example in a future released. Here is a very crude example (download the imports-85.data from http://archive.ics.uci.edu/ml/machine-learning-databases/autos/ first): library(cluster) library(apcluster) automobiles - read.table(imports-85.data, header=FALSE, sep=,, na.strings=?) sim - -as.matrix(daisy(automobiles)) apcluster(sim) The most essential part here is to use daisy() from the package cluster for computing distances/similarities. Have a look to the help page of daisy() to get a better impression how it works and how to tailor the distance/similarity calculations to your needs. I do not know whether this is a good data set for clustering. Affinity propagation produces quite a number of clusters. Maybe fiddling with the input preferences is necessary (see Section 4 of vignette of package apcluster). Best regards, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233053.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Sorry, Joris, I overlooked that you already mentioned daisy() in your posting. I should have credited your recommendation in my previous message. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233055.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Ulrich wrote: Affinity propagation produces quite a number of clusters. I tried with q=0 and produces 17 clusters. Anyway that's a good idea, thanks. I'm looking to test it with my dataset. So I'll probably use daisy() to compute an appropriate dissimilarity then apcluster() or another method to determine clusters. What do you suggest in order to assign a new observation to a determined cluster? It seems that RandomForest doesn't work with both numerical and categorical predictors (thanks to Joris). Christian wrote: and the implement nearest neighbours classification myself if I needed it. It should be pretty straightforward to implement. Do you intend modify the code of the knn1() function by yourself? thanks to everyone! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Christian wrote: and the implement nearest neighbours classification myself if I needed it. It should be pretty straightforward to implement. Do you intend modify the code of the knn1() function by yourself? No; if you understand what the nearest neighbours method does, it's not very complicated to implement it from scratch (assuming that your dataset is small enough that you don't have to worry too much about optimising computing times). A bit of programming experience is required, though. (It's not that I intend to do it right now, I suggest that you do it if you can...) Christian thanks to everyone! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
What do you suggest in order to assign a new observation to a determined cluster? As I mentioned already, I would simply assign the new observation to the cluster to whose exemplar the new observation is most similar to (in a knn1-like fashion). To compute these similarities, you can use the daisy() function. However, you have to do some tricks, since daisy() is designed for computing square matrices of all mutual distances for a given data set. I did not find another function that is better suitable (e.g. a function that allows to compute simply the distance of two distinct samples). Maybe others have an idea. In any case, you have to make sure that data either remain unscaled or that you take care yourself that your new observation is scaled exactly with the same parameters that were used for clustering before. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233308.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis and supervised classification: an alternative to knn1?
Hi, I have a 1.000 observations with 10 attributes (of different types: numeric, dicotomic, categorical ecc..) and a measure M. I need to cluster these observations in order to assign a new observation (with the same 10 attributes but not the measure) to a cluster. I want to calculate for the new observation a measure as the average of the meausures M of the observations in the cluster assigned. I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in package class) to assign the new observation to a cluster. The problem is: I’m not able to use “knn1” because some of attributes are categorical. Do you know something like “knn1” that works with categorical variables too? Do you have any suggestion? -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Not a direct answer, but from your description it looks like you are better of with supervised classification algorithms instead of unsupervised clustering. see the library randomForest for example. Alternatively, you can try a logistic regression or a multinomial regression approach, but these are parametric methods and put requirements on the data. randomForest is completely non-parametric. Cheers Joris On Wed, May 26, 2010 at 3:45 PM, abanero gdevi...@xtel.it wrote: Hi, I have a 1.000 observations with 10 attributes (of different types: numeric, dicotomic, categorical ecc..) and a measure M. I need to cluster these observations in order to assign a new observation (with the same 10 attributes but not the measure) to a cluster. I want to calculate for the new observation a measure as the average of the meausures M of the observations in the cluster assigned. I would use cluster analysis ( Clara algorithm?) and then knn1 (in package class) to assign the new observation to a cluster. The problem is: Im not able to use knn1 because some of attributes are categorical. Do you know something like knn1 that works with categorical variables too? Do you have any suggestion? -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis: dissimilar results between R and SPSS
Hello everyone! My data is composed of 277 individuals measured on 8 binary variables (1=yes, 2=no). I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The objective is to have the means for each variable per retained cluster. 1) the R analysis ran as followed: call data dist=dist(data,method=euclidean) cluster=hclust(dist,method=ward) cluster Call: hclust(d = dist, method = ward) Cluster method : ward Distance : euclidean Number of objects: 277 plot(cluster) rect.hclust(cluster, k=4, border=red) x=rect.hclust(cluster, k=4, border=red) sapply(x, function(i) colMeans(data[i,])) round(sapply(x, function(i) colMeans(data[i,])),2) 2) The SPSS analysis ran as follows: Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method= Ward's method and Distance measure= Interval: Squared Euclidean distance. After that, I computed the means of each variable for each cluster. The problem is I have different results between the two analyses (different clusters and means). However, when I use the Euclidean distance (unsquared) in SPSS, I have the same results! I thought the R euclidean command meant the usual square distance between the two vectors (2 norm) as specified in the documentation, no the unsquared distance. Did it not? Thanks for the comment! Jeffrey [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: dissimilar results between R and SPSS
Hi Jeoffrey, How stable are the results in general ? If you repeat the analysis in R several times, does it yield the same results ? Tal Contact Details:--- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) -- On Mon, Apr 26, 2010 at 3:37 PM, Jeoffrey Gaspard jeoffrey.gasp...@gmail.com wrote: Hello everyone! My data is composed of 277 individuals measured on 8 binary variables (1=yes, 2=no). I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The objective is to have the means for each variable per retained cluster. 1) the R analysis ran as followed: call data dist=dist(data,method=euclidean) cluster=hclust(dist,method=ward) cluster Call: hclust(d = dist, method = ward) Cluster method : ward Distance : euclidean Number of objects: 277 plot(cluster) rect.hclust(cluster, k=4, border=red) x=rect.hclust(cluster, k=4, border=red) sapply(x, function(i) colMeans(data[i,])) round(sapply(x, function(i) colMeans(data[i,])),2) 2) The SPSS analysis ran as follows: Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method= Ward's method and Distance measure= Interval: Squared Euclidean distance. After that, I computed the means of each variable for each cluster. The problem is I have different results between the two analyses (different clusters and means). However, when I use the Euclidean distance (unsquared) in SPSS, I have the same results! I thought the R euclidean command meant the usual square distance between the two vectors (2 norm) as specified in the documentation, no the unsquared distance. Did it not? Thanks for the comment! Jeffrey [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: dissimilar results between R and SPSS
I'm not sure why you'd expect Euclidean distance and squared Euclidean distance to give the same results. Euclidean distance is the square root of the sums of squared differences for each variable, and that's exactly what dist() returns. http://en.wikipedia.org/wiki/Euclidean_distance On a map, it's the length of the hypoteneuse, and you can measure it with a ruler and get the same number. Euclidean distance has a specific geometric meaning. Squared Euclidean distance is not the same thing, and not the standard definition you seem to be expecting. If that's what you want, then square the output of dist() before you perform the clustering. Sarah On Mon, Apr 26, 2010 at 8:37 AM, Jeoffrey Gaspard jeoffrey.gasp...@gmail.com wrote: Hello everyone! My data is composed of 277 individuals measured on 8 binary variables (1=yes, 2=no). I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The objective is to have the means for each variable per retained cluster. 1) the R analysis ran as followed: call data dist=dist(data,method=euclidean) cluster=hclust(dist,method=ward) cluster Call: hclust(d = dist, method = ward) Cluster method : ward Distance : euclidean Number of objects: 277 plot(cluster) rect.hclust(cluster, k=4, border=red) x=rect.hclust(cluster, k=4, border=red) sapply(x, function(i) colMeans(data[i,])) round(sapply(x, function(i) colMeans(data[i,])),2) 2) The SPSS analysis ran as follows: Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method= Ward's method and Distance measure= Interval: Squared Euclidean distance. After that, I computed the means of each variable for each cluster. The problem is I have different results between the two analyses (different clusters and means). However, when I use the Euclidean distance (unsquared) in SPSS, I have the same results! I thought the R euclidean command meant the usual square distance between the two vectors (2 norm) as specified in the documentation, no the unsquared distance. Did it not? Thanks for the comment! Jeffrey -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis :: urgent
hi, how can i do cluster analysis on spatial data? (longitude latitude) Because i used the function clust of the clustTool package and it did'nt work at all: cl - clust(dv,3,method=hclustAverage,distMethod=euclidean) thanks a lot Karine HEERAH Master 2 , océanographie et environnements marins Université Pierre et Marie Curie (Paris 6) _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis labels for dendrogram
Hi Samantha, Did you check out the help for plclust? There's a labels argument that is used to label the leaves of your dendrogram. By default, the rownames of your dataframe are used. Sarah On Wed, Mar 10, 2010 at 9:01 PM, Samantha samantha.fra...@gmail.com wrote: Hi, I am clustering data based on three numeric variables. I have a fourth variable that is categorical (site) which I would like to use to label the leaves of my dendrogram, so I can see how the different sites are grouped throughout the tree, but I do NOT want to use this variable in the cluster analysis itself. Is there any way I can do this? Thanks, Samantha -- -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis labels for dendrogram
Hi Samantha, You can check out the graph and source code on this page: http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=79 best, Xian -- View this message in context: http://n4.nabble.com/cluster-analysis-labels-for-dendrogram-tp1588347p1588790.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis labels for dendrogram
Hi, I am clustering data based on three numeric variables. I have a fourth variable that is categorical (site) which I would like to use to label the leaves of my dendrogram, so I can see how the different sites are grouped throughout the tree, but I do NOT want to use this variable in the cluster analysis itself. Is there any way I can do this? Thanks, Samantha -- View this message in context: http://n4.nabble.com/cluster-analysis-labels-for-dendrogram-tp1588347p1588347.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis
Hi Folks, I want to apply cluster analysis on a categorical data set, could you recommend me some R package and suggestion? Thanks! Dong [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis
Without know what your data set really looks like, I'd look to decision trees - specifically package rpart and use method = classify. Your problem may not be appropriate in that environment, but it is hard to say with limited explanation of issues. good luck Steve Friedman Ph. D. Spatial Statistical Analyst Everglades and Dry Tortugas National Park 950 N Krome Ave (3rd Floor) Homestead, Florida 33034 steve_fried...@nps.gov Office (305) 224 - 4282 Fax (305) 224 - 4147 Dong He dongh...@gmail.c omTo Sent by: r-help@r-project.org r-help-boun...@r- cc project.org Subject [R] cluster analysis 02/18/2010 04:54 PM Hi Folks, I want to apply cluster analysis on a categorical data set, could you recommend me some R package and suggestion? Thanks! Dong [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: hclust manipulation possible?
On 16.11.2009 19:13, Charles C. Berry wrote: The question: Can this be accomplished in the *dendrogram plot* by manipulating the resulting hclust data structure or by some other means, and if yes, how? Yes, you need to study ?hclust particularly the part about 'Value' from which you will see what needs modification. Here is a very simple example: res - hclust(dist(1-diag(3)*rnorm(3))) plot(res) res2 - res res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, -res2$merge, res2$merge+sum(res2$merge0)),2)) res2$height - c(rep(0,3), res2$height) res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) ) plot(res2) str( res ) str( res2 ) Dear Chuck, Many thanks for spending your valuable time in the suggestions and the example. However, the drawback is that as a humanist I have been having considerable difficulties in figuring out what exactly to do. After hours of experimenting I could modify another dendrogram (without crashing R), but still fail to get the result I want to: the added leaf is not attached to where I am intending to but instead, another adjacent leaves have their height turned to 0. The question, to put it more clearly perhaps: Is there any straightforward procedure to just add a single leaf to any dendrogram, next to an existing leaf at the height 0, and if there is, what might that be? As of now, it seems that the $merge has to be modified correctly, but what is the exact strategy, if there is one (other than redoing the whole clustering by hand)? Alternatively, you could use as.dendrogram( res ) as the point of departure and manipulate the value. Possibly, yes, but I am even less well-equipped with editing that sort of a data type. Sincerely, Jopi Harri Musicologist University of Turku Finland __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: hclust manipulation possible?
Original Message Subject: Re: [R] Cluster analysis: hclust manipulation possible? Date: Mon, 16 Nov 2009 19:22:54 -0800 From: Charles C. Berry cbe...@tajo.ucsd.edu To: Jopi Harri jopi.ha...@utu.fi References: 4b016237.7050...@utu.fi pine.lnx.4.64.0911160906420.27...@tajo.ucsd.edu 4b01bc5d.3020...@utu.fi On Mon, 16 Nov 2009, Jopi Harri wrote: On 16.11.2009 19:13, Charles C. Berry wrote: The question: Can this be accomplished in the *dendrogram plot* by manipulating the resulting hclust data structure or by some other means, and if yes, how? Yes, you need to study ?hclust particularly the part about 'Value' from which you will see what needs modification. Here is a very simple example: res - hclust(dist(1-diag(3)*rnorm(3))) plot(res) res2 - res res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, -res2$merge, res2$merge+sum(res2$merge0)),2)) res2$height - c(rep(0,3), res2$height) res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) ) plot(res2) str( res ) str( res2 ) Dear Chuck, Many thanks for spending your valuable time in the suggestions and the example. However, the drawback is that as a humanist I have been having considerable difficulties in figuring out what exactly to do. After hours of experimenting I could modify another dendrogram (without crashing R), but still fail to get the result I want to: the added leaf is not attached to where I am intending to but instead, another adjacent leaves have their height turned to 0. The question, to put it more clearly perhaps: Is there any straightforward procedure to just add a single leaf to any dendrogram, next to an existing leaf at the height 0, and if there is, what might that be? As of now, it seems that the $merge has to be modified correctly, but what is the exact strategy, if there is one (other than redoing the whole clustering by hand)? First, read the ?hclust page and see what it says about merge. Then look at a really simple example like cl - hclust( dist( c(1,2,4) ) ) plot(cl) unclass( cl ) The unclass() strips the class attribute and allows print() to give you a bit more detail. Now make the figure a bit more complicated: cl2 - hclust(dist(as.matrix(c(1,2,4,4.5 plot(cl2) unclass(cl2) and see what has changed in $merge, $height, and $order. Once you get the hang of it, you'll be in a position to modify an existing hclust object. Chuck p.s. it is best to post replies like yours to the whole list; others may want to know the same thing that you want to know or others may give a better reply than I have. Alternatively, you could use as.dendrogram( res ) as the point of departure and manipulate the value. Possibly, yes, but I am even less well-equipped with editing that sort of a data type. Sincerely, Jopi Harri Musicologist University of Turku Finland Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:cbe...@tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: hclust manipulation possible?
On 17.11.2009 5:22, Charles C. Berry wrote: Once you get the hang of it, you'll be in a position to modify an existing hclust object. I believe that I managed to solve the problem. (The code may not be too refined, and my R is perhaps a bit dialectal. The function may fail especially if the addition of multiple identical labels is attempted.) So, for the addition of a single duplicate label, one needs to increment the positive values in $merge by one, and keep the negative values except for the original of the duplicate which will be given +1. Then, the duplicate pair [the value for the of the new label being -(abs(min($merge))+1)] is added on top of $merge. The other manipulations involved are the addition of height 0, the label for the duplicate, and placing it properly in $order. Once more thanks for the assistance. Jopi Harri dup.hclust=function(Hc,Label,DupLabel) # We add to hclust Hc the duplicate DupLabel of Label. # May fail in certain conditions, but shouldn't in normal use. { if (is.null(Hc$labels)) return(Labels are required!); Mer=Hc$merge; Hght=Hc$height; Ord=Hc$order; Labs=Hc$labels; DupLNo=abs(min(Mer))+1; LNo=which(Labs==Label); LPlace=which(Labs[Ord]==Label); Hght=c(0,Hght); Labs=c(Labs,DupLabel); Ord=c(Ord[1:LPlace[1]],DupLNo,Ord[LPlace[1]+1:(length(Ord))-LPlace[1]]); NewMer=matrix(ifelse(Mer0,Mer,Mer+1),nrow(Mer)); NewMer[NewMer==-LNo]=1; NewMer=as.matrix(rbind(-cbind(LNo,DupLNo),NewMer)); NewMer=cbind(NewMer[,1],NewMer[,2]); Hc$merge=NewMer; Hc$height=Hght; Hc$order=Ord; Hc$labels=Labs; return(Hc); } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis: hclust manipulation possible?
I am doing cluster analysis [hclust(Dist, method=average)] on data that potentially contains redundant objects. As expected, the inclusion of redundant objects affects the clustering result, i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to cluster differently from the same data without the redundancy, i.e., a1, b, c, d, e1. This is apparent when the outcome is visualized as a dendrogram. Now, it seems that the clustering result for which the redundancy has been eliminated is more robust for the present assignment than that of the redundant data. Naturally, there is no problem in the elimination: just exclude the redundant objects from Dist. However, it would be very convenient to be able to include the redundant objects in the *dendrogram* by attaching them as 0-level branches to the subtrees, i.e.: 1.0--- 0.5___|___|_.. 0.0.._|_..|..|..|.._|_ |.|.|.|..|..|.|...|... ...a1a2a3.b..c..d.e1.e2... instead of 1.0--- 0.5___|___|_.. 0.0...|...|..|..|...|. ..a1..b..c..d..e1. The question: Can this be accomplished in the *dendrogram plot* by manipulating the resulting hclust data structure or by some other means, and if yes, how? Jopi Harri __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis: hclust manipulation possible?
On Mon, 16 Nov 2009, Jopi Harri wrote: I am doing cluster analysis [hclust(Dist, method=average)] on data that potentially contains redundant objects. As expected, the inclusion of redundant objects affects the clustering result, i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to cluster differently from the same data without the redundancy, i.e., a1, b, c, d, e1. This is apparent when the outcome is visualized as a dendrogram. Now, it seems that the clustering result for which the redundancy has been eliminated is more robust for the present assignment than that of the redundant data. Naturally, there is no problem in the elimination: just exclude the redundant objects from Dist. However, it would be very convenient to be able to include the redundant objects in the *dendrogram* by attaching them as 0-level branches to the subtrees, i.e.: 1.0--- 0.5___|___|_.. 0.0.._|_..|..|..|.._|_ |.|.|.|..|..|.|...|... ...a1a2a3.b..c..d.e1.e2... instead of 1.0--- 0.5___|___|_.. 0.0...|...|..|..|...|. ..a1..b..c..d..e1. The question: Can this be accomplished in the *dendrogram plot* by manipulating the resulting hclust data structure or by some other means, and if yes, how? Yes, you need to study ?hclust particularly the part about 'Value' from which you will see what needs modification. Here is a very simple example: res - hclust(dist(1-diag(3)*rnorm(3))) plot(res) res2 - res res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, -res2$merge, res2$merge+sum(res2$merge0)),2)) res2$height - c(rep(0,3), res2$height) res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) ) plot(res2) str( res ) str( res2 ) Alternatively, you could use as.dendrogram( res ) as the point of departure and manipulate the value. HTH, Chuck Jopi Harri __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:cbe...@tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis with missing data
Hi folks, I tried for the first time hclust. Unfortunately, with missing data in my data file, it doesn't seem to work. I found no information about how to consider missing data. Omission of all missings is not really an option as I would loose to many cases. Thanks in advance Holger -- View this message in context: http://www.nabble.com/Cluster-analysis-with-missing-data-tp24474486p24474486.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis with missing data
vegdist() in the vegan package optionally allows pairwise deletion of missing values when computing dissimilarities. The result can be used as the first agrument to hclust() ('Caveat emptor', of course.) From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of Hollix [holger.steinm...@web.de] Sent: 14 July 2009 16:42 To: r-help@r-project.org Subject: [R] Cluster analysis with missing data Hi folks, I tried for the first time hclust. Unfortunately, with missing data in my data file, it doesn't seem to work. I found no information about how to consider missing data. Omission of all missings is not really an option as I would loose to many cases. Thanks in advance Holger -- View this message in context: http://www.nabble.com/Cluster-analysis-with-missing-data-tp24474486p24474486.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis with missing data
On Mon, 2009-07-13 at 23:42 -0700, Hollix wrote: Hi folks, I tried for the first time hclust. Unfortunately, with missing data in my data file, it doesn't seem to work. I found no information about how to consider missing data. Omission of all missings is not really an option as I would loose to many cases. Holger, hclust takes a dissimilarity matrix as input, not your data, so the problem is in finding an appropriate dissimilarity/distance coefficient that handles missing data. Once such measure is Gower's coefficient and is implemented in function 'daisy' in recommended package 'cluster'. Try: require(cluster) ?daisy to read about it. Also 'vegdist' in package 'vegan' has an ability to not consider pairwise missingness. See ?vegdist after loading 'vegan' and in particular, the 'na.rm' argument. Whether either of these (i.e. the resulting dissimilarities) make sense for your particular problem is another matter... HTH G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis, defining center seeds or number of clusters
I use kmeans to classify spectral events in high and low 1/3 octave bands: #Do cluster analysis CyclA-data.frame(LlowA,LhghA) CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE) ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen) This works well when the actual data shows 1,2 or 3 groups that are not too close in a cross plot. The MacQueen algorithm will give one or more empty groups which is what I want. However, there are cases when the groups are closer together, less compact or diffuse which leads to the situation where visually only 2 groups are apparent but the algorithm returns 3 splitting one group in two. I looked at the package 'cluster' specifically at clara (cannot use pam as I have 1 observations). But clara always returns as many groups as you aks for. Is there a way to help find a seed for the intial cluster centers? Equivalently, is there a way to find a priori the number of groups? I know this is not an easy problem. I have looked at principal components (princomp, prcomp) because there is a connection with cluster analysis. It is not obvious to me how to program that connection though. http://en.wikipedia.org/wiki/Principal_Component_Analysis http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf Thanks in advance, Alex van der Spek __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, defining center seeds or number of clusters
Dear Alex, actually fixing the number of clusters in kmeans end then ending up with a smaller number because of empty clusters is not a standard method of estimating the number of clusters. I may happen (as apparently in some of your examples), but it is generally rather unusual. In most cases, kmeans, as well as clara, pam and other clustering methods, only give you the number of clusters you ask for. Even with some reasonable separation between clusters kmeans cannot generally be expected to come up with empty clusters if the number is initially chosen too high or too many initially centers are specified. The help page for pam.object in library cluster shows you a method to estimate the optimal number of clusters based on pam. However, this problem strongly depends on what cluster concept you have in mind and what you want to use your clusters for. There are alternative indexes that could be optimised to find the best number of clusters. Some of them are implemented in the function cluster.stats in package fpc. I strongly advise reading some literature about this to understand the problem better; the help page of cluster.stats gives a few references. The BIC gives you an estimate of the number of cluster together with Gaussian mixtures, see package mclust. If you can specify things like maximum within-cluster distances, you may get something from using cutree together with a hierarchical clustering method in hclust, for example complete linkage. dbscan and fixmahal in package fpc are further alternatives, requiring one or two tuning constants to come up with an automatical number of clusters. Best regards, Christian On Thu, 11 Jun 2009, am...@xs4all.nl wrote: I use kmeans to classify spectral events in high and low 1/3 octave bands: #Do cluster analysis CyclA-data.frame(LlowA,LhghA) CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE) ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen) This works well when the actual data shows 1,2 or 3 groups that are not too close in a cross plot. The MacQueen algorithm will give one or more empty groups which is what I want. However, there are cases when the groups are closer together, less compact or diffuse which leads to the situation where visually only 2 groups are apparent but the algorithm returns 3 splitting one group in two. I looked at the package 'cluster' specifically at clara (cannot use pam as I have 1 observations). But clara always returns as many groups as you aks for. Is there a way to help find a seed for the intial cluster centers? Equivalently, is there a way to find a priori the number of groups? I know this is not an easy problem. I have looked at principal components (princomp, prcomp) because there is a connection with cluster analysis. It is not obvious to me how to program that connection though. http://en.wikipedia.org/wiki/Principal_Component_Analysis http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf Thanks in advance, Alex van der Spek __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis: mean values for each variable and cluster
Hi all! I'm new to R and don't know many about it. Because it is free, I managed to learn it a little bit. Here is my problem: I did a cluster analysis on 30 observations and 16 variables (monde, figaro, liberation, etc.). Here is the .txt data file: monde,figaro,liberation,yespeople,nopeople,bxl,europe,ue,union_eur,other,yesmeto,nometo,yesfonc,nofonc,yestone,notone 1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0 0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1 1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,1 1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0 1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0 1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,1 0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0 0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0 1,0,0,0,1,0,0,1,0,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,1,0,0,1,1,0,1,0 0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0 0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,0,1 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 1,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0 The steps I made were those: headlines=read.table(/data.csv, header=T, sep=,) data dist=dist(data,method=euclidean) dist cluster=hclust(dist,method=ward) cluster plot(cluster) rect.hclust(cluster, k=4, border=red) I extracted 4 clusters from the data. My question is: is it possible to produce a summary of every mean values for each variable of each of the 4 clusters? Thanks a lot in advance, Jeoffrey -- View this message in context: http://www.nabble.com/cluster-analysis%3A-mean-values-for-each-variable-and-cluster-tp22120427p22120427.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis: mean values for each variable and cluster
jgaspard wrote: Hi all! I'm new to R and don't know many about it. Because it is free, I managed to learn it a little bit. Here is my problem: I did a cluster analysis on 30 observations and 16 variables (monde, figaro, liberation, etc.). Here is the .txt data file: monde,figaro,liberation,yespeople,nopeople,bxl,europe,ue,union_eur,other,yesmeto,nometo,yesfonc,nofonc,yestone,notone 1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0 0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1 1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,1 1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0 1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0 1,0,0,0,1,0,0,0,1,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0 1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,1 0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0 0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0 1,0,0,0,1,0,0,1,0,0,0,1,0,1,1,0 0,1,0,0,1,0,0,0,1,0,0,1,1,0,1,0 0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0 0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0 0,1,0,0,1,1,0,0,0,0,1,0,0,1,0,1 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0 1,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0 The steps I made were those: headlines=read.table(/data.csv, header=T, sep=,) data dist=dist(data,method=euclidean) dist cluster=hclust(dist,method=ward) cluster plot(cluster) rect.hclust(cluster, k=4, border=red) I extracted 4 clusters from the data. My question is: is it possible to produce a summary of every mean values for each variable of each of the 4 clusters? Well, I think this is not what you want. Probably you want to use Manhattan distance (rather than Euclidean) 0/1 data and you want to know the number of 1s and the total number in each cluster. Anyway, in order to answer your question, do an assignment in the end such as: x - rect.hclust(cluster, k=4, border=red) sapply(x, function(i) colMeans(data[i,])) Uwe Ligges Thanks a lot in advance, Jeoffrey __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis question
Dan, I don't use the flexclust package, but if I understand your question correctly, you can use your own distance measure to calculate a dissimilarity matrix and pass that to, e.g., agnes() in the cluster package. Stephen On Fri, Feb 6, 2009 at 9:42 AM, Jim Porzak jpor...@gmail.com wrote: Dan, Check out Fritz Leisch's flexclust package. HTH, Jim Porzak TGN.com San Francisco, CA http://www.linkedin.com/in/jimporzak use R! Group SF: http://ia.meetup.com/67/ On Fri, Feb 6, 2009 at 7:11 AM, Dan Stanger dstan...@eatonvance.com wrote: Hello All, I have data where each feature data point is a vector, and my distance measurement is a weighted dot product between vectors. I would like to use R to perform a cluster analysis on this data. Does one of the R cluster analysis routines provide for a user provided distance function? Dan Stanger Eaton Vance Management 255 State Street Boston, MA 02109 617 598 8261 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Rochester, Minn. USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis question
Hello All, I have data where each feature data point is a vector, and my distance measurement is a weighted dot product between vectors. I would like to use R to perform a cluster analysis on this data. Does one of the R cluster analysis routines provide for a user provided distance function? Dan Stanger Eaton Vance Management 255 State Street Boston, MA 02109 617 598 8261 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis question
Dan, Check out Fritz Leisch's flexclust package. HTH, Jim Porzak TGN.com San Francisco, CA http://www.linkedin.com/in/jimporzak use R! Group SF: http://ia.meetup.com/67/ On Fri, Feb 6, 2009 at 7:11 AM, Dan Stanger dstan...@eatonvance.com wrote: Hello All, I have data where each feature data point is a vector, and my distance measurement is a weighted dot product between vectors. I would like to use R to perform a cluster analysis on this data. Does one of the R cluster analysis routines provide for a user provided distance function? Dan Stanger Eaton Vance Management 255 State Street Boston, MA 02109 617 598 8261 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis using numeric and factor variables
Hi, Are there any algorithms that handle numeric and factor variables together in a cluster analysis? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis using numeric and factor variables
If you can define a distance between two vectors (where each one has some numerical and some categorical coordinates) then you can proceed with any clustering algorithm. One possibility to get such a distance is to use RandomForest which can produce a proximity matrix which can be turned into distance matrix. Regards, Moshe. --- On Wed, 11/6/08, Nagu [EMAIL PROTECTED] wrote: From: Nagu [EMAIL PROTECTED] Subject: [R] Cluster analysis using numeric and factor variables To: r-help@r-project.org Received: Wednesday, 11 June, 2008, 11:49 AM Hi, Are there any algorithms that handle numeric and factor variables together in a cluster analysis? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster analysis with numeric and categorical variables
Dear all, I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book Modern Applied Statistics with S I did not find a satisfactory solution. I will be grateful for any suggestions. Best regards Miha __ can.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis with numeric and categorical variables
Dear Miha, a general way to do this is as follows: Define a distance measure by aggregating the Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if category is the same) on the categorial variable. Perform cluster analysis (whichever you want) on the resulting distance matrix. Note that there is more than one way to do this. The 0-1-distance could be incorporated in the definition of the Euclidean distance (instead of (x_i-y_i)^2), or a weighted average of the distances in X-, Y- and categorial space could be computed. Weights of variables (including possibly rescaling) have to be decided. How to do this precisely should depend on the subject matter and prior information about variable importance etc. In absence of such information, you may standardise the variablewise sums of squared pairwise distances to be equal. Hope this helps (and you can figure out the relevant R code yourself). Christian On Tue, 3 Jun 2008, Miha Staut wrote: Dear all, I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book Modern Applied Statistics with S I did not find a satisfactory solution. I will be grateful for any suggestions. Best regards Miha __ can.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis
AMINA SHAHZADI, The eternal question. What I do is that I generate a range of solutions, profile them on variables used to cluster the data into groups and any other information I have to profile the cluster groups on and then present the solutions to a group of others to assess meaningfulness, debate on the solutions and attempt to reach a consensus etc In many cases, eg, for algorithms based on k-means and hierarchical clustering, you are using an exploratory technique and there are no right/wrong answers to this Having used cluster analysis for years some things to look at because there is no way to answer this statistically (unless you are using a latent class type model with goodness of fit measures) are the following 1. What is the minimum size you believe to be robust for a single cluster (eg n=30, n=100) etc because the larger the number of clusters you generate relative to sample size, the smaller your clusters will be and there must be a cut-off point defined upon which you are not prepared to go any lower... 2. If you run the clusters through different algorithms, how comparable are the results (cluster stability) 2. What differences emerge between 2, 3, 4 cluster solutions etc (as you utilise larger numbers of clusters, does this still produce a meaningful result in that the clusters are distinct and unique, or are you just cutting larger clusters into smaller clusters without generating unique and usable information... Examine the clusters via a series of cross tabs (as you go from 2 to 3 to 4 cluster solutions) what happens to the members within clusters, are they distributed differently etc Thanks Paul - Original Message - From: amna khan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, November 02, 2007 2:19 AM Subject: [R] cluster analysis Hi Sir How can we select the optimum number of clusters? Best Regards -- AMINA SHAHZADI Department of Statistics GC University Lahore, Pakistan. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis
Hi Sir How can we select the optimum number of clusters? Best Regards -- AMINA SHAHZADI Department of Statistics GC University Lahore, Pakistan. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cluster Analysis
Dear all, I would like to know if I can do a hierarchical cluster analysis in R using my own similarity matrix and how. Thanks. Katia Freire. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster Analysis
take a look at hclust() Dieter Katia Freire wrote: Dear all, I would like to know if I can do a hierarchical cluster analysis in R using my own similarity matrix and how. Thanks. Katia Freire. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster Analysis
Subject: [R] Cluster Analysis Dear all, I would like to know if I can do a hierarchical cluster analysis in R using my own similarity matrix and how. Thanks. Katia Freire. Yes. ;) Reading the help for dist() and hclust() should make the procedure for doing this appear fairly straightforward. For interpreting the results, cutree() should be helpful.. --elijah __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cluster analysis
Hi Sir How to perform cluster analysis using Ward's method and K- means clustering? Regards -- AMINA SHAHZADI Department of Statistics GC University Lahore, Pakistan. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis
On 10/18/07, amna khan [EMAIL PROTECTED] wrote: Hi Sir How to perform cluster analysis using Ward's method and K- means clustering? For beginning, try to perform it using the GUI Rcmdr. Regards, Liviu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.