Re: [R] Cluster analysis

2019-03-31 Thread Sarah Goslee
Hi,

R has a vast array of tools for cluster analysis. There's even a task
view: https://cran.r-project.org/web/views/Cluster.html

Which method is best for your needs is going to require you spending
some time working to understand the pros and cons, and possibly
consulting with a local statistician.

Sarah

On Sun, Mar 31, 2019 at 4:20 PM bienvenidoz...@gmail.com
 wrote:
>
> Hi,
> I have data from farmers with different variables. I would like to classify 
> them according to some variables. Can you help me with "R" to find the best 
> variables to classify them and how to classify them with "R". Some variables 
> are numerical others are ordinal.
>
> Best regards,
> Bienvenue
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Sarah Goslee (she/her)
http://www.numberwright.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis

2019-03-31 Thread bienvenidoz...@gmail.com
Hi,
I have data from farmers with different variables. I would like to classify 
them according to some variables. Can you help me with "R" to find the best 
variables to classify them and how to classify them with "R". Some variables 
are numerical others are ordinal.

Best regards,
Bienvenue
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis with Weighted attribute

2016-06-03 Thread Ahreum Lee
 Hi! All.
 
I'm not much familiar with R. 
So I tried to find a R function or packages that could work with my problems. 
 
What  I wonder is, 
Whether there is any R function or package that includes the cluster analysis 
considering with the weighted attribute.
 
I saw several papers that dealt with the Attribute Value Weighting in K-Modes 
Clustering. 
​but I could not find the R function or packages related with this.  
 
We got the weight of each attributes by interviewing the experts. 
 
What we want to do is do cluster analysis regarding with those weighted value 
on the attributes.
 
 
Is there any suggestion for me?? 
It would be much appreciated ! 
 
Thanks for your interest on my question! 
 
​ 
 
 

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis

2015-06-17 Thread PIKAL Petr
Hi

 -Original Message-
 From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Venky
 Sent: Wednesday, June 17, 2015 8:43 AM
 To: R Help R
 Subject: [R] cluster analysis

 Hi friends,

 I have data like this


In R or elsewhere?



 Group
   Employee size WOE Employee size2 Weight of Evidence 1081680995 0
 0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464
 1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537
 0.128755
 0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5
 0.12875537
 1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825
 In this data first variable (Employee size) has 10 rows and variable 2
 (employee size2) has only 5 rows

Extremely messy due to HTML posting. Use plain text post as recommended by 
Posting Guide.


 Question 1:there are different number of rows so that, we can able to
 do K-means cluster or not?

I am not an expert but why not to try it?

 Question 2:If we run k-means clustering in R answer not coming  because
 of NA exists

 I have used dataset-na.omit(dataset)

 But that time also i cannot able to run clustering

Perhaps not enough data remained after NA removing.

To get better answer you shall provide reproducible example or at least some 
usable data.

Cheers
Petr



 Please help me to find this answer

   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.


Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny 
pouze jeho adresátům.
Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně 
jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze 
svého systému.
Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email 
jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat.
Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či 
zpožděním přenosu e-mailu.

V případě, že je tento e-mail součástí obchodního jednání:
- vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a 
to z jakéhokoliv důvodu i bez uvedení důvodu.
- a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; 
Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce 
s dodatkem či odchylkou.
- trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným 
dosažením shody na všech jejích náležitostech.
- odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost 
žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně 
pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně 
osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi 
či osobě jím zastoupené známá.

This e-mail and any documents attached to it may be confidential and are 
intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. 
Delete the contents of this e-mail with all attachments and its copies from 
your system.
If you are not the intended recipient of this e-mail, you are not authorized to 
use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by 
modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a 
contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately 
accept such offer; The sender of this e-mail (offer) excludes any acceptance of 
the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an 
express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into 
any contracts on behalf of the company except for cases in which he/she is 
expressly authorized to do so in writing, and such authorization or power of 
attorney is submitted to the recipient or the person represented by the 
recipient, or the existence of such authorization is known to the recipient of 
the person represented by the recipient.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis

2015-06-17 Thread Venky
Hi friends,

I have data like this



Group
  Employee size WOE Employee size2 Weight of Evidence 1081680995 0
0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464
1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537 0.128755
0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5 0.12875537
1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825
In this data first variable (Employee size) has 10 rows and variable 2
(employee size2) has only 5 rows

Question 1:there are different number of rows so that, we can able to do
K-means cluster or not?
Question 2:If we run k-means clustering in R answer not coming  because of
NA exists

I have used dataset-na.omit(dataset)

But that time also i cannot able to run clustering

Please help me to find this answer

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis using term frequencies

2015-03-24 Thread Christian Hennig

Dear Sun Shine,


dtes - dist(tes.df, method = 'euclidean')
dtesFreq - hclust(dtes, method = 'ward.D')
plot(dtesFreq, labels = names(tes.df))


However, I get an error message when trying to plot this: Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : invalid 
dendrogram input.


I don't see anything wrong with the code, so what I'd do is run
str(dtes) and str(dtesFreq) to see whether these are what they should be 
(or if not, what they are instead).


I'm clearly screwing something up, either in my source data.frame or in my 
setting hclust up, but don't know which, nor how.


Can't comment on your source data but generally, whatever you do, use 
str() or even print() to see whether the R-objects are allright or what 
went wrong.


More than just identifying the error however, I am interested in finding a 
smart (efficient/ elegant) way of checking the occurrence and frequency value 
of the terms that may be associated with 'sports', 'learning', and 
'extra-mural' and extracting these into a matrix or data frame so that I can 
analyse and plot their clustering to see if how I associated these terms is 
actually supported statistically.


The first thing that comes to my mind (not necessarily the best/most 
elegant) is to run...

dtes3 - cutree(dtesFreq,3)
...and to table dtes3 against your manual classification.
Note that 3 is the most natural number of clusters to cut the tree 
here but may not be the best to match your classification (for example, 
you may have a one-point cluster in the 3-cluster solution, so it may 
effectively be a two-cluster solution with an outlier). Your 
dendrogram, if you succeed plotting it, may give you a hint about that.


Hope this helps,
Christian




I'm sure that there must be a way of doing this in R, but I'm obviously not 
going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis using term frequencies

2015-03-24 Thread Sun Shine

Hi list

I am using the 'tm' package to review meeting notes at a school to 
identify terms frequently associated with 'learning', 'sports', and 
'extra-mural' activities, and then to sort any terms according to these 
three headers in a way that could be supported statistically (as opposed 
to, say, my own bias, etc.).


To accomplish this, I have done the following:

(1) After the usual pre-processing of the text data, loading it as a 
corpus and then converting it into a document term matrix (called 
'allTerms'), I have identified the 20 most frequently occurring terms in 
the meeting notes and extracted these into a named vector called 
'freqTerms'. Many of the terms returned have nothing to do with any of 
the three themes of 'learning', 'sports', or 'extra-mural'.


(2) Therefore, I have also manually generated a list of terms and 
synonyms for 'learning' and 'sports', etc. (e.g. 'football', 'soccer', 
'drama', 'chess', etc.) and then tested for the occurrence of each of 
these terms in the corpus, e.g.:


 allTerms['soccer']

and have come up with a list of some 30 terms together with their 
frequencies. I manually sorted these according to three headers 
'learning', 'sports', and 'extra-mural' and dropped these into a table 
in a word processing document. Some of these terms are also in the 
freqTerms vector.


What I want to do now is to use cluster analysis (hclust, from the 
'cluster' library) to plot a dendrogram of the terms I have manually 
checked and put into the table, in order to see how closely similar the 
terms are and whether they cluster in ways similar to the way as I 
manually sorted these under the table column headers of 'learning', 
'sports', and 'extra-mural'.


To do this, I dropped these manually sorted terms into a data frame 
together with the associated values (which I called 'tes.df') and then 
tried plotting this as follows:


 dtes - dist(tes.df, method = 'euclidean')
 dtesFreq - hclust(dtes, method = 'ward.D')
 plot(dtesFreq, labels = names(tes.df))

However, I get an error message when trying to plot this: Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : 
invalid dendrogram input.


I'm clearly screwing something up, either in my source data.frame or in 
my setting hclust up, but don't know which, nor how.


More than just identifying the error however, I am interested in finding 
a smart (efficient/ elegant) way of checking the occurrence and 
frequency value of the terms that may be associated with 'sports', 
'learning', and 'extra-mural' and extracting these into a matrix or data 
frame so that I can analyse and plot their clustering to see if how I 
associated these terms is actually supported statistically.


I'm sure that there must be a way of doing this in R, but I'm obviously 
not going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis

2013-07-04 Thread Ekele Alih
I want to do Agglomerative Hierarchical clustering using complete linkage 
method in R using the function agnes or hclust. 
1. Can i do a cluster analysis of h=(n+p+1)/2 out of n observation?  note that 
p=nomber of variables(dependent and independent)
2. Can i plot the dendrogram and get the cluster history of this analysis in R?
3. Can i use the cluster with the largest values to sort the n observations in 
ascending order?
Your assistance and guide will be greatly appreciated in solving problems 1-3
Thanks
EKELE ALIH
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis

2013-04-11 Thread ravanlou
I am doing cluster analysis of my SNPs data. I have 2 questions:
1. I draw the cluster in hclust using the following codes.change direction
to vertical.

data - read.table(as.matrix(file.choose()), header=T, row.names = 1,
sep=\t)
 plot(hclust(as.dist(data),method=complete))

 it is horizontal, and I dont know how to change to vertical shape?

2. I would like to have bootstraps, but no luck. I am using the following
codes:

 result - pvclust(as.dist(data), method.dist=cor,
method.hclust=complete, nboot=1000)

Error in cor(x, method = pearson, use = use.cor) :
  supply both 'x' and 'y' or a matrix-like 'x'


I will appreciate if someone could help me please


-- 
*Abbasali Ali Ravanlou
PhD candidate of Plant Pathology
**Dept. of Crop Sci.*
*University of Illinois-UC**
** **
**
*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis on weighted survey data with continuous and categorical variables

2013-03-19 Thread Emma Gibson
I am trying to perform cluster analysis on survey data where each respondent 
has answered several questions, some of which have categorical answers (blue 
pink green etc) and some of which have scale answers (rating from 1 to 10 
etc).My problem is that certain age groups were over-sampled and I need to 
weight the data collected in order to accurately reflect the current 
population.Will it make a difference if I do the cluster analysis on the 
weighted data, and if so, how do I do cluster analysis on the weighted data?Any 
advice would be much appreciated!Thanks Emma
   
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis on weighted survey data with continuous and categorical variables

2013-03-19 Thread Thomas Lumley
On Wed, Mar 20, 2013 at 3:55 AM, Emma Gibson waterbab...@hotmail.comwrote:

 I am trying to perform cluster analysis on survey data where each
 respondent has answered several questions, some of which have categorical
 answers (blue pink green etc) and some of which have scale answers
 (rating from 1 to 10 etc).My problem is that certain age groups were
 over-sampled and I need to weight the data collected in order to accurately
 reflect the current population.Will it make a difference if I do the
 cluster analysis on the weighted data, and if so, how do I do cluster
 analysis on the weighted data?Any advice would be much appreciated!Thanks
 Emma



The unequal sampling will have some effect on most clustering methods (eg
not single-linkage, but k-means or average-linkage).  Whether this matters
depends on whether you have genuinely separate clusters in the population
or a general mush that you are trying to segment in some convenient way.

If you have genuine well-separated clusters, then ignoring the oversampling
is likely to do well.  If you don't, you will get a segementation into
clusters that partitions the over-sampled people too finely and the
under-sampled people too coarsely.

I don't know of any R functions that cluster with sampling weights.

If your data set is fairly small, you could expand it by making duplicates
(perhaps jittered) of some points, and cluster the expanded data set.  On
the other hand, if it is very large, you can thin it out to a uniform
sample by sampling from it with probability inversely proportional to the
original sampling probability.

   - thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis in the setting of repeated measures

2013-03-10 Thread John Sorkin
Does R have any function for performing cluster analysis when each subject 
contributes more than one observation to the analysis, i.e. a repeated measures 
cluster analysis? I prefer an agglomerative clustering, but would certainly be 
happy with a K-mean or other clustering technique. To the best of my knowledge, 
the standard R clustering functions (e.g. kmeans, hclust, pvclust) all assume 
that each subject contributes a single line of data to the analyses.
Thanks,
John
 
 
 
John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)
Confidentiality Statement:
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.  
Any unauthorized use, disclosure or distribution is prohibited.  If you are not 
the intended recipient, please contact the sender by reply email and destroy 
all copies of the original message. 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster Analysis and PCoA (mixt variables)

2013-01-19 Thread Julien Mvdb
Hello everyone,

 I mail you because of my lake of knowlegde regarding statistics.
I'm using the CA and PCoA (but maybe should I use some other techniques) to
determine the differences and similarities between a large sample of plants
using different kind of traits through matrix of mixte variables.
I understood that the daisy() function using the gower metric and defining
the different type of variable is a good way to deal with such mixt
variable. And in fact, my plots (cluster{agnes})(more that my PCoA) are
quite reflecting what I was expecting from the aspect of those different
plants.

My problem :
The problem now is that I need to understand wich variables are considered
to produce the dissimilarity matrix that is used for the cluster analysis
or the PCoA. In other word, how are construct the branch of my Cluster
Analysis tree?

It has been one month since I tried to figured most of the things out of
what I know today in data analysis and R software world. So, I'm really
sorry for asking so simple things that do not exactly focus on the R issues
but I tried in many ways but I just can't figure it out.
Thank you

Julien Mehl Vettori

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis error - mclust package

2012-11-26 Thread KitKat
I am following instructions online for cluster analysis using the mclust
package, and keep getting errors.
http://www.statmethods.net/advstats/cluster.html

These are the instructions (there is no sample dataset unfortunately):
# Model Based Clustering
library(mclust)
fit - Mclust(mydata)
plot(fit, mydata) # plot results 
print(fit) # display the best model 

This is what I did and the error I get:
 library(mclust)
 fit - Mclust(mydat)
 plot(fit, mydat) #plot results
Error in match.arg(what, c(BIC, classification, uncertainty,
density),  : 
  'arg' must be NULL or a character vector

My data is arranged so I have each row representing one individual with 9
values for morphological data. I want to see if they will group into 2
clusters, representing gender. 

I have tried using the instructions from the cran-r website, but they didn't
work either

Any help would be great, thank you



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-error-mclust-package-tp4650842.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-22 Thread Ingmar Visser
It's hard to answer these questions without knowing what the errors are and
how they can be reproduced.
Best, Ingmar

On Thu, Nov 22, 2012 at 1:03 AM, KitKat katherinewri...@trentu.ca wrote:

 Thanks, I have been trying that site and another one
 (http://www.statmethods.net/advstats/cluster.html)

 I don't know if I should be doing mclust or mcclust, but either way, the
 codes are not working. I am following the guidelines online at:
 mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf
 mclust - http://cran.r-project.org/

 I am relatively new to R, but so far I have been able to figure out dfa,
 manova, pca... I cannot get these codes to work, I keep getting various
 errors. Are there other resources that have details about what codes to use
 or what to do when errors result? I have not found anything else helpful

 Thank you



 --
 View this message in context:
 http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-22 Thread KitKat
These are the errors I've been having. I have been trying 3 different things

1- Mclust:
This is the example I have been following:
# Model Based Clustering
library(mclust)
fit - Mclust(mydata)
plot(fit, mydata) # plot results 
print(fit) # display the best model 
 
What I have done:
 fit - Mclust(mydat)
 plot(fit, mydat) #plot results
Error in match.arg(what, c(BIC, classification, uncertainty,
density),  : 
  'arg' must be NULL or a character vector

2- Mclust using different website (cran-r) instructions
This is the example: 
 mydatMclust - Mclust(mydat)
 summary(mydatMclust)
 summary(mydatMclust, parameters = TRUE)
 plot(mydatMclust)

There are a couple other steps but the plot is the problem. I get two plots,
there should be four. One should be plotting all my individuals but it's
plotting my variables instead. It's also taking a very long time. R script
at this point says: Waiting to confirm page change… 

3. Mcclust 
Instructions from cran-r:
data(cls.draw2)
# sample of 500 clusterings from a Bayesian cluster model
tru.class - rep(1:8,each=50)
# the true grouping of the observations
psm2 - comp.psm(cls.draw2)
# posterior similarity matrix
# optimize criteria based on PSM
mbind2 - minbinder(psm2)
mpear2 - maxpear(psm2)
# Relabelling
k - apply(cls.draw2,1, function(cl) length(table(cl)))
max.k - as.numeric(names(table(k))[which.max(table(k))])
relab2 - relabel(cls.draw2[k==max.k,])
# compare clusterings found by different methods with true grouping
arandi(mpear2$cl, tru.class)
arandi(mbind2$cl, tru.class)
arandi(relab2$cl, tru.class)

I called my data: mydat so I changed that where appropriate. I cannot get
past one early step, psm2 - comp.psm(cls.draw2).. the error reads: Error:
could not find function comp.psm

I think I have all appropriate packages installed. I don't know what more to
do on these three errors.  Any help would be great! Thank you




--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650466.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-21 Thread KitKat
Thank you for replying! 
I made a new post asking if there are any websites or files on how to
download package mclust (or other Bayesian cluster analysis packages) and
the appropriate R functions? Sorry I don't know how this forum works yet



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-21 Thread Brian Feeny


http://cran.r-project.org/web/views/Cluster.html

might be a good start

Brian

On Nov 21, 2012, at 1:36 PM, KitKat wrote:

 Thank you for replying! 
 I made a new post asking if there are any websites or files on how to
 download package mclust (or other Bayesian cluster analysis packages) and
 the appropriate R functions? Sorry I don't know how this forum works yet
 
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-21 Thread KitKat
Thanks, I have been trying that site and another one
(http://www.statmethods.net/advstats/cluster.html)

I don't know if I should be doing mclust or mcclust, but either way, the
codes are not working. I am following the guidelines online at:
mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf
mclust - http://cran.r-project.org/

I am relatively new to R, but so far I have been able to figure out dfa,
manova, pca... I cannot get these codes to work, I keep getting various
errors. Are there other resources that have details about what codes to use
or what to do when errors result? I have not found anything else helpful 

Thank you



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-16 Thread Hennig, Christian
Dear Katherine,

function flexmixedruns in package fpc may do what you want; it fits mixtures 
with continuous and categorical variables, can use the BIC for giving you the 
number of mixture components and also gives you posterior probabilities for 
cases to belong to components.

Note that generally finding the right cluster analysis method is a complicated 
task and depends crucially on your application, what use you want to make of 
the clusters etc., so what's best cannot be conclusively said on a mailing 
list. The same holds for whether and how to select variables. Certainly it's 
not wrong in general to use all the variables that you have but whether it's 
better otherwise depends on what meaning your variables have and how this 
relates to the aim of clustering, what to do with the variables afterwards etc.

You may have a look at 
http://www.rss.org.uk/site/cms/contentviewarticle.asp?article=866#Link%20to%20Nov.%202012%20paper
where I discuss a number of related issues.

Best regards,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche


From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of 
KitKat [katherinewri...@trentu.ca]
Sent: 15 November 2012 18:14
To: r-help@r-project.org
Subject: [R] cluster analysis in R

I have two issues.

1-I am trying to use morphology to identify gender. I have 9 variables, both
continuous and categorical. I was using two-step cluster analysis in SPSS
because two-step could deal with different types of variables. But the
output tells me that an animal is in cluster 1 or 2, it does not give me a
probability (ex. 0.70 cluster 2).  I also did not want to specify that I
want two clusters, I wanted to see if analysis would naturally give me two
clusters. These were all advantages to using SPSS but now I'm having
trouble.

Does cluster analysis in R give probabilities?
Which type of cluster analysis in R is best to use? I did not think
hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
to create the average variable, I want the analysis to do it on its own.
I'm also new to R so would have to figure out the right codes to enter, etc.

2-I was also told to analyze each variable on its own before including it in
cluster analysis. I had first included them all then teased out which ones
were not important, but now have been asked to do the reverse. I cannot do
cluster analysis on one variable -for example, one variable is either
present or absent on an individual so of course cluster analysis gives me
two clusters, one representing present and one representing absent. I was
told to use regression, but how can regression also not give the same
result? I feel like it would give me a line connecting a bunch of 0s to 1s.
I don't know what to use, or if I can analyze each variable like this before
putting them into cluster analysis. I ultimately want to only use the
smallest number of variables necessary to identify gender.

I have tried reading manuals etc and talking to people at my school, but
nothing has helped. If anyone has any insight, that would be much
appreciated
Thank you!



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis in R

2012-11-15 Thread KitKat
I have two issues. 

1-I am trying to use morphology to identify gender. I have 9 variables, both
continuous and categorical. I was using two-step cluster analysis in SPSS
because two-step could deal with different types of variables. But the
output tells me that an animal is in cluster 1 or 2, it does not give me a
probability (ex. 0.70 cluster 2).  I also did not want to specify that I
want two clusters, I wanted to see if analysis would naturally give me two
clusters. These were all advantages to using SPSS but now I'm having
trouble.

Does cluster analysis in R give probabilities?
Which type of cluster analysis in R is best to use? I did not think
hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
to create the average variable, I want the analysis to do it on its own. 
I'm also new to R so would have to figure out the right codes to enter, etc.

2-I was also told to analyze each variable on its own before including it in
cluster analysis. I had first included them all then teased out which ones
were not important, but now have been asked to do the reverse. I cannot do
cluster analysis on one variable -for example, one variable is either
present or absent on an individual so of course cluster analysis gives me
two clusters, one representing present and one representing absent. I was
told to use regression, but how can regression also not give the same
result? I feel like it would give me a line connecting a bunch of 0s to 1s.
I don't know what to use, or if I can analyze each variable like this before
putting them into cluster analysis. I ultimately want to only use the
smallest number of variables necessary to identify gender. 

I have tried reading manuals etc and talking to people at my school, but
nothing has helped. If anyone has any insight, that would be much
appreciated
Thank you!



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-15 Thread Ingmar Visser
Dear KitKat,

After installing R and reading some introductory material on getting
started with R you may want to check the CRAN task view on cluster analysis:
http://cran.r-project.org/web/views/Cluster.html
which has many useful references to all kinds and flavors of clustering
techniques, hierarchical or not, selecting the nr of clusters based on some
model selection statistic, et cetera.

hth, Ingmar

On Thu, Nov 15, 2012 at 7:14 PM, KitKat katherinewri...@trentu.ca wrote:

 I have two issues.

 1-I am trying to use morphology to identify gender. I have 9 variables,
 both
 continuous and categorical. I was using two-step cluster analysis in SPSS
 because two-step could deal with different types of variables. But the
 output tells me that an animal is in cluster 1 or 2, it does not give me a
 probability (ex. 0.70 cluster 2).  I also did not want to specify that I
 want two clusters, I wanted to see if analysis would naturally give me two
 clusters. These were all advantages to using SPSS but now I'm having
 trouble.

 Does cluster analysis in R give probabilities?
 Which type of cluster analysis in R is best to use? I did not think
 hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
 to create the average variable, I want the analysis to do it on its own.
 I'm also new to R so would have to figure out the right codes to enter,
 etc.

 2-I was also told to analyze each variable on its own before including it
 in
 cluster analysis. I had first included them all then teased out which ones
 were not important, but now have been asked to do the reverse. I cannot do
 cluster analysis on one variable -for example, one variable is either
 present or absent on an individual so of course cluster analysis gives me
 two clusters, one representing present and one representing absent. I was
 told to use regression, but how can regression also not give the same
 result? I feel like it would give me a line connecting a bunch of 0s to 1s.
 I don't know what to use, or if I can analyze each variable like this
 before
 putting them into cluster analysis. I ultimately want to only use the
 smallest number of variables necessary to identify gender.

 I have tried reading manuals etc and talking to people at my school, but
 nothing has helped. If anyone has any insight, that would be much
 appreciated
 Thank you!



 --
 View this message in context:
 http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-15 Thread Jose Iparraguirre
Have a look at the package mclust.
Jose

From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of 
Ingmar Visser [i.vis...@uva.nl]
Sent: 15 November 2012 21:10
To: KitKat
Cc: r-help@r-project.org
Subject: Re: [R] cluster analysis in R

Dear KitKat,

After installing R and reading some introductory material on getting
started with R you may want to check the CRAN task view on cluster analysis:
http://cran.r-project.org/web/views/Cluster.html
which has many useful references to all kinds and flavors of clustering
techniques, hierarchical or not, selecting the nr of clusters based on some
model selection statistic, et cetera.

hth, Ingmar

On Thu, Nov 15, 2012 at 7:14 PM, KitKat katherinewri...@trentu.ca wrote:

 I have two issues.

 1-I am trying to use morphology to identify gender. I have 9 variables,
 both
 continuous and categorical. I was using two-step cluster analysis in SPSS
 because two-step could deal with different types of variables. But the
 output tells me that an animal is in cluster 1 or 2, it does not give me a
 probability (ex. 0.70 cluster 2).  I also did not want to specify that I
 want two clusters, I wanted to see if analysis would naturally give me two
 clusters. These were all advantages to using SPSS but now I'm having
 trouble.

 Does cluster analysis in R give probabilities?
 Which type of cluster analysis in R is best to use? I did not think
 hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
 to create the average variable, I want the analysis to do it on its own.
 I'm also new to R so would have to figure out the right codes to enter,
 etc.

 2-I was also told to analyze each variable on its own before including it
 in
 cluster analysis. I had first included them all then teased out which ones
 were not important, but now have been asked to do the reverse. I cannot do
 cluster analysis on one variable -for example, one variable is either
 present or absent on an individual so of course cluster analysis gives me
 two clusters, one representing present and one representing absent. I was
 told to use regression, but how can regression also not give the same
 result? I feel like it would give me a line connecting a bunch of 0s to 1s.
 I don't know what to use, or if I can analyze each variable like this
 before
 putting them into cluster analysis. I ultimately want to only use the
 smallest number of variables necessary to identify gender.

 I have tried reading manuals etc and talking to people at my school, but
 nothing has helped. If anyone has any insight, that would be much
 appreciated
 Thank you!



 --
 View this message in context:
 http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wrap Up  Run 10k next March to raise vital funds for Age UK

Six exciting new 10k races are taking place throughout the country and we want 
you to join in the fun! Whether you're a runner or not, these are
events are for everyone ~ from walking groups to serious athletes. The Age UK 
Events Team will provide you with a training plan to suit your 
level and lots of tips to make this your first successful challenge of 2012. 
Beat the January blues and raise some vital funds to help us 
prevent avoidable deaths amongst older people this winter.


Sign up now! www.ageuk.org.uk/10k

Coming to; London Crystal Palace, Southport, Tatton Park, Cheshire Harewood 
House, Leeds,Coventry, Exeter


Age UK Improving later life
www.ageuk.org.uk


 

---
Age UK is a registered charity and company limited by guarantee, (registered 
charity number 1128267, registered company number 6825798). 
Registered office: Tavis House, 1-6 Tavistock Square, London WC1H 9NA.

For the purposes of promoting Age UK Insurance, Age UK is an Appointed 
Representative of Age UK Enterprises Limited, Age UK is an Introducer 
Appointed Representative of JLT Benefit Solutions Limited and Simplyhealth 
Access for the purposes of introducing potential annuity and health 
cash plans customers respectively.  Age UK Enterprises Limited, JLT Benefit 
Solutions Limited and Simplyhealth Access are all authorised and 
regulated by the Financial Services Authority. 
--

This email and any files transmitted with it are confidential and intended

Re: [R] Cluster Analysis

2012-04-19 Thread Alekseiy Beloshitskiy
Hi, Taisa,

It depends on many paramfactors, e.g. nature of your data, volume of data set 
etc.

The analog of SAS fastclus in R - kmeans (for practical example check slide #35 
here:
 http://www.slideshare.net/whitish/textmining-with-r)

Check also  kmedoids (pam) and hclust.

Good luck,
-Alex


From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of 
Taisa Brown [taisa.br...@unb.ca]
Sent: 15 April 2012 03:28
To: r-help@r-project.org
Subject: [R] Cluster Analysis

Hi,

I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER 
would be.  I need to be able to test the significance of the clusters by 
comparing the probability of obtaining an equal or greater pseudo F to the 
Bonferroni-corrected level. I will also need to plot r squared against the 
number of clusters.

Thanks so much,

Taisa

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster Analysis

2012-04-16 Thread David L Carlson
At the R command prompt 
?kmeans (for info on the R equivalent to FASTCLUS)
?hclust (for info on the R equivalent to CLUSTER)

Install package clusterSim 
and look at function index.G1 for the Calinski-Harabasz pseudo F-statistic

--
David L Carlson
Associate Professor of Anthropology
Texas AM University
College Station, TX 77843-4352

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
 project.org] On Behalf Of Taisa Brown
 Sent: Saturday, April 14, 2012 7:29 PM
 To: r-help@r-project.org
 Subject: [R] Cluster Analysis
 
 Hi,
 
 I was wondering what the best equivalent to SAS's FASTCLUS and PROC
 CLUSTER would be.  I need to be able to test the significance of the
 clusters by comparing the probability of obtaining an equal or greater
 pseudo F to the Bonferroni-corrected level. I will also need to plot r
 squared against the number of clusters.
 
 Thanks so much,
 
 Taisa
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster Analysis

2012-04-14 Thread Taisa Brown
Hi,

I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER 
would be.  I need to be able to test the significance of the clusters by 
comparing the probability of obtaining an equal or greater pseudo F to the 
Bonferroni-corrected level. I will also need to plot r squared against the 
number of clusters.

Thanks so much,

Taisa

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis with pairwise data

2012-04-04 Thread paladini

Hello,
I want to do a cluster analysis with my data. The problem is, that the 
variables dont't consist of single value but the entries are pairs of 
values.

That lokks like this:


Variable 1:Variable2:  Variable3:  ...
(1,2)  (1,5)   (4,2)
(7,8)  (3,88)  (6,5)
(4,7)  (12,4)  (4,4)
.   .  .
.   .  .
.   .  .
Is it possible to perform a cluster-analysis with this kind of data in 
R ?
I dont even know how to get this data in a matrix or a dada-frame or 
anything like this.


It would be really nice if somebody could help me.

Best regards and happy Easter

Claudia

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis with pairwise data

2012-04-04 Thread David L Carlson
You can create distance matrices for each Variable, square them, sum them,
and take the square root. As for getting the data into a data frame, the
simplest would be to enter the three variables into six columns like the
following:

data
 [,1] [,2] [,3] [,4] [,5] [,6]
[1,]121542
[2,]783   8865
[3,]47   12444

Then use dist() on each pair of columns:

1:2, 3:4, 5:6 . . .

e.g. for the 3 rows of data you provided

size - nrow(data)*(nrow(data)-1)/2
dm - dist(rep(0, size))
for(i in seq(1, 6, 2)) {
  dm - dm + dist(data[,i:(i+1)])^2
}
dm - sqrt(dm)
dm

--
David L Carlson
Associate Professor of Anthropology
Texas AM University
College Station, TX 77843-4352



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of paladini
Sent: Wednesday, April 04, 2012 6:32 AM
To: r-help@r-project.org
Subject: [R] cluster analysis with pairwise data

Hello,
I want to do a cluster analysis with my data. The problem is, that the 
variables dont't consist of single value but the entries are pairs of 
values.
That lokks like this:


Variable 1:Variable2:  Variable3:  ...
(1,2)  (1,5)   (4,2)
(7,8)  (3,88)  (6,5)
(4,7)  (12,4)  (4,4)
.   .  .
.   .  .
.   .  .
Is it possible to perform a cluster-analysis with this kind of data in 
R ?
I dont even know how to get this data in a matrix or a dada-frame or 
anything like this.

It would be really nice if somebody could help me.

Best regards and happy Easter

Claudia

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis with pairwise data

2012-04-04 Thread Petr Savicky
On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote:
 Hello,
 I want to do a cluster analysis with my data. The problem is, that the 
 variables dont't consist of single value but the entries are pairs of 
 values.
 That lokks like this:
 
 
 Variable 1:Variable2:  Variable3:  ...
 (1,2)  (1,5)   (4,2)
 (7,8)  (3,88)  (6,5)
 (4,7)  (12,4)  (4,4)
 .   .  .
 .   .  .
 .   .  .
 Is it possible to perform a cluster-analysis with this kind of data in 
 R ?
 I dont even know how to get this data in a matrix or a dada-frame or 
 anything like this.

Hi.

The data as they are may be read into R as character data. The
exact way depends on the format of the data in the file. The
result may look like the following.

  Var1 - c((1,2), (7,8), (4,7))
  Var2 - c((1,5), (3,88), (12,4))
  Var3 - c((4,2), (6,5), (4,4))
  DF - data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE)

If you want to use a distance between pairs depending on the
numbers (and not only equal/different pair), then the data should
to be transformed to a numeric format. For example, as follows

  trans - function(x)
  {
  y - strsplit(gsub([()], , x), ,)
  unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0
  }

  DF - data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3))
  DF

Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1
  1  1  2  1  542
  2  7  8  3 8865
  3  4  7 12  444

Then, see library(help=cluster).

Hope this helps.

Petr Savicky.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis with pairwise data

2012-04-04 Thread ilai
On Wed, Apr 4, 2012 at 10:12 AM, Petr Savicky savi...@cs.cas.cz wrote:
 On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote:

  Var1 - c((1,2), (7,8), (4,7))
  Var2 - c((1,5), (3,88), (12,4))
  Var3 - c((4,2), (6,5), (4,4))
  DF - data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE)

 If you want to use a distance between pairs depending on the
 numbers (and not only equal/different pair), then the data should
 to be transformed to a numeric format.

Or if the pairs have unique meaning ?daisy , also in the cluster
package, comes in handy (in this case you'll want to keep Vi as
factors in the call to DF).

Cheers

For example, as follows

  trans - function(x)
  {
      y - strsplit(gsub([()], , x), ,)
      unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0
  }

  DF - data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3))
  DF

    Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1
  1      1      2      1      5        4        2
  2      7      8      3     88        6        5
  3      4      7     12      4        4        4

 Then, see library(help=cluster).

 Hope this helps.

 Petr Savicky.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis on extreme event

2011-05-27 Thread FMH
Dear all,

I'm  modelling  extreme rainfall,particularly those that lie above a threshold 
  was searching for a suitable package in R which may enable a cluster 
analysis on those extreme events and would really appreciate for any 
suggestions.

Thanks,
Fir

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand
Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy - clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Christian Hennig

Dear Hans,

clara doesn't require a distance matrix as input (and therefore doesn't 
require you to run daisy), it will work with the raw data matrix using

Euclidean distances implicitly.
I can't tell you whether Euclidean distances are appropriate in this 
situation (this depends on the interpretation and variables and 
particularly on how they are scaled), but they may be fine at least after 
some transformation and standardisation of your variables.


Hope this helps,
Christian

On Thu, 31 Mar 2011, Hans Ekbrand wrote:


Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy - clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand
On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
 Dear Hans,
 
 clara doesn't require a distance matrix as input (and therefore
 doesn't require you to run daisy), it will work with the raw data
 matrix using
 Euclidean distances implicitly.
 I can't tell you whether Euclidean distances are appropriate in this
 situation (this depends on the interpretation and variables and
 particularly on how they are scaled), but they may be fine at least
 after some transformation and standardisation of your variables.

The variables are unordered factors, stored as integers 1:9, where 

1 means Full-time employment
2 means Part-time employment
3 means Student
4 means Full-time self-employee
...

Does euclidean distances make sense on unordered factors coded as
integers?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand
On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote:
 On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
  Dear Hans,
  
  clara doesn't require a distance matrix as input (and therefore
  doesn't require you to run daisy), it will work with the raw data
  matrix using
  Euclidean distances implicitly.
  I can't tell you whether Euclidean distances are appropriate in this
  situation (this depends on the interpretation and variables and
  particularly on how they are scaled), but they may be fine at least
  after some transformation and standardisation of your variables.
 
 The variables are unordered factors, stored as integers 1:9, where 
 
 1 means Full-time employment
 2 means Part-time employment
 3 means Student
 4 means Full-time self-employee
 ...
 
 Does euclidean distances make sense on unordered factors coded as
 integers?

To be clear, here is an extract

 my.df.full[900:910, 16:19]
PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year
900 2 2 1 2
901 1 1 1 1
902 1 1 1 1
903 2 2 2 2
904 1 1 1 1
905 2 2 2 2
906 7 8 2 7
907 5 5 5 5
908 1 1 1 1
909 1 1 1 1
910 1 1 1 1

 class(my.df.full[,16])
[1] integer

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Peter Langfelder
On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand h...@sociologi.cjb.net wrote:

 The variables are unordered factors, stored as integers 1:9, where

 1 means Full-time employment
 2 means Part-time employment
 3 means Student
 4 means Full-time self-employee
 ...

 Does euclidean distances make sense on unordered factors coded as
 integers?

It probably doesn't. You said you have some 36 observations for each
case, correct? You can turn these 36 observations into a vector of
length 36 * 9 on which Euclidean distance will make some sense, namely
k changes will produce a distance of sqrt(2*k). For each observation
with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0)
where the entry 1 is in the p-th component. Hence, if values p1 and p2
are the same, euclidean distance between r1 and r2 is zero; if they
are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:


transform = function(obsVector, maxVal)
{
  templateMat = matrix(0, maxVal, maxVal);
  diag(templateMat) = 1;

  return(as.vector(templateMat[, obsVector]));
}

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

 data
 [,1] [,2] [,3] [,4]
[1,]3312
[2,]1332
[3,]3324
[4,]1242
[5,]4141


trafoData = apply(data, 2, transform, maxVal = max);

 trafoData
  [,1] [,2] [,3] [,4]
 [1,]0010
 [2,]0001
 [3,]1100
 [4,]0000
 [5,]1000
 [6,]0001
 [7,]0110
 [8,]0000
 [9,]0000
[10,]0010
[11,]1100
[12,]0001
[13,]1000
[14,]0101
[15,]0000
[16,]0010
[17,]0101
[18,]0000
[19,]0000
[20,]1010



The code assumes that cases are in columns and observations in rows of
data. Examine data and trafoData to see how the transformation works.
Once you have the transformed data, simply apply your favorite
clustering method that uses Euclidean distance.

HTH,

Peter


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis: predefined clusters

2010-12-01 Thread deriK2000


Peter Langfelder wrote:
 
 On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert derik2...@yahoo.de wrote:
 Dear list,

 running a hierachical cluster analysis I want to define a number of
 objects that build a cluster already. In other words: I want to force
 some of the cases to be in the same cluster from the start of the
 algorithm.

 Any hints? Thanks in advance!
 
 The hclust function has an argument 'members' that should allow you to
 do that. You will need to specify the dissimilarity matrix
 accordingly.
 
 Peter
 
 

Thank you! But to specify the dissimilarity matrix correctly seems to be
major task. Anyone who has done so sofar? 
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-predefined-clusters-tp3060433p3067215.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis: predefined clusters

2010-11-26 Thread Derik Burgert
Dear list,
 
running a hierachical cluster analysis I want to define a number of objects 
that build a cluster already. In other words: I want to force some of the cases 
to be in the same cluster from the start of the algorithm.
 
Any hints? Thanks in advance!
 
Derik


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis: predefined clusters

2010-11-26 Thread Peter Langfelder
On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert derik2...@yahoo.de wrote:
 Dear list,

 running a hierachical cluster analysis I want to define a number of objects 
 that build a cluster already. In other words: I want to force some of the 
 cases to be in the same cluster from the start of the algorithm.

 Any hints? Thanks in advance!

The hclust function has an argument 'members' that should allow you to
do that. You will need to specify the dissimilarity matrix
accordingly.

Peter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-09-27 Thread abanero

Hi Ulrich,
 I'm studying the principles of Affinity Propagation and I'm really glad to
use your package (apcluster) in order to cluster my data.  I have just an
issue to solve..

If I apply the funcion: apcluster(sim) 

where sim is the matrix of dissimilarities, sometimes I encounter the
warning message:

Algorithm did not converge. Turn on details
and call plot() to monitor net similarity. Consider
increasing maxits and convits, and, if oscillations occur
also increasing damping factor lam.
 
with  too high number of clusters.
 
I thought to solve the problem setting the argument p of the function
apcluster() to mean(PreferenceRange(sim)):


apcluster(sim, p=mean(preferenceRange(sim)))

and actually it seems to be a good solution because I don't receive any
warning message and the number of cluster is slower.

Do you think it's a good solution? I submitt that I have to use apcluster()
in an automatic procedure so I can't manipulate directly the arguments of
the funcion.

Thanks in advance.
Giuseppe
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2715278.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis

2010-07-27 Thread Jim Porzak
Pablo, we've had success using
http://mephisto.unige.ch/traminer/preview.shtml to look at marketing paths.
Question would be how many distinct case step discriptions are there?

HTH, Jim

On Jul 26, 2010 9:44 AM, Pablo Cerdeira pablo.cerde...@gmail.com wrote:

Hi all,

I have no idea if this question is to easy to be answered, but I´m starting
with R. So, here we go.

I have a large dataset with a lot of steps a judicial case. A sample is
attached.

I´d like to do a cluster analysis to try to understand with one is the most
usual path followed by this legal cases.

After that, I´d like to plot a cluster tree.

In the attached sample, the column:

- id_processo is the primary key of a legal case;
- number is the step number in the legal case;
- andamento is the description of the legal case step.

I have no idea on how to do it using R. Can someone help me?

Thanks in advanced

--
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis

2010-07-27 Thread Pablo Cerdeira
Hi Allan,

It helps a lot. I´ll try to read more about it.

But, as you asked me, here goes a brief explanation about the necessary
columns of the sample date paste at the end:

id_processo: identify a legal case, it is its primary key.
ordem_andamento: is the step number inside a legal case (id_processo);
id_andamento: is the primary key of the step.

I´d like to identify the most commom steps (id_andamento) sequence
(ordem_andamento) inside a lot of legal cases (id_processo). Probably a
cluster analysis with a dendogram plot is what I´m looking for.

Here goes the sample of two different legal cases (2 different
id_processo):

Best regards and thank you in advanced

id_processo,proc_num,ordem_andamento,id_andamento,andamento,data,dias,origem_tribunal,data_entrada,relator,duracao_dias
1480010,1,1,208,DISTRIBUIDO,1988-10-06 00:00:00,5,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,2,69,CONCLUSAO,1988-10-06 00:00:00,0,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,3,180,DESPACHO ORDINATORIO,1988-10-11 00:00:00,8,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,4,465,PEDIDO DE INFORMACOES,1988-10-19 00:00:00,1,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,5,465,PEDIDO DE INFORMACOES,1988-10-20 00:00:00,15,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,6,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-11-04
00:00:00,24,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN.
CÉLIO BORJA,1251
1480010,1,7,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-11-28
00:00:00,0,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN.
CÉLIO BORJA,1251
1480010,1,8,69,CONCLUSAO,1988-11-28 00:00:00,38,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,9,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-01-05
00:00:00,874,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN.
CÉLIO BORJA,1251
1480010,1,10,26,AUTOS DEVOLVIDOS,1991-05-29 00:00:00,8,FÓRUM DA COMARCA
DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,11,75,CONCLUSOS AO RELATOR,1991-05-29 00:00:00,0,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,12,578,VISTA AO ADVOGADO-GERAL DA UNIAO,1991-06-06
00:00:00,232,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN.
CÉLIO BORJA,1251
1480010,1,13,507,RECEBIMENTO DOS AUTOS,1992-01-24 00:00:00,10,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,14,75,CONCLUSOS AO RELATOR,1992-02-03 00:00:00,21,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,15,284,JULG. POR DESPACHO - NEGADO SEGUIMENTO,1992-02-24
00:00:00,3,FÓRUM DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN.
CÉLIO BORJA,1251
1480010,1,16,497,PUBLICADO DESPACHO NO DJ,1992-02-27 00:00:00,12,FÓRUM
DA COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,17,163,DECORRIDO O PRAZO,1992-03-10 00:00:00,0,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480010,1,18,34,BAIXA AO ARQUIVO DO STF,1992-03-10 00:00:00,0,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-06 00:00:00,MIN. CÉLIO BORJA,1251
1480183,2,1,208,DISTRIBUIDO,1988-10-12 00:00:00,8,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,2,69,CONCLUSAO,1988-10-12 00:00:00,0,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,3,352,JULGAMENTO NO PLENO,1988-10-20 00:00:00,22,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,4,476,PETICAO AVULSA,1988-11-11 00:00:00,13,FÓRUM DA COMARCA
DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,5,531,REMESSA DOS AUTOS,1988-11-11 00:00:00,0,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,6,495,PUBLICADO ACORDAO, DJ:,1988-11-24 00:00:00,11,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,7,163,DECORRIDO O PRAZO,1988-12-05 00:00:00,8,FÓRUM DA
COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,8,241,INFORMACOES RECEBIDAS, OFICIO NRO.:,1988-12-13
00:00:00,63,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN.
PAULO BROSSARD,6677
1480183,2,9,69,CONCLUSAO,1988-12-13 00:00:00,0,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,10,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-02-14
00:00:00,83,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN.
PAULO BROSSARD,6677
1480183,2,11,69,CONCLUSAO,1989-05-08 00:00:00,91,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,12,584,VISTA AO PROCURADOR-GERAL DA REPUBLICA,1989-08-07
00:00:00,21,FÓRUM DA COMARCA DE RANCHARIA,1988-10-12 00:00:00,MIN.
PAULO BROSSARD,6677
1480183,2,13,69,CONCLUSAO,1989-08-28 00:00:00,2,FÓRUM DA COMARCA DE
RANCHARIA,1988-10-12 00:00:00,MIN. PAULO BROSSARD,6677
1480183,2,14,484,PROCESSO A JULGAMENTO - 

Re: [R] Cluster analysis

2010-07-27 Thread Pablo Cerdeira
Hi Jim,

Ow! Very nice job at http://mephisto.unige.ch/traminer/preview.shtml I´m
going to read more about it.

I have a lot of different steps, in a sequence. Actually, 586 different
possible steps, but I have 4269 legal cases, with a maximum of 379 steps
each one.

If you want, I can send this dataset to you.

Best regards and thank you very much,



On Tue, Jul 27, 2010 at 10:16 AM, Jim Porzak jpor...@gmail.com wrote:

 Pablo, we've had success using
 http://mephisto.unige.ch/traminer/preview.shtml to look at marketing
 paths. Question would be how many distinct case step discriptions are there?

 HTH, Jim

 On Jul 26, 2010 9:44 AM, Pablo Cerdeira pablo.cerde...@gmail.com
 wrote:

 Hi all,

 I have no idea if this question is to easy to be answered, but I´m starting
 with R. So, here we go.

 I have a large dataset with a lot of steps a judicial case. A sample is
 attached.

 I´d like to do a cluster analysis to try to understand with one is the most
 usual path followed by this legal cases.

 After that, I´d like to plot a cluster tree.

 In the attached sample, the column:

 - id_processo is the primary key of a legal case;
 - number is the step number in the legal case;
 - andamento is the description of the legal case step.

 I have no idea on how to do it using R. Can someone help me?

 Thanks in advanced

 --
 *Pablo de Camargo Cerdeira*
 pa...@fgv.br
 pablo.cerde...@gmail.com
 +55 (21) 3799-6065

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis

2010-07-26 Thread Pablo Cerdeira
Hi all,

I have no idea if this question is to easy to be answered, but I´m starting
with R. So, here we go.

I have a large dataset with a lot of steps a judicial case. A sample is
attached.

I´d like to do a cluster analysis to try to understand with one is the most
usual path followed by this legal cases.

After that, I´d like to plot a cluster tree.

In the attached sample, the column:

- id_processo is the primary key of a legal case;
- number is the step number in the legal case;
- andamento is the description of the legal case step.

I have no idea on how to do it using R. Can someone help me?

Thanks in advanced

-- 
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer

abanero wrote:

 Do you know  something like “knn1” that works with categorical variables
 too?
 Do you have any suggestion? 

There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package apcluster I have co-developed). All you need is a way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final classification task can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:

 Not a direct answer, but from your description it looks like you are
 better
 of with supervised classification algorithms instead of unsupervised
 clustering. 

If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig

Dear abanero,

In principle, k nearest neighbours classification can be computed on 
any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume 
Euclidean vectors as input, which restricts their use.


I'd probably compute an appropriate dissimilarity between points (have a 
look at Gower's distance in daisy, package cluster), and the implement 
nearest neighbours classification myself if I needed it. It should be 
pretty straightforward to implement.


If you want unsupervised classification (clustering) instead, you have the 
choice between all kinds of dissimilarity based algorithms then (hclust, pam, 
agnes etc.)


Christian

On Thu, 27 May 2010, Ulrich Bodenhofer wrote:



abanero wrote:


Do you know  something like “knn1” that works with categorical variables
too?
Do you have any suggestion? 


There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package apcluster I have co-developed). All you need is a way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final classification task can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:


Not a direct answer, but from your description it looks like you are
better
of with supervised classification algorithms instead of unsupervised
clustering. 


If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread abanero

Hi,

thank you Joris and Ulrich for you answers.

Joris Meys wrote: 

see the library randomForest for example


I'm trying to find some example in randomForest with categorical variables
but I haven't found anything. Do you know any example with both categorical
and numerical variables? Anyway I don't have any class labels yet. How could 
I  find clusters with randomForest? 


Ulrich wrote:

Probably the simplest way is Affinity Propagation[...] All you need is a
way of measuring the similarity of samples which is straightforward both
for numerical and categorical variables.

I had a look at the documentation of the package apcluster. That's
interesting but do you have any example using it with both categorical and
numerical variables? I'd like to test it with a large dataset..

Thanks a lot!
Cheers

Giuseppe

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Joris Meys
Hi Abanero,

first, I have to correct myself. Knn1 is a supervised learning algorithm, so
my comment wasn't completely correct. In any case, if you want to do a
clustering prior to a supervised classification, the function daisy() can
handle any kind of variable. The resulting distance matrix can be used with
a number of different methods.

And you're right, randomForest doesn't handle categorical variables either.
So I haven't been of great help here...
Cheers
Joris

On Thu, May 27, 2010 at 1:25 PM, abanero gdevi...@xtel.it wrote:


 Hi,

 thank you Joris and Ulrich for you answers.

 Joris Meys wrote:

 see the library randomForest for example


 I'm trying to find some example in randomForest with categorical variables
 but I haven't found anything. Do you know any example with both categorical
 and numerical variables? Anyway I don't have any class labels yet. How
 could
 I  find clusters with randomForest?


 Ulrich wrote:

 Probably the simplest way is Affinity Propagation[...] All you need is a
 way of measuring the similarity of samples which is straightforward both
 for numerical and categorical variables.

 I had a look at the documentation of the package apcluster. That's
 interesting but do you have any example using it with both categorical and
 numerical variables? I'd like to test it with a large dataset..

 Thanks a lot!
 Cheers

 Giuseppe

 --
 View this message in context:
 http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Joris Meys
I'm confusing myself :-)

randomForest cannot handle character vectors as predictors. (Which is why I,
to my surprise, found out that a categorical variable could not be used in
the function). It can handle categorical variables as predictors IF they are
put in as a factor.

Obviously they handle categorical variables as a response variable.

 I hope I'm not going to add up more mistakes, it's been enough for the
day...
Cheers
Joris

On Thu, May 27, 2010 at 2:08 PM, steve_fried...@nps.gov wrote:

 Joris,

 I've been following this thread for a few days as I am beginning to use
 randomForest in my work.  I am confused by your last email.

 What do you mean that randomForest does not handle categorical variables ?

 It can be used in either regression or classification analysis.  Do you
 mean that categorical predictors are not suitable? Certainly they are as
 the response.
 Would you be so kind, and clarify what you were suggesting.

 Thanks,

 Steve Friedman Ph. D.
 Spatial Statistical Analyst
 Everglades and Dry Tortugas National Park
 950 N Krome Ave (3rd Floor)
 Homestead, Florida 33034

 steve_fried...@nps.gov
 Office (305) 224 - 4282
 Fax (305) 224 - 4147



 Joris Meys
 jorism...@gmail.
 com   To
 Sent by:  abanero gdevi...@xtel.it
 r-help-boun...@r-  cc
 project.org   r-help@r-project.org
   Subject
   Re: [R] cluster analysis and
 05/27/2010 07:56  supervised classification: an
 AMalternative to knn1?










 Hi Abanero,

 first, I have to correct myself. Knn1 is a supervised learning algorithm,
 so
 my comment wasn't completely correct. In any case, if you want to do a
 clustering prior to a supervised classification, the function daisy() can
 handle any kind of variable. The resulting distance matrix can be used with
 a number of different methods.

 And you're right, randomForest doesn't handle categorical variables either.
 So I haven't been of great help here...
 Cheers
 Joris

 On Thu, May 27, 2010 at 1:25 PM, abanero gdevi...@xtel.it wrote:

 
  Hi,
 
  thank you Joris and Ulrich for you answers.
 
  Joris Meys wrote:
 
  see the library randomForest for example
 
 
  I'm trying to find some example in randomForest with categorical
 variables
  but I haven't found anything. Do you know any example with both
 categorical
  and numerical variables? Anyway I don't have any class labels yet. How
  could
  I  find clusters with randomForest?
 
 
  Ulrich wrote:
 
  Probably the simplest way is Affinity Propagation[...] All you need is a
  way of measuring the similarity of samples which is straightforward both
  for numerical and categorical variables.
 
  I had a look at the documentation of the package apcluster. That's
  interesting but do you have any example using it with both categorical
 and
  numerical variables? I'd like to test it with a large dataset..
 
  Thanks a lot!
  Cheers
 
  Giuseppe
 
  --
  View this message in context:
 

 http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html

  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Joris Meys
 Statistical Consultant

 Ghent University
 Faculty of Bioscience Engineering
 Department of Applied mathematics, biometrics and process control

 Coupure Links 653
 B-9000 Gent

 tel : +32 9 264 59 87
 joris.m...@ugent.be
 ---
 Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

  [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


 I had a look at the documentation of the package apcluster.
 That's interesting but do you have any example using it with both
 categorical
 and numerical variables? I'd like to test it with a large dataset..

Your posting has opened my eyes: problems where both numerical and
categorical features occur are probably among the most attractive
applications of affinity propagation. So I am considering to include such an
example in a future released.

Here is a very crude example (download the imports-85.data from
http://archive.ics.uci.edu/ml/machine-learning-databases/autos/ first):

 library(cluster)
 library(apcluster)
 automobiles - read.table(imports-85.data, header=FALSE, sep=,,
 na.strings=?)
 sim - -as.matrix(daisy(automobiles))
 apcluster(sim)

The most essential part here is to use daisy() from the package cluster
for computing distances/similarities. Have a look to the help page of
daisy() to get a better impression how it works and how to tailor the
distance/similarity calculations to your needs.

I do not know whether this is a good data set for clustering. Affinity
propagation produces quite a number of clusters. Maybe fiddling with the
input preferences is necessary (see Section 4 of vignette of package
apcluster).

Best regards,
Ulrich


-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233053.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer

Sorry, Joris, I overlooked that you already mentioned daisy() in your
posting. I should have credited your recommendation in my previous message.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233055.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread abanero


Ulrich wrote: 
Affinity propagation produces quite a number of clusters. 


I tried with q=0 and produces 17 clusters. Anyway that's a good idea,
thanks. I'm looking to test it with my dataset.

So I'll probably use daisy() to compute an appropriate dissimilarity then
apcluster() or another method to determine clusters.

What do you suggest in order to assign a new observation to a determined
cluster?

 It seems that RandomForest doesn't work with both numerical and categorical
predictors (thanks to Joris).

Christian wrote: 
and the implement
nearest neighbours classification myself if I needed it. 
It should be pretty straightforward to implement. 

Do you intend modify the code of the knn1() function by yourself?


thanks to everyone!

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig



Christian wrote:

and the implement
nearest neighbours classification myself if I needed it.
It should be pretty straightforward to implement.


Do you intend modify the code of the knn1() function by yourself?


No; if you understand what the nearest neighbours method does, it's not 
very complicated to implement it from scratch (assuming that your dataset 
is small enough that you don't have to worry too much about optimising 
computing times). A bit of programming experience is required, though. 
(It's not that I intend to do it right now, I suggest that you do it if 
you can...)


Christian




thanks to everyone!

--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


 What do you suggest in order to assign a new observation to a determined
 cluster?

As I mentioned already, I would simply assign the new observation to the
cluster to whose exemplar the new observation is most similar to (in a
knn1-like fashion). To compute these similarities, you can use the daisy()
function. However, you have to do some tricks, since daisy() is designed for
computing square matrices of all mutual distances for a given data set. I
did not find another function that is better suitable (e.g. a function that
allows to compute simply the distance of two distinct samples). Maybe others
have an idea. In any case, you have to make sure that data either remain
unscaled or that you take care yourself that your new observation is scaled
exactly with the same parameters that were used for clustering before.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233308.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-26 Thread abanero

Hi,
I have a 1.000 observations with 10 attributes (of different types: numeric,
dicotomic, categorical  ecc..) and a measure M. 

I need to cluster these observations in order to assign a new observation
(with the same 10 attributes but not the measure) to a cluster. 

I want to calculate for the new observation a measure as the average of the
meausures M of the observations in the cluster assigned.

I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in 
package class) to assign the new observation to a cluster.

The problem is: I’m not able to use “knn1” because some of attributes are
categorical. 

Do you know  something like “knn1” that works with categorical variables
too? Do you have any suggestion?

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-26 Thread Joris Meys
Not a direct answer, but from your description it looks like you are better
of with supervised classification algorithms instead of unsupervised
clustering. see the library randomForest for example. Alternatively, you can
try a logistic regression or a multinomial regression approach, but these
are parametric methods and put requirements on the data. randomForest is
completely non-parametric.

Cheers
Joris

On Wed, May 26, 2010 at 3:45 PM, abanero gdevi...@xtel.it wrote:


 Hi,
 I have a 1.000 observations with 10 attributes (of different types:
 numeric,
 dicotomic, categorical  ecc..) and a measure M.

 I need to cluster these observations in order to assign a new observation
 (with the same 10 attributes but not the measure) to a cluster.

 I want to calculate for the new observation a measure as the average of the
 meausures M of the observations in the cluster assigned.

 I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in
 package class) to assign the new observation to a cluster.

 The problem is: I’m not able to use “knn1” because some of attributes are
 categorical.

 Do you know  something like “knn1” that works with categorical variables
 too? Do you have any suggestion?

 --
 View this message in context:
 http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis: dissimilar results between R and SPSS

2010-04-26 Thread Jeoffrey Gaspard
Hello everyone!

My data is composed of 277 individuals measured on 8 binary variables
(1=yes, 2=no). 

I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The
objective is to have the means for each variable per retained cluster.

1) the R analysis ran as followed:

 call data
 dist=dist(data,method=euclidean)
 cluster=hclust(dist,method=ward)
 cluster

Call:
hclust(d = dist, method = ward)

Cluster method   : ward
Distance : euclidean
Number of objects: 277

 plot(cluster)
 rect.hclust(cluster, k=4, border=red)
 x=rect.hclust(cluster, k=4, border=red)
 sapply(x, function(i) colMeans(data[i,]))
 round(sapply(x, function(i) colMeans(data[i,])),2)

2) The SPSS analysis ran as follows:

Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method=
Ward's method and Distance measure= Interval:  Squared Euclidean distance.
After that, I computed the means of each variable for each cluster.

The problem is I have different results between the two analyses (different
clusters and means).

However, when I use the Euclidean distance (unsquared) in SPSS, I have the
same results! 

I thought the R euclidean command meant the usual square distance between
the two vectors (2 norm) as specified in the documentation, no the
unsquared distance. Did it not?

Thanks for the comment!

Jeffrey



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: dissimilar results between R and SPSS

2010-04-26 Thread Tal Galili
Hi Jeoffrey,

How stable are the results in general ?
If you repeat the analysis in R several times, does it yield the same
results ?


Tal

Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Mon, Apr 26, 2010 at 3:37 PM, Jeoffrey Gaspard 
jeoffrey.gasp...@gmail.com wrote:

 Hello everyone!

 My data is composed of 277 individuals measured on 8 binary variables
 (1=yes, 2=no).

 I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2.
 The
 objective is to have the means for each variable per retained cluster.

 1) the R analysis ran as followed:

  call data
  dist=dist(data,method=euclidean)
  cluster=hclust(dist,method=ward)
  cluster

 Call:
 hclust(d = dist, method = ward)

 Cluster method   : ward
 Distance : euclidean
 Number of objects: 277

  plot(cluster)
  rect.hclust(cluster, k=4, border=red)
  x=rect.hclust(cluster, k=4, border=red)
  sapply(x, function(i) colMeans(data[i,]))
  round(sapply(x, function(i) colMeans(data[i,])),2)

 2) The SPSS analysis ran as follows:

 Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method=
 Ward's method and Distance measure= Interval:  Squared Euclidean distance.
 After that, I computed the means of each variable for each cluster.

 The problem is I have different results between the two analyses (different
 clusters and means).

 However, when I use the Euclidean distance (unsquared) in SPSS, I have
 the
 same results!

 I thought the R euclidean command meant the usual square distance
 between
 the two vectors (2 norm) as specified in the documentation, no the
 unsquared distance. Did it not?

 Thanks for the comment!

 Jeffrey



[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: dissimilar results between R and SPSS

2010-04-26 Thread Sarah Goslee
I'm not sure why you'd expect Euclidean distance and squared Euclidean
distance to
give the same results.

Euclidean distance is the square root of the sums of squared
differences for each variable, and that's exactly what dist() returns.

http://en.wikipedia.org/wiki/Euclidean_distance

On a map, it's the length of the hypoteneuse, and you can measure it
with a ruler
and get the same number. Euclidean distance has a specific geometric meaning.

Squared Euclidean distance is not the same thing, and not the standard
definition
you seem to be expecting. If that's what you want, then square the
output of dist()
before you perform the clustering.

Sarah

On Mon, Apr 26, 2010 at 8:37 AM, Jeoffrey Gaspard
jeoffrey.gasp...@gmail.com wrote:
 Hello everyone!

 My data is composed of 277 individuals measured on 8 binary variables
 (1=yes, 2=no).

 I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The
 objective is to have the means for each variable per retained cluster.

 1) the R analysis ran as followed:

 call data
 dist=dist(data,method=euclidean)
 cluster=hclust(dist,method=ward)
 cluster

 Call:
 hclust(d = dist, method = ward)

 Cluster method   : ward
 Distance         : euclidean
 Number of objects: 277

 plot(cluster)
 rect.hclust(cluster, k=4, border=red)
 x=rect.hclust(cluster, k=4, border=red)
 sapply(x, function(i) colMeans(data[i,]))
 round(sapply(x, function(i) colMeans(data[i,])),2)

 2) The SPSS analysis ran as follows:

 Analysis -- Classify -- Hierarchical cluster analysis -- Cluster method=
 Ward's method and Distance measure= Interval:  Squared Euclidean distance.
 After that, I computed the means of each variable for each cluster.

 The problem is I have different results between the two analyses (different
 clusters and means).

 However, when I use the Euclidean distance (unsquared) in SPSS, I have the
 same results!

 I thought the R euclidean command meant the usual square distance between
 the two vectors (2 norm) as specified in the documentation, no the
 unsquared distance. Did it not?

 Thanks for the comment!

 Jeffrey





-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis :: urgent

2010-04-11 Thread karine heerah

hi,

 

how can i do cluster analysis on spatial data? (longitude  latitude)

 

Because i used the function clust of the clustTool package and it did'nt work 
at all:

 

cl - clust(dv,3,method=hclustAverage,distMethod=euclidean)

 

thanks a lot



Karine HEERAH
 
Master 2 , océanographie et environnements marins
Université Pierre et Marie Curie (Paris 6)




  
_
Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans 
HOTMAIL !

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis labels for dendrogram

2010-03-11 Thread Sarah Goslee
Hi Samantha,

Did you check out the help for plclust? There's a labels argument that
is used to label the leaves of your dendrogram. By default, the rownames
of your dataframe are used.

Sarah

On Wed, Mar 10, 2010 at 9:01 PM, Samantha samantha.fra...@gmail.com wrote:

 Hi,

 I am clustering data based on three numeric variables.  I have a fourth
 variable that is categorical (site) which I would like to use to label the
 leaves of my dendrogram, so I can see how the different sites are grouped
 throughout the tree, but I do NOT want to use this variable in the cluster
 analysis itself.  Is there any way I can do this?

 Thanks,
 Samantha
 --
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis labels for dendrogram

2010-03-11 Thread xian

Hi Samantha,

You can check out the graph and source code on this page:

http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=79


best, Xian

-- 
View this message in context: 
http://n4.nabble.com/cluster-analysis-labels-for-dendrogram-tp1588347p1588790.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis labels for dendrogram

2010-03-10 Thread Samantha

Hi,

I am clustering data based on three numeric variables.  I have a fourth
variable that is categorical (site) which I would like to use to label the
leaves of my dendrogram, so I can see how the different sites are grouped
throughout the tree, but I do NOT want to use this variable in the cluster
analysis itself.  Is there any way I can do this?

Thanks,
Samantha
-- 
View this message in context: 
http://n4.nabble.com/cluster-analysis-labels-for-dendrogram-tp1588347p1588347.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis

2010-02-18 Thread Dong He
Hi Folks,

I want to apply cluster analysis on a categorical data set,  could you
recommend me some R package and suggestion?

Thanks!

Dong

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis

2010-02-18 Thread Steve_Friedman
Without know what your data set really looks like, I'd look to decision
trees - specifically package rpart and use method = classify.

Your problem may not be appropriate in that environment, but it is hard to
say with limited explanation of issues.

good luck

Steve Friedman Ph. D.
Spatial Statistical Analyst
Everglades and Dry Tortugas National Park
950 N Krome Ave (3rd Floor)
Homestead, Florida 33034

steve_fried...@nps.gov
Office (305) 224 - 4282
Fax (305) 224 - 4147


   
 Dong He   
 dongh...@gmail.c 
 omTo 
 Sent by:  r-help@r-project.org
 r-help-boun...@r-  cc 
 project.org   
   Subject 
   [R] cluster analysis
 02/18/2010 04:54  
 PM
   
   
   
   




Hi Folks,

I want to apply cluster analysis on a categorical data set,  could you
recommend me some R package and suggestion?

Thanks!

Dong

 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: hclust manipulation possible?

2009-11-17 Thread Jopi Harri
On 16.11.2009 19:13, Charles C. Berry wrote:
 The question: Can this be accomplished in the *dendrogram plot*
 by manipulating the resulting hclust data structure or by some
 other means, and if yes, how?
 
 Yes, you need to study
 
   ?hclust
 
 particularly the part about 'Value' from which you will see what needs 
 modification.
  
 Here is a very simple example:
 
 res - hclust(dist(1-diag(3)*rnorm(3)))
 plot(res)
 res2 - res
 res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, 
 -res2$merge, res2$merge+sum(res2$merge0)),2))
 res2$height - c(rep(0,3), res2$height)
 res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) )
 plot(res2)
 str( res )
 str( res2 )


Dear Chuck,

Many thanks for spending your valuable time in the suggestions
and the example. However, the drawback is that as a humanist I
have been having considerable difficulties in figuring out what
exactly to do. After hours of experimenting I could modify
another dendrogram (without crashing R), but still fail to get
the result I want to: the added leaf is not attached to where I
am intending to but instead, another adjacent leaves have their
height turned to 0.

The question, to put it more clearly perhaps: Is there any
straightforward procedure to just add a single leaf to any
dendrogram, next to an existing leaf at the height 0, and if
there is, what might that be?

As of now, it seems that the $merge has to be modified correctly,
but what is the exact strategy, if there is one (other than
redoing the whole clustering by hand)?

 Alternatively, you could use as.dendrogram( res ) as the point of 
 departure and manipulate the value.

Possibly, yes, but I am even less well-equipped with editing that
sort of a data type.


Sincerely,


Jopi Harri
Musicologist
University of Turku
Finland

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: hclust manipulation possible?

2009-11-17 Thread Jopi Harri


 Original Message 
Subject: Re: [R] Cluster analysis: hclust manipulation possible?
Date: Mon, 16 Nov 2009 19:22:54 -0800
From: Charles C. Berry cbe...@tajo.ucsd.edu
To: Jopi Harri jopi.ha...@utu.fi
References: 4b016237.7050...@utu.fi
pine.lnx.4.64.0911160906420.27...@tajo.ucsd.edu
4b01bc5d.3020...@utu.fi

On Mon, 16 Nov 2009, Jopi Harri wrote:

 On 16.11.2009 19:13, Charles C. Berry wrote:
 The question: Can this be accomplished in the *dendrogram plot*
 by manipulating the resulting hclust data structure or by some
 other means, and if yes, how?

 Yes, you need to study

  ?hclust

 particularly the part about 'Value' from which you will see what needs
 modification.

 Here is a very simple example:

 res - hclust(dist(1-diag(3)*rnorm(3)))
 plot(res)
 res2 - res
 res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, 
 -res2$merge, res2$merge+sum(res2$merge0)),2))
 res2$height - c(rep(0,3), res2$height)
 res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) )
 plot(res2)
 str( res )
 str( res2 )


 Dear Chuck,

 Many thanks for spending your valuable time in the suggestions
 and the example. However, the drawback is that as a humanist I
 have been having considerable difficulties in figuring out what
 exactly to do. After hours of experimenting I could modify
 another dendrogram (without crashing R), but still fail to get
 the result I want to: the added leaf is not attached to where I
 am intending to but instead, another adjacent leaves have their
 height turned to 0.

 The question, to put it more clearly perhaps: Is there any
 straightforward procedure to just add a single leaf to any
 dendrogram, next to an existing leaf at the height 0, and if
 there is, what might that be?

 As of now, it seems that the $merge has to be modified correctly,
 but what is the exact strategy, if there is one (other than
 redoing the whole clustering by hand)?

First, read the ?hclust page and see what it says about merge.

Then look at a really simple example like

cl - hclust( dist( c(1,2,4) ) )

plot(cl)

unclass( cl )

The unclass() strips the class attribute and allows print() to
give you a
bit more detail.

Now make the figure a bit more complicated:

cl2 - hclust(dist(as.matrix(c(1,2,4,4.5
plot(cl2)
unclass(cl2)

and see what has changed in $merge, $height, and $order.

Once you get the hang of it, you'll be in a position to modify an
existing
hclust object.

Chuck

p.s. it is best to post replies like yours to the whole list;
others may
want to know the same thing that you want to know or others may
give a
better reply than I have.

  
 Alternatively, you could use as.dendrogram( res ) as the point of
 departure and manipulate the value.

 Possibly, yes, but I am even less well-equipped with editing that
 sort of a data type.


 Sincerely,


 Jopi Harri
 Musicologist
 University of Turku
 Finland


Charles C. Berry(858) 534-2098
 Dept of
Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu   UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego
92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: hclust manipulation possible?

2009-11-17 Thread Jopi Harri
On 17.11.2009 5:22, Charles C. Berry wrote:
 
 Once you get the hang of it, you'll be in a position to modify an existing 
 hclust object.

I believe that I managed to solve the problem. (The code may not
be too refined, and my R is perhaps a bit dialectal. The function
may fail especially if the addition of multiple identical labels
is attempted.)

So, for the addition of a single duplicate label, one needs to
increment the positive values in $merge by one, and keep the
negative values except for the original of the duplicate which
will be given +1. Then, the duplicate pair [the value for the of
the new label being -(abs(min($merge))+1)] is added on top of $merge.

The other manipulations involved are the addition of height 0,
the label for the duplicate, and placing it properly in $order.

Once more thanks for the assistance.


Jopi Harri


dup.hclust=function(Hc,Label,DupLabel)
# We add to hclust Hc the duplicate DupLabel of Label.
# May fail in certain conditions, but shouldn't in normal use.
{
if (is.null(Hc$labels)) return(Labels are required!);
Mer=Hc$merge;
Hght=Hc$height;
Ord=Hc$order;
Labs=Hc$labels;
DupLNo=abs(min(Mer))+1;
LNo=which(Labs==Label);
LPlace=which(Labs[Ord]==Label);
Hght=c(0,Hght);
Labs=c(Labs,DupLabel);
Ord=c(Ord[1:LPlace[1]],DupLNo,Ord[LPlace[1]+1:(length(Ord))-LPlace[1]]);
NewMer=matrix(ifelse(Mer0,Mer,Mer+1),nrow(Mer));
NewMer[NewMer==-LNo]=1;
NewMer=as.matrix(rbind(-cbind(LNo,DupLNo),NewMer));
NewMer=cbind(NewMer[,1],NewMer[,2]);
Hc$merge=NewMer;
Hc$height=Hght;
Hc$order=Ord;
Hc$labels=Labs;
return(Hc);
}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis: hclust manipulation possible?

2009-11-16 Thread Jopi Harri
I am doing cluster analysis [hclust(Dist, method=average)] on
data that potentially contains redundant objects. As expected,
the inclusion of redundant objects affects the clustering result,
i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to
cluster differently from the same data without the redundancy,
i.e., a1, b, c, d, e1. This is apparent when the outcome is
visualized as a dendrogram.

Now, it seems that the clustering result for which the redundancy
has been eliminated is more robust for the present assignment
than that of the redundant data. Naturally, there is no problem
in the elimination: just exclude the redundant objects from Dist.

However, it would be very convenient to be able to include the
redundant objects in the *dendrogram* by attaching them as
0-level branches to the subtrees, i.e.:

1.0---
0.5___|___|_..
0.0.._|_..|..|..|.._|_
|.|.|.|..|..|.|...|...
...a1a2a3.b..c..d.e1.e2...

instead of

1.0---
0.5___|___|_..
0.0...|...|..|..|...|.
..a1..b..c..d..e1.

The question: Can this be accomplished in the *dendrogram plot*
by manipulating the resulting hclust data structure or by some
other means, and if yes, how?

Jopi Harri

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis: hclust manipulation possible?

2009-11-16 Thread Charles C. Berry

On Mon, 16 Nov 2009, Jopi Harri wrote:


I am doing cluster analysis [hclust(Dist, method=average)] on
data that potentially contains redundant objects. As expected,
the inclusion of redundant objects affects the clustering result,
i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to
cluster differently from the same data without the redundancy,
i.e., a1, b, c, d, e1. This is apparent when the outcome is
visualized as a dendrogram.

Now, it seems that the clustering result for which the redundancy
has been eliminated is more robust for the present assignment
than that of the redundant data. Naturally, there is no problem
in the elimination: just exclude the redundant objects from Dist.

However, it would be very convenient to be able to include the
redundant objects in the *dendrogram* by attaching them as
0-level branches to the subtrees, i.e.:

1.0---
0.5___|___|_..
0.0.._|_..|..|..|.._|_
|.|.|.|..|..|.|...|...
...a1a2a3.b..c..d.e1.e2...

instead of

1.0---
0.5___|___|_..
0.0...|...|..|..|...|.
..a1..b..c..d..e1.

The question: Can this be accomplished in the *dendrogram plot*
by manipulating the resulting hclust data structure or by some
other means, and if yes, how?



Yes, you need to study

?hclust

particularly the part about 'Value' from which you will see what needs 
modification.



Here is a very simple example:


res - hclust(dist(1-diag(3)*rnorm(3)))
plot(res)
res2 - res
res2$merge - rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge0, -res2$merge, 
res2$merge+sum(res2$merge0)),2))
res2$height - c(rep(0,3), res2$height)
res2$order - as.vector( rbind(res2$order,(4:6)[res2$order]) )
plot(res2)
str( res )
str( res2 )



Alternatively, you could use as.dendrogram( res ) as the point of 
departure and manipulate the value.


HTH,

Chuck





Jopi Harri

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Charles C. Berry(858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu   UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis with missing data

2009-07-14 Thread Hollix

Hi folks,

I tried for the first time hclust. Unfortunately, with missing data in my
data file, it doesn't seem
to work. I found no information about how to consider missing data.

Omission of all missings is not really an option as I would loose to many
cases.

Thanks in advance
Holger
-- 
View this message in context: 
http://www.nabble.com/Cluster-analysis-with-missing-data-tp24474486p24474486.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis with missing data

2009-07-14 Thread Bill.Venables
vegdist() in the vegan package optionally allows pairwise deletion of missing 
values when computing dissimilarities.  The result can be used as the first 
agrument to hclust()

('Caveat emptor', of course.)

From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of 
Hollix [holger.steinm...@web.de]
Sent: 14 July 2009 16:42
To: r-help@r-project.org
Subject: [R]  Cluster analysis with missing data

Hi folks,

I tried for the first time hclust. Unfortunately, with missing data in my
data file, it doesn't seem
to work. I found no information about how to consider missing data.

Omission of all missings is not really an option as I would loose to many
cases.

Thanks in advance
Holger
--
View this message in context: 
http://www.nabble.com/Cluster-analysis-with-missing-data-tp24474486p24474486.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis with missing data

2009-07-14 Thread Gavin Simpson
On Mon, 2009-07-13 at 23:42 -0700, Hollix wrote:
 Hi folks,
 
 I tried for the first time hclust. Unfortunately, with missing data in my
 data file, it doesn't seem
 to work. I found no information about how to consider missing data.
 
 Omission of all missings is not really an option as I would loose to many
 cases.

Holger,

hclust takes a dissimilarity matrix as input, not your data, so the
problem is in finding an appropriate dissimilarity/distance coefficient
that handles missing data.

Once such measure is Gower's coefficient and is implemented in function
'daisy' in recommended package 'cluster'. Try:

require(cluster)
?daisy

to read about it.

Also 'vegdist' in package 'vegan' has an ability to not consider
pairwise missingness. See ?vegdist after loading 'vegan' and in
particular, the 'na.rm' argument.

Whether either of these (i.e. the resulting dissimilarities) make sense
for your particular problem is another matter...

HTH

G
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis, defining center seeds or number of clusters

2009-06-11 Thread amvds
I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA-data.frame(LlowA,LhghA)
CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen)

This works well when the actual data shows 1,2 or 3 groups that are not
too close in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 1 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, defining center seeds or number of clusters

2009-06-11 Thread Christian Hennig

Dear Alex,

actually fixing the number of clusters in kmeans end then ending up with a 
smaller number because of empty clusters is not a standard method of 
estimating the number of clusters. I may happen (as apparently in some of 
your examples), but it is generally rather unusual. In most cases, kmeans, 
as well as clara, pam and other clustering methods, only give you the 
number of clusters you ask for. Even with some reasonable separation 
between clusters kmeans cannot generally be expected to come up with empty 
clusters if the number is initially chosen too high or too many 
initially centers are specified.


The help page for pam.object in library cluster shows you a method to 
estimate the optimal number of clusters based on pam.
However, this problem strongly depends on what cluster concept you have in 
mind and what you want to use your clusters for. There are alternative 
indexes that could be optimised to find the best number of clusters. Some 
of them are implemented in the function cluster.stats in package fpc.
I strongly advise reading some literature about this to understand the 
problem better; the help page of cluster.stats gives a few references.


The BIC gives you an estimate of the number of cluster together with 
Gaussian mixtures, see package mclust.


If you can specify things like maximum within-cluster distances, you may 
get something from using cutree together with a hierarchical clustering 
method in hclust, for example complete linkage.


dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.

Best regards,
Christian

On Thu, 11 Jun 2009, am...@xs4all.nl wrote:


I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA-data.frame(LlowA,LhghA)
CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen)

This works well when the actual data shows 1,2 or 3 groups that are not
too close in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 1 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis: mean values for each variable and cluster

2009-02-20 Thread jgaspard

Hi all!

I'm new to R and don't know many about it. Because it is free, I managed to
learn it a little bit.

Here is my problem: I did a cluster analysis on 30 observations and 16
variables (monde, figaro, liberation, etc.). Here is the .txt data file:

monde,figaro,liberation,yespeople,nopeople,bxl,europe,ue,union_eur,other,yesmeto,nometo,yesfonc,nofonc,yestone,notone
1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0
0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1
1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,1
1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0
1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0
1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,1
0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0
0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,1,0,0,1,1,0,1,0
0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0
0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,0,1
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
1,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0


The steps I made were those:

headlines=read.table(/data.csv, header=T, sep=,)
data
dist=dist(data,method=euclidean)
dist
cluster=hclust(dist,method=ward)
cluster
plot(cluster)
rect.hclust(cluster, k=4, border=red)

I extracted 4 clusters from the data. My question is: is it possible to
produce a summary of every mean values for each variable of each of the 4
clusters?

Thanks a lot in advance,

Jeoffrey




-- 
View this message in context: 
http://www.nabble.com/cluster-analysis%3A-mean-values-for-each-variable-and-cluster-tp22120427p22120427.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis: mean values for each variable and cluster

2009-02-20 Thread Uwe Ligges



jgaspard wrote:

Hi all!

I'm new to R and don't know many about it. Because it is free, I managed to
learn it a little bit.

Here is my problem: I did a cluster analysis on 30 observations and 16
variables (monde, figaro, liberation, etc.). Here is the .txt data file:

monde,figaro,liberation,yespeople,nopeople,bxl,europe,ue,union_eur,other,yesmeto,nometo,yesfonc,nofonc,yestone,notone
1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0
0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,1
1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,1
1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0
1,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0
1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,1
0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0
0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,1,1,0
0,1,0,0,1,0,0,0,1,0,0,1,1,0,1,0
0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0
0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0
0,1,0,0,1,1,0,0,0,0,1,0,0,1,0,1
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0
1,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0


The steps I made were those:

headlines=read.table(/data.csv, header=T, sep=,)
data
dist=dist(data,method=euclidean)
dist
cluster=hclust(dist,method=ward)
cluster
plot(cluster)
rect.hclust(cluster, k=4, border=red)

I extracted 4 clusters from the data. My question is: is it possible to
produce a summary of every mean values for each variable of each of the 4
clusters?



Well, I think this is not what you want.
Probably you want to use Manhattan distance (rather than Euclidean) 0/1 
data and you want to know the number of 1s and the total number in each 
cluster.


Anyway, in order to answer your question, do an assignment in the end 
such as:


x - rect.hclust(cluster, k=4, border=red)
sapply(x, function(i) colMeans(data[i,]))

Uwe Ligges




Thanks a lot in advance,

Jeoffrey






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis question

2009-02-08 Thread Stephen Weigand
Dan,

I don't use the flexclust package, but if I understand your question
correctly, you can use your own distance measure to calculate a
dissimilarity matrix and pass that to, e.g., agnes() in the cluster
package.

Stephen

On Fri, Feb 6, 2009 at 9:42 AM, Jim Porzak jpor...@gmail.com wrote:
 Dan,

 Check out Fritz Leisch's flexclust package.
 HTH,
 Jim Porzak
 TGN.com
 San Francisco, CA
 http://www.linkedin.com/in/jimporzak
 use R! Group SF: http://ia.meetup.com/67/



 On Fri, Feb 6, 2009 at 7:11 AM, Dan Stanger dstan...@eatonvance.com wrote:
 Hello All,

 I have data where each feature data point is a vector, and my distance
 measurement is a weighted dot product between vectors.

 I would like to use R to perform a cluster analysis on this data.  Does
 one of the R cluster analysis routines provide for a user provided
 distance function?



 Dan Stanger

 Eaton Vance Management
 255 State Street
 Boston, MA 02109
 617 598 8261




[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Rochester, Minn. USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis question

2009-02-06 Thread Dan Stanger
Hello All,

I have data where each feature data point is a vector, and my distance
measurement is a weighted dot product between vectors. 

I would like to use R to perform a cluster analysis on this data.  Does
one of the R cluster analysis routines provide for a user provided
distance function?

 

Dan Stanger

Eaton Vance Management
255 State Street
Boston, MA 02109
617 598 8261

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis question

2009-02-06 Thread Jim Porzak
Dan,

Check out Fritz Leisch's flexclust package.
HTH,
Jim Porzak
TGN.com
San Francisco, CA
http://www.linkedin.com/in/jimporzak
use R! Group SF: http://ia.meetup.com/67/



On Fri, Feb 6, 2009 at 7:11 AM, Dan Stanger dstan...@eatonvance.com wrote:
 Hello All,

 I have data where each feature data point is a vector, and my distance
 measurement is a weighted dot product between vectors.

 I would like to use R to perform a cluster analysis on this data.  Does
 one of the R cluster analysis routines provide for a user provided
 distance function?



 Dan Stanger

 Eaton Vance Management
 255 State Street
 Boston, MA 02109
 617 598 8261




[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis using numeric and factor variables

2008-06-10 Thread Nagu
Hi,

Are there any algorithms that handle numeric and factor variables
together in a cluster analysis?

Thank you,
Nagu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis using numeric and factor variables

2008-06-10 Thread Moshe Olshansky
If you can define a distance between two vectors (where each one has some 
numerical and some categorical coordinates) then you can proceed with any 
clustering algorithm.

One possibility to get such a distance is to use RandomForest which can produce 
a proximity matrix which can be turned into distance matrix.

Regards,

Moshe.

--- On Wed, 11/6/08, Nagu [EMAIL PROTECTED] wrote:

 From: Nagu [EMAIL PROTECTED]
 Subject: [R] Cluster analysis using numeric and factor variables
 To: r-help@r-project.org
 Received: Wednesday, 11 June, 2008, 11:49 AM
 Hi,
 
 Are there any algorithms that handle numeric and factor
 variables
 together in a cluster analysis?
 
 Thank you,
 Nagu
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster analysis with numeric and categorical variables

2008-06-03 Thread Miha Staut
Dear all,

I would like to perform a clustering analysis on a data frame with two 
coordinate variables (X and Y) and a categorical variable where only a != b can 
be established.  As far as I understood classification analyses, they are not 
an option as they partition the training set only in k classes of the test set. 
 By searching through the book Modern Applied Statistics with S I did not 
find a satisfactory solution. 

I will be grateful for any suggestions.

Best regards
Miha



  __
can.html

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis with numeric and categorical variables

2008-06-03 Thread Christian Hennig

Dear Miha,

a general way to do this is as follows:
Define a distance measure by aggregating the 
Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if 
category is the same) on the categorial variable. Perform cluster analysis 
(whichever you want) on the resulting distance matrix.


Note that there is more than one way to do this. The 0-1-distance could be 
incorporated in the definition of the Euclidean distance (instead of 
(x_i-y_i)^2), or a weighted average of the distances in X-, Y- and 
categorial space could be computed. Weights of variables (including 
possibly rescaling) have to be decided. How to do this precisely should 
depend on the subject matter and prior information about variable 
importance etc. In absence of such information, you may standardise the 
variablewise sums of squared pairwise distances to be equal.


Hope this helps (and you can figure out the relevant R code yourself).

Christian

On Tue, 3 Jun 2008, Miha Staut wrote:


Dear all,

I would like to perform a clustering analysis on a data frame with two coordinate 
variables (X and Y) and a categorical variable where only a != b can be established.  As 
far as I understood classification analyses, they are not an option as they partition the 
training set only in k classes of the test set.  By searching through the book 
Modern Applied Statistics with S I did not find a satisfactory solution.

I will be grateful for any suggestions.

Best regards
Miha



 __
can.html

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis

2007-11-02 Thread paulandpen
AMINA SHAHZADI,

The eternal question.

What I do is that I generate a range of solutions, profile them on variables 
used to cluster the data into groups and any other information I have to 
profile the cluster groups on and then present the solutions to a group of 
others to assess meaningfulness, debate on the solutions and attempt to 
reach a consensus etc

In many cases, eg, for algorithms based on k-means and hierarchical 
clustering, you are using an exploratory technique and there are no 
right/wrong answers to this

Having used cluster analysis for years some things to look at because there 
is no way to answer this statistically (unless you are using a latent class 
type model with goodness of fit measures) are the following

1.  What is the minimum size you believe to be robust for a single cluster 
(eg n=30, n=100) etc because the larger the number of clusters you generate 
relative to sample size, the smaller your clusters will be and there must be 
a cut-off point defined upon which you are not prepared to go any lower...
2. If you run the clusters through different algorithms, how comparable are 
the results (cluster stability)
2.  What differences emerge between 2, 3, 4 cluster solutions etc (as you 
utilise larger numbers of clusters, does this still produce a meaningful 
result in that the clusters are distinct and unique, or are you just cutting 
larger clusters into smaller clusters without generating unique and usable 
information...  Examine the clusters via a series of cross tabs (as you go 
from 2 to 3 to 4 cluster solutions) what happens to the members within 
clusters, are they distributed differently etc

Thanks Paul

- Original Message - 
From: amna khan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, November 02, 2007 2:19 AM
Subject: [R] cluster analysis


 Hi Sir

 How can we select the optimum number of clusters?

 Best Regards

 -- 
 AMINA SHAHZADI
 Department of Statistics
 GC University Lahore, Pakistan.

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis

2007-11-01 Thread amna khan
Hi Sir

How can we select the optimum number of clusters?

Best Regards

-- 
AMINA SHAHZADI
Department of Statistics
GC University Lahore, Pakistan.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cluster Analysis

2007-10-29 Thread Katia Freire
Dear all,
   
  I would like to know if I can do a hierarchical cluster analysis in R using 
my own similarity matrix and how. Thanks. Katia Freire.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster Analysis

2007-10-29 Thread Dieter Vanderelst
take a look at hclust()

Dieter

Katia Freire wrote:
 Dear all,

   I would like to know if I can do a hierarchical cluster analysis in R using 
 my own similarity matrix and how. Thanks. Katia Freire.
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster Analysis

2007-10-29 Thread elw


 Subject: [R] Cluster Analysis
 
 Dear all,

  I would like to know if I can do a hierarchical cluster analysis in R 
 using my own similarity matrix and how. Thanks. Katia Freire.

Yes. ;)

Reading the help for dist() and hclust() should make the procedure for 
doing this appear fairly straightforward.  For interpreting the results, 
cutree() should be helpful..

--elijah

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cluster analysis

2007-10-18 Thread amna khan
Hi Sir

How to perform cluster analysis using Ward's method and K- means clustering?

Regards

-- 
AMINA SHAHZADI
Department of Statistics
GC University Lahore, Pakistan.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis

2007-10-18 Thread Liviu Andronic
On 10/18/07, amna khan [EMAIL PROTECTED] wrote:
 Hi Sir

 How to perform cluster analysis using Ward's method and K- means clustering?

For beginning, try to perform it using the GUI Rcmdr.

Regards,
Liviu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.