subject:"\[R\] Cluster"

Re: [R] Cluster analysis

2019-03-31 Thread Sarah Goslee

Hi,

R has a vast array of tools for cluster analysis. There's even a task
view: https://cran.r-project.org/web/views/Cluster.html

Which method is best for your needs is going to require you spending
some time working to understand the pros and cons, and possibly
consulting with a local statistician.

Sarah

On Sun, Mar 31, 2019 at 4:20 PM bienvenidoz...@gmail.com
 wrote:
>
> Hi,
> I have data from farmers with different variables. I would like to classify 
> them according to some variables. Can you help me with "R" to find the best 
> variables to classify them and how to classify them with "R". Some variables 
> are numerical others are ordinal.
>
> Best regards,
> Bienvenue
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Sarah Goslee (she/her)
http://www.numberwright.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis

2019-03-31 Thread bienvenidoz...@gmail.com

Hi,
I have data from farmers with different variables. I would like to classify 
them according to some variables. Can you help me with "R" to find the best 
variables to classify them and how to classify them with "R". Some variables 
are numerical others are ordinal.

Best regards,
Bienvenue
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster samples using self organizing map in R

2018-10-10 Thread Sarah Goslee

Hi Tina,

What's wrong with what you did?

The output object of som() contains the classification of each sample.

You probably do need to read more about self-organizing maps, since
you specified you wanted the samples classified into nine groups, and
that's unlikely to be your actual intent.

I have no idea what you thought your hierarchical clustering step was
supposed to do, either.

Here's one way to get 3 groups instead of 9:

library(kohonen)
iris.sc <- scale(iris[, 1:4])
iris.som <- som(iris.sc, grid=somgrid(xdim = 1, ydim=3,
topo="rectangular"), rlen=100, alpha=c(0.05,0.01))

table(iris.som$unit.classif, iris$Species)
plot(iris.som)

Sarah

On Wed, Oct 10, 2018 at 2:14 AM A DNA RNA  wrote:
>
> Dear All,
>
> Who can I use Self Organizing Map (SOM) results to cluster samples? I have
> tried following but this gives me only the clustering of grids, while I
> want to cluster (150) samples:
>
> library(kohonen)
> iris.sc <- scale(iris[, 1:4])
> iris.som <- som(iris.sc, grid=somgrid(xdim = 3, ydim=3, topo="hexagonal"),
>rlen=100, alpha=c(0.05,0.01))
> ##hierarchical clustering
> groups <- 3
> iris.hc <- cutree(hclust(dist(iris.som$codes[[1]])), groups)
> iris.hc
> #V1 V2 V3 V4 V5 V6 V7 V8 V9
> #1  1  2  1  1  2  3  3  2
>
>
> Can anyone help me with this please?
> --
> Tina

-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster samples using self organizing map in R

2018-10-10 Thread Bert Gunter

Search!

the rseek.org site gives many hits for "self organizing maps", including
the som package among others.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Oct 9, 2018 at 11:14 PM A DNA RNA  wrote:

> Dear All,
>
> Who can I use Self Organizing Map (SOM) results to cluster samples? I have
> tried following but this gives me only the clustering of grids, while I
> want to cluster (150) samples:
>
> library(kohonen)
> iris.sc <- scale(iris[, 1:4])
> iris.som <- som(iris.sc, grid=somgrid(xdim = 3, ydim=3, topo="hexagonal"),
>rlen=100, alpha=c(0.05,0.01))
> ##hierarchical clustering
> groups <- 3
> iris.hc <- cutree(hclust(dist(iris.som$codes[[1]])), groups)
> iris.hc
> #V1 V2 V3 V4 V5 V6 V7 V8 V9
> #1  1  2  1  1  2  3  3  2
>
>
> Can anyone help me with this please?
> --
> Tina
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster samples using self organizing map in R

2018-10-09 Thread A DNA RNA

Dear All,

Who can I use Self Organizing Map (SOM) results to cluster samples? I have
tried following but this gives me only the clustering of grids, while I
want to cluster (150) samples:

library(kohonen)
iris.sc <- scale(iris[, 1:4])
iris.som <- som(iris.sc, grid=somgrid(xdim = 3, ydim=3, topo="hexagonal"),
   rlen=100, alpha=c(0.05,0.01))
##hierarchical clustering
groups <- 3
iris.hc <- cutree(hclust(dist(iris.som$codes[[1]])), groups)
iris.hc
#V1 V2 V3 V4 V5 V6 V7 V8 V9
#1  1  2  1  1  2  3  3  2


Can anyone help me with this please?
--
Tina

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster data in lattice dotplot and show stdev

2017-02-16 Thread Duncan Mackay

Hi Luigi

I think your data is duplicated

> xtabs(~cluster+type+target,my.data)
, , target = A

   type
cluster blank negative positive
  run_1 222
  run_2 000

, , target = B

   type
cluster blank negative positive
  run_1 000
  run_2 222

> xtabs(~cluster+target,my.data)
   target
cluster A B
  run_1 6 0
  run_2 0 6

I am not sure exactly what you want partly because what Jim has plotted.
I have thought of 2 ways. I have added columns coding the factors as numeric
to make it flexible

1. By runs
my.data$Target <- paste0(rep(LETTERS[1:2],each= 6),rep(1:2,each=3))
my.data$x <- rep(c(0.8,1.2),each=3)
my.data$xrun <- rep(1:3)

xyplot(value ~ x|target,my.data,
   groups = type,
   xlim = c(0.5,1.5),
   scales = list(x = list(at= c(0.8,1.2),
 label=paste("Run",1:2)),
 alternating = 1),
   auto.key = list(points = T,
   lines = F),
   pch=16,
   panel = panel.superpose,
   panel.groups = function(x,y,...){
   
panel.xyplot(x,y, ...)

   
   }
)

2. By type

xyplot(value ~ xrun|target,my.data,
   groups = run,
   xlim = c(0,4),   
  par.settings = list(strip.background = list(col = "transparent")),
   scales = list(x = list(at= c(1:3),
 label= unique(my.data$type),
 alternating = 1)),
   auto.key = list(points = T,
   lines = F),
   pch=16,
   panel = panel.superpose,
   panel.groups = function(x,y,...){

panel.xyplot(x,y, ...)


   }
)

If you want error bars use the functions in 
demo(lattice::intervals)
or use your own panel .segments

If you decide not to use default colours etc use 

panel.settings = list(superpose.symbol = list(pch = ... ,
  col = ... ,
  cex = 1))

makes keys easier

example by hand error bars

xyplot(value ~ xrun|target,my.data,
   groups = run,
   xlim = c(0,4),
   par.settings = list(strip.background = list(col = "transparent"),
  grid.pars = list(lineend =
"butt")),
   scales = list(x = list(at= c(1:3),
 label= unique(my.data$type),
 alternating = 1)),
   auto.key = list(points = T,
   lines = F),
   pch=16,
   panel = panel.superpose,
   panel.groups = function(x,y,...,group.number){

panel.xyplot(x,y, ...)

panel.arrows(group.number+0.3, group.number-0.6,
group.number+0.3, group.number-0.4,
  length = 0.04,
  unit = "inches",
  angle = 90,
  code = 3)

   }
)

Regards

Duncan

Duncan Mackay
Department of Agronomy and Soil Science
University of New England
Armidale NSW 2351
Email: home: mac...@northnet.com.au

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Luigi
Marongiu
Sent: Friday, 17 February 2017 02:31
To: r-help
Subject: [R] cluster data in lattice dotplot and show stdev

dear all,
i have a set of data that is separated in the variables: cluster (two
runs), type (blank, negative and positive) and target (A and B), each
duplicated. I am plotting it with lattice and the result is a 2x2 matrix
plot in which the top two cells (or panels) are relative to run 2, the
lower to run 2; each panel is then subdivided in target A or B and I have
colour-coded the dots to match the target.
However i would like to have a 1x2 panel plot representing the targets, and
within each panel having a cluster of 3 dots (representing the types) for
run 1 and another for run 2. I tried to represent such requirement in the
rough construction at the end of the example.
also, since each run is actually formed by duplicates, each dot should
indicate the standard deviation of the values.
How would I do that? any tips?
thanks
luigi

>>>
cluster <- c(rep("run_1", 6), rep("run_2", 6))
type <- rep(c("blank", "positive", "negative"),2)
target <- c(rep("A", 6), rep("B", 6))
value <- c(0.01, 1.1, 0.5,
   0.02, 1.6, 0.8,
   0.07, 1.4, 0.7,
   0.03, 1.4, 0.4)
my.data <- data.frame(cluster, type, target, value)

library(lattice)
dotplot(
  value ~ type|cluster + target,
  my.data,
  groups = type,
  pch=21,
  main = "Luminex analysis MTb humans",
  xlab = "Target", ylab = "Reading",
  col = c("grey", "green", "red"),
  par.se

Re: [R] cluster data in lattice dotplot and show stdev

2017-02-16 Thread Jim Lemon

Hi Luigi,
Are you looking for something like this?

library(plotrix)
ylim=c(0,1.7)
png("lmplot.png",width=600,height=300)
par(mfrow=c(1,2))
brkdn.plot(value~type,data=my.data[my.data$target=="A",],
 main="Run 1",ylab="Value",xlab="",xaxlab="target",ylim=ylim,
 mct="mean",md="sd",pch=c("B","N","P"))
brkdn.plot(value~type,data=my.data[my.data$target=="B",],
 main="Run 2",ylab="Value",xlab="",xaxlab="target",ylim=ylim,
 mct="mean",md="sd",pch=c("B","N","P"))
dev.off()

Jim


On Fri, Feb 17, 2017 at 2:30 AM, Luigi Marongiu
 wrote:
> dear all,
> i have a set of data that is separated in the variables: cluster (two
> runs), type (blank, negative and positive) and target (A and B), each
> duplicated. I am plotting it with lattice and the result is a 2x2 matrix
> plot in which the top two cells (or panels) are relative to run 2, the
> lower to run 2; each panel is then subdivided in target A or B and I have
> colour-coded the dots to match the target.
> However i would like to have a 1x2 panel plot representing the targets, and
> within each panel having a cluster of 3 dots (representing the types) for
> run 1 and another for run 2. I tried to represent such requirement in the
> rough construction at the end of the example.
> also, since each run is actually formed by duplicates, each dot should
> indicate the standard deviation of the values.
> How would I do that? any tips?
> thanks
> luigi
>

> cluster <- c(rep("run_1", 6), rep("run_2", 6))
> type <- rep(c("blank", "positive", "negative"),2)
> target <- c(rep("A", 6), rep("B", 6))
> value <- c(0.01, 1.1, 0.5,
>0.02, 1.6, 0.8,
>0.07, 1.4, 0.7,
>0.03, 1.4, 0.4)
> my.data <- data.frame(cluster, type, target, value)
>
> library(lattice)
> dotplot(
>   value ~ type|cluster + target,
>   my.data,
>   groups = type,
>   pch=21,
>   main = "Luminex analysis MTb humans",
>   xlab = "Target", ylab = "Reading",
>   col = c("grey", "green", "red"),
>   par.settings = list(strip.background = list(col="paleturquoise")),
>   scales = list(alternating = FALSE, x = list(labels = c("", "", ""))),
>   key = list(
> space = "top",
> columns = 3,
> text = list(c("Blank", "Negative", "Positive"), col="black"),
> rectangles = list(col=c("grey", "green", "red"))
>   )
> )
>
> x <- 1:7
> plot(x , c(max(my.data$value), min(my.data$value), my.data$value[1:5]),
> col="white", xaxt = "n", ylab="value", xlab="target")
> points(x[1], mean(my.data$value[1], my.data$value[4]), col="grey")
> points(x[2], mean(my.data$value[2], my.data$value[5]), col="red")
> points(x[3], mean(my.data$value[3], my.data$value[6]), col="green")
> points(x[5], mean(my.data$value[7], my.data$value[10]), col="grey")
> points(x[6], mean(my.data$value[8], my.data$value[11]), col="red")
> points(x[7], mean(my.data$value[9], my.data$value[12]), col="green")
> axis(side=1, at = x[2], lab = "A", cex.axis=1)
> axis(side=1, at = x[6], lab = "B", cex.axis=1)
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster data in lattice dotplot and show stdev

2017-02-16 Thread Luigi Marongiu

dear all,
i have a set of data that is separated in the variables: cluster (two
runs), type (blank, negative and positive) and target (A and B), each
duplicated. I am plotting it with lattice and the result is a 2x2 matrix
plot in which the top two cells (or panels) are relative to run 2, the
lower to run 2; each panel is then subdivided in target A or B and I have
colour-coded the dots to match the target.
However i would like to have a 1x2 panel plot representing the targets, and
within each panel having a cluster of 3 dots (representing the types) for
run 1 and another for run 2. I tried to represent such requirement in the
rough construction at the end of the example.
also, since each run is actually formed by duplicates, each dot should
indicate the standard deviation of the values.
How would I do that? any tips?
thanks
luigi

>>>
cluster <- c(rep("run_1", 6), rep("run_2", 6))
type <- rep(c("blank", "positive", "negative"),2)
target <- c(rep("A", 6), rep("B", 6))
value <- c(0.01, 1.1, 0.5,
   0.02, 1.6, 0.8,
   0.07, 1.4, 0.7,
   0.03, 1.4, 0.4)
my.data <- data.frame(cluster, type, target, value)

library(lattice)
dotplot(
  value ~ type|cluster + target,
  my.data,
  groups = type,
  pch=21,
  main = "Luminex analysis MTb humans",
  xlab = "Target", ylab = "Reading",
  col = c("grey", "green", "red"),
  par.settings = list(strip.background = list(col="paleturquoise")),
  scales = list(alternating = FALSE, x = list(labels = c("", "", ""))),
  key = list(
space = "top",
columns = 3,
text = list(c("Blank", "Negative", "Positive"), col="black"),
rectangles = list(col=c("grey", "green", "red"))
  )
)

x <- 1:7
plot(x , c(max(my.data$value), min(my.data$value), my.data$value[1:5]),
col="white", xaxt = "n", ylab="value", xlab="target")
points(x[1], mean(my.data$value[1], my.data$value[4]), col="grey")
points(x[2], mean(my.data$value[2], my.data$value[5]), col="red")
points(x[3], mean(my.data$value[3], my.data$value[6]), col="green")
points(x[5], mean(my.data$value[7], my.data$value[10]), col="grey")
points(x[6], mean(my.data$value[8], my.data$value[11]), col="red")
points(x[7], mean(my.data$value[9], my.data$value[12]), col="green")
axis(side=1, at = x[2], lab = "A", cex.axis=1)
axis(side=1, at = x[6], lab = "B", cex.axis=1)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis with Weighted attribute

2016-06-03 Thread Ahreum Lee

 Hi! All.
 
I'm not much familiar with R. 
So I tried to find a R function or packages that could work with my problems. 
 
What  I wonder is, 
Whether there is any R function or package that includes the cluster analysis 
considering with the weighted attribute.
 
I saw several papers that dealt with the Attribute Value Weighting in K-Modes 
Clustering. 
but I could not find the R function or packages related with this.  
 
We got the weight of each attributes by interviewing the experts. 
 
What we want to do is do cluster analysis regarding with those weighted value 
on the attributes.
 
 
Is there any suggestion for me?? 
It would be much appreciated ! 
 
Thanks for your interest on my question! 
 
 
 
 

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis

2015-06-17 Thread PIKAL Petr

Hi

> -Original Message-
> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Venky
> Sent: Wednesday, June 17, 2015 8:43 AM
> To: R Help R
> Subject: [R] cluster analysis
>
> Hi friends,
>
> I have data like this
>

In R or elsewhere?

>
>
> Group
>   Employee size WOE Employee size2 Weight of Evidence 1081680995 0
> 0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464
> 1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537
> 0.128755
> 0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5
> 0.12875537
> 1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825
> In this data first variable (Employee size) has 10 rows and variable 2
> (employee size2) has only 5 rows

Extremely messy due to HTML posting. Use plain text post as recommended by 
Posting Guide.

>
> Question 1:there are different number of rows so that, we can able to
> do K-means cluster or not?

I am not an expert but why not to try it?

> Question 2:If we run k-means clustering in R answer not coming  because
> of NA exists
>
> I have used dataset<-na.omit(dataset)
>
> But that time also i cannot able to run clustering

Perhaps not enough data remained after NA removing.

To get better answer you shall provide reproducible example or at least some 
usable data.

Cheers
Petr


>
> Please help me to find this answer
>
>   [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.


Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny 
pouze jeho adresátům.
Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně 
jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze 
svého systému.
Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email 
jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat.
Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či 
zpožděním přenosu e-mailu.

V případě, že je tento e-mail součástí obchodního jednání:
- vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a 
to z jakéhokoliv důvodu i bez uvedení důvodu.
- a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; 
Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce 
s dodatkem či odchylkou.
- trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným 
dosažením shody na všech jejích náležitostech.
- odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost 
žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně 
pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně 
osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi 
či osobě jím zastoupené známá.

This e-mail and any documents attached to it may be confidential and are 
intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. 
Delete the contents of this e-mail with all attachments and its copies from 
your system.
If you are not the intended recipient of this e-mail, you are not authorized to 
use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by 
modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a 
contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately 
accept such offer; The sender of this e-mail (offer) excludes any acceptance of 
the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an 
express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into 
any contracts on behalf of the company except for cases in which he/she is 
expressly authorized to do so in writing, and such authorization or power of 
attorney is submitted to the recipient or the person represented by the 
recipient, or the existence of such authorization is known to the recipient of 
the person represented by the recipient.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE d

[R] cluster analysis

2015-06-16 Thread Venky

Hi friends,

I have data like this



Group
  Employee size WOE Employee size2 Weight of Evidence 1081680995 0
0.12875537 0.128755 -0.30761 1007079896 1 0.48380133 -0.46544 -0.70464
1000507407 2 0.26029825 -0.46544 0.070221 1006400720 3 0.12875537 0.128755
0.151385 1006916029 4 0.12875537 -0.05955 0.320269 1006717587 5 0.12875537
1002032301 6 0.12875537 1007021594 7 0.26029825 1007118066 8 0.26029825
In this data first variable (Employee size) has 10 rows and variable 2
(employee size2) has only 5 rows

Question 1:there are different number of rows so that, we can able to do
K-means cluster or not?
Question 2:If we run k-means clustering in R answer not coming  because of
NA exists

I have used dataset<-na.omit(dataset)

But that time also i cannot able to run clustering

Please help me to find this answer

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis using term frequencies

2015-03-24 Thread Christian Hennig


Dear Sun Shine,


dtes <- dist(tes.df, method = 'euclidean')
dtesFreq <- hclust(dtes, method = 'ward.D')
plot(dtesFreq, labels = names(tes.df))


However, I get an error message when trying to plot this: "Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : invalid 
dendrogram input".


I don't see anything wrong with the code, so what I'd do is run
str(dtes) and str(dtesFreq) to see whether these are what they should be 
(or if not, what they are instead).


I'm clearly screwing something up, either in my source data.frame or in my 
setting hclust up, but don't know which, nor how.


Can't comment on your source data but generally, whatever you do, use 
str() or even print() to see whether the R-objects are allright or what 
went wrong.


More than just identifying the error however, I am interested in finding a 
smart (efficient/ elegant) way of checking the occurrence and frequency value 
of the terms that may be associated with 'sports', 'learning', and 
'extra-mural' and extracting these into a matrix or data frame so that I can 
analyse and plot their clustering to see if how I associated these terms is 
actually supported statistically.


The first thing that comes to my mind (not necessarily the best/most 
elegant) is to run...

dtes3 <- cutree(dtesFreq,3)
...and to table dtes3 against your manual classification.
Note that 3 is the most "natural" number of clusters to cut the tree 
here but may not be the best to match your classification (for example, 
you may have a one-point cluster in the 3-cluster solution, so it may 
effectively be a two-cluster solution with an outlier). Your 
dendrogram, if you succeed plotting it, may give you a hint about that.


Hope this helps,
Christian




I'm sure that there must be a way of doing this in R, but I'm obviously not 
going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis using term frequencies

2015-03-24 Thread Sun Shine


Hi list

I am using the 'tm' package to review meeting notes at a school to 
identify terms frequently associated with 'learning', 'sports', and 
'extra-mural' activities, and then to sort any terms according to these 
three headers in a way that could be supported statistically (as opposed 
to, say, my own bias, etc.).


To accomplish this, I have done the following:

(1) After the usual pre-processing of the text data, loading it as a 
corpus and then converting it into a document term matrix (called 
'allTerms'), I have identified the 20 most frequently occurring terms in 
the meeting notes and extracted these into a named vector called 
'freqTerms'. Many of the terms returned have nothing to do with any of 
the three themes of 'learning', 'sports', or 'extra-mural'.


(2) Therefore, I have also manually generated a list of terms and 
synonyms for 'learning' and 'sports', etc. (e.g. 'football', 'soccer', 
'drama', 'chess', etc.) and then tested for the occurrence of each of 
these terms in the corpus, e.g.:


> allTerms['soccer']

and have come up with a list of some 30 terms together with their 
frequencies. I manually sorted these according to three headers 
'learning', 'sports', and 'extra-mural' and dropped these into a table 
in a word processing document. Some of these terms are also in the 
freqTerms vector.


What I want to do now is to use cluster analysis (hclust, from the 
'cluster' library) to plot a dendrogram of the terms I have manually 
checked and put into the table, in order to see how closely similar the 
terms are and whether they cluster in ways similar to the way as I 
manually sorted these under the table column headers of 'learning', 
'sports', and 'extra-mural'.


To do this, I dropped these manually sorted terms into a data frame 
together with the associated values (which I called 'tes.df') and then 
tried plotting this as follows:


> dtes <- dist(tes.df, method = 'euclidean')
> dtesFreq <- hclust(dtes, method = 'ward.D')
> plot(dtesFreq, labels = names(tes.df))

However, I get an error message when trying to plot this: "Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : 
invalid dendrogram input".


I'm clearly screwing something up, either in my source data.frame or in 
my setting hclust up, but don't know which, nor how.


More than just identifying the error however, I am interested in finding 
a smart (efficient/ elegant) way of checking the occurrence and 
frequency value of the terms that may be associated with 'sports', 
'learning', and 'extra-mural' and extracting these into a matrix or data 
frame so that I can analyse and plot their clustering to see if how I 
associated these terms is actually supported statistically.


I'm sure that there must be a way of doing this in R, but I'm obviously 
not going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster mapping data

2015-03-08 Thread Leask, Graham

Bert,

Thank you for the suggestion but I am familiar with the clustering routines in 
R. My issue is how to carry out a grouping analysis on multi variate data that 
includes postcode shape file data as a variable.

Rather than obtain clusters spread across the map I am looking to limit the 
solution to groups that are entirely contiguous. I know how to accomplish this 
with SAS but I am looking to accomplish this using R.

Kind Regards

Dr Graham Leask
Economics & Strategy Group
Aston University
Aston Triangle
Birmingham
B4 7ET

Tel: 0121 204 3150

> On 8 Mar 2015, at 17:14, Bert Gunter  wrote:
> 
> Have you looked at the "Cluster" task view on CRAN?
> 
> http://cran.r-project.org/web/views/
> 
> -- Bert
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
> 
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
> 
> 
> 
> 
>> On Sun, Mar 8, 2015 at 9:58 AM, Leask, Graham  wrote:
>> I am looking to cluster some data including a postcode shape file but need 
>> to ensure that the resulting groups are contiguous.
>> 
>> How do I accomplish this using R?
>> 
>> Kind Regards
>> 
>> Dr Graham Leask
>> Economics & Strategy Group
>> Aston University
>> Aston Triangle
>> Birmingham
>> B4 7ET
>> 
>> Tel: 0121 204 3150
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster mapping data

2015-03-08 Thread Bert Gunter

Have you looked at the "Cluster" task view on CRAN?

http://cran.r-project.org/web/views/

-- Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sun, Mar 8, 2015 at 9:58 AM, Leask, Graham  wrote:
> I am looking to cluster some data including a postcode shape file but need to 
> ensure that the resulting groups are contiguous.
>
> How do I accomplish this using R?
>
> Kind Regards
>
> Dr Graham Leask
> Economics & Strategy Group
> Aston University
> Aston Triangle
> Birmingham
> B4 7ET
>
> Tel: 0121 204 3150
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster mapping data

2015-03-08 Thread Leask, Graham

I am looking to cluster some data including a postcode shape file but need to 
ensure that the resulting groups are contiguous.

How do I accomplish this using R?

Kind Regards

Dr Graham Leask
Economics & Strategy Group
Aston University
Aston Triangle
Birmingham
B4 7ET

Tel: 0121 204 3150
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster + tt terms in coxph

2014-11-06 Thread Henric Winell


On 2014-11-05 14:50, Therneau, Terry M., Ph.D. wrote:


This is fixed in version 2.37-8 of the survival package, which has been
in my "send to CRAN real-soon-now" queue for 6 months.  Your note is a
prod to get it done.  I've been updating and adding vignettes.


Is your fixed code publicly available somewhere?  (The 'survival' 
repository at R-forge doesn't seem to have been updated since January.)


Henric Winell




Terry Therneau


On 11/05/2014 05:00 AM, r-help-requ...@r-project.org wrote:

I am receiving the following error when trying to include both tt
(time transforms) and frailty terms in coxph


>coxph(Surv(time, status) ~ ph.ecog + tt(age)+cluster(sex), data=lung,

+  tt=function(x,t,...) pspline(x + t/365.25))
Error in residuals.coxph(fit2, type = "dfbeta", collapse = cluster,
weighted = TRUE) :
   Wrong length for 'collapse'

I tried both 64 bit (R.3.1.0) and 32 bit (R.3.1.2) in Windows 7 64bit
and get the same errors

Inclusion of tt and cluster terms worked fine in R2.9.2-2.15.1 under
Windows Vista 32 bit and Ubuntu 64 bit

Any ideas?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster + tt terms in coxph

2014-11-05 Thread Therneau, Terry M., Ph.D.

This is fixed in version 2.37-8 of the survival package, which has been in my "send to 
CRAN real-soon-now" queue for 6 months.  Your note is a prod to get it done.  I've been 
updating and adding vignettes.


Terry Therneau


On 11/05/2014 05:00 AM, r-help-requ...@r-project.org wrote:

I am receiving the following error when trying to include both tt (time 
transforms) and frailty terms in coxph


>coxph(Surv(time, status) ~ ph.ecog + tt(age)+cluster(sex), data=lung,

+  tt=function(x,t,...) pspline(x + t/365.25))
Error in residuals.coxph(fit2, type = "dfbeta", collapse = cluster, weighted = 
TRUE) :
   Wrong length for 'collapse'

I tried both 64 bit (R.3.1.0) and 32 bit (R.3.1.2) in Windows 7 64bit and get 
the same errors

Inclusion of tt and cluster terms worked fine in R2.9.2-2.15.1 under Windows 
Vista 32 bit and Ubuntu 64 bit

Any ideas?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster -- Agnes function

2014-09-24 Thread David L Carlson

Read the documentation for cutree(). You will have to decide how many clusters 
you want to use since agnes() provides results for everything from n clusters 
(where n is the number of observations) to 1 cluster.

?cutree

-
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Sohail Khan
Sent: Wednesday, September 24, 2014 9:14 AM
To: r-help@r-project.org
Subject: [R] Cluster -- Agnes function

Dear All,

I have clustered a patient data set by agnes.

I want to extract information for each cluster, I.E. all row ids
belonging to each cluster.

Thank you.






[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster -- Agnes function

2014-09-24 Thread Bart Kastermans

On 24/09/14 16:13, Sohail Khan wrote:
> Dear All,
> 
> I have clustered a patient data set by agnes.
> 
> I want to extract information for each cluster, I.E. all row ids
> belonging to each cluster.

Fascinating, thank you for sharing.

Best,
Bart

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster -- Agnes function

2014-09-24 Thread Sohail Khan

Dear All,

I have clustered a patient data set by agnes.

I want to extract information for each cluster, I.E. all row ids
belonging to each cluster.

Thank you.

 

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Error creating daisy matrix in R cluster package - Cannot allocate vector of size 66.0 Gb

2014-06-21 Thread Scott Davis

My purpose involves creating a dissimilarity matrix using the daisy package
in R before applying k-mediod clustering for customer segmentation. The
dataset has 133,153 observations of 35 variables in a data.frame with
numerical, categorical, blank cells and missing values. Missing values
refer to NA, while a blank cells means nothing present within the
data.frame.

Hereâs my OS:

> sessionInfo()

R version 3.1.0 (2014-04-10)

Platform x86_64-w64-mingw32/x64 (64-bit)

I have 35 variables, but here is description of the first 5:

> head(df)

  user_idAgeGender  Household.Income  Marital.Status

1   12945 Male

2   12947 Male

3   12990

4   13160   25-34  Male 100k-125k   Single

5   13195 Male  75k-100kSingle

6   13286

Since the Windows computer has 3 Gb RAM, I increased the virtual memory to
100Gb hoping that would be enough to create the matrix - it didn't
work. I've looked into other R packages for solving the memory problem, but
they don't work. I cannot use the `bigmemory` with the `biganalytics`
package because it only accepts numeric matrices. The `clara` and `ff`
packages also accept only numeric matrices. Here's the daisy script:

#Load csv file

> Store1 <- read.csv("/Users/name/Client1.csv", head = TRUE)

#Convert csv to data.frame

> df <-as.data.frame(Store1)

#Increase memory allocation in R to 70 GB using the command:

> memory.limit(size = 7)

[1] 7

#Load cluster package

> library(cluster)

#Create daisy dissimilarity matrix

#Use Gower distance coefficient for mixed variables

#Set type as ratio scaled variable

> daisy1 <- daisy(df, metric = "gowerâ,

   type = list(ordratio = c(1:35)))

#Error: cannot allocate vector of size 66.0 Gb


How can I fix the error?
-- 
Scott Davis
Cell: (408)826-9561
Skype ID: Scdavis61
San Jose, CA.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster option in stata for random intercept model in the R language?

2013-10-15 Thread David Winsemius



On Oct 15, 2013, at 3:32 AM, Martin Batholdy wrote:


Dear R-list,

I am currently working on a dataset with a colleague who uses stata.
We fit a random intercept model to the data (decisions clustered in  
participants) and get closely the same results in stata (using xtreg  
re) and R (using the lme4 or multilevel package).



Now in stata, there is an additional option for the regression to  
control for clustering; the vce(cluster clustvar) option, which  
changes the standard errors quite a bit.
(see http://www.stata.com/support/faqs/statistics/standard-errors-and-vce-cluster-option/ 
 or http://www.stata.com/manuals13/xtxtreg.pdf).


Unfortunately I don't understand what this 'correction' does and why  
it yields different results.
First I thought it would control for autocorrelations over time  
(decisions), but if I model this directly with a random-intercept  
random-slope model, I don't get nearly the same results.


Can someone help me understand what stata is doing here?
And what would be the equivalent in R to get similar results?


http://www.stata.com/statalist/



___
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster option in stata for random intercept model in the R language?

2013-10-15 Thread Martin Batholdy

Dear R-list,

I am currently working on a dataset with a colleague who uses stata.
We fit a random intercept model to the data (decisions clustered in 
participants) and get closely the same results in stata (using xtreg re) and R 
(using the lme4 or multilevel package).


Now in stata, there is an additional option for the regression to control for 
clustering; the vce(cluster clustvar) option, which changes the standard errors 
quite a bit.
(see 
http://www.stata.com/support/faqs/statistics/standard-errors-and-vce-cluster-option/
 or http://www.stata.com/manuals13/xtxtreg.pdf).

Unfortunately I don't understand what this 'correction' does and why it yields 
different results.
First I thought it would control for autocorrelations over time (decisions), 
but if I model this directly with a random-intercept random-slope model, I 
don't get nearly the same results.

Can someone help me understand what stata is doing here?
And what would be the equivalent in R to get similar results?


thanks!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster package - Installation problems

2013-07-15 Thread Uwe Ligges




On 15.07.2013 23:51, David Stevens wrote:

Group - I'm having problems with the 'cluster' package. Installation
appears successful but attempts to load it with either library() or
require() result in the error message

Error in library(cluster) : there is no package called ‘cluster’

All that appears to be installed is cluster.dll in the
library/cluster/libs/x64 folder. Several reinstallation attempts haven't
helped. I'm using R-3.0.1 under windows 7. Any help is appreciated.


Restart your R (and close other R processes) and try again, probably the 
dll was locked by an(other) R process when you tried to update.


best,
Uwe Ligges




Regards

David



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster package - Installation problems

2013-07-15 Thread David Stevens

Group - I'm having problems with the 'cluster' package. Installation 
appears successful but attempts to load it with either library() or 
require() result in the error message


Error in library(cluster) : there is no package called ‘cluster’

All that appears to be installed is cluster.dll in the 
library/cluster/libs/x64 folder. Several reinstallation attempts haven't 
helped. I'm using R-3.0.1 under windows 7. Any help is appreciated.


Regards

David

--
David K Stevens, P.E., Ph.D.
Professor and Head, Environmental Engineering
Civil and Environmental Engineering
Utah Water Research Laboratory
8200 Old Main Hill
Logan, UT  84322-8200
435 797 3229 - voice
435 797 1363 - fax
david.stev...@usu.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis

2013-07-04 Thread Ekele Alih

I want to do Agglomerative Hierarchical clustering using complete linkage 
method in R using the function agnes or hclust. 
1. Can i do a cluster analysis of h=(n+p+1)/2 out of n observation?  note that 
p=nomber of variables(dependent and independent)
2. Can i plot the dendrogram and get the cluster history of this analysis in R?
3. Can i use the cluster with the largest values to sort the n observations in 
ascending order?
Your assistance and guide will be greatly appreciated in solving problems 1-3
Thanks
EKELE ALIH
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster gene list

2013-04-22 Thread Catarina Maia

Hello,

I'm just a beginner and probably there is a better way to do it but here it
goes:

#cluster analysis

Euclidean_Distance <- dist(mydata, method="euclidean", diag=FALSE
, upper=FALSE, p=2)
data <- hclust(Euclidean_Distance, method="ward", members=NULL)
plot(data,hang=-1)


#K=4  # i chose to divide my data in 4 clusters)

X <- rect.hclust(data, k=4, border="red")
summary(X)

#to isolate each cluster
#identify samples in each cluster

cluster_1 <- as.vector(X[[1]])
cluster_2 <- as.vector(X[[2]])
cluster_3 <- as.vector(X[[3]])
cluster_4 <- as.vector(X[[4]])

hope that helps.

Best regards,

Catarina Maia

2013/4/21 Sudhir Singh 

>  Hi,
>
> I have created a heatmap using heatmap.2 having 7 clusters.  I would like
> to extract the list of genes that are in these 7 clusters.
>
>  Is there any function that can be used to extract genes for each cluster?
>
> Cheers,
>
> Sudhir
>
> --
> __
>
> SAVE PAPER - Please do not print this e-mail unless absolutely necessary
> Being happy doesn't mean everything's perfect.
> It means you've decided to see beyond  imperfections.
> __
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster gene list

2013-04-20 Thread Sudhir Singh

 Hi,

I have created a heatmap using heatmap.2 having 7 clusters.  I would like
to extract the list of genes that are in these 7 clusters.

 Is there any function that can be used to extract genes for each cluster?

Cheers,

Sudhir

-- 
__

SAVE PAPER - Please do not print this e-mail unless absolutely necessary
Being happy doesn't mean everything's perfect.
It means you've decided to see beyond  imperfections.
__

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis

2013-04-11 Thread ravanlou

I am doing cluster analysis of my SNPs data. I have 2 questions:
1. I draw the cluster in hclust using the following codes.change direction
to vertical.

>data <- read.table(as.matrix(file.choose()), header=T, row.names = 1,
sep="\t")
> plot(hclust(as.dist(data),method="complete"))

 it is horizontal, and I dont know how to change to vertical shape?

2. I would like to have bootstraps, but no luck. I am using the following
codes:

> result <- pvclust(as.dist(data), method.dist="cor",
method.hclust="complete", nboot=1000)

Error in cor(x, method = "pearson", use = use.cor) :
  supply both 'x' and 'y' or a matrix-like 'x'


I will appreciate if someone could help me please


-- 
*Abbasali "Ali" Ravanlou
PhD candidate of Plant Pathology
**Dept. of Crop Sci.*
*University of Illinois-UC**
** **
**
*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis on weighted survey data with continuous and categorical variables

2013-03-19 Thread Thomas Lumley

On Wed, Mar 20, 2013 at 3:55 AM, Emma Gibson wrote:

> I am trying to perform cluster analysis on survey data where each
> respondent has answered several questions, some of which have categorical
> answers ("blue" "pink" "green" etc) and some of which have scale answers
> (rating from 1 to 10 etc).My problem is that certain age groups were
> over-sampled and I need to weight the data collected in order to accurately
> reflect the current population.Will it make a difference if I do the
> cluster analysis on the weighted data, and if so, how do I do cluster
> analysis on the weighted data?Any advice would be much appreciated!Thanks
> Emma
>

The unequal sampling will have some effect on most clustering methods (eg
not single-linkage, but k-means or average-linkage).  Whether this matters
depends on whether you have genuinely separate clusters in the population
or a general mush that you are trying to segment in some convenient way.

If you have genuine well-separated clusters, then ignoring the oversampling
is likely to do well.  If you don't, you will get a segementation into
clusters that partitions the over-sampled people too finely and the
under-sampled people too coarsely.

I don't know of any R functions that cluster with sampling weights.

If your data set is fairly small, you could expand it by making duplicates
(perhaps jittered) of some points, and cluster the expanded data set.  On
the other hand, if it is very large, you can thin it out to a uniform
sample by sampling from it with probability inversely proportional to the
original sampling probability.

   - thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis on weighted survey data with continuous and categorical variables

2013-03-19 Thread Emma Gibson

I am trying to perform cluster analysis on survey data where each respondent 
has answered several questions, some of which have categorical answers ("blue" 
"pink" "green" etc) and some of which have scale answers (rating from 1 to 10 
etc).My problem is that certain age groups were over-sampled and I need to 
weight the data collected in order to accurately reflect the current 
population.Will it make a difference if I do the cluster analysis on the 
weighted data, and if so, how do I do cluster analysis on the weighted data?Any 
advice would be much appreciated!Thanks Emma
   
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis in the setting of repeated measures

2013-03-10 Thread John Sorkin

Does R have any function for performing cluster analysis when each subject 
contributes more than one observation to the analysis, i.e. a repeated measures 
cluster analysis? I prefer an agglomerative clustering, but would certainly be 
happy with a K-mean or other clustering technique. To the best of my knowledge, 
the "standard" R clustering functions (e.g. kmeans, hclust, pvclust) all assume 
that each subject contributes a single line of data to the analyses.
Thanks,
John
 
 
 
John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)
Confidentiality Statement:
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.  
Any unauthorized use, disclosure or distribution is prohibited.  If you are not 
the intended recipient, please contact the sender by reply email and destroy 
all copies of the original message. 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Is it possible to obtain an agglomeration schedule with R cluster analyis

2013-02-23 Thread Bob Green


Willam,

Many thanks. I'll check this against my data tomorrow when I'm back 
at work.  This looks just what I wanted.


Regards

Bob


At 09:27 AM 24/02/2013, William Dunlap wrote:

You didn't show what the tabular summary should look like.
However, look at the height and merge components of
an hclust object:

> hc3 <- hclust(dist(USArrests[1:8, c(1,2,4)]))
> data.frame(hc3[2:1])
  height merge.1 merge.2
1   9.297849  -1  -8
2  13.609188  -2  -5
3  23.779193  -4  -6
4  33.865321  -3   2
5  48.229659   1   3
6 104.636227   4   5
7 185.135221  -7   6
The two merge.* columns identify what groups merged at
the corresponding height value.  Negative values, i, refer to the
-i'th leaf value in the 'labels' component and positive values, i, refer
to cluster created in the i'th row of the data.frame.  The following
function transforms those references into name:

f <- function(hc){
 data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
height=hc$height,
components=ifelse(hc$merge<0, 
hc$labels[abs(hc$merge)], paste0("Cluster",hc$merge)),

stringsAsFactors=FALSE)
}

as in
> f(hc3)
 height components.1 components.2
Cluster1   9.297849  Alabama Delaware
Cluster2  13.609188   Alaska   California
Cluster3  23.779193 Arkansas Colorado
Cluster4  33.865321  Arizona Cluster2
Cluster5  48.229659 Cluster1 Cluster3
Cluster6 104.636227 Cluster4 Cluster5
Cluster7 185.135221  Connecticut Cluster6

Compare that to the output of str(as.dendrogram(hc3)):

> str(as.dendrogram(hc3))
--[dendrogram w/ 2 branches and 8 members at h = 185]
  |--leaf "Connecticut"
  `--[dendrogram w/ 2 branches and 7 members at h = 105]
 |--[dendrogram w/ 2 branches and 3 members at h = 33.9]
 |  |--leaf "Arizona"
 |  `--[dendrogram w/ 2 branches and 2 members at h = 13.6]
 | |--leaf "Alaska"
 | `--leaf "California"
 `--[dendrogram w/ 2 branches and 4 members at h = 48.2]
|--[dendrogram w/ 2 branches and 2 members at h = 9.3]
|  |--leaf "Alabama"
|  `--leaf "Delaware"
`--[dendrogram w/ 2 branches and 2 members at h = 23.8]
   |--leaf "Arkansas"
   `--leaf "Colorado"

Does f() produce the information you need for your display?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -Original Message-
> From: r-help-boun...@r-project.org 
[mailto:r-help-boun...@r-project.org] On Behalf

> Of Bob Green
> Sent: Saturday, February 23, 2013 12:49 PM
> To: Uwe Ligges
> Cc: r-help@r-project.org
> Subject: Re: [R] Is it possible to obtain an agglomeration 
schedule with R cluster analyis

>
> Hello Uwes,
>
> Thanks. Re-reading the hclust pages I found that using the hclust
> 'USArrests' data  that the command > plot (hc1)  will generate the
> order in which cases joined. however, I still can't see how to obtain
> the respective height at which each case joined each cluster or the
> height when clusters merge.
>
>
> The dendrogram {stats} page provides the following code which
> produces the information that I require. However, what I would like
> to obtain is a table of the height at which cluster formed.
>
>  > hc <- hclust(dist(USArrests), "ave")
>  > (dend1 <- as.dendrogram(hc)) # "print()" method
>  > str(dend1)  # "str()" method
>
> I also found as.hclust which plots what I want, but I still can't
> find a way to produce the actual height values which are being
> plotted, for example as a tabular summary.
>
>   plot(hc) ;  mtext("hclust", side=1)
>
> Any assistance is appreciated,
>
> Bob
>
>
>
> At 04:01 AM 24/02/2013, Uwe Ligges wrote:
>
>
> >On 22.02.2013 11:41, Bob Green wrote:
> >>Hello,
> >>
> >>In SPSS the cluster analysis output includes an agglomerations schedule,
> >>which details the stages when cases are joined.
> >>
> >>Is it possible to obtain such output when performing cluster analysis in
> >>R?  If so, I'd appreciate advice regarding how to obtain this 
information.

> >
> >
> >If you are talking about hierarchical clustering via hclust(), see ?hclust
> >It tells you that the relevant information is available inside the
> >object and you can even see it via the plot method.
> >
> >Uwe Ligges
> >
> >
> >
> >>
> >>Any assistance is appreciated,
> >>
> >>Regards
> >>
> >>Bob
> >>
> >>

Re: [R] Is it possible to obtain an agglomeration schedule with R cluster analyis

2013-02-23 Thread William Dunlap

You didn't show what the tabular summary should look like.
However, look at the height and merge components of
an hclust object:

> hc3 <- hclust(dist(USArrests[1:8, c(1,2,4)]))
> data.frame(hc3[2:1])
  height merge.1 merge.2
1   9.297849  -1  -8
2  13.609188  -2  -5
3  23.779193  -4  -6
4  33.865321  -3   2
5  48.229659   1   3
6 104.636227   4   5
7 185.135221  -7   6
The two merge.* columns identify what groups merged at
the corresponding height value.  Negative values, i, refer to the
-i'th leaf value in the 'labels' component and positive values, i, refer
to cluster created in the i'th row of the data.frame.  The following
function transforms those references into name:

f <- function(hc){
 data.frame(row.names=paste0("Cluster",seq_along(hc$height)),
height=hc$height,
components=ifelse(hc$merge<0, hc$labels[abs(hc$merge)], 
paste0("Cluster",hc$merge)),
stringsAsFactors=FALSE)
}

as in
> f(hc3)
 height components.1 components.2
Cluster1   9.297849  Alabama Delaware
Cluster2  13.609188   Alaska   California
Cluster3  23.779193 Arkansas Colorado
Cluster4  33.865321  Arizona Cluster2
Cluster5  48.229659 Cluster1 Cluster3
Cluster6 104.636227 Cluster4 Cluster5
Cluster7 185.135221  Connecticut Cluster6

Compare that to the output of str(as.dendrogram(hc3)):

> str(as.dendrogram(hc3))
--[dendrogram w/ 2 branches and 8 members at h = 185]
  |--leaf "Connecticut" 
  `--[dendrogram w/ 2 branches and 7 members at h = 105]
 |--[dendrogram w/ 2 branches and 3 members at h = 33.9]
 |  |--leaf "Arizona" 
 |  `--[dendrogram w/ 2 branches and 2 members at h = 13.6]
 | |--leaf "Alaska" 
 | `--leaf "California" 
 `--[dendrogram w/ 2 branches and 4 members at h = 48.2]
|--[dendrogram w/ 2 branches and 2 members at h = 9.3]
|  |--leaf "Alabama" 
|  `--leaf "Delaware" 
`--[dendrogram w/ 2 branches and 2 members at h = 23.8]
   |--leaf "Arkansas" 
   `--leaf "Colorado"

Does f() produce the information you need for your display?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
> Behalf
> Of Bob Green
> Sent: Saturday, February 23, 2013 12:49 PM
> To: Uwe Ligges
> Cc: r-help@r-project.org
> Subject: Re: [R] Is it possible to obtain an agglomeration schedule with R 
> cluster analyis
> 
> Hello Uwes,
> 
> Thanks. Re-reading the hclust pages I found that using the hclust
> 'USArrests' data  that the command > plot (hc1)  will generate the
> order in which cases joined. however, I still can't see how to obtain
> the respective height at which each case joined each cluster or the
> height when clusters merge.
> 
> 
> The dendrogram {stats} page provides the following code which
> produces the information that I require. However, what I would like
> to obtain is a table of the height at which cluster formed.
> 
>  > hc <- hclust(dist(USArrests), "ave")
>  > (dend1 <- as.dendrogram(hc)) # "print()" method
>  > str(dend1)  # "str()" method
> 
> I also found as.hclust which plots what I want, but I still can't
> find a way to produce the actual height values which are being
> plotted, for example as a tabular summary.
> 
>   plot(hc) ;  mtext("hclust", side=1)
> 
> Any assistance is appreciated,
> 
> Bob
> 
> 
> 
> At 04:01 AM 24/02/2013, Uwe Ligges wrote:
> 
> 
> >On 22.02.2013 11:41, Bob Green wrote:
> >>Hello,
> >>
> >>In SPSS the cluster analysis output includes an agglomerations schedule,
> >>which details the stages when cases are joined.
> >>
> >>Is it possible to obtain such output when performing cluster analysis in
> >>R?  If so, I'd appreciate advice regarding how to obtain this information.
> >
> >
> >If you are talking about hierarchical clustering via hclust(), see ?hclust
> >It tells you that the relevant information is available inside the
> >object and you can even see it via the plot method.
> >
> >Uwe Ligges
> >
> >
> >
> >>
> >>Any assistance is appreciated,
> >>
> >>Regards
> >>
> >>Bob
> >>
> >>__
> >>R-help@r-project.org mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide
>

Re: [R] Is it possible to obtain an agglomeration schedule with R cluster analyis

2013-02-23 Thread Bob Green

Hello Uwes,

Thanks. Re-reading the hclust pages I found that using the hclust 
'USArrests' data  that the command > plot (hc1)  will generate the 
order in which cases joined. however, I still can't see how to obtain 
the respective height at which each case joined each cluster or the 
height when clusters merge.

The dendrogram {stats} page provides the following code which 
produces the information that I require. However, what I would like 
to obtain is a table of the height at which cluster formed.

> hc <- hclust(dist(USArrests), "ave")
> (dend1 <- as.dendrogram(hc)) # "print()" method
> str(dend1)  # "str()" method

I also found as.hclust which plots what I want, but I still can't 
find a way to produce the actual height values which are being 
plotted, for example as a tabular summary.

 plot(hc) ;  mtext("hclust", side=1)

Any assistance is appreciated,

Bob

At 04:01 AM 24/02/2013, Uwe Ligges wrote:

On 22.02.2013 11:41, Bob Green wrote:

Hello,

In SPSS the cluster analysis output includes an agglomerations schedule,
which details the stages when cases are joined.

Is it possible to obtain such output when performing cluster analysis in
R?  If so, I'd appreciate advice regarding how to obtain this information.

If you are talking about hierarchical clustering via hclust(), see ?hclust
It tells you that the relevant information is available inside the 
object and you can even see it via the plot method.

Uwe Ligges

Any assistance is appreciated,

Regards

Bob

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Is it possible to obtain an agglomeration schedule with R cluster analyis

2013-02-23 Thread Uwe Ligges




On 22.02.2013 11:41, Bob Green wrote:

Hello,

In SPSS the cluster analysis output includes an agglomerations schedule,
which details the stages when cases are joined.

Is it possible to obtain such output when performing cluster analysis in
R?  If so, I'd appreciate advice regarding how to obtain this information.



If you are talking about hierarchical clustering via hclust(), see ?hclust
It tells you that the relevant information is available inside the 
object and you can even see it via the plot method.


Uwe Ligges





Any assistance is appreciated,

Regards

Bob

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Is it possible to obtain an agglomeration schedule with R cluster analyis

2013-02-22 Thread Bob Green


Hello,

In SPSS the cluster analysis output includes an agglomerations 
schedule, which details the stages when cases are joined.


Is it possible to obtain such output when performing cluster analysis 
in R?  If so, I'd appreciate advice regarding how to obtain this information.



Any assistance is appreciated,

Regards

Bob

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster Analysis and PCoA (mixt variables)

2013-01-19 Thread Julien Mvdb

Hello everyone,

 I mail you because of my lake of knowlegde regarding statistics.
I'm using the CA and PCoA (but maybe should I use some other techniques) to
determine the differences and similarities between a large sample of plants
using different kind of traits through matrix of mixte variables.
I understood that the daisy() function using the gower metric and defining
the different type of variable is a good way to deal with such mixt
variable. And in fact, my plots (cluster{agnes})(more that my PCoA) are
quite reflecting what I was expecting from the aspect of those different
plants.

My problem :
The problem now is that I need to understand wich variables are considered
to produce the dissimilarity matrix that is used for the cluster analysis
or the PCoA. In other word, "how are construct the branch of my Cluster
Analysis tree?"

It has been one month since I tried to figured most of the things out of
what I know today in "data analysis and R software" world. So, I'm really
sorry for asking so simple things that do not exactly focus on the R issues
but I tried in many ways but I just can't figure it out.
Thank you

Julien Mehl Vettori

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis error - mclust package

2012-11-26 Thread KitKat

I am following instructions online for cluster analysis using the mclust
package, and keep getting errors.
http://www.statmethods.net/advstats/cluster.html

These are the instructions (there is no sample dataset unfortunately):
# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit, mydata) # plot results 
print(fit) # display the best model 

This is what I did and the error I get:
> library(mclust)
> fit <- Mclust(mydat)
> plot(fit, mydat) #plot results
Error in match.arg(what, c("BIC", "classification", "uncertainty",
"density"),  : 
  'arg' must be NULL or a character vector

My data is arranged so I have each row representing one individual with 9
values for morphological data. I want to see if they will group into 2
clusters, representing gender. 

I have tried using the instructions from the cran-r website, but they didn't
work either

Any help would be great, thank you



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-error-mclust-package-tp4650842.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-22 Thread KitKat

These are the errors I've been having. I have been trying 3 different things

1- Mclust:
This is the example I have been following:
# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit, mydata) # plot results 
print(fit) # display the best model 
 
What I have done:
> fit <- Mclust(mydat)
> plot(fit, mydat) #plot results
Error in match.arg(what, c("BIC", "classification", "uncertainty",
"density"),  : 
  'arg' must be NULL or a character vector

2- Mclust using different website (cran-r) instructions
This is the example: 
> mydatMclust <- Mclust(mydat)
> summary(mydatMclust)
> summary(mydatMclust, parameters = TRUE)
> plot(mydatMclust)

There are a couple other steps but the plot is the problem. I get two plots,
there should be four. One should be plotting all my individuals but it's
plotting my variables instead. It's also taking a very long time. R script
at this point says: "Waiting to confirm page change… "

3. Mcclust 
Instructions from cran-r:
data(cls.draw2)
# sample of 500 clusterings from a Bayesian cluster model
tru.class <- rep(1:8,each=50)
# the true grouping of the observations
psm2 <- comp.psm(cls.draw2)
# posterior similarity matrix
# optimize criteria based on PSM
mbind2 <- minbinder(psm2)
mpear2 <- maxpear(psm2)
# Relabelling
k <- apply(cls.draw2,1, function(cl) length(table(cl)))
max.k <- as.numeric(names(table(k))[which.max(table(k))])
relab2 <- relabel(cls.draw2[k==max.k,])
# compare clusterings found by different methods with true grouping
arandi(mpear2$cl, tru.class)
arandi(mbind2$cl, tru.class)
arandi(relab2$cl, tru.class)

I called my data: mydat so I changed that where appropriate. I cannot get
past one early step, psm2 <- comp.psm(cls.draw2).. the error reads: "Error:
could not find function "comp.psm""

I think I have all appropriate packages installed. I don't know what more to
do on these three errors.  Any help would be great! Thank you




--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650466.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-22 Thread Ingmar Visser

It's hard to answer these questions without knowing what the errors are and
how they can be reproduced.
Best, Ingmar

On Thu, Nov 22, 2012 at 1:03 AM, KitKat  wrote:

> Thanks, I have been trying that site and another one
> (http://www.statmethods.net/advstats/cluster.html)
>
> I don't know if I should be doing mclust or mcclust, but either way, the
> codes are not working. I am following the guidelines online at:
> mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf
> mclust - http://cran.r-project.org/
>
> I am relatively new to R, but so far I have been able to figure out dfa,
> manova, pca... I cannot get these codes to work, I keep getting various
> errors. Are there other resources that have details about what codes to use
> or what to do when errors result? I have not found anything else helpful
>
> Thank you
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-21 Thread KitKat

Thanks, I have been trying that site and another one
(http://www.statmethods.net/advstats/cluster.html)

I don't know if I should be doing mclust or mcclust, but either way, the
codes are not working. I am following the guidelines online at:
mcclust - http://cran.r-project.org/web/packages/mcclust/mcclust.pdf
mclust - http://cran.r-project.org/

I am relatively new to R, but so far I have been able to figure out dfa,
manova, pca... I cannot get these codes to work, I keep getting various
errors. Are there other resources that have details about what codes to use
or what to do when errors result? I have not found anything else helpful 

Thank you



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650397.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-21 Thread Brian Feeny



http://cran.r-project.org/web/views/Cluster.html

might be a good start

Brian

On Nov 21, 2012, at 1:36 PM, KitKat wrote:

> Thank you for replying! 
> I made a new post asking if there are any websites or files on how to
> download package mclust (or other Bayesian cluster analysis packages) and
> the appropriate R functions? Sorry I don't know how this forum works yet
> 
> 
> 
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-21 Thread KitKat

Thank you for replying! 
I made a new post asking if there are any websites or files on how to
download package mclust (or other Bayesian cluster analysis packages) and
the appropriate R functions? Sorry I don't know how this forum works yet



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-16 Thread Hennig, Christian

Dear Katherine,

function flexmixedruns in package fpc may do what you want; it fits mixtures 
with continuous and categorical variables, can use the BIC for giving you the 
number of mixture components and also gives you posterior probabilities for 
cases to belong to components.

Note that generally finding the right cluster analysis method is a complicated 
task and depends crucially on your application, what use you want to make of 
the clusters etc., so what's best cannot be conclusively said on a mailing 
list. The same holds for whether and how to select variables. Certainly it's 
not wrong in general to use all the variables that you have but whether it's 
better otherwise depends on what meaning your variables have and how this 
relates to the aim of clustering, what to do with the variables afterwards etc.

You may have a look at 
http://www.rss.org.uk/site/cms/contentviewarticle.asp?article=866#Link%20to%20Nov.%202012%20paper
where I discuss a number of related issues.

Best regards,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche


From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of 
KitKat [katherinewri...@trentu.ca]
Sent: 15 November 2012 18:14
To: r-help@r-project.org
Subject: [R] cluster analysis in R

I have two issues.

1-I am trying to use morphology to identify gender. I have 9 variables, both
continuous and categorical. I was using two-step cluster analysis in SPSS
because two-step could deal with different types of variables. But the
output tells me that an animal is in cluster 1 or 2, it does not give me a
probability (ex. 0.70 cluster 2).  I also did not want to specify that I
want two clusters, I wanted to see if analysis would naturally give me two
clusters. These were all advantages to using SPSS but now I'm having
trouble.

Does cluster analysis in R give probabilities?
Which type of cluster analysis in R is best to use? I did not think
hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
to create the average variable, I want the analysis to do it on its own.
I'm also new to R so would have to figure out the right codes to enter, etc.

2-I was also told to analyze each variable on its own before including it in
cluster analysis. I had first included them all then teased out which ones
were not important, but now have been asked to do the reverse. I cannot do
cluster analysis on one variable -for example, one variable is either
present or absent on an individual so of course cluster analysis gives me
two clusters, one representing present and one representing absent. I was
told to use regression, but how can regression also not give the same
result? I feel like it would give me a line connecting a bunch of 0s to 1s.
I don't know what to use, or if I can analyze each variable like this before
putting them into cluster analysis. I ultimately want to only use the
smallest number of variables necessary to identify gender.

I have tried reading manuals etc and talking to people at my school, but
nothing has helped. If anyone has any insight, that would be much
appreciated
Thank you!



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis in R

2012-11-15 Thread Jose Iparraguirre

Have a look at the package mclust.
Jose

From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of 
Ingmar Visser [i.vis...@uva.nl]
Sent: 15 November 2012 21:10
To: KitKat
Cc: r-help@r-project.org
Subject: Re: [R] cluster analysis in R

Dear KitKat,

After installing R and reading some introductory material on getting
started with R you may want to check the CRAN task view on cluster analysis:
http://cran.r-project.org/web/views/Cluster.html
which has many useful references to all kinds and flavors of clustering
techniques, hierarchical or not, selecting the nr of clusters based on some
model selection statistic, et cetera.

hth, Ingmar

On Thu, Nov 15, 2012 at 7:14 PM, KitKat  wrote:

> I have two issues.
>
> 1-I am trying to use morphology to identify gender. I have 9 variables,
> both
> continuous and categorical. I was using two-step cluster analysis in SPSS
> because two-step could deal with different types of variables. But the
> output tells me that an animal is in cluster 1 or 2, it does not give me a
> probability (ex. 0.70 cluster 2).  I also did not want to specify that I
> want two clusters, I wanted to see if analysis would naturally give me two
> clusters. These were all advantages to using SPSS but now I'm having
> trouble.
>
> Does cluster analysis in R give probabilities?
> Which type of cluster analysis in R is best to use? I did not think
> hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
> to create the average variable, I want the analysis to do it on its own.
> I'm also new to R so would have to figure out the right codes to enter,
> etc.
>
> 2-I was also told to analyze each variable on its own before including it
> in
> cluster analysis. I had first included them all then teased out which ones
> were not important, but now have been asked to do the reverse. I cannot do
> cluster analysis on one variable -for example, one variable is either
> present or absent on an individual so of course cluster analysis gives me
> two clusters, one representing present and one representing absent. I was
> told to use regression, but how can regression also not give the same
> result? I feel like it would give me a line connecting a bunch of 0s to 1s.
> I don't know what to use, or if I can analyze each variable like this
> before
> putting them into cluster analysis. I ultimately want to only use the
> smallest number of variables necessary to identify gender.
>
> I have tried reading manuals etc and talking to people at my school, but
> nothing has helped. If anyone has any insight, that would be much
> appreciated
> Thank you!
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wrap Up & Run 10k next March to raise vital funds for Age UK

Six exciting new 10k races are taking place throughout the country and we want 
you to join in the fun! Whether you're a runner or not, these are
events are for everyone ~ from walking groups to serious athletes. The Age UK 
Events Team will provide you with a training plan to suit your 
level and lots of tips to make this your first successful challenge of 2012. 
Beat the January blues and raise some vital funds to help us 
prevent avoidable deaths amongst older people this winter.

Sign up now! www.ageuk.org.uk/10k

Coming to; London Crystal Palace, Southport, Tatton Park, Cheshire Harewood 
House, Leeds,Coventry, Exeter

Age UK Improving later life
www.ageuk.org.uk

---
Age UK is a registered charity and company limited by guarantee, (registered 
charity number 1128267, registered company number 6825798). 
Registered office: Tavis House, 1-6 Tavistock Square, London WC1H 9NA.

For the purposes of promoting Age UK Insurance, Age UK is an Appointed 
Representative of Age UK Enterprises Limited, Age UK is an Introducer 
Appointed Representative of JLT Benefit Solutions Limited and Simplyhealth 
Access for the purposes of introducing potential annuity and health 
cash plans customers respectively.  Age UK Enterprises Limited, JLT Benefit 
Solu

Re: [R] cluster analysis in R

2012-11-15 Thread Ingmar Visser

Dear KitKat,

After installing R and reading some introductory material on getting
started with R you may want to check the CRAN task view on cluster analysis:
http://cran.r-project.org/web/views/Cluster.html
which has many useful references to all kinds and flavors of clustering
techniques, hierarchical or not, selecting the nr of clusters based on some
model selection statistic, et cetera.

hth, Ingmar

On Thu, Nov 15, 2012 at 7:14 PM, KitKat  wrote:

> I have two issues.
>
> 1-I am trying to use morphology to identify gender. I have 9 variables,
> both
> continuous and categorical. I was using two-step cluster analysis in SPSS
> because two-step could deal with different types of variables. But the
> output tells me that an animal is in cluster 1 or 2, it does not give me a
> probability (ex. 0.70 cluster 2).  I also did not want to specify that I
> want two clusters, I wanted to see if analysis would naturally give me two
> clusters. These were all advantages to using SPSS but now I'm having
> trouble.
>
> Does cluster analysis in R give probabilities?
> Which type of cluster analysis in R is best to use? I did not think
> hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
> to create the average variable, I want the analysis to do it on its own.
> I'm also new to R so would have to figure out the right codes to enter,
> etc.
>
> 2-I was also told to analyze each variable on its own before including it
> in
> cluster analysis. I had first included them all then teased out which ones
> were not important, but now have been asked to do the reverse. I cannot do
> cluster analysis on one variable -for example, one variable is either
> present or absent on an individual so of course cluster analysis gives me
> two clusters, one representing present and one representing absent. I was
> told to use regression, but how can regression also not give the same
> result? I feel like it would give me a line connecting a bunch of 0s to 1s.
> I don't know what to use, or if I can analyze each variable like this
> before
> putting them into cluster analysis. I ultimately want to only use the
> smallest number of variables necessary to identify gender.
>
> I have tried reading manuals etc and talking to people at my school, but
> nothing has helped. If anyone has any insight, that would be much
> appreciated
> Thank you!
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis in R

2012-11-15 Thread KitKat

I have two issues. 

1-I am trying to use morphology to identify gender. I have 9 variables, both
continuous and categorical. I was using two-step cluster analysis in SPSS
because two-step could deal with different types of variables. But the
output tells me that an animal is in cluster 1 or 2, it does not give me a
probability (ex. 0.70 cluster 2).  I also did not want to specify that I
want two clusters, I wanted to see if analysis would naturally give me two
clusters. These were all advantages to using SPSS but now I'm having
trouble.

Does cluster analysis in R give probabilities?
Which type of cluster analysis in R is best to use? I did not think
hierarchical analysis was a great choice, but maybe I'm wrong. I don't want
to create the average variable, I want the analysis to do it on its own. 
I'm also new to R so would have to figure out the right codes to enter, etc.

2-I was also told to analyze each variable on its own before including it in
cluster analysis. I had first included them all then teased out which ones
were not important, but now have been asked to do the reverse. I cannot do
cluster analysis on one variable -for example, one variable is either
present or absent on an individual so of course cluster analysis gives me
two clusters, one representing present and one representing absent. I was
told to use regression, but how can regression also not give the same
result? I feel like it would give me a line connecting a bunch of 0s to 1s.
I don't know what to use, or if I can analyze each variable like this before
putting them into cluster analysis. I ultimately want to only use the
smallest number of variables necessary to identify gender. 

I have tried reading manuals etc and talking to people at my school, but
nothing has helped. If anyone has any insight, that would be much
appreciated
Thank you!



--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster of points

2012-07-31 Thread Jean V Adams

Frederico,

This is not exactly what you're after, but perhaps it will help.  In this 
example I fit a cluster analysis to the data, then I "cut the tree" at a 
height of 3 (you would do this with your data at a height of 40).  It's 
not a perfect solution, but it might be good enough, depending on the 
spatial distribution of your points.

# example data frame with x and y (on the same scale)
df <- data.frame(x = rnorm(100), y = rnorm(100))

# cluster analysis
tree <- hclust(dist(df[, c("x", "y")], method="euclidean"), 
method="complete")

# define groups as those that are at least 3 units apart
df$group <- cutree(tree, h=3)

# plot the data, using color and symbol to identify group membership
eqscplot(df$x, df$y, col=df$group, pch=df$group)

Jean


"Frederico Mestre"  wrote on 07/30/2012 
07:29:00 AM:
> 
> Hello:
> 
> What I want to do is quite simple, but I can't find a way.
> 
> I have a data frame with several points (x and y coords). I want to add
> another column with cluster membership. For example aggregate all the 
points
> that stand within a distance of 40 from each other. 
> 
> I've tried using "nncluster" from the package nnclust, but the results 
are
> not correct, for some reason (probably my mistake).
> 
> This is what I did:
> 
> x <- nncluster(as.matrix(dframe[,1:2]), threshold=35, fill = 1, maxclust 
=
> NULL, give.up = 500,verbose=FALSE,start=NULL)#avaliar as clusters 
> 
> Thanks,
> 
> Frederico

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster of points

2012-07-30 Thread Frederico Mestre

Hello:

 

What I want to do is quite simple, but I can't find a way.

 

I have a data frame with several points (x and y coords). I want to add
another column with cluster membership. For example aggregate all the points
that stand within a distance of 40 from each other. 

 

I've tried using "nncluster" from the package nnclust, but the results are
not correct, for some reason (probably my mistake).

 

This is what I did:

 

x <- nncluster(as.matrix(dframe[,1:2]), threshold=35, fill = 1, maxclust =
NULL, give.up = 500,verbose=FALSE,start=NULL)#avaliar as clusters 

 

Thanks,

 

Frederico

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster algorithm with fixed cluster size

2012-06-07 Thread Martin Gütlein

Hi,

okay, and which algorithm is it? I had a closer look at the manual and could
not find it, but there is quite a number of methods in there, maybe I missed
it.

Thanks,
Martin

--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-algorithm-with-fixed-cluster-size-tp4632523p4632746.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster algorithm with fixed cluster size

2012-06-06 Thread Özgür Asar

Hi,

See the package cluster in R.

Ozgur

--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-algorithm-with-fixed-cluster-size-tp4632523p4632540.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster algorithm with fixed cluster size

2012-06-06 Thread Martin Guetlein

Hi all,

Does anyone know a cluster algorithm in R that allows to set the
cluster size (not the number of clusters) to a fixed value?

With best regards,

Martin



-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 7633 (office)
+49 (0)177 623 9499 (mobile)
Email:
guetl...@informatik.uni-freiburg.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster with mahalanobis distance

2012-05-31 Thread David L Carlson

Use distance() in package ecodist to compute the mahalanobis distance matrix
and pass that to hclust(). 

--
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
> project.org] On Behalf Of Maria Froes
> Sent: Wednesday, May 30, 2012 6:42 PM
> To: r-help@r-project.org
> Subject: Re: [R] cluster with mahalanobis distance
> 
> How can I perform cluster analysis using the mahalanobis distance
> instead of
> 
> the euclidean distance?
> 
> 
> 
> Thank you
> 
> Maria Froes
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster with mahalanobis distance

2012-05-30 Thread Maria Froes

How can I perform cluster analysis using the mahalanobis distance instead of

the euclidean distance?

 

Thank you 

Maria Froes


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster Analysis

2012-04-19 Thread Alekseiy Beloshitskiy

Hi, Taisa,

It depends on many paramfactors, e.g. nature of your data, volume of data set 
etc.

The analog of SAS fastclus in R - kmeans (for practical example check slide #35 
here:
 http://www.slideshare.net/whitish/textmining-with-r)

Check also  kmedoids (pam) and hclust.

Good luck,
-Alex


From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] on behalf of 
Taisa Brown [taisa.br...@unb.ca]
Sent: 15 April 2012 03:28
To: r-help@r-project.org
Subject: [R] Cluster Analysis

Hi,

I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER 
would be.  I need to be able to test the significance of the clusters by 
comparing the probability of obtaining an equal or greater pseudo F to the 
Bonferroni-corrected level. I will also need to plot r squared against the 
number of clusters.

Thanks so much,

Taisa

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster Analysis

2012-04-16 Thread David L Carlson

At the R command prompt 
?kmeans (for info on the R equivalent to FASTCLUS)
?hclust (for info on the R equivalent to CLUSTER)

Install package clusterSim 
and look at function index.G1 for the Calinski-Harabasz pseudo F-statistic

--
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
> project.org] On Behalf Of Taisa Brown
> Sent: Saturday, April 14, 2012 7:29 PM
> To: r-help@r-project.org
> Subject: [R] Cluster Analysis
> 
> Hi,
> 
> I was wondering what the best equivalent to SAS's FASTCLUS and PROC
> CLUSTER would be.  I need to be able to test the significance of the
> clusters by comparing the probability of obtaining an equal or greater
> pseudo F to the Bonferroni-corrected level. I will also need to plot r
> squared against the number of clusters.
> 
> Thanks so much,
> 
> Taisa
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster Analysis

2012-04-14 Thread Taisa Brown

Hi,

I was wondering what the best equivalent to SAS's FASTCLUS and PROC CLUSTER 
would be.  I need to be able to test the significance of the clusters by 
comparing the probability of obtaining an equal or greater pseudo F to the 
Bonferroni-corrected level. I will also need to plot r squared against the 
number of clusters.

Thanks so much,

Taisa

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis with pairwise data

2012-04-04 Thread ilai

On Wed, Apr 4, 2012 at 10:12 AM, Petr Savicky  wrote:
> On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote:

>  Var1 <- c("(1,2)", "(7,8)", "(4,7)")
>  Var2 <- c("(1,5)", "(3,88)", "(12,4)")
>  Var3 <- c("(4,2)", "(6,5)", "(4,4)")
>  DF <- data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE)
>
> If you want to use a distance between pairs depending on the
> numbers (and not only equal/different pair), then the data should
> to be transformed to a numeric format.

Or if the pairs have unique meaning ?daisy , also in the cluster
package, comes in handy (in this case you'll want to keep Vi as
factors in the call to DF).

Cheers

For example, as follows
>
>  trans <- function(x)
>  {
>      y <- strsplit(gsub("[()]", "", x), ",")
>      unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0
>  }
>
>  DF <- data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3))
>  DF
>
>    Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1
>  1      1      2      1      5        4        2
>  2      7      8      3     88        6        5
>  3      4      7     12      4        4        4
>
> Then, see library(help=cluster).
>
> Hope this helps.
>
> Petr Savicky.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis with pairwise data

2012-04-04 Thread Petr Savicky

On Wed, Apr 04, 2012 at 01:32:10PM +0200, paladini wrote:
> Hello,
> I want to do a cluster analysis with my data. The problem is, that the 
> variables dont't consist of single value but the entries are pairs of 
> values.
> That lokks like this:
> 
> 
> Variable 1:Variable2:  Variable3:  ...
> (1,2)  (1,5)   (4,2)
> (7,8)  (3,88)  (6,5)
> (4,7)  (12,4)  (4,4)
> .   .  .
> .   .  .
> .   .  .
> Is it possible to perform a cluster-analysis with this kind of data in 
> R ?
> I dont even know how to get this data in a matrix or a dada-frame or 
> anything like this.

Hi.

The data as they are may be read into R as character data. The
exact way depends on the format of the data in the file. The
result may look like the following.

  Var1 <- c("(1,2)", "(7,8)", "(4,7)")
  Var2 <- c("(1,5)", "(3,88)", "(12,4)")
  Var3 <- c("(4,2)", "(6,5)", "(4,4)")
  DF <- data.frame(Var1, Var2, Var3, stringsAsFactors=FALSE)

If you want to use a distance between pairs depending on the
numbers (and not only equal/different pair), then the data should
to be transformed to a numeric format. For example, as follows

  trans <- function(x)
  {
  y <- strsplit(gsub("[()]", "", x), ",")
  unname(t(vapply(y, FUN=as.numeric, FUN.VALUE=c(0, 0
  }

  DF <- data.frame(Var1=trans(Var1), Var2=trans(Var2), Var2=trans(Var3))
  DF

Var1.1 Var1.2 Var2.1 Var2.2 Var2.1.1 Var2.2.1
  1  1  2  1  542
  2  7  8  3 8865
  3  4  7 12  444

Then, see library(help=cluster).

Hope this helps.

Petr Savicky.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis with pairwise data

2012-04-04 Thread David L Carlson

You can create distance matrices for each Variable, square them, sum them,
and take the square root. As for getting the data into a data frame, the
simplest would be to enter the three variables into six columns like the
following:

data
 [,1] [,2] [,3] [,4] [,5] [,6]
[1,]121542
[2,]783   8865
[3,]47   12444

Then use dist() on each pair of columns:

1:2, 3:4, 5:6 . . .

e.g. for the 3 rows of data you provided

size <- nrow(data)*(nrow(data)-1)/2
dm <- dist(rep(0, size))
for(i in seq(1, 6, 2)) {
  dm <- dm + dist(data[,i:(i+1)])^2
}
dm <- sqrt(dm)
dm

--
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of paladini
Sent: Wednesday, April 04, 2012 6:32 AM
To: r-help@r-project.org
Subject: [R] cluster analysis with pairwise data

Hello,
I want to do a cluster analysis with my data. The problem is, that the 
variables dont't consist of single value but the entries are pairs of 
values.
That lokks like this:


Variable 1:Variable2:  Variable3:  ...
(1,2)  (1,5)   (4,2)
(7,8)  (3,88)  (6,5)
(4,7)  (12,4)  (4,4)
.   .  .
.   .  .
.   .  .
Is it possible to perform a cluster-analysis with this kind of data in 
R ?
I dont even know how to get this data in a matrix or a dada-frame or 
anything like this.

It would be really nice if somebody could help me.

Best regards and happy Easter

Claudia

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis with pairwise data

2012-04-04 Thread paladini


Hello,
I want to do a cluster analysis with my data. The problem is, that the 
variables dont't consist of single value but the entries are pairs of 
values.

That lokks like this:


Variable 1:Variable2:  Variable3:  ...
(1,2)  (1,5)   (4,2)
(7,8)  (3,88)  (6,5)
(4,7)  (12,4)  (4,4)
.   .  .
.   .  .
.   .  .
Is it possible to perform a cluster-analysis with this kind of data in 
R ?
I dont even know how to get this data in a matrix or a dada-frame or 
anything like this.


It would be really nice if somebody could help me.

Best regards and happy Easter

Claudia

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster GUI package worth publishing/enhancing?

2012-02-28 Thread Todd Gillette

For a school course I and a partner developed a GUI in R designed to
enable exploration of data via visualization of hierarchical
clustering and correlation of cluster partitions with external
metadata. The key features were the ability to load in a distance
matrix (most GUI-based clustering programs require feature vector
input), and the ability to dynamically subset the data via the GUI
built using a user-provided meta data file. I didn't think this was
sufficient to publish a package, but it did seem like it could make
for a good foundation. I so far have not come across other software
that provides all of these features. I was hoping to get initial
feedback in terms of whether anyone thought this, with or without
certain enhancements, might be worthwhile. For a sense of what the
existing tool looks like and what it does:
http://mason.gmu.edu/~tgillett/R_Cluster_GUI/RClusterGUI.html

We intend to extend the tool to enable feature vector input,
additional forms of visualization beyond cluster dendrogram,
multidimensional scaling of distance matrix input, internal and
external cluster statistics (e.g. Davies-Bouldin index, Rand index),
random subsampling of data to account for the instability of clusters,
and non-hierarchical clustering methods. Obviously a well-organized
GUI is important, and we would seek out feedback and perhaps partner
developers.

I'd appreciate any direct feedback here, or suggestions of more
appropriate forums for getting feedback and having a discussion on the
topic.

Thank you,
Todd Gillette

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster by unique value

2011-07-18 Thread Petr Savicky

On Mon, Jul 18, 2011 at 06:36:13AM -0400, Sarah Goslee wrote:
> Your data1 and your data1_class file differ in the first three
> columns. Assuming that's an error, here's one way to do it:
> 
> > data1 <- data.frame(layer1=c(.2, .5, .2, .8, .2, .5, .5, .8, .2, 
> > .8),layer2=c(2,3,2,2,1,2,3,2,2,2), layer3=c(1,1,1,1,1,1,1,1,1,4))
> > data1 <- cbind(data1, class=as.numeric(as.factor(do.call(paste, data1
> > data1
>layer1 layer2 layer3 class
> 1 0.2  2  1 2
> 2 0.5  3  1 4
> 3 0.2  2  1 2
> 4 0.8  2  1 5
> 5 0.2  1  1 1
> 6 0.5  2  1 3
> 7 0.5  3  1 4
> 8 0.8  2  1 5
> 9 0.2  2  1 2
> 100.8  2  4 6
> 
> You didn't give a reproducible example, and I didn't want to type in
> all the decimal places, but you should be able to get the idea from
> this example. Also, the class numbers are assigned on sorted character
> rows, from lowest to highest, and not starting with the first one, as
> in your example.  If you do need the latter, some combination of
> unique() and subsetting or merge() may work for you.

Let me suggest the following modification, which assigns numbers
to the classes according to their first occurrence.

  data1 <- data.frame(layer1=c(.2, .5, .2, .8, .2, .5, .5, .8, .2, .8),
  layer2=c(2,3,2,2,1,2,3,2,2,2), layer3=c(1,1,1,1,1,1,1,1,1,4))
  x <- do.call(paste, data1)
  data1 <- cbind(data1, class=as.numeric(factor(x, levels=unique(x
  data1

 layer1 layer2 layer3 class
  1 0.2  2  1 1
  2 0.5  3  1 2
  3 0.2  2  1 1
  4 0.8  2  1 3
  5 0.2  1  1 4
  6 0.5  2  1 5
  7 0.5  3  1 2
  8 0.8  2  1 3
  9 0.2  2  1 1
  100.8  2  4 6

Hope this helps.

Petr Savicky.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster by unique value

2011-07-18 Thread jim holtman

Also read FAQ 7.31 before using 'numerics' as grouping factors.

On Mon, Jul 18, 2011 at 6:36 AM, Sarah Goslee  wrote:
> Your data1 and your data1_class file differ in the first three
> columns. Assuming that's an error, here's one way to do it:
>
>> data1 <- data.frame(layer1=c(.2, .5, .2, .8, .2, .5, .5, .8, .2, 
>> .8),layer2=c(2,3,2,2,1,2,3,2,2,2), layer3=c(1,1,1,1,1,1,1,1,1,4))
>> data1 <- cbind(data1, class=as.numeric(as.factor(do.call(paste, data1
>> data1
>   layer1 layer2 layer3 class
> 1     0.2      2      1     2
> 2     0.5      3      1     4
> 3     0.2      2      1     2
> 4     0.8      2      1     5
> 5     0.2      1      1     1
> 6     0.5      2      1     3
> 7     0.5      3      1     4
> 8     0.8      2      1     5
> 9     0.2      2      1     2
> 10    0.8      2      4     6
>
> You didn't give a reproducible example, and I didn't want to type in
> all the decimal places, but you should be able to get the idea from
> this example. Also, the class numbers are assigned on sorted character
> rows, from lowest to highest, and not starting with the first one, as
> in your example.  If you do need the latter, some combination of
> unique() and subsetting or merge() may work for you.
>
> Sarah
>
> On Mon, Jul 18, 2011 at 6:23 AM, Alfredo Alessandrini
>  wrote:
>> Hi,
>>
>> I need to make a cluster classification by the unique values of the data 
>> frame.
>>
>> I explain the problem. I need to classify this table, and assign to
>> the same cluster each row that has the same combination of value:
>>
>>
>>> data1
>>             layer_1 layer_2 layer_3
>>   [1,] 0.246000       2    -0.1
>>   [2,] 0.546000       3    -0.1
>>   [3,] 0.246000       2    -0.1
>>   [4,] 0.846000       2    -0.1
>>   [5,] 0.246000       1    -0.1
>>   [6,] 0.546000       2    -0.1
>>   [7,] 0.246000       2    -0.1
>>   [8,] 0.846000       2    -0.1
>>   [9,] 0.246000       2    -0.1
>>  [10,] 0.246000       2    -0.1
>>
>>
>>> data1_class
>>             layer_1 layer_2 layer_3 class
>>   [1,] 0.246000       2    -0.1  1
>>   [2,] 0.546000       3    -0.1  2
>>   [3,] 0.246000       2    -0.1  1
>>   [4,] 0.846000       2    -0.1  3
>>   [5,] 0.246000       1    -0.1  4
>>   [6,] 0.546000       2    -0.1  5
>>   [7,] 0.546000       3    -0.1  2
>>   [8,] 0.846000       2    -0.1  3
>>   [9,] 0.246000       2    -0.1  1
>>  [10,] 0.846000       2    -0.4  6
>>
>>
>
> --
> Sarah Goslee
> http://www.functionaldiversity.org
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster by unique value

2011-07-18 Thread Sarah Goslee

Your data1 and your data1_class file differ in the first three
columns. Assuming that's an error, here's one way to do it:

> data1 <- data.frame(layer1=c(.2, .5, .2, .8, .2, .5, .5, .8, .2, 
> .8),layer2=c(2,3,2,2,1,2,3,2,2,2), layer3=c(1,1,1,1,1,1,1,1,1,4))
> data1 <- cbind(data1, class=as.numeric(as.factor(do.call(paste, data1
> data1
   layer1 layer2 layer3 class
1 0.2  2  1 2
2 0.5  3  1 4
3 0.2  2  1 2
4 0.8  2  1 5
5 0.2  1  1 1
6 0.5  2  1 3
7 0.5  3  1 4
8 0.8  2  1 5
9 0.2  2  1 2
100.8  2  4 6

You didn't give a reproducible example, and I didn't want to type in
all the decimal places, but you should be able to get the idea from
this example. Also, the class numbers are assigned on sorted character
rows, from lowest to highest, and not starting with the first one, as
in your example.  If you do need the latter, some combination of
unique() and subsetting or merge() may work for you.

Sarah

On Mon, Jul 18, 2011 at 6:23 AM, Alfredo Alessandrini
 wrote:
> Hi,
>
> I need to make a cluster classification by the unique values of the data 
> frame.
>
> I explain the problem. I need to classify this table, and assign to
> the same cluster each row that has the same combination of value:
>
>
>> data1
>             layer_1 layer_2 layer_3
>   [1,] 0.246000       2    -0.1
>   [2,] 0.546000       3    -0.1
>   [3,] 0.246000       2    -0.1
>   [4,] 0.846000       2    -0.1
>   [5,] 0.246000       1    -0.1
>   [6,] 0.546000       2    -0.1
>   [7,] 0.246000       2    -0.1
>   [8,] 0.846000       2    -0.1
>   [9,] 0.246000       2    -0.1
>  [10,] 0.246000       2    -0.1
>
>
>> data1_class
>             layer_1 layer_2 layer_3 class
>   [1,] 0.246000       2    -0.1  1
>   [2,] 0.546000       3    -0.1  2
>   [3,] 0.246000       2    -0.1  1
>   [4,] 0.846000       2    -0.1  3
>   [5,] 0.246000       1    -0.1  4
>   [6,] 0.546000       2    -0.1  5
>   [7,] 0.546000       3    -0.1  2
>   [8,] 0.846000       2    -0.1  3
>   [9,] 0.246000       2    -0.1  1
>  [10,] 0.846000       2    -0.4  6
>
>

-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster by unique value

2011-07-18 Thread Alfredo Alessandrini

Hi,

I need to make a cluster classification by the unique values of the data frame.

I explain the problem. I need to classify this table, and assign to
the same cluster each row that has the same combination of value:


> data1
 layer_1 layer_2 layer_3
   [1,] 0.246000   2-0.1
   [2,] 0.546000   3-0.1
   [3,] 0.246000   2-0.1
   [4,] 0.846000   2-0.1
   [5,] 0.246000   1-0.1
   [6,] 0.546000   2-0.1
   [7,] 0.246000   2-0.1
   [8,] 0.846000   2-0.1
   [9,] 0.246000   2-0.1
  [10,] 0.246000   2-0.1


> data1_class
 layer_1 layer_2 layer_3 class
   [1,] 0.246000   2-0.1  1
   [2,] 0.546000   3-0.1  2
   [3,] 0.246000   2-0.1  1
   [4,] 0.846000   2-0.1  3
   [5,] 0.246000   1-0.1  4
   [6,] 0.546000   2-0.1  5
   [7,] 0.546000   3-0.1  2
   [8,] 0.846000   2-0.1  3
   [9,] 0.246000   2-0.1  1
  [10,] 0.846000   2-0.4  6



Thanks in advance,

Alfredo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster() or frailty() in coxph

2011-06-27 Thread Terry Therneau

Addition of a cluster() term fits a Generalized Estimating Equations
(GEE) type of model, addition of frailty() fits a random effects  model
(Mixed Effect or ME).  In glm analysis (linear regression, logistic
regression, etc) the arguments about the advantages/disadvantages of GEE
ve ME would easily fill a volume.  Most of this argument carries over to
the coxph case; I find both approaches useful.  

Caveats:
  1. Coxph with cluster() only allows the "working independence"
variance structure.  The details for other variance structures were
worked out by Alicia Z in her Iowa State PhD thesis, but I've never
gotton around to implementing it.
  2. For random effects, the coxme function is preferred.
  3. In comparing GEE and ME one part of the arguement is that the
former model is "marginal" and the second "conditional", and thus the
coefficients from the models mean different things.  I take this with a
grain of salt.  Remember that ALL models are wrong.

Terry Therneau

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster() or frailty() in coxph

2011-06-26 Thread Joshua Wiley

Hi Ehsan,

My understanding (hopefully someone will jump in if this is wrong) is
that cluster() identifies a variable that is an indicator for
correlated observations (rats in a litter, children in a classroom,
etc.).  The relative risk from treatment (rx) is for a random sample
of rats.

frailty() estimates the relative risk from treatment (rx) within
litters.  Also, by default, it (frailty) uses a gamma distribution and
estimates the scale parameter unless specified by theta (or df).

There is an entire chapter devoted to frailty models in Therneau &
Grambsch book on the cox model (the title is something like Survival
Data Analysis).

HTH,

Josh

On Sat, Jun 25, 2011 at 3:48 PM, Ehsan Karim  wrote:
> Dear List,
>
> Can anyone please explain the difference between cluster() and
> frailty() in a coxph? I am a bit puzzled about it. Would appreciate
> any useful reference or direction.
>
> cheers,
>
> Ehsan
>
>
>
>> marginal.model <- coxph(Surv(time, status) ~ rx + cluster(litter), rats)
>> frailty.model  <- coxph(Surv(time, status) ~ rx + frailty(litter), rats)
>> marginal.model
> Call:
> coxph(formula = Surv(time, status) ~ rx + cluster(litter), data = rats)
>
>
>    coef exp(coef) se(coef) robust se    z      p
> rx 0.905      2.47    0.318     0.303 2.99 0.0028
>
> Likelihood ratio test=7.98  on 1 df, p=0.00474  n= 150
>> frailty.model
> Call:
> coxph(formula = Surv(time, status) ~ rx + frailty(litter), data = rats)
>
>                coef  se(coef) se2   Chisq DF   p
> rx              0.914 0.323    0.319  8.01  1.0 0.0046
> frailty(litter)                      17.69 14.4 0.2400
>
> Iterations: 6 outer, 24 Newton-Raphson
>     Variance of random effect= 0.499   I-likelihood = -180.8
> Degrees of freedom for terms=  1.0 14.4
> Likelihood ratio test=37.6  on 15.4 df, p=0.00124  n= 150
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster() or frailty() in coxph

2011-06-25 Thread Ehsan Karim

Dear List,

Can anyone please explain the difference between cluster() and
frailty() in a coxph? I am a bit puzzled about it. Would appreciate
any useful reference or direction.

cheers,

Ehsan



> marginal.model <- coxph(Surv(time, status) ~ rx + cluster(litter), rats)
> frailty.model  <- coxph(Surv(time, status) ~ rx + frailty(litter), rats)
> marginal.model
Call:
coxph(formula = Surv(time, status) ~ rx + cluster(litter), data = rats)


coef exp(coef) se(coef) robust sez  p
rx 0.905  2.470.318 0.303 2.99 0.0028

Likelihood ratio test=7.98  on 1 df, p=0.00474  n= 150
> frailty.model
Call:
coxph(formula = Surv(time, status) ~ rx + frailty(litter), data = rats)

coef  se(coef) se2   Chisq DF   p
rx  0.914 0.3230.319  8.01  1.0 0.0046
frailty(litter)  17.69 14.4 0.2400

Iterations: 6 outer, 24 Newton-Raphson
 Variance of random effect= 0.499   I-likelihood = -180.8
Degrees of freedom for terms=  1.0 14.4
Likelihood ratio test=37.6  on 15.4 df, p=0.00124  n= 150

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis on extreme event

2011-05-27 Thread FMH

Dear all,

I'm  modelling  extreme rainfall,particularly those that lie above a threshold 
&  was searching for a suitable package in R which may enable a cluster 
analysis on those extreme events and would really appreciate for any 
suggestions.

Thanks,
Fir

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Peter Langfelder

On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand  wrote:
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?

It probably doesn't. You said you have some 36 observations for each
case, correct? You can turn these 36 observations into a vector of
length 36 * 9 on which Euclidean distance will make some sense, namely
k changes will produce a distance of sqrt(2*k). For each observation
with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0)
where the entry 1 is in the p-th component. Hence, if values p1 and p2
are the same, euclidean distance between r1 and r2 is zero; if they
are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:

transform = function(obsVector, maxVal)
{
  templateMat = matrix(0, maxVal, maxVal);
  diag(templateMat) = 1;

  return(as.vector(templateMat[, obsVector]));
}

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

> data
 [,1] [,2] [,3] [,4]
[1,]3312
[2,]1332
[3,]3324
[4,]1242
[5,]4141

trafoData = apply(data, 2, transform, maxVal = max);

> trafoData
  [,1] [,2] [,3] [,4]
 [1,]0010
 [2,]0001
 [3,]1100
 [4,]0000
 [5,]1000
 [6,]0001
 [7,]0110
 [8,]0000
 [9,]0000
[10,]0010
[11,]1100
[12,]0001
[13,]1000
[14,]0101
[15,]0000
[16,]0010
[17,]0101
[18,]0000
[19,]0000
[20,]1010

The code assumes that cases are in columns and observations in rows of
data. Examine data and trafoData to see how the transformation works.
Once you have the transformed data, simply apply your favorite
clustering method that uses Euclidean distance.

HTH,

Peter

>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand

On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote:
> On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
> > Dear Hans,
> > 
> > clara doesn't require a distance matrix as input (and therefore
> > doesn't require you to run daisy), it will work with the raw data
> > matrix using
> > Euclidean distances implicitly.
> > I can't tell you whether Euclidean distances are appropriate in this
> > situation (this depends on the interpretation and variables and
> > particularly on how they are scaled), but they may be fine at least
> > after some transformation and standardisation of your variables.
> 
> The variables are unordered factors, stored as integers 1:9, where 
> 
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
> 
> Does euclidean distances make sense on unordered factors coded as
> integers?

To be clear, here is an extract

> my.df.full[900:910, 16:19]
PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year
900 2 2 1 2
901 1 1 1 1
902 1 1 1 1
903 2 2 2 2
904 1 1 1 1
905 2 2 2 2
906 7 8 2 7
907 5 5 5 5
908 1 1 1 1
909 1 1 1 1
910 1 1 1 1

> class(my.df.full[,16])
[1] "integer"

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand

On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
> Dear Hans,
> 
> clara doesn't require a distance matrix as input (and therefore
> doesn't require you to run daisy), it will work with the raw data
> matrix using
> Euclidean distances implicitly.
> I can't tell you whether Euclidean distances are appropriate in this
> situation (this depends on the interpretation and variables and
> particularly on how they are scaled), but they may be fine at least
> after some transformation and standardisation of your variables.

The variables are unordered factors, stored as integers 1:9, where 

1 means "Full-time employment"
2 means "Part-time employment"
3 means "Student"
4 means "Full-time self-employee"
...

Does euclidean distances make sense on unordered factors coded as
integers?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Christian Hennig


Dear Hans,

clara doesn't require a distance matrix as input (and therefore doesn't 
require you to run daisy), it will work with the raw data matrix using

Euclidean distances implicitly.
I can't tell you whether Euclidean distances are appropriate in this 
situation (this depends on the interpretation and variables and 
particularly on how they are scaled), but they may be fine at least after 
some transformation and standardisation of your variables.


Hope this helps,
Christian

On Thu, 31 Mar 2011, Hans Ekbrand wrote:


Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy -> clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Hans Ekbrand

Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy -> clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis: predefined clusters

2010-12-01 Thread deriK2000



Peter Langfelder wrote:
> 
> On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert  wrote:
>> Dear list,
>>
>> running a hierachical cluster analysis I want to define a number of
>> objects that build a cluster already. In other words: I want to force
>> some of the cases to be in the same cluster from the start of the
>> algorithm.
>>
>> Any hints? Thanks in advance!
> 
> The hclust function has an argument 'members' that should allow you to
> do that. You will need to specify the dissimilarity matrix
> accordingly.
> 
> Peter
> 
> 

Thank you! But to "specify the dissimilarity matrix" correctly seems to be
major task. Anyone who has done so sofar? 
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-predefined-clusters-tp3060433p3067215.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis: predefined clusters

2010-11-26 Thread Peter Langfelder

On Fri, Nov 26, 2010 at 6:55 AM, Derik Burgert  wrote:
> Dear list,
>
> running a hierachical cluster analysis I want to define a number of objects 
> that build a cluster already. In other words: I want to force some of the 
> cases to be in the same cluster from the start of the algorithm.
>
> Any hints? Thanks in advance!

The hclust function has an argument 'members' that should allow you to
do that. You will need to specify the dissimilarity matrix
accordingly.

Peter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis: predefined clusters

2010-11-26 Thread Derik Burgert

Dear list,
 
running a hierachical cluster analysis I want to define a number of objects 
that build a cluster already. In other words: I want to force some of the cases 
to be in the same cluster from the start of the algorithm.
 
Any hints? Thanks in advance!
 
Derik


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-09-27 Thread abanero


Hi Ulrich,
 I'm studying the principles of Affinity Propagation and I'm really glad to
use your package (apcluster) in order to cluster my data.  I have just an
issue to solve..

If I apply the funcion: apcluster(sim) 

where sim is the matrix of dissimilarities, sometimes I encounter the
warning message:

"Algorithm did not converge. Turn on details
and call plot() to monitor net similarity. Consider
increasing maxits and convits, and, if oscillations occur
also increasing damping factor lam."
 
with  too high number of clusters.
 
I thought to solve the problem setting the argument "p" of the function
apcluster() to mean(PreferenceRange(sim)):


apcluster(sim, p=mean(preferenceRange(sim)))

and actually it seems to be a good solution because I don't receive any
warning message and the number of cluster is slower.

Do you think it's a good solution? I submitt that I have to use apcluster()
in an automatic procedure so I can't manipulate directly the arguments of
the funcion.

Thanks in advance.
Giuseppe
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2715278.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis

2010-07-27 Thread Pablo Cerdeira

Hi Jim,

Ow! Very nice job at http://mephisto.unige.ch/traminer/preview.shtml I´m
going to read more about it.

I have a lot of different steps, in a sequence. Actually, 586 different
possible steps, but I have 4269 legal cases, with a maximum of 379 steps
each one.

If you want, I can send this dataset to you.

Best regards and thank you very much,

On Tue, Jul 27, 2010 at 10:16 AM, Jim Porzak  wrote:

> Pablo, we've had success using
> http://mephisto.unige.ch/traminer/preview.shtml to look at marketing
> paths. Question would be how many distinct case step discriptions are there?
>
> HTH, Jim
>
> On Jul 26, 2010 9:44 AM, "Pablo Cerdeira" 
> wrote:
>
> Hi all,
>
> I have no idea if this question is to easy to be answered, but I´m starting
> with R. So, here we go.
>
> I have a large dataset with a lot of steps a judicial case. A sample is
> attached.
>
> I´d like to do a cluster analysis to try to understand with one is the most
> usual path followed by this legal cases.
>
> After that, I´d like to plot a cluster tree.
>
> In the attached sample, the column:
>
> - "id_processo" is the primary key of a legal case;
> - "number" is the "step number" in the legal case;
> - "andamento" is the description of the legal case step.
>
> I have no idea on how to do it using R. Can someone help me?
>
> Thanks in advanced
>
> --
> *Pablo de Camargo Cerdeira*
> pa...@fgv.br
> pablo.cerde...@gmail.com
> +55 (21) 3799-6065
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis

2010-07-27 Thread Pablo Cerdeira

Hi Allan,

It helps a lot. I´ll try to read more about it.

But, as you asked me, here goes a brief explanation about the necessary
columns of the sample date paste at the end:

id_processo: identify a legal case, it is its primary key.
ordem_andamento: is the step number inside a legal case (id_processo);
id_andamento: is the primary key of the step.

I´d like to identify the most commom steps (id_andamento) sequence
(ordem_andamento) inside a lot of legal cases (id_processo). Probably a
cluster analysis with a dendogram plot is what I´m looking for.

Here goes the sample of two different legal cases (2 different
id_processo):

Best regards and thank you in advanced

id_processo,proc_num,ordem_andamento,id_andamento,andamento,data,dias,origem_tribunal,data_entrada,relator,duracao_dias
1480010,1,1,208,DISTRIBUIDO,"1988-10-06 00:00:00",5,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,2,69,CONCLUSAO,"1988-10-06 00:00:00",0,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,3,180,"DESPACHO ORDINATORIO","1988-10-11 00:00:00",8,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,4,465,"PEDIDO DE INFORMACOES","1988-10-19 00:00:00",1,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,5,465,"PEDIDO DE INFORMACOES","1988-10-20 00:00:00",15,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,6,241,"INFORMACOES RECEBIDAS, OFICIO NRO.:","1988-11-04
00:00:00",24,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN.
CÉLIO BORJA",1251
1480010,1,7,241,"INFORMACOES RECEBIDAS, OFICIO NRO.:","1988-11-28
00:00:00",0,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN.
CÉLIO BORJA",1251
1480010,1,8,69,CONCLUSAO,"1988-11-28 00:00:00",38,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,9,584,"VISTA AO PROCURADOR-GERAL DA REPUBLICA","1989-01-05
00:00:00",874,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN.
CÉLIO BORJA",1251
1480010,1,10,26,"AUTOS DEVOLVIDOS","1991-05-29 00:00:00",8,"FÓRUM DA COMARCA
DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,11,75,"CONCLUSOS AO RELATOR","1991-05-29 00:00:00",0,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,12,578,"VISTA AO ADVOGADO-GERAL DA UNIAO","1991-06-06
00:00:00",232,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN.
CÉLIO BORJA",1251
1480010,1,13,507,"RECEBIMENTO DOS AUTOS","1992-01-24 00:00:00",10,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,14,75,"CONCLUSOS AO RELATOR","1992-02-03 00:00:00",21,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,15,284,"JULG. POR DESPACHO - NEGADO SEGUIMENTO","1992-02-24
00:00:00",3,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN.
CÉLIO BORJA",1251
1480010,1,16,497,"PUBLICADO DESPACHO NO DJ","1992-02-27 00:00:00",12,"FÓRUM
DA COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,17,163,"DECORRIDO O PRAZO","1992-03-10 00:00:00",0,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480010,1,18,34,"BAIXA AO ARQUIVO DO STF","1992-03-10 00:00:00",0,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-06 00:00:00","MIN. CÉLIO BORJA",1251
1480183,2,1,208,DISTRIBUIDO,"1988-10-12 00:00:00",8,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,2,69,CONCLUSAO,"1988-10-12 00:00:00",0,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,3,352,"JULGAMENTO NO PLENO","1988-10-20 00:00:00",22,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,4,476,"PETICAO AVULSA","1988-11-11 00:00:00",13,"FÓRUM DA COMARCA
DE RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,5,531,"REMESSA DOS AUTOS","1988-11-11 00:00:00",0,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,6,495,"PUBLICADO ACORDAO, DJ:","1988-11-24 00:00:00",11,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,7,163,"DECORRIDO O PRAZO","1988-12-05 00:00:00",8,"FÓRUM DA
COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,8,241,"INFORMACOES RECEBIDAS, OFICIO NRO.:","1988-12-13
00:00:00",63,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN.
PAULO BROSSARD",6677
1480183,2,9,69,CONCLUSAO,"1988-12-13 00:00:00",0,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,10,584,"VISTA AO PROCURADOR-GERAL DA REPUBLICA","1989-02-14
00:00:00",83,"FÓRUM DA COMARCA DE RANCHARIA","1988-10-12 00:00:00","MIN.
PAULO BROSSARD",6677
1480183,2,11,69,CONCLUSAO,"1989-05-08 00:00:00",91,"FÓRUM DA COMARCA DE
RANCHARIA","1988-10-12 00:00:00","MIN. PAULO BROSSARD",6677
1480183,2,12,584,"VISTA AO PROCURADOR-GERAL DA

Re: [R] Cluster analysis

2010-07-27 Thread Jim Porzak

Pablo, we've had success using
http://mephisto.unige.ch/traminer/preview.shtml to look at marketing paths.
Question would be how many distinct case step discriptions are there?

HTH, Jim

On Jul 26, 2010 9:44 AM, "Pablo Cerdeira"  wrote:

Hi all,

I have no idea if this question is to easy to be answered, but I´m starting
with R. So, here we go.

I have a large dataset with a lot of steps a judicial case. A sample is
attached.

I´d like to do a cluster analysis to try to understand with one is the most
usual path followed by this legal cases.

After that, I´d like to plot a cluster tree.

In the attached sample, the column:

- "id_processo" is the primary key of a legal case;
- "number" is the "step number" in the legal case;
- "andamento" is the description of the legal case step.

I have no idea on how to do it using R. Can someone help me?

Thanks in advanced

--
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Cluster analysis

2010-07-26 Thread Pablo Cerdeira

Hi all,

I have no idea if this question is to easy to be answered, but I´m starting
with R. So, here we go.

I have a large dataset with a lot of steps a judicial case. A sample is
attached.

I´d like to do a cluster analysis to try to understand with one is the most
usual path followed by this legal cases.

After that, I´d like to plot a cluster tree.

In the attached sample, the column:

- "id_processo" is the primary key of a legal case;
- "number" is the "step number" in the legal case;
- "andamento" is the description of the legal case step.

I have no idea on how to do it using R. Can someone help me?

Thanks in advanced

-- 
*Pablo de Camargo Cerdeira*
pa...@fgv.br
pablo.cerde...@gmail.com
+55 (21) 3799-6065
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


>
> What do you suggest in order to assign a new observation to a determined
> cluster?
>
As I mentioned already, I would simply assign the new observation to the
cluster to whose exemplar the new observation is most similar to (in a
knn1-like fashion). To compute these similarities, you can use the daisy()
function. However, you have to do some tricks, since daisy() is designed for
computing square matrices of all mutual distances for a given data set. I
did not find another function that is better suitable (e.g. a function that
allows to compute simply the distance of two distinct samples). Maybe others
have an idea. In any case, you have to make sure that data either remain
unscaled or that you take care yourself that your new observation is scaled
exactly with the same parameters that were used for clustering before.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233308.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig




Christian wrote:

and the implement
nearest neighbours classification myself if I needed it.
It should be pretty straightforward to implement.


Do you intend modify the code of the knn1() function by yourself?


No; if you understand what the nearest neighbours method does, it's not 
very complicated to implement it from scratch (assuming that your dataset 
is small enough that you don't have to worry too much about optimising 
computing times). A bit of programming experience is required, though. 
(It's not that I intend to do it right now, I suggest that you do it if 
you can...)


Christian




thanks to everyone!

--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread abanero

Ulrich wrote: 
>Affinity propagation produces quite a number of clusters. 

I tried with q=0 and produces 17 clusters. Anyway that's a good idea,
thanks. I'm looking to test it with my dataset.

So I'll probably use daisy() to compute an appropriate dissimilarity then
apcluster() or another method to determine clusters.

What do you suggest in order to assign a new observation to a determined
cluster?

 It seems that RandomForest doesn't work with both numerical and categorical
predictors (thanks to Joris).

Christian wrote: 
>and the implement
>nearest neighbours classification myself if I needed it. 
>It should be pretty straightforward to implement. 

Do you intend modify the code of the knn1() function by yourself?

thanks to everyone!

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


Sorry, Joris, I overlooked that you already mentioned daisy() in your
posting. I should have credited your recommendation in my previous message.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233055.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


>
> I had a look at the documentation of the package apcluster.
> That's interesting but do you have any example using it with both
> categorical
> and numerical variables? I'd like to test it with a large dataset..
>
Your posting has opened my eyes: problems where both numerical and
categorical features occur are probably among the most attractive
applications of affinity propagation. So I am considering to include such an
example in a future released.

Here is a very crude example (download the imports-85.data from
http://archive.ics.uci.edu/ml/machine-learning-databases/autos/ first):

> library(cluster)
> library(apcluster)
> automobiles <- read.table("imports-85.data", header=FALSE, sep=",",
> na.strings="?")
> sim <- -as.matrix(daisy(automobiles))
> apcluster(sim)

The most essential part here is to use daisy() from the package "cluster"
for computing distances/similarities. Have a look to the help page of
daisy() to get a better impression how it works and how to tailor the
distance/similarity calculations to your needs.

I do not know whether this is a good data set for clustering. Affinity
propagation produces quite a number of clusters. Maybe fiddling with the
input preferences is necessary (see Section 4 of vignette of package
"apcluster").

Best regards,
Ulrich


-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233053.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Joris Meys

I'm confusing myself :-)

randomForest cannot handle character vectors as predictors. (Which is why I,
to my surprise, found out that a categorical variable could not be used in
the function). It can handle categorical variables as predictors IF they are
put in as a factor.

Obviously they handle categorical variables as a response variable.

 I hope I'm not going to add up more mistakes, it's been enough for the
day...
Cheers
Joris

On Thu, May 27, 2010 at 2:08 PM,  wrote:

> Joris,
>
> I've been following this thread for a few days as I am beginning to use
> randomForest in my work.  I am confused by your last email.
>
> What do you mean that randomForest does not handle categorical variables ?
>
> It can be used in either regression or classification analysis.  Do you
> mean that categorical predictors are not suitable? Certainly they are as
> the response.
> Would you be so kind, and clarify what you were suggesting.
>
> Thanks,
>
> Steve Friedman Ph. D.
> Spatial Statistical Analyst
> Everglades and Dry Tortugas National Park
> 950 N Krome Ave (3rd Floor)
> Homestead, Florida 33034
>
> steve_fried...@nps.gov
> Office (305) 224 - 4282
> Fax (305) 224 - 4147
>
>
>
> Joris Meys
>  com>   To
> Sent by:  abanero 
> r-help-boun...@r-  cc
> project.org   r-help@r-project.org
>       Subject
>   Re: [R] cluster analysis and
> 05/27/2010 07:56  supervised classification: an
> AMalternative to knn1?
>
>
>
>
>
>
>
>
>
>
> Hi Abanero,
>
> first, I have to correct myself. Knn1 is a supervised learning algorithm,
> so
> my comment wasn't completely correct. In any case, if you want to do a
> clustering prior to a supervised classification, the function daisy() can
> handle any kind of variable. The resulting distance matrix can be used with
> a number of different methods.
>
> And you're right, randomForest doesn't handle categorical variables either.
> So I haven't been of great help here...
> Cheers
> Joris
>
> On Thu, May 27, 2010 at 1:25 PM, abanero  wrote:
>
> >
> > Hi,
> >
> > thank you Joris and Ulrich for you answers.
> >
> > Joris Meys wrote:
> >
> > >see the library randomForest for example
> >
> >
> > I'm trying to find some example in randomForest with categorical
> variables
> > but I haven't found anything. Do you know any example with both
> categorical
> > and numerical variables? Anyway I don't have any class labels yet. How
> > could
> > I  find clusters with randomForest?
> >
> >
> > Ulrich wrote:
> >
> > >Probably the simplest way is Affinity Propagation[...] All you need is a
> > way of measuring the similarity of >samples which is straightforward both
> > for numerical and categorical variables.
> >
> > I had a look at the documentation of the package apcluster. That's
> > interesting but do you have any example using it with both categorical
> and
> > numerical variables? I'd like to test it with a large dataset..
> >
> > Thanks a lot!
> > Cheers
> >
> > Giuseppe
> >
> > --
> > View this message in context:
> >
>
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
>
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Joris Meys
> Statistical Consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
>
> Coupure Links 653
> B-9000 Gent
>
> tel : +32 9 264 59 87
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>  [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read t

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Joris Meys

Hi Abanero,

first, I have to correct myself. Knn1 is a supervised learning algorithm, so
my comment wasn't completely correct. In any case, if you want to do a
clustering prior to a supervised classification, the function daisy() can
handle any kind of variable. The resulting distance matrix can be used with
a number of different methods.

And you're right, randomForest doesn't handle categorical variables either.
So I haven't been of great help here...
Cheers
Joris

On Thu, May 27, 2010 at 1:25 PM, abanero  wrote:

>
> Hi,
>
> thank you Joris and Ulrich for you answers.
>
> Joris Meys wrote:
>
> >see the library randomForest for example
>
>
> I'm trying to find some example in randomForest with categorical variables
> but I haven't found anything. Do you know any example with both categorical
> and numerical variables? Anyway I don't have any class labels yet. How
> could
> I  find clusters with randomForest?
>
>
> Ulrich wrote:
>
> >Probably the simplest way is Affinity Propagation[...] All you need is a
> way of measuring the similarity of >samples which is straightforward both
> for numerical and categorical variables.
>
> I had a look at the documentation of the package apcluster. That's
> interesting but do you have any example using it with both categorical and
> numerical variables? I'd like to test it with a large dataset..
>
> Thanks a lot!
> Cheers
>
> Giuseppe
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread abanero

Hi,

thank you Joris and Ulrich for you answers.

Joris Meys wrote: 

>see the library randomForest for example

I'm trying to find some example in randomForest with categorical variables
but I haven't found anything. Do you know any example with both categorical
and numerical variables? Anyway I don't have any class labels yet. How could 
I  find clusters with randomForest? 

Ulrich wrote:

>Probably the simplest way is Affinity Propagation[...] All you need is a
way of measuring the similarity of >samples which is straightforward both
for numerical and categorical variables.

I had a look at the documentation of the package apcluster. That's
interesting but do you have any example using it with both categorical and
numerical variables? I'd like to test it with a large dataset..

Thanks a lot!
Cheers

Giuseppe

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig

Dear abanero,

In principle, k nearest neighbours classification can be computed on
any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume
Euclidean vectors as input, which restricts their use.

I'd probably compute an appropriate dissimilarity between points (have a
look at Gower's distance in daisy, package cluster), and the implement
nearest neighbours classification myself if I needed it. It should be
pretty straightforward to implement.

If you want unsupervised classification (clustering) instead, you have the
choice between all kinds of dissimilarity based algorithms then (hclust, pam,
agnes etc.)

Christian

On Thu, 27 May 2010, Ulrich Bodenhofer wrote:

abanero wrote:

Do you know something like “knn1” that works with categorical variables
too?
Do you have any suggestion?

There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package "apcluster" I have co-developed). All you need is a way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final "classification" task can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:

Not a direct answer, but from your description it looks like you are
better
of with supervised classification algorithms instead of unsupervised
clustering.

If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
--
View this message in context:
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Ulrich Bodenhofer


abanero wrote:
>
> Do you know  something like “knn1” that works with categorical variables
> too?
> Do you have any suggestion? 
>
There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package "apcluster" I have co-developed). All you need is a way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final "classification" task can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:
>
> Not a direct answer, but from your description it looks like you are
> better
> of with supervised classification algorithms instead of unsupervised
> clustering. 
>
If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-26 Thread Joris Meys

Not a direct answer, but from your description it looks like you are better
of with supervised classification algorithms instead of unsupervised
clustering. see the library randomForest for example. Alternatively, you can
try a logistic regression or a multinomial regression approach, but these
are parametric methods and put requirements on the data. randomForest is
completely non-parametric.

Cheers
Joris

On Wed, May 26, 2010 at 3:45 PM, abanero  wrote:

>
> Hi,
> I have a 1.000 observations with 10 attributes (of different types:
> numeric,
> dicotomic, categorical  ecc..) and a measure M.
>
> I need to cluster these observations in order to assign a new observation
> (with the same 10 attributes but not the measure) to a cluster.
>
> I want to calculate for the new observation a measure as the average of the
> meausures M of the observations in the cluster assigned.
>
> I would use cluster analysis ( Clara algorithm?) and then knn1 (in
> package class) to assign the new observation to a cluster.
>
> The problem is: Im not able to use knn1 because some of attributes are
> categorical.
>
> Do you know  something like knn1 that works with categorical variables
> too? Do you have any suggestion?
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-26 Thread abanero


Hi,
I have a 1.000 observations with 10 attributes (of different types: numeric,
dicotomic, categorical  ecc..) and a measure M. 

I need to cluster these observations in order to assign a new observation
(with the same 10 attributes but not the measure) to a cluster. 

I want to calculate for the new observation a measure as the average of the
meausures M of the observations in the cluster assigned.

I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in 
package class) to assign the new observation to a cluster.

The problem is: I’m not able to use “knn1” because some of attributes are
categorical. 

Do you know  something like “knn1” that works with categorical variables
too? Do you have any suggestion?

-- 
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster procedure using geographical neighborhood

2010-05-07 Thread Martin Maechler

Dear Dario Sacco,

> "DS" == Dario Sacco 
> on Thu, 06 May 2010 17:45:30 +0200 writes:

DS> Dear Dr. Maechler,
DS> I am an agronomist and a researcher at the University of Turin. I am 
DS> also teaching "Applied statistics", then I have some knowledge in 
DS> Statistics, but not in numerical computation.

DS> I found your email at the Cran website.

DS> At now I am working on segmentation of a GIS database. My problem is 
DS> that I have a set of points over a region and I need to define sub 
DS> region characterised by small inside variability.
DS> The application seems to apply a hierarchical cluster analysis, but the 
DS> agglomeration procedure should consider only pairs of clusters or of 
DS> points that are neighbours.

DS> This can be performed deleting the dissimilarities in the dissimilarity 
DS> matrix (for example calculated with the dist() procedure in R) that 
DS> refers to pairs of points that are not neighbours.

Deleeting is not ok; you should make them "large" in some way.

I think you should just define your  dissimilarities by *both*
the "variability" (your current dist())
*and* the geographical distance, maybe giving much more weight
to the geographical distance, something like

   D_{i,j} :=  d_{i,j} +  w*  d~(X_i, X_i)

where d_{i,j} are your dist() or daisy() dissimilarities,
'w' is  weight factor and d~(u,v) is e.g. the geodesic distance
between u and v.

I'm CC'ing this to the R-help mailing list,
as I think you could get more advice from there.

Martin Maechler, ETH Zurich

DS> However if I do that the procedure hclust () does not work anymore. 
DS> Moreover, even if it would work, after the first agglomeration any 
DS> further agglomeration should take into account only pairs of point or 
DS> clusters that are geographically neighbour.
DS> My idea is to create a procedure able to read the list of pairs of 
point 
DS> that are neighbours, and after each agglomeration, indicate to the 
DS> procedure which pairs are neighbour, but I am not able to understand 
the 
DS> source code that I dowloaded from the Cran web site.

DS> So, my questions are:
DS> could you help me in solving the problem?
DS> Or, alternatively, could you send to me the agglomeration procedure 
DS> applied by R in hcluster() as a programme written in command of R or as 
DS> a code for Visual Basic. These two programming language are the only 
two 
DS> that I am able to understand.

DS> Thank you in advance for any suggestion or help you will give me.
DS> Best regards,

DS> Dario Sacco

DS> -- 
DS> Dr. Dario Sacco
DS> Dept. of Agronomy, Forestry and Land Management
DS> University of Turin

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis: dissimilar results between R and SPSS

2010-04-26 Thread Sarah Goslee

I'm not sure why you'd expect Euclidean distance and squared Euclidean
distance to
give the same results.

Euclidean distance is the square root of the sums of squared
differences for each variable, and that's exactly what dist() returns.

http://en.wikipedia.org/wiki/Euclidean_distance

On a map, it's the length of the hypoteneuse, and you can measure it
with a ruler
and get the same number. Euclidean distance has a specific geometric meaning.

Squared Euclidean distance is not the same thing, and not the standard
definition
you seem to be expecting. If that's what you want, then square the
output of dist()
before you perform the clustering.

Sarah

On Mon, Apr 26, 2010 at 8:37 AM, Jeoffrey Gaspard
 wrote:
> Hello everyone!
>
> My data is composed of 277 individuals measured on 8 binary variables
> (1=yes, 2=no).
>
> I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The
> objective is to have the means for each variable per retained cluster.
>
> 1) the R analysis ran as followed:
>
>> call data
>> dist=dist(data,method="euclidean")
>> cluster=hclust(dist,method="ward")
>> cluster
>
> Call:
> hclust(d = dist, method = "ward")
>
> Cluster method   : ward
> Distance         : euclidean
> Number of objects: 277
>
>> plot(cluster)
>> rect.hclust(cluster, k=4, border="red")
>> x=rect.hclust(cluster, k=4, border="red")
>> sapply(x, function(i) colMeans(data[i,]))
>> round(sapply(x, function(i) colMeans(data[i,])),2)
>
> 2) The SPSS analysis ran as follows:
>
> Analysis --> Classify --> Hierarchical cluster analysis --> Cluster method=
> Ward's method and Distance measure= Interval:  Squared Euclidean distance.
> After that, I computed the means of each variable for each cluster.
>
> The problem is I have different results between the two analyses (different
> clusters and means).
>
> However, when I use the "Euclidean distance" (unsquared) in SPSS, I have the
> same results!
>
> I thought the R "euclidean" command meant the "usual square distance between
> the two vectors (2 norm)" as specified in the documentation, no the
> unsquared distance. Did it not?
>
> Thanks for the comment!
>
> Jeffrey
>
>

-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis: dissimilar results between R and SPSS

2010-04-26 Thread Tal Galili

Hi Jeoffrey,

How stable are the results in general ?
If you repeat the analysis in R several times, does it yield the same
results ?


Tal

Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Mon, Apr 26, 2010 at 3:37 PM, Jeoffrey Gaspard <
jeoffrey.gasp...@gmail.com> wrote:

> Hello everyone!
>
> My data is composed of 277 individuals measured on 8 binary variables
> (1=yes, 2=no).
>
> I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2.
> The
> objective is to have the means for each variable per retained cluster.
>
> 1) the R analysis ran as followed:
>
> > call data
> > dist=dist(data,method="euclidean")
> > cluster=hclust(dist,method="ward")
> > cluster
>
> Call:
> hclust(d = dist, method = "ward")
>
> Cluster method   : ward
> Distance : euclidean
> Number of objects: 277
>
> > plot(cluster)
> > rect.hclust(cluster, k=4, border="red")
> > x=rect.hclust(cluster, k=4, border="red")
> > sapply(x, function(i) colMeans(data[i,]))
> > round(sapply(x, function(i) colMeans(data[i,])),2)
>
> 2) The SPSS analysis ran as follows:
>
> Analysis --> Classify --> Hierarchical cluster analysis --> Cluster method=
> Ward's method and Distance measure= Interval:  Squared Euclidean distance.
> After that, I computed the means of each variable for each cluster.
>
> The problem is I have different results between the two analyses (different
> clusters and means).
>
> However, when I use the "Euclidean distance" (unsquared) in SPSS, I have
> the
> same results!
>
> I thought the R "euclidean" command meant the "usual square distance
> between
> the two vectors (2 norm)" as specified in the documentation, no the
> unsquared distance. Did it not?
>
> Thanks for the comment!
>
> Jeffrey
>
>
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 >

1 - 100 of 187 matches

Mail list logo