[R] hierarchical clustering of large dataset

Massimo Di Stefano Thu, 08 Mar 2012 05:54:58 -0800

Hello All,

i've a set of observations that is in the form :


a,    b,    c,    d,    e,    f
67.12,    4.28,    1.7825,    30,    3,    16001
67.12,    4.28,    1.7825,    30,    3,    16001
66.57,    4.28,    1.355,    30,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
63.64,    9.726,    1.3004,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
…
….

55.000 observation in total.

where :

a,    b,    c,    d,    e  
are environmental parameters
and f  is a label.

as you can see some rows are duplicated,
this means that the observation occurred more times 

(in my use cases the observation is the presence of a specific  biological 
specie in a photo, 
if in the photo there are more than one individual of the same species i have a 
duplicated row )


i'm trying to learn how to use R in order to build a dendrogram 
that will help me to 'group' several species in communities, based on the 
similarity of the env. parameters.

i tried with 

d <- diet(as.matrix(my data))
hc <- hclust(d)

but it doesn't works.

is the 'redundancy' of my data (multiple rows with same information) a problem?
should i remove all the rows that are exactly the same ? 
in this way how to take care about the fact that for the same environmental 
parameters i've multiple observation ? 
maybe this information is not relevant in order to build the dendrogram ?

Please, can you suggest me a valid approach in order to cluster a such dataset ?
forgive me, i've an evident lack of statistic knowledge, thank you very mach 
for you help!

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] hierarchical clustering of large dataset

Reply via email to