Re: [R] hierarchical clustering of large dataset

Sarah Goslee Thu, 08 Mar 2012 06:04:58 -0800

See inline:

On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
<massimodisa...@gmail.com> wrote:
>
> Hello All,
>
> i've a set of observations that is in the form :
>
> a,    b,    c,    d,    e,    f
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 66.57,    4.28,    1.355,    30,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 63.64,    9.726,    1.3004,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> …
> ….
>
> 55.000 observation in total.
>
> where :
>
> a,    b,    c,    d,    e
> are environmental parameters
> and f  is a label.
>
> as you can see some rows are duplicated,
> this means that the observation occurred more times


If you use dput() for the first 10 or 20 rows of your data, then you will
have provided the requested reproducible example.

> (in my use cases the observation is the presence of a specific  biological 
> specie in a photo,
> if in the photo there are more than one individual of the same species i have 
> a duplicated row )
>
>
> i'm trying to learn how to use R in order to build a dendrogram
> that will help me to 'group' several species in communities, based on the 
> similarity of the env. parameters.
>
> i tried with
>
> d <- diet(as.matrix(my data))
> hc <- hclust(d)
>
> but it doesn't works.

I'm assuming you mean dist() instead of diet() ? I don't know of any
function named
diet().

What "doesn't work"? We can't answer your question unless we know what it is.

> is the 'redundancy' of my data (multiple rows with same information) a 
> problem?
> should i remove all the rows that are exactly the same ?

Yes. Identical rows have a distance of 0, so they're clustered
together immediately,
so a dendrogram that includes them is identical to one that has only
unique rows.

> in this way how to take care about the fact that for the same environmental 
> parameters i've multiple observation ?
> maybe this information is not relevant in order to build the dendrogram ?
>
> Please, can you suggest me a valid approach in order to cluster a such 
> dataset ?
> forgive me, i've an evident lack of statistic knowledge, thank you very mach 
> for you help!

Perhaps some reading in one of the many excellent ecologically-based
multivariate
statistics books is called for?

Sarah



-- 
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] hierarchical clustering of large dataset

Reply via email to