Re: [R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-22 Thread Martin Maechler
>>>>> David L Carlson 
>>>>> on Sun, 21 Feb 2016 16:55:56 + writes:

> I do not think this is quite true. When the medoids are
> not specified, pam/clara looks for a good initial set
> (build phase) and then finds a local minimum of the
> objective function (swap phase). Both pam/clara and kmeans
> can find local minima that are not the global minimum. 

Indeed, thank you for the explanation so far.

> If the build phase involves any random element, two runs
> could produce different results. 

One of the important features of pam (over kmeans) is indeed
that the build phase is entirely deterministic and typically
much better than random starts.
But there is an option to pam() to skip tbe build phase and
start from user specified medoids, so you can also do random
starts as kmeans, by using
  pam(..., medoids = sample(n, k))

And now you should not throw  clara  and  pam together:
Clara does randomly choose a subset from the LARge data
Application ("LARA") and then runs the build phase (and more) with only
that subset; hence clara necessarily has an element of
randomness to it.
Though --- for historical reasons -- the default chosen by CLARA's
original authors had been to always chose the same random seed
by default (and a cheap non-R RNG).  For that reasons, for a
long time now, clara()  has got an argument 'rngR' which you
"should" set to TRUE in order to get the variability of random starts.

.. more feature requests are welcome!

Martin Maechler,
ETH Zurich  == maintainer("cluster")


> If not, then the original
> order of the data determines the final result, but the
> final result is not necessarily the best one possible
> (assuming the order of the data is irrelevant to the
> analysis so we are not looking at observations taken along
> a line in time or space). That is why kmeans includes an
> argument to run the algorithm multiple times and pick the
> best result.

> -
> David L Carlson Department of Anthropology Texas A&M
> University College Station, TX 77840-4352

> -Original Message- From: R-help
> [mailto:r-help-boun...@r-project.org] On Behalf Of Sarah
> Goslee Sent: Friday, February 19, 2016 1:47 PM To:
    > ABABAEI, Behnam Cc: r-help@r-project.org Subject: Re: [R]
> Why CLARA clustering method does not give the same classes
> as when I do clustering manually?

> clara() is a version of pam() adapted to use large
> datasets.

> pam() uses the entire dataset, and should give results
> identical to your manual procedure, or nearly so. clara()
> works on subsets of the data, so it may give a slightly
> different result each time you run it.

> The default parameters for clara() are very small, so you
> can get substantially different results from run to run on
> a large dataset if you don't change them.

> Sarah

> On Fri, Feb 19, 2016 at 6:30 AM, ABABAEI, Behnam
>  wrote:
>> Hi,
>> 
>> 
>> I am using CLARA (in 'cluster' package). This method is
>> supposed to assign each observation to the closest
>> 'medoid'. But when I calculate the distance of medoids
>> and observations manually and assign them manually, the
>> results are slightly different (1-2 percent of occurrence
>> probability). Does anyone know how clara calculates
>> dissimilarities and why I get different clustering
>> results?
>> 
>> 
>> Behnam.

> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
> more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide
> commented, minimal, self-contained, reproducible code.

> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
> more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide
> commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-21 Thread ABABAEI, Behnam
By the way, I have to say that I am dealing with missing values and that is why 
I am using clara or I may use pam, as kmeans (which is very good at dealing 
with large datasets) cannot handle missing values.

Behnam.


From: David L Carlson 
Sent: 21 February 2016 17:55
To: Sarah Goslee; ABABAEI, Behnam
Cc: r-help@r-project.org
Subject: RE: [R] Why CLARA clustering method does not give the same classes as 
when I do clustering manually?

I do not think this is quite true. When the medoids are not specified, 
pam/clara looks for a good initial set (build phase) and then finds a local 
minimum of the objective function (swap phase). Both pam/clara and kmeans can 
find local minima that are not the global minimum. If the build phase involves 
any random element, two runs could produce different results. If not, then the 
original order of the data determines the final result, but the final result is 
not necessarily the best one possible (assuming the order of the data is 
irrelevant to the analysis so we are not looking at observations taken along a 
line in time or space). That is why kmeans includes an argument to run the 
algorithm multiple times and pick the best result.

-
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Sarah Goslee
Sent: Friday, February 19, 2016 1:47 PM
To: ABABAEI, Behnam
Cc: r-help@r-project.org
Subject: Re: [R] Why CLARA clustering method does not give the same classes as 
when I do clustering manually?

clara() is a version of pam() adapted to use large datasets.

pam() uses the entire dataset, and should give results identical to
your manual procedure, or nearly so. clara() works on subsets of the
data, so it may give a slightly different result each time you run it.

The default parameters for clara() are very small, so you can get
substantially different results from run to run on a large dataset if
you don't change them.

Sarah

On Fri, Feb 19, 2016 at 6:30 AM, ABABAEI, Behnam
 wrote:
> Hi,
>
>
> I am using CLARA (in 'cluster' package). This method is supposed to assign 
> each observation to the closest 'medoid'. But when I calculate the distance 
> of medoids and observations manually and assign them manually, the results 
> are slightly different (1-2 percent of occurrence probability). Does anyone 
> know how clara calculates dissimilarities and why I get different clustering 
> results?
>
>
> Behnam.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-21 Thread Hennig, Christian
Clara uses the Euclidean distance. 
Why you get different results can only be said if you provide a reproducible 
code example for both what you did in clara and what you did "manually".

Best wishes,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche


From: R-help  on behalf of ABABAEI, Behnam 

Sent: 19 February 2016 11:30
To: r-help@r-project.org
Subject: [R] Why CLARA clustering method does not give the same classes as when 
I do clustering manually?

Hi,


I am using CLARA (in 'cluster' package). This method is supposed to assign each 
observation to the closest 'medoid'. But when I calculate the distance of 
medoids and observations manually and assign them manually, the results are 
slightly different (1-2 percent of occurrence probability). Does anyone know 
how clara calculates dissimilarities and why I get different clustering results?


Behnam.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-21 Thread David L Carlson
I do not think this is quite true. When the medoids are not specified, 
pam/clara looks for a good initial set (build phase) and then finds a local 
minimum of the objective function (swap phase). Both pam/clara and kmeans can 
find local minima that are not the global minimum. If the build phase involves 
any random element, two runs could produce different results. If not, then the 
original order of the data determines the final result, but the final result is 
not necessarily the best one possible (assuming the order of the data is 
irrelevant to the analysis so we are not looking at observations taken along a 
line in time or space). That is why kmeans includes an argument to run the 
algorithm multiple times and pick the best result.

-
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Sarah Goslee
Sent: Friday, February 19, 2016 1:47 PM
To: ABABAEI, Behnam
Cc: r-help@r-project.org
Subject: Re: [R] Why CLARA clustering method does not give the same classes as 
when I do clustering manually?

clara() is a version of pam() adapted to use large datasets.

pam() uses the entire dataset, and should give results identical to
your manual procedure, or nearly so. clara() works on subsets of the
data, so it may give a slightly different result each time you run it.

The default parameters for clara() are very small, so you can get
substantially different results from run to run on a large dataset if
you don't change them.

Sarah

On Fri, Feb 19, 2016 at 6:30 AM, ABABAEI, Behnam
 wrote:
> Hi,
>
>
> I am using CLARA (in 'cluster' package). This method is supposed to assign 
> each observation to the closest 'medoid'. But when I calculate the distance 
> of medoids and observations manually and assign them manually, the results 
> are slightly different (1-2 percent of occurrence probability). Does anyone 
> know how clara calculates dissimilarities and why I get different clustering 
> results?
>
>
> Behnam.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-19 Thread Sarah Goslee
clara() is a version of pam() adapted to use large datasets.

pam() uses the entire dataset, and should give results identical to
your manual procedure, or nearly so. clara() works on subsets of the
data, so it may give a slightly different result each time you run it.

The default parameters for clara() are very small, so you can get
substantially different results from run to run on a large dataset if
you don't change them.

Sarah

On Fri, Feb 19, 2016 at 6:30 AM, ABABAEI, Behnam
 wrote:
> Hi,
>
>
> I am using CLARA (in 'cluster' package). This method is supposed to assign 
> each observation to the closest 'medoid'. But when I calculate the distance 
> of medoids and observations manually and assign them manually, the results 
> are slightly different (1-2 percent of occurrence probability). Does anyone 
> know how clara calculates dissimilarities and why I get different clustering 
> results?
>
>
> Behnam.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Why CLARA clustering method does not give the same classes as when I do clustering manually?

2016-02-19 Thread ABABAEI, Behnam
Hi,


I am using CLARA (in 'cluster' package). This method is supposed to assign each 
observation to the closest 'medoid'. But when I calculate the distance of 
medoids and observations manually and assign them manually, the results are 
slightly different (1-2 percent of occurrence probability). Does anyone know 
how clara calculates dissimilarities and why I get different clustering results?


Behnam.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.