>>>>> Martin Maechler <maech...@stat.math.ethz.ch> >>>>> on Mon, 22 Feb 2016 16:48:39 +0100 writes:
>>>>> Sarah Goslee <sarah.gos...@gmail.com> >>>>> on Fri, 19 Feb 2016 15:22:22 -0500 writes: >> Ah, my guess about the confusion was wrong, then. You're >> misunderstanding silhouette() instead. >>> From ?silhouette: >> Observations with a large s(i) (almost 1) are very >> well clustered, a small s(i) (around 0) means that the >> observation lies between two clusters, and observations >> with a negative s(i) are probably placed in the wrong >> cluster. >> In more detail, they're looking at different things. >> clara() assigns each point to a cluster based on the >> distance to the nearest medoid. >> silhouette() does something different: instead of >> comparing the distances to the closest medoid and the next >> closest medoid, which is what you seem to be assuming, >> silhouette() looks at the mean distance to ALL other >> points assigned to that cluster, vs the mean distance to >> all points in other clusters. The distance to the medoid >> is irrelevant, except as it is one of the points in that >> cluster. >> So a negative silhouette value is entirely possible, and >> means that the cluster produced doesn't represent the >> dataset very well. > Indeed ... and this extends to pam(), even; as you say above, > " silhouette() does something different " : > If your look at the plots of > example(silhouette) > where the silhouettes of pam(ruspini, k = k') , k' = 2,..,6 > are displayed, or if you directly look at > plot( silhouette(ruspini, k = 6) ) oops... that should have been plot( silhouette(pam(ruspini, k = 6)) ) > you will notice that pam() itself can easily lead to negative > silhouette values. > Martin Maechler [ == maintainer("cluster") ] >> On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam >> <behnam.abab...@limagrain.com> wrote: >>> Sarah, sorry for taking up your time. >>> >>> I totally agree with you about how it works. But please >>> let's take a look at this part of the description: >>> >>> "Once k representative objects have been selected from >>> the sub-dataset, each observation of the entire dataset >>> is assigned to the nearest medoid. The mean (equivalent >>> to the sum) of the dissimilarities of the observations to >>> their closest medoid is used as a measure of the quality >>> of the clustering. The sub-dataset for which the mean (or >>> sum) is minimal, is retained. A further analysis is >>> carried out on the final partition." >>> >>> It says each observation is finally assigned to the >>> closest medoid. The whole clustering process may be >>> imperfect in terms of isolation of clusters, but each >>> observation is already assigned to the closest one and >>> according to the silhouette formula, the silhouette value >>> cannot be negative, as a must be always less than b. >>> >>> Regards, Behnam. >>> >>> ________________________________________ From: Sarah >>> Goslee <sarah.gos...@gmail.com> Sent: 19 February 2016 >>> 20:58 To: ABABAEI, Behnam Cc: r-help@r-project.org >>> Subject: Re: [R] How a clustering algorithm in R can end >>> up with negative silhouette values? >>> >>> You need to think more carefully about the details of the >>> clara() method. >>> >>> The algorithm draws repeated samples of sampsize from the >>> larger dataset, as specified by the arguments to the >>> function. It clusters each sample in turn, and saves the >>> best one. It uses the medoids from the best one to >>> assign all of the points to a cluster. >>> >>> But because the clustering is based on a subsample, it >>> may not be representative of the dataset as a whole, and >>> may not provide a good clustering overall. Just because >>> it clusters the subsample well, doesn't mean it clusters >>> the entirety. The details section of the help describes >>> this, and the book references goes into more detail. >>> >>> Sarah >>> >>> >>> >>> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam >>> <behnam.abab...@limagrain.com> wrote: >>>> Hi Sarah, >>>> >>>> Thank you for the response. But it is said in its >>>> description that after each run (sample), each >>>> observation in the whole dataset is assigned to the >>>> closest cluster. So how is it possible for one >>>> observation to be wrongly allocated, even with clara? >>>> >>>> Behnam >>>> >>>> Behnam >>>> >>>> >>>> >>>> >>>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" >>>> <sarah.gos...@gmail.com> wrote: >>>> >>>> That means that points have been assigned to the wrong >>>> groups. This may readily happen with a clustering method >>>> like cluster::clara() that uses a subset of the data to >>>> cluster a dataset too large to analyze as a >>>> unit. Negative silhouette numbers strongly suggest that >>>> your clustering parameters should be changed. >>>> >>>> Sarah >>>> >>>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam >>>> <behnam.abab...@limagrain.com> wrote: >>>>> Hi, >>>>> >>>>> >>>>> We know that clustering methods in R assign >>>>> observations to the closest medoids. Hence, it is >>>>> supposed to be the closest cluster each observation can >>>>> have. So, I wonder how it is possible to have negative >>>>> values of silhouette , while we are supposedly assign >>>>> each observation to the closest cluster and the formula >>>>> in silhouette method cannot get negative? >>>>> >>>>> >>>>> Behnam. >>>>> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >> more, see https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html and provide >> commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.