Re: problems with running K-means on hadoop's pseudo-distributed mode

Ajay Sharma Mon, 09 Jun 2014 09:03:37 -0700

K-means Clustering
 K-means: widely used clustering technique! ,Initialization: blind random
on input data!
Drawback: very sensitive to choice of initial clustercenters (seeds)!
Local optimal can be arbitrarily bad wrt. objective function, compared to
global optimal clustering


Idea: spread the k initial cluster centers away from each other.!
O(log k)-competitive with the optimal clustering" substantial convergence
time speedups (empirical)!

C - Sample a point uniformly at random from X
    While `C´ < k do
    Sample x € X with probability prop, to DSquare (x)
    c <- C U {x}
end while

c € c: Cluster Center
x € X: Data Point'D(x) distance between x and nearest Ck that has already
chosen

Test dataset
200 Clustering runs, each with and without k-means initialization
Measure RSS (Intra-Class variance)

K.Means optimal clustering 115 times (57.5%)

 Implementation Test Dataset: 4 Square (n=16)



Expected: 4 nice Cluster








Evaluation on Test Dataset!
• 200 clustering runs, each with and without kmeans++ initialization!
• Measure RSS (intra-class variance)!
• K-means! optimal clustering 115 times (57.5%) !
• K-means++ ! optimal clustering 182 times (91%)!

Comparison of the frequency distribution of RSS values between k-means and
k-means
++ on the evaluation dataset (n=200)!



 Comparison of the frequency distribution of RSS values between k-means and
k-means
++ on the UCI real world dataset (n=500)!











On Mon, Jun 9, 2014 at 10:50 AM, sumit sharma <pro.su...@gmail.com> wrote:

> Naïve Bayes can be used for text clustering effectively in Mahout.
>
>
> On Mon, Jun 9, 2014 at 7:07 PM, Eeti Jain <eetijai...@gmail.com> wrote:
>
> >
> > Sir, I have been working on hadoop/mahout platform and performing
> > clustering
> > on twitter data in my thesis work. I just want to know whether Mahout can
> > handle text documents in some other language? Please if you can help me
> > sir?
> >
> >
> >
> >
> >
>
>
> --
>
> Best Regards:
> Sumit Sharma
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Reply via email to