Hi Chitturi, Please checkout https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int ).
I think it is caused by the initialization step. the "kmeans||" method does not initialize dataset in parallel. If your dataset is large, it takes a long time to initialize. Just changed to "random". Hope it helps. On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma <learnings.chitt...@gmail.com> wrote: > Hi All, > > I am facing the same issue. taking k values from 60 to 120 incrementing > by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes > around 2.5h on a 800 MB data set with 38 dimensions. > On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] > <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>> > wrote: > >> Hi Jao, >> >> Sorry to pop up this old thread. I am have the same problem like you did. >> I want to know if you have figured out how to improve k-means on Spark. >> >> I am using Spark 1.2.0. My data set is about 270k vectors, each has about >> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The >> cluster has 7 executors, each has 8 cores... >> >> If I set k=5000 which is the required value for my task, the job goes on >> forever... >> >> >> Thanks, >> David >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the discussion >> below: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html >> > To start a new topic under Apache Spark User List, email [hidden email] >> <http:///user/SendEmail.jtp?type=node&node=26467&i=1> >> To unsubscribe from Apache Spark User List, click here. >> NAML >> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > ------------------------------ > View this message in context: Re: Why KMeans with mllib is so slow ? > <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Regards, David