Re: Why KMeans with mllib is so slow ?

Xi Shen Sat, 12 Mar 2016 23:08:20 -0800

Hi Chitturi,

Please checkout
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
).


I think it is caused by the initialization step. the "kmeans||" method does
not initialize dataset in parallel. If your dataset is large, it takes a
long time to initialize. Just changed to "random".

Hope it helps.


On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma <learnings.chitt...@gmail.com>
wrote:

> Hi All,
>
>   I  am facing the same issue. taking k values from 60 to 120 incrementing
> by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
> around 2.5h on a 800 MB data set with 38 dimensions.
> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>>
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>
> To start a new topic under Apache Spark User List, email [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=26467&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
> View this message in context: Re: Why KMeans with mllib is so slow ?
> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
-- 

Regards,
David

Re: Why KMeans with mllib is so slow ?

Reply via email to