Re: Why KMeans with mllib is so slow ?

Priya Ch Mon, 14 Mar 2016 04:22:36 -0700

Hi Xi Shen,

  Changing the initialization step from "kmeans||" to "random" decreased
the execution time from 2 hrs to 6 min. However, by default the no.of runs
is 1. If I try to set the number of runs to 10, then again see increase in
job execution time.


How to proceed on this ?.

By the way how is this initialization mode "random" different from
"k-means||" ?


Regards,
Padma Ch



On Sun, Mar 13, 2016 at 12:37 PM, Xi Shen <davidshe...@gmail.com> wrote:

> Hi Chitturi,
>
> Please checkout
> https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
> ).
>
> I think it is caused by the initialization step. the "kmeans||" method
> does not initialize dataset in parallel. If your dataset is large, it takes
> a long time to initialize. Just changed to "random".
>
> Hope it helps.
>
>
> On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma <
> learnings.chitt...@gmail.com> wrote:
>
>> Hi All,
>>
>>   I  am facing the same issue. taking k values from 60 to 120
>> incrementing by 10 each time i.e k takes value 60,70,80,...120 the
>> algorithm takes around 2.5h on a 800 MB data set with 38 dimensions.
>> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
>> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>>
>> wrote:
>>
>>> Hi Jao,
>>>
>>> Sorry to pop up this old thread. I am have the same problem like you
>>> did. I want to know if you have figured out how to improve k-means on
>>> Spark.
>>>
>>> I am using Spark 1.2.0. My data set is about 270k vectors, each has
>>> about 350 dimensions. If I set k=500, the job takes about 3hrs on my
>>> cluster. The cluster has 7 executors, each has 8 cores...
>>>
>>> If I set k=5000 which is the required value for my task, the job goes on
>>> forever...
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>>
>> To start a new topic under Apache Spark User List, email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=26467&i=1>
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Why KMeans with mllib is so slow ?
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
> --
>
> Regards,
> David
>

Re: Why KMeans with mllib is so slow ?

Reply via email to