Eh... more specifically, since Spark 2.0 the "runs" parameter in the KMeans mllib implementation has been ignored and is always 1. This means a lot of code that wraps this stuff up in arrays could be simplified quite a lot. I'll take a shot at optimizing this code and see if I can measure an effect.
On Fri, Sep 2, 2016 at 6:33 PM, Sean Owen <so...@cloudera.com> wrote: > Yes it works fine, though each iteration of the parallel init step is > slow indeed -- about 5 minutes on my cluster. Given your question I > think you are actually 'hanging' because resources are being killed. > > I think this init may need some love and optimization. For example, I > think treeAggregate might work better. An Array[Float] may be just > fine and cut down memory usage, etc. > > On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras > <georgesamaras...@gmail.com> wrote: >> So you were able to execute the minimal example I posted? >> >> I mean that the application doesn't progresses, it hangs (I would be OK if >> it was just slower). It doesn't seem to me a configuration issue. >> >> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen <so...@cloudera.com> wrote: >>> >>> Hm, what do you mean? k-means|| init is certainly slower because it's >>> making passes over the data in order to pick better initial centroids. >>> The idea is that you might then spend fewer iterations converging >>> later, and converge to a better clustering. >>> >>> Your problem doesn't seem to be related to scale. You aren't even >>> running out of memory it seems. Your memory settings are causing YARN >>> to kill the executors for using more memory than they advertise. That >>> could mean it never proceeds if this happens a lot. >>> >>> I don't have any problems with it. >>> >>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras >>> <georgesamaras...@gmail.com> wrote: >>> > Dear all, >>> > >>> > the random initialization works well, but the default initialization >>> > is >>> > k-means|| and has made me struggle. Also, I had heard people one year >>> > ago >>> > struggling with it too, and everybody would just skip it and use random, >>> > but >>> > I cannot keep it inside me! >>> > >>> > I have posted a minimal example here.. >>> > >>> > Please advice, >>> > George Samaras >> >> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org