Yes it works fine, though each iteration of the parallel init step is slow indeed -- about 5 minutes on my cluster. Given your question I think you are actually 'hanging' because resources are being killed.
I think this init may need some love and optimization. For example, I think treeAggregate might work better. An Array[Float] may be just fine and cut down memory usage, etc. On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras <georgesamaras...@gmail.com> wrote: > So you were able to execute the minimal example I posted? > > I mean that the application doesn't progresses, it hangs (I would be OK if > it was just slower). It doesn't seem to me a configuration issue. > > On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen <so...@cloudera.com> wrote: >> >> Hm, what do you mean? k-means|| init is certainly slower because it's >> making passes over the data in order to pick better initial centroids. >> The idea is that you might then spend fewer iterations converging >> later, and converge to a better clustering. >> >> Your problem doesn't seem to be related to scale. You aren't even >> running out of memory it seems. Your memory settings are causing YARN >> to kill the executors for using more memory than they advertise. That >> could mean it never proceeds if this happens a lot. >> >> I don't have any problems with it. >> >> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras >> <georgesamaras...@gmail.com> wrote: >> > Dear all, >> > >> > the random initialization works well, but the default initialization >> > is >> > k-means|| and has made me struggle. Also, I had heard people one year >> > ago >> > struggling with it too, and everybody would just skip it and use random, >> > but >> > I cannot keep it inside me! >> > >> > I have posted a minimal example here.. >> > >> > Please advice, >> > George Samaras > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org