Re: Is Spark's KMeans unable to handle bigdata?

Sean Owen Fri, 02 Sep 2016 01:09:24 -0700

Hm, what do you mean? k-means|| init is certainly slower because it's
making passes over the data in order to pick better initial centroids.
The idea is that you might then spend fewer iterations converging
later, and converge to a better clustering.

Your problem doesn't seem to be related to scale. You aren't even
running out of memory it seems. Your memory settings are causing YARN
to kill the executors for using more memory than they advertise. That
could mean it never proceeds if this happens a lot.

I don't have any problems with it.

On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
<[email protected]> wrote:
> Dear all,
>
>   the random initialization works well, but the default initialization is
> k-means|| and has made me struggle. Also, I had heard people one year ago
> struggling with it too, and everybody would just skip it and use random, but
> I cannot keep it inside me!
>
>   I have posted a minimal example here..
>
> Please advice,
> George Samaras

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Is Spark's KMeans unable to handle bigdata?

Reply via email to