Eh... more specifically, since Spark 2.0 the "runs" parameter in the
KMeans mllib implementation has been ignored and is always 1. This
means a lot of code that wraps this stuff up in arrays could be
simplified quite a lot. I'll take a shot at optimizing this code and
see if I can measure an effect.

On Fri, Sep 2, 2016 at 6:33 PM, Sean Owen <so...@cloudera.com> wrote:
> Yes it works fine, though each iteration of the parallel init step is
> slow indeed -- about 5 minutes on my cluster. Given your question I
> think you are actually 'hanging' because resources are being killed.
>
> I think this init may need some love and optimization. For example, I
> think treeAggregate might work better. An Array[Float] may be just
> fine and cut down memory usage, etc.
>
> On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras
> <georgesamaras...@gmail.com> wrote:
>> So you were able to execute the minimal example I posted?
>>
>> I mean that the application doesn't progresses, it hangs (I would be OK if
>> it was just slower). It doesn't seem to me a configuration issue.
>>
>> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> Hm, what do you mean? k-means|| init is certainly slower because it's
>>> making passes over the data in order to pick better initial centroids.
>>> The idea is that you might then spend fewer iterations converging
>>> later, and converge to a better clustering.
>>>
>>> Your problem doesn't seem to be related to scale. You aren't even
>>> running out of memory it seems. Your memory settings are causing YARN
>>> to kill the executors for using more memory than they advertise. That
>>> could mean it never proceeds if this happens a lot.
>>>
>>> I don't have any problems with it.
>>>
>>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
>>> <georgesamaras...@gmail.com> wrote:
>>> > Dear all,
>>> >
>>> >   the random initialization works well, but the default initialization
>>> > is
>>> > k-means|| and has made me struggle. Also, I had heard people one year
>>> > ago
>>> > struggling with it too, and everybody would just skip it and use random,
>>> > but
>>> > I cannot keep it inside me!
>>> >
>>> >   I have posted a minimal example here..
>>> >
>>> > Please advice,
>>> > George Samaras
>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to