Re: Is Spark's KMeans unable to handle bigdata?

Sean Owen Fri, 02 Sep 2016 10:34:51 -0700

Yes it works fine, though each iteration of the parallel init step is
slow indeed -- about 5 minutes on my cluster. Given your question I
think you are actually 'hanging' because resources are being killed.


I think this init may need some love and optimization. For example, I
think treeAggregate might work better. An Array[Float] may be just
fine and cut down memory usage, etc.

On Fri, Sep 2, 2016 at 5:47 PM, Georgios Samaras
<georgesamaras...@gmail.com> wrote:
> So you were able to execute the minimal example I posted?
>
> I mean that the application doesn't progresses, it hangs (I would be OK if
> it was just slower). It doesn't seem to me a configuration issue.
>
> On Fri, Sep 2, 2016 at 1:07 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Hm, what do you mean? k-means|| init is certainly slower because it's
>> making passes over the data in order to pick better initial centroids.
>> The idea is that you might then spend fewer iterations converging
>> later, and converge to a better clustering.
>>
>> Your problem doesn't seem to be related to scale. You aren't even
>> running out of memory it seems. Your memory settings are causing YARN
>> to kill the executors for using more memory than they advertise. That
>> could mean it never proceeds if this happens a lot.
>>
>> I don't have any problems with it.
>>
>> On Thu, Sep 1, 2016 at 11:35 PM, Georgios Samaras
>> <georgesamaras...@gmail.com> wrote:
>> > Dear all,
>> >
>> >   the random initialization works well, but the default initialization
>> > is
>> > k-means|| and has made me struggle. Also, I had heard people one year
>> > ago
>> > struggling with it too, and everybody would just skip it and use random,
>> > but
>> > I cannot keep it inside me!
>> >
>> >   I have posted a minimal example here..
>> >
>> > Please advice,
>> > George Samaras
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Is Spark's KMeans unable to handle bigdata?

Reply via email to