My comments are at the end as some people do not like top posts.

 

From: scikit-learn <scikit-learn-bounces+avigross=verizon....@python.org> On 
Behalf Of Matti Viljamaa
Sent: Friday, January 25, 2019 3:31 PM
To: Scikit-learn mailing list <scikit-learn@python.org>
Subject: Re: [scikit-learn] How to determine suitable cluster algo

 

Also,

 

Remember that some algos may exhibit “sweet spots” w.r.t. computation time and 
gained accuracy.

 

So you might want to keep measuring “explained variance”, while you add 
complexity to your models. And then do plots of model complexity vs explained 
variance.

 

E.g. in MLPClassifier you’d plot e.g. hidden layers against explained variance 
to figure out where adding hidden layers starts to exhibit lesser gain in 
explained variance.

 

Lähetetty Windows 10:n Sähköposti 
<https://go.microsoft.com/fwlink/?LinkId=550986> sta

 

Lähettäjä: Matti Viljamaa <mailto:matti.v.vilja...@gmail.com> 
Lähetetty: Friday, 25 January 2019 13.43
Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn@python.org> 
Aihe: VS: [scikit-learn] How to determine suitable cluster algo

 

For determining what one can afford computaionally, see e.g.:

https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run

https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/
 
<https://www.redditcom/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/>
 

 

Lähetetty Windows 10:n Sähköposti 
<https://go.microsoft.com/fwlink/?LinkId=550986> sta

 

Lähettäjä: lampahome <mailto:pahome.c...@mirlab.org> 
Lähetetty: Friday, 25 January 2019 3.42
Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn@python.org> 
Aihe: Re: [scikit-learn] How to determine suitable cluster algo

 

Maybe the suitable way is try-and-error?

 

What I'm interesting is that my datasets is very huge and I can't try number of 
cluster from 1 to N if I have N samples

That cost too much time for me.

 

Maybe I should define the initial number of cluster based on execution time?

 

Then analyze the next step is increase/decrease the number of cluster?

 

thx

 

__COMMENT__

This is a question, not a suggestion.

 

The poster suggested they have such a large amount of data that looking for 
larger numbers of clusters to find a ‘sweet’ spot may take too much time.

 

Is there any value in taking a much smaller random sample of data that remains 
big enough and trying that on a reasonable range of clusters? The results would 
not be definitive but might supply a clue as to what range to try again with 
the full data.

 

As I see mentioned, the run time may not be going up if the data is constant 
and the number of clusters varies. I am not sure what clustering algorithms you 
want to use but for something like K-means with reasonable data, generally the 
number of clusters that show meaningful results are usually much smaller than 
the number of items in the data. The algorithms often terminate when successive 
runs show little change. This may likely be a tunable parameter. So if you ask 
it to make N+1 clusters it may even terminate sooner than for N if it is 
because that number of clusters more closely resembles the variation in the 
data. 

 

And, again, if you are using a K-means variant, it may be better to use some 
human intervention to see if a particular level of clustering fits some model 
you can make that explains what each cluster has in common. If you overfit, the 
number of clusters can effectively be the number of unique items in your data 
and probably has no meaningful purpose.

 

Again, just a question. There are algorithms out there that deal better with 
large data than others.

 

Avi

 

 

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to