Re: [scikit-learn] How to determine suitable cluster algo

Matti Viljamaa Thu, 24 Jan 2019 01:44:48 -0800

GridSearchCV is meant for tuning hyperparameters of a model over some ranges of 
configurations and parameter values. Like the documentation explains:

https://scikit-learn.org/stable/modules/grid_search.html

(and it also has some examples)

The (e.g. 10-fold) cross-validation as measure of accuracy (how accurately do
different folds attain the value of the statistic) and generalization (that the
accuracy remains similar between folds) is at least that what I’m taught at uni.

A greater problem is how can one decide, what parameters or e.g. parameter
ranges to look for. Since some e.g. float-valued parameters might have some
ranges that are “more often used”, while some others that may not work for most
of the time. Additionally e.g. the kernels and stuff have some which may have
more general robustness, while some others may become computationally very
expensive, when combined with some other parameters (such as that in
MLPClassifier some activation functions and hidden_layer_sizes may correlate in
increased computation cost, while not necessarily increasing accuracy).

The best I’ve figured so far is to:

Start with few of the most often used / major parameters and try to get them to
produce results that are as accurate as possible with still affordable
computation time. Only after that consider adding more params.

However, I’ve not found much info regarding how the parameters of different
methods are ordered in terms of “significance”. One could assume that by the
preceding ones are more major than the following ones. However, some of the
parameters also clearly “correlate” between each other, so they have
cross-effects on accuracy etc.

Best is probably just start trying and then perhaps write down, if you notice
some general patterns as to what works?

There’s also:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

for designing “pipelines” or sort of “Design of Experiments” on sklearn algos.
Also found this:
https://towardsdatascience.com/design-your-engineering-experiment-plan-with-a-simple-python-command-35a6ba52fa35
but have not tried it, nor know if it’s necessary.

BR, Matti

Lähetetty Windows 10:n Sähköpostista

Lähettäjä: lampahome
Lähetetty: Thursday, 24 January 2019 11.14
Vastaanottaja: Scikit-learn mailing list
Aihe: [scikit-learn] How to determine suitable cluster algo

I want to do customized clustering algo for my datasets, that's cuz I don't
want to try every algo and its hyperparameters.

I though I just define the default range of import hyperparameters ex: number
of cluster in K-means.

I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc,
and I choose the suitable algo to cluster for me.

I'm not sure if that is able to do, but does GridSearchCV work for me?

Or any other ways to determine that?

thx

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How to determine suitable cluster algo

Reply via email to