Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-09 Thread Matthias Feurer

Hi Andy,

Having distributions objects would be useful for several reasons:

1. Having a uniform way to programatically access the parameters of all 
kinds of distribution objects. Currently, I could parse the 'args' item 
in 'distribution.__dict__'. I don't know how important this is for 
others, though.
2. Having a helpful __repr__. Currently, printing a distribution does 
not even tell the kind of distribution:



uniform  = scipy.stats.uniform(3,5)
print(uniform)



3. Some useful distributions aren't easily possible with scipy.stats. 
Can you please give me examples for:
* tuning the number of layers and the number of hidden neurons of 
the MLPClassifier?

* tuning C and gamma of SVC on a log scale between 2^12 and 2^12?
I couldn't find appropriate objects in scipy.stats and ended up defining 
my own.


Best,
Matthias


 to have a useful representation of distribution __repr__), and finally 
to have distributions


On 08.05.2016 23:49, Andreas Mueller wrote:

Hi Matthias.
Can you explain this point again?
Is it about the bad __repr__ ?

Thanks,
Andy

On 05/07/2016 08:56 AM, Matthias Feurer wrote:

Dear Joel,

Thank you for taking the time to answer my email. I didn't see the PR 
on this topic, thanks for pointing me to that. I can see your points 
with regards to the get_params() method and it might be better if I 
write more serialization code on my side (although for example 
RandomizedSearchCV also returns a lot of parameters one would not 
consider searching over).


Nevertheless, I still think it would be a good idea to have 
distribution objects in scikit-learn since some common use cases 
cannot be easily handled with scipy.stats (see my last email for 
examples).


Best regards,
Matthias

On 07.05.2016 14:41, Joel Nothman wrote:
On 7 May 2016 at 19:12, Matthias Feurer 
<mailto:feur...@informatik.uni-freiburg.de>> wrote:


1. Return the fit and predict time in `grid_scores_`


This has been proposed for many years as part of an overhaul of 
grid_scores_. The latest attempt is currently underway at 
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a 
good chance of being merged.


2. Add distribution objects to scikit-learn which have
get_params and
set_params attributes


Your use of get_params to perform serialisation is certainly not 
what get_params is designed for, though I understand your use of it 
that way... as long as all your parameters are either primitives or 
objects supporting get_params. However, this is not by design. 
Further, param_distributions is a dict whose values are scipy.stats 
rvs; get_params currently does not traverse dicts, so this is 
already unfamiliar territory requiring a lot of design, even once we 
were convinced that this were a valuable use-case, which I am not 
certain of.


3. Add get_params and set_params to CV objects


get_params and set_params are intended to allow programmatic search 
over those parameter settings. This is not often what one does with 
the parameters of CV splitting methods, but I acknowledge that 
supporting this would not be difficult. Still, if serialisation is 
the purpose of this, it's not really the point.




--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lis

Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Matthias Feurer

Dear Joel,

Thank you for taking the time to answer my email. I didn't see the PR on 
this topic, thanks for pointing me to that. I can see your points with 
regards to the get_params() method and it might be better if I write 
more serialization code on my side (although for example 
RandomizedSearchCV also returns a lot of parameters one would not 
consider searching over).


Nevertheless, I still think it would be a good idea to have distribution 
objects in scikit-learn since some common use cases cannot be easily 
handled with scipy.stats (see my last email for examples).


Best regards,
Matthias

On 07.05.2016 14:41, Joel Nothman wrote:
On 7 May 2016 at 19:12, Matthias Feurer 
<mailto:feur...@informatik.uni-freiburg.de>> wrote:


1. Return the fit and predict time in `grid_scores_`


This has been proposed for many years as part of an overhaul of 
grid_scores_. The latest attempt is currently underway at 
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good 
chance of being merged.


2. Add distribution objects to scikit-learn which have get_params and
set_params attributes


Your use of get_params to perform serialisation is certainly not what 
get_params is designed for, though I understand your use of it that 
way... as long as all your parameters are either primitives or objects 
supporting get_params. However, this is not by design. Further, 
param_distributions is a dict whose values are scipy.stats rvs; 
get_params currently does not traverse dicts, so this is already 
unfamiliar territory requiring a lot of design, even once we were 
convinced that this were a valuable use-case, which I am not certain of.


3. Add get_params and set_params to CV objects


get_params and set_params are intended to allow programmatic search 
over those parameter settings. This is not often what one does with 
the parameters of CV splitting methods, but I acknowledge that 
supporting this would not be difficult. Still, if serialisation is the 
purpose of this, it's not really the point.




--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Matthias Feurer
Dear scikit-learn team,

First of all, the model selection module is really easy to use and has a 
nice and clean interface, I really like that. Nevertheless, while using 
it for benchmarks I found some shortcomings where I think the module 
could be improved.

1. Return the fit and predict time in `grid_scores_`

BaseSearchCV relies on a function called _fit_and_score to produce the 
entries in grid_scores_. This function measures the time it takes to fit 
a model, predict for the (cross-)validation set and calculate the score. 
It returns this time, which is then discarded: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_search.py#L569

I propose to store this time in grid_scores_ and make it accessible to 
the user. Also, the time taken to refit the model in line 596 and 
following should be measured and made accessible to the user.

2. Add distribution objects to scikit-learn which have get_params and 
set_params attributes

When printing the parameter distribution proposed for the model 
selection module (scipy.stats), the result is something which cannot be 
parsed:



It's also not possible to access this with the scikit-learn like methods 
get_params() and set_params() (actually, the first of both should 
suffice). I propose to add distribution objects for commonly used 
distributions:

1. Categorical variables - replace previously used lists
2. RandInt - replace scipy.stats.randint
3. Uniform - might replace scipy.stats.uniform, I'm not sure if that 
would accept a lower and an upper bound at construction time
4. LogUniform - does not exist so far, useful for search C and gamma in 
SVMs, learning rate in NNs etc.
5. LogUniformInt - same thing, but as an Integer, useful for the 
min_samples_split in RF and ET
6. MultipleUniformInt - this is a bit weird as it would return a tuple 
of Integers, but I could not find any other way to tune both the number 
of hidden layers and their size in the MLPClassifier

3. Add get_params and set_params to CV objects

Currently, the CV objects like StratifiedKFold look nice when printed, 
but it is not possible to access their parameters programatically in 
order to serialize them (without pickle). Since they are part of the 
BaseSearchCV and returned by a call to BaseSearchCV.get_params(), I 
propose to add parameter setter and getter to the CV objects as well to 
maintain a consistent interface.


I think these changes are not too hard to implement and I am willing to 
do so if you approve these suggestions.

Best regards,
Matthias

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC2015 Hyperparameter Optimization topic

2015-03-26 Thread Matthias Feurer

Dear Christof, dear scikit-learn team,

This is a great idea, I highly encourage your idea to integrate Bayesian 
Optimization into scikit-learn since automatically configuring 
scikit-learn is quite powerful. It was done by the three winning teams 
of the first automated machine learning competition: 
https://sites.google.com/a/chalearn.org/automl/


I am writing this e-mail because our research group on learning, 
optimization and automated algorithm design 
(http://aad.informatik.uni-freiburg.de/) is working on very similar 
things which might be useful in this context. Some people in our lab 
(together with some people from other universities)developed a framework 
for robust Bayesian optimization with minimal external dependencies. It 
currently depends on GPy, but this dependency could be easily replaced 
by the scikit-learn GP. It is probably not as leightweight as you want 
to have it for scikit-learn, but you might want to have a look at the 
source code. I will provide a link as soon as the project is public 
(which is soon). In the meantime, I can grant read-access to those who 
are interested. It might be helpful for you to have look at the 
structure of the module.


Besides these remarks, I think that using a GP is a good way to tune the 
few hyperparameters of a single model. Another remark: Instead of 
comparing GPSearchCV to spearmint only, you should also consider the TPE 
algorithm implemented in hyperopt 
(https://github.com/hyperopt/hyperopt). You could consider the following 
benchmarks:


1. Together with a fellow student I implemented a library called HPOlib, 
which provides a few benchmarks for hyperparameter optimization (for 
example some from the 2012 spearmint paper): 
https://github.com/automl/HPOlib It is further described in this paper: 
http://automl.org/papers/13-BayesOpt_EmpiricalFoundation.pdf
2. If you are looking for a small pipeline, you can use 
sklearn.feature_selection.SelectPercentile with a fixed scoring function 
together with a classification algorithm. It adds a single 
hyperparameter which should be a good fit for the GP.


Best regards,
Matthias


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Subject: Hyperparameters in scikit-learn

2015-03-25 Thread Matthias Feurer

Hi Andy,

On 24.03.2015 21:00, Andy wrote:

Hi Matthias.
I think that is an interesting direction to go into, and I actually 
thought a bit about if and how we could add something like that to 
scikit-learn.

Is there online documentation for paramsklearn?
I just compiled the current state of the documentation and put it on my 
website here: 
http://aad.informatik.uni-freiburg.de/~feurerm/paramsklearn/index.html. 
Currently, the documentation only shows how to do random search, but I 
hope that I can add an example for pySMAC 
<https://github.com/sfalkner/pysmac/> soon.
It is a bit hard to say what are good defaults, I think, and it often 
encodes intuition about the problem.

Hm, yes indeed. You could choose between three approaches here:

* Use defaults which work well across a lot of problem domains.
* Use defaults which are calculated based on dataset properties (like in 
some scikit-learn models).
* Use techniques like Meta-Learning or Algorithm Selection to do this 
job for you.


Because I am working on the last of the three approaches, the defaults 
in the SVC example are more or less the scikit-learn defaults. 
Furthermore, the default is an optional argument. Thus, a hyperparameter 
optimization algorithm is not obliged to make use of it.
The parameter spaces that you want to search are probably different 
between GridSearchCV and a model-based approach, too.
Probably. After some more thinking about this, you have to design a grid 
depending on the computing time available, right? This would make it 
hard to provide a configuration space for GridSearchCV at all.

Do you have any examples or benchmarks available online?
There is nothing besides the random search example. The only benchmark I 
can provide at this moment is that the ParamSklearn approach (together 
with an ensemble post-processing technique) placed third in the manual 
track and first in the auto track of the Chalearn AutoML competition 
<https://sites.google.com/a/chalearn.org/automl/>. If you have a 
specific dataset/benchmark in mind, I can configure ParamSklearn with 
SMAC and tell you about the results.


Best regards,
Matthias


Cheers,
Andy



On 03/24/2015 03:50 PM, Matthias Feurer wrote:

Dear scikit-learn team,

After reading the proposal of Christoph Angermüller wanting to 
enhance scikit-learn with Bayesian optimization 
(http://sourceforge.net/p/scikit-learn/mailman/message/33630274/) as 
a GSoC project, you might also want to think again about the 
integration of a hyperparameter concept into scikit-learn.


Our group built a framework called ParamSklearn 
(https://bitbucket.org/mfeurer/paramsklearn/overview), which provides 
hyperparameter definitions for a subset of classifiers, regressors 
and preprocessors in scikit-learn. The result is something similar 
like what James Bergstra did in hpsklearn 
(https://github.com/hyperopt/hyperopt-sklearn) and a post from 2010 
(http://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/aanlktilvznvavqr-sbiixcguwyuf6jyq_ijvytdx7...@mail.gmail.com/?page=0). 
In the end you get a configuration space which can then be read by a 
Sequential Model-based Optimization package. For example, we used 
this module for our AutoSklearn entry in the first automated machine 
learning competition: https://sites.google.com/a/chalearn.org/automl/


Optimizing hyperparameters is a challenge itself, but defining 
relevant ranges is also a difficult task for non-experts. Thus, it 
would be nice to find a way to integrate the hyperparameter 
definitions into scikit-learn (see bottom of this e-mail for a 
suggestion) such that they can be used either by the not-yet-existing 
GPSearchCV, the already existing RandomizedSearchCV or the 
GridSearchCV, but also by external tools like our ParamSklearn. The 
hyperparameter definitions would leave a user with only two mandatory 
choices: number of evaluations/runtime and the estimator to use.


What do you think?

Best regards,
Matthias Feurer

Currently, we define the hyperparameters with a package called 
HPOlibConfigSpace (https://github.com/automl/HPOlibConfigSpace). For 
the SVC it looks like this:


C = UniformFloatHyperparameter("C", 0.03125, 32768, log=True, 
default=1.0)

kernel = CategoricalHyperparameter(name="kernel",
choices=["rbf", "poly", "sigmoid"], default="rbf")
degree = UniformIntegerHyperparameter("degree", 1, 5, default=3)
gamma = UniformFloatHyperparameter("gamma", 3.0517578125e-05, 8,
log=True, default=0.1)
coef0 = UniformFloatHyperparameter("coef0", -1, 1, default=0)
shrinking = CategoricalHyperparameter("shrinking", ["True", "False"],
  default="True")
tol = UniformFloatHyperparameter("tol", 1e-5, 1e-1, default=1e-4,
 log=True)
class_weight = CategoricalHyperparameter("c

[Scikit-learn-general] Subject: Hyperparameters in scikit-learn

2015-03-24 Thread Matthias Feurer

Dear scikit-learn team,

After reading the proposal of Christoph Angermüller wanting to enhance 
scikit-learn with Bayesian optimization 
(http://sourceforge.net/p/scikit-learn/mailman/message/33630274/) as a 
GSoC project, you might also want to think again about the integration 
of a hyperparameter concept into scikit-learn.


Our group built a framework called ParamSklearn 
(https://bitbucket.org/mfeurer/paramsklearn/overview), which provides 
hyperparameter definitions for a subset of classifiers, regressors and 
preprocessors in scikit-learn. The result is something similar like what 
James Bergstra did in hpsklearn 
(https://github.com/hyperopt/hyperopt-sklearn) and a post from 2010 
(http://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/aanlktilvznvavqr-sbiixcguwyuf6jyq_ijvytdx7...@mail.gmail.com/?page=0). 
In the end you get a configuration space which can then be read by a 
Sequential Model-based Optimization package. For example, we used this 
module for our AutoSklearn entry in the first automated machine learning 
competition: https://sites.google.com/a/chalearn.org/automl/


Optimizing hyperparameters is a challenge itself, but defining relevant 
ranges is also a difficult task for non-experts. Thus, it would be nice 
to find a way to integrate the hyperparameter definitions into 
scikit-learn (see bottom of this e-mail for a suggestion) such that they 
can be used either by the not-yet-existing GPSearchCV, the already 
existing RandomizedSearchCV or the GridSearchCV, but also by external 
tools like our ParamSklearn. The hyperparameter definitions would leave 
a user with only two mandatory choices: number of evaluations/runtime 
and the estimator to use.


What do you think?

Best regards,
Matthias Feurer

Currently, we define the hyperparameters with a package called 
HPOlibConfigSpace (https://github.com/automl/HPOlibConfigSpace). For the 
SVC it looks like this:


C = UniformFloatHyperparameter("C", 0.03125, 32768, log=True, default=1.0)
kernel = CategoricalHyperparameter(name="kernel",
choices=["rbf", "poly", "sigmoid"], default="rbf")
degree = UniformIntegerHyperparameter("degree", 1, 5, default=3)
gamma = UniformFloatHyperparameter("gamma", 3.0517578125e-05, 8,
log=True, default=0.1)
coef0 = UniformFloatHyperparameter("coef0", -1, 1, default=0)
shrinking = CategoricalHyperparameter("shrinking", ["True", "False"],
  default="True")
tol = UniformFloatHyperparameter("tol", 1e-5, 1e-1, default=1e-4,
 log=True)
class_weight = CategoricalHyperparameter("class_weight",
["None", "auto"],default="None")
max_iter = UnParametrizedHyperparameter("max_iter", -1)

cs = ConfigurationSpace()
cs.add_hyperparameter(C)
cs.add_hyperparameter(kernel)
cs.add_hyperparameter(degree)
cs.add_hyperparameter(gamma)
cs.add_hyperparameter(coef0)
cs.add_hyperparameter(shrinking)
cs.add_hyperparameter(tol)
cs.add_hyperparameter(class_weight)
cs.add_hyperparameter(max_iter)

degree_depends_on_poly = EqualsCondition(degree, kernel, "poly")
coef0_condition = InCondition(coef0, kernel, ["poly", "sigmoid"])
cs.add_condition(degree_depends_on_poly)
cs.add_condition(coef0_condition)

The code is more verbose than it has to be, but we are working on this. 
The ConfigurationSpace object can then be accessed by a @staticmethod 
and be used as a parameter description object inside *SearchCV. We can 
provide a stripped-down version of the HPOlibConfigSpace for integration 
in sklearn.external, as well as the hyperparameter definitions we have 
so far.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general