And now that branch includes a prototype of BaseSearchCV using iter_fits,
and enet_path as an iter_fits method. (Unfortunately, I can't easily do the
same for lars_path as its output is not as straightforward in relation to
estimator coefs as that of enet_path; and its input does not allow
specifying the alphas explicitly.)
For example, take the grid-searched pipeline in
https://gist.github.com/jnothman/5624440 (StandardScaler, SelectKBest,
ElasticNet; cv=3, n_jobs=1):
(iter_fits)$ python -m cProfile example_pipeline.py | grep
'(\(fit\|transform\|fit_transform\))'
2 0.000 0.000 0.002 0.001 base.py:360(fit_transform)
241 0.003 0.000 0.124 0.001 coordinate_descent.py:153(fit)
1 0.000 0.000 0.445 0.445 grid_search.py:679(fit)
1 0.000 0.000 0.002 0.002 pipeline.py:128(fit)
7 0.000 0.000 0.001 0.000 preprocessing.py:301(fit)
247 0.006 0.000 0.019 0.000
preprocessing.py:332(transform)
265 0.007 0.000 0.043 0.000
univariate_selection.py:285(transform)
7 0.000 0.000 0.006 0.001
univariate_selection.py:338(fit)
(master)$ python -m cProfile example_pipeline.py | grep
'(\(fit\|transform\|fit_transform\))'
482 0.002 0.000 0.289 0.001 base.py:343(fit_transform)
481 0.012 0.000 0.102 0.000 base.py:61(transform)
241 0.004 0.000 0.105 0.000 coordinate_descent.py:152(fit)
1 0.000 0.000 1.208 1.208 grid_search.py:672(fit)
241 0.002 0.000 0.402 0.002 pipeline.py:126(fit)
241 0.002 0.000 0.032 0.000 preprocessing.py:301(fit)
481 0.013 0.000 0.037 0.000
preprocessing.py:332(transform)
241 0.003 0.000 0.184 0.001
univariate_selection.py:292(fit)
(And yes, even with this inefficient implementation of iter_fits, and no
major time-saving components in the pipeline, the iter_fits version seems
to run in 67% of the total time.)
A couple of caveats discovered en-route: iter_fits reorders the
parameter_iterator used in BaseSearchCV, so:
* iter_fits must be deterministic in its reordering of the input
param_iterator to match results across different folds;
* GridSearchCV results reshaping is no longer straightforward; and
* where multiple candidates produce the same score, the reordering affects
which argmax is selected.
~J
On Tue, May 21, 2013 at 12:49 PM, Joel Nothman <[email protected]
> wrote:
> Doing the ordering in the parameter space rather than in sampled
> candidates would also be difficult for regularization paths. And it's very
> much more complicated when Pipeline steps can be set, because then there is
> interdependence between one parameter's value and the availability/ordering
> of parameter names.
>
> In the meantime, I've procrastinated by hacking up an implementation of
> iter_fits for Pipeline, as yet ignoring the setting of steps.
>
> With https://github.com/jnothman/scikit-learn/tree/iter_fits
>
> >>> from sklearn import pipeline, grid_search, datasets, linear_model,
> feature_selection
> >>> clf = pipeline.Pipeline([('sel', feature_selection.SelectKBest()),
> ('clf', linear_model.LogisticRegression())])
> >>> params = grid_search.ParameterGrid({'sel__k': [1,2],
> 'sel__score_func': [feature_selection.chi2, feature_selection.f_classif],
> 'clf__C': [1, 10]})
> >>> iris = datasets.load_iris()
> >>> gen = clf.iter_fits(params, iris.data, iris.target == 1)
> >>> for params, est in gen:
> ... print(params)
> {'sel__k': 2, 'clf__C': 1, 'sel__score_func': <function chi2 at 0x3803770>}
> {'sel__k': 2, 'clf__C': 10, 'sel__score_func': <function chi2 at
> 0x3803770>}
> {'sel__k': 1, 'clf__C': 1, 'sel__score_func': <function chi2 at 0x3803770>}
> {'sel__k': 1, 'clf__C': 10, 'sel__score_func': <function chi2 at
> 0x3803770>}
> {'sel__k': 1, 'clf__C': 1, 'sel__score_func': <function f_classif at
> 0x3803730>}
> {'sel__k': 1, 'clf__C': 10, 'sel__score_func': <function f_classif at
> 0x3803730>}
> {'sel__k': 2, 'clf__C': 1, 'sel__score_func': <function f_classif at
> 0x3803730>}
> {'sel__k': 2, 'clf__C': 10, 'sel__score_func': <function f_classif at
> 0x3803730>}
>
>
>
> On Tue, May 21, 2013 at 12:24 PM, Joel Nothman <
> [email protected]> wrote:
>
>> Hi Ken,
>>
>> It's not just the optimal parameter ordering, but the estimators noting
>> when fit does or does not need to be reperformed. This is complicated but
>> important in the Pipeline case where earlier steps affect later but not the
>> other way. And some of the earlier steps may be expensive to fit, such as
>> PCA. So not only would you have to specify the optimal ordering, but when
>> particular estimators and sub-estimators can be used without refitting.
>>
>> Currently, the parameter searches clone and fit the estimator for each
>> candidate. Anything that relies on some optimal ordering needs to not do
>> that.
>>
>> And apart from designing an API (which we'd still need to do for your
>> solution), and implementing planners for tricky cases such as Pipeline,
>> it's actually not that complicated.
>>
>> However, you are right to the extent that optimal ordering can often be
>> done when specifying the parameter space (manually or automatically),
>> before sampling candidates from it, and that may be a clever way to go
>> about it. But it may also be tricky to extend to, say, hyperopt parameter
>> spaces.
>>
>>
>> On Tue, May 21, 2013 at 12:11 PM, Kenneth C. Arnold <
>> [email protected]> wrote:
>>
>>> I haven't been following the details of this thread, but I thought: why
>>> automate? GridSearch could, e.g., take an OrderedDict of parameters, and
>>> try combinations in C-array order. (For parallelism, maybe batches could be
>>> queued up in the opposite (i.e., Fortran) order, though I haven't thought
>>> that one through in detail.) Then the user who wants to turbocharge their
>>> grid search can think through the optimal parameter ordering themselves. Or
>>> a dedicated Planner or Driver could work out how to order parameters,
>>> distribute work between machines, and maybe even explore the parameter
>>> space intelligently (a la hyperopt), but the simple objects don't have to
>>> care.
>>>
>>>
>>> -Ken
>>>
>>>
>>> On Mon, May 20, 2013 at 9:58 PM, Joel Nothman <
>>> [email protected]> wrote:
>>>
>>>> Another advantage of the approach I proposed the other day is that its
>>>> overhead in sorting through the parameters is done once per search.
>>>> Solutions that integrate the planning into the fitting require the planning
>>>> to be done once per fold.
>>>>
>>>> A big frustration in implementing any of this: grouping by parameter
>>>> values is complicated by the fact that parameter values may not be
>>>> orderable, boolean-comparable or hashable (numpy arrays are none of these),
>>>> which means the standard groupby(sorted(...)) and defaultdict(list) cannot
>>>> be used to bin them together. (This could be avoided if we had your
>>>> original proposal of receiving a set of values for each parameter; but we
>>>> can't really make the assumption that everything is a grid.)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Try New Relic Now & We'll Send You this Cool Shirt
>>>> New Relic is the only SaaS-based application performance monitoring
>>>> service
>>>> that delivers powerful full stack analytics. Optimize and monitor your
>>>> browser, app, & servers with just a few lines of code. Try New Relic
>>>> and get this awesome Nerd Life shirt!
>>>> http://p.sf.net/sfu/newrelic_d2d_may
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_may
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general