Hi Joel.
Thanks for working on this :)
I like the idea of making the estimator know which parameters need no refit.
I'm not quite sure how the _plan_refits would really work, though.
As for single estimators the "_plan_refits" would probably be easy, I
think the main question is
how to implement the "_plan_refits" of the pipeline. This seems somewhat
non-trivial, but
a central aspect of the proposal.
We should have some non-trivial example on how it would work for some
situations,
something like a StandardScaler -> SelectKBest -> LassoLars pipeline.
Cheers,
Andy
On 05/19/2013 01:07 AM, Joel Nothman wrote:
scikit-learn's general parameter searches currently require calling
fit() on their estimator for every parameter variation, even those
where re-fitting is unnecessary.
Andy has proposed a solution
<https://github.com/scikit-learn/scikit-learn/issues/1626> which
involves providing the estimator with a set of values for each
parameter varied (i.e. a grid), and its predict function will predict
a result for each parameter setting. Unless its return value is a
generator this may be expensive in terms of memory, and coefficients
for each setting may need to also be returned; but mostly I think it's
a bad idea because it sounds like it will be difficult to implement as
a simple extension of the current API. (I also proposed a solution on
that issue, but it has some big flaws...)
After Lars mentioning that GridSearch needs to call fit for every
transform, I lay awake in bed last night and came up with the following:
BaseEstimator adds a refit(X, y=None, **params) method which has two
explicit preconditions:
1. the data arguments (X, y; and sample_weight, etc. not that I'm
sure how fit into the method signature) are identical to the most
recent call to fit().
2. params are exactly the changes since the last re/fit.
Both of these are implicit in the use of warm_start=True elsewhere,
which is one reason the latter should be deprecated in favour of a
more explicit, general API for minimal re-fitting.
The default implementation looks something like this:
def refit(self, X, y=None, **params):
self.set_params(**params)
if hasattr(self, '_refit_noop_params') and all(name in
self._refit_noop_params for name in iterkeys(params)):
return self
return self.fit(X, y)
For example, on SelectKBest, refit_noop_params would be ['k'] because
fit does not need to be called again when k is modified, though it
does when score_func is modified. Similarly, we need a refit_transform
in TransformerMixin.
We also implement BaseEstimator._plan_refits(self, param_iterator).
This has two return values. One is a reordering of param_iterator that
attempts to minimise work were fit was called with the first
parameters, and refit with all the subsequent parameters in order. The
second is an expected cost for each parameter setting if executed in
this order.
For example:
SelectKBest._plan_refits(ParameterGrid({'score_func': [chi2,
f_classif], 'k': [10, 20]}))
might return:
([
{'score_func': chi2, 'k': 10},
{'score_func': chi2, 'k': 20},
{'score_func': f_classif, 'k': 10},
{'score_func': f_classif, 'k': 20},
],
array([1, 0, 1, 0])
)
(array([0, 0, 1, 0])would have the same effect as a cost and is what
is returned by the below implementation.)
GridSearch may then operate by first calling _plan_refits on its
estimator, and divide the work by folds and cost-based partitions of
the reordered parameter space, the parallelised function calling clone
and fit once, and refit many times.
A default implementation looks something like this:
def _plan_refits(self, param_iterator):
try:
NOOP_NAMES = set(self._refit_noop_params)
except AttributeError:
# presumably fit will be called every time
param_iterator = list(param_iterator)
return param_iterator, np.zeros(len(param_iterator))
# bin parameter settings by common non-noop params
groups = defaultdict(list)
for params in param_iterator:
# sort parameters into two types
op_params = []
noop_params = []
for k, v in iteritems(params):
(noop_params if k in NOOP_NAMES else op_params).append((k, v))
groups[tuple(sorted(op_params)].append(noop_params)
# merge bins and assign nonzero cost at transitions
groups = list(iteritems(groups))
reordered = sum((dict(op_params + noop_params) for
op_params, noop_seq in groups for noop_params in noop_iter), [])
costs = np.zeros(len(reordered))
costs[np.cumsum([len(noop_seq) for op_params, noop_seq in
groups[:-1]])] = 1
return reordered, costs
While these generic implementations have some properties such as
working entirely on the basis of parameter names and not values, we
can't assume that in the general case. In particular, a Pipeline
implementation where steps can be set requires a somewhat more
sophisticated implementation, and non-binary costs. Pipeline.refit may
refit only the tail end of the pipeline, depending on the parameters
it's passed.
Cheers,
- Joel
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general