Re: [Scikit-learn-general] Generalised warm start / parameter search

Andreas Mueller Sun, 19 May 2013 06:59:39 -0700

Hi Joel.
Thanks for working on this :)
I like the idea of making the estimator know which parameters need no refit.
I'm not quite sure how the _plan_refits would really work, though.

As for single estimators the "_plan_refits" would probably be easy, Ithink the main question ishow to implement the "_plan_refits" of the pipeline. This seems somewhatnon-trivial, but

a central aspect of the proposal.

We should have some non-trivial example on how it would work for somesituations,

something like a StandardScaler -> SelectKBest -> LassoLars pipeline.

Cheers,
Andy


On 05/19/2013 01:07 AM, Joel Nothman wrote:

scikit-learn's general parameter searches currently require callingfit() on their estimator for every parameter variation, even thosewhere re-fitting is unnecessary.
Andy has proposed a solution<https://github.com/scikit-learn/scikit-learn/issues/1626> whichinvolves providing the estimator with a set of values for eachparameter varied (i.e. a grid), and its predict function will predicta result for each parameter setting. Unless its return value is agenerator this may be expensive in terms of memory, and coefficientsfor each setting may need to also be returned; but mostly I think it'sa bad idea because it sounds like it will be difficult to implement asa simple extension of the current API. (I also proposed a solution onthat issue, but it has some big flaws...)
After Lars mentioning that GridSearch needs to call fit for everytransform, I lay awake in bed last night and came up with the following:
BaseEstimator adds a refit(X, y=None, **params) method which has twoexplicit preconditions:
 1. the data arguments (X, y; and sample_weight, etc. not that I'm
    sure how fit into the method signature) are identical to the most
    recent call to fit().
 2. params are exactly the changes since the last re/fit.
Both of these are implicit in the use of warm_start=True elsewhere,which is one reason the latter should be deprecated in favour of amore explicit, general API for minimal re-fitting.
The default implementation looks something like this:

def refit(self, X, y=None, **params):
    self.set_params(**params)
if hasattr(self, '_refit_noop_params') and all(name inself._refit_noop_params for name in iterkeys(params)):
        return self
    return self.fit(X, y)
For example, on SelectKBest, refit_noop_params would be ['k'] becausefit does not need to be called again when k is modified, though itdoes when score_func is modified. Similarly, we need a refit_transformin TransformerMixin.
We also implement BaseEstimator._plan_refits(self, param_iterator).This has two return values. One is a reordering of param_iterator thatattempts to minimise work were fit was called with the firstparameters, and refit with all the subsequent parameters in order. Thesecond is an expected cost for each parameter setting if executed inthis order.
For example:
SelectKBest._plan_refits(ParameterGrid({'score_func': [chi2,f_classif], 'k': [10, 20]}))
might return:
    ([
      {'score_func': chi2, 'k': 10},
      {'score_func': chi2, 'k': 20},
      {'score_func': f_classif, 'k': 10},
      {'score_func': f_classif, 'k': 20},
     ],
     array([1, 0, 1, 0])
    )
(array([0, 0, 1, 0])would have the same effect as a cost and is whatis returned by the below implementation.)
GridSearch may then operate by first calling _plan_refits on itsestimator, and divide the work by folds and cost-based partitions ofthe reordered parameter space, the parallelised function calling cloneand fit once, and refit many times.
A default implementation looks something like this:

def _plan_refits(self, param_iterator):
    try:
NOOP_NAMES = set(self._refit_noop_params)
  except AttributeError:
      # presumably fit will be called every time
      param_iterator = list(param_iterator)
      return param_iterator, np.zeros(len(param_iterator))

  # bin parameter settings by common non-noop params
    groups = defaultdict(list)
    for params in param_iterator:
        # sort parameters into two types
op_params = []
noop_params = []
        for k, v in iteritems(params):
(noop_params if k in NOOP_NAMES else op_params).append((k, v))
      groups[tuple(sorted(op_params)].append(noop_params)

  # merge bins and assign nonzero cost at transitions
  groups = list(iteritems(groups))
reordered = sum((dict(op_params + noop_params) forop_params, noop_seq in groups for noop_params in noop_iter), [])
  costs = np.zeros(len(reordered))
costs[np.cumsum([len(noop_seq) for op_params, noop_seq ingroups[:-1]])] = 1
  return reordered, costs
While these generic implementations have some properties such asworking entirely on the basis of parameter names and not values, wecan't assume that in the general case. In particular, a Pipelineimplementation where steps can be set requires a somewhat moresophisticated implementation, and non-binary costs. Pipeline.refit mayrefit only the tail end of the pipeline, depending on the parametersit's passed.
Cheers,

- Joel


------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Generalised warm start / parameter search

Reply via email to