Hi Joel.
Thanks for working on this :)
I like the idea of making the estimator know which parameters need no refit.
I'm not quite sure how the _plan_refits would really work, though.

As for single estimators the "_plan_refits" would probably be easy, I think the main question is how to implement the "_plan_refits" of the pipeline. This seems somewhat non-trivial, but
a central aspect of the proposal.

We should have some non-trivial example on how it would work for some situations,
something like a StandardScaler -> SelectKBest -> LassoLars pipeline.

Cheers,
Andy


On 05/19/2013 01:07 AM, Joel Nothman wrote:
scikit-learn's general parameter searches currently require calling fit() on their estimator for every parameter variation, even those where re-fitting is unnecessary.

Andy has proposed a solution <https://github.com/scikit-learn/scikit-learn/issues/1626> which involves providing the estimator with a set of values for each parameter varied (i.e. a grid), and its predict function will predict a result for each parameter setting. Unless its return value is a generator this may be expensive in terms of memory, and coefficients for each setting may need to also be returned; but mostly I think it's a bad idea because it sounds like it will be difficult to implement as a simple extension of the current API. (I also proposed a solution on that issue, but it has some big flaws...)

After Lars mentioning that GridSearch needs to call fit for every transform, I lay awake in bed last night and came up with the following:

BaseEstimator adds a refit(X, y=None, **params) method which has two explicit preconditions:

 1. the data arguments (X, y; and sample_weight, etc. not that I'm
    sure how fit into the method signature) are identical to the most
    recent call to fit().
 2. params are exactly the changes since the last re/fit.

Both of these are implicit in the use of warm_start=True elsewhere, which is one reason the latter should be deprecated in favour of a more explicit, general API for minimal re-fitting.

The default implementation looks something like this:

def refit(self, X, y=None, **params):
    self.set_params(**params)
if hasattr(self, '_refit_noop_params') and all(name in self._refit_noop_params for name in iterkeys(params)):
        return self
    return self.fit(X, y)

For example, on SelectKBest, refit_noop_params would be ['k'] because fit does not need to be called again when k is modified, though it does when score_func is modified. Similarly, we need a refit_transform in TransformerMixin.

We also implement BaseEstimator._plan_refits(self, param_iterator). This has two return values. One is a reordering of param_iterator that attempts to minimise work were fit was called with the first parameters, and refit with all the subsequent parameters in order. The second is an expected cost for each parameter setting if executed in this order.

For example:
SelectKBest._plan_refits(ParameterGrid({'score_func': [chi2, f_classif], 'k': [10, 20]}))
might return:
    ([
      {'score_func': chi2, 'k': 10},
      {'score_func': chi2, 'k': 20},
      {'score_func': f_classif, 'k': 10},
      {'score_func': f_classif, 'k': 20},
     ],
     array([1, 0, 1, 0])
    )

(array([0, 0, 1, 0])would have the same effect as a cost and is what is returned by the below implementation.)

GridSearch may then operate by first calling _plan_refits on its estimator, and divide the work by folds and cost-based partitions of the reordered parameter space, the parallelised function calling clone and fit once, and refit many times.

A default implementation looks something like this:

def _plan_refits(self, param_iterator):
    try:
NOOP_NAMES = set(self._refit_noop_params)
  except AttributeError:
      # presumably fit will be called every time
      param_iterator = list(param_iterator)
      return param_iterator, np.zeros(len(param_iterator))

  # bin parameter settings by common non-noop params
    groups = defaultdict(list)
    for params in param_iterator:
        # sort parameters into two types
op_params = []
noop_params = []
        for k, v in iteritems(params):
(noop_params if k in NOOP_NAMES else op_params).append((k, v))
      groups[tuple(sorted(op_params)].append(noop_params)

  # merge bins and assign nonzero cost at transitions
  groups = list(iteritems(groups))
reordered = sum((dict(op_params + noop_params) for op_params, noop_seq in groups for noop_params in noop_iter), [])
  costs = np.zeros(len(reordered))
costs[np.cumsum([len(noop_seq) for op_params, noop_seq in groups[:-1]])] = 1
  return reordered, costs

While these generic implementations have some properties such as working entirely on the basis of parameter names and not values, we can't assume that in the general case. In particular, a Pipeline implementation where steps can be set requires a somewhat more sophisticated implementation, and non-binary costs. Pipeline.refit may refit only the tail end of the pipeline, depending on the parameters it's passed.

Cheers,

- Joel


------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to