I couldn't help but work on it, it seems.

The Pipeline's refit is trivial given that all sub-estimators have a refit
that will do nothing if certain parameters are not passed (and in case not
all have set their noop_params, we can explicitly only refit from the first
step where a parameter is changed).

By trivial, I mean just take the current fit(_transform) implementation and
replace fit(_transform) with refit(_transform), passing each step its
parameters.

Its _plan_refits is not as trivial. First, let us assume no steps are set
as parameters <https://github.com/scikit-learn/scikit-learn/pull/1769>.
Consider the parameters as a tree, like that attached, with each layer
corresponding to a step and the parameters set there. (The tree attached
corresponds to a grid, but it needn't be so balanced.) If you reorder the
children of each node using the relevant _plan_refits (perhaps with
memoization for the common grid case), and aggregate the costs
appropriately, you should result in a good plan.

Where one can set the steps of the Pipeline as parameters, you simply need
to operate over a separate tree for each setting of steps. [The trees
aren't an in-memory data structure but a recursive descent.]

Of course, this still does some unnecessary work that wouldn't have to be
the case for your proposed solution: transform will always be called for
every step. Memoization is potentially an easy solution to that.

Now, I don't know enough about regularisation paths. I am worried they
don't fit naturally in this framework, because they need all candidate
values for the the relevant parameter to be specified at once; I was hoping
someone would shout that at me when I proposed this. Could you please
clarify?

Is there something interesting about StandardScaler, or have you thrown it
in for fun? or for an example where transform is more expensive than fit?

- Joel



On Sun, May 19, 2013 at 11:58 PM, Andreas Mueller
<[email protected]>wrote:

>  Hi Joel.
> Thanks for working on this :)
> I like the idea of making the estimator know which parameters need no
> refit.
> I'm not quite sure how the _plan_refits would really work, though.
>
> As for single estimators the "_plan_refits" would probably be easy, I
> think the main question is
> how to implement the "_plan_refits" of the pipeline. This seems somewhat
> non-trivial, but
> a central aspect of the proposal.
>
> We should have some non-trivial example on how it would work for some
> situations,
> something like a StandardScaler -> SelectKBest -> LassoLars  pipeline.
>
> Cheers,
> Andy
>
>
>
> On 05/19/2013 01:07 AM, Joel Nothman wrote:
>
>  scikit-learn's general parameter searches currently require calling fit()on 
> their estimator for every parameter variation, even those where
> re-fitting is unnecessary.
>
>  Andy has proposed a 
> solution<https://github.com/scikit-learn/scikit-learn/issues/1626> which
> involves providing the estimator with a set of values for each parameter
> varied (i.e. a grid), and its predict function will predict a result for
> each parameter setting. Unless its return value is a generator this may be
> expensive in terms of memory, and coefficients for each setting may need to
> also be returned; but mostly I think it's a bad idea because it sounds like
> it will be difficult to implement as a simple extension of the current
> API. (I also proposed a solution on that issue, but it has some big
> flaws...)
>
>  After Lars mentioning that GridSearch needs to call fit for every
> transform, I lay awake in bed last night and came up with the following:
>
>  BaseEstimator adds a refit(X, y=None, **params) method which has two
> explicit preconditions:
>
>    1. the data arguments (X, y; and sample_weight, etc. not that I'm sure
>    how fit into the method signature) are identical to the most recent call to
>    fit().
>    2. params are exactly the changes since the last re/fit.
>
>  Both of these are implicit in the use of warm_start=True elsewhere,
> which is one reason the latter should be deprecated in favour of a more
> explicit, general API for minimal re-fitting.
>
>  The default implementation looks something like this:
>
>  def refit(self, X, y=None, **params):
>     self.set_params(**params)
>     if hasattr(self, '_refit_noop_params') and all(name in
> self._refit_noop_params for name in iterkeys(params)):
>         return self
>     return self.fit(X, y)
>
>  For example, on SelectKBest, refit_noop_params would be ['k'] because
> fit does not need to be called again when k is modified, though it does
> when score_func is modified. Similarly, we need a refit_transform in
> TransformerMixin.
>
>  We also implement BaseEstimator._plan_refits(self, param_iterator). This
> has two return values. One is a reordering of param_iterator that
> attempts to minimise work were fit was called with the first parameters,
> and refit with all the subsequent parameters in order. The second is an
> expected cost for each parameter setting if executed in this order.
>
>  For example:
>     SelectKBest._plan_refits(ParameterGrid({'score_func': [chi2,
> f_classif], 'k': [10, 20]}))
>  might return:
>     ([
>       {'score_func': chi2, 'k': 10},
>       {'score_func': chi2, 'k': 20},
>        {'score_func': f_classif, 'k': 10},
>        {'score_func': f_classif, 'k': 20},
>      ],
>      array([1, 0, 1, 0])
>     )
>
>  (array([0, 0, 1, 0])would have the same effect as a cost and is what is
> returned by the below implementation.)
>
>  GridSearch may then operate by first calling _plan_refits on its
> estimator, and divide the work by folds and cost-based partitions of the
> reordered parameter space, the parallelised function calling clone and fit 
> once,
> and refit many times.
>
>  A default implementation looks something like this:
>
>  def _plan_refits(self, param_iterator):
>      try:
>          NOOP_NAMES = set(self._refit_noop_params)
>     except AttributeError:
>         # presumably fit will be called every time
>         param_iterator = list(param_iterator)
>         return param_iterator, np.zeros(len(param_iterator))
>
>      # bin parameter settings by common non-noop params
>     groups = defaultdict(list)
>     for params in param_iterator:
>         # sort parameters into two types
>         op_params = []
>         noop_params = []
>         for k, v in iteritems(params):
>             (noop_params if k in NOOP_NAMES else op_params).append((k, v))
>         groups[tuple(sorted(op_params)].append(noop_params)
>
>      # merge bins and assign nonzero cost at transitions
>     groups = list(iteritems(groups))
>     reordered = sum((dict(op_params + noop_params) for op_params, noop_seq
> in groups for noop_params in noop_iter), [])
>     costs = np.zeros(len(reordered))
>     costs[np.cumsum([len(noop_seq) for op_params, noop_seq in
> groups[:-1]])] = 1
>     return reordered, costs
>
>  While these generic implementations have some properties such as working
> entirely on the basis of parameter names and not values, we can't assume
> that in the general case. In particular, a Pipeline implementation where
> steps can be set requires a somewhat more sophisticated implementation, and
> non-binary costs. Pipeline.refit may refit only the tail end of the
> pipeline, depending on the parameters it's passed.
>
>  Cheers,
>
>  - Joel
>
>
> ------------------------------------------------------------------------------
> AlienVault Unified Security Management (USM) platform delivers complete
> security visibility with the essential security capabilities. Easily and
> efficiently configure, manage, and operate all of your security controls
> from a single console and one unified framework. Download a free 
> trial.http://p.sf.net/sfu/alienvault_d2d
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> AlienVault Unified Security Management (USM) platform delivers complete
> security visibility with the essential security capabilities. Easily and
> efficiently configure, manage, and operate all of your security controls
> from a single console and one unified framework. Download a free trial.
> http://p.sf.net/sfu/alienvault_d2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

Attachment: pipeline_params_tree.pdf
Description: Adobe PDF document

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to