On 05/20/2013 05:20 AM, Joel Nothman wrote:
I couldn't help but work on it, it seems.

The Pipeline's refit is trivial given that all sub-estimators have a refit that will do nothing if certain parameters are not passed (and in case not all have set their noop_params, we can explicitly only refit from the first step where a parameter is changed).

By trivial, I mean just take the current fit(_transform) implementation and replace fit(_transform) with refit(_transform), passing each step its parameters.

Its _plan_refits is not as trivial. First, let us assume no steps are set as parameters <https://github.com/scikit-learn/scikit-learn/pull/1769>. Consider the parameters as a tree, like that attached, with each layer corresponding to a step and the parameters set there. (The tree attached corresponds to a grid, but it needn't be so balanced.) If you reorder the children of each node using the relevant _plan_refits (perhaps with memoization for the common grid case), and aggregate the costs appropriately, you should result in a good plan.
I will have to think about this after the NIPS deadline ;)

Now, I don't know enough about regularisation paths. I am worried they don't fit naturally in this framework, because they need all candidate values for the the relevant parameter to be specified at once; I was hoping someone would shout that at me when I proposed this. Could you please clarify?
Basically you have a highest and lowest value of the regularization value, fit one model for the highest value and can efficiently produce models for all possible values
in between.
The thing here is that you can efficiently compute the models for all values of the parameter if you compute them together.
This is basically the opposite of your "group by value" strategy.
If you are not so much into linear models, another way of thinking about it is with trees / forests: if you compute a tree up to a certain depth, you basically get the tree with smaller depths for free - similar things are true for boosting.

For these you basically need to know a maximum and / or minimum parameter setting and compute solutions for all parameter setting at once. I feel like this is the really interesting case, which we should try to solve.

Basically your proposal addresses cases where one doesn't need to touch parts of the pipeline at all.
It wouldn't help us get rid of any of the CV objects, though.


Is there something interesting about StandardScaler, or have you thrown it in for fun? or for an example where transform is more expensive than fit?

Just for fun ;) Basically I thought that was one that you don't really need to refit at all (for a given fold) as you usually don't search over any parameters.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to