On 05/20/2013 05:20 AM, Joel Nothman wrote:
I couldn't help but work on it, it seems.
The Pipeline's refit is trivial given that all sub-estimators have a
refit that will do nothing if certain parameters are not passed (and
in case not all have set their noop_params, we can explicitly only
refit from the first step where a parameter is changed).
By trivial, I mean just take the current fit(_transform)
implementation and replace fit(_transform) with refit(_transform),
passing each step its parameters.
Its _plan_refits is not as trivial. First, let us assume no steps are
set as parameters
<https://github.com/scikit-learn/scikit-learn/pull/1769>. Consider the
parameters as a tree, like that attached, with each layer
corresponding to a step and the parameters set there. (The tree
attached corresponds to a grid, but it needn't be so balanced.) If you
reorder the children of each node using the relevant _plan_refits
(perhaps with memoization for the common grid case), and aggregate the
costs appropriately, you should result in a good plan.
I will have to think about this after the NIPS deadline ;)
Now, I don't know enough about regularisation paths. I am worried they
don't fit naturally in this framework, because they need all candidate
values for the the relevant parameter to be specified at once; I was
hoping someone would shout that at me when I proposed this. Could you
please clarify?
Basically you have a highest and lowest value of the regularization
value, fit one model for the highest value and can efficiently produce
models for all possible values
in between.
The thing here is that you can efficiently compute the models for all
values of the parameter if you compute them together.
This is basically the opposite of your "group by value" strategy.
If you are not so much into linear models, another way of thinking about
it is with trees / forests: if you compute a tree up to a certain depth,
you basically
get the tree with smaller depths for free - similar things are true for
boosting.
For these you basically need to know a maximum and / or minimum
parameter setting and compute solutions for all parameter setting at once.
I feel like this is the really interesting case, which we should try to
solve.
Basically your proposal addresses cases where one doesn't need to touch
parts of the pipeline at all.
It wouldn't help us get rid of any of the CV objects, though.
Is there something interesting about StandardScaler, or have you
thrown it in for fun? or for an example where transform is more
expensive than fit?
Just for fun ;) Basically I thought that was one that you don't really
need to refit at all (for a given fold) as you usually don't search over
any parameters.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general