On Sat, May 18, 2013 at 10:44 PM, Lars Buitinck <[email protected]> wrote:
> 2013/5/18 Joel Nothman <[email protected]>:
> >> I think that this is software sophistication that makes it harder to use
> >> for people who are not used to complex software construct (think the
> >> matlab 101 user), and I for this reason, am -1.
>
> Agree...
>
> > So you'd +1 the transform_threshold object parameter?
>
> There's one issue with this, which is that grid searching
> transform_threshold would re-train the estimator many times in the
> loop to change a parameter that does not actually affect fit.
>
This is a more general problem in the current search implementation. But at
least such a parameter would suffice for Pipeline and FeatureUnion.
As regards methods vs. meta-estimators, I'm not too fond of extra
> methods on classifiers to overload them with feature selection. If we
> use a meta-estimator, then we can add an option to make it "forget"
> the underlying estimator and keep only the mask.
Or perhaps the importances so that the mask can be calculated for different
thresholds, k-bests, etc.
I'm currrently doing
> this manually with linear SVMs, because in multiclass classification,
> the coef_ is n_classes × n_features × sizeof(np.float64), while
> n_features × sizeof(np.bool) suffices to do feature selection. At 3e6
> features × 6 classes, this greatly reduces the stored model's size.
>
> (Also, is it an idea to extend SelectKBest and SelectPercentile to
> work with estimators that have feature_importances_?)
>
One of the problems with extending SelectKBest and SelectPercentile to new
cases is that they expect a score_func that returns both scores and
pvalues. Otherwise your suggestion just involves exposing a
get_feature_importances method...?
I have implemented a more generic function mask_by_score (in
https://github.com/jnothman/scikit-learn/commits/feat_sel_overhaul) that
takes the following arguments (and others I'm less certain about):
* scores
* minimum bound
* maximum bound
* limit (a maximum number or fraction of features)
and have reimplemented Select* and the mixin currently under discussion in
terms of it (and also a SelectByScore wrapper).
One then also needs some no-op selectors, like SelectByIndices and
SelectByMask that wrap transformers around support masks extracted by
whatever means.
- Joel
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general