[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181238#comment-14181238
 ] 

Gavin Brown edited comment on SPARK-1473 at 10/23/14 11:20 AM:
---------------------------------------------------------------

Hello, I am the first author of the paper being discussed.

Our paper did indeed separate the two tasks of (1) estimating the probability 
distributions, and (2) the process/dynamics of selecting features once you have 
those probabilities.  So as Sam says, yes it is entirely possible that bad 
estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, 
but is known as "entropy estimation".  Sam rightly points out that you could 
use LaPlace, or other smoothing methods.  Which one will work best on any 
arbitrary dataset is unknown in general.  However many of these smoothing 
methods all perform identically once you get a reasonable number of datapoints 
per feature.  Small sample feature selection using information theoretic 
methods is an open problem - but one could use good heuristics like AIC or BIC 
to regularize.

In answer to the question by Sam - "Is there any literature that has attempted 
to *formalize the way we introduce independence*" ... no, our paper is the only 
one I know of.

I think of these information theoretic filter methods as a way to *explore* 
large data. If you wanted to build the best possible classifier, with no limit 
on computation, you'd use a wrapper method around it.  Filter methods by nature 
assume independence between the feature selection stage and the classifier 
building, which is clearly false, but works well in practice with very large 
datasets that cannot have unlimited compute time.  If you want to explore data, 
a filter is a good heuristic - and the information theoretic ones are the most 
theoretically grounded "heuristics" I know of.

Happy to answer further questions or give my perspectives.    Everybody - 
thanks for the attention to the paper - very happy to see it is useful.


was (Author: gbr...@cs.man.ac.uk):
Hello, I am the first author of the paper being discussed.  Everybody - thanks 
for the attention to the paper - very happy to see it is useful.

Our paper did indeed separate the two tasks of (1) estimating the probability 
distributions, and (2) the process/dynamics of selecting features once you have 
those probabilities.  So as Sam says, yes it is entirely possible that bad 
estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, 
but is known as "entropy estimation".  Sam rightly points out that you could 
use LaPlace, or other smoothing methods.  Which one will work best on any 
arbitrary dataset is unknown in general.  However many of these smoothing 
methods all perform identically once you get a reasonable number of datapoints 
per feature.  Small sample feature selection using information theoretic 
methods is an open problem - but one could use good heuristics like AIC or BIC 
to regularize.

In answer to the question by Sam - "Is there any literature that has attempted 
to *formalize the way we introduce independence*" ... no, our paper is the only 
one I know of.

I think of these information theoretic filter methods as a way to *explore* 
large data. If you wanted to build the best possible classifier, with no limit 
on computation, you'd use a wrapper method around it.  Filter methods by nature 
assume independence between the feature selection stage and the classifier 
building, which is clearly false, but works well in practice with very large 
datasets that cannot have unlimited compute time.  If you want to explore data, 
a filter is a good heuristic - and the information theoretic ones are the most 
theoretically grounded "heuristics" I know of.

Happy to answer further questions or give my perspectives.   

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to