[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-10-23 Thread Gavin Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181238#comment-14181238
 ] 

Gavin Brown edited comment on SPARK-1473 at 10/23/14 11:20 AM:
---

Hello, I am the first author of the paper being discussed.

Our paper did indeed separate the two tasks of (1) estimating the probability 
distributions, and (2) the process/dynamics of selecting features once you have 
those probabilities.  So as Sam says, yes it is entirely possible that bad 
estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, 
but is known as entropy estimation.  Sam rightly points out that you could 
use LaPlace, or other smoothing methods.  Which one will work best on any 
arbitrary dataset is unknown in general.  However many of these smoothing 
methods all perform identically once you get a reasonable number of datapoints 
per feature.  Small sample feature selection using information theoretic 
methods is an open problem - but one could use good heuristics like AIC or BIC 
to regularize.

In answer to the question by Sam - Is there any literature that has attempted 
to *formalize the way we introduce independence* ... no, our paper is the only 
one I know of.

I think of these information theoretic filter methods as a way to *explore* 
large data. If you wanted to build the best possible classifier, with no limit 
on computation, you'd use a wrapper method around it.  Filter methods by nature 
assume independence between the feature selection stage and the classifier 
building, which is clearly false, but works well in practice with very large 
datasets that cannot have unlimited compute time.  If you want to explore data, 
a filter is a good heuristic - and the information theoretic ones are the most 
theoretically grounded heuristics I know of.

Happy to answer further questions or give my perspectives.Everybody - 
thanks for the attention to the paper - very happy to see it is useful.


was (Author: gbr...@cs.man.ac.uk):
Hello, I am the first author of the paper being discussed.  Everybody - thanks 
for the attention to the paper - very happy to see it is useful.

Our paper did indeed separate the two tasks of (1) estimating the probability 
distributions, and (2) the process/dynamics of selecting features once you have 
those probabilities.  So as Sam says, yes it is entirely possible that bad 
estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general, 
but is known as entropy estimation.  Sam rightly points out that you could 
use LaPlace, or other smoothing methods.  Which one will work best on any 
arbitrary dataset is unknown in general.  However many of these smoothing 
methods all perform identically once you get a reasonable number of datapoints 
per feature.  Small sample feature selection using information theoretic 
methods is an open problem - but one could use good heuristics like AIC or BIC 
to regularize.

In answer to the question by Sam - Is there any literature that has attempted 
to *formalize the way we introduce independence* ... no, our paper is the only 
one I know of.

I think of these information theoretic filter methods as a way to *explore* 
large data. If you wanted to build the best possible classifier, with no limit 
on computation, you'd use a wrapper method around it.  Filter methods by nature 
assume independence between the feature selection stage and the classifier 
building, which is clearly false, but works well in practice with very large 
datasets that cannot have unlimited compute time.  If you want to explore data, 
a filter is a good heuristic - and the information theoretic ones are the most 
theoretically grounded heuristics I know of.

Happy to answer further questions or give my perspectives.   

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
  Labels: features

 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain 

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov edited comment on SPARK-1473 at 8/8/14 8:27 AM:
-

I've implemented Chi-Squared and added a pull request 
https://github.com/apache/spark/pull/1484


was (Author: avulanov):
I've implemented Chi-Squared and added a pull request

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org