subject:"\[jira\] \[Comment Edited\] \(SPARK\-1473\) Feature selection for high dimensional datasets"

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-10-23 Thread Gavin Brown (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181238#comment-14181238
]

Gavin Brown edited comment on SPARK-1473 at 10/23/14 11:20 AM:
---

Hello, I am the first author of the paper being discussed.

Our paper did indeed separate the two tasks of (1) estimating the probability
distributions, and (2) the process/dynamics of selecting features once you have
those probabilities. So as Sam says, yes it is entirely possible that bad
estimation of those probabilities could lead to bad choices of features.

The task of estimating those probabilities is an unsolved problem in general,
but is known as entropy estimation. Sam rightly points out that you could
use LaPlace, or other smoothing methods. Which one will work best on any
arbitrary dataset is unknown in general. However many of these smoothing
methods all perform identically once you get a reasonable number of datapoints
per feature. Small sample feature selection using information theoretic
methods is an open problem - but one could use good heuristics like AIC or BIC
to regularize.

In answer to the question by Sam - Is there any literature that has attempted
to *formalize the way we introduce independence* ... no, our paper is the only
one I know of.

I think of these information theoretic filter methods as a way to *explore*
large data. If you wanted to build the best possible classifier, with no limit
on computation, you'd use a wrapper method around it. Filter methods by nature
assume independence between the feature selection stage and the classifier
building, which is clearly false, but works well in practice with very large
datasets that cannot have unlimited compute time. If you want to explore data,
a filter is a good heuristic - and the information theoretic ones are the most
theoretically grounded heuristics I know of.

Happy to answer further questions or give my perspectives.Everybody -
thanks for the attention to the paper - very happy to see it is useful.

was (Author: gbr...@cs.man.ac.uk):
Hello, I am the first author of the paper being discussed. Everybody - thanks
for the attention to the paper - very happy to see it is useful.

In answer to the question by Sam - Is there any literature that has attempted
to *formalize the way we introduce independence* ... no, our paper is the only
one I know of.

Happy to answer further questions or give my perspectives.

Feature selection for high dimensional datasets
---

Key: SPARK-1473
URL: https://issues.apache.org/jira/browse/SPARK-1473
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
Labels: features

For classification tasks involving large feature spaces in the order of tens
of thousands or higher (e.g., text classification with n-grams, where n 1),
it is often useful to rank and filter features that are irrelevant thereby
reducing the feature space by at least one or two orders of magnitude without
impacting performance on key evaluation metrics (accuracy/precision/recall).
A feature evaluation interface which is flexible needs to be designed and at
least two methods should be implemented with Information Gain

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov edited comment on SPARK-1473 at 8/8/14 8:27 AM:
-

I've implemented Chi-Squared and added a pull request 
https://github.com/apache/spark/pull/1484


was (Author: avulanov):
I've implemented Chi-Squared and added a pull request

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2 matches

Site Navigation

Mail list logo

Footer information