[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122386#comment-14122386
 ] 

sam commented on SPARK-1473:
----------------------------

Good paper, the theory is very solid. My only concern is that the paper does 
not explicitly tackle the problem of probability estimation for high 
dimensionality, which for sparse data will be even worse. It just touches on 
the problem, saying:

"This in turn causes increasingly poor judgements for the in- clusion/exclusion 
of features. For precisely this reason, the research community have developed 
various low-dimensional approximations to (9). In the following sections, we 
will investigate the implicit statistical assumptions and empirical effects of 
these approximations"

Those mentioned sections do not go into theoretical detail, and therefore I 
disagree that the paper provides a "single unified information theoretic 
framework for feature selection" as it basically leaves the problem of 
probability estimation to the readers choice, and merely suggests the reader 
assumes some level of independence between features in order to implement an 
algorithm.

 [~dmborque]  Do you know of any literature that does approach the problem of 
probability estimation in an information theoretic and philosophically 
justified way?? 

Anyway despite my concerns, this paper is still by far the best treatment of 
feature selection I have seen.

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to