[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181238#comment-14181238
]
Gavin Brown edited comment on SPARK-1473 at 10/23/14 11:20 AM:
---
Hello, I am the first author of the paper being discussed.
Our paper did indeed separate the two tasks of (1) estimating the probability
distributions, and (2) the process/dynamics of selecting features once you have
those probabilities. So as Sam says, yes it is entirely possible that bad
estimation of those probabilities could lead to bad choices of features.
The task of estimating those probabilities is an unsolved problem in general,
but is known as "entropy estimation". Sam rightly points out that you could
use LaPlace, or other smoothing methods. Which one will work best on any
arbitrary dataset is unknown in general. However many of these smoothing
methods all perform identically once you get a reasonable number of datapoints
per feature. Small sample feature selection using information theoretic
methods is an open problem - but one could use good heuristics like AIC or BIC
to regularize.
In answer to the question by Sam - "Is there any literature that has attempted
to *formalize the way we introduce independence*" ... no, our paper is the only
one I know of.
I think of these information theoretic filter methods as a way to *explore*
large data. If you wanted to build the best possible classifier, with no limit
on computation, you'd use a wrapper method around it. Filter methods by nature
assume independence between the feature selection stage and the classifier
building, which is clearly false, but works well in practice with very large
datasets that cannot have unlimited compute time. If you want to explore data,
a filter is a good heuristic - and the information theoretic ones are the most
theoretically grounded "heuristics" I know of.
Happy to answer further questions or give my perspectives.Everybody -
thanks for the attention to the paper - very happy to see it is useful.
was (Author: gbr...@cs.man.ac.uk):
Hello, I am the first author of the paper being discussed. Everybody - thanks
for the attention to the paper - very happy to see it is useful.
Our paper did indeed separate the two tasks of (1) estimating the probability
distributions, and (2) the process/dynamics of selecting features once you have
those probabilities. So as Sam says, yes it is entirely possible that bad
estimation of those probabilities could lead to bad choices of features.
The task of estimating those probabilities is an unsolved problem in general,
but is known as "entropy estimation". Sam rightly points out that you could
use LaPlace, or other smoothing methods. Which one will work best on any
arbitrary dataset is unknown in general. However many of these smoothing
methods all perform identically once you get a reasonable number of datapoints
per feature. Small sample feature selection using information theoretic
methods is an open problem - but one could use good heuristics like AIC or BIC
to regularize.
In answer to the question by Sam - "Is there any literature that has attempted
to *formalize the way we introduce independence*" ... no, our paper is the only
one I know of.
I think of these information theoretic filter methods as a way to *explore*
large data. If you wanted to build the best possible classifier, with no limit
on computation, you'd use a wrapper method around it. Filter methods by nature
assume independence between the feature selection stage and the classifier
building, which is clearly false, but works well in practice with very large
datasets that cannot have unlimited compute time. If you want to explore data,
a filter is a good heuristic - and the information theoretic ones are the most
theoretically grounded "heuristics" I know of.
Happy to answer further questions or give my perspectives.
> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
> Labels: features
>
> For classification tasks involving large feature spaces in the order of tens
> of thousands or higher (e.g., text classification with n-grams, where n > 1),
> it is often useful to rank and filter features that are irrelevant thereby
> reducing the feature space by at least one or two orders of magnitude without
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at
> least two methods should be