[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168590#comment-14168590 ]
sam commented on SPARK-1473: ---------------------------- [~torito1984] Thank you for the response, and apologies for my delay in responding. Yes the problems of trying to estimate probabilities when independence assumptions are not made indeed make it necessary to consider some features independent. My question is *how* should we do this? Is there any literature that has attempted to **formalize the way we introduce independence** in *information theoretic* terms. Moreover I see this problem, and feature selection in general, as problems that are tightly coupled with the way probability estimation is performed. Suppose in the simplest case we wish to decide whether features F_1 and F_2 are dependent (we could consider arbitrary conjunctions too). Then the Information Theorist would want to consider the Mutual Information, i.e. the KL between the joint and product of marginals: KL( p(F_1, F_2) || p(F_1) * p(F_2) ) Then use a threshold or rank on feature pairs to determine whether to consider them dependent. Now this is where we are tightly coupled with the means by which we estimate the probabilities p(F_1, F_2), p(F_1) and p(F_2). We could use Maximum Liklihood with Laplace Smoothing, MAP / Regularization, etc, or the much lesser known Carnap's Continuum of Inductive Methods. Which method we choose along with the usual arbitrary choice of some constant (e.g. alpha in Laplace/Additive Smoothing) will determine p(F_1, F_2), p(F_1) and p(F_2) and therefore determine whether or not F_1 & F_2 are to be considered dependent. The current practice in Machine Learning has been to choose a method of estimation based off x-validation results rather than some deep philosophical justification. Prof' Jeff Paris's work and his colleagues is the only work I've seen that attempts to use Information Theoretic principles to estimate probabilities. Unfortunately the work is a little incomplete with regard to practical application. To summarize, although I like the paper, especially it's principled approach (vs the "just test and see" commonly seen in Data Science), how independence is to be assumed (to solve the exponential sparsity problem) is left as arbitrary, and so is the choice of probability estimation, and therefore it is not fully principled nor fully foundational. Please do not interpret this comment as a rejection/attack on the paper, rather I consider it a little incomplete and was hoping someone may have found a line of research more successful than my own to fill in the gaps. > Feature selection for high dimensional datasets > ----------------------------------------------- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Ignacio Zendejas > Assignee: Alexander Ulanov > Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org