[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

Alexander Ulanov (JIRA) Tue, 11 Oct 2016 13:21:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566467#comment-15566467
 ]


Alexander Ulanov commented on SPARK-17870:
------------------------------------------

[`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)
 works with "a Function taking two arrays X and y, and returning a pair of 
arrays (scores, pvalues) or a single array with scores". According to what you 
observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case 
for all functions that return two arrays: 
https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331.
 Alternative, one case use raw `chi2` scores for sorting. She need to pass only 
the first array from `chi2` to `SelectKBest`. As far as I remember, using raw 
chi2 scores is default in Weka's 
[ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html).
 So, I would not claim that either of approaches is incorrect. According to 
[Introduction to 
IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),
 there might be an issue with computing p-values because then chi2-test is used 
multiple times. Using plain chi2 values does not involve statistical test, so 
it might be treated as just some ranking with no statistical implications.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> ------------------------------------------------------------------------
>
>                 Key: SPARK-17870
>                 URL: https://issues.apache.org/jira/browse/SPARK-17870
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>            Reporter: Peng Meng
>            Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

Reply via email to