[ 
https://issues.apache.org/jira/browse/SPARK-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6310.
------------------------------------
    Resolution: Duplicate

> ChiSqTest should check for too few counts
> -----------------------------------------
>
>                 Key: SPARK-6310
>                 URL: https://issues.apache.org/jira/browse/SPARK-6310
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to