[ 
https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12026:
------------------------------------

    Assignee: Apache Spark

> ChiSqTest gets slower and slower over time when number of features is large
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-12026
>                 URL: https://issues.apache.org/jira/browse/SPARK-12026
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: Hunter Kelly
>            Assignee: Apache Spark
>              Labels: mllib, stats
>         Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My 
> understanding is that internally it creates jobs to run on batches of 1000 
> features at a time.
> I was under the impression that the features are treated as independant, but 
> this does not appear to be the case.  When the number of features is large 
> (160k in my case), each batch gets slower and slower.  As an example, running 
> on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  
> By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to