[ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-12026: -------------------------------------- Assignee: yuhao yang Target Version/s: 1.6.1, 2.0.0 > ChiSqTest gets slower and slower over time when number of features is large > --------------------------------------------------------------------------- > > Key: SPARK-12026 > URL: https://issues.apache.org/jira/browse/SPARK-12026 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.5.2 > Reporter: Hunter Kelly > Assignee: yuhao yang > Labels: mllib, stats > Attachments: First Stages.png, Latest Stages.png > > > I've been running a ChiSqTest to pick features for feature reduction. My > understanding is that internally it creates jobs to run on batches of 1000 > features at a time. > I was under the impression that the features are treated as independant, but > this does not appear to be the case. When the number of features is large > (160k in my case), each batch gets slower and slower. As an example, running > on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. > By the end, batches were taking over 30 minutes per batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org