[ https://issues.apache.org/jira/browse/SPARK-7718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549846#comment-14549846 ]
Apache Spark commented on SPARK-7718: ------------------------------------- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6256 > Speed up data source partitioning by avoiding cleaning closures > --------------------------------------------------------------- > > Key: SPARK-7718 > URL: https://issues.apache.org/jira/browse/SPARK-7718 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Reporter: Andrew Or > Assignee: Andrew Or > Priority: Critical > > The new partitioning support strategy creates a bunch of RDDs (1 per > partition, could be up to several thousands), then calls `mapPartitions` on > every single one of these RDDs. This causes us to clean the same closure many > times. Since we provide the closure in Spark we know for sure it is > serializable, so we can bypass the cleaning for performance. > According to [~yhuai] cleaning 5000 closures take up to 6-7 seconds in a 12 > seconds job that involves data source partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org