[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liang-Chi Hsieh resolved SPARK-29181. ------------------------------------- Resolution: Duplicate > Cache preferred locations of checkpointed RDD > --------------------------------------------- > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Liang-Chi Hsieh > Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org