Liang-Chi Hsieh created SPARK-29181:
---------------------------------------

             Summary: Cache preferred locations of checkpointed RDD
                 Key: SPARK-29181
                 URL: https://issues.apache.org/jira/browse/SPARK-29181
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Liang-Chi Hsieh


One Spark job in our cluster fits many ALS models in parallel. The fitting goes 
well, but in next when we union all factors, the union operation is very slow.

By looking into the driver stack dump, looks like the driver spends a lot of 
time on computing preferred locations. As we checkpoint training data before 
fitting ALS, the time is spent on ReliableCheckpointRDD.getPreferredLocations. 
In this method, it will call DFS interface to query file status and block 
locations. As we have big number of partitions derived from the checkpointed 
RDD,  the union will spend a lot of time on querying the same information.

This proposes to add a Spark config to control the caching behavior of 
ReliableCheckpointRDD.getPreferredLocations. If it is enabled, 
getPreferredLocations will only compute preferred locations once and cache it 
for late usage.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to