[ https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-44306: ----------------------------------- Labels: pull-request-available (was: ) > Group FileStatus with few RPC calls within Yarn Client > ------------------------------------------------------ > > Key: SPARK-44306 > URL: https://issues.apache.org/jira/browse/SPARK-44306 > Project: Spark > Issue Type: New Feature > Components: Spark Submit > Affects Versions: 0.9.2, 2.3.0, 3.5.0 > Reporter: SHU WANG > Priority: Major > Labels: pull-request-available > > It's inefficient to obtain *FileStatus* for each resource [one by > one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1]. > In our company setting, we are running Spark with Hadoop Yarn and HDFS. We > noticed the current behavior has two major drawbacks: > # Since each *getFileStatus* call involves network delays, the overall delay > can be *large* and add *uncertainty* to the overall Spark job runtime. > Specifically, we quantify this overhead within our cluster. We see the p50 > overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is > overloaded, the delays become more severe. > # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS > daily. We noticed that in our cluster, most resources come from the same HDFS > directory for each user (See our [engineer blog > post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-] > about why we took this approach). Therefore, we can greatly reduce nearly > 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. > This will further reduce overhead from the HDFS side. > All in all, a more efficient way to fetch the *FileStatus* for each resource > is highly needed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org