[ 
https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44306.
-----------------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 42357
[https://github.com/apache/spark/pull/42357]

> Group FileStatus with few RPC calls within Yarn Client
> ------------------------------------------------------
>
>                 Key: SPARK-44306
>                 URL: https://issues.apache.org/jira/browse/SPARK-44306
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Submit
>    Affects Versions: 0.9.2, 2.3.0, 3.5.0
>            Reporter: SHU WANG
>            Assignee: SHU WANG
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> It's inefficient to obtain *FileStatus* for each resource [one by 
> one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1].
>  In our company setting, we are running Spark with Hadoop Yarn and HDFS. We 
> noticed the current behavior has two major drawbacks:
>  # Since each *getFileStatus* call involves network delays, the overall delay 
> can be *large* and add *uncertainty* to the overall Spark job runtime. 
> Specifically, we quantify this overhead within our cluster. We see the p50 
> overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is 
> overloaded, the delays become more severe. 
>  # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS 
> daily. We noticed that in our cluster, most resources come from the same HDFS 
> directory for each user (See our [engineer blog 
> post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-]
>  about why we took this approach). Therefore, we can greatly reduce nearly 
> 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. 
> This will further reduce overhead from the HDFS side. 
> All in all, a more efficient way to fetch the *FileStatus* for each resource 
> is highly needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to