Rong Tang created SPARK-25409: --------------------------------- Summary: Speed up Spark History at start if there are tens of thousands of applications. Key: SPARK-25409 URL: https://issues.apache.org/jira/browse/SPARK-25409 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Rong Tang
We have a spark history server, storing 7 days' applications. it usually has 10K to 20K attempts. We found that it can take hours at start up,loading/replaying the logs in event-logs folder. thus, new finished applications have to wait several hours to be seem. So I made 2 improvements for it. # As we run spark on yarn. the on-going applications' information can also be seen via resource manager, so I introduce in a flag spark.history.fs.load.incomplete to say loading logs for incomplete attempts or not. # Incremental loading applications. as I said, we have more then 10K applications stored, it can take hours to load all of them at the first time. so I introduced in a config spark.history.fs.appRefreshNum to say how many application to load each time, then it gets a chance the check the latest updates. Here are the benchmark I did. our system has 1K incomplete application ( it was not cleaned up for some reason, that is another issue that I need investigate), and applications' log size can be gigabytes. Not load incomplete attempts. | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase with more attempts| |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| |2|All|No|13K|31 minutes| yes| Limit each time how much to load. | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase with more attempts| |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| |2|3000|Yes|13K|42 minutes except last 1.6K (The last 1.6K attempts cost extremely long 2.5 hours)|NO| Limit each time how many to load, and not load incomplete jobs. | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Avg|Increase with more attempts| |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes| |2|3000|NO|12K|17minutes |10 minutes ( 41 minutes in total)|NO| | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Avg|Increase with more attempts| |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes| |2|3000|NO|18.5K|20minutes|18 minutes (2 hours 18 minutes in total) |NO| -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org