[ https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618213#comment-16618213 ]
Rong Tang commented on SPARK-25409: ----------------------------------- Create a pull request for it. [https://github.com/apache/spark/pull/22444] > Speed up Spark History at start if there are tens of thousands of > applications. > ------------------------------------------------------------------------------- > > Key: SPARK-25409 > URL: https://issues.apache.org/jira/browse/SPARK-25409 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.3.1 > Reporter: Rong Tang > Priority: Major > Attachments: SPARK-25409.0001.patch > > > We have a spark history server, storing 7 days' applications. it usually has > 10K to 20K attempts. > We found that it can take hours at start up,loading/replaying the logs in > event-logs folder. thus, new finished applications have to wait several > hours to be seem. So I made 2 improvements for it. > # As we run spark on yarn. the on-going applications' information can also > be seen via resource manager, so I introduce in a flag > spark.history.fs.load.incomplete to say loading logs for incomplete attempts > or not. > # Incremental loading applications. as I said, we have more then 10K > applications stored, it can take hours to load all of them at the first time. > so I introduced in a config spark.history.fs.appRefreshNum to say how many > application to load each time, then it gets a chance the check the latest > updates. > Here are the benchmark I did. our system has 1K incomplete application ( it > was not cleaned up for some reason, that is another issue that I need > investigate), and applications' log size can be gigabytes. > > Not load incomplete attempts. > | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|All|No|13K|31 minutes| yes| > > > Limit each time how much to load. > > | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|3000|Yes|13K|42 minutes except last 1.6K > (The last 1.6K attempts cost extremely long 2.5 hours)|NO| > > > Limit each time how many to load, and not load incomplete jobs. > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes| > |2|3000|NO|12K|17minutes > |10 minutes > ( 41 minutes in total)|NO| > > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes| > |2|3000|NO|18.5K|20minutes|18 minutes > (2 hours 18 minutes in total) > |NO| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org