[ https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-18010: ------------------------------------ Assignee: Apache Spark > Remove unneeded heavy work performed by FsHistoryProvider for building up the > application listing UI page > --------------------------------------------------------------------------------------------------------- > > Key: SPARK-18010 > URL: https://issues.apache.org/jira/browse/SPARK-18010 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI > Affects Versions: 1.6.2, 2.0.1, 2.1.0 > Reporter: Vinayak Joshi > Assignee: Apache Spark > > There are known complaints/cribs about History Server's Application List not > updating quickly enough when the event log files that need replay are huge. > Currently, the FsHistoryProvider design causes the entire event log file to > be replayed when building the initial application listing (refer the method > mergeApplicationListing(fileStatus: FileStatus) ). The process of replay > involves: > - each line in the event log being read as a string, > - parsing the string to a Json structure > - converting the Json to the corresponding Scala classes with nested > structures > Particularly the part involving parsing string to Json and then to Scala > classes is expensive. Tests show that majority of time spent in replay is in > doing this work. > When the replay is performed for building the application listing, the only > two events that the code really cares for are "SparkListenerApplicationStart" > and "SparkListenerApplicationEnd" - since the only listener attached to the > ReplayListenerBus at that point is the ApplicationEventListener. This means > that when processing an event log file with a huge number (hundreds of > thousands, can be more) of events, the work done to deserialize all of these > event, and then replay them is not needed. Only two events are what we're > interested in, and this can be used to ensure that when replay is performed > for the purpose of building the application list, we only make the effort to > replay these two events and not others. > My tests show that this drastically improves application list load time. For > a 150MB event log from a user, with over 100,000 events, the load time (local > on my mac) comes down from about 16 secs to under 1 second using this > approach. For customers that typically execute applications with large event > logs, and thus have multiple large event logs present, this can speed up how > soon the history server UI lists the apps considerably. > I will be updating a pull request with take at fixing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org