[ https://issues.apache.org/jira/browse/SPARK-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-13988. ----------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 > Large history files block new applications from showing up in History UI. > ------------------------------------------------------------------------- > > Key: SPARK-13988 > URL: https://issues.apache.org/jira/browse/SPARK-13988 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.6.1 > Reporter: Parth Brahmbhatt > Assignee: Parth Brahmbhatt > Fix For: 2.0.0 > > > Some of our Spark users complain that their application was not showing up in > history server UI. Our analysis suggests that this is a side effect of some > application’s event log being too big. This is especially true for spark ML > applications that may have lot of iterations but is applicable to other kind > of spark jobs too. For example on my local machine just running the following > generates an event log of size 80MB. > {code} > ./spark-shell --master yarn --deploy-mode client --conf > spark.eventLog.enabled=true --conf > spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events > val words = sc.textFile(“test.txt”) > for(i <- 1 to 10000) words.count > sc.close > {code} > For one of our user this file was as big as 12GB. He was running logistic > regression using spark ML. Given each application generates its own > application event log and event logs are processed serially in a single > thread, one huge application can result in lot of users not being able to > view their application on the main UI. To overcome this issue I propose to > make the replay execution multi threaded so a single large event log won’t > block other applications from being rendered into UI. This still cannot solve > the issue completely if there are too many large event logs but the > alternatives I have considered (Read chunks from begin and end to get > Application Start and End event, Modify the event log format so it has this > info in header or footer) are all more intrusive. > In addition there are several other things we can do to improve History > Server implementation. > * During the log checker phase to identify application start and end time the > replaying thread processes the whole event log and throws away all the info > apart from application start and end event. This is pretty huge waste given > as soon as a user clicks on the application we reprocess the same event log > to get job/task details. We should either optimize the first level of parsing > so it reads some chunks from beginning and end to identify the application > level details or better yet cache the job/task level details when we process > the file for the first time. > * On the details job page there is no pagination and we only show the last > 1000 job events when there are > 1000 job events. Granted when users have > more than 1K jobs they probably won't page through them but not even having > that option is bad experience. Also if that page is paginated we could > probably do away with partial processing of the event log until the user > wants to view the next page. This can help in cases where processing really > large files causes OOM issues as we will only be processing a subset of the > file. > * On startup, the history server reprocesses the whole event log. For the top > level application details, we could persist the processing results from the > last run in a more compact and searchable format to improve the bootstrap > time. This is briefly mentioned in SPARK-6951. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org