Parth Brahmbhatt created SPARK-13988:
----------------------------------------
Summary: Large history files block new applications from showing
up in History UI.
Key: SPARK-13988
URL: https://issues.apache.org/jira/browse/SPARK-13988
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.6.1
Reporter: Parth Brahmbhatt
Some of our Spark users complain that their application was not showing up in
history server UI. Our analysis suggests that this is a side effect of some
application’s event log being too big. This is especially true for spark ML
applications that may have lot of iterations but is applicable to other kind of
spark jobs too. For example on my local machine just running the following
generates an event log of size 80MB.
{code}
./spark-shell --master yarn --deploy-mode client --conf
spark.eventLog.enabled=true --conf
spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events
val words = sc.textFile(“test.txt”)
for(i <- 1 to 10000) words.count
sc.close
{code}
For one of our user this file was as big as 12GB. He was running logistic
regression using spark ML. Given each application generates its own application
event log and event logs are processed serially in a single thread, one huge
application can result in lot of users not being able to view their application
on the main UI. To overcome this issue I propose to make the replay execution
multi threaded so a single large event log won’t block other applications from
being rendered into UI. This still cannot solve the issue completely if there
are too many large event logs but the alternatives I have considered (Read
chunks from begin and end to get Application Start and End event, Modify the
event log format so it has this info in header or footer) are all more
intrusive.
In addition there are several other things we can do to improve History Server
implementation.
* During the log checker phase to identify application start and end time the
replaying thread processes the whole event log and throws away all the info
apart from application start and end event. This is pretty huge waste given as
soon as a user clicks on the application we reprocess the same event log to get
job/task details. We should either optimize the first level of parsing so it
reads some chunks from beginning and end to identify the application level
details or better yet cache the job/task level details when we process the file
for the first time.
* On the details job page there is no pagination and we only show the last 1000
job events when there are > 1000 job events. Granted when users have more than
1K jobs they probably won't page through them but not even having that option
is bad experience. Also if that page is paginated we could probably do away
with partial processing of the event log until the user wants to view the next
page. This can help in cases where processing really large files causes OOM
issues as we will only be processing a subset of the file.
* On startup, the history server reprocesses the whole event log. For the top
level application details, we could persist the processing results from the
last run in a more compact and searchable format to improve the bootstrap time.
This is briefly mentioned in SPARK-6951.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]