Parth Brahmbhatt created SPARK-13988:
----------------------------------------

             Summary: Large history files block new applications from showing 
up in History UI.
                 Key: SPARK-13988
                 URL: https://issues.apache.org/jira/browse/SPARK-13988
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 1.6.1
            Reporter: Parth Brahmbhatt


Some of our Spark users complain that their application was not showing up in 
history server UI. Our analysis suggests that this is a side effect of some 
application’s event log being too big. This is especially true for spark ML 
applications that may have lot of iterations but is applicable to other kind of 
spark jobs too. For example on my local machine just running the following 
generates an event log of size 80MB.

{code}
./spark-shell --master yarn --deploy-mode client --conf 
spark.eventLog.enabled=true --conf 
spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events
val words = sc.textFile(“test.txt”)
for(i <- 1 to 10000) words.count
sc.close 
{code}

For one of our user this file was as big as 12GB. He was running logistic 
regression using spark ML. Given each application generates its own application 
event log and event logs are processed serially in a single thread, one huge 
application can result in lot of users not being able to view their application 
on the main UI. To overcome this issue I propose to make the replay execution 
multi threaded so a single large event log won’t block other applications from 
being rendered into UI. This still cannot solve the issue completely if there 
are too many large event logs but the alternatives I have considered (Read 
chunks from begin and end  to get Application Start and End event, Modify the 
event log format so it has this info in header or footer) are all more 
intrusive. 

In addition there are several other things we can do to improve History Server 
implementation. 
* During the log checker phase to identify application start and end time the 
replaying thread processes the whole event log and throws away all the info 
apart from application start and end event. This is pretty huge waste given as 
soon as a user clicks on the application we reprocess the same event log to get 
job/task details. We should either optimize the first level of parsing so it 
reads some chunks from beginning and end to identify the application level 
details or better yet cache the job/task level details when we process the file 
for the first time.
* On the details job page there is no pagination and we only show the last 1000 
job events when there are > 1000 job events. Granted when users have more than 
1K jobs they probably won't page through them but not even having that option 
is bad experience. Also if that page is paginated we could probably do away 
with partial processing of the event log until the user wants to view the next 
page. This can help in cases where processing really large files causes OOM 
issues as we will only be processing a subset of the file.
* On startup, the history server reprocesses the whole event log. For the top 
level application details, we could persist the processing results from the 
last run in a more compact and searchable format to improve the bootstrap time. 
This is briefly mentioned in SPARK-6951.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to