[
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388230#comment-16388230
]
Marcelo Vanzin commented on SPARK-23607:
----------------------------------------
I think this is a nice trick to speed things up, even though it only works for
HDFS. I have some ideas on how to have a more generic speed up in this code,
just haven't had the time to sit down and try them out, but this could help in
the meantime.
> Use HDFS extended attributes to store application summary to improve the
> Spark History Server performance
> ---------------------------------------------------------------------------------------------------------
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, Web UI
> Affects Versions: 2.3.0
> Reporter: Ye Zhou
> Priority: Major
> Fix For: 2.4.0
>
>
> Currently in Spark History Server, checkForLogs thread will create replaying
> tasks for log files which have file size change. The replaying task will
> filter out most of the log file content and keep the application summary
> including applicationId, user, attemptACL, start time, end time. The
> application summary data will get updated into listing.ldb and serve the
> application list on SHS home page. For a long running application, its log
> file which name ends with "inprogress" will get replayed for multiple times
> to get these application summary. This is a waste of computing and data
> reading resource to SHS, which results in the delay for application to get
> showing up on home page. Internally we have a patch which utilizes HDFS
> extended attributes to improve the performance for getting application
> summary in SHS. With this patch, Driver will write the application summary
> information into extended attributes as key/value. SHS will try to read from
> extended attributes. If SHS fails to read from extended attributes, it will
> fall back to read from the log file content as usual. This feature can be
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last
> updated timestamp on SHS keeps within 1 minute as we configure the interval
> to 1 minute. Originally we had long delay which could be as long as 30
> minutes in our scale where we have a large number of Spark applications
> running per day.
> We want to see whether this kind of approach is also acceptable to community.
> Please comment. If so, I will post a pull request for the changes. Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]