[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387126#comment-16387126 ]
Ye Zhou commented on SPARK-23607: --------------------------------- [~vanzin] Any comments? Thanks. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > --------------------------------------------------------------------------------------------------------- > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI > Affects Versions: 2.3.0 > Reporter: Ye Zhou > Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org