[ https://issues.apache.org/jira/browse/SPARK-33841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-33841. ----------------------------------- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30845 [https://github.com/apache/spark/pull/30845] > Jobs disappear intermittently from the SHS under high load > ---------------------------------------------------------- > > Key: SPARK-33841 > URL: https://issues.apache.org/jira/browse/SPARK-33841 > Project: Spark > Issue Type: Task > Components: Spark Core > Affects Versions: 3.0.0, 3.0.1 > Environment: SHS is running locally on Ubuntu 19.04 > > Reporter: Vladislav Glinskiy > Assignee: Vladislav Glinskiy > Priority: Major > Fix For: 3.2.0 > > > Ran into an issue when a particular job was displayed in the SHS and > disappeared after some time, but then, in several minutes showed up again. > The issue is caused by SPARK-29043, which is designated to improve the > concurrent performance of the History Server. The > [change|https://github.com/apache/spark/pull/25797/files#] breaks the ["app > deletion" > logic|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563] > because of missing proper synchronization for {{processing}} event log > entries. Since SHS now [filters > out|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462] > all {{processing}} event log entries, such entries do not have a chance to > be [updated with the new > {{lastProcessed}}|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472] > time and thus any entity that completes processing right after > [filtering|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462] > and before [the check for stale > entities|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560] > will be identified as stale and will be deleted from the UI until the next > {{checkForLogs}} run. This is because [updated {{lastProcessed}} time is used > as > criteria|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557], > and event log entries that missed to be updated with a new time, will match > that criteria. > The issue can be reproduced by generating a big number of event logs and > uploading them to the SHS event log directory on S3. Essentially, around > 800(82.6 MB) copies of an event log file were created using > [shs-monitor|https://github.com/vladhlinsky/shs-monitor] script. Strange > behavior of SHS counting the total number of applications was noticed - at > first, the number was increasing as expected, but with the next page refresh, > the total number of applications decreased. No errors were logged by SHS. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org