[ 
https://issues.apache.org/jira/browse/SPARK-33841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33841:
----------------------------------
    Affects Version/s: 3.2.0
                       3.1.0

> Jobs disappear intermittently from the SHS under high load
> ----------------------------------------------------------
>
>                 Key: SPARK-33841
>                 URL: https://issues.apache.org/jira/browse/SPARK-33841
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>         Environment: SHS is running locally on Ubuntu 19.04
>  
>            Reporter: Vladislav Glinskiy
>            Assignee: Vladislav Glinskiy
>            Priority: Major
>             Fix For: 3.1.0, 3.2.0
>
>
> Ran into an issue when a particular job was displayed in the SHS and 
> disappeared after some time, but then, in several minutes showed up again.
> The issue is caused by SPARK-29043, which is designated to improve the 
> concurrent performance of the History Server. The 
> [change|https://github.com/apache/spark/pull/25797/files#] breaks the ["app 
> deletion" 
> logic|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563]
>  because of missing proper synchronization for {{processing}} event log 
> entries. Since SHS now [filters 
> out|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462]
>  all {{processing}} event log entries, such entries do not have a chance to 
> be [updated with the new 
> {{lastProcessed}}|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472]
>  time and thus any entity that completes processing right after 
> [filtering|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462]
>  and before [the check for stale 
> entities|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560]
>  will be identified as stale and will be deleted from the UI until the next 
> {{checkForLogs}} run. This is because [updated {{lastProcessed}} time is used 
> as 
> criteria|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557],
>  and event log entries that missed to be updated with a new time, will match 
> that criteria.
> The issue can be reproduced by generating a big number of event logs and 
> uploading them to the SHS event log directory on S3. Essentially, around 
> 800(82.6 MB) copies of an event log file were created using 
> [shs-monitor|https://github.com/vladhlinsky/shs-monitor] script. Strange 
> behavior of SHS counting the total number of applications was noticed - at 
> first, the number was increasing as expected, but with the next page refresh, 
> the total number of applications decreased. No errors were logged by SHS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to