[
https://issues.apache.org/jira/browse/YUNIKORN-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948709#comment-17948709
]
Mit Desai commented on YUNIKORN-3076:
-------------------------------------
This is a example of how the response looks like. The first object is the one
that causes the failure to load. It does not have the stateLog object.
When checking for the corresponding applicationID, I found that the cluster no
longer had that job in the namespace. Mostlikely the job was terminated and the
pod removed. But yunikorn for some reason still thinks that the pod is still in
the system.
{noformat}
[{
"applicationID": "spark-app-id1",
"partition": "default",
"queueName": "root.NAMESPACE.SUBQUEUE",
"submissionTime": 1231231231321232123,
"applicationState": "New",
"user": "username",
"groups": ["group1","group2"],
"maxRequestPriority": -2147483648
},{
"applicationID": "spark-app-id2",
"usedResource": {
"memory": 2000000000000,
"pods": 201,
"vcore": 600000
},
"maxUsedResource": {
"memory": 2000000000000,
"pods": 201,
"vcore": 600000
},
"pendingResource": {
"memory": 0,
"pods": 0,
"vcore": 0
},
"partition": "default",
"queueName": "root.NAMESPACE.SUBQUEUE",
"submissionTime": 1231231231321232123,
"allocations": [{
"allocationKey": "6f4e57a2-43e5-413a-8e65-9853bccb593f",
"allocationTags": {
"tags-removed": "not-applicable"
},
"requestTime": 1231231231321232123,
"allocationTime": 1231231231321232123,
"allocationDelay": 2134568776,
"uuid": "6f4e57a2-43e5-413a-8e65-9853bccb593f-0",
"allocationID": "6f4e57a2-43e5-413a-8e65-9853bccb593f-0",
"resource": {
"memory": 7682104331,
"pods": 1,
"vcore": 3000
},
"priority": "0",
"nodeId": "node-address",
"applicationId": "spark-app-id2",
"partition": "default"
}],
"applicationState": "Running",
"user": "username",
"groups": ["group1", "group2", "group3"],
"stateLog": [{
"time": 1231231231321232123,
"applicationState": "Accepted"
}, {
"time": 1231231231321232123,
"applicationState": "Starting"
}, {
"time": 1231231231321232123,
"applicationState": "Running"
}],
"maxRequestPriority": -2147483648
}] {noformat}
> Web UI Fails to Load Applications for Certain Queues on Heavily Loaded
> Clusters
> -------------------------------------------------------------------------------
>
> Key: YUNIKORN-3076
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3076
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: webapp
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
> Fix For: 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3
>
>
> On heavily loaded clusters, the Web UI randomly fails to load applications
> for certain queues. The allocations and resource usage are reported
> correctly, but the list of applications does not load for some queues. There
> is no definitive way to reproduce this scenario, but it has been observed
> frequently.
> {*}Initial Assumptions{*}: Initially, it was assumed that this issue could be
> due to a large payload being exchanged between the scheduler and the web UI,
> causing network latency or client-side parsing delays for a large number of
> applications/pods. However, this does not seem to be the case, as the issue
> was observed yesterday on a queue with just 3 applications and approximately
> 200 pods.
> {*}Root Cause{*}: Upon further debugging, it was found that not all
> applications come back with a 'stateLog' object. When the UI rendering
> occurs, there is an unconditional access to the stateLog object, which fails
> for applications that do not have it. This causes the rendering process to
> fail and results in a blank applications page.
> {*}Steps to Validate{*}:
> # When experiencing such issues in the Web UI, open the inspect panel and
> navigate to the network tab.
> # Clear any existing network items. Note: Clear the network items if you are
> moving to a different queue, as the UI will cache the applications object
> unless the page is refreshed.
> # Go to the applications tab and select the desired queue from the drop-down
> menu.
> # An 'Applications' tab should appear in the network tab, showing the
> payload it received.
> # If the UI is not loading the applications, there will be an application
> with {{applicationState=New}} that does not have a stateLog object.
> {*}Proposed Solution{*}: Modify the UI rendering logic to handle cases where
> the stateLog object is missing, ensuring that it does not fail and give up on
> rendering the entire applications page. Implement error handling to either
> skip or provide a default value for applications without a stateLog object.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]