hippozjs opened a new issue, #6710: URL: https://github.com/apache/kyuubi/issues/6710
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### What would you like to be improved? Problem Description: When submitting a Spark job to a Kubernetes (K8s) namespace via Kyuubi, if multiple Spark jobs are submitted concurrently and started at the same time, it may lead to a situation where several Spark drivers start successfully and enter the "running" state. This exhausts the resources of the namespace, leaving no resources available for Spark executors to start. As a result, multiple Spark drivers end up waiting for resources to start their Spark executors, but the Spark drivers do not release resources, causing a deadlock where the drivers are mutually waiting for resources. ### How should we improve? Solution: We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock. Therefore, we adopted another approach by adding a switch in Kyuubi. If the switch is turned on, Spark jobs submitted to the same namespace are processed sequentially instead of in parallel. The submission of the current job depends on the running status of the previous job: 1.If both the driver and executor exist simultaneously, it means the previous Spark job has successfully occupied resources, and the current Spark job can be submitted. 2.If only the driver exists and no executor is present, the system waits for a period of time before checking the status of the previous Spark job again until both the driver and executor are present, then the current Spark job is submitted. 3.A timeout period can be configured. If the timeout is greater than 0 and the condition of having both the driver and executor is not met before the timeout, the previous driver is killed, and the current Spark job is submitted. If the configured timeout is less than or equal to 0, the system will wait indefinitely until the previous Spark job meets the condition of having both the driver and executor before submitting the current Spark job. ### Are you willing to submit PR? - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
