zuston opened a new issue, #503: URL: https://github.com/apache/incubator-uniffle/issues/503
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug I found the full GC occurs when there are too many partitions on a shuffle server, and the duration could be 20s+. Env: 1. using java8 2. xmx=30g, buffer-capacity=10g, read-capacity=10g 3. from the metric dashboard, during peek hours, there are 20k partitions in a shuffle-server, but disk used capacity ratio is 0.1-0.2 I guess the object creating or allocation request speed is greater than the gc speed, which causes the STW. ### Affects Version(s) master ### Uniffle Server Log Output _No response_ ### Uniffle Engine Log Output _No response_ ### Uniffle Server Configurations ```yaml We use the default g1 GC configuration. ``` ### Uniffle Engine Configurations _No response_ ### Additional context ### Possible solutions 1. Garbage collector changes to CMS 2. Expand the uniffle cluster by adding more shuffle-servers 3. If one shuffle-server has partition number exceeding the threshold, we should make it fallback to ess. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
