gavinchou commented on PR #64167: URL: https://github.com/apache/doris/pull/64167#issuecomment-4849868512
Robustness issue in `TenantLevelColocateTableCheckerAndBalancer#matchGroups()`: one bad group can abort the whole checker round. `matchGroups()` iterates all tenant-level colocate groups without per-group exception isolation. Some deeper checks use `Preconditions.checkState(...)`, for example when matching backend bucket sequences and tablet counts. If one group has inconsistent metadata/tablets and throws, the method exits and later groups are not checked or repaired in this round. If the bad group stays inconsistent, it can keep blocking other groups. Relevant code paths: - group loop without per-group try/catch: https://github.com/apache/doris/blob/cde59482ce5a548a2652c3aead57096a9c832f22/fe/fe-core/src/main/java/org/apache/doris/clone/TenantLevelColocateTableCheckerAndBalancer.java#L186-L221 - `Preconditions` inside matching: https://github.com/apache/doris/blob/cde59482ce5a548a2652c3aead57096a9c832f22/fe/fe-core/src/main/java/org/apache/doris/clone/TenantLevelColocateTableCheckerAndBalancer.java#L301-L323 Can we isolate failures per group, log/mark only that group as unstable, and continue checking the remaining groups? That would also match the tenant-level isolation goal of this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
