abhishekrb19 opened a new pull request, #15941: URL: https://github.com/apache/druid/pull/15941
This PR makes a few functional and non-functional changes in the auto-kill `KillUnusedSegments` duty. Functional changes: - Add a new metric `killTask/candidateSegments/count` that tracks the number of candidate unused segments per kill cycle. It's candidate because it's the kill task that ultimately kills the unused segments and the number can change. - Move a debug log to info for improved visibility. This should be fine as it gets logged once per indexing/kill task period. - Bug fix - auto-kill throws NPE when there are no datasources present and defaults mismatch. - Change the kill duty DruidException validations from user to operator persona as they're configured by operators typically. - Remove redundant checks in the code for code readability. - Remove all the unnecessary `@VisibleForTesting` annotations as the test code now verifies the duty directly instead of mocks + various hooks into the code. - Finalize some parameters and variables to clarify the intent. Testing changes: - Updated tests from using mocks (the mocks were difficult to use and buggy - for example, last updated timestamp was never used. - Fix a bug where `TestDruidCoordinatorConfig.DEFAULT_COORDINATOR_KILL_BUFFER_PERIOD ` was incorrectly set to 1 day instead of 90 days. While at it, I changed all the test coordinator defaults to match the same ISO 8601 period format as `DruidCoordinatorConfig`, so it's easy to spot check. - Add more test coverage for the different config parameters. - Add a couple of unit tests that are ignored currently as they fail. The duty doesn't consider segments spanning eternity, or end in `DateTimes.MAX` (outside the normal datetime string comparison range). TODO: need to clean up some test setup. ### Motivation In general, I found it difficult to track the progress of the auto-kill coordinator duty. Also, see some discussion in the community slack: https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1707941058448649. My preliminary conclusion from looking at a couple of small-to-medium clusters is that the auto-kill defaults are overly conservative and need to be revisited as https://github.com/apache/druid/issues/15912 notes. I would like to address some of the concerns raised in https://github.com/apache/druid/issues/15911 and https://github.com/apache/druid/issues/15912 separately once we have tests from this patch and some additional data in place. #### Release note <!-- Give your best effort to summarize your changes in a couple of sentences aimed toward Druid users. If your change doesn't have end user impact, you can skip this section. For tips about how to write a good release note, see [Release notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes). --> <hr> ##### Key changed/added classes in this PR * `KillUnusedSegments.java` * `KillUnusedSegmentsTest.java` <hr> <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. --> This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
