Akshat-Jain opened a new pull request, #17907: URL: https://github.com/apache/druid/pull/17907
### Description Lately, there have been quite a few occurrences of `standard-its / (Compile=openjdk17, Run=openjdk17, Cluster Build On K8s` standard integration test (that triggers `ITNestedQueryPushDownTest`) being stuck, and failing after 6 hours of being stuck. Sample run: https://github.com/apache/druid/actions/runs/14400145606/job/40386733833 TLDR: This PR updates the Maven command for this particular GitHub Actions Job to exclude `web-console` module. It's fine to do so as it doesn't have anything to do with the test anyway. Adding detailed investigation notes below: On investigating, all such failed runs seem to be stuck at the following line in `setup_druid_on_k8s.sh`: ``` mvn -B -ff -q \ install \ -Pdist,bundle-contrib-exts \ -Pskip-static-checks,skip-tests \ -Dmaven.javadoc.skip=true -T1C ``` On removing `-q` (quiet mode) from the above command, I found that the above command was hung immediately after having successfully finished with `benchmarks` module. I tried adding some debug steps in the workflow to get some monitoring data for a failed run vs a successful run. Failed run had a `npm ci` process (which is most likely the stuck process), whereas the successful one didn't. Also, the reactor build order that gets used has `web-console` module immediately after `benchmarks` module, hence this observation aligns with the previous observation that the command seemed stuck immediately after having successfully finished with `benchmarks` module. This PR mainly excludes `web-console` module from the Maven command by adding `-pl '!web-console'`. Apart from this, a couple other things have been done: 1. The Maven command has been moved from `setup_druid_on_k8s.sh` to `standard-its.yml`. Previously, `standard-its.yml` was calling `MAVEN_OPTS='-Xmx2048m' ${MVN} verify -pl integration-tests -P int-tests-config-file ${IT_TEST} ${MAVEN_SKIP} -Dpod.name=${POD_NAME} -Dpod.namespace=${POD_NAMESPACE} -Dbuild.druid.cluster=${BUILD_DRUID_CLUSTER}`, which ended up calling a chain of bash scripts, which called the other Maven command, which then ended up stuck. Essentially, a Maven command was triggering another Maven command via a chain of bash scripts. This nested Maven commands could potentially run into other issues, like competing for lock on the local Maven repo etc, which are very difficult to debug. Hence, I brought that "inner" Maven command to the same level as the "outer" Maven command, and now they are running sequentially. It's easier to reason about the flow as well this way. 2. I have also added `set -x` to enable debugging for the bash scripts involved in running this IT. I think it's better to have them, it's not a lot of noise, and those scripts are only used for this IT. So I think it'd be good to improve the ability to debug for future. On a similar note, I have removed the `-q` (quiet mode) argument from the Maven command. I tried a bunch of runs for this test with this PR's change, all of them passed: 1. 8/8 successful runs [here](https://github.com/Akshat-Jain/druid/actions/runs/14429675741/job/40464412441?pr=7) 2. 5/5 successful runs [here](https://github.com/Akshat-Jain/druid/actions/runs/14429950271/job/40464375268?pr=9) 3. 2/2 successful runs [here](https://github.com/Akshat-Jain/druid/actions/runs/14430480632/job/40464926328?pr=8) <hr> This PR has: - [x] been self-reviewed. - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
