Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 merged PR #14917: URL: https://github.com/apache/kafka/pull/14917 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1888236729 Test failures appear unrelated, the test which is being changed has no failures, and this is just a timeout increase. Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
vamossagar12 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1871795472 I noticed another instance of this test's failure here: https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15072/5/tests/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841939038 Okay i'll let it keep running, but it appears that the 4-minute timeout has an ~85% fail rate at 20% CPU, and ~0% fail rate at 30% CPU. The 2-minute timeout has a 100% fail rate at 20% and 30% CPU, and a 50% fail rate at 40% CPU. The additional timeout is helping, but stil isn't tolerant of low CPU percentages like the BeforeAll strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841810754 I did some stress-testing on my "BeforeAll" fix overnight and found a new failure mode ``` org.opentest4j.AssertionFailedError: expected: <{"state":"RUNNING","spec":{"class":"org.apache.kafka.trogdor.task.NoOpTaskSpec","startMs":552,"durationMs":500},"startedMs":552,"status":"receiving"}> but was: <{"state":"RUNNING","spec":{"class":"org.apache.kafka.trogdor.task.NoOpTaskSpec","startMs":552,"durationMs":500},"startedMs":552,"status":"active"}> ``` This occurred in 29 of 7531 runs (0.38% chance) at 20% CPU. I'm going to re-run the test with your branch to see how effective the 4-minute timeout is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
splett2 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841683355 @gharris1727 pretty odd that class loading takes so long, but the analysis seems reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1839818405 I tested out adding a MiniTrogdorCluster start/stop in a `@BeforeAll` method in the test, and I'm getting reliable test passes at about 20% CPU, where before it was consistent failures. It appears that the first start of MiniTrogdorCluster takes ~110 seconds, while subsequent starts take ~2 seconds. That would explain the trend I was seeing in gradle enterprise, where one test has >30s runtime, while every other test has ~2s runtime. I would probably recommend increasing the timeout in this test to 4 minutes, or adding in the `@BeforeAll` warm-up method I mentioned, in lieu of disabling this test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
gharris1727 commented on PR #14917: URL: https://github.com/apache/kafka/pull/14917#issuecomment-1839438204 I was looking into this too, and haven't figured out a stabilization yet. I profiled the tests under a 40% CPU limited environment, and got pretty similar flakes to happen in several of the tests in this suite. Since i'm profiling one test at a time, I'm seeing a lot of impact from loading classes (`loadClass` highlighted in purple, wall clock trace) ![Screenshot 2023-12-04 at 12 32 03 PM](https://github.com/apache/kafka/assets/5856969/e9244101-920a-4ff9-88e8-cb28d2e7d456) In terms of runtime, the JsonRestServer.start is taking up 75% of the test duration and doesn't seem to have any obvious slowdowns. My current working theory is that this test is arbitrarily picked to be the first test in the suite, and the loadClass overhead is enough to tip over the 2-minute test timeout. If that's the case, we might expect disabling the test to cause one of the other tests to become flaky. I would support merging this change just to test that hypothesis. I was planning on exploring "warming" the classloader cache before the test timer starts, or increasing the test timer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]
splett2 opened a new pull request, #14917: URL: https://github.com/apache/kafka/pull/14917 ### What Disables `testTaskRequestWithOldStartMsGetsUpdated` which seems to flake in CI frequently. I tried to take a look and could not get the test failure to repro locally. The trogdor code is not production critical and is mostly stable as well, so disabling seems fine. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org