Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2024-01-11 Thread via GitHub


gharris1727 merged PR #14917:
URL: https://github.com/apache/kafka/pull/14917


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2024-01-11 Thread via GitHub


gharris1727 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1888236729

   Test failures appear unrelated, the test which is being changed has no 
failures, and this is just a timeout increase. Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-28 Thread via GitHub


vamossagar12 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1871795472

   I noticed another instance of this test's failure here: 
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15072/5/tests/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-05 Thread via GitHub


gharris1727 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841939038

   Okay i'll let it keep running, but it appears that the 4-minute timeout has 
an ~85% fail rate at 20% CPU, and ~0% fail rate at 30% CPU.
   
   The 2-minute timeout has a 100% fail rate at 20% and 30% CPU, and a 50% fail 
rate at 40% CPU. The additional timeout is helping, but stil isn't tolerant of 
low CPU percentages like the BeforeAll strategy.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-05 Thread via GitHub


gharris1727 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841810754

   I did some stress-testing on my "BeforeAll" fix overnight and found a new 
failure mode
   
   ```
   org.opentest4j.AssertionFailedError: expected: 
   
<{"state":"RUNNING","spec":{"class":"org.apache.kafka.trogdor.task.NoOpTaskSpec","startMs":552,"durationMs":500},"startedMs":552,"status":"receiving"}>
   but was: 
   
<{"state":"RUNNING","spec":{"class":"org.apache.kafka.trogdor.task.NoOpTaskSpec","startMs":552,"durationMs":500},"startedMs":552,"status":"active"}>
   ```
   
   This occurred in 29 of 7531 runs (0.38% chance) at 20% CPU.
   
   I'm going to re-run the test with your branch to see how effective the 
4-minute timeout is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-05 Thread via GitHub


splett2 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1841683355

   @gharris1727 pretty odd that class loading takes so long, but the analysis 
seems reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-04 Thread via GitHub


gharris1727 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1839818405

   I tested out adding a MiniTrogdorCluster start/stop in a `@BeforeAll` method 
in the test, and I'm getting reliable test passes at about 20% CPU, where 
before it was consistent failures.
   It appears that the first start of MiniTrogdorCluster takes ~110 seconds, 
while subsequent starts take ~2 seconds. That would explain the trend I was 
seeing in gradle enterprise, where one test has >30s runtime, while every other 
test has ~2s runtime.
   
   I would probably recommend increasing the timeout in this test to 4 minutes, 
or adding in the `@BeforeAll` warm-up method I mentioned, in lieu of disabling 
this test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-04 Thread via GitHub


gharris1727 commented on PR #14917:
URL: https://github.com/apache/kafka/pull/14917#issuecomment-1839438204

   I was looking into this too, and haven't figured out a stabilization yet.
   
   I profiled the tests under a 40% CPU limited environment, and got pretty 
similar flakes to happen in several of the tests in this suite. Since i'm 
profiling one test at a time, I'm seeing a lot of impact from loading classes 
(`loadClass` highlighted in purple, wall clock trace)
   ![Screenshot 2023-12-04 at 12 32 03 
PM](https://github.com/apache/kafka/assets/5856969/e9244101-920a-4ff9-88e8-cb28d2e7d456)
   In terms of runtime, the JsonRestServer.start is taking up 75% of the test 
duration and doesn't seem to have any obvious slowdowns.
   
   My current working theory is that this test is arbitrarily picked to be the 
first test in the suite, and the loadClass overhead is enough to tip over the 
2-minute test timeout. If that's the case, we might expect disabling the test 
to cause one of the other tests to become flaky. I would support merging this 
change just to test that hypothesis.
   
   I was planning on exploring "warming" the classloader cache before the test 
timer starts, or increasing the test timer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] KAFKA-15760: Disable flaky test testTaskRequestWithOldStartMsGetsUpdated [kafka]

2023-12-04 Thread via GitHub


splett2 opened a new pull request, #14917:
URL: https://github.com/apache/kafka/pull/14917

   ### What
   Disables `testTaskRequestWithOldStartMsGetsUpdated` which seems to flake in 
CI frequently. I tried to take a look and could not get the test failure to 
repro locally. The trogdor code is not production critical and is mostly stable 
as well, so disabling seems fine.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org