[jira] [Updated] (KAFKA-19286) RemoteIndexCacheTest.testConcurrentRemoveReadForCache1 is flaky

Uladzislau Blok (Jira) Sat, 15 Nov 2025 13:09:06 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-19286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uladzislau Blok updated KAFKA-19286:
------------------------------------
    Description: 
[https://github.com/apache/kafka/actions/runs/15046629450/job/42291644617?pr=19493]
 build failure for this test.

Over the past 7 days, 463 runs gave 25 flakes and 4 failures.

[https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.storage.internals.log.RemoteIndexCacheTest]

I was able to re-produce error with repeated test: 
{code:java}
100000 tests completed, 3 failed
> Task :storage:test FAILED{code}
 

Looks like cache wasn't propagated:
{code:java}
org.opentest4j.AssertionFailedError: Condition not met within timeout 15000. 
cache did not adhere to expected size of 1; was 0{code}
Error message is similar for all runs.

 

Logs from success run:
{code:java}
2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,715] ERROR Inside markForCleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,715] ERROR Inside createCacheEntry 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:362)
2025-11-15T21:42:23.716+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,716] ERROR Inside cleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
 

Logs from failed run:
{code:java}
2025-11-15T21:42:23.720+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,719] ERROR Inside markForCleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
2025-11-15T21:42:23.722+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,721] ERROR Inside cleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
 

This means that entry still was in cache. The root cause is too low delay for 
scheduler:
{code:java}
cache.setFileDeleteDelayMs(0); {code}
Before scheduler starts clean up, he's triggering read thread. If it's 
happening before RemoteIndexCache#remove does remove entity from cache, reader 
thread will successfully read entity, and won't trigger fetching from remote log

TL;DR

We need to increase value in cache.setFileDeleteDelayMs(); 

  was:
[https://github.com/apache/kafka/actions/runs/15046629450/job/42291644617?pr=19493]
 build failure for this test.

Over the past 7 days, 463 runs gave 25 flakes and 4 failures.

[https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.storage.internals.log.RemoteIndexCacheTest]

I was able to re-produce error with repeated test: 
{code:java}
100000 tests completed, 3 failed
> Task :storage:test FAILED{code}
 

Looks like cache wasn't propagated:
{code:java}
org.opentest4j.AssertionFailedError: Condition not met within timeout 15000. 
cache did not adhere to expected size of 1; was 0{code}
Error message is similar for all runs.

 

Did deeper analysis:

Logs from success run:

 
{code:java}
2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,715] ERROR Inside markForCleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,715] ERROR Inside createCacheEntry 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:362)
2025-11-15T21:42:23.716+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,716] ERROR Inside cleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
Logs from failed run:

 

 
{code:java}
2025-11-15T21:42:23.720+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,719] ERROR Inside markForCleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
2025-11-15T21:42:23.722+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
21:42:23,721] ERROR Inside cleanup 
(org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
This means that entry still was in cache. The root cause is too low delay for 
scheduler:
{code:java}
cache.setFileDeleteDelayMs(0); {code}
Before scheduler starts clean up, he's triggering read thread. If it's 
happening before RemoteIndexCache#remove does remove entity from cache, reader 
thread will successfully read entity, and won't trigger fetching from remote log

 


> RemoteIndexCacheTest.testConcurrentRemoveReadForCache1 is flaky
> ---------------------------------------------------------------
>
>                 Key: KAFKA-19286
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19286
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Andrew Schofield
>            Assignee: Uladzislau Blok
>            Priority: Major
>
> [https://github.com/apache/kafka/actions/runs/15046629450/job/42291644617?pr=19493]
>  build failure for this test.
> Over the past 7 days, 463 runs gave 25 flakes and 4 failures.
> [https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.storage.internals.log.RemoteIndexCacheTest]
> I was able to re-produce error with repeated test: 
> {code:java}
> 100000 tests completed, 3 failed
> > Task :storage:test FAILED{code}
>  
> Looks like cache wasn't propagated:
> {code:java}
> org.opentest4j.AssertionFailedError: Condition not met within timeout 15000. 
> cache did not adhere to expected size of 1; was 0{code}
> Error message is similar for all runs.
>  
> Logs from success run:
> {code:java}
> 2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
> 21:42:23,715] ERROR Inside markForCleanup 
> (org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
> 2025-11-15T21:42:23.715+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
> 21:42:23,715] ERROR Inside createCacheEntry 
> (org.apache.kafka.storage.internals.log.RemoteIndexCache:362)
> 2025-11-15T21:42:23.716+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
> 21:42:23,716] ERROR Inside cleanup 
> (org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
>  
> Logs from failed run:
> {code:java}
> 2025-11-15T21:42:23.720+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
> 21:42:23,719] ERROR Inside markForCleanup 
> (org.apache.kafka.storage.internals.log.RemoteIndexCache:555)
> 2025-11-15T21:42:23.722+0100 [DEBUG] [TestEventLogger]     [2025-11-15 
> 21:42:23,721] ERROR Inside cleanup 
> (org.apache.kafka.storage.internals.log.RemoteIndexCache:570) {code}
>  
> This means that entry still was in cache. The root cause is too low delay for 
> scheduler:
> {code:java}
> cache.setFileDeleteDelayMs(0); {code}
> Before scheduler starts clean up, he's triggering read thread. If it's 
> happening before RemoteIndexCache#remove does remove entity from cache, 
> reader thread will successfully read entity, and won't trigger fetching from 
> remote log
> TL;DR
> We need to increase value in cache.setFileDeleteDelayMs(); 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19286) RemoteIndexCacheTest.testConcurrentRemoveReadForCache1 is flaky

Reply via email to