yashmayya opened a new pull request, #17622:
URL: https://github.com/apache/pinot/pull/17622
## Summary
Fixes critical bugs in `ResponseStoreCleaner` that caused:
- Response store directories on brokers to grow unbounded (100s of GBs)
until disk exhaustion
- Cleanup jobs to fail with timeout errors when response stores are large
- Cascading 404 errors during delete operations due to duplicate requests
## Problem
The `ResponseStoreCleaner` periodic task had several issues preventing
effective cleanup:
1. **URL accumulation bug**: The `brokerUrls` list was declared outside the
broker iteration loop but never cleared between iterations. This caused:
- Delete URLs to accumulate across all brokers
- Requests being sent to wrong brokers (e.g., trying to delete broker1's
responses via broker2)
- Duplicate delete requests causing 404 errors
2. **Insufficient timeout**: The 3-second timeout was too short for brokers
with large response stores, as `getAllStoredResponses()` must read and
deserialize every response file.
3. **404 treated as failure**: Delete operations returning 404 were treated
as errors, but for idempotent cleanup, a missing response means the goal is
already achieved.
4. **Single broker failure blocked all cleanup**: An exception from any
broker stopped processing of all remaining brokers.
## Changes
- **Split timeouts**: Increased GET timeout to 60s (from 3s) for listing
operations; DELETE timeout set to 10s
- **Fixed URL scoping**: Moved `brokerUrls` declaration inside the broker
loop so each broker's URLs are processed independently
- **Idempotent deletes**: 404 responses are now treated as success (response
already deleted)
- **Partial failure tolerance**: Cleanup continues for remaining brokers
even if one fails; only throws if ALL brokers fail
- **Better error isolation**: Separate try-catch blocks for auth, GET, and
per-broker DELETE operations
- **Improved logging**: Added structured logging for cleanup progress and
partial failures
## Test Plan
- [ ] Verify existing `CursorIntegrationTest` passes
- [ ] Manual testing with multiple brokers to verify per-broker URL isolation
- [ ] Verify 404 responses during delete are logged at DEBUG level and don't
cause failures
- [ ] Test partial failure scenario (one broker down) to confirm other
brokers are still cleaned
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]