itschrispeck commented on code in PR #14217:
URL: https://github.com/apache/pinot/pull/14217#discussion_r1800411604
##########
pinot-controller/src/main/java/org/apache/pinot/controller/validation/RealtimeSegmentValidationManager.java:
##########
@@ -169,6 +171,10 @@ private void runSegmentLevelValidation(TableConfig
tableConfig, StreamConfig str
if (_llcRealtimeSegmentManager.isDeepStoreLLCSegmentUploadRetryEnabled()) {
_llcRealtimeSegmentManager.uploadToDeepStoreIfMissing(tableConfig,
segmentsZKMetadata);
}
+
+ if (_segmentAutoResetOnErrorAtValidation) {
+ _pinotHelixResourceManager.resetSegments(realtimeTableName, null, true);
+ }
Review Comment:
Adding a bit more background, for some of our largest clusters (>1M segments
per zk) we sporadically find segments in error state. Many of these can be
fixed with a simple reset and we would like to avoid operator intervention for
these cases.
One example case:
1. server 1 completes build, fails to upload to deep store
2. server 2 is being restarted/upgraded/replaced, when it starts up peer
download fails
3. server 1 backfills deepstore via async upload task
4. server 2's segment needs to be reset to trigger deep-store download and
load the segment
We have tried increasing deep store upload/peer download timeouts/retries,
but this isn't a great solution for us since it introduces more delays into the
ingestion path
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]