rseetham opened a new issue, #18314:
URL: https://github.com/apache/pinot/issues/18314

   Currently, this feature always triggers when there's an ingestion delay 
detected at segment creation. This feature effectively doubles the ingestion 
which we might not have resources for sometimes. There are no mechanism to stop 
it if there's any issues or if the cluster is overwhelmed specifically a topic 
is constantly triggering this feature or if all the topics on a cluster trigger 
this feature at once. We need some circuit breaking mechanisms.
   
   **Proposal**
   1. Introduce a backfill pause property 
`stream.kafka.consumer.prop.auto.offset.reset.pause`. This can be set 
programmatically (through a controller API that will be added) or through an 
automatic process details below. This property will be checked before the 
backfill is triggered at segment creation.
   2. Add another property 
stream.kafka.consumer.prop.auto.offset.reset.maxSegmentsBeforeBackfillSkip - If 
the number of segments on the table are already high, we will not trigger 
backfill. In case ingestion is permanently high and not spiky, the ingestion 
doubling is causing more segments to be created leading to znode limit being 
reached faster.
   3. Add a dsc property maxConcurrentBackfillsPerController. Per controller if 
there's more than this number of backfills ongoing, we will not trigger any 
more backfills. In the case where the cluster was down and restarted after a 
while, all topics will have ingestion lag. This means backfill will trigger for 
all tables. This is not really necessary in the case because cluster can self 
recover. Instead if all topics backfill, we will create a lot of topics for 
this. Since topics cannot be cleanly removed, we will need to carry these 
topics in the stream config forever which is not ideal.
   4. If a backfill is ongoing, we will not trigger another backfill until the 
current backfill is complete. We will also pause the backfill using the 
property in point 1. and emit a metric in this case. This featurre was built 
for occasional spikes. If backfills are constantly triggering then there's more 
throughput than expected and we should be increasing the number of partitions 
on the main topic.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to