aruraghuwanshi opened a new issue, #19038:
URL: https://github.com/apache/druid/issues/19038

   ### Description
   
   Add an optional mechanism to automatically sample (drop a configurable 
fraction of) incoming rows during streaming ingestion when the ingestion rate 
exceeds a configurable threshold. When the rate is below the threshold, all 
rows would be ingested as normal; when it exceeds the threshold, the system 
would apply sampling to reduce load.
   
   This would allow users to maintain ingestion stability during traffic spikes 
or bursts, trading some data completeness for system stability when scaling out 
(via the existing task autoscaler) is not sufficient or desirable.
   
   If you have a detailed implementation in mind and wish to contribute that 
implementation yourself, and the change that
   you are planning would require a 'Design Review' tag because it introduces 
or changes some APIs, or it is large and
   imposes lasting consequences on the codebase, please open a Proposal instead.
   
   ### Motivation
   
   **Use case:** Streaming ingestion from Kafka or Kinesis can experience 
sudden spikes in event volume. When incoming rate exceeds what the cluster can 
sustainably process, lag accumulates and tasks may fail, backpressure builds, 
or the system becomes unstable. Today, Druid addresses this primarily by 
scaling out (adding more ingestion tasks via the autoscaler), but there are 
scenarios where:
   
   - Additional task capacity is not immediately available (e.g., worker slots 
constrained)
   - Users prefer to sample during bursts rather than risk ingestion failures 
or growing lag
   - The use case tolerates approximate data (e.g., metrics, sampling-friendly 
analytics)
   
   **Rationale:** Providing a built-in, configurable way to sample when rate 
exceeds a threshold would give operators a predictable, observable fallback 
when the stream overwhelms available capacity. It would complement (not 
replace) the existing autoscaling approach—users could configure both and have 
sampling kick in only when scaling alone is insufficient.
   
   **Benefit:** Users would have a declarative way to prioritize ingestion 
stability over data completeness during overload, with sampled events tracked 
in existing metrics (e.g., `ingest/events/thrownAway` with a distinct reason) 
for visibility.
   
   ### Implementation considerations
   
   Any implementation would likely build on existing building blocks: the 
row-level filtering applied during streaming ingestion (e.g., `InputRowFilter`, 
`FilteringCloseableInputRowIterator`), the throughput and thrown-away metrics 
already exposed per task (`RowIngestionMeters`, `DropwizardRowIngestionMeters` 
with its moving averages), and the streaming tuning config for new parameters 
(`SeekableStreamIndexTaskTuningConfig`). The cost-based and lag-based 
autoscalers already consume task-level stats for scaling decisions—similar 
metrics could inform when sampling is warranted.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to