Hi Spark community,

I have a Spark Structured Streaming application that reads data from a socket source (implemented very similarly to the TextSocketMicroBatchStream). The issue is that the source can generate data faster than Spark can process it, eventually leading to an OutOfMemoryError when Spark runs out of memory trying to queue up all the pending data.

I'm looking for advice on the most idiomatic/recommended way in Spark to rate-limit data ingestion to avoid overwhelming the system.

Approaches I've considered:

1. Using a BlockingQueue with a fixed size to throttle the data. However, this requires careful tuning of the queue size. If too small, it limits throughput; if too large, you risk batches taking too long.

2. Fetching a limited number of records in the PartitionReader's next(), adding the records into a queue and checking if the queue is empty. However, I'm not sure if there is a built-in way to dynamically scale the number of records fetched (i.e., dynamically calculating the offset) based on the system load and capabilities.

So in summary, what is the recommended way to dynamically rate-limit a streaming source to match Spark's processing capacity and avoid out-of-memory issues? Are there any best practices or configuration options I should look at? Any guidance would be much appreciated! Let me know if you need any other details.

Thanks,
Mert


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to