I've seen the feature work very well. For tuning, you've got: spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - weight for response to "error" (change between last batch and this batch) spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) - weight for the response to the accumulation of error. This has a dampening effect. spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) - weight for the response to the trend in error. This can cause arbitrary/noise-induced fluctuations in batch size, but can also help react quickly to increased/reduced capacity. spark.streaming.backpressure.pid.minRate - the default value is 100 (must be positive), batch size won't go below this.
spark.streaming.receiver.maxRate - batch size won't go above this. Cheers, Richard https://richardstartin.com/ ________________________________ From: Liren Ding <sky.gonna.bri...@gmail.com> Sent: 05 December 2016 22:18 To: dev@spark.apache.org; u...@spark.apache.org Subject: Back-pressure to Spark Kafka Streaming? Hey all, Does backressure actually work on spark kafka streaming? According to the latest spark streaming document: http://spark.apache.org/docs/latest/streaming-programming-guide.html "In Spark 1.5, we have introduced a feature called backpressure that eliminate the need to set this rate limit, as Spark Streaming automatically figures out the rate limits and dynamically adjusts them if the processing conditions change. This backpressure can be enabled by setting the configuration parameter spark.streaming.backpressure.enabled to true." But I also see a few open spark jira tickets on this option: https://issues.apache.org/jira/browse/SPARK-7398 https://issues.apache.org/jira/browse/SPARK-18371 The case in the second ticket describes a similar issue as we have here. We use Kafka to send large batches (10~100M) to spark streaming, and the spark streaming interval is set to 1~4 minutes. With the backpressure set to true, the queued active batches still pile up when average batch processing time takes longer than default interval. After the spark driver is restarted, all queued batches turn to a giant batch, which block subsequent batches and also have a great chance to fail eventually. The only config we found that might help is "spark.streaming.kafka.maxRatePerPartition". It does limit the incoming batch size, but not a perfect solution since it depends on size of partition as well as the length of batch interval. For our case, hundreds of partitions X minutes of interval still produce a number that is too large for each batch. So we still want to figure out how to make the backressure work in spark kafka streaming, if it is supposed to work there. Thanks. Liren