Hey all,

Does backressure actually work on spark kafka streaming? According to the
latest spark streaming document:
*http://spark.apache.org/docs/latest/streaming-programming-guide.html
<http://spark.apache.org/docs/latest/streaming-programming-guide.html>*
"*In Spark 1.5, we have introduced a feature called backpressure that
eliminate the need to set this rate limit, as Spark Streaming automatically
figures out the rate limits and dynamically adjusts them if the processing
conditions change. This backpressure can be enabled by setting the
configuration parameter spark.streaming.backpressure.enabled to true.*"
But I also see a few open spark jira tickets on this option:

*https://issues.apache.org/jira/browse/SPARK-7398
<https://issues.apache.org/jira/browse/SPARK-7398>*
*https://issues.apache.org/jira/browse/SPARK-18371
<https://issues.apache.org/jira/browse/SPARK-18371>*

The case in the second ticket describes a similar issue as we have here. We
use Kafka to send large batches (10~100M) to spark streaming, and the spark
streaming interval is set to 1~4 minutes. With the backpressure set to
true, the queued active batches still pile up when average batch processing
time takes longer than default interval. After the spark driver is
restarted, all queued batches turn to a giant batch, which block subsequent
batches and also have a great chance to fail eventually. The only config we
found that might help is "*spark.streaming.kafka.maxRatePerPartition*". It
does limit the incoming batch size, but not a perfect solution since it
depends on size of partition as well as the length of batch interval. For
our case, hundreds of partitions X minutes of interval still produce a
number that is too large for each batch. So we still want to figure out how
to make the backressure work in spark kafka streaming, if it is supposed to
work there. Thanks.


Liren

Reply via email to