Re: Limit Kafka batches size with Spark Streaming

Samy Dindane Wed, 12 Oct 2016 09:01:18 -0700

I am 100% sure.

println(conf.get("spark.streaming.backpressure.enabled")) prints true.


On 10/12/2016 05:48 PM, Cody Koeninger wrote:

Just to make 100% sure, did you set

spark.streaming.backpressure.enabled

to true?

On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane <s...@dindane.com> wrote:



On 10/12/2016 04:40 PM, Cody Koeninger wrote:


How would backpressure know anything about the capacity of your system
on the very first batch?


Isn't "spark.streaming.backpressure.initialRate" supposed to do this?



You should be able to set maxRatePerPartition at a value that makes
sure your first batch doesn't blow things up, and let backpressure
scale from there.


Backpressure doesn't scale even when using maxRatePerPartition: when I
enable backpressure and set maxRatePerPartition to n, I always get n
records, even if my batch takes longer than batchDuration to finish.

Example:
* I set batchDuration to 1 sec: `val ssc = new StreamingContext(conf,
Durations.seconds(1))`
* I set backpressure.initialRate and/or maxRatePerPartition to 100,000 and
enable backpressure
* Since I can't handle 100,000 records in 1 second, I expect the
backpressure to kick in in the second batch, and get less than 100,000; but
this does not happen

What am I missing here?


On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane <s...@dindane.com> wrote:


That's what I was looking for, thank you.

Unfortunately, neither

* spark.streaming.backpressure.initialRate
* spark.streaming.backpressure.enabled
* spark.streaming.receiver.maxRate
* spark.streaming.receiver.initialRate

change how many records I get (I tried many different combinations).

The only configuration that works is
"spark.streaming.kafka.maxRatePerPartition".
That's better than nothing, but I'd be useful to have backpressure
enabled
for automatic scaling.

Do you have any idea about why aren't backpressure working? How to debug
this?


On 10/11/2016 06:08 PM, Cody Koeninger wrote:



http://spark.apache.org/docs/latest/configuration.html

"This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see
below)."

On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane <s...@dindane.com> wrote:



Hi,

Is it possible to limit the size of the batches returned by the Kafka
consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions of
records
and it takes ages to process and checkpoint them.

Thank you.

Samy

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Limit Kafka batches size with Spark Streaming

Reply via email to