Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
Hi, On 10/13/2016 04:35 PM, Cody Koeninger wrote: So I see in the logs that PIDRateEstimator is choosing a new rate, and the rate it's choosing is 100. But it's always choosing 100, while all the other variables change (processing time, latestRate, etc.) change. Also, the records per batch is

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Cody Koeninger
So I see in the logs that PIDRateEstimator is choosing a new rate, and the rate it's choosing is 100. That happens to be the default minimum of an (apparently undocumented) setting, spark.streaming.backpressure.pid.minRate Try setting that to 1 and see if there's different behavior. BTW, how

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
Hey Cody, Thanks for the reply. Really helpful. Following your suggestion, I set spark.streaming.backpressure.enabled to true and maxRatePerPartition to 10. I know I can handle 100k records at the same time, but definitely not in 1 second (the batchDuration), so I expect the backpressure

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
Cool, just wanted to make sure. To answer your question about > Isn't "spark.streaming.backpressure.initialRate" supposed to do this? that configuration was added well after the integration of the direct stream with the backpressure code, and was added only to the receiver code, which the

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, did you set spark.streaming.backpressure.enabled to true? On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote:

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
How would backpressure know anything about the capacity of your system on the very first batch? You should be able to set maxRatePerPartition at a value that makes sure your first batch doesn't blow things up, and let backpressure scale from there. On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
That's what I was looking for, thank you. Unfortunately, neither * spark.streaming.backpressure.initialRate * spark.streaming.backpressure.enabled * spark.streaming.receiver.maxRate * spark.streaming.receiver.initialRate change how many records I get (I tried many different combinations). The

Re: Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote: > Hi, > > Is it

Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane
Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy