Thanks Cody, will try to do some estimation. Thanks Nicolae, will try out this config.
Thanks, Sourabh On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > Set 10ms and spark.streaming.backpressure.enabled=true > > > This should automatically delay the next batch until the current one is > processed, or at least create that balance over a few batches/periods > between the consume/process rate vs ingestion rate. > > > Nicu > > ------------------------------ > *From:* Cody Koeninger <c...@koeninger.org> > *Sent:* Thursday, October 1, 2015 11:46 PM > *To:* Sourabh Chandak > *Cc:* user > *Subject:* Re: spark.streaming.kafka.maxRatePerPartition for direct stream > > That depends on your job, your cluster resources, the number of seconds > per batch... > > You'll need to do some empirical work to figure out how many messages per > batch a given executor can handle. Divide that by the number of seconds > per batch. > > > > On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3...@gmail.com> > wrote: > >> Hi, >> >> I am writing a spark streaming job using the direct stream method for >> kafka and wanted to handle the case of checkpoint failure when we'll have >> to reprocess the entire data from starting. By default for every new >> checkpoint it tries to load everything from each partition and that takes a >> lot of time for processing. After some searching found out that there >> exists a config spark.streaming.kafka.maxRatePerPartition which can be used >> to tackle this. My question is what will be a suitable range for this >> config if we have ~12 million messages in kafka with maximum message size >> ~10 MB. >> >> Thanks, >> Sourabh >> > >