That depends on your job, your cluster resources, the number of seconds per batch...
You'll need to do some empirical work to figure out how many messages per batch a given executor can handle. Divide that by the number of seconds per batch. On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3...@gmail.com> wrote: > Hi, > > I am writing a spark streaming job using the direct stream method for > kafka and wanted to handle the case of checkpoint failure when we'll have > to reprocess the entire data from starting. By default for every new > checkpoint it tries to load everything from each partition and that takes a > lot of time for processing. After some searching found out that there > exists a config spark.streaming.kafka.maxRatePerPartition which can be used > to tackle this. My question is what will be a suitable range for this > config if we have ~12 million messages in kafka with maximum message size > ~10 MB. > > Thanks, > Sourabh >