In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go out of memory on the fetch of the 1st batch.
I am wondering if * Is there a way to throttle reading from kafka in spark streaming jobs? * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. * Is there a way to limit the consumption rate at Kafka side? (This one is not actually for spark streaming and doesn't seem to be question in this group. But I am raising it anyway here.) I have looked at code example below but doesn't seem it is supported. KafkaUtils.createStream ... Thanks, All -- Chen Song