Hi Folks, I am seeing some strange behavior when using the Spark Kafka connector in Spark streaming.
I have a Kafka topic which has 8 partitions. I have a kafka producer that pumps some messages into this topic. On the consumer side I have a spark streaming application that that has 8 executors on 8 worker nodes and 8 ReceiverInputDStream with the same kafka group id connected to the 8 partitions I have for the topic. Also the kafka consumer property "auto.offset.reset" is set to "smallest". Now here is the sequence of steps - (1) I Start the the spark streaming app. (2) Start the producer. As this point I see the messages that are being pumped from the producer in Spark Streaming. Then I - (1) Stopped the producer (2) Wait for all the message to be consumed. (2) Stopped the spark streaming app. Now when I restart the spark streaming app (note - the producer is still down and no messages are being pumped into the topic) - I observe the following - (1) Spark Streaming starts reading from each partition right from the beginning. This is not what I was expecting. I was expecting the consumers started by spark streaming to start from where it left off.... Is my assumption not correct that "the consumers (the kafka/spark connector) to start reading from the topic where it last left off."..? Has anyone else seen this behavior? Is there a way to make it such that it starts from where it left off? Regards, - Abraham