Hi Folks,

I am seeing some strange behavior when using the Spark Kafka connector in
Spark streaming.

I have a Kafka topic which has 8 partitions. I have a kafka producer that
pumps some messages into this topic.

On the consumer side I have a spark streaming application that that has 8
executors on 8 worker nodes and 8 ReceiverInputDStream with the same kafka
group id connected to the 8 partitions I have for the topic. Also the kafka
consumer property "auto.offset.reset" is set to "smallest".


Now here is the sequence of steps -

(1) I Start the the spark streaming app.
(2) Start the producer.

As this point I see the messages that are being pumped from the producer in
Spark Streaming.  Then I -

(1) Stopped the producer
(2) Wait for all the message to be consumed.
(2) Stopped the spark streaming app.

Now when I restart the spark streaming app (note - the producer is still
down and no messages are being pumped into the topic) - I observe the
following -

(1) Spark Streaming starts reading from each partition right from the
beginning.


This is not what I was expecting. I was expecting the consumers started by
spark streaming to start from where it left off....

Is my assumption not correct that "the consumers (the kafka/spark
connector) to start reading from the topic where it last left off."..?

Has anyone else seen this behavior? Is there a way to make it such that it
starts from where it left off?

Regards,
- Abraham

Reply via email to