Hi all I'm currently developing a Spark structured streaming application which joins/aggregates messages from ~7 Kafka topics and produces messages onto another Kafka topic.
Quite often in my development cycle, I want to "reprocess from scratch": I stop the program, delete the target topic and associated checkpoint information, and restart the application with the query. My assumption would be that the newly started query then processes all messages that are on the input topics, sets the watermark according to the freshest messages on the topic and produces the output messages which have moved past the watermark and can thus be safely produced. As an example, if the freshest message on the topic has an event time of "2019-05-20 10:13" I restart the query at "2019-05-20 11:30" and I have a watermark duration of 10 minutes, I would expect the query to have a eventTime watermark of "2019-05-20 10:03" and all earlier results are produced. But my observations indicate that after initial query startup and reading all input topics, the watermark stays at Unix epoch (1970-01-01) and no messages are produced. Only once a new message comes in, after the start of the query, then the watermark is moved ahead and all the messages are produced. Is this the expected behaviour, and my assumption is wrong? Am I doing something wrong during query setup? -- CU, Joe --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org