We have a structured streaming job that processes a stream of events. It needs to perform aggregation while maintaining state, for which we are using flatMapGroupsWithState.
It also needs to load some domain data that needs to be refreshed periodically. To refresh domain data, we are using a solution of query restart that Tathagata suggested in this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3CCA%2BAHuKn%2BvSEWkJD%3DbSSt6G5bDZDaS6wmN%2Bfwmn4jTm1X1nDAPA%40mail.gmail.com%3E This works for the domain data refresh, however, on query restart, the state maintained in flatMapGroupsWithState is flushed. Is there a way to retain the state on query refresh? One way I am thinking is to split the job into two jobs to separate the concerns of domain data refresh and state based processing. Does this make sense? Are there other thoughts on solving this? Thanks, Ashutosh Joshi