Hi, Just to add it to Gabor's excellent answer that checkpointing and offsets are infrastructure-related and should not really be in the hands of Spark devs who should instead focus on the business purpose of the code (not offsets that are very low-level and not really important).
BTW That's what happens in Kafka Streams too Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sun, Apr 4, 2021 at 12:28 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > There is no way to store offsets in Kafka and restart from the stored > offset. Structured Streaming stores offset in checkpoint and it restart > from there without any user code. > > Offsets can be stored with a listener but it can be only used for lag > calculation. > > BR, > G > > > On Sat, 3 Apr 2021, 21:09 Ali Gouta, <ali.go...@gmail.com> wrote: > >> Hello, >> >> I was reading the spark docs about spark structured streaming, since we >> are thinking about updating our code base that today uses Dstreams, hence >> spark streaming. Also, one main reason for this change that we want to >> realize is that reading headers in kafka messages is only supported in >> spark structured streaming and not in Dstreams. >> >> I was surprised to not see an obvious way to handle manually the offsets >> by committing the offsets to kafka. In spark streaming we used to do it >> with something similar to these lines of code: >> >> stream.foreachRDD { rdd => >> val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges >> >> // some time later, after outputs have completed >> stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)} >> >> >> And this works perfectly ! Especially, this works very nice in case of >> job failure/restart... I am wondering how this can be achieved in spark >> structured streaming ? >> >> I read about checkpoints, and this reminds me the old way of doing things >> in spark 1.5/kafka0.8 and is not perfect since we are not deciding when to >> commit offsets by ourselves. >> >> Did I miss anything ? What would be the best way of committing offsets to >> kafka with spark structured streaming to the concerned consumer group ? >> >> Best regards, >> Ali Gouta. >> >