It’s not intermittent, seems to happen everytime spark fails when it starts up from last checkpoint and complains the offset is old. I checked the offset and it is indeed true the offset expired from kafka side. My version of spark is 2.4.4 using kafka 0.10
On Sun, Apr 19, 2020 at 3:38 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > That sounds odd. Is it intermittent, or always reproducible if you starts > with same checkpoint? What's the version of Spark? > > On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li <liruijin...@gmail.com> wrote: > >> Hi all, >> >> I have a question on how structured streaming does checkpointing. I’m >> noticing that spark is not reading from the max / latest offset it’s seen. >> For example, in HDFS, I see it stored offset file 30 which contains >> partition: offset {1: 2000} >> >> But instead after stopping the job and restarting it, I see it instead >> reads from offset file 9 which contains {1:1000} >> >> Can someone explain why spark doesn’t take the max offset? >> >> Thanks. >> -- >> Cheers, >> Ruijing Li >> > -- Cheers, Ruijing Li