Re: Spark structured streaming - Fallback to earliest offset

2020-04-19 Thread Jungtaek Lim
You may want to check "where" the job is stuck via taking thread dump - it could be in kafka consumer, in Spark codebase, etc. Without the information it's hard to say. On Thu, Apr 16, 2020 at 4:22 PM Ruijing Li wrote: > Thanks Jungtaek, that makes sense. > > I tried Burak’s solution of just

Re: Spark structured streaming - Fallback to earliest offset

2020-04-16 Thread Ruijing Li
Thanks Jungtaek, that makes sense. I tried Burak’s solution of just turning failOnDataLoss to be false, but instead of failing, the job is stuck. I’m guessing that the offsets are being deleted faster than the job can process them and it will be stuck unless I increase resources? Or does once the

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Jungtaek Lim
I think Spark is trying to ensure that it reads the input "continuously" without any missing. Technically it may be valid to say the situation is a kind of "data-loss", as the query couldn't process the offsets which are being thrown out, and owner of the query needs to be careful as it affects

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
I see, I wasn’t sure if that would work as expected. The docs seems to suggest to be careful before turning off that option, and I’m not sure why failOnDataLoss is true by default. On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz wrote: > Just set `failOnDataLoss=false` as an option in readStream? >

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream? On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > Hi all, > > I have a spark structured streaming app that is consuming from a kafka > topic with retention set up. Sometimes I face an issue where my query has > not finished

Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
Hi all, I have a spark structured streaming app that is consuming from a kafka topic with retention set up. Sometimes I face an issue where my query has not finished processing a message but the retention kicks in and deletes the offset, which since I use the default setting of