You may want to check "where" the job is stuck via taking thread dump - it
could be in kafka consumer, in Spark codebase, etc. Without the information
it's hard to say.

On Thu, Apr 16, 2020 at 4:22 PM Ruijing Li <liruijin...@gmail.com> wrote:

> Thanks Jungtaek, that makes sense.
>
> I tried Burak’s solution of just turning failOnDataLoss to be false, but
> instead of failing, the job is stuck. I’m guessing that the offsets are
> being deleted faster than the job can process them and it will be stuck
> unless I increase resources? Or does once the exception happen, spark will
> hang?
>
> On Tue, Apr 14, 2020 at 10:48 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I think Spark is trying to ensure that it reads the input "continuously"
>> without any missing. Technically it may be valid to say the situation is a
>> kind of "data-loss", as the query couldn't process the offsets which are
>> being thrown out, and owner of the query needs to be careful as it affects
>> the result.
>>
>> If your streaming query keeps up with input rate then it's pretty rare
>> for the query to go under retention. Even it lags a bit, it'd be safe if
>> retention is set to enough period. The ideal state would be ensuring your
>> query to process all offsets before they are thrown out by retention (don't
>> leave the query lagging behind - either increasing processing power or
>> increasing retention duration, though most probably you'll need to do
>> former), but if you can't make sure and if you understand the risk then yes
>> you can turn off the option and take the risk.
>>
>>
>> On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li <liruijin...@gmail.com> wrote:
>>
>>> I see, I wasn’t sure if that would work as expected. The docs seems to
>>> suggest to be careful before turning off that option, and I’m not sure why
>>> failOnDataLoss is true by default.
>>>
>>> On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> Just set `failOnDataLoss=false` as an option in readStream?
>>>>
>>>> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li <liruijin...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a spark structured streaming app that is consuming from a kafka
>>>>> topic with retention set up. Sometimes I face an issue where my query has
>>>>> not finished processing a message but the retention kicks in and deletes
>>>>> the offset, which since I use the default setting of “failOnDataLoss=true”
>>>>> causes my query to fail. The solution I currently have is manual, deleting
>>>>> the offsets directory and rerunning.
>>>>>
>>>>> I instead like to have spark automatically fall back to the earliest
>>>>> offset available. The solutions I saw recommend setting auto.offset =
>>>>> earliest, but for structured streaming, you cannot set that. How do I do
>>>>> this for structured streaming?
>>>>>
>>>>> Thanks!
>>>>> --
>>>>> Cheers,
>>>>> Ruijing Li
>>>>>
>>>> --
>>> Cheers,
>>> Ruijing Li
>>>
>> --
> Cheers,
> Ruijing Li
>

Reply via email to