Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Till Rohrmann Mon, 08 Apr 2019 02:45:45 -0700

Hi Bruno,

sorry for getting back to you so late. I just tried to access your logs to
investigate the problem but transfer.sh tells me that they are no longer
there. Could you maybe re-upload them or directly send them to my mail
address. Sorry for not taking faster a look at your problem and the
inconveniences with the upload.


Cheers,
Till

On Thu, Mar 21, 2019 at 4:30 PM Bruno Aranda <bara...@apache.org> wrote:

> Ok, here it goes:
>
> https://transfer.sh/12qMre/jobmanager-debug.log
>
> In an attempt to make it smaller, did remove the noisy "http wire" ones
> and masked a couple of things. Not sure this covers everything you would
> like to see.
>
> Thanks!
>
> Bruno
>
> On Thu, 21 Mar 2019 at 15:24, Till Rohrmann <trohrm...@apache.org> wrote:
>
>> Hi Bruno,
>>
>> could you upload the logs to https://transfer.sh/ or
>> https://gist.github.com/ and then post a link. For further debugging
>> this will be crucial. It would be really good if you could set the log
>> level to DEBUG.
>>
>> Concerning the number of registered TMs, the new mode (not the legacy
>> mode), no longer respects the `-n` setting when you start a yarn session.
>> Instead it will dynamically start as many containers as you need to run the
>> submitted jobs. That's why you don't see the spare TM and this is the
>> expected behaviour.
>>
>> The community intends to add support for ranges of how many TMs must be
>> active at any given time [1].
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-11078
>>
>> Cheers,
>> Till
>>
>> On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <bara...@apache.org> wrote:
>>
>>> Hi Andrey,
>>>
>>> Thanks for your response. I was trying to get the logs somewhere but
>>> they are biggish (~4Mb). Do you suggest somewhere I could put them?
>>>
>>> In any case, I can see exceptions like this:
>>>
>>> 2019/03/18 10:11:50,763 DEBUG
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing
>>> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null
>>> 2019/03/18 10:11:50,807 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched
>>> from state RUNNING to FAILING.
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>>> required: 2, slots allocated: 0
>>> at
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991)
>>> at
>>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>>> ...
>>>
>>> It looks like a TM may crash, and then the JM. And then the JM is not
>>> able to find slots for the tasks in a reasonable time frame? Weirdly, we
>>> are running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we
>>> always try to keep an extra TM worth of free slots just in case. Looking at
>>> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are
>>> available when we start the session in yarn.
>>>
>>> Any ideas? It is way less stable for us these days without having
>>> changed settings much since we started using Flink around 1.2 some time
>>> back.
>>>
>>> Thanks,
>>>
>>> Bruno
>>>
>>>
>>>
>>> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <and...@ververica.com>
>>> wrote:
>>>
>>>> Hi Bruno,
>>>>
>>>> could you also share the job master logs?
>>>>
>>>> Thanks,
>>>> Andrey
>>>>
>>>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is causing serious instability and data loss in our production
>>>>> environment. Any help figuring out what's going on here would be really
>>>>> appreciated.
>>>>>
>>>>> We recently updated our two EMR clusters from flink 1.6.1 to flink
>>>>> 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, but
>>>>> we felt like it was working sufficiently well in our pre-production
>>>>> environments that we rolled it out to prod.
>>>>>
>>>>> However we're now seeing the jobmanager crash spontaneously several
>>>>> times a day. There doesn't seem to be any pattern to when this happens - 
>>>>> it
>>>>> doesn't coincide with an increase in the data flowing through the system,
>>>>> nor is it at the same time of day.
>>>>>
>>>>> The big problem is that when it recovers, sometimes a lot of the jobs
>>>>> fail to resume with the following exception:
>>>>>
>>>>> org.apache.flink.util.FlinkException: JobManager responsible for
>>>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
>>>>>     at
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
>>>>>     at
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
>>>>>     at
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
>>>>> //...
>>>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of
>>>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
>>>>>     at
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
>>>>>     ... 16 more
>>>>>
>>>>> Starting them manually afterwards doesn't resume from checkpoint,
>>>>> which for most jobs means it starts from the end of the source kafka 
>>>>> topic.
>>>>> This means whenever this surprise jobmanager restart happens, we have a
>>>>> ticking clock during which we're losing data.
>>>>>
>>>>> We speculate that those jobs die first and while they wait to be
>>>>> restarted (they have a 30 second delay strategy), the job manager restarts
>>>>> and does not recover them? In any case, we have never seen so many job
>>>>> failures and JM restarts with exactly the same EMR config.
>>>>>
>>>>> We've got some functionality we're building that uses the
>>>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an 
>>>>> ideal
>>>>> option.
>>>>>
>>>>> Looking through the mailing list, we found
>>>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem
>>>>> possible this might be related?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Bruno
>>>>>
>>>>

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to