Vishal, from which version did you upgrade to 1.5.1? Maybe from 1.5.0
(release)? Knowing that might help narrowing down the source of this.

On Wed, Aug 15, 2018 at 11:38 AM Juho Autio <juho.au...@rovio.com> wrote:

> Thanks Gary..
>
> What could be blocking the RPC threads? Slow checkpointing?
>
> In production we're still using a self-built Flink package 1.5-SNAPSHOT,
> flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
> stable.
>
> Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
> day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
> release 1.5.2?
>
> Also, I just tried to run a slightly heavier job. It eventually had some
> heartbeat timeouts, and then this:
>
> 2018-08-15 01:49:58,156 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
> XFilter -> EventMapFilter(AppFilters) (4/8)
> (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
> org.apache.flink.util.FlinkException: The assigned slot
> container_1534167926397_0001_01_000002_1 was removed.
>
> After that the job tried to restart according to Flink restart strategy
> but that kept failing with this error:
>
> 2018-08-15 02:00:22,000 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
> (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 36, slots allocated: 12
>
> This was repeated until all restart attempts had been used (we've set it
> to 50), and then the job finally failed.
>
> I would like to know also how to prevent Flink from going into such bad
> state. At least it should exit immediately instead of retrying in such a
> situation. And why was "slot container removed"?
>
> On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <g...@data-artisans.com> wrote:
>
>> Hi Juho,
>>
>> It seems in your case the JobMaster did not receive a heartbeat from the
>> TaskManager in time [1]. Heartbeat requests and answers are sent over the
>> RPC
>> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
>> are
>> dispatched by a single thread. Therefore, the reasons for heartbeats
>> timeouts
>> include:
>>
>>     1. The RPC threads of the TM or JM are blocked. In this case
>> heartbeat requests or answers cannot be dispatched.
>>     2. The scheduled task for sending the heartbeat requests [2] died.
>>     3. The network is flaky.
>>
>> If you are confident that the network is not the culprit, I would suggest
>> to
>> set the logging level to DEBUG, and look for periodic log messages (JM
>> and TM
>> logs) that are related to heartbeating. If the periodic log messages are
>> overdue, it is a hint that the main thread of the RPC endpoint is blocked
>> somewhere.
>>
>> Best,
>> Gary
>>
>> [1]
>> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
>> [2]
>> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>>
>> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.au...@rovio.com> wrote:
>>
>>> I also have jobs failing on a daily basis with the error "Heartbeat of
>>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>>
>>> Could anyone suggest how to debug possible causes?
>>>
>>> I already set these in flink-conf.yaml, but I'm still getting failures:
>>> heartbeat.interval: 10000
>>> heartbeat.timeout: 100000
>>>
>>> Thanks.
>>>
>>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> According to the UI it seems that "
>>>>
>>>> org.apache.flink.util.FlinkException: The assigned slot 
>>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>
>>>> " was the cause of a pipe restart.
>>>>
>>>> As to the TM it is an artifact of the new job allocation regime which
>>>> will exhaust all slots on a TM rather then distributing them equitably.
>>>> TMs selectively are under more stress then in a pure RR distribution I
>>>> think. We may have to lower the slots on each TM to define a good upper
>>>> bound. You are correct 50s is a a pretty generous value.
>>>>
>>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The first exception should be only logged on info level. It's expected
>>>>> to see
>>>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>>>
>>>>> Heartbeats can be configured via heartbeat.interval and
>>>>> hearbeat.timeout [1].
>>>>> The default timeout is 50s, which should be a generous value. It is
>>>>> probably a
>>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>>
>>>>> Best,
>>>>> Gary
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>>
>>>>>
>>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>>
>>>>>> org.apache.flink.util.FlinkException: The assigned slot 
>>>>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 
>>>>>> 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>>
>>>>>>
>>>>>> Not sure about the first but how do we increase the heartbeat
>>>>>> interval of a TM
>>>>>>
>>>>>> Thanks much
>>>>>>
>>>>>> Vishal
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to