Vishal, from which version did you upgrade to 1.5.1? Maybe from 1.5.0 (release)? Knowing that might help narrowing down the source of this.
On Wed, Aug 15, 2018 at 11:38 AM Juho Autio <juho.au...@rovio.com> wrote: > Thanks Gary.. > > What could be blocking the RPC threads? Slow checkpointing? > > In production we're still using a self-built Flink package 1.5-SNAPSHOT, > flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are > stable. > > Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every > day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e & > release 1.5.2? > > Also, I just tried to run a slightly heavier job. It eventually had some > heartbeat timeouts, and then this: > > 2018-08-15 01:49:58,156 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) -> > XFilter -> EventMapFilter(AppFilters) (4/8) > (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED. > org.apache.flink.util.FlinkException: The assigned slot > container_1534167926397_0001_01_000002_1 was removed. > > After that the job tried to restart according to Flink restart strategy > but that kept failing with this error: > > 2018-08-15 02:00:22,000 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job X > (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING. > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate all requires slots within timeout of 300000 ms. Slots > required: 36, slots allocated: 12 > > This was repeated until all restart attempts had been used (we've set it > to 50), and then the job finally failed. > > I would like to know also how to prevent Flink from going into such bad > state. At least it should exit immediately instead of retrying in such a > situation. And why was "slot container removed"? > > On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <g...@data-artisans.com> wrote: > >> Hi Juho, >> >> It seems in your case the JobMaster did not receive a heartbeat from the >> TaskManager in time [1]. Heartbeat requests and answers are sent over the >> RPC >> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) >> are >> dispatched by a single thread. Therefore, the reasons for heartbeats >> timeouts >> include: >> >> 1. The RPC threads of the TM or JM are blocked. In this case >> heartbeat requests or answers cannot be dispatched. >> 2. The scheduled task for sending the heartbeat requests [2] died. >> 3. The network is flaky. >> >> If you are confident that the network is not the culprit, I would suggest >> to >> set the logging level to DEBUG, and look for periodic log messages (JM >> and TM >> logs) that are related to heartbeating. If the periodic log messages are >> overdue, it is a hint that the main thread of the RPC endpoint is blocked >> somewhere. >> >> Best, >> Gary >> >> [1] >> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611 >> [2] >> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64 >> >> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.au...@rovio.com> wrote: >> >>> I also have jobs failing on a daily basis with the error "Heartbeat of >>> TaskManager with id <id> timed out". I'm using Flink 1.5.2. >>> >>> Could anyone suggest how to debug possible causes? >>> >>> I already set these in flink-conf.yaml, but I'm still getting failures: >>> heartbeat.interval: 10000 >>> heartbeat.timeout: 100000 >>> >>> Thanks. >>> >>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> According to the UI it seems that " >>>> >>>> org.apache.flink.util.FlinkException: The assigned slot >>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>>> >>>> " was the cause of a pipe restart. >>>> >>>> As to the TM it is an artifact of the new job allocation regime which >>>> will exhaust all slots on a TM rather then distributing them equitably. >>>> TMs selectively are under more stress then in a pure RR distribution I >>>> think. We may have to lower the slots on each TM to define a good upper >>>> bound. You are correct 50s is a a pretty generous value. >>>> >>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <g...@data-artisans.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The first exception should be only logged on info level. It's expected >>>>> to see >>>>> this exception when a TaskManager unregisters from the ResourceManager. >>>>> >>>>> Heartbeats can be configured via heartbeat.interval and >>>>> hearbeat.timeout [1]. >>>>> The default timeout is 50s, which should be a generous value. It is >>>>> probably a >>>>> good idea to find out why the heartbeats cannot be answered by the TM. >>>>> >>>>> Best, >>>>> Gary >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager >>>>> >>>>> >>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line >>>>>> >>>>>> org.apache.flink.util.FlinkException: The assigned slot >>>>>> 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>>>>> >>>>>> >>>>>> and >>>>>> >>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >>>>>> 208af709ef7be2d2dfc028ba3bbf4600 timed out. >>>>>> >>>>>> >>>>>> Not sure about the first but how do we increase the heartbeat >>>>>> interval of a TM >>>>>> >>>>>> Thanks much >>>>>> >>>>>> Vishal >>>>>> >>>>> >>>>> >>>> >>> >> >