Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Till Rohrmann Wed, 26 Sep 2018 02:25:41 -0700

Yes, that would be a good idea. I think it should go into the release
notes. Will add it.


On Wed, Sep 26, 2018 at 10:24 AM Fabian Hueske <fhue...@gmail.com> wrote:

> Should we add a warning to the release announcements?
>
> Fabian
>
> Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <
> rmetz...@apache.org>:
>
>> Hey Jamie,
>>
>> we've been facing the same issue with dA Platform, when running Flink
>> 1.6.1.
>> I assume a lot of people will be affected by this.
>>
>>
>>
>> On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Jamie,
>>>
>>> thanks for the update on how to fix the problem. This is very helpful
>>> for the rest of the community.
>>>
>>> The change of removing the execution mode parameter (FLINK-8696) from
>>> the start up scripts was actually released with Flink 1.5.0. That way, the
>>> host name became the 2nd parameter. By calling the start up scripts with
>>> the old syntax, the execution mode parameter was interpreted as the
>>> hostname. This host name option was, however, not properly evaluated until
>>> we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.
>>>
>>> We definitely need to treat the start up scripts as a stable API as
>>> well. So far, we don't have good tooling which ensures that we don't
>>> introduce breaking changes. In the future we need to be more careful!
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote:
>>>
>>>> Update on this:
>>>>
>>>> The issue was the command being used to start the jobmanager:
>>>> `jobmanager.sh start-foreground cluster`.  This was a command leftover in
>>>> our automation that used to be the correct way to start the JM -- however
>>>> now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted
>>>> as the hostname for the jobmanager to bind to.
>>>>
>>>> The solution was just to remove `cluster` from that command.
>>>>
>>>>
>>>>
>>>> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote:
>>>>
>>>>> Anybody else seen this and know the solution?  We're dead in the water
>>>>> with Flink 1.5.4.
>>>>>
>>>>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote:
>>>>>
>>>>>> We started to see same errors after upgrading to flink 1.6.0 from
>>>>>> 1.4.2. We
>>>>>> have one JM and 5 TM on kubernetes. JM is running on HA mode.
>>>>>> Taskmanagers
>>>>>> sometimes are loosing connection to JM and having following error
>>>>>> like you
>>>>>> have.
>>>>>>
>>>>>> *2018-09-19 12:36:40,687 INFO
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could
>>>>>> not
>>>>>> resolve ResourceManager address
>>>>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager,
>>>>>> retrying in
>>>>>> 10000 ms: Ask timed out on
>>>>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
>>>>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
>>>>>> message of
>>>>>> type "akka.actor.Identify"..*
>>>>>>
>>>>>> When TM started to have "Could not resolve ResourceManager", it cannot
>>>>>> resolve itself until I restart the TM pod.
>>>>>>
>>>>>> *Here is the content of our flink-conf.yaml:*
>>>>>> blob.server.port: 6124
>>>>>> jobmanager.rpc.address: flink-jobmanager
>>>>>> jobmanager.rpc.port: 6123
>>>>>> jobmanager.heap.mb: 4096
>>>>>> jobmanager.web.history: 20
>>>>>> jobmanager.archive.fs.dir: s3://our_path
>>>>>> taskmanager.rpc.port: 6121
>>>>>> taskmanager.heap.mb: 16384
>>>>>> taskmanager.numberOfTaskSlots: 10
>>>>>> taskmanager.log.path: /opt/flink/log/output.log
>>>>>> web.log.path: /opt/flink/log/output.log
>>>>>> state.checkpoints.num-retained: 3
>>>>>> metrics.reporters: prom
>>>>>> metrics.reporter.prom.class:
>>>>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>>>>
>>>>>> high-availability: zookeeper
>>>>>> high-availability.jobmanager.port: 50002
>>>>>> high-availability.zookeeper.quorum: zookeeper_instance_list
>>>>>> high-availability.zookeeper.path.root: /flink
>>>>>> high-availability.cluster-id: profileservice
>>>>>> high-availability.storageDir: s3://our_path
>>>>>>
>>>>>> Any help will be greatly appreciated!
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>
>>>>>

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Reply via email to