Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Robert Metzger Wed, 26 Sep 2018 01:22:36 -0700

Hey Jamie,

we've been facing the same issue with dA Platform, when running Flink 1.6.1.
I assume a lot of people will be affected by this.




On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Jamie,
>
> thanks for the update on how to fix the problem. This is very helpful for
> the rest of the community.
>
> The change of removing the execution mode parameter (FLINK-8696) from the
> start up scripts was actually released with Flink 1.5.0. That way, the host
> name became the 2nd parameter. By calling the start up scripts with the old
> syntax, the execution mode parameter was interpreted as the hostname. This
> host name option was, however, not properly evaluated until we fixed it
> with Flink 1.5.4. Therefore, the problem only surfaced now.
>
> We definitely need to treat the start up scripts as a stable API as well.
> So far, we don't have good tooling which ensures that we don't introduce
> breaking changes. In the future we need to be more careful!
>
> Cheers,
> Till
>
> On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote:
>
>> Update on this:
>>
>> The issue was the command being used to start the jobmanager:
>> `jobmanager.sh start-foreground cluster`.  This was a command leftover in
>> our automation that used to be the correct way to start the JM -- however
>> now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted
>> as the hostname for the jobmanager to bind to.
>>
>> The solution was just to remove `cluster` from that command.
>>
>>
>>
>> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote:
>>
>>> Anybody else seen this and know the solution?  We're dead in the water
>>> with Flink 1.5.4.
>>>
>>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote:
>>>
>>>> We started to see same errors after upgrading to flink 1.6.0 from
>>>> 1.4.2. We
>>>> have one JM and 5 TM on kubernetes. JM is running on HA mode.
>>>> Taskmanagers
>>>> sometimes are loosing connection to JM and having following error like
>>>> you
>>>> have.
>>>>
>>>> *2018-09-19 12:36:40,687 INFO
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could
>>>> not
>>>> resolve ResourceManager address
>>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying
>>>> in
>>>> 10000 ms: Ask timed out on
>>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
>>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
>>>> message of
>>>> type "akka.actor.Identify"..*
>>>>
>>>> When TM started to have "Could not resolve ResourceManager", it cannot
>>>> resolve itself until I restart the TM pod.
>>>>
>>>> *Here is the content of our flink-conf.yaml:*
>>>> blob.server.port: 6124
>>>> jobmanager.rpc.address: flink-jobmanager
>>>> jobmanager.rpc.port: 6123
>>>> jobmanager.heap.mb: 4096
>>>> jobmanager.web.history: 20
>>>> jobmanager.archive.fs.dir: s3://our_path
>>>> taskmanager.rpc.port: 6121
>>>> taskmanager.heap.mb: 16384
>>>> taskmanager.numberOfTaskSlots: 10
>>>> taskmanager.log.path: /opt/flink/log/output.log
>>>> web.log.path: /opt/flink/log/output.log
>>>> state.checkpoints.num-retained: 3
>>>> metrics.reporters: prom
>>>> metrics.reporter.prom.class:
>>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>>
>>>> high-availability: zookeeper
>>>> high-availability.jobmanager.port: 50002
>>>> high-availability.zookeeper.quorum: zookeeper_instance_list
>>>> high-availability.zookeeper.path.root: /flink
>>>> high-availability.cluster-id: profileservice
>>>> high-availability.storageDir: s3://our_path
>>>>
>>>> Any help will be greatly appreciated!
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>
>>>

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Reply via email to