Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Till Rohrmann Fri, 28 Sep 2018 06:11:35 -0700

What do you think about reverting this change (FLINK-8696), because it is
really hard to debug for users? A problem would be if people now rely on
the second argument being the hostname.


An alternative could be to filter out `cluster` and `local` if they should
appear as second argument. This could however lead to problems if a user
wants to set the hostname to either `local` or `cluster` via jobmanager.sh.

Cheers,
Till

On Wed, Sep 26, 2018 at 11:24 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Yes, that would be a good idea. I think it should go into the release
> notes. Will add it.
>
> On Wed, Sep 26, 2018 at 10:24 AM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Should we add a warning to the release announcements?
>>
>> Fabian
>>
>> Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <
>> rmetz...@apache.org>:
>>
>>> Hey Jamie,
>>>
>>> we've been facing the same issue with dA Platform, when running Flink
>>> 1.6.1.
>>> I assume a lot of people will be affected by this.
>>>
>>>
>>>
>>> On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Hi Jamie,
>>>>
>>>> thanks for the update on how to fix the problem. This is very helpful
>>>> for the rest of the community.
>>>>
>>>> The change of removing the execution mode parameter (FLINK-8696) from
>>>> the start up scripts was actually released with Flink 1.5.0. That way, the
>>>> host name became the 2nd parameter. By calling the start up scripts with
>>>> the old syntax, the execution mode parameter was interpreted as the
>>>> hostname. This host name option was, however, not properly evaluated until
>>>> we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.
>>>>
>>>> We definitely need to treat the start up scripts as a stable API as
>>>> well. So far, we don't have good tooling which ensures that we don't
>>>> introduce breaking changes. In the future we need to be more careful!
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <jgr...@lyft.com> wrote:
>>>>
>>>>> Update on this:
>>>>>
>>>>> The issue was the command being used to start the jobmanager:
>>>>> `jobmanager.sh start-foreground cluster`.  This was a command leftover in
>>>>> our automation that used to be the correct way to start the JM -- however
>>>>> now, in Flink 1.5.4, that second parameter, `cluster`, is being 
>>>>> interpreted
>>>>> as the hostname for the jobmanager to bind to.
>>>>>
>>>>> The solution was just to remove `cluster` from that command.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <jgr...@lyft.com> wrote:
>>>>>
>>>>>> Anybody else seen this and know the solution?  We're dead in the
>>>>>> water with Flink 1.5.4.
>>>>>>
>>>>>> On Sun, Sep 23, 2018 at 11:46 PM alex <ek.rei...@gmail.com> wrote:
>>>>>>
>>>>>>> We started to see same errors after upgrading to flink 1.6.0 from
>>>>>>> 1.4.2. We
>>>>>>> have one JM and 5 TM on kubernetes. JM is running on HA mode.
>>>>>>> Taskmanagers
>>>>>>> sometimes are loosing connection to JM and having following error
>>>>>>> like you
>>>>>>> have.
>>>>>>>
>>>>>>> *2018-09-19 12:36:40,687 INFO
>>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>>> Could not
>>>>>>> resolve ResourceManager address
>>>>>>> akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager,
>>>>>>> retrying in
>>>>>>> 10000 ms: Ask timed out on
>>>>>>> [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
>>>>>>> Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent
>>>>>>> message of
>>>>>>> type "akka.actor.Identify"..*
>>>>>>>
>>>>>>> When TM started to have "Could not resolve ResourceManager", it
>>>>>>> cannot
>>>>>>> resolve itself until I restart the TM pod.
>>>>>>>
>>>>>>> *Here is the content of our flink-conf.yaml:*
>>>>>>> blob.server.port: 6124
>>>>>>> jobmanager.rpc.address: flink-jobmanager
>>>>>>> jobmanager.rpc.port: 6123
>>>>>>> jobmanager.heap.mb: 4096
>>>>>>> jobmanager.web.history: 20
>>>>>>> jobmanager.archive.fs.dir: s3://our_path
>>>>>>> taskmanager.rpc.port: 6121
>>>>>>> taskmanager.heap.mb: 16384
>>>>>>> taskmanager.numberOfTaskSlots: 10
>>>>>>> taskmanager.log.path: /opt/flink/log/output.log
>>>>>>> web.log.path: /opt/flink/log/output.log
>>>>>>> state.checkpoints.num-retained: 3
>>>>>>> metrics.reporters: prom
>>>>>>> metrics.reporter.prom.class:
>>>>>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>>>>>
>>>>>>> high-availability: zookeeper
>>>>>>> high-availability.jobmanager.port: 50002
>>>>>>> high-availability.zookeeper.quorum: zookeeper_instance_list
>>>>>>> high-availability.zookeeper.path.root: /flink
>>>>>>> high-availability.cluster-id: profileservice
>>>>>>> high-availability.storageDir: s3://our_path
>>>>>>>
>>>>>>> Any help will be greatly appreciated!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from:
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>>
>>>>>>

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Reply via email to