Another update:  Looking more carefully in my appmaster log, I see the
following

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Registering
as new framework.

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-----------------------------------------------------------------------------

---

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
URL: 10.0.18.246:5050

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
flink-test

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
Timeout (secs): 604800.0

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role: *

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
Capabilities:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
311dcf7fd77c

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web UI:
http://311dcf7fd77c:8081

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-----------------------------------------------------------------------------

---


which is picking up the mesos.master and
mesos.resourcemanager.framework.name params I am passing to
mesos-appmaster.sh


In my Mesos dashboard I can see the framework has been created with the
right name, but has no associated agents/tasks to it. So at least Flink has
been able to connect to the Mesos master to create the framework


Later in the mesos-appmaster log is when I see the Mesos connection errors:


2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting the
slot manager.

2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> StoppedState) with data ()

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State change
(Suspended -> Suspended) with data ReconciliationData(Map(),0)

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to Mesos...

2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> ConnectingState) with data ()

2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
resource manager started.

2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
(Suspended -> Suspended) with data GatherData(List(),List())

2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect to
Mesos; still trying...

2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.




So why the appmaster was able to connect to Mesos master to create the
framework but failed to connect later to do whatever it does later?


One possible issue I see is that the framework is set with web UI in h
ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
master. 311dcf7fd77c
is the result of doing hostname on the Docker container, and the Mesos
master can not resolve that name. I could try to replace the Docker
container hostname with the Docker host hostname, but the host port that
gets mapped to 8081 on the container is a random port that I can not know
beforehand. Does Mesos master try to reach Flink using that Web UI setting?
Could this be the issue causing my connection problem, or is this a red
herring and the problem is a different one?


Thanks,


Javier Vegas








On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jve...@strava.com> wrote:

> Thanks, Matthias!
>
> There are lots of apps deployed to the Mesos cluster, the task manager
> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
> Job manager agent starting, but no error messages related to it. As you
> say, TaskManagers don't even have the chance to get confused about
> variables, since the Job Manager can not connect to the Mesos master to
> tell it to start the Task Managers.
>
> Thanks,
>
> Javier
>
> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <matth...@ververica.com>
> wrote:
>
>> Hi Javier,
>> I don't see anything that's configured in the wrong way based on the
>> jobmanager logs you've provided. Have you been able to deploy other
>> applications to this Mesos cluster? Do the Mesos master logs reveal
>> anything? The variable resolution on the TaskManager side is a valid
>> concern shared by Roman since it's easy to run into such an issue. But the
>> JobManager logs indicate that the JobManager is not able to contact the
>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>> not coming up.
>>
>> Best,
>> Matthias
>>
>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> No additional ports need to be open as far as I know.
>>>
>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>
>>> Please also make sure that the following gets executed before
>>> mesos-appmaster.sh:
>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>> (as per the documentation you linked)
>>>
>>> Regards,
>>> Roman
>>>
>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jve...@strava.com> wrote:
>>> >
>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>> in
>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>> and using Marathon to deploy a Docker image with both the Flink and my
>>> binaries.
>>> >
>>> > My entrypoint for the Docker image is:
>>> >
>>> >
>>> > /opt/flink/bin/mesos-appmaster.sh \
>>> >
>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>> >
>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>> >
>>> >       -Dmesos.master=10.0.18.246:5050 \
>>> >
>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>> >
>>> >
>>> >
>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>> >
>>> >
>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>> >
>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>> >
>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>> executor on 10.0.20.177
>>> >
>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>> >
>>> > WARNING: Your kernel does not support swap limit capabilities or the
>>> cgroup is not mounted. Memory limited without swap.
>>> >
>>> > WARNING: An illegal reflective access operation has occurred
>>> >
>>> > WARNING: Illegal reflective access by
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>> sun.security.krb5.Config.getInstance()
>>> >
>>> > WARNING: Please consider reporting this to the maintainers of
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> >
>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>> illegal reflective access operations
>>> >
>>> > WARNING: All illegal access operations will be denied in a future
>>> release
>>> >
>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>> >
>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>> master@10.0.18.246:5050
>>> >
>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>>> Attempting to register without authentication
>>> >
>>> >
>>> > where the "New master detected" line is promising.
>>> >
>>> > However, on the Flink UI I see only the jobmanager started, and there
>>> are no task managers.  Getting into the Docker container, I see this in the
>>> log:
>>> >
>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>> connect to Mesos; still trying...
>>> >
>>> >
>>> > I have verified that from the container I can access the Mesos
>>> container 10.0.18.246:5050
>>> >
>>> >
>>> > Does any other port besides the web UI port 5050 need to be open for
>>> mesos-appmaster to connect with the Mesos master?
>>> >
>>> >
>>> > In the appmaster log (attached) I see one exception that I don't know
>>> if they are related to the Mesos connection problem, one is
>>> >
>>> >
>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>> unset.
>>> >
>>> >         at
>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>> >
>>> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>> >
>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>> >
>>> >         at
>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>> >
>>> >         at
>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>> >
>>> >         at
>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>> >
>>> >         at
>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>> Source)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>> Source)
>>> >
>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>>> >
>>> >         at
>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>> >
>>> >         at
>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>> >
>>> >         at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>> >
>>> >
>>> >
>>> >
>>> > I am not trying (yet) to run in high availability mode, so I am not
>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>> about HADOOP_HOME in the FLink docs.
>>> >
>>> >
>>> >
>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>> Flink can connect to my Mesos master?
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > Javier Vegas
>>> >
>>> >
>>
>>

Reply via email to