Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks for sharing. I was wondering why you don't use $PORT0 in your command. And: Are the ports properly configured in the Marathon network configuration [1]? But the error seems to be unrelated to that setting. Other than that, I cannot see any other issue with the configuration. It could be that the HOST IP is blocked? [1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas wrote: > > Full appmaster log in debug mode is attached. > My startup command was > /opt/flink/bin/mesos-appmaster.sh \ > -Drest.bind-port=8081 \ > -Drest.port=8081 \ > -Djobmanager.rpc.address=$HOST \ > -Djobmanager.rpc.port=$PORT1 \ > -Dmesos.resourcemanager.framework.user=flink \ > -Dmesos.resourcemanager.framework.name=timeline-flink-populator \ > -Dmesos.master=10.0.18.246:5050 \ > -Dmesos.resourcemanager.tasks.cpus=4 \ > -Dmesos.resourcemanager.tasks.container.type=docker \ > -Dmesos.resourcemanager.tasks.container.image.name= > docker.strava.com/strava/timeline-populator2:jv-mesos \ > -Dtaskmanager.numberOfTaskSlots=4 ; > > where $PORT1 refers to my second host open port, mapped to 6123 on the > Docker container (first port is mapped to 8081). > I can see in the log that $HOST and $PORT1 resolve to the correct values, > 10.0.20.25 > and 31608 > > On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl > wrote: > >> ...and if possible, it would be helpful to provide debug logs as well. >> >> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl >> wrote: >> >>> May you provide the entire JobManager logs so that we can see what's >>> going on? >>> >>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas wrote: >>> Thanks again, Matthias! Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh I see in tog they seem to transform in the correct values -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 but a bit later the appmaster dies with this new error. it is unclear what address it is trying to bind, I added explicit params -Drest.bind-port=8081 and -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering, but that didn't help. 2021-09-29 10:29:59.845 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl wrote: > The port has its separate configuration parameter jobmanager.rpc.port > [1] > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 > >>
Re: Unable to connect to Mesos on mesos-appmaster.sh start
...and if possible, it would be helpful to provide debug logs as well. On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl wrote: > May you provide the entire JobManager logs so that we can see what's going > on? > > On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas wrote: > >> Thanks again, Matthias! >> >> Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 >> as params for appmaster.sh >> I see in tog they seem to transform in the correct values >> >> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 >> >> but a bit later the appmaster dies with this new error. it is unclear >> what address it is trying to bind, I added explicit params >> -Drest.bind-port=8081 and >> -Drest.port=8081 in case jobmanager.rpc.port was somehow >> interfering, but that didn't help. >> >> 2021-09-29 10:29:59.845 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >> MesosSessionClusterEntrypoint down with application status FAILED. >> Diagnostics java.net.BindException: Cannot assign requested address >> at java.base/sun.nio.ch.Net.bind0(Native Method) >> at java.base/sun.nio.ch.Net.bind(Unknown Source) >> at java.base/sun.nio.ch.Net.bind(Unknown Source) >> at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) >> at >> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >> at >> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >> at >> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >> at >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Unknown Source) >> >> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl >> wrote: >> >>> The port has its separate configuration parameter jobmanager.rpc.port [1] >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 >>> >>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas wrote: >>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves properly to the host IP and port mapped to 8081 2021-09-29 07:58:05.452 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=10.0.22.114:31894 which is very promising. But sadly a little bit later appmaster dies with this errror: 2021-09-29 07:58:05.648 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services. 2021-09-29 07:58:05.674 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics org.apache.flink.configurati on.IllegalConfigurationException: The configured hostname is not valid at org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) at org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
Re: Unable to connect to Mesos on mesos-appmaster.sh start
May you provide the entire JobManager logs so that we can see what's going on? On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas wrote: > Thanks again, Matthias! > > Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 > as params for appmaster.sh > I see in tog they seem to transform in the correct values > > -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 > > but a bit later the appmaster dies with this new error. it is unclear what > address it is trying to bind, I added explicit params > -Drest.bind-port=8081 and > -Drest.port=8081 in case jobmanager.rpc.port was somehow > interfering, but that didn't help. > > 2021-09-29 10:29:59.845 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting > MesosSessionClusterEntrypoint down with application status FAILED. > Diagnostics java.net.BindException: Cannot assign requested address > at java.base/sun.nio.ch.Net.bind0(Native Method) > at java.base/sun.nio.ch.Net.bind(Unknown Source) > at java.base/sun.nio.ch.Net.bind(Unknown Source) > at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) > at > org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > > On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl > wrote: > >> The port has its separate configuration parameter jobmanager.rpc.port [1] >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 >> >> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas wrote: >> >>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address >>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves >>> properly to the host IP and port mapped to 8081 >>> >>> 2021-09-29 07:58:05.452 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>> -Djobmanager.rpc.address=10.0.22.114:31894 >>> >>> which is very promising. But sadly a little bit later appmaster dies >>> with this errror: >>> >>> 2021-09-29 07:58:05.648 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing >>> cluster services. >>> 2021-09-29 07:58:05.674 [main] INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >>> MesosSessionClusterEntrypoint down with application status FAILED. >>> Diagnostics org.apache.flink.configurati >>> on.IllegalConfigurationException: The configured hostname is not valid >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >>> at >>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >>> at >>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >>> at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRp
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks again, Matthias! Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh I see in tog they seem to transform in the correct values -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 but a bit later the appmaster dies with this new error. it is unclear what address it is trying to bind, I added explicit params -Drest.bind-port=8081 and -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering, but that didn't help. 2021-09-29 10:29:59.845 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) . On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl wrote: > The port has its separate configuration parameter jobmanager.rpc.port [1] > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 > > On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas wrote: > >> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address >> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves >> properly to the host IP and port mapped to 8081 >> >> 2021-09-29 07:58:05.452 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >> -Djobmanager.rpc.address=10.0.22.114:31894 >> >> which is very promising. But sadly a little bit later appmaster dies with >> this errror: >> >> 2021-09-29 07:58:05.648 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing >> cluster services. >> 2021-09-29 07:58:05.674 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >> MesosSessionClusterEntrypoint down with application status FAILED. >> Diagnostics org.apache.flink.configurati >> on.IllegalConfigurationException: The configured hostname is not valid >> at >> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >> at >> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >> at >> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >> at >> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) >> at >> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(
Re: Unable to connect to Mesos on mesos-appmaster.sh start
The port has its separate configuration parameter jobmanager.rpc.port [1] [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas wrote: > Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address > param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves > properly to the host IP and port mapped to 8081 > > 2021-09-29 07:58:05.452 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Djobmanager.rpc.address=10.0.22.114:31894 > > which is very promising. But sadly a little bit later appmaster dies with > this errror: > > 2021-09-29 07:58:05.648 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing > cluster services. > 2021-09-29 07:58:05.674 [main] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting > MesosSessionClusterEntrypoint down with application status FAILED. > Diagnostics org.apache.flink.configurati > on.IllegalConfigurationException: The configured hostname is not valid > at > org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) > at > org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) > at > org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) > at > org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) > at > org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/javax.security.auth.Subject.doAs(Unknown Source) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) > at > org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) > Caused by: java.lang.IllegalArgumentException > at > org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) > at > org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) > ... 17 more > . > 2021-09-29 07:58:05.685 [main] ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start > cluster entrypoint MesosSessionClusterEntrypoint. > org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to > initialize the cluster entrypoint MesosSessionClusterEntrypoint. > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) > at > org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) > Caused by: org.apache.flink.configuration.IllegalConfigurationException: > The configured hostname is not valid > at > org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) > at > org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) > at > org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) > at > org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) > at > org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves properly to the host IP and port mapped to 8081 2021-09-29 07:58:05.452 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=10.0.22.114:31894 which is very promising. But sadly a little bit later appmaster dies with this errror: 2021-09-29 07:58:05.648 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services. 2021-09-29 07:58:05.674 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics org.apache.flink.configurati on.IllegalConfigurationException: The configured hostname is not valid at org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) at org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) Caused by: java.lang.IllegalArgumentException at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) at org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) ... 17 more . 2021-09-29 07:58:05.685 [main] ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint MesosSessionClusterEntrypoint. org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint MesosSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114) Caused by: org.apache.flink.configuration.IllegalConfigurationException: The configured hostname is not valid at org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) at org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) at org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
Re: Unable to connect to Mesos on mesos-appmaster.sh start
One thing that was puzzling me yesterday when reading your post: Have you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I played around with Mesos, I remember using HOST to resolve the host's IP address instead of the host's name. It could be that the hostname itself cannot be resolved to the right IP address. But I struggled to find proper documentation to back that up. Only in the recipes section of the Marathon docs [1], HOST was used as well. Matthias [1] https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas wrote: > Another update: Looking more carefully in my appmaster log, I see the > following > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > Registering as new framework. > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > - > > --- > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos > Info: > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Master > URL: 10.0.18.246:5050 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Framework > Info: > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Name: > flink-test > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Failover > Timeout (secs): 604800.0 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Role: * > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > Capabilities: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Principal: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Host: > 311dcf7fd77c > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web > UI: http://311dcf7fd77c:8081 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > - > > --- > > > which is picking up the mesos.master and > mesos.resourcemanager.framework.name params I am passing to > mesos-appmaster.sh > > > In my Mesos dashboard I can see the framework has been created with the > right name, but has no associated agents/tasks to it. So at least Flink has > been able to connect to the Mesos master to create the framework > > > Later in the mesos-appmaster log is when I see the Mesos connection errors: > > > 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - Starting > the slot manager. > > 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG > org.apache.flink.mesos.scheduler.ConnectionMonitor - State change > (StoppedState -> StoppedState) with data () > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > org.apache.flink.mesos.scheduler.ReconciliationCoordinator - State > change (Suspended -> Suspended) with data ReconciliationData(Map(),0) > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to > Mesos... > > 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG > org.apache.flink.mesos.scheduler.ConnectionMonitor - State change > (StoppedState -> ConnectingState) with data () > > 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos > resource manager started. > >
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Another update: Looking more carefully in my appmaster log, I see the following 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Registering as new framework. 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - - --- 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos Info: 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Master URL: 10.0.18.246:5050 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Framework Info: 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: (none) 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Name: flink-test 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Failover Timeout (secs): 604800.0 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Role: * 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Capabilities: (none) 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Principal: (none) 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Host: 311dcf7fd77c 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web UI: http://311dcf7fd77c:8081 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - - --- which is picking up the mesos.master and mesos.resourcemanager.framework.name params I am passing to mesos-appmaster.sh In my Mesos dashboard I can see the framework has been created with the right name, but has no associated agents/tasks to it. So at least Flink has been able to connect to the Mesos master to create the framework Later in the mesos-appmaster log is when I see the Mesos connection errors: 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - Starting the slot manager. 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor - State change (StoppedState -> StoppedState) with data () 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger heartbeat request. 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator - State change (Suspended -> Suspended) with data ReconciliationData(Map(),0) 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger heartbeat request. 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to Mesos... 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor - State change (StoppedState -> ConnectingState) with data () 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos resource manager started. 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator - State change (Suspended -> Suspended) with data GatherData(List(),List()) 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to connect to Mesos; still trying... 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger heartbeat request. 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger heartbeat request. So why the appmaster was able to connect to Mesos master to create the framework but failed to connect later to do whatever it does l
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks, Matthias! There are lots of apps deployed to the Mesos cluster, the task manager itself is deployed to Mesos via Marathon. In the Mesos log I can see the Job manager agent starting, but no error messages related to it. As you say, TaskManagers don't even have the chance to get confused about variables, since the Job Manager can not connect to the Mesos master to tell it to start the Task Managers. Thanks, Javier On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl wrote: > Hi Javier, > I don't see anything that's configured in the wrong way based on the > jobmanager logs you've provided. Have you been able to deploy other > applications to this Mesos cluster? Do the Mesos master logs reveal > anything? The variable resolution on the TaskManager side is a valid > concern shared by Roman since it's easy to run into such an issue. But the > JobManager logs indicate that the JobManager is not able to contact the > Mesos master. Hence, I'd assume that it's not related to the TaskManagers > not coming up. > > Best, > Matthias > > On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan > wrote: > >> Hi, >> >> No additional ports need to be open as far as I know. >> >> Probably, $HOSTNAME is substituted for something not resolvable on TMs? >> >> Please also make sure that the following gets executed before >> mesos-appmaster.sh: >> export HADOOP_CLASSPATH=$(hadoop classpath) >> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >> (as per the documentation you linked) >> >> Regards, >> Roman >> >> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: >> > >> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in >> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >> and using Marathon to deploy a Docker image with both the Flink and my >> binaries. >> > >> > My entrypoint for the Docker image is: >> > >> > >> > /opt/flink/bin/mesos-appmaster.sh \ >> > >> > -Djobmanager.rpc.address=$HOSTNAME \ >> > >> > -Dmesos.resourcemanager.framework.user=flink \ >> > >> > -Dmesos.master=10.0.18.246:5050 \ >> > >> > -Dmesos.resourcemanager.tasks.cpus=6 >> > >> > >> > >> > When mesos-appmaster.sh starts, in the stderr I see this: >> > >> > >> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >> > >> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent >> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >> > >> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker >> executor on 10.0.20.177 >> > >> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >> > >> > WARNING: Your kernel does not support swap limit capabilities or the >> cgroup is not mounted. Memory limited without swap. >> > >> > WARNING: An illegal reflective access operation has occurred >> > >> > WARNING: Illegal reflective access by >> org.apache.hadoop.security.authentication.util.KerberosUtil >> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method >> sun.security.krb5.Config.getInstance() >> > >> > WARNING: Please consider reporting this to the maintainers of >> org.apache.hadoop.security.authentication.util.KerberosUtil >> > >> > WARNING: Use --illegal-access=warn to enable warnings of further >> illegal reflective access operations >> > >> > WARNING: All illegal access operations will be denied in a future >> release >> > >> > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 >> > >> > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at >> master@10.0.18.246:5050 >> > >> > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. >> Attempting to register without authentication >> > >> > >> > where the "New master detected" line is promising. >> > >> > However, on the Flink UI I see only the jobmanager started, and there >> are no task managers. Getting into the Docker container, I see this in the >> log: >> > >> > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to >> connect to Mesos; still trying... >> > >> > >> > I have verified that from the container I can access the Mesos >> container 10.0.18.246:5050 >> > >> > >> > Does any other port besides the web UI port 5050 need to be open for >> mesos-appmaster to connect with the Mesos master? >> > >> > >> > In the appmaster log (attached) I see one exception that I don't know >> if they are related to the Mesos connection problem, one is >> > >> > >> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are >> unset. >> > >> > at >> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) >> > >> > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) >> > >> > at org.apache.hadoop.util.Shell.(Shell.java:496) >> > >> > at >> org.apache.hadoop.util.StringUtils.(StringUtils.java:79) >> > >> > at >> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) >
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks, Roman! Looking at the log, seems that the TaskManager can resolve $HOSTNAME to its own hostname (07a6b681ee0f), as seen in these lines: 2021-09-27 22:02:41.067 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=*07a6b681ee0f* 2021-09-27 22:02:43.025 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at *07a6b681ee0f*:8081 2021-09-27 22:02:43.025 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http:// *07a6b681ee0f*:8081 was granted leadership with leaderSessionID=---- 2021-09-27 22:02:43.026 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://*07a6b681ee0f*:8081. I am deploying to Mesos with Marathon, so I have no way other than $HOSTNAME to indicate the host that will execute mesos-appmaster.sh The environment variables are set, this is what I can see if I hop into the Docker container: root@07a6b681ee0f:/opt/flink# echo $HADOOP_CLASSPATH /opt/flink/hadoop-3.2.2/etc/hadoop:/opt/flink/hadoop-3.2.2/share/hadoop/common/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/common/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/*:/opt/flink/lib root@07a6b681ee0f:/opt/flink# echo $MESOS_NATIVE_JAVA_LIBRARY /usr/lib/libmesos.so On Tue, Sep 28, 2021 at 5:45 AM Roman Khachatryan wrote: > Hi, > > No additional ports need to be open as far as I know. > > Probably, $HOSTNAME is substituted for something not resolvable on TMs? > > Please also make sure that the following gets executed before > mesos-appmaster.sh: > export HADOOP_CLASSPATH=$(hadoop classpath) > export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so > (as per the documentation you linked) > > Regards, > Roman > > On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: > > > > I am trying to start Flink 1.13.2 on Mesos following the instrucions in > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ > and using Marathon to deploy a Docker image with both the Flink and my > binaries. > > > > My entrypoint for the Docker image is: > > > > > > /opt/flink/bin/mesos-appmaster.sh \ > > > > -Djobmanager.rpc.address=$HOSTNAME \ > > > > -Dmesos.resourcemanager.framework.user=flink \ > > > > -Dmesos.master=10.0.18.246:5050 \ > > > > -Dmesos.resourcemanager.tasks.cpus=6 > > > > > > > > When mesos-appmaster.sh starts, in the stderr I see this: > > > > > > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 > > > > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent > f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 > > > > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker > executor on 10.0.20.177 > > > > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task > tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 > > > > WARNING: Your kernel does not support swap limit capabilities or the > cgroup is not mounted. Memory limited without swap. > > > > WARNING: An illegal reflective access operation has occurred > > > > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method > sun.security.krb5.Config.getInstance() > > > > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > > > > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > > > > WARNING: All illegal access operations will be denied in a future release > > > > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 > > > > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at > master@10.0.18.246:5050 > > > > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. > Attempting to register without authentication > > > > > > where the "New master detected" line is promising. > > > > However, on the Flink UI I see only the jobmanager started, and there > are no task managers. Getting into the Docker container, I see this in the > log: > > > > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to > connect to Mesos; still trying... > > > > > > I have verified that from the container I can access the Mesos container > 10.0.18.246:5050 > > > > > > Does any other port besides the web UI port 5050 need to be open for > mesos-appmaster to connect with the Mesos master? > > > > > > In the appmaster log (attached) I see one exception that I don't know if > they are related to the Mesos connection
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Hi Javier, I don't see anything that's configured in the wrong way based on the jobmanager logs you've provided. Have you been able to deploy other applications to this Mesos cluster? Do the Mesos master logs reveal anything? The variable resolution on the TaskManager side is a valid concern shared by Roman since it's easy to run into such an issue. But the JobManager logs indicate that the JobManager is not able to contact the Mesos master. Hence, I'd assume that it's not related to the TaskManagers not coming up. Best, Matthias On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan wrote: > Hi, > > No additional ports need to be open as far as I know. > > Probably, $HOSTNAME is substituted for something not resolvable on TMs? > > Please also make sure that the following gets executed before > mesos-appmaster.sh: > export HADOOP_CLASSPATH=$(hadoop classpath) > export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so > (as per the documentation you linked) > > Regards, > Roman > > On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: > > > > I am trying to start Flink 1.13.2 on Mesos following the instrucions in > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ > and using Marathon to deploy a Docker image with both the Flink and my > binaries. > > > > My entrypoint for the Docker image is: > > > > > > /opt/flink/bin/mesos-appmaster.sh \ > > > > -Djobmanager.rpc.address=$HOSTNAME \ > > > > -Dmesos.resourcemanager.framework.user=flink \ > > > > -Dmesos.master=10.0.18.246:5050 \ > > > > -Dmesos.resourcemanager.tasks.cpus=6 > > > > > > > > When mesos-appmaster.sh starts, in the stderr I see this: > > > > > > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 > > > > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent > f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 > > > > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker > executor on 10.0.20.177 > > > > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task > tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 > > > > WARNING: Your kernel does not support swap limit capabilities or the > cgroup is not mounted. Memory limited without swap. > > > > WARNING: An illegal reflective access operation has occurred > > > > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method > sun.security.krb5.Config.getInstance() > > > > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > > > > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > > > > WARNING: All illegal access operations will be denied in a future release > > > > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 > > > > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at > master@10.0.18.246:5050 > > > > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. > Attempting to register without authentication > > > > > > where the "New master detected" line is promising. > > > > However, on the Flink UI I see only the jobmanager started, and there > are no task managers. Getting into the Docker container, I see this in the > log: > > > > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to > connect to Mesos; still trying... > > > > > > I have verified that from the container I can access the Mesos container > 10.0.18.246:5050 > > > > > > Does any other port besides the web UI port 5050 need to be open for > mesos-appmaster to connect with the Mesos master? > > > > > > In the appmaster log (attached) I see one exception that I don't know if > they are related to the Mesos connection problem, one is > > > > > > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. > > > > at > org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) > > > > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) > > > > at org.apache.hadoop.util.Shell.(Shell.java:496) > > > > at > org.apache.hadoop.util.StringUtils.(StringUtils.java:79) > > > > at > org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) > > > > at > org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) > > > > at > org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90) > > > > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) > > > > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) > > > > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) > > > > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroup
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Hi, No additional ports need to be open as far as I know. Probably, $HOSTNAME is substituted for something not resolvable on TMs? Please also make sure that the following gets executed before mesos-appmaster.sh: export HADOOP_CLASSPATH=$(hadoop classpath) export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so (as per the documentation you linked) Regards, Roman On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: > > I am trying to start Flink 1.13.2 on Mesos following the instrucions in > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ > and using Marathon to deploy a Docker image with both the Flink and my > binaries. > > My entrypoint for the Docker image is: > > > /opt/flink/bin/mesos-appmaster.sh \ > > -Djobmanager.rpc.address=$HOSTNAME \ > > -Dmesos.resourcemanager.framework.user=flink \ > > -Dmesos.master=10.0.18.246:5050 \ > > -Dmesos.resourcemanager.tasks.cpus=6 > > > > When mesos-appmaster.sh starts, in the stderr I see this: > > > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 > > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent > f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 > > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor on > 10.0.20.177 > > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task > tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 > > WARNING: Your kernel does not support swap limit capabilities or the cgroup > is not mounted. Memory limited without swap. > > WARNING: An illegal reflective access operation has occurred > > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method > sun.security.krb5.Config.getInstance() > > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > > WARNING: All illegal access operations will be denied in a future release > > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 > > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at > master@10.0.18.246:5050 > > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. > Attempting to register without authentication > > > where the "New master detected" line is promising. > > However, on the Flink UI I see only the jobmanager started, and there are no > task managers. Getting into the Docker container, I see this in the log: > > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to connect > to Mesos; still trying... > > > I have verified that from the container I can access the Mesos container > 10.0.18.246:5050 > > > Does any other port besides the web UI port 5050 need to be open for > mesos-appmaster to connect with the Mesos master? > > > In the appmaster log (attached) I see one exception that I don't know if they > are related to the Mesos connection problem, one is > > > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. > > at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) > > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) > > at org.apache.hadoop.util.Shell.(Shell.java:496) > > at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) > > at > org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) > > at > org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) > > at > org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90) > > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) > > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) > > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) > > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) > > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) > > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) > > at java.base/java.lang.reflect.Method.invoke(Unknown Source) > > at > org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) > > at > org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) > > at > org.apache.fli
Unable to connect to Mesos on mesos-appmaster.sh start
I am trying to start Flink 1.13.2 on Mesos following the instrucions in https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ and using Marathon to deploy a Docker image with both the Flink and my binaries. My entrypoint for the Docker image is: /opt/flink/bin/mesos-appmaster.sh \ -Djobmanager.rpc.address=$HOSTNAME \ -Dmesos.resourcemanager.framework.user=flink \ -Dmesos.master=10.0.18.246:5050 \ -Dmesos.resourcemanager.tasks.cpus=6 When mesos-appmaster.sh starts, in the stderr I see this: I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor on 10.0.20.177 I0927 16:50:32.311394 801345 executor.cpp:186] Starting task tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 I0927 16:50:43.624439 328 sched.cpp:336] New master detected at master@10.0.18.246:5050 I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. Attempting to register without authentication where the "New master detected" line is promising. However, on the Flink UI I see only the jobmanager started, and there are no task managers. Getting into the Docker container, I see this in the log: WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to connect to Mesos; still trying... I have verified that from the container I can access the Mesos container 10.0.18.246:5050 Does any other port besides the web UI port 5050 need to be open for mesos-appmaster to connect with the Mesos master? In the appmaster log (attached) I see one exception that I don't know if they are related to the Mesos connection problem, one is java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) at org.apache.hadoop.util.Shell.(Shell.java:496) at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) at org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) at org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95) I am not trying (yet) to run in high availability mode, so I am not sure if I need to have HADOOP_HOME set or not, but I don't see anything about HADOOP_HOME in the FLink docs. Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink can connect to my Mesos master? Thanks, Javier Vegas flink--mesos-appmaster-6c49aa87e1d4.log Description: Binary data