Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-30 Thread Matthias Pohl
Thanks for sharing. I was wondering why you don't use $PORT0 in your
command. And: Are the ports properly configured in the Marathon network
configuration [1]? But the error seems to be unrelated to that setting.
Other than that, I cannot see any other issue with the configuration. It
could be that the HOST IP is blocked?

[1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports

On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas  wrote:

>
> Full appmaster log in debug mode is attached.
> My startup command was
> /opt/flink/bin/mesos-appmaster.sh \
>   -Drest.bind-port=8081 \
>   -Drest.port=8081 \
>   -Djobmanager.rpc.address=$HOST \
>   -Djobmanager.rpc.port=$PORT1 \
>   -Dmesos.resourcemanager.framework.user=flink \
>   -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>   -Dmesos.master=10.0.18.246:5050 \
>   -Dmesos.resourcemanager.tasks.cpus=4 \
>   -Dmesos.resourcemanager.tasks.container.type=docker \
>   -Dmesos.resourcemanager.tasks.container.image.name=
> docker.strava.com/strava/timeline-populator2:jv-mesos \
>   -Dtaskmanager.numberOfTaskSlots=4 ;
>
> where $PORT1 refers to my second host open port, mapped to 6123 on the
> Docker container (first port is mapped to 8081).
> I can see in the log that $HOST and $PORT1 resolve to the correct values, 
> 10.0.20.25
> and 31608
>
> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl 
> wrote:
>
>> ...and if possible, it would be helpful to provide debug logs as well.
>>
>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl 
>> wrote:
>>
>>> May you provide the entire JobManager logs so that we can see what's
>>> going on?
>>>
>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:
>>>
 Thanks again, Matthias!

 Putting  -Djobmanager.rpc.address=$HOST and
 -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
 I see in tog they seem to transform in the correct values

 -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009

 but a bit later the appmaster dies with this new error. it is unclear
 what address it is trying to bind, I added explicit params
 -Drest.bind-port=8081 and
   -Drest.port=8081 in case jobmanager.rpc.port was somehow
 interfering, but that didn't help.

 2021-09-29 10:29:59.845 [main] INFO  
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
 MesosSessionClusterEntrypoint down with application status FAILED. 
 Diagnostics java.net.BindException: Cannot assign requested address
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
at 
 org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
 org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)

 On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
 wrote:

> The port has its separate configuration parameter jobmanager.rpc.port
> [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
...and if possible, it would be helpful to provide debug logs as well.

On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl 
wrote:

> May you provide the entire JobManager logs so that we can see what's going
> on?
>
> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:
>
>> Thanks again, Matthias!
>>
>> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
>> as params for appmaster.sh
>> I see in tog they seem to transform in the correct values
>>
>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>
>> but a bit later the appmaster dies with this new error. it is unclear
>> what address it is trying to bind, I added explicit params
>> -Drest.bind-port=8081 and
>>   -Drest.port=8081 in case jobmanager.rpc.port was somehow
>> interfering, but that didn't help.
>>
>> 2021-09-29 10:29:59.845 [main] INFO  
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
>> MesosSessionClusterEntrypoint down with application status FAILED. 
>> Diagnostics java.net.BindException: Cannot assign requested address
>>  at java.base/sun.nio.ch.Net.bind0(Native Method)
>>  at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>  at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>  at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>  at java.base/java.lang.Thread.run(Unknown Source)
>>
>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
>> wrote:
>>
>>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>
>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>>>
 Matthias, thanks for the suggestion! I changed my
 jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
 log I see resolves properly to the host IP and port mapped to 8081

 2021-09-29 07:58:05.452 [main] INFO
  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
 -Djobmanager.rpc.address=10.0.22.114:31894

 which is very promising. But sadly a little bit later appmaster dies
 with this errror:

 2021-09-29 07:58:05.648 [main] INFO
  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
 cluster services.
 2021-09-29 07:58:05.674 [main] INFO
  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
 MesosSessionClusterEntrypoint down with application status FAILED.
 Diagnostics org.apache.flink.configurati
 on.IllegalConfigurationException: The configured hostname is not valid
 at
 org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
 at
 org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
 at
 org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
 at
 org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
 at
 org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
May you provide the entire JobManager logs so that we can see what's going
on?

On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:

> Thanks again, Matthias!
>
> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
> as params for appmaster.sh
> I see in tog they seem to transform in the correct values
>
> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>
> but a bit later the appmaster dies with this new error. it is unclear what
> address it is trying to bind, I added explicit params
> -Drest.bind-port=8081 and
>   -Drest.port=8081 in case jobmanager.rpc.port was somehow
> interfering, but that didn't help.
>
> 2021-09-29 10:29:59.845 [main] INFO  
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
> MesosSessionClusterEntrypoint down with application status FAILED. 
> Diagnostics java.net.BindException: Cannot assign requested address
>   at java.base/sun.nio.ch.Net.bind0(Native Method)
>   at java.base/sun.nio.ch.Net.bind(Unknown Source)
>   at java.base/sun.nio.ch.Net.bind(Unknown Source)
>   at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>   at 
> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Unknown Source)
>
> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
> wrote:
>
>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>
>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>>
>>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>>> properly to the host IP and port mapped to 8081
>>>
>>> 2021-09-29 07:58:05.452 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>
>>> which is very promising. But sadly a little bit later appmaster dies
>>> with this errror:
>>>
>>> 2021-09-29 07:58:05.648 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>> cluster services.
>>> 2021-09-29 07:58:05.674 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>> Diagnostics org.apache.flink.configurati
>>> on.IllegalConfigurationException: The configured hostname is not valid
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>> at
>>> 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Javier Vegas
Thanks again, Matthias!

Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
as params for appmaster.sh
I see in tog they seem to transform in the correct values

-Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009

but a bit later the appmaster dies with this new error. it is unclear what
address it is trying to bind, I added explicit params
-Drest.bind-port=8081 and
  -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering,
but that didn't help.

2021-09-29 10:29:59.845 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
MesosSessionClusterEntrypoint down with application status FAILED.
Diagnostics java.net.BindException: Cannot assign requested address
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
at 
org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)


.


On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
wrote:

> The port has its separate configuration parameter jobmanager.rpc.port [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>
> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>
>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>> properly to the host IP and port mapped to 8081
>>
>> 2021-09-29 07:58:05.452 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>> -Djobmanager.rpc.address=10.0.22.114:31894
>>
>> which is very promising. But sadly a little bit later appmaster dies with
>> this errror:
>>
>> 2021-09-29 07:58:05.648 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>> cluster services.
>> 2021-09-29 07:58:05.674 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>> MesosSessionClusterEntrypoint down with application status FAILED.
>> Diagnostics org.apache.flink.configurati
>> on.IllegalConfigurationException: The configured hostname is not valid
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>> at
>> 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
The port has its separate configuration parameter jobmanager.rpc.port [1]

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1

On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:

> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
> properly to the host IP and port mapped to 8081
>
> 2021-09-29 07:58:05.452 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
> -Djobmanager.rpc.address=10.0.22.114:31894
>
> which is very promising. But sadly a little bit later appmaster dies with
> this errror:
>
> 2021-09-29 07:58:05.648 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
> cluster services.
> 2021-09-29 07:58:05.674 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
> MesosSessionClusterEntrypoint down with application status FAILED.
> Diagnostics org.apache.flink.configurati
> on.IllegalConfigurationException: The configured hostname is not valid
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
> at
> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
> Caused by: java.lang.IllegalArgumentException
> at
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
> ... 17 more
> .
> 2021-09-29 07:58:05.685 [main] ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
> cluster entrypoint MesosSessionClusterEntrypoint.
> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
> initialize the cluster entrypoint MesosSessionClusterEntrypoint.
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException:
> The configured hostname is not valid
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
> at
> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
> at
> 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Javier Vegas
Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
properly to the host IP and port mapped to 8081

2021-09-29 07:58:05.452 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
-Djobmanager.rpc.address=10.0.22.114:31894

which is very promising. But sadly a little bit later appmaster dies with
this errror:

2021-09-29 07:58:05.648 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
cluster services.
2021-09-29 07:58:05.674 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
MesosSessionClusterEntrypoint down with application status FAILED.
Diagnostics org.apache.flink.configurati
on.IllegalConfigurationException: The configured hostname is not valid
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
at
org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
Caused by: java.lang.IllegalArgumentException
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
... 17 more
.
2021-09-29 07:58:05.685 [main] ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
cluster entrypoint MesosSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
initialize the cluster entrypoint MesosSessionClusterEntrypoint.
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
Caused by: org.apache.flink.configuration.IllegalConfigurationException:
The configured hostname is not valid
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
at
org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
One thing that was puzzling me yesterday when reading your post: Have you
tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
played around with Mesos, I remember using HOST to resolve the host's IP
address instead of the host's name. It could be that the hostname itself
cannot be resolved to the right IP address. But I struggled to find proper
documentation to back that up. Only in the recipes section of the Marathon
docs [1], HOST was used as well.

Matthias

[1]
https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks

On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas  wrote:

> Another update:  Looking more carefully in my appmaster log, I see the
> following
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> Registering as new framework.
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> -
>
> ---
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
> Info:
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Master
> URL: 10.0.18.246:5050
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
> Info:
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - ID:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Name:
> flink-test
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Failover
> Timeout (secs): 604800.0
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Role: *
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - 
> Capabilities:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Principal:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Host:
> 311dcf7fd77c
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Web
> UI: http://311dcf7fd77c:8081
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> -
>
> ---
>
>
> which is picking up the mesos.master and
> mesos.resourcemanager.framework.name params I am passing to
> mesos-appmaster.sh
>
>
> In my Mesos dashboard I can see the framework has been created with the
> right name, but has no associated agents/tasks to it. So at least Flink has
> been able to connect to the Mesos master to create the framework
>
>
> Later in the mesos-appmaster log is when I see the Mesos connection errors:
>
>
> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
> the slot manager.
>
> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
> (StoppedState -> StoppedState) with data ()
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
> Mesos...
>
> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
> (StoppedState -> ConnectingState) with data ()
>
> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
> resource manager started.
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
Another update:  Looking more carefully in my appmaster log, I see the
following

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Registering
as new framework.

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-

---

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Master
URL: 10.0.18.246:5050

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - ID:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Name:
flink-test

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Failover
Timeout (secs): 604800.0

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Role: *

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
Capabilities:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Principal:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Host:
311dcf7fd77c

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Web UI:
http://311dcf7fd77c:8081

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-

---


which is picking up the mesos.master and
mesos.resourcemanager.framework.name params I am passing to
mesos-appmaster.sh


In my Mesos dashboard I can see the framework has been created with the
right name, but has no associated agents/tasks to it. So at least Flink has
been able to connect to the Mesos master to create the framework


Later in the mesos-appmaster log is when I see the Mesos connection errors:


2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting the
slot manager.

2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> StoppedState) with data ()

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State change
(Suspended -> Suspended) with data ReconciliationData(Map(),0)

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to Mesos...

2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> ConnectingState) with data ()

2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
resource manager started.

2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
(Suspended -> Suspended) with data GatherData(List(),List())

2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect to
Mesos; still trying...

2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.




So why the appmaster was able to connect to Mesos master to create the
framework but failed to connect later to do whatever it does 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
Thanks, Matthias!

There are lots of apps deployed to the Mesos cluster, the task manager
itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
Job manager agent starting, but no error messages related to it. As you
say, TaskManagers don't even have the chance to get confused about
variables, since the Job Manager can not connect to the Mesos master to
tell it to start the Task Managers.

Thanks,

Javier

On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl 
wrote:

> Hi Javier,
> I don't see anything that's configured in the wrong way based on the
> jobmanager logs you've provided. Have you been able to deploy other
> applications to this Mesos cluster? Do the Mesos master logs reveal
> anything? The variable resolution on the TaskManager side is a valid
> concern shared by Roman since it's easy to run into such an issue. But the
> JobManager logs indicate that the JobManager is not able to contact the
> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
> not coming up.
>
> Best,
> Matthias
>
> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan 
> wrote:
>
>> Hi,
>>
>> No additional ports need to be open as far as I know.
>>
>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>
>> Please also make sure that the following gets executed before
>> mesos-appmaster.sh:
>> export HADOOP_CLASSPATH=$(hadoop classpath)
>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>> (as per the documentation you linked)
>>
>> Regards,
>> Roman
>>
>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
>> >
>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>> and using Marathon to deploy a Docker image with both the Flink and my
>> binaries.
>> >
>> > My entrypoint for the Docker image is:
>> >
>> >
>> > /opt/flink/bin/mesos-appmaster.sh \
>> >
>> >   -Djobmanager.rpc.address=$HOSTNAME \
>> >
>> >   -Dmesos.resourcemanager.framework.user=flink \
>> >
>> >   -Dmesos.master=10.0.18.246:5050 \
>> >
>> >   -Dmesos.resourcemanager.tasks.cpus=6
>> >
>> >
>> >
>> > When mesos-appmaster.sh starts, in the stderr I see this:
>> >
>> >
>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>> >
>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
>> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>> >
>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>> executor on 10.0.20.177
>> >
>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>> >
>> > WARNING: Your kernel does not support swap limit capabilities or the
>> cgroup is not mounted. Memory limited without swap.
>> >
>> > WARNING: An illegal reflective access operation has occurred
>> >
>> > WARNING: Illegal reflective access by
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>> sun.security.krb5.Config.getInstance()
>> >
>> > WARNING: Please consider reporting this to the maintainers of
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> >
>> > WARNING: Use --illegal-access=warn to enable warnings of further
>> illegal reflective access operations
>> >
>> > WARNING: All illegal access operations will be denied in a future
>> release
>> >
>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>> >
>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>> master@10.0.18.246:5050
>> >
>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>> Attempting to register without authentication
>> >
>> >
>> > where the "New master detected" line is promising.
>> >
>> > However, on the Flink UI I see only the jobmanager started, and there
>> are no task managers.  Getting into the Docker container, I see this in the
>> log:
>> >
>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>> connect to Mesos; still trying...
>> >
>> >
>> > I have verified that from the container I can access the Mesos
>> container 10.0.18.246:5050
>> >
>> >
>> > Does any other port besides the web UI port 5050 need to be open for
>> mesos-appmaster to connect with the Mesos master?
>> >
>> >
>> > In the appmaster log (attached) I see one exception that I don't know
>> if they are related to the Mesos connection problem, one is
>> >
>> >
>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>> unset.
>> >
>> > at
>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>> >
>> > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>> >
>> > at org.apache.hadoop.util.Shell.(Shell.java:496)
>> >
>> > at
>> org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
>> >
>> > at
>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
Thanks, Roman!

Looking at the log, seems that the TaskManager can resolve $HOSTNAME to its
own hostname (07a6b681ee0f), as seen in these lines:

2021-09-27 22:02:41.067 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
-Djobmanager.rpc.address=*07a6b681ee0f*

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Rest endpoint
listening at *07a6b681ee0f*:8081

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - http://
*07a6b681ee0f*:8081 was granted leadership with
leaderSessionID=----

2021-09-27 22:02:43.026 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Web frontend
listening at http://*07a6b681ee0f*:8081.


I am deploying to Mesos with Marathon, so I have no way other than
$HOSTNAME to indicate the host that will execute mesos-appmaster.sh

The environment variables are set, this is what I can see if I hop into the
Docker container:

root@07a6b681ee0f:/opt/flink# echo $HADOOP_CLASSPATH

/opt/flink/hadoop-3.2.2/etc/hadoop:/opt/flink/hadoop-3.2.2/share/hadoop/common/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/common/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/*:/opt/flink/lib


root@07a6b681ee0f:/opt/flink# echo $MESOS_NATIVE_JAVA_LIBRARY

/usr/lib/libmesos.so




On Tue, Sep 28, 2021 at 5:45 AM Roman Khachatryan  wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >   -Djobmanager.rpc.address=$HOSTNAME \
> >
> >   -Dmesos.resourcemanager.framework.user=flink \
> >
> >   -Dmesos.master=10.0.18.246:5050 \
> >
> >   -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
> >
> >
> > I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
> >
> >
> > Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
> >
> >
> > In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Matthias Pohl
Hi Javier,
I don't see anything that's configured in the wrong way based on the
jobmanager logs you've provided. Have you been able to deploy other
applications to this Mesos cluster? Do the Mesos master logs reveal
anything? The variable resolution on the TaskManager side is a valid
concern shared by Roman since it's easy to run into such an issue. But the
JobManager logs indicate that the JobManager is not able to contact the
Mesos master. Hence, I'd assume that it's not related to the TaskManagers
not coming up.

Best,
Matthias

On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan  wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >   -Djobmanager.rpc.address=$HOSTNAME \
> >
> >   -Dmesos.resourcemanager.framework.user=flink \
> >
> >   -Dmesos.master=10.0.18.246:5050 \
> >
> >   -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
> >
> >
> > I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
> >
> >
> > Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
> >
> >
> > In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos connection problem, one is
> >
> >
> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
> >
> > at
> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
> >
> > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
> >
> > at org.apache.hadoop.util.Shell.(Shell.java:496)
> >
> > at
> org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
> >
> > at
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
> >
> > at
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
> >
> > at
> org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
> >
> > at
> 

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Roman Khachatryan
Hi,

No additional ports need to be open as far as I know.

Probably, $HOSTNAME is substituted for something not resolvable on TMs?

Please also make sure that the following gets executed before
mesos-appmaster.sh:
export HADOOP_CLASSPATH=$(hadoop classpath)
export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
(as per the documentation you linked)

Regards,
Roman

On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
>
> I am trying to start Flink 1.13.2 on Mesos following the instrucions in 
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>  and using Marathon to deploy a Docker image with both the Flink and my 
> binaries.
>
> My entrypoint for the Docker image is:
>
>
> /opt/flink/bin/mesos-appmaster.sh \
>
>   -Djobmanager.rpc.address=$HOSTNAME \
>
>   -Dmesos.resourcemanager.framework.user=flink \
>
>   -Dmesos.master=10.0.18.246:5050 \
>
>   -Dmesos.resourcemanager.tasks.cpus=6
>
>
>
> When mesos-appmaster.sh starts, in the stderr I see this:
>
>
> I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>
> I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent 
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>
> I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor on 
> 10.0.20.177
>
> I0927 16:50:32.311394 801345 executor.cpp:186] Starting task 
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>
> WARNING: Your kernel does not support swap limit capabilities or the cgroup 
> is not mounted. Memory limited without swap.
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method 
> sun.security.krb5.Config.getInstance()
>
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>
> I0927 16:50:43.624439   328 sched.cpp:336] New master detected at 
> master@10.0.18.246:5050
>
> I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided. 
> Attempting to register without authentication
>
>
> where the "New master detected" line is promising.
>
> However, on the Flink UI I see only the jobmanager started, and there are no 
> task managers.  Getting into the Docker container, I see this in the log:
>
> WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect 
> to Mesos; still trying...
>
>
> I have verified that from the container I can access the Mesos container 
> 10.0.18.246:5050
>
>
> Does any other port besides the web UI port 5050 need to be open for 
> mesos-appmaster to connect with the Mesos master?
>
>
> In the appmaster log (attached) I see one exception that I don't know if they 
> are related to the Mesos connection problem, one is
>
>
> java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
>
> at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>
> at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>
> at org.apache.hadoop.util.Shell.(Shell.java:496)
>
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
>
> at 
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>
> at 
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>
> at 
> org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90)
>
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source)
>
> at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>
> at 
> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>
> at 
> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>
> at 
>