Thanks a bunch! I replaced 127.0.1.1 with the actual IP address and it
works now :)

On 3/15/21 3:22 PM, Robert Metzger wrote:
> Hey Matthias,
>
> are you sure you can connect to 127.0.1.1, since everything between
> 127.0.0.1 and  127.255.255.255 is bound to the loopback device?:
> https://serverfault.com/a/363098 <https://serverfault.com/a/363098>
>
>
>
> On Mon, Mar 15, 2021 at 11:13 AM Matthias Seiler
> <matthias.sei...@campus.tu-berlin.de
> <mailto:matthias.sei...@campus.tu-berlin.de>> wrote:
>
>     Hi Arvid,
>
>     I listened to ports with netcat and connected via telnet and each
>     node can connect to the other and itself.
>
>     The `/etc/hosts` file looks like this
>     ```
>     127.0.0.1   localhost
>     127.0.1.1   node-2.example.com <http://node-2.example.com>   node-2
>
>     <ip-node-1>   node-1
>     ```
>     Is the second line the reason it fails? I also replaced all
>     hostnames with IP addresses in the config files (flink-conf,
>     workers, masters) but without effect...
>
>     Do you have any ideas what else I could try?
>
>     Thanks again,
>     Matthias
>
>     On 2/24/21 2:17 PM, Arvid Heise wrote:
>>     Hi Matthias,
>>
>>     most of the debug statements are just noise. You can ignore that.
>>
>>     Something with your network seems fishy to me. Either taskmanager
>>     1 cannot connect to taskmanager 2 (and vice versa), or the
>>     taskmanager cannot connect locally.
>>
>>     I found this fragment, which seems suspicious
>>
>>     Failed to connect to /127.0.*1*.1:32797. Giving up.
>>
>>     localhost is usually 127.0.0.1. Can you double check that you
>>     connect from all machines to all machines (including themselves)
>>     by opening trivial text sockets on random ports?
>>
>>     On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
>>     <matthias.sei...@campus.tu-berlin.de
>>     <mailto:matthias.sei...@campus.tu-berlin.de>> wrote:
>>
>>         Hi Till,
>>
>>         thanks for the hint, you seem about right. Setting the log
>>         level to DEBUG reveals more information, but I don't know
>>         what to do about it.
>>
>>         All logs throw some Java related exceptions:
>>         `java.lang.UnsupportedOperationException: Reflective
>>         setAccessible(true) disabled`
>>         and
>>         `java.lang.IllegalAccessException: class
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>>         cannot access class jdk.internal.misc.Unsafe (in module
>>         java.base) because module java.base does not export
>>         jdk.internal.misc to unnamed module`
>>
>>         The log of node-2's TaskManager reveals connection problems:
>>         `org.apache.flink.runtime.net.ConnectionUtils                
>>         [] - Failed to connect from address 'node-2/127.0.1.1
>>         <http://127.0.1.1>': Invalid argument (connect failed)`
>>         `java.net.ConnectException: Invalid argument (connect failed)`
>>
>>         What's more, both TaskManagers (node-1 and node-2) are having
>>         trouble to load
>>         `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>>         but load some version eventually.
>>
>>
>>         There is quite a lot going on here that I don't understand.
>>         Can you (or someone) shed some light on it and tell me what I
>>         could try?
>>
>>         Some more information:
>>         I appended the following to the `/etc/hosts` file:
>>         ```
>>         <ip-node-1> node-1
>>         <ip-node-2> node-2
>>         ```
>>         And the `flink/conf/workers` consists of:
>>         ```
>>         node-1
>>         node-2
>>         ```
>>
>>         Thanks,
>>         Matthias
>>
>>         P.S. I attached the logs for further reference. `<ip-node-1>`
>>         is of course the real IP address instead.
>>
>>
>>         On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>>         Hi Matthias,
>>>
>>>         Can you make sure that node-1 and node-2 can talk to each
>>>         other? It looks to me that node-2 fails to open a connection
>>>         to the other TaskManager. Maybe the logs give some more
>>>         insights. You can change the log level to DEBUG to gather
>>>         more information.
>>>
>>>         Cheers,
>>>         Till
>>>
>>>         On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>>>         <matthias.sei...@campus.tu-berlin.de
>>>         <mailto:matthias.sei...@campus.tu-berlin.de>> wrote:
>>>
>>>             Hi Everyone,
>>>
>>>             I'm trying to setup a Flink cluster in standealone mode
>>>             with two
>>>             machines. However, running a job throws the following
>>>             exception:
>>>             `org.apache.flink.runtime.io
>>>             
>>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>>             Sending the partition request to 'null' failed`
>>>
>>>             Here is some background:
>>>
>>>             Machines:
>>>             - node-1: JobManager, TaskManager
>>>             - node-2: TaskManager
>>>
>>>             flink-conf-yaml looks like this:
>>>             ```
>>>             jobmanager.rpc.address: node-1
>>>             taskmanager.numberOfTaskSlots: 8
>>>             parallelism.default: 2
>>>             cluster.evenly-spread-out-slots: true
>>>             ```
>>>
>>>             Deploying the cluster works: I can see both TaskManagers
>>>             in the WebUI.
>>>
>>>             I ran the streaming WordCount example: `flink run
>>>             flink-1.12.1/examples/streaming/WordCount.jar --input
>>>             lorem-ipsum.txt`
>>>             - the job has been submitted
>>>             - job failed (with the above exception)
>>>             - the log of the node-2 also shows the exception, the
>>>             other logs are
>>>             fine (graceful stop)
>>>
>>>             I played around with the config and observed that
>>>             - if parallelism is set to 1, node-1 gets all the slots
>>>             and node-2 none
>>>             - if parallelism is set to 2, each TaskManager occupies
>>>             1 TaskSlot (but
>>>             fails because of node-2)
>>>
>>>             I suspect, that the problem must be with the
>>>             communication between
>>>             TaskManagers
>>>             - job runs successful if
>>>                 - node-1 is the **only** node with x TaskManagers
>>>             (tested with x=1
>>>             and x=2)
>>>                 - node-2 is the **only** node with x TaskManagers
>>>             (tested with x=1
>>>             and x=2)
>>>             - job fails if
>>>                 - node-1 **and** node-2 have one TaskManager
>>>
>>>             The full exception is:
>>>             ```
>>>             org.apache.flink.client.program.ProgramInvocationException:
>>>             The main
>>>             method caused an error:
>>>             org.apache.flink.client.program.ProgramInvocationException:
>>>             Job failed
>>>             (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>>             // ... Job failed, Recovery is suppressed by
>>>             NoRestartBackoffTimeStrategy, ...
>>>             Caused by:
>>>             org.apache.flink.runtime.io
>>>             
>>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>>             Sending the partition request to 'null' failed.
>>>                 at
>>>             org.apache.flink.runtime.io
>>>             
>>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>>                 at
>>>             org.apache.flink.runtime.io
>>>             
>>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>                 at java.base/java.lang.Thread.run(Thread.java:834)
>>>             Caused by: java.nio.channels.ClosedChannelException
>>>                 at
>>>             
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>>                 ... 11 more
>>>             ```
>>>
>>>             Thanks in advance,
>>>             Matthias
>>>

Reply via email to