Hi Matthias,

most of the debug statements are just noise. You can ignore that.

Something with your network seems fishy to me. Either taskmanager 1 cannot
connect to taskmanager 2 (and vice versa), or the taskmanager cannot
connect locally.

I found this fragment, which seems suspicious

Failed to connect to /127.0.*1*.1:32797. Giving up.

localhost is usually 127.0.0.1. Can you double check that you connect from
all machines to all machines (including themselves) by opening trivial text
sockets on random ports?

On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler <
matthias.sei...@campus.tu-berlin.de> wrote:

> Hi Till,
>
> thanks for the hint, you seem about right. Setting the log level to DEBUG
> reveals more information, but I don't know what to do about it.
>
> All logs throw some Java related exceptions:
> `java.lang.UnsupportedOperationException: Reflective setAccessible(true)
> disabled`
> and
> `java.lang.IllegalAccessException: class
> org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
> cannot access class jdk.internal.misc.Unsafe (in module java.base) because
> module java.base does not export jdk.internal.misc to unnamed module`
>
> The log of node-2's TaskManager reveals connection problems:
> `org.apache.flink.runtime.net.ConnectionUtils                 [] - Failed
> to connect from address 'node-2/127.0.1.1': Invalid argument (connect
> failed)`
> `java.net.ConnectException: Invalid argument (connect failed)`
>
> What's more, both TaskManagers (node-1 and node-2) are having trouble to
> load `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
> but load some version eventually.
>
>
> There is quite a lot going on here that I don't understand. Can you (or
> someone) shed some light on it and tell me what I could try?
>
> Some more information:
> I appended the following to the `/etc/hosts` file:
> ```
> <ip-node-1> node-1
> <ip-node-2> node-2
> ```
> And the `flink/conf/workers` consists of:
> ```
> node-1
> node-2
> ```
>
> Thanks,
> Matthias
>
> P.S. I attached the logs for further reference. `<ip-node-1>` is of course
> the real IP address instead.
>
>
> On 2/16/21 1:56 PM, Till Rohrmann wrote:
>
> Hi Matthias,
>
> Can you make sure that node-1 and node-2 can talk to each other? It looks
> to me that node-2 fails to open a connection to the other TaskManager.
> Maybe the logs give some more insights. You can change the log level to
> DEBUG to gather more information.
>
> Cheers,
> Till
>
> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler <
> matthias.sei...@campus.tu-berlin.de> wrote:
>
>> Hi Everyone,
>>
>> I'm trying to setup a Flink cluster in standealone mode with two
>> machines. However, running a job throws the following exception:
>> `org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed`
>>
>> Here is some background:
>>
>> Machines:
>> - node-1: JobManager, TaskManager
>> - node-2: TaskManager
>>
>> flink-conf-yaml looks like this:
>> ```
>> jobmanager.rpc.address: node-1
>> taskmanager.numberOfTaskSlots: 8
>> parallelism.default: 2
>> cluster.evenly-spread-out-slots: true
>> ```
>>
>> Deploying the cluster works: I can see both TaskManagers in the WebUI.
>>
>> I ran the streaming WordCount example: `flink run
>> flink-1.12.1/examples/streaming/WordCount.jar --input lorem-ipsum.txt`
>> - the job has been submitted
>> - job failed (with the above exception)
>> - the log of the node-2 also shows the exception, the other logs are
>> fine (graceful stop)
>>
>> I played around with the config and observed that
>> - if parallelism is set to 1, node-1 gets all the slots and node-2 none
>> - if parallelism is set to 2, each TaskManager occupies 1 TaskSlot (but
>> fails because of node-2)
>>
>> I suspect, that the problem must be with the communication between
>> TaskManagers
>> - job runs successful if
>>     - node-1 is the **only** node with x TaskManagers (tested with x=1
>> and x=2)
>>     - node-2 is the **only** node with x TaskManagers (tested with x=1
>> and x=2)
>> - job fails if
>>     - node-1 **and** node-2 have one TaskManager
>>
>> The full exception is:
>> ```
>> org.apache.flink.client.program.ProgramInvocationException: The main
>> method caused an error:
>> org.apache.flink.client.program.ProgramInvocationException: Job failed
>> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>> // ... Job failed, Recovery is suppressed by
>> NoRestartBackoffTimeStrategy, ...
>> Caused by:
>> org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException:
>> Sending the partition request to 'null' failed.
>>     at
>> org.apache.flink.runtime.io
>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>     at
>> org.apache.flink.runtime.io
>> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>     at java.base/java.lang.Thread.run(Thread.java:834)
>> Caused by: java.nio.channels.ClosedChannelException
>>     at
>>
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>     ... 11 more
>> ```
>>
>> Thanks in advance,
>> Matthias
>>
>>

Reply via email to