Re: ConnectTimeoutException when createPartitionRequestClient

Wenrui Meng Tue, 08 Jan 2019 05:59:19 -0800

Hi Till,

Thanks for your reply. Our cluster is Yarn cluster. I found that if we
decrease the total parallel the timeout issue can be avoided. But we do
need that amount of taskManagers to process data. In addition, once I
increase the netty server threads to 128, the error is changed to to
following error. It seems the cause is different. Could you help take a
look?


2b0ac47c1eb1bcbbbe4a97) switched from RUNNING to FAILED.
java.io.IOException: Connecting the channel failed: Connecting to remote
task manager + 'athena464-sjc1/10.70.129.13:39466' has failed. This might
indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
        at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
        at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
        at
org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.requestPartitions(UnionInputGate.java:134)
        at
org.apache.flink.runtime.io.network.partition.consumer.UnionInputGate.getNextBufferOrEvent(UnionInputGate.java:148)
        at
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager + 'athena464-sjc1/10.70.129.13:39466' has
failed. This might indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 common frames omitted
Caused by: java.net.ConnectException: Connection timed out: athena464-sjc1/
10.70.129.13:39466
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
        ... 6 common frames omitted


Thanks,
Wenrui

On Mon, Jan 7, 2019 at 2:38 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Wenrui,
>
> the code to set the connect timeout looks ok to me [1]. I also tested it
> locally and checked that the timeout is correctly registered in Netty's
> AbstractNioChannel [2].
>
> Increasing the number of threads to 128 should not be necessary. But it
> could indicate that there is some long lasting or blocking operation being
> executed by the threads.
>
> How does the job submission and cluster configuration work with AthenaX?
> Will the platform spawn for each job a new Flink cluster for which you can
> specify the cluster configuration?
>
> [1]
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyClient.java#L102
> [2]
> https://github.com/netty/netty/blob/netty-4.0.27.Final/transport/src/main/java/io/netty/channel/nio/AbstractNioChannel.java#L207
>
> Cheers,
> Till
>
> On Sat, Jan 5, 2019 at 2:22 AM Wenrui Meng <wenruim...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Thanks for your reply and help on this issue.
>>
>> I increased taskmanager.network.netty.client.connectTimeoutSec to 1200
>> which is 20 minutes. But it seems the connection not respects this timeout.
>> In addition, I increase both taskmanager.network.request-backoff.max
>> and taskmanager.registration.max-backoff to 20min.
>>
>> One thing I found is helpful to some extent is increasing
>> the taskmanager.network.netty.server.numThreads. I increase it to 128
>> threads, it can succeed sometimes. But keep increasing it doesn't solve the
>> problem. We have 100 parallel intermediate results, so there are too many
>> partition requests. I think that's why it timeout. The solution should let
>> the connection timeout increase. But I think there is some issue that
>> connection doesn't respect the timeout config.
>>
>> We will definitely try the latest flink version. But at Uber, there is a
>> team who is responsible to provide a platform with Flink. They will upgrade
>> it at the end of this Month. Meanwhile, I would like to ask some help to
>> investigate how to increase the connection timeout and make it respected.
>>
>> Thanks,
>> Wenrui
>>
>> On Fri, Jan 4, 2019 at 5:27 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Wenrui,
>>>
>>> from the logs I cannot spot anything suspicious. Which configuration
>>> parameters have you changed exactly? Does the JobManager log contain
>>> anything suspicious?
>>>
>>> The current Flink version changed quite a bit wrt 1.4. Thus, it might be
>>> worth a try to run the job with the latest Flink version.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng <wenruim...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I consistently get connection timeout issue when creating
>>>> partitionRequestClient in flink 1.4. I tried to ping from the connecting
>>>> host to the connected host, but the ping latency is less than 0.1 ms
>>>> consistently. So it's probably not due to the cluster status. I also tried
>>>> increase max backoff, nettowrk timeout and some other setting, it doesn't
>>>> help.
>>>>
>>>> I enabled the debug log of flink but not find any suspicious or useful
>>>> information to help me fix the issue. Here is the link
>>>> <https://www.dropbox.com/sh/sul62muz5pk0bqk/AABX8QbMrNmSq3k8I289mGmSa?dl=0>
>>>> of the jobManager and taskManager logs. The connecting host is the host
>>>> which throw the exception. The connected host is the host the connecting
>>>> host try to request partition from.
>>>>
>>>> Since our platform is not up to date yet, the flink version I used in
>>>> this is 1.4. But I noticed that there is not much change of these code on
>>>> the Master branch. Any help will be appreciated.
>>>>
>>>> Here is stack trace of the exception
>>>>
>>>> from RUNNING to FAILED.
>>>> java.io.IOException: Connecting the channel failed: Connecting to
>>>> remote task manager + 'athena485-sjc1/10.70.132.8:34185' has failed.
>>>> This might indicate that the remote task manager has been lost.
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
>>>> at
>>>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
>>>> at
>>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
>>>> at
>>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
>>>> at
>>>> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
>>>> at
>>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
>>>> at
>>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
>>>> at
>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>>> at java.lang.Thread.run(Thread.java:748)
>>>> Caused by:
>>>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>>> Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185'
>>>> has failed. This might indicate that the remote task manager has been lost.
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>>>> at
>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>>> ... 1 common frames omitted
>>>> Caused by:
>>>> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
>>>> connection timed out: athena485-sjc1/10.70.132.8:34185
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
>>>> ... 6 common frames omitted
>>>>
>>>> Thanks,
>>>> Wenrui
>>>>
>>>

Re: ConnectTimeoutException when createPartitionRequestClient

Reply via email to