[ https://issues.apache.org/jira/browse/FLINK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gameking updated FLINK-15074: ----------------------------- Summary: Connection timed out, Standalone cluster (was: Connection timed out, Standalone) > Connection timed out, Standalone cluster > ---------------------------------------- > > Key: FLINK-15074 > URL: https://issues.apache.org/jira/browse/FLINK-15074 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.9.1 > Environment: flink version : 1.5.1 , 1.9.1 > jdk version : 1.8.0_181 > Number of servers : 15 > Number of taskmanagers : 178 > Number of slots: 178 > Reporter: gameking > Priority: Major > Attachments: flink-conf.yaml, jobmanager.log, taskmanager.log > > > I am running a flink streaming application on a standalone-cluster. > It works well when the job's parallelism is low, just like 96. > But when I try to increase job's parallelism to a high value, like 164 or > more, Job will fail in 10-15 minutes due to connection timeout error > I have try to solve this problem by increaseing taskmanager configs just like > 'taskmanager.network.netty.server.numThreads', > 'taskmanager.network.netty.client.numThreads', > 'taskmanager.network.request-backoff.max', 'akka.ask.timeout' and so on, It > doesn't work. > I also try to change different versions of flink, such as 1.5.1 and 1.9.1, to > solve this problem , it doesn't help too. > Does anyone know how to fix this problem,I have no idea now. It looks like a > bug. > I hava upload my config and log as attachment, and the error trace below : > > ------------------------------------------------------------------ > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: > Connection timed out > at > org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:172) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] > Caused by: java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_181] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_181] > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181] > at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181] > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > ~[na:1.8.0_181] > at > org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > ~[flink-dist_2.11-1.5.1.jar:1.5.1] > ... 6 common frames omitted -- This message was sent by Atlassian Jira (v8.3.4#803005)