Thanks! regarding the HA. So the Job will auto recover after a crash understood. Is that true also when deploying new version? easy as simply canceling one job, updating the version and once it is up & running do the same to the other one ? is there anything that is should be highlighted when working in HA mode (e.g configuration, machines setup besides what is written in the Job manager HA documentation <https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/jobmanager_high_availability.html> ) ?
On Thu, Jan 17, 2019 at 9:39 PM Jamie Grier <jgr...@lyft.com> wrote: > Avi, > > The stack trace there is pretty much a red herring. That happens whenever > a job shuts down for any reason and is not a root cause. To diagnose this > you will want to look at all the TaskManager logs as well as the JobManager > logs. If you have a way to easily grep these (all of them at once) I would > search for a string like "to FAILED" on the taskmanagers and look at those > error lines and stacktraces. > > Don't be misled by the exception reported in the Flink UI. It OFTEN isn't > the true root cause but it's a hard problem to solve. You have to look at > the TaskManager logs to really be sure. > > The taskmanager.network.netty.server.numThreads > <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads> > is > also a red herring. I would leave that alone. > > Finally, if you have HA and checkpointing setup correctly you will not > lose any state even in the case of losing a JobManager. The job will > auto-recover as soon as a new JobManager becomes available. > > I hope that helps. > > -Jamie > > > > On Thu, Jan 17, 2019 at 7:10 AM Dominik Wosiński <wos...@gmail.com> wrote: > >> *Hey,* >> As for the question about taskmanager.network.netty.server.numThreads >> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>. >> It is the size of the thread pool that will be used by the netty server. >> The default value is -1, which will result in the thread pool with size >> equal to the number of task slots for your JobManager. >> >> Best Regards, >> Dom. >> >> czw., 17 sty 2019 o 00:52 Avi Levi <avi.l...@bluevoyant.com> napisał(a): >> >>> Hi Guys, >>> >>> We done some load tests and we got the exception below, I saw that the >>> JobManager was restarted, If I understood correctly, it will get new job id >>> and the state will lost - is that correct? how the state is handled setting >>> HA as described here >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/jobmanager_high_availability.html>, >>> what >>> actually happens to the state if one of the job manager crashes (keyed >>> state using rocks db) ? >>> >>> >>> One of the property that might be relevant to this exception is >>> taskmanager.network.netty.server.numThreads >>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads> >>> with >>> a default value of -1 - what is this default value actually means? and >>> should it be set to different value according to #cores? >>> >>> >>> Thanks for your advice . >>> >>> Avi >>> >>> >>> >>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >>> Lost connection to task manager 'xxxx:1234'. This indicates that the remote >>> task manager was lost. >>> >>> at >>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:160) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:87) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) >>> >>> at java.lang.Thread.run(Thread.java:748) >>> >>> Caused by: java.io.IOException: Connection reset by peer >>> >>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>> >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>> >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>> >>> at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>> >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345) >>> >>> at >>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) >>> >>> ... 6 more >>> >>