Hi Barisa, thanks for sharing this. I'm gonna add Till to this thread. He might have some insights.
Best, Matthias On Wed, Feb 10, 2021 at 4:19 PM Barisa Obradovic <bbaj...@gmail.com> wrote: > I'm trying to understand if behaviour of the flink jobmanager during > zookeeper upgrade is expected or not. > > I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta. > While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper > downtime. > I'd expect to either flink job to restart or few warnings in the logs > during > those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod > restart). > > I expected for flink to internally retry zookeeper requests, so I'm > surprised it crashes. Is this expected, or is it a bug? > > From the logs > > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > [09-Feb-2021 11:30:00.197 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 > [09-Feb-2021 11:30:00.197 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Socket connection established to zdzk.servicexxx/192.168.190.92:2181, > initiating session > [09-Feb-2021 11:30:00.198 UTC] WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192] > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > ~[?:1.8.0_192] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > [09-Feb-2021 11:30:02.294 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 > [09-Feb-2021 11:30:02.295 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Socket connection established to zdzk.servicexxx/192.168.190.92:2181, > initiating session > [09-Feb-2021 11:30:02.295 UTC] WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192] > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > ~[?:1.8.0_192] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > at > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0] > [09-Feb-2021 11:30:03.841 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181 > [09-Feb-2021 11:30:03.842 UTC] INFO > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Socket connection established to zdzk.servicexxx/192.168.190.92:2181, > initiating session > [09-Feb-2021 11:30:03.842 UTC] WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > ~[?:1.8.0_192] > at sun.nio.ch.IOUtil.rea > > > > FYI: I've asked same question on stackoverflow: > > https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/