Hi all, My name is Raúl Gracia and I work in the Pravega project (open-source project for data stream storage): http://pravega.io/.
I'm currently working on a Pravega branch using "zookeeper-3.5.5-rc6", as we are interested on allowing Curator (4.0.1) to use a Zookeeper version with the bugfix proposed in ZOOKEEPER-2184<https://issues.apache.org/jira/browse/ZOOKEEPER-2184>. The integration has been pretty smooth and 99% of tests are successful in a Pravega build, and the original issue that motivated the upgrade to zookeeper-3.5.5 seems also solved. However, there are failures related to a specific type of tests in Pravega in which we instantiate a Zookeeper server process (for testing Pravega in standalone mode). Such failures only occur when running the standalone tests with SSL enabled, which includes configuring the Zookeeper server process with SSL as well. To constrain the scope of the problem, I have built zookeeper-3.5.5-rc6 ("mvn package") and executed the server (e.g., "./bin/zkServer.sh start") with the appropriate security configuration to enable SSL: export SERVER_JVMFLAGS=" -Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory -Dzookeeper.ssl.keyStore.location=.../server.keystore.jks -Dzookeeper.ssl.keyStore.password=password -Dzookeeper.ssl.trustStore.location=.../client.truststore.jks -Dzookeeper.ssl.trustStore.password= password" (I have also added secureClientPort=2281 in zoo.cfg as indicated in the admin instructions) With the Zookeeper server running separately, I executed all the Pravega standalone tests (with and without SSL) pointing that external Zookeeper server (and disabling the Zookeeper server process that was created as part of the test workflow). Regarding configuration, in our tests the clients are configured with the recommended security settings in the administration guide: System.setProperty("zookeeper.client.secure", "true"); System.setProperty("zookeeper.clientCnxnSocket", "org.apache.zookeeper.ClientCnxnSocketNetty"); System.setProperty("zookeeper.ssl.trustStore.location", .../client.truststore.jks"); System.setProperty("zookeeper.ssl.trustStore.password", "password "); System.setProperty("zookeeper.ssl.keyStore.location", ".../server.keystore.jks"); System.setProperty("zookeeper.ssl.keyStore.password", "password "); In this case, all the Pravega standalone tests succeeded. This leaves us the way we are configuring SSL in the Zookeeper server process in Pravega standalone as the most plausible cause for the problem. This is intriguing, as the security settings used are the same in both scenarios (zkServer.sh / Zookeeper server process started in the test code). I have also confirmed this by running the Zookeeper server process used in standalone with/without SSL and connecting to it via the zkCli. Without SSL configured I can connect properly to it, whereas with SSL enabled I get the following error in the client: 2019-05-15 19:59:40,479 [myid:] - INFO [main:ZooKeeper@868] - Initiating client connection, connectString=localhost:2281 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@621be5d1<mailto:watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@621be5d1> 2019-05-15 19:59:40,507 [myid:] - INFO [main:X509Util@79] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation 2019-05-15 19:59:40,791 [myid:] - INFO [main:ClientCnxnSocket@237] - jute.maxbuffer value is 4194304 Bytes 2019-05-15 19:59:40,798 [myid:] - INFO [main:ClientCnxn@1653] - zookeeper.request.timeout value is 0. feature enabled= 2019-05-15 19:59:40,817 [myid:localhost:2281] - INFO [main-SendThread(localhost:2281):ClientCnxn$SendThread@1112] - Opening socket connection to server localhost/127.0.0.1:2281. Will not attempt to authenticate using SASL (unknown error) Welcome to ZooKeeper! JLine support is enabled [zk: localhost:2281(CONNECTING) 0] 2019-05-15 19:59:41,168 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxnSocketNetty$ZKClientPipelineFactory@460] - SSL handler added for channel: [id: 0x7bf11dfa] 2019-05-15 19:59:41,176 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /127.0.0.1:52652, server: localhost/127.0.0.1:2281 2019-05-15 19:59:41,178 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxnSocketNetty$1@188] - channel is connected: [id: 0x7bf11dfa, L:/127.0.0.1:52652 - R:localhost/127.0.0.1:2281] 2019-05-15 19:59:41,614 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxn$SendThread@1394] - Session establishment complete on server localhost/127.0.0.1:2281, sessionid = 0x10002239ae10000, negotiated timeout = 30000 WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: localhost:2281(CONNECTED) 0] ls / 2019-05-15 20:00:01,616 [myid:localhost:2281] - WARN [main-SendThread(localhost:2281):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server in 20004ms for sessionid 0x10002239ae10000 2019-05-15 20:00:01,618 [myid:localhost:2281] - INFO [main-SendThread(localhost:2281):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server in 20004ms for sessionid 0x10002239ae10000, closing socket connection and attempting reconnect 2019-05-15 20:00:01,630 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxnSocketNetty$ZKClientHandler@473] - channel is disconnected: [id: 0x7bf11dfa, L:/127.0.0.1:52652 ! R:localhost/127.0.0.1:2281] 2019-05-15 20:00:01,631 [myid:localhost:2281] - INFO [epollEventLoopGroup-2-1:ClientCnxnSocketNetty@253] - channel is told closing KeeperErrorCode = ConnectionLoss for / [zk: localhost:2281(CONNECTED) 1] I see some suspicious messages in these logs that I will need to investigate further. But as a general observation, it looks like the way we instantiate the Zookeeper server process for Pravega standalone is not valid in zookeeper-3.5.5-rc6 (to inspect how we create the Zookeeper server process, please see methods initialize() and start() in this file<https://github.com/pravega/pravega/blob/master/segmentstore/storage/impl/src/main/java/io/pravega/segmentstore/storage/impl/bookkeeper/ZooKeeperServiceRunner.java>). In summary, if the error I'm getting is related to changes in the SSL configuration introduced in zookeeper-3.5.5, it would be great to get feedback from you if I'm missing something. On the other hand, if the way we are creating a Zookeeper server process is not the recommended one, I'm also open to suggestions here. Thanks in advance and sorry for the long email, Raúl. PS: I have also tried to run the Zookeeper server process with SSL forcing to only use the netty and boringSSL library versions that are used either in Pravega(netty*:4.1.30.Final, netty-tcnative-boringssl-static:2.0.17) or Zookeeper 3.5.5(netty*:4.1.29.Final, netty-tcnative-boringssl-static:2.0.7), but none of these combinations made any difference in the behavior of the Zookeeper server process. PS2: The JDK version I use is: openjdk version "1.8.0_212".
