Emil Kleszcz created ZOOKEEPER-4334: ---------------------------------------
Summary: SASL authentication fails when using host aliases Key: ZOOKEEPER-4334 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4334 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.6.1 Reporter: Emil Kleszcz I faced an issue while trying to use alternative alises with Zookeeper quorum when SASL is enabled. The errors I get in zookeeper log are the following: ``` 2021-07-12 21:04:46,437 [myid:3] - WARN [NIOWorkerThread-3:ZooKeeperServer@1661] - Client /<IP addr>:37368 failed to SASL authenticate: {} javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)] at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:199) at org.apache.zookeeper.server.ZooKeeperSaslServer.evaluateResponse(ZooKeeperSaslServer.java:49) at org.apache.zookeeper.server.ZooKeeperServer.processSasl(ZooKeeperServer.java:1650) at org.apache.zookeeper.server.ZooKeeperServer.processPacket(ZooKeeperServer.java:1599) at org.apache.zookeeper.server.NIOServerCnxn.readRequest(NIOServerCnxn.java:379) at org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:182) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed) at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:856) at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:342) at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:285) at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:167) ... 11 more Caused by: KrbException: Checksum failed at sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType.decrypt(Aes256CtsHmacSha1EType.java:102) at sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType.decrypt(Aes256CtsHmacSha1EType.java:94) at sun.security.krb5.EncryptedData.decrypt(EncryptedData.java:175) at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:281) at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:149) at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:108) at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:829) ... 14 more Caused by: java.security.GeneralSecurityException: Checksum failed at sun.security.krb5.internal.crypto.dk.AesDkCrypto.decryptCTS(AesDkCrypto.java:451) at sun.security.krb5.internal.crypto.dk.AesDkCrypto.decrypt(AesDkCrypto.java:272) at sun.security.krb5.internal.crypto.Aes256.decrypt(Aes256.java:76) at sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType.decrypt(Aes256CtsHmacSha1EType.java:100) ... 20 more ``` What did I do? 1) created host aliases for each quorum node (a,b,c): zk1, zk2, zk3 2) Changed in zoo.cfg: changed from server.1=a server.2=b server.3=c to: server.1=zk1 server.2=zk2 server.3=zk3 (at this stage after restarting the ensemble all works as expected. 3) Generate new keytab with alias-based principals and host-based principals in zookeeper.keytab 4) Change jaas.conf (server) definition from: Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/zookeeper/conf/zookeeper.keytab" storeKey=true useTicketCache=false principal="zookeeper/a.com@COM"; }; to Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/zookeeper/conf/zookeeper.keytab" storeKey=true useTicketCache=false principal="zookeeper/zk1.com@COM"; }; >From that moment, after restarting quorum members, I get the above error. Now, why do I do this? To allow other services such as zkfc,hbase,hdfs,yarn to connect to the quorum using aliases. Interestingly, without changing the zookeeper principal, hbase works perfectly, but the other 3 services fail with: ``` <2021-07-12T20:45:19.491+0200> <INFO> <org.apache.zookeeper.ZooKeeper>: <Initiating client connection, connectString=zk01.com:2181,zk02.com:2181,zk03.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@3246fb96> <2021-07-12T20:45:19.519+0200> <INFO> <org.apache.zookeeper.Login>: <Client successfully logged in.> <2021-07-12T20:45:19.521+0200> <INFO> <org.apache.zookeeper.Login>: <TGT refresh thread started.> <2021-07-12T20:45:19.524+0200> <INFO> <org.apache.zookeeper.Login>: <TGT valid starting at: Mon Jul 12 20:45:19 CEST 2021> <2021-07-12T20:45:19.524+0200> <INFO> <org.apache.zookeeper.Login>: <TGT expires: Tue Jul 13 21:45:19 CEST 2021> <2021-07-12T20:45:19.524+0200> <INFO> <org.apache.zookeeper.Login>: <TGT refresh sleeping until: Tue Jul 13 17:05:16 CEST 2021> <2021-07-12T20:45:19.524+0200> <INFO> <org.apache.zookeeper.client.ZooKeeperSaslClient>: <Client will use GSSAPI as SASL mechanism.> <2021-07-12T20:45:19.530+0200> <INFO> <org.apache.zookeeper.ClientCnxn>: <Opening socket connection to server zk02.com/<ip addr>:2181. Will attempt to SASL-authenticate using Login Context section 'Client'> <2021-07-12T20:45:19.535+0200> <INFO> <org.apache.zookeeper.ClientCnxn>: <Socket connection established to zk02.com/<ip addr>:2181, initiating session> <2021-07-12T20:45:19.543+0200> <INFO> <org.apache.zookeeper.ClientCnxn>: <Session establishment complete on server zk02.com/<ip addr>:2181, sessionid = 0x200247870fb0007, negotiated timeout = 10000> <2021-07-12T20:45:19.561+0200> <ERROR> <org.apache.zookeeper.client.ZooKeeperSaslClient>: <SASL authentication failed using login context 'Client' with exception: {}> javax.security.sasl.SaslException: Error in authenticating with a Zookeeper Quorum member: the quorum member's saslToken is null. at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:279) at org.apache.zookeeper.client.ZooKeeperSaslClient.respondToServer(ZooKeeperSaslClient.java:242) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:805) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) <2021-07-12T20:45:19.564+0200> <INFO> <org.apache.zookeeper.ClientCnxn>: <Unable to read additional data from server sessionid 0x200247870fb0007, likely server has closed socket, closing socket connection and attempting reconnect> <2021-07-12T20:45:19.671+0200> <INFO> <org.apache.hadoop.ha.ActiveStandbyElector>: <Session connected.> <2021-07-12T20:45:19.672+0200> <ERROR> <org.apache.hadoop.hdfs.tools.DFSZKFailoverController>: <DFSZKFailOverController exiting due to earlier exception java.io.IOException: Couldn't determine existence of znode ``` When I change the principle of zookeeper hbase starts failing with this error and other services except for the zookeeper itself is somehow working fine. After that, I cannot connect manually to the zk quorum using zkCli and zookeeper-client with all possible combinations of principals. I wonder if that may have something to do with the "Server environment:host.name=" pointing to the canonical name (and not the alias) during the startup. -- This message was sent by Atlassian Jira (v8.3.4#803005)