Hey everyone, (Forgot to add subject in the previous email, resent with clear subject.)
I'd like to share some weird inconsistency bugs we saw recently on prod, the root cause and potential fixes of it. It took us around a month to investigate, reproduce and find out the root cause, hopefully the informations here will help people avoid hitting this same potential issue. [Trigger conditions and behavior] The inconsistency issue only happened when running ZK with OpenJDK 10 on SKL machines, and it's not because of bugs inside ZK but due to a macro-assembly bug inside JDK. And the behavior of the issues might be: * NONODE returned when getData from a child exist when queried with getChildren, and there is no delete issued * NONODE error returned when try to create a child based on the parent node just successfully created, and there is no delete issued * No client is able to acquire the lock even though the previous session who hold the lock already dead [Root cause] The direct cause of the misbehavior above is due to the key/value put into the ZooKeeperServer.outstandingChangesForPath HashMap or the DataNode.children HashSet are not visible to the future get or remove, which caused the outstanding changes not visible when leader prepare the following txns, or node being deleted but not removed from DataNode.children. And the 'bad' HashMap/HashSet behavior is not because of concurrency bugs inside ZK, but due to a macro-assembly bug which is used to generate the String.equals intrinsic assembly code in JDK 9 and 10. The bug was introduced in JDK-8144771 when adding AVX-512 instructions support in JDK to optimize the String.equals intrinsic performance with 512 bit vector op support. Due to the bug, the String.equals method may return false result when using high band of CPU register (xmm16 - xmm31) with non-empty stack on SKL machines where AVX-512 is available. The macro-assembly bug we hit is in vptest which is used in the string_compare macro assembly code <http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933>. It uses add/sub instruction when saving/resuming register values temporarily from stack, which will affect and distort the ZF (zero flag) in FLAGS register from the previous test instruction. For our case, if the key exist in the DataNode.children HashSet, the test instruction result will be zero, ZF bit will be set to 1, if the RSP value is not 0 (e.g stack is not empty) after addptr code here, then the ZF bit will be changed to 0, so String.equals compare during removeNode will return false result, and the key won't be removed. There is bug reported in JDK-8207746, the behavior is different, we've confirmed the issue by adding assembly code to log the issue in JDK 10. [Solutions] The possible mitigations are: 1. Disabling the AVX-512 with JVM option -XX:UseAVX=2 2. Using OpenJDK version higher than 10, which has fixed the issue in JDK-8207746 Upgrading to OpenJDK 11+ is a better option, since 10 is not well supported, and AVX-512 do helps improving performance. We use JDK 10 due to SSL quorum socket close stall issue mentioned in ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>, and the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by asynchronously closing the quorum socket, and we're upstreaming that in ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>. Thanks, Fangmin