Enrico, As Andor mentioned, the issue has been fixed in JDK 11 since b27, you should be fine :)
Fangmin On Mon, Oct 28, 2019 at 10:44 PM Andor Molnar <[email protected]> wrote: > Here’s the JDK issue that Fangmin mentioned: > > https://bugs.openjdk.java.net/browse/JDK-8207746 > > It’s a JDK 10 & 11 bug which has already been fixed since JDK11 b27. > > Andor > > > > > On 2019. Oct 28., at 8:00, Enrico Olivelli <[email protected]> wrote: > > > > Fangmin, > > > > Il lun 28 ott 2019, 02:23 Fangmin Lv <[email protected]> ha scritto: > > > >> Hey everyone, > >> > >> (Forgot to add subject in the previous email, resent with clear > subject.) > >> > >> I'd like to share some weird inconsistency bugs we saw recently on prod, > >> the root cause and potential fixes of it. It took us around a month to > >> investigate, reproduce and find out the root cause, hopefully the > >> informations here will help people avoid hitting this same potential > issue. > >> > >> [Trigger conditions and behavior] > >> > >> The inconsistency issue only happened when running ZK with OpenJDK 10 on > >> SKL machines, and it's not because of bugs inside ZK but due to a > >> macro-assembly bug inside JDK. > >> > >> And the behavior of the issues might be: > >> > >> * NONODE returned when getData from a child exist when queried with > >> getChildren, and there is no delete issued > >> * NONODE error returned when try to create a child based on the parent > node > >> just successfully created, and there is no delete issued > >> * No client is able to acquire the lock even though the previous session > >> who hold the lock already dead > >> > >> [Root cause] > >> > >> The direct cause of the misbehavior above is due to the key/value put > into > >> the ZooKeeperServer.outstandingChangesForPath HashMap or the > >> DataNode.children HashSet are not visible to the future get or remove, > >> which caused the outstanding changes not visible when leader prepare the > >> following txns, or node being deleted but not removed from > >> DataNode.children. > >> > >> And the 'bad' HashMap/HashSet behavior is not because of concurrency > bugs > >> inside ZK, but due to a macro-assembly bug which is used to generate the > >> String.equals intrinsic assembly code in JDK 9 and 10. The bug was > >> introduced in JDK-8144771 when adding AVX-512 instructions support in > JDK > >> to optimize the String.equals intrinsic performance with 512 bit vector > op > >> support. Due to the bug, the String.equals method may return false > result > >> when using high band of CPU register (xmm16 - xmm31) with non-empty > stack > >> on SKL machines where AVX-512 is available. > >> > >> The macro-assembly bug we hit is in vptest which is used in the > >> string_compare macro assembly code > >> < > >> > http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933 > >>> . > >> It uses add/sub instruction when saving/resuming register values > >> temporarily from stack, which will affect and distort the ZF (zero > flag) in > >> FLAGS register from the previous test instruction. > >> > >> For our case, if the key exist in the DataNode.children HashSet, the > test > >> instruction result will be zero, ZF bit will be set to 1, if the RSP > value > >> is not 0 (e.g stack is not empty) after addptr code here, then the ZF > bit > >> will be changed to 0, so String.equals compare during removeNode will > >> return false result, and the key won't be removed. > >> > >> There is bug reported in JDK-8207746, the behavior is different, we've > >> confirmed the issue by adding assembly code to log the issue in JDK 10. > >> > >> [Solutions] > >> > >> The possible mitigations are: > >> > >> 1. Disabling the AVX-512 with JVM option -XX:UseAVX=2 > >> 2. Using OpenJDK version higher than 10, which has fixed the issue in > >> JDK-8207746 > >> > >> Upgrading to OpenJDK 11+ is a better option, since 10 is not well > >> supported, and AVX-512 do helps improving performance. > >> > >> We use JDK 10 due to SSL quorum socket close stall issue mentioned in > >> ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>, > and > >> the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by > >> asynchronously closing the quorum socket, and we're upstreaming that in > >> ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>. > >> > >> Thanks, > >> Fangmin > >> > > > > > > Thank you for sharing this. > > Do you have any pointer to the jdk11 bugs? Is it solved in 12+? > > > > I am running with jdk11-13 but without ssl, so never seen problems. > > > > Enrico > > > >> > >
