Here’s the JDK issue that Fangmin mentioned:

https://bugs.openjdk.java.net/browse/JDK-8207746

It’s a JDK 10 & 11 bug which has already been fixed since JDK11 b27.

Andor



> On 2019. Oct 28., at 8:00, Enrico Olivelli <[email protected]> wrote:
> 
> Fangmin,
> 
> Il lun 28 ott 2019, 02:23 Fangmin Lv <[email protected]> ha scritto:
> 
>> Hey everyone,
>> 
>> (Forgot to add subject in the previous email, resent with clear subject.)
>> 
>> I'd like to share some weird inconsistency bugs we saw recently on prod,
>> the root cause and potential fixes of it. It took us around a month to
>> investigate, reproduce and find out the root cause, hopefully the
>> informations here will help people avoid hitting this same potential issue.
>> 
>> [Trigger conditions and behavior]
>> 
>> The inconsistency issue only happened when running ZK with OpenJDK 10 on
>> SKL machines, and it's not because of bugs inside ZK but due to a
>> macro-assembly bug inside JDK.
>> 
>> And the behavior of the issues might be:
>> 
>> * NONODE returned when getData from a child exist when queried with
>> getChildren, and there is no delete issued
>> * NONODE error returned when try to create a child based on the parent node
>> just successfully created, and there is no delete issued
>> * No client is able to acquire the lock even though the previous session
>> who hold the lock already dead
>> 
>> [Root cause]
>> 
>> The direct cause of the misbehavior above is due to the key/value put into
>> the ZooKeeperServer.outstandingChangesForPath HashMap or the
>> DataNode.children HashSet are not visible to the future get or remove,
>> which caused the outstanding changes not visible when leader prepare the
>> following txns, or node being deleted but not removed from
>> DataNode.children.
>> 
>> And the 'bad' HashMap/HashSet behavior is not because of concurrency bugs
>> inside ZK, but due to a macro-assembly bug which is used to generate the
>> String.equals intrinsic assembly code in JDK 9 and 10. The bug was
>> introduced in JDK-8144771 when adding AVX-512 instructions support in JDK
>> to optimize the String.equals intrinsic performance with 512 bit vector op
>> support. Due to the bug, the String.equals method may return false result
>> when using high band of CPU register (xmm16 - xmm31) with non-empty stack
>> on SKL machines where AVX-512 is available.
>> 
>> The macro-assembly bug we hit is in vptest which is used in the
>> string_compare macro assembly code
>> <
>> http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933
>>> .
>> It uses add/sub instruction when saving/resuming register values
>> temporarily from stack, which will affect and distort the ZF (zero flag) in
>> FLAGS register from the previous test instruction.
>> 
>> For our case, if the key exist in the DataNode.children HashSet, the test
>> instruction result will be zero, ZF bit will be set to 1, if the RSP value
>> is not 0 (e.g stack is not empty) after addptr code here, then the ZF bit
>> will be changed to 0, so String.equals compare during removeNode will
>> return false result, and the key won't be removed.
>> 
>> There is bug reported in JDK-8207746, the behavior is different, we've
>> confirmed the issue by adding assembly code to log the issue in JDK 10.
>> 
>> [Solutions]
>> 
>> The possible mitigations are:
>> 
>> 1. Disabling the AVX-512 with JVM option -XX:UseAVX=2
>> 2. Using OpenJDK version higher than 10, which has fixed the issue in
>> JDK-8207746
>> 
>> Upgrading to OpenJDK 11+ is a better option, since 10 is not well
>> supported, and AVX-512 do helps improving performance.
>> 
>> We use JDK 10 due to SSL quorum socket close stall issue mentioned in
>> ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>, and
>> the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by
>> asynchronously closing the quorum socket, and we're upstreaming that in
>> ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>.
>> 
>> Thanks,
>> Fangmin
>> 
> 
> 
> Thank you for sharing this.
> Do you have any pointer to the jdk11 bugs? Is it solved in 12+?
> 
> I am running with jdk11-13 but without ssl, so never seen problems.
> 
> Enrico
> 
>> 

Reply via email to