Hey everyone,

(Forgot to add subject in the previous email, resent with clear subject.)

I'd like to share some weird inconsistency bugs we saw recently on prod,
the root cause and potential fixes of it. It took us around a month to
investigate, reproduce and find out the root cause, hopefully the
informations here will help people avoid hitting this same potential issue.

[Trigger conditions and behavior]

The inconsistency issue only happened when running ZK with OpenJDK 10 on
SKL machines, and it's not because of bugs inside ZK but due to a
macro-assembly bug inside JDK.

And the behavior of the issues might be:

* NONODE returned when getData from a child exist when queried with
getChildren, and there is no delete issued
* NONODE error returned when try to create a child based on the parent node
just successfully created, and there is no delete issued
* No client is able to acquire the lock even though the previous session
who hold the lock already dead

[Root cause]

The direct cause of the misbehavior above is due to the key/value put into
the ZooKeeperServer.outstandingChangesForPath HashMap or the
DataNode.children HashSet are not visible to the future get or remove,
which caused the outstanding changes not visible when leader prepare the
following txns, or node being deleted but not removed from
DataNode.children.

And the 'bad' HashMap/HashSet behavior is not because of concurrency bugs
inside ZK, but due to a macro-assembly bug which is used to generate the
String.equals intrinsic assembly code in JDK 9 and 10. The bug was
introduced in JDK-8144771 when adding AVX-512 instructions support in JDK
to optimize the String.equals intrinsic performance with 512 bit vector op
support. Due to the bug, the String.equals method may return false result
when using high band of CPU register (xmm16 - xmm31) with non-empty stack
on SKL machines where AVX-512 is available.

The macro-assembly bug we hit is in vptest which is used in the
string_compare macro assembly code
<http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933>.
It uses add/sub instruction when saving/resuming register values
temporarily from stack, which will affect and distort the ZF (zero flag) in
FLAGS register from the previous test instruction.

For our case, if the key exist in the DataNode.children HashSet, the test
instruction result will be zero, ZF bit will be set to 1, if the RSP value
is not 0 (e.g stack is not empty) after addptr code here, then the ZF bit
will be changed to 0, so String.equals compare during removeNode will
return false result, and the key won't be removed.

There is bug reported in JDK-8207746, the behavior is different, we've
confirmed the issue by adding assembly code to log the issue in JDK 10.

[Solutions]

The possible mitigations are:

1. Disabling the AVX-512 with JVM option -XX:UseAVX=2
2. Using OpenJDK version higher than 10, which has fixed the issue in
JDK-8207746

Upgrading to OpenJDK 11+ is a better option, since 10 is not well
supported, and AVX-512 do helps improving performance.

We use JDK 10 due to SSL quorum socket close stall issue mentioned in
ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>, and
the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by
asynchronously closing the quorum socket, and we're upstreaming that in
ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>.

Thanks,
Fangmin

Reply via email to