Hi, I'm David's colleague. I'm the one who's spotted this issue twice while GraalVM is doing its points-to analysis with jdk11u-dev. The fuller exception is here:
Caused by: java.lang.NullPointerException at java.base/java.lang.ClassValue$ClassValueMap.loadFromCache(ClassValue.java:535) at java.base/java.lang.ClassValue$ClassValueMap.probeHomeLocation(ClassValue.java:541) at java.base/java.lang.ClassValue.get(ClassValue.java:101) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIMetaAccessContext.fromClass(HotSpotJVMCIMetaAccessContext.java:163) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.fromClass(HotSpotJVMCIRuntime.java:339) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotResolvedObjectTypeImpl.fromObjectClass(HotSpotResolvedObjectTypeImpl.java:83) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotResolvedObjectTypeImpl.fromMetaspace(HotSpotResolvedObjectTypeImpl.java:97) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.CompilerToVM.resolveTypeInPool(Native Method) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotConstantPool.loadReferencedType(HotSpotConstantPool.java:727) at java.base/jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.oracle.graal.pointsto.infrastructure.WrappedConstantPool.loadReferencedType(WrappedConstantPool.java:88) at com.oracle.graal.pointsto.infrastructure.WrappedConstantPool.loadReferencedType(WrappedConstantPool.java:105) at com.oracle.svm.hosted.phases.SubstrateClassInitializationPlugin.loadReferencedType(SubstrateClassInitializationPlugin.java:71) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.loadReferenceType(BytecodeParser.java:4384) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.maybeEagerlyResolve(BytecodeParser.java:4366) at com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.maybeEagerlyResolve(SharedGraphBuilderPhase.java:127) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.lookupMethod(BytecodeParser.java:4316) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.genInvokeStatic(BytecodeParser.java:1659) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.processBytecode(BytecodeParser.java:5340) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.iterateBytecodesForBlock(BytecodeParser.java:3423) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.processBlock(BytecodeParser.java:3230) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.build(BytecodeParser.java:1088) at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.buildRootMethod(BytecodeParser.java:982) at jdk.internal.vm.compiler/org.graalvm.compiler.java.GraphBuilderPhase$Instance.run(GraphBuilderPhase.java:84) at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.run(Phase.java:49) at jdk.internal.vm.compiler/org.graalvm.compiler.phases.BasePhase.apply(BasePhase.java:214) at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.apply(Phase.java:42) at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.apply(Phase.java:38) at com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.parse(MethodTypeFlowBuilder.java:224) at com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.apply(MethodTypeFlowBuilder.java:351) at com.oracle.graal.pointsto.flow.MethodTypeFlow.doParse(MethodTypeFlow.java:322) at com.oracle.graal.pointsto.flow.MethodTypeFlow.ensureParsed(MethodTypeFlow.java:311) at com.oracle.graal.pointsto.flow.MethodTypeFlow.addContext(MethodTypeFlow.java:112) at com.oracle.graal.pointsto.flow.StaticInvokeTypeFlow.update(InvokeTypeFlow.java:437) at com.oracle.graal.pointsto.BigBang$2.run(BigBang.java:530) at com.oracle.graal.pointsto.util.CompletionExecutor.lambda$execute$0(CompletionExecutor.java:173) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) ... 5 more On Fri, Aug 7, 2020 at 11:36 PM John Rose <john.r.r...@oracle.com> wrote: > On Aug 7, 2020, at 9:52 AM, Andrew Haley <a...@redhat.com> wrote: > > > > On 8/7/20 4:39 PM, David Lloyd wrote: > > > > > However, I'm wondering if this isn't a side effect of what appears > > > to be an incorrect double-checked lock at lines 374-378 of > > > ClassValue.java [1]. In order for the write to the non-volatile > > > `cache` field of ClassValueMap, it is my understanding that there > > > must be some acquire/release edge from where the variable is > > > published to guarantee that all of the writes are visible to the > > > read site, and I'm not really sure that the exit-the-constructor > > > edge is going to match up with with the synchronized block which > > > protects a different non-volatile field. And even if my feeling > > > that this is dodgy is valid, I'm not sure whether this NPE is a > > > possible outcome of that problem! > > > > It certainly doesn't look right to me. AFAICS this is classic broken > > double-checked locking. It'd need some sort of fence after > > constructing the ClassValueMap instance and before publishing it. > > Y’all are calling my baby ugly but he’s really cute if you look > at him from the right angle. > > (First, a quick question: This bug shows up on x86, right? > Yes, the issue is happening on x86. > If so, we probably are not looking at a hardware reordering > problem but something stemming from Graal’s reordering > of operations compared to C1 and C2, perhaps after a different > set of inlining decisions uncharacteristic of C2.) > To the best of my knowledge, GraalVM points-to analysis is just a JVM process that figures out which code is used by the end-user application. The result of this process is then used to create the native executable. I'm not aware of the Graal compiler having an impact here. >From what I've learnt looking at this code and more recent JVMCI implementations, jdk11u-dev relies on reflection to understand the end-user application, whereas more modern JVMCI stacks avoid the use of reflection and query the JVM directly. I've been trying to create a vanilla JDK stress test that would trigger such NPE but I've not had luck so far. I don't know whether JVMCI has a part to play here. > > The bug indeed looks like a dropped fence at the end of the > constructor for ClassValueMap. (I don’t see an easier way to > get a null at the NPE point.) The fence might not actually be > dropped, but a write which the fence is supposed to fence > might have been reordered after the fence, and after an > unlock operation, allowing someone to see the initial > null after the unlock. > > The double-check idiom breaks when the outer check > (not synchronized) gets half-baked data and acts on it. > This much-maligned idiom would not be broken in the > present case under the assumption that a data dependency > on type.classValueMap (set in initializeMap) will see > a fully initialized value of ClassValueMap.cacheArray. > > Now we get into the fine print, and I agree there is a bug > here. The easy fix is to repair the logic under which that > ugly everybody’s-hating-on-it double check idiom would > be in fact correct. > > The fine print requires a couple different things that are not > quite fulfilled by the present code, and Graal has either > reordered the memory accesses to expose the problem, or > it (itself) has a complementary bug. (Or it has an unrelated > bug, and you people are *staring* at my baby!) > > First, as Paul noted, you need a final variable in your class > to get the benefit of a fence at the end of the constructor. > Oops #1: ClassValueMap doesn’t have *any* finals. That’s > awkward. Two fixes for that: (a) add a dummy final, and > (b) add a real fence in the constructor. For better luck, > maybe put the fence at the end of sizeCache. > > On machines (not x86) which need fences *before* final > reads, another explicit fence should be placed before the > unsynchronized read (or else inside the getCache method). > > By the letter of the JMM, the helpful effects of the fence > at the end of the constructor are only guaranteed for > references which depend on final variables set by that > constructor, because only those final variables get a > “freeze” operation which stabilizes them (and values > dependently loaded via them). Throwing in a dummy > final is therefore not guaranteed to work. But throwing > in a fence will work. One downside of the fence is that > the JMM is silent about fences, but the JMM is widely > recognized as a partial story, not the full or final story. > > (Here’s a tidbit of JMM politics: I once heard Doug Lea > considering whether maybe all fields should be treated > more like final fields. I don’t know if this is still a live > idea, but it would make this bug go way, since nearly all > constructors would then get fences of some sort.) > > Here’s a bit of explanatory (non-normative) text from the draft > POPL paper for JMM, where the parenthetic comment indicates > (I think) the sort of thing that is happening here: > > > Let us now assume that the acquire and release do happen. As long as > > these actions take place after object has been constructed (and there > > is no code motion around the end of the constructor), the diffs that > > the second processor acquires are guaranteed to reflect the correctly > > constructed object. > > Basically, the synchronization statement is expected to do a > release *after* the non-null value is stored. It looks like this > is failing due to a lost and/or reordered fence. > > Second, David says: > > > I’m not really sure that the exit-the-constructor edge is going to match > > up with with the synchronized block which protects a different > > non-volatile field. > > By the letter of the JMM (AFAIUI), this code is coloring way outside > the lines. (Yes, I admit it.) The problem is pretty fundamental. > I went and swapped in some JMM spec just now (don’t have it in > short term memory; go figure), and here’s what I re-learned. > > In JMM terms, a “release” is the earlier end of any edge in relation > called “synchronizes-with”. An “acquire” is the latter end of such > an edge. In the JMM this relation applies only to lock, unlock, > and volatile reads and writes. Your guess is as good as mine whether > other operations (CAS, fences, etc.) participate; the JMM is silent. > Crucially, acquires and release do not touch plain fields. This is > counter-intuitive for those of us who like to reason with fences. > > The JMM prevents regular writes from floating past what we like > to think of as “release points” (unlocks) by appealing to processor > order inside lock/unlock pairs, and also by appealing to global > ordering of multiple lock/unlock pairs (or stuff with volatiles > and other corner cases which we will neglect here). In order > to work right, a normal write has to come before a release *and* > a matching normal read has to come after an acquire, and such > an acquire is nothing other than the lock of a later lock/unlock > pair, typically in another thread. > > So the rules which reliably connect normal writes to normal > reads in other threads work differently from acquire fences and > release fences as we (or just I?) usually think of them. Yes, there > are “acquires” and “releases” in the JMM. No, they are not > primitives but rather descriptions of the way the happens-before > relation is constrained by (among other things) lock and unlock > actions. > > On x86, you don’t have to worry about acquires; you just set up > the right release fences. In the JMM, there’s no way to get either > an acquire or a release without a synchronized statement (or > other fancy stuff like volatiles). That’s why double-check is > broken in the pure JMM, and why it can be repaired if you > add some (non-JMM) fences. > > So why is this showing up suddenly with Graal? Possibly it’s > got a really aggressive interpretation of the JMM, relative to > the standard HotSpot JITs. Possibly it’s got a bug. Graal might > be allowing the initialization of ClassValueMap.cacheArray > to float down past publication of the ClassValueMap on > type.classValueMap (in initializeMap). Less likely, it is allowing > the initialization to totally escape the lock/unlock pair, something > the JMM might allow but the Cookbook advises against (in its > “Can Reorder” table). > > Perhaps Graal is modeling the “freeze” operations on finals more > accurately (and the Cookbook gives instructions for this). That > would allow the ClassValueMap to correctly initialize its frozen > finals, but not (unfortunately) the not-frozen cacheArray field. > It appears that the new JIT on the block is exposing my neglect > for my poor baby, in not putting the right fences around its playpen. > > Bottom line: Add a release fence before type.cVM is set, and add > a (no-op on x86) acquire fence in getCache. Quick test: Add a dummy > final to CVM, to see if that makes the bug let go. > > — John > >