Re: null checks vs class resolution: taking a few steps back

Brian Goetz Mon, 20 Apr 2020 11:26:42 -0700

As Fred mentions, he shared this with me a few days ago, and I foundthese arguments very persuasive. We have already, several times,fixated on nullity when the problem was elsewhere, and this feels likeyet another case of that. Fixing the underlying technical debt -- thatthe language that `checkcast` and friends have for referring to classesis inadequate -- feels like addressing the real problem.


On 4/20/2020 1:20 PM, Frederic Parain wrote:

Here’s a few thoughts about the null checks vs class resolution issue
(many thanks to Brian for his review and his improvements to thisdocument).
    Checkcast: is it a null issue or a type issue?

There has been some discussion recently on how casts should be translated.
While the static compiler has considerable latitude on how totranslate languageconstructs to bytecode, I’d like to make sure that we first have aclean storyat the bytecode level, and then take up the translation story (if westill need
to.)


        History, and historical inconveniences
Before Valhalla, classfiles had two ways to denote a reference type:the plainname used in |CONSTANT_Class_info| entries, and the name within anenvelope in
the field and method descriptors used in |CONSTANT_Fieldref_info|,
|CONSTANT_Methodref_info| and |CONSTANT_InterfaceMethodref_info| entries.
Having two syntaxes was already a sign that something was weird, butwe mostly
wrote that off as a historical accident. (Worse, it is not even applied
uniformly: arrays are always denoted with their envelope, even in
|CONSTANT_Class_info| entries.) Aesthetics aside, it worked becausethere was asingle unambiguous translation from a class name to a class name withenvelope.
In the bytecode sequence:
|aload_1 checkcast #10 // class Foo invokestatic #19 // MethodBar:(LFoo;)V |
the real meaning of the |checkcast| was: “I guarantee that the top ofstack is areference to an instance of class |Foo| (a.k.a. |LFoo;|), otherwiseI’ll throwan exception”. Because |null| is valid value of all reference types,the JVMdoes not load the class |Foo| if the value on the top of the stack isa |null|,and the verifier is still satisfied that the arguments on the stackmatch the
signature of the method begin invoked.


        Valhalla turns up the pressure
The Valhalla project introduces a new kind of envelope: |Q*;|. Thespelling has
remained the same, but it’s meaning has evolved with each prototype:

  * With the |v*| bytecodes, it was a marker of a /new kind of type/;
  * In L-world, it became a marker of /null-hostility/;
  * In the current user model, it has become /part of the type/.
The last two points require some explanation. In L-world, the L and Qflavorsof an inline class were projected from a single set of class metadata.In thisworld, there were really three names — the L projection of C, the Qprojection
of C, and the class C itself — all of which could be given meaning. So it
still could make sense to denote a class just by name — but it’s notclear this
was a very good idea.
For instance, the |devaultvalue| bytecode used a|CONSTANT_Class_info| entryreferring to the value class by its plain name. This was unambiguous,because/of course/ the |defaultvalue| bytecode was referring to the Q-versionof the
type. (Until some future when we want to apply |defaultvalue| to reference
types, and get |null| out.) The information was missing from theconstant pool
entry but deduced from the context because of the implicit assumption that
|defaultvalue| only applies to Q-types. But there were other caseswhere evensuch implicit assumptions was not sufficient to deduce which variantof a valuetype should be used. The |checkcast| bytecode was one of this cases;it thenbecoame necessary to denote the class argument with the full envelopein order
to express the expected behavior.
With the new model of inline types, a class can only have oneenvelope: either|Q| if it is an inline type, or |L| otherwise. Which means that|LFoo;| and|QFoo;| are not two variants of a same type, but are in fact /twodifferent
types/.

As much as we’d like to ignore it, if |Foo| is an inline type, it is still
possible to forge a reference with type |LFoo;| — we can create aclass thatdeclares a field of type |LFoo;|, instantiate an instance, and readthe field.This |LFoo;| is a pretty silly type; it cannot interact with any othertype, andit can only hold |null|. But the JVM has to deal with such silly typesall thetime, such as |LBar;| when |Bar| is a nonexistent class. But thereality isthat |LFoo;| and |QFoo;| are two different types (with completelydisjoint value
sets!), and we should be honest about it.

    In the current inline type model, the envelope is an essential
    part of the
    identification of a type.


    Checkcast
The legacy behavior of |checkcast| is on a collision course with thenew type
system. If the following bytecode sequence:

|aload_1 checkcast #10 // class Foo |
still means the same as before — checking that the reference on thetop of the
stack is of type |LFoo;| — we have a problem if |Foo| is an inline class,
because if the top of stack holds the |null|, the |checkcast| will succeed
(because null is indeed a valid value of the otherwise-useless type|LFoo;|),but this is not really what we had in mind when we asked whether thetop of the
stack held a |Foo|.
It is easy to assume that this is just yet another bad nullitybehavior, andforgivable to make this assumption because |null| has been the sourceof so much
bad behavior in the past. But this would be putting the blame in the wrong
place.

    In this example, the |checkcast| operation is simply operating on
    the wrong
    type, assuming |LFoo;| where it has no right to do so —
    |LFoo;| and |QFoo;| are
    completely distinct types.


        Quick, plug the hole!
There was a lot of discussions on the EG mailing list, and manyproposals for
ways to restore peace and tranquility. Unfortunately, they all seem to be
“quick fixes”, are each likely to generate new problems of their own.Without
recapitulating the details of each of them, here’s a summary of their
shortcomings:

 *
    *Generate a different sequence of bytecodes when casting to an inline
    type.* This is a workaround for the current |checkcast| behavior,
    but is
    likely to cause trouble for generic code in the future that is
    specializable
    over both identity and inline types, because the goal is to share the
    bytecode across instantiations, and only patch the constant pool
    or type
    descriptors.
 *
    *Use |Class::cast|.* |Class::cast| is a generic method returning
    T, which
    is erased to |Object|, which will hide the type information the
    verifier
    needs to guarantee correctness of method arguments types.
 *
    *Use |invokedynamic| to call custom behavior.* This has serious
    risk of
    bootstrapping issues.
 *
    *Invent a |checknull| bytecode.* This, and nother solutions
    focusing of
    the handling of |null|, address the symptom, not the problem. The
    problem
    is not the handling of |null|, it is /checking that a particular
    value is
    within the value set of this particular type/. The handling of the
    |null|
    reference should not be handled separately, and should just fall
    out of
    addressing the general question of whether a given value is in the
    value set
    of a given type.
All of these solutions feel like quick fixes that are likely to biteus back
in the fiture. Let’s solve the real problem instead.


    Concrete proposal
Let’s fix this by fixing the underlying problem — being explicit aboutwhattype we are dealing with. Specifically, from Valhalla and beyond, theway todenote a class type in a classfile is always a class name with anenvelope.
The two possible envelopes (currently) are the L-envelope for typeswith a value
set containing |null|, and the Q-envelope for types with a value set not
containing |null|.

This has several pleasant consequences:

 *
    All representations within the class file itself are unified:
    |CONSTANT_Class_info|, |CONSTANT_Fieldref_info|,
    |CONSTANT_Methodref_info|
    and |CONSTANT_InterfaceMethodref_info| will all use the same
    syntax, with no
    more translation required between names and type descriptors.
 *
    Class denotation will be aligned with array denotation, which
    already uses
    type descriptors in |CONSTANT_Class_info| entries.
 *
    All bytecodes referencing a |CONSTANT_Class_info| entry will have
    access to
    the full denotation, envelope + name, even when the class has not been
    loaded yet.
 *
    The verifier will no longer have to translate between names and type
    descriptors.
For the |checkcast| bytecode, the semantics has to be rephrased:|checkcast|must ensure that the reference on the top of the stack is within thevalue setof the type specified in argument, or throw an exception. For|L| types, thisis the same behavior as before, but for |Q| types, the behaviorreflects the
value set of the type specified in the classfile. If we have:

|aload_1 checkcast #10 // class LFoo; |
then |checkcast| is being used with a type using a L-envelope, so westill know|null| is within the value set of |Foo| without having to load |Foo|.If thetop of stack is not the |null| reference, then |Foo| must be loaded tocheck if
this value is part of the remaining of |Foo|‘s value set, as before.

On the other hand, if we have:

|aload_1 checkcast #11 // class QBar; |
then |checkcast| is used with a type using a Q-envelope, which means|null|cannot be part of the value set of |Bar|. So if the top of stackcontains the|null| reference, an exception can be thrown (again, without loading|Bar| if weso desire). If the top of stack is not the |null| reference, then|Bar| must be
loaded to check if this value is part of |Bar|‘s value set, as before.
The bytecode sequence is the same for both inline types andnot-inline-types,with the behavior being controlled by a constant pool entry, making itsuitablefor our specialization model, and the semantics being derived from thetype on
which |checkcast| operates.
The benefits of always using a name+envelope will be less significantfor other
bytecodes, but they still do exist. (For example, using |new| on an inline
type, could be caught at verification time instead of runtime.)

    Let’s take this
    opportunity to address the real problem — correct denotation of
    types — rather
    than pinning the blame on |null| (however many sins it committed
    in the past.)
    The current loose treatment of non-enveloped names has already
    caused trouble,
    and will be a huge source of technical debt going forward. Let’s
    just pay it
    off.


        Backward compatibility
Pre-Valhalla class files only know about the L-envelope, so the JVMcan continue
to deal with them applying the old default translation from names to |L*;|
descriptors. The implementation of |checkcast| won’t have to check theclassfile version, as the behavior can be deduced directly from the contentof the
|CONSTANT_Class_info| (plain name -> old syntax, name with envelope -> new
syntax). New classfiles will reject the old syntax.

Re: null checks vs class resolution: taking a few steps back

Reply via email to