As Fred mentions, he shared this with me a few days ago, and I found these arguments very persuasive.  We have already, several times, fixated on nullity when the problem was elsewhere, and this feels like yet another case of that.  Fixing the underlying technical debt -- that the language that `checkcast` and friends have for referring to classes is inadequate -- feels like addressing the real problem.

On 4/20/2020 1:20 PM, Frederic Parain wrote:



Here’s a few thoughts about the null checks vs class resolution issue
(many thanks to Brian for his review and his improvements to this document).


    Checkcast: is it a null issue or a type issue?

There has been some discussion recently on how casts should be translated.
While the static compiler has considerable latitude on how to translate language constructs to bytecode, I’d like to make sure that we first have a clean story at the bytecode level, and then take up the translation story (if we still need
to.)


        History, and historical inconveniences

Before Valhalla, classfiles had two ways to denote a reference type: the plain name used in |CONSTANT_Class_info| entries, and the name within an envelope in
the field and method descriptors used in |CONSTANT_Fieldref_info|,
|CONSTANT_Methodref_info| and |CONSTANT_InterfaceMethodref_info| entries.

Having two syntaxes was already a sign that something was weird, but we mostly
wrote that off as a historical accident. (Worse, it is not even applied
uniformly: arrays are always denoted with their envelope, even in
|CONSTANT_Class_info| entries.) Aesthetics aside, it worked because there was a single unambiguous translation from a class name to a class name with envelope.

In the bytecode sequence:

|aload_1 checkcast #10 // class Foo invokestatic #19 // Method Bar:(LFoo;)V |

the real meaning of the |checkcast| was: “I guarantee that the top of stack is a reference to an instance of class |Foo| (a.k.a. |LFoo;|), otherwise I’ll throw an exception”. Because |null| is valid value of all reference types, the JVM does not load the class |Foo| if the value on the top of the stack is a |null|, and the verifier is still satisfied that the arguments on the stack match the
signature of the method begin invoked.


        Valhalla turns up the pressure

The Valhalla project introduces a new kind of envelope: |Q*;|. The spelling has
remained the same, but it’s meaning has evolved with each prototype:

  * With the |v*| bytecodes, it was a marker of a /new kind of type/;
  * In L-world, it became a marker of /null-hostility/;
  * In the current user model, it has become /part of the type/.

The last two points require some explanation. In L-world, the L and Q flavors of an inline class were projected from a single set of class metadata. In this world, there were really three names — the L projection of C, the Q projection
of C, and the class C itself — all of which could be given meaning. So it
still could make sense to denote a class just by name — but it’s not clear this
was a very good idea.

For instance, the |devaultvalue| bytecode used a |CONSTANT_Class_info| entry referring to the value class by its plain name. This was unambiguous, because /of course/ the |defaultvalue| bytecode was referring to the Q-version of the
type. (Until some future when we want to apply |defaultvalue| to reference
types, and get |null| out.) The information was missing from the constant pool
entry but deduced from the context because of the implicit assumption that
|defaultvalue| only applies to Q-types. But there were other cases where even such implicit assumptions was not sufficient to deduce which variant of a value type should be used. The |checkcast| bytecode was one of this cases; it then becoame necessary to denote the class argument with the full envelope in order
to express the expected behavior.

With the new model of inline types, a class can only have one envelope: either |Q| if it is an inline type, or |L| otherwise. Which means that |LFoo;| and |QFoo;| are not two variants of a same type, but are in fact /two different
types/.

As much as we’d like to ignore it, if |Foo| is an inline type, it is still
possible to forge a reference with type |LFoo;| — we can create a class that declares a field of type |LFoo;|, instantiate an instance, and read the field. This |LFoo;| is a pretty silly type; it cannot interact with any other type, and it can only hold |null|. But the JVM has to deal with such silly types all the time, such as |LBar;| when |Bar| is a nonexistent class. But the reality is that |LFoo;| and |QFoo;| are two different types (with completely disjoint value
sets!), and we should be honest about it.

    In the current inline type model, the envelope is an essential
    part of the
    identification of a type.


    Checkcast

The legacy behavior of |checkcast| is on a collision course with the new type
system. If the following bytecode sequence:

|aload_1 checkcast #10 // class Foo |

still means the same as before — checking that the reference on the top of the
stack is of type |LFoo;| — we have a problem if |Foo| is an inline class,
because if the top of stack holds the |null|, the |checkcast| will succeed
(because null is indeed a valid value of the otherwise-useless type |LFoo;|), but this is not really what we had in mind when we asked whether the top of the
stack held a |Foo|.

It is easy to assume that this is just yet another bad nullity behavior, and forgivable to make this assumption because |null| has been the source of so much
bad behavior in the past. But this would be putting the blame in the wrong
place.

    In this example, the |checkcast| operation is simply operating on
    the wrong
    type, assuming |LFoo;| where it has no right to do so —
    |LFoo;| and |QFoo;| are
    completely distinct types.


        Quick, plug the hole!

There was a lot of discussions on the EG mailing list, and many proposals for
ways to restore peace and tranquility. Unfortunately, they all seem to be
“quick fixes”, are each likely to generate new problems of their own. Without
recapitulating the details of each of them, here’s a summary of their
shortcomings:

 *
    *Generate a different sequence of bytecodes when casting to an inline
    type.* This is a workaround for the current |checkcast| behavior,
    but is
    likely to cause trouble for generic code in the future that is
    specializable
    over both identity and inline types, because the goal is to share the
    bytecode across instantiations, and only patch the constant pool
    or type
    descriptors.
 *
    *Use |Class::cast|.* |Class::cast| is a generic method returning
    T, which
    is erased to |Object|, which will hide the type information the
    verifier
    needs to guarantee correctness of method arguments types.
 *
    *Use |invokedynamic| to call custom behavior.* This has serious
    risk of
    bootstrapping issues.
 *
    *Invent a |checknull| bytecode.* This, and nother solutions
    focusing of
    the handling of |null|, address the symptom, not the problem. The
    problem
    is not the handling of |null|, it is /checking that a particular
    value is
    within the value set of this particular type/. The handling of the
    |null|
    reference should not be handled separately, and should just fall
    out of
    addressing the general question of whether a given value is in the
    value set
    of a given type.

All of these solutions feel like quick fixes that are likely to bite us back
in the fiture. Let’s solve the real problem instead.


    Concrete proposal

Let’s fix this by fixing the underlying problem — being explicit about what type we are dealing with. Specifically, from Valhalla and beyond, the way to denote a class type in a classfile is always a class name with an envelope.

The two possible envelopes (currently) are the L-envelope for types with a value
set containing |null|, and the Q-envelope for types with a value set not
containing |null|.

This has several pleasant consequences:

 *
    All representations within the class file itself are unified:
    |CONSTANT_Class_info|, |CONSTANT_Fieldref_info|,
    |CONSTANT_Methodref_info|
    and |CONSTANT_InterfaceMethodref_info| will all use the same
    syntax, with no
    more translation required between names and type descriptors.
 *
    Class denotation will be aligned with array denotation, which
    already uses
    type descriptors in |CONSTANT_Class_info| entries.
 *
    All bytecodes referencing a |CONSTANT_Class_info| entry will have
    access to
    the full denotation, envelope + name, even when the class has not been
    loaded yet.
 *
    The verifier will no longer have to translate between names and type
    descriptors.

For the |checkcast| bytecode, the semantics has to be rephrased: |checkcast| must ensure that the reference on the top of the stack is within the value set of the type specified in argument, or throw an exception. For |L| types, this is the same behavior as before, but for |Q| types, the behavior reflects the
value set of the type specified in the classfile. If we have:

|aload_1 checkcast #10 // class LFoo; |

then |checkcast| is being used with a type using a L-envelope, so we still know |null| is within the value set of |Foo| without having to load |Foo|. If the top of stack is not the |null| reference, then |Foo| must be loaded to check if
this value is part of the remaining of |Foo|‘s value set, as before.

On the other hand, if we have:

|aload_1 checkcast #11 // class QBar; |

then |checkcast| is used with a type using a Q-envelope, which means |null| cannot be part of the value set of |Bar|. So if the top of stack contains the |null| reference, an exception can be thrown (again, without loading |Bar| if we so desire). If the top of stack is not the |null| reference, then |Bar| must be
loaded to check if this value is part of |Bar|‘s value set, as before.

The bytecode sequence is the same for both inline types and not-inline-types, with the behavior being controlled by a constant pool entry, making it suitable for our specialization model, and the semantics being derived from the type on
which |checkcast| operates.

The benefits of always using a name+envelope will be less significant for other
bytecodes, but they still do exist. (For example, using |new| on an inline
type, could be caught at verification time instead of runtime.)

    Let’s take this
    opportunity to address the real problem — correct denotation of
    types — rather
    than pinning the blame on |null| (however many sins it committed
    in the past.)
    The current loose treatment of non-enveloped names has already
    caused trouble,
    and will be a huge source of technical debt going forward. Let’s
    just pay it
    off.


        Backward compatibility

Pre-Valhalla class files only know about the L-envelope, so the JVM can continue
to deal with them applying the old default translation from names to |L*;|
descriptors. The implementation of |checkcast| won’t have to check the class file version, as the behavior can be deduced directly from the content of the
|CONSTANT_Class_info| (plain name -> old syntax, name with envelope -> new
syntax). New classfiles will reject the old syntax.


Reply via email to