Number 2 of 100 in a series of “What we learned in Phase I of Project Valhalla.” This one focuses on the challenges of evolving a class to be any-generic, while interacting with existing erased code. No solutions here, just recaps of problems and challenges.

Let’s imagine a class today:

|interface Boxy<T> { T get(); void set(T t); } class Foo<T> implements Boxy<T> { public T t; public T[] tArray; public Foo(T t) { set(t); } public static<T> Foo<T> of(T t) { return new Foo(t); } T get() { return t; } void set(T t) { this.t = t; this.tArray = (T[]) new Object[1] { t }; } } |

and client code

|Foo<String> fs = new Foo<>("boo"); println(fs.t); println(fs.tArray); println(fs.get()); Foo<?> wc = fs; if (wc instanceof Foo) { ... } |

When we compile this code, we’ll encounter |LFoo;| or |Constant_class[Foo]| or just plain |Foo| in the following contexts:

 * Foo extends Bar
 * instanceof/checkcast Foo
 * new Foo
 * anewarray Foo[]
 * getfield Foo.t:Object
 * invokevirtual Foo.get():Object
 * Method descriptors of |Foo::of|

We translate raw |Foo|, |Foo<String>|, and |Foo<?>| all the same way today — |LFoo|.


       Tentative simplification: reference instantiations are always erased

The specialization transform takes a template class and a set of type parameters and produces a specialized class. This can cause member (and supertype) signatures to change; for example, if we have

|T get() |

which erases to

|Object get() |

when we specialize with T=int, we’ll have

|int get() |

In theory, there’s nothing to stop us from specializing Listwith T=String. However, in the earlier exploration, we settled on the tentative simplification of always erasing reference instantiations, and only specializing value instantiations. This is a tradeoff; we’re still throwing away potentially useful type information (erasure haters will be disappointed), in exchange for much greater sharing, and avoiding some compatibility issues (existing generic code is rife with tricks like “casting through wildcards” to coerce a |Foo<A>| to |Foo<B>|, which only works as long as we erase; dirty tricks like this are often necessary as there are some things that are hard to express in the generic type system, even though the programmer knows them.)

Ignoring multiple type parameters for the moment, when |Foo| becomes specializable, our model is that it will have an /erased/ species — call it |Foo<erased>|. (If you ask it what its type parameters are, it will say “erased”. That is, we reify the fact that it is erased…) While migrating from erased to specialized generics requires source changes and recompilation at the generic class declaration, it should not require any changes or recompilation for clients. That means that legacy client classfiles that talk about |Foo| must be considered to be talking about |Foo<erased>|. (Hierarchies can be specialized from the top down, so it is OK to specialize |Bar| before |Foo|, but not the other way around.)

While the generic specialization machinery will have no problem with specializing to L-types, I think its a simplification we should hold on to, that we treat all L type parameters as “erased” for purposes of specialization.


       Additional simplification: let’s not worry about primitives

In Burlington, we concluded that as long as there’s a Pox class for each primitive, we can convert primitives to/from poxes through source compiler transforms, and not worry about specializing over primitives. Instead, when the user wants to specialize List, we instead specialize for int’s pox. Except for those pesky arrays … more on that later.


       Assumption: wild means wild

On the other hand, one of the non-simplifying assumptions we want to make is that a wildcard type — |Foo<?>| — should describe any instantiation of |Foo|, even when the wildcard-using code doesn’t know about specialization. (Same with raw usages of |Foo|.) For example, if the user has written a method:

|takeFoo(Foo<?> anyFoo) { anyFoo.m(); } |

in legacy (erased) code, we should be able to call |takeFoo()| with both erased and specialized instances of |Foo|. As we’ll see, this complicates member access, and really complicates arrays.

We will find utterances like

|invokevirtual Foo.get()Object getfield Foo.m:Object |

in legacy code; we want these to work against any specialization of |Foo|.

In the case where the instance is erased, things obviously have a decent chance of lining up properly, as the erased members will not have been specialized away. If our receiver is a specialized |Foo|, it gets harder, as the member signatures will have changed due to specialization.

Starting in Model 2, we handled this with bridge methods; for each specialized method, we also had an erased bridge. This is possible because there’s an easy coercion from |QPoint| to |LObject|. (There are other ways to get there besides bridges.)

Where this completely runs out of gas is in field access; there’s no such thing as a “bridge field”. So legacy code that does |getfield Foo.t:Object| will fall over at link time, since the type of field |t| in a specialized |Foo| might not be |Object|.

Another place this falls short is when a signature has |T[]| in it. Even with bridge methods, without either array covariance (this is what I meant when I said it might come back) or a willingness to copy, a legacy client that invokes a method that returns a |T[]| will invoke it expecting an |Object[]|, but without array covariance, a |Point.Val[]| is not an |Object[]|. (Note that relatively few methods actually expose |T[]| parameters, so its possible there are other dodges here.)


       Wildcards

One of the central challenges of pushing specialization into the VM is how we’re going to handle wildcards. Given a generic class |Foo|, the wildcard type |Foo<?>| is a supertype of any instantiation |Foo<X>| of |Foo|. The wildcard type also erases to |LFoo|.

In Model 2, we modeled wildcards as interfaces, with lots and lots of bridges, but this still fell short in a number of ways: no support for non-public methods or for fields, and we had to deal with fields by hoisting them into virtual bridges on the interface.

Note that the wildcard subtyping also matters to the verifier, in addition to handling bytecodes; the verifier must know that any specialization of |Foo| is a subtype of the wildcard |LFoo|.


       But what does |LFoo| mean?

Careful readers will notice that we’ve been playing fast and loose with the meaning of |Foo|; sometimes it means the class, sometimes the wildcard, and sometimes the erased species.

The best intuition we’ve been able to come up with is:

 * There are /classes/ and /crasses/.
 * A crass describes a single runtime type; it has a layout, methods,
   constructors, etc.
 * A (template) class describes a family of runtime types.
 * A (template) class is like an abstract type; it has members and
   subtypes, but can’t be instantiated directly.
 * All the crasses derived from a class are subtypes of the class.
 * For purposes of instantiation, we interpret |new Foo| as creating an
   instance of the erased species, and a similar game with |<init>|
   methods.


   Model 3 classfile extensions

In Model 3, we extended the constant pool with some new entries:

*TypeVar[n, erasure].* This is a use of a type variable, identified by its index /n/. (There was a table-of-contents attribute listing all the type variables declared in a generic class or method, including those declared in enclosing generic classes or methods.) Since the erasure of a type variable is not merely a property of the type variable, but in fact a property of how it is used, each use of a type variable carries around its own erasure. For field whose type is |T|, the |NameAndType| points not to |Object|, but to |TypeVar[0, Object]|.

When specializing a type variable to |erased|, any uses of that type variable are replaced with the erasure in the |TypeVar| entry.

*MethodType[D,T…].* This is largely a syntactic mechanism, allowing us to represent method descriptors with holes (but also had the benefit of compressing the constant pool somewhat.) The parameter |D| was a method type descriptor, except that in addition to the existing types, one could specify |#| to indicate a hole; the |T...| parameters are CP indexes to other types (which could be UTF8 strings, or |TypeVar|, or the other type CP entries listed below.)

For example, a method

|int size(T t) |

would have a signature

|#1 = TypeVar[0, Object] #2 = MethodType[(#)I, #1] |

When specializing a |MethodType|, its parameters are recursively specialized, and then the resulting strings concatenated.

*ParamType[C,T…].* This represents a parameterized type, where |C| is a class name, and |T...| are the type parameters. So |List<int>| would be represented as |ParamType[List,I]|, and |List<T>| would be represented as |ParamType[List,TypeVar[0,e]]|. When specializing a |ParamType|, its parameters are recursively specialized, and then the resulting instantiation is computed.

*ArrayType[T,rank].* This represents an array of given rank.

The type parameters of a |ParamType|, |ArrayType|, or |MethodType| can themselves be a |TypeVar|, |ParamType|, or |ArrayType|, as well as a UTF8.

We found that as a template language, these types allowed exactly the sort of expressiveness needed, and specialized efficiently down to concrete descriptors (though in the M3 prototype, we had concrete descriptors of the form |List$0=I| to describe |List<int>|, obviously we don’t want that here.) But these designs captured all the complexity we needed (especially that of erasure), and allowed a mechanical translation int Java 8 classfiles.

Reply via email to