Re: Revisiting default values

Brian Goetz Mon, 15 Mar 2021 08:52:47 -0700

Picking this issue up again. To summarize Dan's buckets:

Bucket 1 -- the zero default is in the domain, and is a sensible defaultvalue. Zero for numerics, empty optionals.


Bucket 2 -- there is a sensible default value, but all-zero-bits isn't it.

Bucket 3 -- there simply is no sensible default value.

Ultimately, though, this is not about defaults; it is about_uninitialized variables_. The default only comes into play when theuser uses an uninitialized variable, which usually means (a)uninitialized fields or (b) uninitialized array elements. It ispossible that the language could give us seat belts to dramaticallynarrow the chance of uninitialized fields, but uninitialized arrayelements are much harder to stamp out.

It is an attractive distraction to get caught up in designing mechanismsfor supplying an alternate default ("just let the user declare a no-argconstructor"), but this is focusing on the "writing code" part of theproblem, not the "keeping code safe" part of the problem.

In some sense, it is the existence (and size) of Bucket 1 that causesthe problem; Bucket 1 is what gives us our sense that it is safe to useuninitialized variables. In the current language, uninitializedreference variables are also safe in that if you use them before theyare initialized, you get an exception before anything bad can happen. Uninitialized primitives in today's language are more dangerous, becausewe may interpret the uninitialized value, but this has been a problemwe've been able to live with because today's primitives are prettylimited and zero is usually a good-enough default in most domains. Aswe extend primitives to look more like objects, with behavior, this getsharder.

Both buckets 2 and 3 can be remediated without help from the language orVM, perhaps inconveniently, by careful coding on the part of the authorof the primitive class:


 - don't expose fields to users (a good practice anyway)
 - check for zero on entry to each method

These are options A and E. The difference between Buckets 2 (A) and 3(E) in this model is what do we do when we find a zero; for bucket 2, wesubstitute some pre-baked value and use that, and for bucket 3, we throwsomething (what we throw is a separate discussion.) The variousremediation techniques Dan offers represents a menu which allows us totrade off reliability/cost/intrusiveness.

I think we should lean on the model currently implemented by referencetypes, where _accessing_ an uninitialized field is OK, but _using_ thevalue in the field is not. If we have:


    String s;

All of the following are fine:

    String t = s;
    if (s == null) { ... }
    if (s == t) { ... }

The thing that is not fine is s-dot-something. These are the E/F/Goptions, not the H/I options.

Secondarily, H/I, which attempt to hide the default, create anotherproblem down the road: when we get to specialized generics, `T.default`would become partial.

Some of the solutions for Bucket 3 generalize well enough to Bucket 2that we might consider merging them (though there are still messydetails). Option F, for example, injects code at the top of each methodbody:


    int m() {
        if (this == <zero-value>)
            throw new NullPointerException();
        /* body of m */
    }

into the top of each method; a corresponding feature for Bucket 2 mightinject slightly different code:


    int m() {
        if (this == <zero-value>)
            return <better-default>.m();
        /* body of m */
    }

Another thing that has evolved since we started this discussion isrecognizing the difference between .val and .ref projections. Imagineyou could declare your membership in bucket 3:


    __bucket_3 primitive class NGD { ... }

If, in addition to some way of generating an NPE on dereference (F, G,etc), we mucked with the conversion of NGD.val to NGD.ref (which thecompiler can inject code on), we could actually put a null on top of thestack. Then, code like:


    if (ngd == null) { ... }

would actually work, because to do the comparison, we'd first promotengd to a reference type (null is already a reference), and we'd comparetwo nulls.




On 7/10/2020 2:23 PM, Dan Smith wrote:

Brian pointed out that my list of candidate inline classes in the Identity Warnings JEP 
(JDK-8249100) includes a number of classes that, despite being "value-based 
classes" and disavowing their identity, might not end up as inline classes. The 
problem? Default values.

This might be a good time to revisit the open design issues surrounding default 
values and see if we can make some progress.

Background/status quo: every inline class has a default instance, which 
provides the initial value of fields and array components that have the inline 
type (e.g., in 'new Point[10]'). It's also the prototype instance used to 
create all other instances (start with 'vdefault', then apply 'withfield' as 
needed). The default value is, by fiat, the class instance produced by setting 
all fields to *their* default values. Often, but not always, this means 
field/array initialization amounts to setting all the bits to 0. Importantly, 
no user code is involved in creating a default instance.

Real code is always useful for grounding design discussions, so let's start 
there. Among the classes I listed as inline class candidates, we can put them 
in three buckets:

Bucket #1: Have a reasonable default, as declared.
- wrapper classes (the primitive zeros)
- Optional & friends (empty)
- From java.time: Instant (start of 1970-01-01), LocalTime (midnight), Duration 
(0s), Period (0d), Year (1 BC, if that's acceptable)

Bucket #2: Could have a reasonable default after re-interpreting fields.
- From java.time: LocalDate, YearMonth, MonthDay, LocalDateTime, ZonedDateTime, 
OffsetTime, OffsetDateTime, ZoneOffset, ZoneRegion, MinguoDate, HijrahDate, 
JapaneseDate, ThaiBuddhistDate (months and days should be nonzero; null 
Strings, ZoneIds, HijrahChronologies, and JapaneseEras require special handling)
- ListN, SetN, MapN (null array interpreted as empty)

Bucket #3: No good default.
- Runtime.Version (need a non-null List<Integer>)
- ProcessHandleImpl (need a valid process ID)
- List12, Set12, Map1 (need a non-null value)
- All ConstantDesc implementations (need real class & method names, etc.)

There's some subjectivity between the 2nd and 3rd buckets, but the idea behind the 2nd is that, with some 
translation layer between physical fields and interpretation of those fields, we can come up with an 
intuitive default (e.g., "0 means January"; "a null String means time zone 'UTC'"). In 
contrast, in the third bucket, any attempt to define a default value is going to be pretty unintuitive 
("A null method name means 'toString'").

The question here is how much work the JVM and language are willing to do, or 
how much work we're willing to ask clients to do, in order to support use cases 
that don't fall into Bucket #1.

I don't think totally excluding Buckets #2 and #3 is a very good outcome. It 
means that, in many cases, inline classes need to be built up exclusively from 
primitives or other inline types, because if you use reference types, your 
default value will have a null field. (Sometimes, as in Optional, null fields 
have straightforward interpretations, but most of the time programs are 
designed to prevent them.)

Whether we support Bucket #2 but not Bucket #3 is a harder question. It 
wouldn't be so bad if none of the examples above in Bucket #3 become inline 
classes—for the most part they're handled via interfaces, anyway. 
(Counterpoint: inline class instances that are immediately typed with interface 
types still potentially provide a performance boost.) But I'm also not sure 
this is representative. We've noted before that many use cases, like database 
records or data structure cursors, don't have meaningful defaults (what's a 
default mailing address?). The ConstantDesc classes really illustrate this, 
even though they happen to not be public.

Another observation is that if we support Bucket #3 but not Bucket #2, that's 
probably not a big deal—I'm not sure anybody really *wants* to deal with the 
default instance; it's just the price you pay for being an inline class. If 
there's a way to opt out of that extra weirdness and move from Bucket #2 to 
Bucket #3, great.

With that discussion in mind, here are some summaries of approaches we've 
considered, or that I think we ought to consider, for supporting buckets #2 and 
#3. (This is as best as I recall. If there's something I've missed, add it to 
the list!)

[Weighing in for myself: my current preference is to do one of F, G, or I. I'm 
not that interested in supporting Bucket #2, for reasons given above, although 
Option A works for programmers who really want it.]



=== Solutions to support Bucket #2 ===

Two broad strategies here: re-interpreting fields (A, B), and re-interpreting 
the default instance (C, D).

---

Option A: Encourage programmers to re-interpret fields

Guidance to programmers: when you declare an inline class, identify any fields 
for which the default instance should hold something other than zero/null; 
define a mapping for your implementation from zero/null to the value you want.

One way to do this is to define a (possibly private) getter for each field, and include 
logic like 'return month + 1' or 'return id == null ? "UTC" : id'. Or maybe you 
inline that logic, as long as you're careful to do so everywhere. Importantly, you also 
need to reverse the logic in your constructor—for the sake of '==', if somebody manually 
creates the default instance, you should  set fields to zero/null.

This doesn't work if you want public fields, but that's life as an OO 
programmer.

In this approach, it would be important that inline classes be expected to document their 
default instance in Javadoc (perhaps with a new Javadoc tag)—the interpretation of the 
default instance is less apparent to users than "all zeros".

Limitations:

- It's a fairly error-prone approach. Programmers will absolutely forget to 
apply the mapping in one place, and everything will be fine until somebody 
tries to invoke a particular method on the default instance. Put that bug in a 
security-sensitive context, and maybe you have an exploit. (Something that 
could help some is choosing good names—call your field 'monthIndex', not plain 
'month', to remind yourself that it's zero-based.)

- Performance impact of an extra layer of computation on all field accesses. Probably not 
a big deal in general, but all those null checks, etc., could have a negative impact in 
certain contexts. And the *appearance* of extra cost might scare programmers away from 
doing the right thing ("eh, I probably won't use the default value anyway, I'll just 
ignore it to make my code faster").

---

Option B: Language support for field re-interpretation

The language allows inline classes to declare fields with mappings to/from an 
internal representation. Just like Option A, but with guarantees that the 
internal representation isn't inappropriately accessed directly.

This pulls on a thread we explored a bit for Amber awhile back, some form of "abstract 
fields" or "virtual fields". Maybe there's something there, but it seems like a 
general-purpose feature, and one we're not likely to reach a final solution on anytime soon.

---

Option C: Language support for a designated default

The language provides some way for programmers to declare the "logical" default instance 
(something like a special static field). The compiler inserts a test for the "physical" 
default on any field/array access, and replaces it with the logical default.

That is:

Point p = points[3];

compiles to

point p$0 = points[3];
Point p = (p$0 == [vdefault Point]) ? Point.DEFAULT : p$0;

This is much less bug-prone than Option A—the compiler does all the work—and 
much more achievable in the short/medium term than Option B.

Compared to Option B, this pushes the computation overhead from inline class 
field accesses to reads of the inline type from fields/arrays. I don't know if 
that's good or bad—maybe a wash, heavily dependent on the use case.

A few big problems:

- The physical default still exists, and malicious bytecode can use it. If 
programmers want strong guarantees, they'll have to check and throw wherever an 
untrusted instance is provided. (Clients with access to the inline class's 
fields have to do so, too.)

- Covariant arrays mean every read from any array type that might be flattened 
(Object[], Runnable[], ConstantDesc[], ...) has to go through translation logic.

- There's an assumption here that the programmer doesn't intend to use the 
physical default as a valid non-default instance. That's hard for the compiler 
to enforce, and weird stuff happens in fields/arrays if the programmer doesn't 
prevent it. (Could be mitigated with extra implicit logic on field/array writes 
or in constructors.)

---

Option D: JVM support for a designated default

The VM allows inline classes to designate a logical default instance, and the 
field/array access instructions map from the physical default to the logical 
default. The 'vdefault' instruction produces the logical default instance; 
something else is used by the class's factories to build from the physical 
default.

This addresses the first two problems with Option C—the VM gives strong 
guarantees, and can make the translation a virtual operation of certain arrays.

To address the second problem, it seems like we'd need the more complex logic I 
hinted at: on writes, map the physical default to the logical default, and map 
the logical default to the physical default. Do the reverse on reads.

The problem here is bytecode complexity/slowdowns. We've already added some 
complexity to 'aaload'/'aastore' (covariant flattened arrays), and anticipate 
similar changes to 'putfield'/'getfield' (specialized fields), so maybe that 
means we might as well do more. Or maybe it means we're already over budget. :-)

 From the users' perspective, if any performance reduction on reads/writes can 
be limited to the inline classes in Bucket #2, *all* the options have a similar 
cost, whether imposed by the programmer, language, or VM. So, to a first 
approximation, slower opcode execution is fine.



=== Solutions to support Bucket #3 ===

Two broad strategies here: rejecting member accesses on the default instance 
(E, F, G), and preventing programs from ever seeing the default instance (H, I).

---

Option E: Encourage programmers to guard against default instances

Guidance to programmers: if you don't like your class's default instance, check 
for it in your methods and throw. Maybe Java SE defines a new RuntimeException 
to encourage this.

The simple way to do this is with some boilerplate at the start of all your 
methods:

if (this == MyClass.default) throw new InvalidDefaultException();

More permissive classes could just do some validation on the fields that are 
relevant to a particular operation. (E.g., 'getMonth' doesn't care if 'zoneId' 
is null.)

This doesn't work if you want public fields, but that's life as an OO 
programmer.

It's not ideal that an invalid instance can float around a program until 
somebody trips on one of these checks, rather than detecting the invalid value 
earlier—we're propagating the NPE problem. And it takes some getting used to 
that there are two null-like values in the reference type's domain.

---

Option F: Language support for default instance guards

An inline class declaration can indicate that the default instance is invalid. 
The compiler generates guards, as in Option E, at the start of all instance 
method bodies, and perhaps on all field accesses outside of those methods.

Programmers give up finer-grained control, but get more safety. I'm sure most 
would be happy with that trade.

Improper/separately-compiled bytecode can skip the field access checks, but 
that's a minor concern.

Same issues as Option E regarding adding a "new NPE" to the platform.

---

Option G: JVM support for default instance guards

Inline class files can indicate that their default instance is invalid. All 
attempts to operate on that instance (via field/method accesses, other than 
'withfield') result in an exception.

This tightens up Option F, making it just as impossible to access members of 
the default instance as it is to access members of 'null'.

Same issues as Option E regarding adding a "new NPE" to the platform.

---

Option H: Language checks on field/array reads

An inline class declaration can indicate that the default instance is invalid. Every 
field and array access that may involved an uninitialized field/array component of that 
inline type gets augmented with a check that rejects reads of the default value (treating 
it as "you forgot to initialize this variable").

That is:

Point p = points[3];

compiles to

point p$0 = points[3];
if (p$0 == [vdefault Point]) throw new UninitializedVariableException();
Point p = p$0;

This is much like Option C, and has roughly the same advantages/problems. 
There's not a strong guarantee that the default value won't pop up from 
untrusted bytecode (or unreliable inline class authors), and lots of array 
types need guards.

---

Option I: JVM checks on field/array reads

Inline class files can indicate that their default instance is invalid. When reading from 
a field/array component of the inline type ('getfield'/'getstatic'/'aaload'), an 
exception is thrown if the default value is found (treating it as "you forgot to 
initialize this variable"). The 'vdefault' instruction, like 'withfield', is illegal 
outside of the inline class's nest.

Better than Option H in that it can be optimized to occur on only certain reads, and in 
that it provides strong guarantees—only the inline class can ever "see" the 
default instance.

Well, unless the inline class chooses to share that instance with the world. Not sure how 
we prevent that. But maybe at that point, anything bad/weird that happens is the author's 
own fault. (E.g., putting the default value in an array will make that component 
effectively "uninitialized" again.)

Like Option D, there's a question of whether we're willing to add this 
complexity to the 'getifled'/'getstatic'/'aaload' instructions. My sense is 
that at least it's less complexity than you have in Option D.

Re: Revisiting default values

Reply via email to