Addressing the full range of use cases

Dan Smith Mon, 04 Oct 2021 16:35:01 -0700

When we talk about use cases for Valhalla, we've often considered a very broad 
set of class abstractions that represent immutable, identity-free data. JEP 401 
mentions varieties of integers and floats, points, dates and times, tuples, 
records, subarrays, cursors, etc. However, as shorthand this broad set often 
gets reduced to an example like Point or Int128, and these latter examples are 
not necessarily representative of all candidate value types.


Specifically, our favorite example classes have a property that doesn't 
generalize: they'll happily accept any combination of field values as a valid 
instance. (In fact, they're even happy to accept any combination of *bits* of 
the appropriate length.) Many candidate primitive classes don't have this 
property—the constructors do important validation work, and only certain 
combinations of fields are allowed to represent valid instances.

Related areas of concern that we've had on the radar for awhile:

- The "all zeros is your default value" strategy forces an all-zero instance 
into the class's value set, even if that doesn't make sense for the class. Many 
candidate classes have no reasonable default at all, leading naturally to wish 
for "null is your default value" (or other, more exotic, strategies involving 
revisiting the idea that every type has a default value). We've provided 
'P.ref' for those use sites that *need* null, but haven't provided a complete 
story for value types that want it to be *their* default value, too.

- Non-atomic heap updates can be used to create new instances that arbitrary 
combine previously-validated instances' fields. There is no guarantee that the 
new combination of fields is semantically valid. Again, while there's precedent 
for this with 'double' and 'long' (JLS 17.7), those are special cases that 
don't generalize—any combination of double bit fields is *still a valid 
double*. (This is usually described as "tearing", although JLS 17.6 has 
something else in mind when it uses that word...) The language provides 
'volatile' as a use-site opt-in to atomicity, and we've toyed with a 
declaration-site opt-in as well. But object integrity being "off" by default 
may not be ideal.

- Existing class types like LocalDate are both nullable and atomic. These are 
useful properties to preserve during migration; nullability, in particular, is 
essential for source compatibility. We've provided reference-default 
declarations as a mechanism to make reference types (which have these 
properties) the default, with 'P.val' as an opt-in to value types. But in doing 
so we take away the many benefits of value types by default, and force new code 
to work with the "bad name".

While we can provide enough knobs to accommodate all of these special cases, 
we're left with a complex user model which asks class authors to make n 
different choices they may not immediately grasp the consequences of, and class 
users to keep 2^n different categories straight in their heads.

As an alternative, we've been exploring whether a simpler model is workable. It 
is becoming clear that there are (at least) two clusters of uses for value 
types.  The "classic" value types are like numerics -- they'll happily accept 
any combination of field values as a valid instance, and the zero value is a 
sensible (often the best possible) default value.  They make relatively little 
use of encapsulation.  These are the ones that best "work like an int."  The 
"encapsulated" value types are those that are more like typical aggregates 
("codes like a class") -- their constructors do important validation work, and 
only certain combinations of fields are allowed to represent valid instances.  
These are more likely to not have valid zero values (and hence want to be 
nullable).  

Some questions to consider for this approach:

- How do we group features into clusters so that they meet the sweet spot of 
user expectations and use cases while minimizing complexity? Is two clusters 
the right number? Is two already too many? (And what do we call them? What 
keywords best convey the intended intuitions?)

- If there are knobs within the clusters, what are the right defaults? E.g., 
should atomicity be opt-in or opt-out?

- What are the performance costs (or, in the other direction, performance 
gains) associated with each feature? For certain feature combinations, have we 
canceled out the performance gains over identity classes (and at that point, is 
that combination even worth supporting?)

Addressing the full range of use cases

Reply via email to