Lately I've been thinking about string representation. The world turned out not to be UCS-2 or UTF-16, after all, and we often have to deal with strings generally encoded as ASCII or UTF-8, but we aren't always encoded this way (and there might not even be a charset declaration, see the ELF spec).
(a) byte[] with defensive copies. Internal storage is byte[], copy is made before returning it to the caller. Quite common across the JDK. (b) byte[] without defensive copies. Internal storage is byte[], and a reference is returned. In the past, this could be a security bug, and usually, it was adjusted to (a) when noticed. Without security requirements, this can be quite efficient, but there is ample potential for API misuse. (c) java.lang.String with ISO-8859-1 decoding/encoding. Sometimes done by reconfiguring the entire JVM to run with ISO-8859-1, usually so that it is possible to process malformed UTF-8. The advantage is that there is rich API support, including regular expressions, and good optimization. There is also language support for string literals. (d) java.lang.String with UTF-8 decoding/encoding and replacement. This seems to be very common, but is not completely accurate and can lead to subtle bugs (or completely non-processible data). Otherwise has the same advantages as (c). (e) Various variants of ByteBuffer. Have not seen this much in practice (outside binary file format parsers). In the past, it needed deep defensive copies on input for security (because there isn't an immutably backed ByteBuffer), and shallow copies for access. The ByteBuffer objects themselves are also quite heavy when they can't be optimized away. For that reason, probably most useful on interfaces, and not for storage. (f) Custom, immutable ByteString class. Quite common, but has cross-library interoperability issues, and a full complement of support (matching java.lang.String) is quite hard. (g) Something based on VarHandle. Haven't seen this yet. Probably not useful for storage. Anything that I have missed? Considering these choices, what is the expected direction on the JDK side for new code? Option (d) for things generally ASCII/UTF-8, and (b) for things of a more binary nature? What to do if the choice is difficult?