Lately I've been thinking about string representation.  The world
turned out not to be UCS-2 or UTF-16, after all, and we often have to
deal with strings generally encoded as ASCII or UTF-8, but we aren't
always encoded this way (and there might not even be a charset
declaration, see the ELF spec).

(a) byte[] with defensive copies.
    Internal storage is byte[], copy is made before returning it to
    the caller.  Quite common across the JDK.

(b) byte[] without defensive copies.
    Internal storage is byte[], and a reference is returned.  In the
    past, this could be a security bug, and usually, it was adjusted
    to (a) when noticed.  Without security requirements, this can be
    quite efficient, but there is ample potential for API misuse.

(c) java.lang.String with ISO-8859-1 decoding/encoding.
    Sometimes done by reconfiguring the entire JVM to run with
    ISO-8859-1, usually so that it is possible to process malformed
    UTF-8.  The advantage is that there is rich API support, including
    regular expressions, and good optimization.  There is also
    language support for string literals.

(d) java.lang.String with UTF-8 decoding/encoding and replacement.
    This seems to be very common, but is not completely accurate
    and can lead to subtle bugs (or completely non-processible
    data).  Otherwise has the same advantages as (c).

(e) Various variants of ByteBuffer.
    Have not seen this much in practice (outside binary file format
    parsers).  In the past, it needed deep defensive copies on input
    for security (because there isn't an immutably backed ByteBuffer),
    and shallow copies for access.  The ByteBuffer objects themselves
    are also quite heavy when they can't be optimized away.  For that
    reason, probably most useful on interfaces, and not for storage.

(f) Custom, immutable ByteString class.
    Quite common, but has cross-library interoperability issues,
    and a full complement of support (matching java.lang.String)
    is quite hard.

(g) Something based on VarHandle.
    Haven't seen this yet.  Probably not useful for storage.

Anything that I have missed?

Considering these choices, what is the expected direction on the JDK
side for new code?  Option (d) for things generally ASCII/UTF-8, and
(b) for things of a more binary nature?  What to do if the choice is
difficult?

Reply via email to