Hi Mark,
On 8/20/20 3:10 PM, mark.reinh...@oracle.com wrote:
2020/8/19 14:14:56 -0700, roger.ri...@oracle.com:
Please review a java.util.Hex API to encode and decode hexadecimal
strings to and from byte arrays.
JavaDoc:
http://cr.openjdk.java.net/~rriggs/hex-javadoc/java.base/java/util/Hex.html
This mostly looks good -- it’ll be nice to have a standard utility for
this sort of thing.
A few comments:
- Why do the short-form `encoder` factory methods return encoders that
produce upper-case hex strings? `Integer::toHexString` and other
existing `toHex` methods return lower-case hex strings. That’s also
what you get from common Unix CLI tools (e.g., `od -tx1`).
Please consider changing these methods to return lower-case encoders.
It's (almost) a toss up and easy to change; many of the existing uses
produce uppercase.
Perhaps the short form no-arg should be replaced with a short form
constructor that
takes true/false, so it is explicit at the use site or put the case in
the name.
encodeToUpper(), encodeToLower().
(A boolean parameter is not very informative, a enum would be better but
perhaps a bit heavyweight).
- Is it worth having static `Hex.encode(byte[])` and
`Hex.decode(CharSequence)` convenience methods for the simplest
cases?
There was some discussion of that but it idea was to minimize the
surface area.
- [Warning: Bikeshed] The verbs “encode” and “decode” seem unfortunate.
Over in `java.nio.charsets` we have encoders that transform
characters to bytes, and decoders that transform bytes to characters.
The coded thing is the bytes; the uncoded thing is the characters.
In `java.util` we already have the `Base64` class, which provides
encoders that transform bytes to characters, and decoders that
transform characters to bytes. The coded thing is the characters;
the uncoded thing is the bytes.
The use of “encode” and “decode” in `Base64` was likely inspired by
the fact that the format has been known as “base 64 encoding” for
decades, having originated as a hack for transporting non-ASCII data
via SMTP. Developers looking to do base-64 operations will, thus,
expect this terminology.
The bias of encoding vs decoding terminology is subtle, based the 'native'
form of the data. For the Charset classes, the native form of the data
is characters,
and the encoded form is bytes. For Base64, the native form of the data
is bytes,
and the encoded form is Base64 lines.
Here you’re proposing that the `Hex` class follow the `Base64` class.
Consistency with existing nearby APIs is a worthy goal. If I were
just looking to convert a byte array into a readable hex string,
however, I’d probably want to “format” it rather than “encode” it,
something like `String.format("%x")` on steroids. Likewise, if I
were looking to convert a hex string into bytes then I’d want to
“parse” it rather than “decode” it, i.e., `Integer::parseInt` on
steroids.
If you were to rename the nested classes to `Hex.Formatter` and
`Hex.Parser`, and rename all methods accordingly, then this API would
be inconsistent with the nearby `Base64` but likely more consistent
with developer expectations.
(`Hex` is already inconsistent with `Base64` in that it doesn’t
prefix the names of its factory methods with `get`, which is a good
thing.)
The question is at what level of control does "encoding" become formatting.
There are very few formatting features in the API, no control of leading
zeros, no control
over indicental whitespace, and no control over width. Similarly with
parsing,
what flexibility does parsing imply that decoding does not, is
whitespace ignored, line endings/joins, etc.
Thanks, Roger
- Mark