On 2/8/18, 10:59 AM, joe darcy wrote:
Hello,

On 2/8/2018 3:53 AM, Alan Bateman wrote:
On 07/02/2018 22:12, joe darcy wrote:
Hello,

Text in java.lang.Character states a UTF-16 character encoding is used for java.lang.String. While was true for many years, it is not necessarily true and not true in practice as of JDK 9 due to the improvements from JEP 254: Compact Strings.

The statement about the encoding should be corrected.

Please review the patch below which does this. (I've formatted the patch so that the change is text is made clear; I'll re-flow the paragraph before pushing.
I'm not sure that this is worth changing. You could replace "classes" with "API" and add a note to say that an implementation may use an more optimization representation but I don't think it's really needed.


In response to this feedback and others, how about:

     [...] The Java
  * platform uses the UTF-16 representation in {@code char} arrays and
- * in the {@code String} and {@code StringBuffer} classes. In
+ * presents a UTF-16 model in the string-related API.

IMO anyway, I think saying "uses a UTF-16 representation for String" is at best misleading with the current implementation since 8 != 16 for the compressed representation is used for all Latin-1 strings.


Well, encoding/charset is the concept of a mapping between a character and a corresponding code point value. We are still using the UTF16 encoding scheme to represent a character in jvm. How to represent/store that UTF16 code point value in String class is an implementation detail. A 16-bit for "char" and a 1-byte for "latin1" (still in Unicode charset) + 2 byte for the
rest in String class.

As I said in my previous email. The mention of 8859-1 in the JEP might cause the confusion. At early stage of the project we were really experimenting on using different "encoding", including utf8. But the project ended up with staying with UTF-16, with a "customized/compressed" storage
mechanism to store the UTF16 codepoint value.

-Sherman

Reply via email to