Re: RFR: 8260265: UTF-8 by Default

Giacomo Baso Wed, 14 Jul 2021 05:47:04 -0700

On Thu, 8 Jul 2021 21:23:00 GMT, Naoto Sato <na...@openjdk.org> wrote:


> This is an implementation for the `JEP 400: UTF-8 by Default`. The gist of 
> the changes is `Charset.defaultCharset()` returning `UTF-8` and 
> `file.encoding` system property being added in the spec, but another notable 
> modification is in `java.io.PrintStream` where it continues to use the 
> `Console` encoding as the default charset instead of `UTF-8`. Other changes 
> are mostly clarification of the term "default charset" and their links. 
> Corresponding CSR has also been drafted.
> 
> JEP 400: https://bugs.openjdk.java.net/browse/JDK-8187041
> CSR: https://bugs.openjdk.java.net/browse/JDK-8260266

> Consider an application that creates a java.io.FileWriter with its 
> one-argument constructor and then uses it to write some text to a file. The 
> resulting file will contain a sequence of bytes encoded using the default 
> charset of the JDK running the application. A second application, run on a 
> different machine or by a different user on the same machine, creates a 
> java.io.FileReader with its one-argument constructor and uses it to read the 
> bytes in that file. The resulting text contains a sequence of characters 
> decoded using the default charset of the JDK running the second application. 
> If the default charset differs between the JDK of the first application and 
> the JDK of the second application, then the resulting text may be silently 
> corrupted or incomplete, since these APIs replace erroneous input rather than 
> fail.

It's even worse than that, because many OpenSSH installs are configured by 
default to [forward](https://man.openbsd.org/ssh_config.5#SendEnv) and 
[accept](https://man.openbsd.org/sshd_config.5#AcceptEnv) the user locale (see 
e.g. for [RHEL 7](https://access.redhat.com/solutions/974273)).

So a single application, on a single remote machine, can be unknowingly started 
by a single user with different locales, and therefore different encodings, 
depending on how the user connected to the remote machine. For example, on 
Windows connecting via powershell results in `LANG=en_US.UTF-8`, while using 
WSL2 results in `LANG=C.UTF-8`. On Java 11 in a RHEL7 machine, `file.encoding` 
results in `UTF-8` in the first case, but `ANSI_X3.4-1968` in the second, 
leading to a default charset `ASCII`.

Worth mentioning is also that `Charset.forName("default")` is just an alias to 
`ASCII`, per `sun.nio.cs.StandardCharsets$Aliases`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4733

Re: RFR: 8260265: UTF-8 by Default

Reply via email to