Re: RFR: 8195129: System.load() fails to load from unicode paths [v3]

Naoto Sato Fri, 04 Jun 2021 10:38:07 -0700

On Fri, 4 Jun 2021 14:00:25 GMT, Maxim Kartashev 
<github.com+28651297+mkartas...@openjdk.org> wrote:


>> Not an expert by my understanding is that the VM only deals with modified 
>> UTF-8, as does JNI. So the incoming string should be modified-UTF8 IMO and 
>> then converted to UTF16.
>> 
>> That said, this is shared code being modified on the JDK side so you can't 
>> just change the type of string being passed in without updating all the 
>> implementations of os::dll_load to support that!
>
> I think we need to establish some common ground before proceeding further 
> with this fix. It's a bit of a long read; please, bear with me.
> 
> The path name starts its life as a `jstring` in 
> `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is irrelevant 
> at this point.
> 
> Next, the name has to be passed down to `JVM_LoadLibrary()` that takes 
> `char*`. So we need to convert form `jstring` to `char*` (point (a)). 
> Following that, `os::dll_load()` that actually performs loading in a 
> platform-specific manner also receives `char*`. All platform implementations 
> of `os::dll_load()` pass the path name down to their respective platform's 
> APIs unmodified, but I think that's just incidental and here we have another 
> possible point of conversion (point (b)). Other consumers of the path name 
> are exception(c) and logging(d) messages; they also take `char*`, but 
> potentially of a different encoding.
> 
> Let me try to enumerate all conceivably valid conversions for 
> `JVM_LoadLibrary()` consumption (point (a)):
> 1. jstring -> platform-specific encoding (status quo meaning possibly lossy 
> encoding on Windows and UTF-8 elsewhere AFAICT),
> 2. jstring -> modified UTF-8,
> 3. jstring -> UTF-8.
> 
> This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs 
> because conversion (1) may loose information on Windows if the platform 
> encoding happens to be NOT UTF-8 (which it often - or even always - is). So 
> that's a no-go and we are left with either (2) or (3).
> 
> On MacOS and Linux, "platform" encoding already is UTF-8 and since all the 
> platform APIs happily consume UTF-8, no further conversion is necessary 
> (neither for actual library loading, nor for log or exception messages; the 
> latter have to convert to UTF-16, but do that under the hood).
> 
> On Windows, we require at least these variants of the path name:
> 1. UTF16 for library loading (Unicode Windows API),
> 2. "platform" encoding for logging (yes, loosing information here, but that's 
> tolerable),
> 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages 
> (prefer lossless).
> 
> This is what's behind my choice of UTF-8 for the path name encoding as it 
> gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of 
> course, in which case all platforms - not just Windows - will have to do the 
> conversion on their own, loosing the benefit of the knowledge about the 
> original string encoding (the String.coder field of jstring).

I think I am hesitant to change the JVM interface from modified UTF-8 to 
standard UTF-8, as it would be the only location in JNI/JVM interface that uses 
the standard UTF-8. Instead, I would implement `convert_UTF8_to_UTF16` or 
rather `convert_mUTF8_to_UTF16` with a fairly simple arithmetic logic.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4169

Re: RFR: 8195129: System.load() fails to load from unicode paths [v3]

Reply via email to