Re: RFR: 8195129: System.load() fails to load from unicode paths [v3]

Maxim Kartashev Mon, 07 Jun 2021 04:05:43 -0700

On Sun, 6 Jun 2021 22:25:44 GMT, David Holmes <[email protected]> wrote:


>> I think we need to establish some common ground before proceeding further 
>> with this fix. It's a bit of a long read; please, bear with me.
>> 
>> The path name starts its life as a `jstring` in 
>> `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is 
>> irrelevant at this point.
>> 
>> Next, the name has to be passed down to `JVM_LoadLibrary()` that takes 
>> `char*`. So we need to convert form `jstring` to `char*` (point (a)). 
>> Following that, `os::dll_load()` that actually performs loading in a 
>> platform-specific manner also receives `char*`. All platform implementations 
>> of `os::dll_load()` pass the path name down to their respective platform's 
>> APIs unmodified, but I think that's just incidental and here we have another 
>> possible point of conversion (point (b)). Other consumers of the path name 
>> are exception(c) and logging(d) messages; they also take `char*`, but 
>> potentially of a different encoding.
>> 
>> Let me try to enumerate all conceivably valid conversions for 
>> `JVM_LoadLibrary()` consumption (point (a)):
>> 1. jstring -> platform-specific encoding (status quo meaning possibly lossy 
>> encoding on Windows and UTF-8 elsewhere AFAICT),
>> 2. jstring -> modified UTF-8,
>> 3. jstring -> UTF-8.
>> 
>> This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs 
>> because conversion (1) may loose information on Windows if the platform 
>> encoding happens to be NOT UTF-8 (which it often - or even always - is). So 
>> that's a no-go and we are left with either (2) or (3).
>> 
>> On MacOS and Linux, "platform" encoding already is UTF-8 and since all the 
>> platform APIs happily consume UTF-8, no further conversion is necessary 
>> (neither for actual library loading, nor for log or exception messages; the 
>> latter have to convert to UTF-16, but do that under the hood).
>> 
>> On Windows, we require at least these variants of the path name:
>> 1. UTF16 for library loading (Unicode Windows API),
>> 2. "platform" encoding for logging (yes, loosing information here, but 
>> that's tolerable),
>> 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages 
>> (prefer lossless).
>> 
>> This is what's behind my choice of UTF-8 for the path name encoding as it 
>> gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of 
>> course, in which case all platforms - not just Windows - will have to do the 
>> conversion on their own, loosing the benefit of the knowledge about the 
>> original string encoding (the String.coder field of jstring).
>
> @mkartashev  thank you for the detailed explanation.
> 
> It is not clear to me that the JDK's conformance to being a Unicode 
> application has significantly changed since the evaluation of JDK-8017274 - 
> @naotoj  can you comment on that and related discussion from the CCC for 
> JDK-4958170 ? In particular I'm not sure that using the platform encoding is 
> wrong, nor how we can have a path that cannot be represented by the platform 
> encoding?
> 
> Not being an expert in this area I cannot evaluate the affects of these 
> shared code changes on other platforms, and so am reluctant to introduce any 
> change that affects any non-Windows platforms. Also the JVM and JNI work with 
> modified-UTF8 so I do not think we should diverge from that.
> I would hate to see windows specific code introduced into the JDK or JVM's 
> shared code for these APIs, but that may be the only choice to avoid 
> potential disruption to other platforms. Though perhaps we could push the 
> initial conversion down into the JVM?

> I think I am hesitant to change the JVM interface from modified UTF-8 to 
> standard UTF-8, 

AFAICT all platforms except Windows already use standard UTF-8 on that path 
(from `Java_jdk_internal_loader_NativeLibraries_load()` to `JVM_LoadLibrary()`) 
because the "platform" encoding for those happens to be "UTF-8". So at the 
current stage this patch actually maintains status quo for all platforms except 
Windows, the only platform where the bug exists.

But I am not against changing the encoding to modified UTF-8 and updating 
os::dll_load() for all platforms. Just wanted to have some consensus before 
proceeding with that change.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4169

Re: RFR: 8195129: System.load() fails to load from unicode paths [v3]

Reply via email to