On Fri, 4 Jun 2021 14:00:25 GMT, Maxim Kartashev <github.com+28651297+mkartas...@openjdk.org> wrote:
>> Not an expert by my understanding is that the VM only deals with modified >> UTF-8, as does JNI. So the incoming string should be modified-UTF8 IMO and >> then converted to UTF16. >> >> That said, this is shared code being modified on the JDK side so you can't >> just change the type of string being passed in without updating all the >> implementations of os::dll_load to support that! > > I think we need to establish some common ground before proceeding further > with this fix. It's a bit of a long read; please, bear with me. > > The path name starts its life as a `jstring` in > `Java_jdk_internal_loader_NativeLibraries_load()`, its encoding is irrelevant > at this point. > > Next, the name has to be passed down to `JVM_LoadLibrary()` that takes > `char*`. So we need to convert form `jstring` to `char*` (point (a)). > Following that, `os::dll_load()` that actually performs loading in a > platform-specific manner also receives `char*`. All platform implementations > of `os::dll_load()` pass the path name down to their respective platform's > APIs unmodified, but I think that's just incidental and here we have another > possible point of conversion (point (b)). Other consumers of the path name > are exception(c) and logging(d) messages; they also take `char*`, but > potentially of a different encoding. > > Let me try to enumerate all conceivably valid conversions for > `JVM_LoadLibrary()` consumption (point (a)): > 1. jstring -> platform-specific encoding (status quo meaning possibly lossy > encoding on Windows and UTF-8 elsewhere AFAICT), > 2. jstring -> modified UTF-8, > 3. jstring -> UTF-8. > > This bug [8195129](https://bugs.openjdk.java.net/browse/JDK-8195129) occurs > because conversion (1) may loose information on Windows if the platform > encoding happens to be NOT UTF-8 (which it often - or even always - is). So > that's a no-go and we are left with either (2) or (3). > > On MacOS and Linux, "platform" encoding already is UTF-8 and since all the > platform APIs happily consume UTF-8, no further conversion is necessary > (neither for actual library loading, nor for log or exception messages; the > latter have to convert to UTF-16, but do that under the hood). > > On Windows, we require at least these variants of the path name: > 1. UTF16 for library loading (Unicode Windows API), > 2. "platform" encoding for logging (yes, loosing information here, but that's > tolerable), > 3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages > (prefer lossless). > > This is what's behind my choice of UTF-8 for the path name encoding as it > gets passed down to `JVM_LoadLibrary()`. We can go with modified UTF-8, of > course, in which case all platforms - not just Windows - will have to do the > conversion on their own, loosing the benefit of the knowledge about the > original string encoding (the String.coder field of jstring). I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement `convert_UTF8_to_UTF16` or rather `convert_mUTF8_to_UTF16` with a fairly simple arithmetic logic. ------------- PR: https://git.openjdk.java.net/jdk/pull/4169