On Wed, 29 Oct 2025 10:50:53 GMT, Liam Miller-Cushon <[email protected]> wrote:
> This PR proposes adding a new overload to `MemorySegment::getString` that > takes a known byte length of the content. > > This was previously proposed in https://github.com/openjdk/jdk/pull/20725, > but the outcome of [JDK-8333843](https://bugs.openjdk.org/browse/JDK-8333843) > was to update `MemorySegment#getString` to suggest > > > byte[] bytes = new byte[length]; > MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length); > return new String(bytes, charset); > > > However this is less efficient than what the implementation of getString does > after [JDK-8362893](https://bugs.openjdk.org/browse/JDK-8362893), it now uses > `JavaLangAccess::uncheckedNewStringNoRepl` to avoid the copy. Some preliminary comments. I didn't look at the tests yet. src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1338: > 1336: * access operation will occur > 1337: * @param length byte length to be used for string conversion (not > including any > 1338: * null termination) I think 'to be used for string conversion' is a bit too vague (used _how_?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes'). Also, what happens if: - The length _does_ include a null terminator - The length is not a multiple of the byte size of a character in the given charset. On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size. src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1349: > 1347: * <ul> > 1348: * <li>{@code B} is the size, in bytes, of the string > encoded using the > 1349: * provided charset (e.g. {@code > str.getBytes(charset).length});</li> Isn't `B` equal to the `length` argument? src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1351: > 1349: * provided charset (e.g. {@code > str.getBytes(charset).length});</li> > 1350: * <li>{@code N} is the size (in bytes) of the > terminator char according > 1351: * to the provided charset. For instance, this is 1 for Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either. src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1363: > 1361: */ > 1362: String getString(long offset, int length, Charset charset); > 1363: I'd suggest putting the `length` parameter at the end, so that this becomes a telescoping overload of the length-less variant. src/java.base/share/classes/jdk/internal/foreign/StringSupport.java line 66: > 64: case SINGLE_BYTE -> readByte(segment, offset, len, charset); > 65: case DOUBLE_BYTE -> readShort(segment, offset, len, charset); > 66: case QUAD_BYTE -> readInt(segment, offset, len, charset); These 3 methods appear to be identical ------------- PR Review: https://git.openjdk.org/jdk/pull/28043#pullrequestreview-3394314370 PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473855397 PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473837599 PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473832085 PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473823655 PR Review Comment: https://git.openjdk.org/jdk/pull/28043#discussion_r2473819485
