On Sat, 22 Nov 2025 09:37:31 GMT, ExE Boss <[email protected]> wrote:

>> This implements an API to return the byte length of a String encoded in a 
>> given charset. See 
>> [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background.
>> 
>> ---
>> 
>> 
>> Benchmark                              (encoding)  (stringLength)   Mode  
>> Cnt          Score          Error  Units
>> StringLoopJmhBenchmark.getBytes             ASCII              10  thrpt    
>> 5  406782650.595 ± 16960032.852  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII             100  thrpt    
>> 5  172936926.189 ±  4532029.201  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII            1000  thrpt    
>> 5   38830681.232 ±  2413274.766  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII          100000  thrpt    
>> 5     458881.155 ±    12818.317  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1              10  thrpt    
>> 5   37193762.990 ±  3962947.391  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1             100  thrpt    
>> 5   55400876.236 ±  1267331.434  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1            1000  thrpt    
>> 5   11104514.001 ±    41718.545  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1          100000  thrpt    
>> 5     182535.414 ±    10296.120  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16              10  thrpt    
>> 5  113474681.457 ±  8326589.199  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16             100  thrpt    
>> 5   37854103.127 ±  4808526.773  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16            1000  thrpt    
>> 5    4139833.009 ±    70636.784  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16          100000  thrpt    
>> 5      57644.637 ±     1887.112  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII              10  thrpt    
>> 5  946701647.247 ± 76938927.141  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII             100  thrpt    
>> 5  396615374.479 ± 15167234.884  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII            1000  thrpt    
>> 5  100464784.979 ±   794027.897  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII          100000  thrpt    
>> 5    1215487.689 ±     1916.468  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1              10  thrpt    
>> 5  221265102.323 ± 17013983.056  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1             100  thrpt    
>> 5  137617873.887 ±  5842185.781  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1            1000  thrpt    
>> 5   92540259.1...
>
> src/java.base/share/classes/java/lang/String.java line 2127:
> 
>> 2125:      *          equivalent to this string, {@code false} otherwise
>> 2126:      *
>> 2127:      * @see  #compareTo(String)
> 
> For the **BOM**‑less **UTF‑16** charsets, this can simply return 
> `value.length << (1 ‑ coder())`[^1]:
> 
> Suggestion:
> 
>         if (cs instanceof sun.nio.cs.UTF_16LE ||
>             cs instanceof sun.nio.cs.UTF_16BE) {
>             return value.length << (1 - coder());
>         }
>         return getBytes(cs).length;
> 
> 
> [^1]: Lone surrogates get replaced with `U+FFFD` when encoding to **UTF‑16** 
> by [`String​::getBytes​(Charset)`], and all of **LATIN1** can be encoded in 
> **UTF‑16**.
> 
> [`String​::getBytes​(Charset)`]: 
> https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/String.html#getBytes(java.nio.charset.Charset)

Thanks!

There is more work that could be done for other charsets here, I focused on 
UTF-8 and the bytesCompatible case as a proof of concept, and as a way to start 
discussing this. It may or may not make sense to have optimized paths for all 
of the other standard charsets.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2556171650

Reply via email to