Re: [External] : Proposal for primitive arrays

Markus KARG Sun, 12 Apr 2026 01:53:25 -0700

Arnav,

I think the most interesting point for your further research would be ifvectorization would bring a considerable improvement. If so, providing aPR for Apache Commons would be justified.


-Markus


Am 11.04.2026 um 18:55 schrieb Arnav Somaghatta:

Hi all,
Following the earlier discussion, I took David's suggestion and madesome JMH benchmarks to quantify the tradeoff between a linearcontains(char[] a, char key) on unsorted primitive arrays and the"sort once + Arrays.binarySearch" approach for repeated lookups.
Markus, I also want to thank you for your continued encouragement andyour offer to help with JMH, the support over the past week has meanta lot to me as I learn how to do this properly! :)
My benchmark setup where I used the JMH Maven archetype and createdtwo benchmark classes:
 1. MyBenchmark, single searches:
   - linearContains_* on the original unsorted char[]
   - binarySearchSorted_* on a pre sorted copy
- sortThenBinarySearch_* which clones, sorts, and then searches oneach call
 2. RepeatedSearchBenchmark, models David's scenario:
   - Start with an unsorted char[] whose length is 8, 16, 32, 64, or 128.
   - For each (size, searches) pair, compare:
      - linear_N_*: N linear searches on the same unsorted array
- sortOnceThenBinary_N_*: sort a clone once, then do N binarysearches on the sorted copy
   - searches is one of 1, 2, 5, 10, or 100.
I measured both "hit" (key guaranteed present) and "miss" (keyguaranteed not present) cases.
-----------------------------------------------------------------------------------------
Both benchmarks use @BenchmarkMode(Mode.AverageTime) withTimeUnit.NANOSECONDS, and JMH options -wi 3 -i 5 -f 1 -w 1s -r 1s. Keydata: repeated searches, hit case. Here is the key part of theRepeatedSearchBenchmark hit results (ns/op):
Benchmark (searches) (size) Mode Cnt Score Error UnitsRepeatedSearchBenchmark.linear_N_hit 1 8 avgt 5 20.921 ± 0.413 ns/opRepeatedSearchBenchmark.linear_N_hit 1 16 avgt 5 76.929 ± 121.552 ns/opRepeatedSearchBenchmark.linear_N_hit 1 32 avgt 5 26.490 ± 51.510 ns/opRepeatedSearchBenchmark.linear_N_hit 1 64 avgt 5 108.940 ± 212.914 ns/opRepeatedSearchBenchmark.linear_N_hit 1 128 avgt 5 124.535 ± 2.990 ns/op
RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 8 avgt 5 94.878 ± 29.396 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 16 avgt 5 160.161 ± 60.178 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 32 avgt 5 362.436 ± 61.600 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 64 avgt 5 994.649 ± 305.249 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 128 avgt 5 3775.735 ± 1371.752 ns/op
For a single search, linear search is faster than "sort once thenbinary search" at all tested sizes 8-128, most of the times by a largefactor, which matches the expectation that sort cost dominates at suchsmall n.
-----------------------------------------------------------------------------------------
With more repeated searches, the picture changes gradually. For example:
RepeatedSearchBenchmark.linear_N_hit 10 16 avgt 5 340.180 ± 663.784 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 10 16 avgt 5 267.033 ± 77.084 ns/op
RepeatedSearchBenchmark.linear_N_hit 100 32 avgt 5 2175.411 ± 4591.489 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 32 avgt 5 1088.907 ± 228.760 ns/op
RepeatedSearchBenchmark.linear_N_hit 100 64 avgt 5 15563.332 ± 29450.784 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 64 avgt 5 1472.852 ± 353.327 ns/op
-----------------------------------------------------------------------------------------
So, very briefly:
- For small N (1-5 searches per array), linear search on the unsortedchar[] is faster or comparable for all tested sizes up to 128. - For larger N (e.g. 100 searches per array), sorting once andreusing a sorted copy does become faster than repeating linear search,especially for size >= 16, as theory would predict.The "miss" variants show the same pattern, just with larger absolutetimes because linear search typically scans the whole array on a miss.
Single searches: sort then binary vs direct linear
In MyBenchmark, I also measured the simpler "one search per call" case:
- linearContains_* on the unsorted array.
- binarySearchSorted_* on a pre sorted copy (best case when data isalready kept sorted).
- sortThenBinarySearch_* (fresh sort + binary search per call).
As expected, if the array is already sorted, binary search iscompetitive at these sizes, but when you pay the sort cost per call,linear search generally wins for 8-128 elements.
I understand that this is well explored territory and that sorting +binary search is known to win beyond a low inflection point. Mymeasurements seem consistent with that: - For tiny primitive arrays with only a handful of lookups per array,a simple linear contains(char[] a, char key) is not obviouslydominated by "sort then binarySearch"; it can be faster or at leastclearly competitive. - For moderate numbers of lookups (e.g., 100 per array), "sort once +binarySearch" starts to win in a way that makes the linear helper lessattractive, especially as size grows.
My proposal is not trying to replace the sort + binarySearch, but toprovide a readable entry point for the small unsorted array, small Nwhere users currently write their own loops anyway.If you see flaws in the benchmark methodology or important scenariosI've missed, I would really appreciate suggestions on how to improveit, especially to make sure I am being fair to the "sort once thenreuse" side of the tradeoff, since that seems central to the API bardiscussion.
Best regards,
Arnav
On Apr 11, 2026, at 5:22 AM, Markus KARG <[email protected]> wrote:


Arnav,

I like to amend: This is a great opportunity for you to learn about
benchmarking using JMH. If you need help with this, feel free to contact
me off-list.

Having said that, I want to propose one more option for you: If OpenJDK
finally rejects your vectorized implementation even in case you proof
with JMH that your implementation is faster than the proposed sorted
binary search alternative, you might like to donate the vectorized
implementation to Apache. There already is ArrayUtils.contains(...)
which is based on a simple for-loop (find indexOf(...) on Github), and
vectorizing it might be of interest for Apache:
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/ArrayUtils.html#contains(byte%5B%5D,byte).
Note that any further discussion about Apache is certainly off-topic on
this list.

Regards

-Markus


Am 11.04.2026 um 04:45 schrieb David Alayachew:
Would a prototype (for example starting with char[]) be useful for
debating if the API bar could be met?

Not by itself, but metrics from it might. If the tradeoff threshold for
searching an unsorted array multiple times was high enough compared to
sorting then searching that array multiple times, then that might sway
opinions.

Sadly, what you are trying to do is extremely well explored territory,
which is why Chen and others are telling you this is not going towork out.We have a VERY good idea of what the answer will be -- theinflection point
where sorting an array beforehand starts to outperform searching it
unsorted is embarrassingly low, to the point where it makes no sense to
even include the API in the first place.

It's extremely unlikely that the answer has changed since most of us
checked, but by all means, run the numbers and see what you get. Atthe end
of the day, evidence born from testing something out is the strongest
argument you can make. It's just extremely unlikely that theevidence will
work in your favor.

On Fri, Apr 10, 2026 at 4:04 PM Arnav Somaghatta <
[email protected]> wrote:
Hello Chen,
Thank you for clarifying the API surface concern. I agree that thebar for
additions to Arrays should remain high.

The reason I still think this may be worth considering is that the
repeated manual loop appears frequently for small unsorted primitive
arrays, and a dedicated API would provide both readability and alsoa veryclear optimization hook for future implementations, includingvectorized
fast paths where appropriate.
I don't want to unnecessarily expand Arrays, but I want to explorewhether
this specific operation is common and fundamental enough to justify a
standard library entry point.

Markus’s earlier point about possible optimization opportunities for
unsorted primitive arrays especially made me think more that thismay be a
direction that is worth validating a little more.

Would a prototype (for example starting with char[]) be useful for
debating if the API bar could be met?

Regards,
Arnav
On Apr 10, 2026, at 3:55 PM, Chen Liang <[email protected]>wrote:
Hello all,
I still don't think you should add these APIs.
This linear search is easily written by users.
As I said again, you shouldn't use a linear array for passing suchdata
around. This usually represents a failure in program structure design.
Vectorization can be done with the Vector API.
Arrays is not the home of a million garbage APIs. If everyone says"oh I
have a proposal" and add something, we can see this class blow upreally
quickly.
So there is a bar for API additions.
This "contains" proposal clearly doesn't meet that bar, as it can be
replaced by more high-level APIs like sort+binarySearch, or somesort of
String KMP algorithm applied onto arrays.
Regards,
Chen Liang
From: Arnav Somaghatta <[email protected]>
Sent: Friday, April 10, 2026 5:45 AM
To: Markus KARG <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [External] : Proposal for primitive arrays

Hello all,

Thank you all for the feedback so far, and Markus especially for the
helpful perspective on the potential optimization and vectorization
opportunities for unsorted primitive array searches.
I have now submitted my OCA form as well and waiting for it to get
approved.
Based on the discussion so far, would it make sense for me to begin
working on an initial prototype patch (starting with the char[]overloadand its tests) so the API shape and implementation can be reviewedbetter?
And Markus, if this direction still seems aligned, would you be
comfortable acting as sponsor/reviewer for an initial contribution?
Regards,
Arnav
On Apr 6, 2026, at 6:00 AM, Markus KARG <[email protected]>
wrote:
Chen,

how does this explain that the propopsal "is not good"?
It is not on us to decide what others do with the tools weprovide, as
long as we DO provide them. IF people are using primitive arrays,and IFthey need to search them in an UNSORTED fashion, Arnav's proposedAPI wouldallows us to apply vectorization. Hence, the result COULD bemultiple timesfaster than custom search loops. As long as primitive arrays DOexist, I
would rather say, his proposal "is good".
-Markus
Am 06.04.2026 um 02:40 schrieb Chen Liang:
Hello, I don't think this is a good proposal.
Nowadays, primitive arrays are used as an efficient storage format,
much more compact than reference arrays. However, they all sufferfrom the
mutability issue as all the arrays: when you pass arrays across trust
boundaries, you should always sanitize the input arrays and outputarrays.In this sanitization process, we usually can convert these arraysto more
efficient formats, such as performing a Arrays.sort call, and the
subsequent search can be a Arrays.binarySearch(a, key) >= 0 call.
Regards,
Chen Liang
________________________________
From: Arnav Somaghatta <[email protected]>
Sent: Sunday, April 5, 2026 9:58 AM
To: [email protected] <[email protected]>
Subject: [External] : Proposal for primitive arrays

Hello,
My name is Arnav, and I am a 14-year-old developer who is interested
in contributing to the OpenJDK. I would like to propose adding a set of
"contains" methods to the java.util.Arrays class.
Currently, to check if a primitive array (like char or int) contains
a specific value needs a large manual for loop, converting to aList, or
using a very complicated Stream pipeline.
My idea proposes adding a series of static "contains" methodsfor all
8 primitive types to java.util.Arrays
public static boolean contains(char[] a, char key)
public static boolean contains(int[] a, int key)
public static boolean contains(byte[] a, byte key)
public static boolean contains(short[] a, short key)
public static boolean contains(long[] a, long key)
public static boolean contains(float[] a, float key)
public static boolean contains(double[] a, double key)
public static boolean contains(boolean[] a, boolean key)

Why should you consider this proposal?
Arrays.contains(myChars, 'a') is much more readable than workarounds
like a for loop.
For newer Java learners, not having a "contains" method on arrays is
very confusing when you compare it to other languages like Python.
If this proposal is viewed favorably, I am willing to make allof the
8 primitive overloads and any of the corresponding unit tests. I amalsolooking for a sponsor to help me navigate the OCA process forminors and
guide me through the technical process of pushing these changes.
I look forward to your feedback on whether this is a direction the
Core Libs group would support. Thank you so much for considering myidea
and reading through this email, your support means a lot to me!
Thank you,
Arnav

Re: [External] : Proposal for primitive arrays

Reply via email to