Hi all, Following the earlier discussion, I took David's suggestion and made some JMH benchmarks to quantify the tradeoff between a linear contains(char[] a, char key) on unsorted primitive arrays and the "sort once + Arrays.binarySearch" approach for repeated lookups. Markus, I also want
to thank you for your continued encouragement and your offer to help with JMH, the support over the past week has meant a lot to me as I learn how to do this properly! :) My benchmark setup where I used the JMH Maven archetype and created two benchmark classes: 1. MyBenchmark, single searches: -
linearContains_* on the original unsorted char[] - binarySearchSorted_* on a pre sorted copy - sortThenBinarySearch_* which clones, sorts, and then searches on each call 2. RepeatedSearchBenchmark, models David's scenario: - Start with an unsorted char[] whose length is 8, 16, 32, 64, or 128. - For
each (size, searches) pair, compare: - linear_N_*: N linear searches on the same unsorted array - sortOnceThenBinary_N_*: sort a clone once, then do N binary searches on the sorted copy - searches is one of 1, 2, 5, 10, or 100. I measured both "hit" (key guaranteed present) and
"miss" (key guaranteed not present) cases. ----------------------------------------------------------------------------------------- Both benchmarks use @BenchmarkMode(Mode.AverageTime) with TimeUnit.NANOSECONDS, and JMH options -wi 3 -i 5 -f 1 -w 1s -r 1s. Key data: repeated searches,
hit case. Here is the key part of the RepeatedSearchBenchmark hit results (ns/op): Benchmark (searches) (size) Mode Cnt Score Error Units RepeatedSearchBenchmark.linear_N_hit 1 8 avgt 5 20.921 ± 0.413 ns/op RepeatedSearchBenchmark.linear_N_hit 1 16 avgt 5 76.929 ± 121.552 ns/op
RepeatedSearchBenchmark.linear_N_hit 1 32 avgt 5 26.490 ± 51.510 ns/op RepeatedSearchBenchmark.linear_N_hit 1 64 avgt 5 108.940 ± 212.914 ns/op RepeatedSearchBenchmark.linear_N_hit 1 128 avgt 5 124.535 ± 2.990 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 8 avgt 5 94.878 ± 29.396 ns/op
RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 16 avgt 5 160.161 ± 60.178 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 32 avgt 5 362.436 ± 61.600 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 64 avgt 5 994.649 ± 305.249 ns/op
RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 128 avgt 5 3775.735 ± 1371.752 ns/op For a single search, linear search is faster than "sort once then binary search" at all tested sizes 8-128, most of the times by a large factor, which matches the expectation that sort cost dominates
at such small n. ----------------------------------------------------------------------------------------- With more repeated searches, the picture changes gradually. For example: RepeatedSearchBenchmark.linear_N_hit 10 16 avgt 5 340.180 ± 663.784 ns/op
RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 10 16 avgt 5 267.033 ± 77.084 ns/op RepeatedSearchBenchmark.linear_N_hit 100 32 avgt 5 2175.411 ± 4591.489 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 32 avgt 5 1088.907 ± 228.760 ns/op RepeatedSearchBenchmark.linear_N_hit 100 64 avgt
5 15563.332 ± 29450.784 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 64 avgt 5 1472.852 ± 353.327 ns/op ----------------------------------------------------------------------------------------- So, very briefly: - For small N (1-5 searches per array), linear search on the unsorted
char[] is faster or comparable for all tested sizes up to 128. - For larger N (e.g. 100 searches per array), sorting once and reusing a sorted copy does become faster than repeating linear search, especially for size >= 16, as theory would predict. The "miss" variants show the same
pattern, just with larger absolute times because linear search typically scans the whole array on a miss. Single searches: sort then binary vs direct linear In MyBenchmark, I also measured the simpler "one search per call" case: - linearContains_* on the unsorted array. -
binarySearchSorted_* on a pre sorted copy (best case when data is already kept sorted). - sortThenBinarySearch_* (fresh sort + binary search per call). As expected, if the array is already sorted, binary search is competitive at these sizes, but when you pay the sort cost per call, linear search
generally wins for 8-128 elements. I understand that this is well explored territory and that sorting + binary search is known to win beyond a low inflection point. My measurements seem consistent with that: - For tiny primitive arrays with only a handful of lookups per array, a simple linear
contains(char[] a, char key) is not obviously dominated by "sort then binarySearch"; it can be faster or at least clearly competitive. - For moderate numbers of lookups (e.g., 100 per array), "sort once + binarySearch" starts to win in a way that makes the linear helper less
attractive, especially as size grows. My proposal is not trying to replace the sort + binarySearch, but to provide a readable entry point for the small unsorted array, small N where users currently write their own loops anyway. If you see flaws in the benchmark methodology or important scenarios
I've missed, I would really appreciate suggestions on how to improve it, especially to make sure I am being fair to the "sort once then reuse" side of the tradeoff, since that seems central to the API bar discussion. Best regards, Arnav On Apr 11, 2026, at 5:22 AM, Markus KARG
<[email protected]> wrote: Arnav, I like to amend: This is a great opportunity for you to learn about benchmarking using JMH. If you need help with this, feel free to contact me off-list. Having said that, I want to propose one more option for you: If OpenJDK finally rejects your
vectorized implementation even in case you proof with JMH that your implementation is faster than the proposed sorted binary search alternative, you might like to donate the vectorized implementation to Apache. There already is ArrayUtils.contains(...) which is based on a simple for-loop (find
indexOf(...) on Github), and vectorizing it might be of interest for Apache: https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/ArrayUtils.html#contains (byte%5B%5D,byte). Note that any further discussion about Apache is certainly off-topic on this list. Regards -Markus
Am 11.04.2026 um 04:45 schrieb David Alayachew: Would a prototype (for example starting with char[]) be useful for debating if the API bar could be met? Not by itself, but metrics from it might. If the tradeoff threshold for searching an unsorted array multiple times was high enough compared to
sorting then searching that array multiple times, then that might sway opinions. Sadly, what you are trying to do is extremely well explored territory, which is why Chen and others are telling you this is not going to work out. We have a VERY good idea of what the answer will be -- the inflection
point where sorting an array beforehand starts to outperform searching it unsorted is embarrassingly low, to the point where it makes no sense to even include the API in the first place. It's extremely unlikely that the answer has changed since most of us checked, but by all means, run the numbers
and see what you get. At the end of the day, evidence born from testing something out is the strongest argument you can make. It's just extremely unlikely that the evidence will work in your favor. On Fri, Apr 10, 2026 at 4:04 PM Arnav Somaghatta < [email protected] > wrote: Hello
Chen, Thank you for clarifying the API surface concern. I agree that the bar for additions to Arrays should remain high. The reason I still think this may be worth considering is that the repeated manual loop appears frequently for small unsorted primitive arrays, and a dedicated API would provide
both readability and also a very clear optimization hook for future implementations, including vectorized fast paths where appropriate. I don't want to unnecessarily expand Arrays, but I want to explore whether this specific operation is common and fundamental enough to justify a standard library
entry point. Markus’s earlier point about possible optimization opportunities for unsorted primitive arrays especially made me think more that this may be a direction that is worth validating a little more. Would a prototype (for example starting with char[]) be useful for debating if the API bar
could be met? Regards, Arnav On Apr 10, 2026, at 3:55 PM, Chen Liang <[email protected]> wrote: Hello all, I still don't think you should add these APIs. This linear search is easily written by users. As I said again, you shouldn't use a linear array for passing such data around. This
usually represents a failure in program structure design. Vectorization can be done with the Vector API. Arrays is not the home of a million garbage APIs. If everyone says "oh I have a proposal" and add something, we can see this class blow up really quickly. So there is a bar for API
additions. This "contains" proposal clearly doesn't meet that bar, as it can be replaced by more high-level APIs like sort+binarySearch, or some sort of String KMP algorithm applied onto arrays. Regards, Chen Liang From: Arnav Somaghatta <[email protected]> Sent: Friday,
April 10, 2026 5:45 AM To: Markus KARG <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: [External] : Proposal for primitive arrays Hello all, Thank you all for the feedback so far, and Markus especially for the helpful perspective on the
potential optimization and vectorization opportunities for unsorted primitive array searches. I have now submitted my OCA form as well and waiting for it to get approved. Based on the discussion so far, would it make sense for me to begin working on an initial prototype patch (starting with the
char[] overload and its tests) so the API shape and implementation can be reviewed better? And Markus, if this direction still seems aligned, would you be comfortable acting as sponsor/reviewer for an initial contribution? Regards, Arnav On Apr 6, 2026, at 6:00 AM, Markus KARG
<[email protected]> wrote: Chen, how does this explain that the propopsal "is not good"? It is not on us to decide what others do with the tools we provide, as long as we DO provide them. IF people are using primitive arrays, and IF they need to search them in an UNSORTED
fashion, Arnav's proposed API would allows us to apply vectorization. Hence, the result COULD be multiple times faster than custom search loops. As long as primitive arrays DO exist, I would rather say, his proposal "is good". -Markus Am 06.04.2026 um 02:40 schrieb Chen Liang: Hello, I
don't think this is a good proposal. Nowadays, primitive arrays are used as an efficient storage format, much more compact than reference arrays. However, they all suffer from the mutability issue as all the arrays: when you pass arrays across trust boundaries, you should always sanitize the input
arrays and output arrays. In this sanitization process, we usually can convert these arrays to more efficient formats, such as performing a Arrays.sort call, and the subsequent search can be a Arrays.binarySearch(a, key) >= 0 call. Regards, Chen Liang ________________________________ From: Arnav
Somaghatta <[email protected]> Sent: Sunday, April 5, 2026 9:58 AM To: [email protected] <[email protected]> Subject: [External] : Proposal for primitive arrays Hello, My name is Arnav, and I am a 14-year-old developer who is interested in contributing to the
OpenJDK. I would like to propose adding a set of "contains" methods to the java.util.Arrays class. Currently, to check if a primitive array (like char or int) contains a specific value needs a large manual for loop, converting to a List, or using a very complicated Stream pipeline. My
idea proposes adding a series of static "contains" methods for all 8 primitive types to java.util.Arrays public static boolean contains(char[] a, char key) public static boolean contains(int[] a, int key) public static boolean contains(byte[] a, byte key) public static boolean
contains(short[] a, short key) public static boolean contains(long[] a, long key) public static boolean contains(float[] a, float key) public static boolean contains(double[] a, double key) public static boolean contains(boolean[] a, boolean key) Why should you consider this proposal?
Arrays.contains(myChars, 'a') is much more readable than workarounds like a for loop. For newer Java learners, not having a "contains" method on arrays is very confusing when you compare it to other languages like Python. If this proposal is viewed favorably, I am willing to make all of
the 8 primitive overloads and any of the corresponding unit tests. I am also looking for a sponsor to help me navigate the OCA process for minors and guide me through the technical process of pushing these changes. I look forward to your feedback on whether this is a direction the Core Libs group
would support. Thank you so much for considering my idea and reading through this email, your support means a lot to me! Thank you, Arnav