Arnav,I think the most interesting point for your further research would be if vectorization would bring a considerable improvement. If so, providing a PR for Apache Commons would be justified.
-Markus Am 11.04.2026 um 18:55 schrieb Arnav Somaghatta:
Hi all,Following the earlier discussion, I took David's suggestion and made some JMH benchmarks to quantify the tradeoff between a linear contains(char[] a, char key) on unsorted primitive arrays and the "sort once + Arrays.binarySearch" approach for repeated lookups.Markus, I also want to thank you for your continued encouragement and your offer to help with JMH, the support over the past week has meant a lot to me as I learn how to do this properly! :)My benchmark setup where I used the JMH Maven archetype and created two benchmark classes:1. MyBenchmark, single searches: - linearContains_* on the original unsorted char[] - binarySearchSorted_* on a pre sorted copy- sortThenBinarySearch_* which clones, sorts, and then searches on each call2. RepeatedSearchBenchmark, models David's scenario: - Start with an unsorted char[] whose length is 8, 16, 32, 64, or 128. - For each (size, searches) pair, compare: - linear_N_*: N linear searches on the same unsorted array- sortOnceThenBinary_N_*: sort a clone once, then do N binary searches on the sorted copy- searches is one of 1, 2, 5, 10, or 100.I measured both "hit" (key guaranteed present) and "miss" (key guaranteed not present) cases.-----------------------------------------------------------------------------------------Both benchmarks use @BenchmarkMode(Mode.AverageTime) with TimeUnit.NANOSECONDS, and JMH options -wi 3 -i 5 -f 1 -w 1s -r 1s. Key data: repeated searches, hit case. Here is the key part of the RepeatedSearchBenchmark hit results (ns/op):Benchmark (searches) (size) Mode Cnt Score Error Units RepeatedSearchBenchmark.linear_N_hit 1 8 avgt 5 20.921 ± 0.413 ns/op RepeatedSearchBenchmark.linear_N_hit 1 16 avgt 5 76.929 ± 121.552 ns/op RepeatedSearchBenchmark.linear_N_hit 1 32 avgt 5 26.490 ± 51.510 ns/op RepeatedSearchBenchmark.linear_N_hit 1 64 avgt 5 108.940 ± 212.914 ns/op RepeatedSearchBenchmark.linear_N_hit 1 128 avgt 5 124.535 ± 2.990 ns/opRepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 8 avgt 5 94.878 ± 29.396 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 16 avgt 5 160.161 ± 60.178 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 32 avgt 5 362.436 ± 61.600 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 64 avgt 5 994.649 ± 305.249 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1 128 avgt 5 3775.735 ± 1371.752 ns/opFor a single search, linear search is faster than "sort once then binary search" at all tested sizes 8-128, most of the times by a large factor, which matches the expectation that sort cost dominates at such small n.----------------------------------------------------------------------------------------- With more repeated searches, the picture changes gradually. For example:RepeatedSearchBenchmark.linear_N_hit 10 16 avgt 5 340.180 ± 663.784 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 10 16 avgt 5 267.033 ± 77.084 ns/opRepeatedSearchBenchmark.linear_N_hit 100 32 avgt 5 2175.411 ± 4591.489 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 32 avgt 5 1088.907 ± 228.760 ns/opRepeatedSearchBenchmark.linear_N_hit 100 64 avgt 5 15563.332 ± 29450.784 ns/op RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 100 64 avgt 5 1472.852 ± 353.327 ns/op----------------------------------------------------------------------------------------- So, very briefly:- For small N (1-5 searches per array), linear search on the unsorted char[] is faster or comparable for all tested sizes up to 128. - For larger N (e.g. 100 searches per array), sorting once and reusing a sorted copy does become faster than repeating linear search, especially for size >= 16, as theory would predict. The "miss" variants show the same pattern, just with larger absolute times because linear search typically scans the whole array on a miss.Single searches: sort then binary vs direct linear In MyBenchmark, I also measured the simpler "one search per call" case: - linearContains_* on the unsorted array.- binarySearchSorted_* on a pre sorted copy (best case when data is already kept sorted).- sortThenBinarySearch_* (fresh sort + binary search per call).As expected, if the array is already sorted, binary search is competitive at these sizes, but when you pay the sort cost per call, linear search generally wins for 8-128 elements.I understand that this is well explored territory and that sorting + binary search is known to win beyond a low inflection point. My measurements seem consistent with that: - For tiny primitive arrays with only a handful of lookups per array, a simple linear contains(char[] a, char key) is not obviously dominated by "sort then binarySearch"; it can be faster or at least clearly competitive. - For moderate numbers of lookups (e.g., 100 per array), "sort once + binarySearch" starts to win in a way that makes the linear helper less attractive, especially as size grows.My proposal is not trying to replace the sort + binarySearch, but to provide a readable entry point for the small unsorted array, small N where users currently write their own loops anyway. If you see flaws in the benchmark methodology or important scenarios I've missed, I would really appreciate suggestions on how to improve it, especially to make sure I am being fair to the "sort once then reuse" side of the tradeoff, since that seems central to the API bar discussion.Best regards, ArnavOn Apr 11, 2026, at 5:22 AM, Markus KARG <[email protected]> wrote: Arnav, I like to amend: This is a great opportunity for you to learn about benchmarking using JMH. If you need help with this, feel free to contact me off-list. Having said that, I want to propose one more option for you: If OpenJDK finally rejects your vectorized implementation even in case you proof with JMH that your implementation is faster than the proposed sorted binary search alternative, you might like to donate the vectorized implementation to Apache. There already is ArrayUtils.contains(...) which is based on a simple for-loop (find indexOf(...) on Github), and vectorizing it might be of interest for Apache:https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/ArrayUtils.html#contains(byte%5B%5D,byte).Note that any further discussion about Apache is certainly off-topic on this list. Regards -Markus Am 11.04.2026 um 04:45 schrieb David Alayachew:Would a prototype (for example starting with char[]) be useful fordebating if the API bar could be met? Not by itself, but metrics from it might. If the tradeoff threshold for searching an unsorted array multiple times was high enough compared to sorting then searching that array multiple times, then that might sway opinions. Sadly, what you are trying to do is extremely well explored territory,which is why Chen and others are telling you this is not going to work out. We have a VERY good idea of what the answer will be -- the inflection pointwhere sorting an array beforehand starts to outperform searching it unsorted is embarrassingly low, to the point where it makes no sense to even include the API in the first place. It's extremely unlikely that the answer has changed since most of uschecked, but by all means, run the numbers and see what you get. At the endof the day, evidence born from testing something out is the strongestargument you can make. It's just extremely unlikely that the evidence willwork in your favor. On Fri, Apr 10, 2026 at 4:04 PM Arnav Somaghatta < [email protected]> wrote:Hello Chen,Thank you for clarifying the API surface concern. I agree that the bar foradditions to Arrays should remain high. The reason I still think this may be worth considering is that the repeated manual loop appears frequently for small unsorted primitivearrays, and a dedicated API would provide both readability and also a very clear optimization hook for future implementations, including vectorizedfast paths where appropriate.I don't want to unnecessarily expand Arrays, but I want to explore whetherthis specific operation is common and fundamental enough to justify a standard library entry point. Markus’s earlier point about possible optimization opportunities forunsorted primitive arrays especially made me think more that this may be adirection that is worth validating a little more. Would a prototype (for example starting with char[]) be useful for debating if the API bar could be met? Regards, ArnavOn Apr 10, 2026, at 3:55 PM, Chen Liang <[email protected]> wrote: Hello all, I still don't think you should add these APIs. This linear search is easily written by users.As I said again, you shouldn't use a linear array for passing such dataaround. This usually represents a failure in program structure design.have a proposal" and add something, we can see this class blow up reallyVectorization can be done with the Vector API.Arrays is not the home of a million garbage APIs. If everyone says "oh Iquickly.replaced by more high-level APIs like sort+binarySearch, or some sort ofSo there is a bar for API additions. This "contains" proposal clearly doesn't meet that bar, as it can beString KMP algorithm applied onto arrays.Regards, Chen Liang From: Arnav Somaghatta <[email protected]> Sent: Friday, April 10, 2026 5:45 AM To: Markus KARG <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: [External] : Proposal for primitive arrays Hello all, Thank you all for the feedback so far, and Markus especially for thehelpful perspective on the potential optimization and vectorization opportunities for unsorted primitive array searches.I have now submitted my OCA form as well and waiting for it to getapproved.working on an initial prototype patch (starting with the char[] overload and its tests) so the API shape and implementation can be reviewed better?Based on the discussion so far, would it make sense for me to beginAnd Markus, if this direction still seems aligned, would you becomfortable acting as sponsor/reviewer for an initial contribution?Regards, ArnavOn Apr 6, 2026, at 6:00 AM, Markus KARG <[email protected]>wrote:long as we DO provide them. IF people are using primitive arrays, and IF they need to search them in an UNSORTED fashion, Arnav's proposed API would allows us to apply vectorization. Hence, the result COULD be multiple times faster than custom search loops. As long as primitive arrays DO exist, IChen, how does this explain that the propopsal "is not good"?It is not on us to decide what others do with the tools we provide, aswould rather say, his proposal "is good".much more compact than reference arrays. However, they all suffer from the-MarkusAm 06.04.2026 um 02:40 schrieb Chen Liang: Hello, I don't think this is a good proposal. Nowadays, primitive arrays are used as an efficient storage format,mutability issue as all the arrays: when you pass arrays across trustboundaries, you should always sanitize the input arrays and output arrays. In this sanitization process, we usually can convert these arrays to moreefficient formats, such as performing a Arrays.sort call, and the subsequent search can be a Arrays.binarySearch(a, key) >= 0 call.Regards, Chen Liang ________________________________ From: Arnav Somaghatta <[email protected]> Sent: Sunday, April 5, 2026 9:58 AM To: [email protected] <[email protected]> Subject: [External] : Proposal for primitive arrays Hello, My name is Arnav, and I am a 14-year-old developer who is interestedin contributing to the OpenJDK. I would like to propose adding a set of "contains" methods to the java.util.Arrays class.a specific value needs a large manual for loop, converting to a List, orCurrently, to check if a primitive array (like char or int) containsusing a very complicated Stream pipeline.My idea proposes adding a series of static "contains" methods for all8 primitive types to java.util.Arrayspublic static boolean contains(char[] a, char key) public static boolean contains(int[] a, int key) public static boolean contains(byte[] a, byte key) public static boolean contains(short[] a, short key) public static boolean contains(long[] a, long key) public static boolean contains(float[] a, float key) public static boolean contains(double[] a, double key) public static boolean contains(boolean[] a, boolean key) Why should you consider this proposal? Arrays.contains(myChars, 'a') is much more readable than workaroundslike a for loop.For newer Java learners, not having a "contains" method on arrays isvery confusing when you compare it to other languages like Python.8 primitive overloads and any of the corresponding unit tests. I am also looking for a sponsor to help me navigate the OCA process for minors andIf this proposal is viewed favorably, I am willing to make all of theguide me through the technical process of pushing these changes.Core Libs group would support. Thank you so much for considering my ideaI look forward to your feedback on whether this is a direction theand reading through this email, your support means a lot to me!Thank you, Arnav
