Re: [External] : Proposal for primitive arrays

David Alayachew Mon, 13 Apr 2026 04:45:59 -0700

Glad you can bounce back, as well as see the point behind these things.

Just for context, you are walking in the footsteps of some big names lol.
The entire reason why the Eclipse Collections library and the Apache
Commons libraries were created was because the developers of those
libraries tried to do something very similar to what you did. They tried to
add some very useful libraries to the existing JDK, and while those
libraries clearly had a lot of value, they didn't clear the bar for being
in the JDK. Doesn't change the fact that these libraries became
foundational for Java developers everywhere.


So hey, maybe a few years down the road, I'll be putting a dependency from
io.github.arnav in my pom.xml. You still did point out a valid use case,
and one that is worth meeting.

Finally, you mentioned that you intend to return with future proposals,
which is good, so let me give some advice on how best to accomplish that.
Everyone you have talked to on this list (Chen especially, he has more than
the rest of us combined lol) has gotten stuff merged into the JDK, and they
usually follow the same pattern flow.

At the end of the day, the best proposals are those that start off not as a
proposal, but a problem declaration.

You start by pointing out a pain point you ran into while using the JDK to
do a non-trivial project. The sweet spot is a project that would take you
at least a couple days to build from the ground up. Highlight those pain
points on the relevant mailing list (if you don't know which list, ask on
the discuss list), and see if other people ran into this and how they got
past it. The goal here is to try and ensure that this actually is a full
blown deficiency, rather than a speed bump. If multiple others have run
into this, and no effective solutions are proposed, great, you have
completed Step 1. But please understand that in order for step 1 to be a
success, there really needs to be multiple people who agree with you that
there is a problem. Silence is a response, and it does not support you, so
try not to get dissuaded if this is where things end up. And of course,
feel free to poke the list once or twice, but anything past that will
probably be considered spamming.

Step 2 is to propose that this pain point (whether bug or an enhancement)
get added to the Java Bug System (JBS) tracker. This is usually the point
where someone from the OpenJDK will actually respond with their thoughts if
they haven't already. Only people with Author (?) or higher status can make
JBS tickets, and once it is on JBS, it is officially on the radar. Not
necessarily a priority, but definitely on the radar. Assuming they don't
disagree, you will now have a submission on JBS with a corresponding id.
Step 2 complete.

Once that is done, Step 3 is to go back to the mailing list and propose to
solve that JBS submission with your idea. Again, go to the mailing list.
This is NOT the time to make a PR. Suggest your idea and gauge feedback.
And same as step 1, silence means no -- you need multiple people agreeing
with you that your idea makes sense. That is how you pass step 3.

Then finally, step 4 is to make the PR. The skara bot gives you guidance on
what to do there, so just do what it says.


On Mon, Apr 13, 2026, 6:37 AM Arnav Somaghatta <[email protected]>
wrote:

> Hi David, Markus, and all,
>
>
> Thank you for the thoughtful feedback, and for encouraging me to validate
> the idea with JMH.
>
> After extending the benchmarks with indexed and simple unrolled linear
> scan variants, the results still support the broader point David raised:
> while direct linear membership checks can clearly win for one or only a few
> lookups on small unsorted primitive arrays, the crossover point where the
> sort once + binarySearch becomes better is still low enough that this
> likely does not justify the API and maintenance cost of adding eight new
> overloads to Arrays.
>
> The unrolled scan numbers were still interesting from a performance
> perspective, especially in miss heavy cases:
>
>    - size 32 miss: linear ~106 ns, unrolled ~23 ns
>    - size 64 miss: linear ~188 ns, unrolled ~40 ns
>    - size 128 miss: linear ~335 ns, unrolled ~77 ns
>
> So I do think the benchmarks at least show there is meaningful
> optimization headroom in the “small unsorted primitive array” case, even if
> that use case is too narrow for the JDK API bar.
>
> I understand the maintenance and surface-area concerns much better now,
> and I really appreciate the time everyone took to explain not just the
> “no”, but the reasoning behind where the bar sits for Arrays.
>
> Markus, David: thank you especially for pushing me toward measurement and
> methodology. I learned a lot from building these JMH benchmarks and from
> seeing how the API discussion shifts once the data is on the table.
>
> I really like the Apache Commons / Guava direction and would be happy to
> explore contributing this idea there instead, where a utility focused API
> surface may make more sense. By the way, I am not done here! I will be
> coming back later with more proposals as I have acquired new knowledge and
> will back up my proposals with evidence, stay tuned! :)
>
>
> Best regards,
>
> Arnav
>
> On Apr 12, 2026, at 12:25 PM, David Alayachew <[email protected]>
> wrote:
>
> 
> By all means, try out the vector approach.
>
> But no, those numbers were about what I expected.
>
> To give context, the entire point of the Arrays class is to make doing the
> right thing performant and easy, so you don't have to reinvent the wheel.
> As a result, the most common use cases, are covered by this API (sorting,
> searching, etc.).
>
> And as your own benchmarks showed (at least without vectors), in the
> overwhelming majority of cases, sorting beforehand is more performant. And
> even in the few places where it isn't, it's not that bad.
>
> So, adding this api you suggested would only help in a few cases, and
> would actually hurt in most others. Obviously, one should use the right
> tool for the job, but it's also the JDK's job to provide the tools to give
> everyone a solid baseline to build other tools. Frankly, telling people to
> sort then search is really not bad enough to justify this new API.
>
> Hopefully you see what we are saying now @Arnav Somaghatta
> <[email protected]>? We're not against you. But the inclusion
> of an API, even a small one, has weight. Not just in API surface, but
> primarily in maintenance cost. The methods in this Arrays class are
> reviewed meticulously and given a lot of attention. Therefore, the wider
> the API in this class, the more time and attention needs to be spent. And
> thus, a very high bar that a method needs to cross to be added to this API.
> As is, your benchmarks demonstrate (without vector) that they just don't
> clear that bar. And it's not like we could put these methods anywhere else
> in the JDK -- it is about Arrays, and therefore, this is the class that
> they belong in.
>
> I do think that @Markus KARG <[email protected]> had a good
> suggestion about proposing this to the Apache team. They don't have an
> entire JDK to maintain, so while they pay just as much attention to their
> API as the OpenJDK maintainers do, they have a much smaller API to manage,
> allowing them to have more attention to spare. Therefore, your API sounds
> ideal for that or for Guava or for a few other places. I encourage you to
> show them your benchmarks. Let us know if you need help with that.
>
> On Sun, Apr 12, 2026, 4:53 AM Markus KARG <[email protected]> wrote:
>
>> Arnav,
>>
>> I think the most interesting point for your further research would be if
>> vectorization would bring a considerable improvement. If so, providing a
>> PR for Apache Commons would be justified.
>>
>> -Markus
>>
>>
>> Am 11.04.2026 um 18:55 schrieb Arnav Somaghatta:
>> > Hi all,
>> >
>> > Following the earlier discussion, I took David's suggestion and made
>> > some JMH benchmarks to quantify the tradeoff between a linear
>> > contains(char[] a, char key) on unsorted primitive arrays and the
>> > "sort once + Arrays.binarySearch" approach for repeated lookups.
>> >
>> > Markus, I also want to thank you for your continued encouragement and
>> > your offer to help with JMH, the support over the past week has meant
>> > a lot to me as I learn how to do this properly! :)
>> >
>> > My benchmark setup where I used the JMH Maven archetype and created
>> > two benchmark classes:
>> >  1. MyBenchmark, single searches:
>> >    - linearContains_* on the original unsorted char[]
>> >    - binarySearchSorted_* on a pre sorted copy
>> >    - sortThenBinarySearch_* which clones, sorts, and then searches on
>> > each call
>> >
>> >  2. RepeatedSearchBenchmark, models David's scenario:
>> >    - Start with an unsorted char[] whose length is 8, 16, 32, 64, or
>> 128.
>> >    - For each (size, searches) pair, compare:
>> >       - linear_N_*: N linear searches on the same unsorted array
>> >       - sortOnceThenBinary_N_*: sort a clone once, then do N binary
>> > searches on the sorted copy
>> >    - searches is one of 1, 2, 5, 10, or 100.
>> > I measured both "hit" (key guaranteed present) and "miss" (key
>> > guaranteed not present) cases.
>> >
>> -----------------------------------------------------------------------------------------
>> > Both benchmarks use @BenchmarkMode(Mode.AverageTime) with
>> > TimeUnit.NANOSECONDS, and JMH options -wi 3 -i 5 -f 1 -w 1s -r 1s. Key
>> > data: repeated searches, hit case. Here is the key part of the
>> > RepeatedSearchBenchmark hit results (ns/op):
>> >
>> > Benchmark                              (searches)  (size) Mode
>> > Cnt      Score       Error  Units
>> > RepeatedSearchBenchmark.linear_N_hit            1       8 avgt
>> > 5     20.921 ±     0.413  ns/op
>> > RepeatedSearchBenchmark.linear_N_hit            1      16 avgt
>> > 5     76.929 ±   121.552  ns/op
>> > RepeatedSearchBenchmark.linear_N_hit            1      32 avgt
>> > 5     26.490 ±    51.510  ns/op
>> > RepeatedSearchBenchmark.linear_N_hit            1      64 avgt    5
>> > 108.940 ±   212.914  ns/op
>> > RepeatedSearchBenchmark.linear_N_hit            1     128 avgt    5
>> > 124.535 ±     2.990  ns/op
>> >
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit 1       8  avgt
>> > 5     94.878 ±    29.396  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit       1 16  avgt
>> > 5    160.161 ±    60.178  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit       1 32  avgt
>> > 5    362.436 ±    61.600  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit       1 64  avgt
>> > 5    994.649 ±   305.249  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit       1 128  avgt
>> > 5   3775.735 ±  1371.752  ns/op
>> >
>> > For a single search, linear search is faster than "sort once then
>> > binary search" at all tested sizes 8-128, most of the times by a large
>> > factor, which matches the expectation that sort cost dominates at such
>> > small n.
>> >
>> -----------------------------------------------------------------------------------------
>> > With more repeated searches, the picture changes gradually. For example:
>> >
>> > RepeatedSearchBenchmark.linear_N_hit                 10 16  avgt
>> > 5    340.180 ±   663.784  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit     10 16  avgt
>> > 5    267.033 ±    77.084  ns/op
>> >
>> > RepeatedSearchBenchmark.linear_N_hit                100 32  avgt
>> > 5   2175.411 ±  4591.489  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit    100 32  avgt
>> > 5   1088.907 ±   228.760  ns/op
>> >
>> > RepeatedSearchBenchmark.linear_N_hit                100 64  avgt    5
>> > 15563.332 ± 29450.784  ns/op
>> > RepeatedSearchBenchmark.sortOnceThenBinary_N_hit    100 64  avgt
>> > 5   1472.852 ±   353.327  ns/op
>> >
>> -----------------------------------------------------------------------------------------
>> > So, very briefly:
>> >  - For small N (1-5 searches per array), linear search on the unsorted
>> > char[] is faster or comparable for all tested sizes up to 128.
>> >  - For larger N (e.g. 100 searches per array), sorting once and
>> > reusing a sorted copy does become faster than repeating linear search,
>> > especially for size >= 16, as theory would predict.
>> > The "miss" variants show the same pattern, just with larger absolute
>> > times because linear search typically scans the whole array on a miss.
>> >
>> > Single searches: sort then binary vs direct linear
>> > In MyBenchmark, I also measured the simpler "one search per call" case:
>> > - linearContains_* on the unsorted array.
>> > - binarySearchSorted_* on a pre sorted copy (best case when data is
>> > already kept sorted).
>> > - sortThenBinarySearch_* (fresh sort + binary search per call).
>> > As expected, if the array is already sorted, binary search is
>> > competitive at these sizes, but when you pay the sort cost per call,
>> > linear search generally wins for 8-128 elements.
>> >
>> > I understand that this is well explored territory and that sorting +
>> > binary search is known to win beyond a low inflection point. My
>> > measurements seem consistent with that:
>> >  - For tiny primitive arrays with only a handful of lookups per array,
>> > a simple linear contains(char[] a, char key) is not obviously
>> > dominated by "sort then binarySearch"; it can be faster or at least
>> > clearly competitive.
>> >  - For moderate numbers of lookups (e.g., 100 per array), "sort once +
>> > binarySearch" starts to win in a way that makes the linear helper less
>> > attractive, especially as size grows.
>> >
>> > My proposal is not trying to replace the sort + binarySearch, but to
>> > provide a readable entry point for the small unsorted array, small N
>> > where users currently write their own loops anyway.
>> > If you see flaws in the benchmark methodology or important scenarios
>> > I've missed, I would really appreciate suggestions on how to improve
>> > it, especially to make sure I am being fair to the "sort once then
>> > reuse" side of the tradeoff, since that seems central to the API bar
>> > discussion.
>> >
>> > Best regards,
>> > Arnav
>> >
>> >> On Apr 11, 2026, at 5:22 AM, Markus KARG <[email protected]>
>> wrote:
>> >>
>> >>
>> >> Arnav,
>> >>
>> >> I like to amend: This is a great opportunity for you to learn about
>> >> benchmarking using JMH. If you need help with this, feel free to
>> contact
>> >> me off-list.
>> >>
>> >> Having said that, I want to propose one more option for you: If OpenJDK
>> >> finally rejects your vectorized implementation even in case you proof
>> >> with JMH that your implementation is faster than the proposed sorted
>> >> binary search alternative, you might like to donate the vectorized
>> >> implementation to Apache. There already is ArrayUtils.contains(...)
>> >> which is based on a simple for-loop (find indexOf(...) on Github), and
>> >> vectorizing it might be of interest for Apache:
>> >>
>> https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/ArrayUtils.html#contains(byte%5B%5D,byte).
>>
>> >>
>> >> Note that any further discussion about Apache is certainly off-topic on
>> >> this list.
>> >>
>> >> Regards
>> >>
>> >> -Markus
>> >>
>> >>
>> >> Am 11.04.2026 um 04:45 schrieb David Alayachew:
>> >>>> Would a prototype (for example starting with char[]) be useful for
>> >>> debating if the API bar could be met?
>> >>>
>> >>> Not by itself, but metrics from it might. If the tradeoff threshold
>> for
>> >>> searching an unsorted array multiple times was high enough compared to
>> >>> sorting then searching that array multiple times, then that might sway
>> >>> opinions.
>> >>>
>> >>> Sadly, what you are trying to do is extremely well explored territory,
>> >>> which is why Chen and others are telling you this is not going to
>> >>> work out.
>> >>> We have a VERY good idea of what the answer will be -- the
>> >>> inflection point
>> >>> where sorting an array beforehand starts to outperform searching it
>> >>> unsorted is embarrassingly low, to the point where it makes no sense
>> to
>> >>> even include the API in the first place.
>> >>>
>> >>> It's extremely unlikely that the answer has changed since most of us
>> >>> checked, but by all means, run the numbers and see what you get. At
>> >>> the end
>> >>> of the day, evidence born from testing something out is the strongest
>> >>> argument you can make. It's just extremely unlikely that the
>> >>> evidence will
>> >>> work in your favor.
>> >>>
>> >>> On Fri, Apr 10, 2026 at 4:04 PM Arnav Somaghatta <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Hello Chen,
>> >>>>
>> >>>> Thank you for clarifying the API surface concern. I agree that the
>> >>>> bar for
>> >>>> additions to Arrays should remain high.
>> >>>>
>> >>>> The reason I still think this may be worth considering is that the
>> >>>> repeated manual loop appears frequently for small unsorted primitive
>> >>>> arrays, and a dedicated API would provide both readability and also
>> >>>> a very
>> >>>> clear optimization hook for future implementations, including
>> >>>> vectorized
>> >>>> fast paths where appropriate.
>> >>>>
>> >>>> I don't want to unnecessarily expand Arrays, but I want to explore
>> >>>> whether
>> >>>> this specific operation is common and fundamental enough to justify a
>> >>>> standard library entry point.
>> >>>>
>> >>>> Markus’s earlier point about possible optimization opportunities for
>> >>>> unsorted primitive arrays especially made me think more that this
>> >>>> may be a
>> >>>> direction that is worth validating a little more.
>> >>>>
>> >>>> Would a prototype (for example starting with char[]) be useful for
>> >>>> debating if the API bar could be met?
>> >>>>
>> >>>> Regards,
>> >>>> Arnav
>> >>>>
>> >>>>> On Apr 10, 2026, at 3:55 PM, Chen Liang <[email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>> 
>> >>>>> Hello all,
>> >>>>> I still don't think you should add these APIs.
>> >>>>> This linear search is easily written by users.
>> >>>>> As I said again, you shouldn't use a linear array for passing such
>> >>>>> data
>> >>>> around. This usually represents a failure in program structure
>> design.
>> >>>>> Vectorization can be done with the Vector API.
>> >>>>> Arrays is not the home of a million garbage APIs. If everyone says
>> >>>>> "oh I
>> >>>> have a proposal" and add something, we can see this class blow up
>> >>>> really
>> >>>> quickly.
>> >>>>> So there is a bar for API additions.
>> >>>>> This "contains" proposal clearly doesn't meet that bar, as it can be
>> >>>> replaced by more high-level APIs like sort+binarySearch, or some
>> >>>> sort of
>> >>>> String KMP algorithm applied onto arrays.
>> >>>>> Regards,
>> >>>>> Chen Liang
>> >>>>> From: Arnav Somaghatta <[email protected]>
>> >>>>> Sent: Friday, April 10, 2026 5:45 AM
>> >>>>> To: Markus KARG <[email protected]>
>> >>>>> Cc: [email protected] <[email protected]>
>> >>>>> Subject: Re: [External] : Proposal for primitive arrays
>> >>>>>
>> >>>>> Hello all,
>> >>>>>
>> >>>>> Thank you all for the feedback so far, and Markus especially for the
>> >>>> helpful perspective on the potential optimization and vectorization
>> >>>> opportunities for unsorted primitive array searches.
>> >>>>> I have now submitted my OCA form as well and waiting for it to get
>> >>>> approved.
>> >>>>> Based on the discussion so far, would it make sense for me to begin
>> >>>> working on an initial prototype patch (starting with the char[]
>> >>>> overload
>> >>>> and its tests) so the API shape and implementation can be reviewed
>> >>>> better?
>> >>>>> And Markus, if this direction still seems aligned, would you be
>> >>>> comfortable acting as sponsor/reviewer for an initial contribution?
>> >>>>> Regards,
>> >>>>> Arnav
>> >>>>>
>> >>>>>> On Apr 6, 2026, at 6:00 AM, Markus KARG <[email protected]>
>> >>>> wrote:
>> >>>>>> Chen,
>> >>>>>>
>> >>>>>> how does this explain that the propopsal "is not good"?
>> >>>>>>
>> >>>>>> It is not on us to decide what others do with the tools we
>> >>>>>> provide, as
>> >>>> long as we DO provide them. IF people are using primitive arrays,
>> >>>> and IF
>> >>>> they need to search them in an UNSORTED fashion, Arnav's proposed
>> >>>> API would
>> >>>> allows us to apply vectorization. Hence, the result COULD be
>> >>>> multiple times
>> >>>> faster than custom search loops. As long as primitive arrays DO
>> >>>> exist, I
>> >>>> would rather say, his proposal "is good".
>> >>>>>> -Markus
>> >>>>>>
>> >>>>>>
>> >>>>>>> Am 06.04.2026 um 02:40 schrieb Chen Liang:
>> >>>>>>> Hello, I don't think this is a good proposal.
>> >>>>>>> Nowadays, primitive arrays are used as an efficient storage
>> format,
>> >>>> much more compact than reference arrays. However, they all suffer
>> >>>> from the
>> >>>> mutability issue as all the arrays: when you pass arrays across trust
>> >>>> boundaries, you should always sanitize the input arrays and output
>> >>>> arrays.
>> >>>> In this sanitization process, we usually can convert these arrays
>> >>>> to more
>> >>>> efficient formats, such as performing a Arrays.sort call, and the
>> >>>> subsequent search can be a Arrays.binarySearch(a, key) >= 0 call.
>> >>>>>>> Regards,
>> >>>>>>> Chen Liang
>> >>>>>>> ________________________________
>> >>>>>>> From: Arnav Somaghatta <[email protected]>
>> >>>>>>> Sent: Sunday, April 5, 2026 9:58 AM
>> >>>>>>> To: [email protected] <[email protected]>
>> >>>>>>> Subject: [External] : Proposal for primitive arrays
>> >>>>>>>
>> >>>>>>> Hello,
>> >>>>>>> My name is Arnav, and I am a 14-year-old developer who is
>> interested
>> >>>> in contributing to the OpenJDK. I would like to propose adding a set
>> of
>> >>>> "contains" methods to the java.util.Arrays class.
>> >>>>>>> Currently, to check if a primitive array (like char or int)
>> contains
>> >>>> a specific value needs a large manual for loop, converting to a
>> >>>> List, or
>> >>>> using a very complicated Stream pipeline.
>> >>>>>>> My idea proposes adding a series of static "contains" methods
>> >>>>>>> for all
>> >>>> 8 primitive types to java.util.Arrays
>> >>>>>>> public static boolean contains(char[] a, char key)
>> >>>>>>> public static boolean contains(int[] a, int key)
>> >>>>>>> public static boolean contains(byte[] a, byte key)
>> >>>>>>> public static boolean contains(short[] a, short key)
>> >>>>>>> public static boolean contains(long[] a, long key)
>> >>>>>>> public static boolean contains(float[] a, float key)
>> >>>>>>> public static boolean contains(double[] a, double key)
>> >>>>>>> public static boolean contains(boolean[] a, boolean key)
>> >>>>>>>
>> >>>>>>> Why should you consider this proposal?
>> >>>>>>> Arrays.contains(myChars, 'a') is much more readable than
>> workarounds
>> >>>> like a for loop.
>> >>>>>>> For newer Java learners, not having a "contains" method on arrays
>> is
>> >>>> very confusing when you compare it to other languages like Python.
>> >>>>>>> If this proposal is viewed favorably, I am willing to make all
>> >>>>>>> of the
>> >>>> 8 primitive overloads and any of the corresponding unit tests. I am
>> >>>> also
>> >>>> looking for a sponsor to help me navigate the OCA process for
>> >>>> minors and
>> >>>> guide me through the technical process of pushing these changes.
>> >>>>>>> I look forward to your feedback on whether this is a direction the
>> >>>> Core Libs group would support. Thank you so much for considering my
>> >>>> idea
>> >>>> and reading through this email, your support means a lot to me!
>> >>>>>>> Thank you,
>> >>>>>>> Arnav
>> >
>
>

Re: [External] : Proposal for primitive arrays

Reply via email to