On Fri, 30 Jan 2026 08:33:57 GMT, Emanuel Peter <[email protected]> wrote:

>>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and 
>>> > > one that shows a regression. How are we to proceed?
>>> > > It seems that without loads [#28442 
>>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>>> > >  this patch leads to a regression.
>>> > > Only if there is a load from one of the last elements that the 
>>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
>>> > > Because of missing load-to-store forwarding. If we instead started 
>>> > > loading from the first elements, those would probably already be in 
>>> > > cache, and we would not have any latency issues, right?
>>> > > But is it not rather an edge-case that we load from the last elements 
>>> > > immediately after the `Arrays.fill`? At least for longer arrays, it 
>>> > > seems an edge case. For short arrays it is probably more likely that we 
>>> > > access the last element soon after the fill.
>>> > > It does not seem like a trivial decision to me if this patch is an 
>>> > > improvement or not. What do you think @vamsi-parasa ?
>>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your 
>>> > > thoughts here?
>>> > 
>>> > 
>>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores 
>>> > with scalar tail processing. As we have seen from 
>>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
>>> > regression in certain scenarios: accessing elements just written or any 
>>> > other adjacent data that happens to fall in the masked store range.
>>> 
>>> @sviswa7 But once this PR is integrated, I could file a performance 
>>> regression with the benchmarks from [up 
>>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). 
>>> So what's the argument which choice is better, since we have a mix of 
>>> speedups/regression going either way, and both are probably in the 10-20% 
>>> range?
>> 
>> @eme64 You have a point there, but if you see the performance numbers for 
>> ByteMatrix.java (from JDK-8349452) in the PR description above we are 
>> talking about a recovery of 3x or so. The ByteMatrix.java is doing only 
>> Arrays.fill() on individual arrays of a 2D array. The individual arrays 
>> happened to be allocated alongside each other by the JVM and the next store 
>> sees stalls due to the masked store of previous array initialization. That 
>> was the reason to look for a solution without masked stores.
>
> @sviswa7 Ah right, the ByteMatrix.java is yet another case. There, we don't 
> seem to have any loads.
> 
>> The individual arrays happened to be allocated alongside each other by the 
>> JVM and the next store sees stalls due to the masked store of previous array 
>> initialization.
> 
> Ah, that sounds interesting! Is there some tool that would let me see that it 
> was due to masked store stalls?
> My (probably uneducated) guess would have been that it is just because a 
> single element store would be much cheaper than a masked operation. If you 
> only access a single or 2 elements, then a masked store is not yet 
> profitable. What if the masked stores were a bit further away, say a 
> cacheline away? Would that be significantly faster, because there are no 
> stalls? Or still slow because of the inherent higher cost of masked 
> operations?
> 
> If we take the ByteMatrix.java benchmark: how would the performance change if 
> we increase the size of the arrays (height)? Is there some height before 
> which the non-masked solution is faster, and after which the masked is faster?
> 
> Would it be a solution to use scalar stores for very small arrays, and only 
> use the masked loop starting at a certain threshold?
> 
> -----------------------
> 
> I would like to see a summary of all the benchmarks we have here, and in 
> which cases we get speedups/slowdowns, and for which reason. Maybe listing 
> those reasons lets us see some third option we did not yet consider. And 
> listing all the reasons and code shapes may help us find out which shapes we 
> care about most, and then come to a decision that weighs off the pros and 
> cons.
> 
> We should also document our decision nicely in the code, so that if someone 
> gets a regression in the future, we can see if we had already considered that 
> code shape.
> 
> Does that make sense? Or do you have a better idea how to make a good 
> decision here?

Hi Emanuel (@eme64),

Based on the discussion, I will run further experiments to see if the 
regressions can be addressed and get back to you at a later date.

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825349231

Reply via email to