On Fri, 24 Jan 2020 17:16:13 GMT, Frederic Thevenet 
<github.com+7450507+ftheve...@openjdk.org> wrote:

>> I don't, to be honest. 
>> The results for some dimensions  (not always the same) can vary pretty 
>> widely from one run to another, despite all my effort to repeat results and 
>> remove outliers.
>> Out of curiosity, I also tried to eliminate the GC as possible culprit by 
>> running it with epsilon, but it seems to make no significant difference.
>> I ran that test on a laptop with Integrated Intel graphics and no dedicated 
>> vram (Intel UHD Graphics 620), though, so this might be why. 
>> Maybe someone could try and run the bench on hardware with a discreet GPU?
> With regard as to why the tiling version is significantly slower, though, I 
> do have a pretty good idea; as Kevin hinted, the pixel copy into a temporary 
> buffer before copying into the final image is where most the extra time is 
> spent.
> The reason why it is so much slower is a little bit of a pity, though; 
> profiling a run of the benchmark shows that a lot of time is spent into 
> `IntTo4ByteSameConverter::doConvert`. As it turns out, the reason for this is 
> that, under Windows and the D3D pipeline anyway, the `WriteableImage` used to 
> collate the tiles and the tiles returned from the RTTexture have different 
> pixel formats (IntARGB for the tile and byteBGRA for the `WriteableImage`).
> So if we could use a `WriteableImage` with an IntARGB pixel format as the 
> recipient for the snapshot (at least as long as no image was provided by the 
> caller), I suspect that the copy would be much faster.
> Unfortunately it seems the only way to choose the pixel format for a 
> `WritableImage` is to initialize it with a `PixelBuffer`, but then one can no 
> longer use a `PixelWriter` to update it and it desn't seems to me that there 
> is a way to safely access the `PixelBuffer` from an image's reference alone.
> I'm pretty new to this code base though (which is quite large; I haven't read 
> it all quite yet... ;-), so hopefully there's a way to do that that has 
> simply eluded me so far.

This is a bit naive, but what if you parallelize the code there? I didn't test 
that this produces the correct result, but you can try to replace the loops 
with this:
IntStream.range(0, h).parallel().forEach(y -> {    
    IntStream.range(0, w).parallel().forEach(x -> {
        int pixel = srcarr[srcoff++];              
        dstarr[dstoff++] = (byte) (pixel      );   
        dstarr[dstoff++] = (byte) (pixel >>  8);   
        dstarr[dstoff++] = (byte) (pixel >> 16);   
        dstarr[dstoff++] = (byte) (pixel >> 24);   
    srcoff += srcscanints;                         
    dstoff += dstscanbytes;                        


