On Fri, 24 Jan 2020 17:16:13 GMT, Frederic Thevenet
<[email protected]> wrote:
>> I don't, to be honest.
>> The results for some dimensions (not always the same) can vary pretty
>> widely from one run to another, despite all my effort to repeat results and
>> remove outliers.
>> Out of curiosity, I also tried to eliminate the GC as possible culprit by
>> running it with epsilon, but it seems to make no significant difference.
>> I ran that test on a laptop with Integrated Intel graphics and no dedicated
>> vram (Intel UHD Graphics 620), though, so this might be why.
>> Maybe someone could try and run the bench on hardware with a discreet GPU?
>
> With regard as to why the tiling version is significantly slower, though, I
> do have a pretty good idea; as Kevin hinted, the pixel copy into a temporary
> buffer before copying into the final image is where most the extra time is
> spent.
> The reason why it is so much slower is a little bit of a pity, though;
> profiling a run of the benchmark shows that a lot of time is spent into
> `IntTo4ByteSameConverter::doConvert`. As it turns out, the reason for this is
> that, under Windows and the D3D pipeline anyway, the `WriteableImage` used to
> collate the tiles and the tiles returned from the RTTexture have different
> pixel formats (IntARGB for the tile and byteBGRA for the `WriteableImage`).
> So if we could use a `WriteableImage` with an IntARGB pixel format as the
> recipient for the snapshot (at least as long as no image was provided by the
> caller), I suspect that the copy would be much faster.
> Unfortunately it seems the only way to choose the pixel format for a
> `WritableImage` is to initialize it with a `PixelBuffer`, but then one can no
> longer use a `PixelWriter` to update it and it desn't seems to me that there
> is a way to safely access the `PixelBuffer` from an image's reference alone.
> I'm pretty new to this code base though (which is quite large; I haven't read
> it all quite yet... ;-), so hopefully there's a way to do that that has
> simply eluded me so far.
> profiling a run of the benchmark shows that a lot of time is spent into
> `IntTo4ByteSameConverter::doConvert`
This is a bit naive, but what if you parallelize the code there? I didn't test
that this produces the correct result, but you can try to replace the loops
with this:
IntStream.range(0, h).parallel().forEach(y -> {
IntStream.range(0, w).parallel().forEach(x -> {
int pixel = srcarr[srcoff++];
dstarr[dstoff++] = (byte) (pixel );
dstarr[dstoff++] = (byte) (pixel >> 8);
dstarr[dstoff++] = (byte) (pixel >> 16);
dstarr[dstoff++] = (byte) (pixel >> 24);
});
srcoff += srcscanints;
dstoff += dstscanbytes;
});
-------------
PR: https://git.openjdk.java.net/jfx/pull/68