On Fri, 24 Jan 2020 17:16:13 GMT, Frederic Thevenet <github.com+7450507+ftheve...@openjdk.org> wrote:
>> I don't, to be honest. >> The results for some dimensions (not always the same) can vary pretty >> widely from one run to another, despite all my effort to repeat results and >> remove outliers. >> Out of curiosity, I also tried to eliminate the GC as possible culprit by >> running it with epsilon, but it seems to make no significant difference. >> I ran that test on a laptop with Integrated Intel graphics and no dedicated >> vram (Intel UHD Graphics 620), though, so this might be why. >> Maybe someone could try and run the bench on hardware with a discreet GPU? > > With regard as to why the tiling version is significantly slower, though, I > do have a pretty good idea; as Kevin hinted, the pixel copy into a temporary > buffer before copying into the final image is where most the extra time is > spent. > The reason why it is so much slower is a little bit of a pity, though; > profiling a run of the benchmark shows that a lot of time is spent into > `IntTo4ByteSameConverter::doConvert`. As it turns out, the reason for this is > that, under Windows and the D3D pipeline anyway, the `WriteableImage` used to > collate the tiles and the tiles returned from the RTTexture have different > pixel formats (IntARGB for the tile and byteBGRA for the `WriteableImage`). > So if we could use a `WriteableImage` with an IntARGB pixel format as the > recipient for the snapshot (at least as long as no image was provided by the > caller), I suspect that the copy would be much faster. > Unfortunately it seems the only way to choose the pixel format for a > `WritableImage` is to initialize it with a `PixelBuffer`, but then one can no > longer use a `PixelWriter` to update it and it desn't seems to me that there > is a way to safely access the `PixelBuffer` from an image's reference alone. > I'm pretty new to this code base though (which is quite large; I haven't read > it all quite yet... ;-), so hopefully there's a way to do that that has > simply eluded me so far. > profiling a run of the benchmark shows that a lot of time is spent into > `IntTo4ByteSameConverter::doConvert` This is a bit naive, but what if you parallelize the code there? I didn't test that this produces the correct result, but you can try to replace the loops with this: IntStream.range(0, h).parallel().forEach(y -> { IntStream.range(0, w).parallel().forEach(x -> { int pixel = srcarr[srcoff++]; dstarr[dstoff++] = (byte) (pixel ); dstarr[dstoff++] = (byte) (pixel >> 8); dstarr[dstoff++] = (byte) (pixel >> 16); dstarr[dstoff++] = (byte) (pixel >> 24); }); srcoff += srcscanints; dstoff += dstscanbytes; }); ------------- PR: https://git.openjdk.java.net/jfx/pull/68