Hi,

I believe I found the issue, but I still need to make the proper
measurement. After a few measurements I found that the time this function
spent most was two load operations, where it loads the read and write
pointers back into a registers. This was very suspicious as the function
should not use all that many registers, and shouldn't need to cycle them
out back to their address. The function accepts references to the read and
write pointers which is seemingly throwing the compiler off. By capturing
them into local variable as just straight pointers, and writing them back
before return seems to eliminate the problem. But I need to make more
measurement before I submit a patch.

On Tue, 23 Apr 2019 at 10:26, Kevin Wheatley <[email protected]>
wrote:

> another data point.
>
> When I first experimented with adding DWA to Nuke using OpenEXR 2.2.0 I
> had to patch the configure so I could enable f16c instructions for gcc
> 4.1.2, after doing so vtune pointed to the copyFromFrameBuffer function
> when going from half to float for ~30+% of the CPU when reading files from
> local SSD. (Aside, there were a number of other namespace related fixes
> that were needed too, all of these are in the latest OpenEXR versions). I
> came to the conclusion that to make the performance any better it would
> need a f16c based half to float conversion function rather than going via
> the LUT, at least for those CPUs supporting those instructions. I also have
> some notes about testing memory mapped reading, but no conclusions.
>
> This was not the case when f16c were disabled as other functions appeared
> higher in the profile - the total performance was lower without f16c (no
> surprise), it was only because the other functions got reduced by the f16c
> that bubbled copyFromFrameBuffer to the top.
>
> I didn't try RLE compression.
> Kevin
>
_______________________________________________
Openexr-devel mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/openexr-devel

Reply via email to