Re: [Oiio-dev] TextureSys best practices

Larry Gritz Fri, 12 May 2017 06:54:37 -0700

I'll be curious to hear what your new perf numbers are after the inline fix 
(was that you? thanks!) and trying all the applicable things from my previous 
email.


At that point, you should also feel free to post the full statistics output 
after your trial (texturesys->getstats()) and maybe there's something I can 
spot there that would give me other ideas.

I read Matt's texture cache whitepaper, but I have not tried to directly 
benchmark it (and I noticed, at least in the draft I read, that he 
conspicuously did not compare directly against OIIO). Benchmarking rendering 
components is so hard... how would we make a fair comparison? Is it even 
possible to be fair, considering that some of his restrictions are showstoppers 
for us (we have almost no use cases for 3x8 bits)? His method (notwithstanding 
the many showstopper simplifications) certainly resembles ours, which is 
probably not a coincidence considering that he and I have actually co-authored 
production renderers in the distant past. :-)

And of course, there could surely be things in OIIO's texture handling that 
couldn't be sped up.

For us, in large production, it performs very well and is a relatively small 
fraction of overall render time (the ray tracing itself, and all the rest of 
the shading operations, tend to dominate). When texture mapping is the 
bottleneck, it is usually a pathological I/O issue -- either they are accessing 
much too incoherently and thrashing the cache (usually a combination of 
inadvertently point-sampling textures with 0 derivs and having a too-small 
texture cache size), or else it's a facility-wide problem of the file servers 
just not being able to keep up with all the texture I/O in flight at a time 
when things are extra crazy or the servers are not healthy.

Another tip for production renderers -- when shading very incoherent diffuse 
rays or extremely glossy reflection/refraction (situations in which you'll 
never see a coherent image of the reflected texture), we blur A LOT. We add 
options.sblur = tblur = 1/64.0 (which basically forces the diffuse texture 
lookups to use the level of the MIP map that fits on exactly one texture cache 
tile (we use 64x64 pixel tiles), and also for those rays use InterpBilinear and 
MipModeTrilinear. We don't make users figure this out -- the renderer just 
automatically gooses the lookup parameters for those rays. But it is helpful to 
have an option to be able to turn that off, because every once in a while 
you'll have a diffuse reflector directly abutting a high-contrast textured area 
light or something like that, and you might notice the extra blur in that one 
case, so it's helpful to turn it off for that one shot.

        -- lg


> On May 12, 2017, at 12:55 AM, Stefan Werner <[email protected]> wrote:
> 
> Hi Larry,
> 
> thanks for the comprehensive answer. The eventual use case for this is in 
> production as well, so it’s well understood that even slow texture caching 
> will be better than letting the OS’ virtual memory handle 100s of GBs of 
> textures.
> 
> I’ll play a bit with comparing anisotropic to isotropic filtering. Our 
> differentials are at the moment a bit tighter than they would probably need 
> to be, that may cause more bicubic interpolation than necessary. I did not 
> observe that causing excessive cache misses yet, but it is an area where I 
> intend to improve things anyway.
> 
> With regards to compile options, it turn out that our Windows build did not 
> inline things like the overloaded * and + operators of the SIMD class, 
> leading to significant performance loss. I notice that you already integrated 
> the patch* for that, thank you.
> 
> Has anyone taken a look at PBRT’s texture cache, btw? It was published just 
> this march, and the performance claim is: "Our implementation performs just 
> as well as pbrt does with preloaded textures where there are no such cache 
> maintenance complexities”. From skimming the code, I notice it seems to be 
> more specialised and not as generic, and cutting corners will surely give 
> them a performance advantage. Tiled mip maps are in a custom file format with 
> specific alignment, where OIIO accepts any tiled tiff or exr. The restriction 
> to 3 channels@8bits is a showstopper for many production use cases too.
> 
> -Stefan
> 
> *https://github.com/OpenImageIO/oiio/commit/7373a254463ef58d91964be83028c5fc6ca9c6ea
> 
>> On 12. May 2017, at 00:24, Larry Gritz <[email protected]> wrote:
>> 
>> I think just one TextureSystem overall should be fine. I don't think there 
>> is any advantage to having it be per-thread, and you *really* wouldn't want 
>> to have any accident where a per-thread TS inadvertently ended up with a 
>> separate ImageCache per thread.
>> 
>> A bunch of suggestions, in no particular order, because I don't know how 
>> many you are already doing:
>> 
>> Be sure you are preprocessing all your textures with maketx so that they are 
>> tiled and MIP-mapped. That's definitely better then forcing it to emulate 
>> tiling/mipmapping, which will happen if you use untiled, un-mipped textures.
>> 
>> Note that there are two varieties of each call, for example,
>> 
>>   bool texture (ustring filename, TextureOpt &options, ...)
>> 
>> and
>> 
>>   bool texture (TextureHandle *texture_handle, Perthread *thread_info,
>>                 TextureOpt &options, ...)
>> 
>> You can reduce the per-call overhead somewhat if your you use the latter 
>> call -- that is, if each thread already knows its thread_info (which you can 
>> retrieve ONCE per thread with get_thread_info()), and also if you pass the 
>> handle rather than the filename (which you can retrieve ONCE per filename, 
>> using get_texture_handle()).
>> 
>> And if you have to use the first variety of the call, where you look up by 
>> filename and without knowing the per-thread info already, then at least 
>> ensure that you are creating the ustring ONCE and passing it repeatedly, and 
>> not inadvertently constructing a ustring every time.
>> 
>> In other words, this is the most wasteful thing to do:
>> 
>>   texturesys->texture (ustring("foo.exr"), /* construct ustring every time */
>>                        options, s, t, ...);
>> 
>> and this is the most efficient thing to do:
>> 
>>   // ONCE per thread:   my_thread_info = texturesys->get_thread_info();
>>   // ONCE per texture:  handle = texturesys->get_texture_handle(filename);
>>   // for each texture lookup:
>>   texturesys->texture (handle, my_thread_info, options, ...);
>> 
>> Are your derivatives reasonable? If they are 0 or very small, you'll always 
>> be sampling from the finest level of the MIP-map, which is probably not kind 
>> to caches, and also that finest level of the MIPmap will tend to use bicubic 
>> sampling unless you force bilinear everywhere (somewhat more math). If you 
>> are using correct derivs and your textures are sized well to handle all your 
>> views (without forcing the highest-res level), then you should be in good 
>> shape and as long as you're not "magnifying"/blurring/on the top level, then 
>> "SmartBicubic" will actually give you bilinear most of the time.
>> 
>> Another difference you may be seeing is from our anisotropic texturing, 
>> compared to your old engine. If you don't require the anisotropy, then you 
>> may want to set options.mipmode to MipModeTrilinear rather than MipModeAniso 
>> (which is the default).
>> 
>> What kind of hardware are you compiling for? Are you using appropriate 
>> USE_SIMD flags? Because that can speed up the texture system quite a bit.
>> 
>> I'm not sure how you are benchmarking, but make sure your benchmark run is 
>> long enough (in time) that you are measuring the steady state, and not 
>> having it dominated by initial texture read time. For example, if your prior 
>> system was reading whole textures in one shot, and the new one is reading 
>> tiles on demand (and reading multiple MIP levels as well), the total read 
>> time may be a bit higher. That won't matter at all for a 1 hour render, but 
>> the increase in disk read may show up as significant for a 15 second 
>> benchmark.
>> 
>> Assuming you're doing all this... well, you may just be seeing the overhead 
>> of all the flexibility of TextureSystem. Remember that in some sense, it is 
>> NOT designed to be the fastest possible texture implementation for texture 
>> sets that fit in memory. Rather, it's supposed to be acceptable speed and 
>> degrade gracefully as the texture set grows. In production, we routinely 
>> render frames that reference many thousands of textures totalling many 
>> hundreds of GB (well into the TB range), using a memory cache of perhaps 
>> only 2 or 4GB, and it performs very, very well. Texture sets much larger 
>> than available memory are the case where it really shines.
>> 
>>      -- lg
>> 
>> 
>>> On May 11, 2017, at 7:26 AM, Stefan Werner <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I’m in the middle of integrating OIIO’s TextureSys into a path tracer. 
>>> Previously, textures were just loaded into memory in in full, and lookups 
>>> would always happen at the full resolution, without mip maps. When 
>>> replacing that with TextureSys, I’m noticing a significant performance 
>>> drop, up to the point where texture lookups (sample_bilinear() for example, 
>>> sample_bicubic() even more) occupy 30% or more of the render time. This is 
>>> with good cache hit rates, the cache size exceeds the size of all textures 
>>> and the OIIO stats report a cache miss rate of < 0.01% (in addition, I 
>>> tried hardcoding dsdx/dsdy/dtdx/dtdy to 0.01, just to be sure).
>>> 
>>> I did expect some performance drop compared to the previous naive strategy, 
>>> but this is a bit steeper than I expected. I am wondering if I am doing 
>>> something wrong on my side and if there are some best practises on how to 
>>> integrate OIIO into a path tracer. (I had it running in a REYES renderer 
>>> years ago and don’t remember it being that slow.)
>>> 
>>> I am creating one TextureSys instance per CPU thread, with a shared 
>>> ImageCache - are separate caches per thread any better? I cache perthread 
>>> data and do lookups using TextureHandle, not texture name. Do people 
>>> generally use smartbicubic for path tracing or do you not see enough of a 
>>> difference and stay with bilinear (as pbrt does)? For any 
>>> diffuse/sss/smooth glossy/etc bounces, I use MipModeNoMIP/InterpClosest. I 
>>> am observing this on macOS, Windows and Ubuntu, OIIO built with whatever 
>>> compiler flags CMake picks for a Release build. Is it worth it forcing more 
>>> aggressive optimisation (-O3 -lto -ffast-math…)?
>>> 
>>> Thanks,
>>> Stefan
>>> _______________________________________________
>>> Oiio-dev mailing list
>>> [email protected]
>>> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
>> 
>> --
>> Larry Gritz
>> [email protected]
>> 
>> 
>> _______________________________________________
>> Oiio-dev mailing list
>> [email protected]
>> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
> 
> _______________________________________________
> Oiio-dev mailing list
> [email protected]
> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

--
Larry Gritz
[email protected]


_______________________________________________
Oiio-dev mailing list
[email protected]
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

Re: [Oiio-dev] TextureSys best practices

Reply via email to