Hi Larry, thanks for the comprehensive answer. The eventual use case for this is in production as well, so it’s well understood that even slow texture caching will be better than letting the OS’ virtual memory handle 100s of GBs of textures.
I’ll play a bit with comparing anisotropic to isotropic filtering. Our differentials are at the moment a bit tighter than they would probably need to be, that may cause more bicubic interpolation than necessary. I did not observe that causing excessive cache misses yet, but it is an area where I intend to improve things anyway. With regards to compile options, it turn out that our Windows build did not inline things like the overloaded * and + operators of the SIMD class, leading to significant performance loss. I notice that you already integrated the patch* for that, thank you. Has anyone taken a look at PBRT’s texture cache, btw? It was published just this march, and the performance claim is: "Our implementation performs just as well as pbrt does with preloaded textures where there are no such cache maintenance complexities”. From skimming the code, I notice it seems to be more specialised and not as generic, and cutting corners will surely give them a performance advantage. Tiled mip maps are in a custom file format with specific alignment, where OIIO accepts any tiled tiff or exr. The restriction to 3 channels@8bits is a showstopper for many production use cases too. -Stefan *https://github.com/OpenImageIO/oiio/commit/7373a254463ef58d91964be83028c5fc6ca9c6ea > On 12. May 2017, at 00:24, Larry Gritz <[email protected]> wrote: > > I think just one TextureSystem overall should be fine. I don't think there is > any advantage to having it be per-thread, and you *really* wouldn't want to > have any accident where a per-thread TS inadvertently ended up with a > separate ImageCache per thread. > > A bunch of suggestions, in no particular order, because I don't know how many > you are already doing: > > Be sure you are preprocessing all your textures with maketx so that they are > tiled and MIP-mapped. That's definitely better then forcing it to emulate > tiling/mipmapping, which will happen if you use untiled, un-mipped textures. > > Note that there are two varieties of each call, for example, > > bool texture (ustring filename, TextureOpt &options, ...) > > and > > bool texture (TextureHandle *texture_handle, Perthread *thread_info, > TextureOpt &options, ...) > > You can reduce the per-call overhead somewhat if your you use the latter call > -- that is, if each thread already knows its thread_info (which you can > retrieve ONCE per thread with get_thread_info()), and also if you pass the > handle rather than the filename (which you can retrieve ONCE per filename, > using get_texture_handle()). > > And if you have to use the first variety of the call, where you look up by > filename and without knowing the per-thread info already, then at least > ensure that you are creating the ustring ONCE and passing it repeatedly, and > not inadvertently constructing a ustring every time. > > In other words, this is the most wasteful thing to do: > > texturesys->texture (ustring("foo.exr"), /* construct ustring every time */ > options, s, t, ...); > > and this is the most efficient thing to do: > > // ONCE per thread: my_thread_info = texturesys->get_thread_info(); > // ONCE per texture: handle = texturesys->get_texture_handle(filename); > // for each texture lookup: > texturesys->texture (handle, my_thread_info, options, ...); > > Are your derivatives reasonable? If they are 0 or very small, you'll always > be sampling from the finest level of the MIP-map, which is probably not kind > to caches, and also that finest level of the MIPmap will tend to use bicubic > sampling unless you force bilinear everywhere (somewhat more math). If you > are using correct derivs and your textures are sized well to handle all your > views (without forcing the highest-res level), then you should be in good > shape and as long as you're not "magnifying"/blurring/on the top level, then > "SmartBicubic" will actually give you bilinear most of the time. > > Another difference you may be seeing is from our anisotropic texturing, > compared to your old engine. If you don't require the anisotropy, then you > may want to set options.mipmode to MipModeTrilinear rather than MipModeAniso > (which is the default). > > What kind of hardware are you compiling for? Are you using appropriate > USE_SIMD flags? Because that can speed up the texture system quite a bit. > > I'm not sure how you are benchmarking, but make sure your benchmark run is > long enough (in time) that you are measuring the steady state, and not having > it dominated by initial texture read time. For example, if your prior system > was reading whole textures in one shot, and the new one is reading tiles on > demand (and reading multiple MIP levels as well), the total read time may be > a bit higher. That won't matter at all for a 1 hour render, but the increase > in disk read may show up as significant for a 15 second benchmark. > > Assuming you're doing all this... well, you may just be seeing the overhead > of all the flexibility of TextureSystem. Remember that in some sense, it is > NOT designed to be the fastest possible texture implementation for texture > sets that fit in memory. Rather, it's supposed to be acceptable speed and > degrade gracefully as the texture set grows. In production, we routinely > render frames that reference many thousands of textures totalling many > hundreds of GB (well into the TB range), using a memory cache of perhaps only > 2 or 4GB, and it performs very, very well. Texture sets much larger than > available memory are the case where it really shines. > > -- lg > > >> On May 11, 2017, at 7:26 AM, Stefan Werner <[email protected]> wrote: >> >> Hi, >> >> I’m in the middle of integrating OIIO’s TextureSys into a path tracer. >> Previously, textures were just loaded into memory in in full, and lookups >> would always happen at the full resolution, without mip maps. When replacing >> that with TextureSys, I’m noticing a significant performance drop, up to the >> point where texture lookups (sample_bilinear() for example, sample_bicubic() >> even more) occupy 30% or more of the render time. This is with good cache >> hit rates, the cache size exceeds the size of all textures and the OIIO >> stats report a cache miss rate of < 0.01% (in addition, I tried hardcoding >> dsdx/dsdy/dtdx/dtdy to 0.01, just to be sure). >> >> I did expect some performance drop compared to the previous naive strategy, >> but this is a bit steeper than I expected. I am wondering if I am doing >> something wrong on my side and if there are some best practises on how to >> integrate OIIO into a path tracer. (I had it running in a REYES renderer >> years ago and don’t remember it being that slow.) >> >> I am creating one TextureSys instance per CPU thread, with a shared >> ImageCache - are separate caches per thread any better? I cache perthread >> data and do lookups using TextureHandle, not texture name. Do people >> generally use smartbicubic for path tracing or do you not see enough of a >> difference and stay with bilinear (as pbrt does)? For any diffuse/sss/smooth >> glossy/etc bounces, I use MipModeNoMIP/InterpClosest. I am observing this on >> macOS, Windows and Ubuntu, OIIO built with whatever compiler flags CMake >> picks for a Release build. Is it worth it forcing more aggressive >> optimisation (-O3 -lto -ffast-math…)? >> >> Thanks, >> Stefan >> _______________________________________________ >> Oiio-dev mailing list >> [email protected] >> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org > > -- > Larry Gritz > [email protected] > > > _______________________________________________ > Oiio-dev mailing list > [email protected] > http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org _______________________________________________ Oiio-dev mailing list [email protected] http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
