The inlining fix was a team effort - I discovered that msvc wasn't inlining
those functions, but it was LazyDodo/Ray who discovered why MSVC refused
to do so and submitted the pull request. It wasn't very hard to find,
without inlining the texturing calls occupied over 50% of the execution
time! ( http://pasteall.org/pic/index.php?id=115483 ) I don't have the
equivalent profiler screenshot for Windows to show, but on OS X I see
texturing taking around 20% of rendering time, mostly spent in the
sample_bilinear calls. Admittedly, my current test scenes are picked for
their heavy use of image texturing, other scenes where the bottleneck is
more in procedurals or geometric complexity will behave differently.
Eventually, this should also get used on production scenes (when you
upgrade your render nodes from 64GB to 128GB RAM, you know it's time to
implement texture caching...), for now I'm using smaller test scenes to get
quicker turnaround times.

With regards to differentials, they are mostly following the BSDF
differentials as they were in an earlier OSL when it still included BSDFs,
with some extra work on top to take dNdu/dNdv and softness into account for
speculars. Especially for curved chrome objects (think faucets, car parts,
etc) dNd* makes a big difference. I will try the sblur/tblur adjustment for
incoherent rays, that sounds like it should help too.

-Stefan

On Fri, May 12, 2017 at 3:53 PM, Larry Gritz <[email protected]> wrote:

> I'll be curious to hear what your new perf numbers are after the inline
> fix (was that you? thanks!) and trying all the applicable things from my
> previous email.
>
> At that point, you should also feel free to post the full statistics
> output after your trial (texturesys->getstats()) and maybe there's
> something I can spot there that would give me other ideas.
>
> I read Matt's texture cache whitepaper, but I have not tried to directly
> benchmark it (and I noticed, at least in the draft I read, that he
> conspicuously did not compare directly against OIIO). Benchmarking
> rendering components is so hard... how would we make a fair comparison? Is
> it even possible to be fair, considering that some of his restrictions are
> showstoppers for us (we have almost no use cases for 3x8 bits)? His method
> (notwithstanding the many showstopper simplifications) certainly resembles
> ours, which is probably not a coincidence considering that he and I have
> actually co-authored production renderers in the distant past. :-)
>
> And of course, there could surely be things in OIIO's texture handling
> that couldn't be sped up.
>
> For us, in large production, it performs very well and is a relatively
> small fraction of overall render time (the ray tracing itself, and all the
> rest of the shading operations, tend to dominate). When texture mapping is
> the bottleneck, it is usually a pathological I/O issue -- either they are
> accessing much too incoherently and thrashing the cache (usually a
> combination of inadvertently point-sampling textures with 0 derivs and
> having a too-small texture cache size), or else it's a facility-wide
> problem of the file servers just not being able to keep up with all the
> texture I/O in flight at a time when things are extra crazy or the servers
> are not healthy.
>
> Another tip for production renderers -- when shading very incoherent
> diffuse rays or extremely glossy reflection/refraction (situations in which
> you'll never see a coherent image of the reflected texture), we blur A LOT.
> We add options.sblur = tblur = 1/64.0 (which basically forces the diffuse
> texture lookups to use the level of the MIP map that fits on exactly one
> texture cache tile (we use 64x64 pixel tiles), and also for those rays use
> InterpBilinear and MipModeTrilinear. We don't make users figure this out --
> the renderer just automatically gooses the lookup parameters for those
> rays. But it is helpful to have an option to be able to turn that off,
> because every once in a while you'll have a diffuse reflector directly
> abutting a high-contrast textured area light or something like that, and
> you might notice the extra blur in that one case, so it's helpful to turn
> it off for that one shot.
>
>         -- lg
>
>
> > On May 12, 2017, at 12:55 AM, Stefan Werner <[email protected]> wrote:
> >
> > Hi Larry,
> >
> > thanks for the comprehensive answer. The eventual use case for this is
> in production as well, so it’s well understood that even slow texture
> caching will be better than letting the OS’ virtual memory handle 100s of
> GBs of textures.
> >
> > I’ll play a bit with comparing anisotropic to isotropic filtering. Our
> differentials are at the moment a bit tighter than they would probably need
> to be, that may cause more bicubic interpolation than necessary. I did not
> observe that causing excessive cache misses yet, but it is an area where I
> intend to improve things anyway.
> >
> > With regards to compile options, it turn out that our Windows build did
> not inline things like the overloaded * and + operators of the SIMD class,
> leading to significant performance loss. I notice that you already
> integrated the patch* for that, thank you.
> >
> > Has anyone taken a look at PBRT’s texture cache, btw? It was published
> just this march, and the performance claim is: "Our implementation performs
> just as well as pbrt does with preloaded textures where there are no such
> cache maintenance complexities”. From skimming the code, I notice it seems
> to be more specialised and not as generic, and cutting corners will surely
> give them a performance advantage. Tiled mip maps are in a custom file
> format with specific alignment, where OIIO accepts any tiled tiff or exr.
> The restriction to 3 channels@8bits is a showstopper for many production
> use cases too.
> >
> > -Stefan
> >
> > *https://github.com/OpenImageIO/oiio/commit/
> 7373a254463ef58d91964be83028c5fc6ca9c6ea
> >
> >> On 12. May 2017, at 00:24, Larry Gritz <[email protected]> wrote:
> >>
> >> I think just one TextureSystem overall should be fine. I don't think
> there is any advantage to having it be per-thread, and you *really*
> wouldn't want to have any accident where a per-thread TS inadvertently
> ended up with a separate ImageCache per thread.
> >>
> >> A bunch of suggestions, in no particular order, because I don't know
> how many you are already doing:
> >>
> >> Be sure you are preprocessing all your textures with maketx so that
> they are tiled and MIP-mapped. That's definitely better then forcing it to
> emulate tiling/mipmapping, which will happen if you use untiled, un-mipped
> textures.
> >>
> >> Note that there are two varieties of each call, for example,
> >>
> >>   bool texture (ustring filename, TextureOpt &options, ...)
> >>
> >> and
> >>
> >>   bool texture (TextureHandle *texture_handle, Perthread *thread_info,
> >>                 TextureOpt &options, ...)
> >>
> >> You can reduce the per-call overhead somewhat if your you use the
> latter call -- that is, if each thread already knows its thread_info (which
> you can retrieve ONCE per thread with get_thread_info()), and also if you
> pass the handle rather than the filename (which you can retrieve ONCE per
> filename, using get_texture_handle()).
> >>
> >> And if you have to use the first variety of the call, where you look up
> by filename and without knowing the per-thread info already, then at least
> ensure that you are creating the ustring ONCE and passing it repeatedly,
> and not inadvertently constructing a ustring every time.
> >>
> >> In other words, this is the most wasteful thing to do:
> >>
> >>   texturesys->texture (ustring("foo.exr"), /* construct ustring every
> time */
> >>                        options, s, t, ...);
> >>
> >> and this is the most efficient thing to do:
> >>
> >>   // ONCE per thread:   my_thread_info = texturesys->get_thread_info();
> >>   // ONCE per texture:  handle = texturesys->get_texture_
> handle(filename);
> >>   // for each texture lookup:
> >>   texturesys->texture (handle, my_thread_info, options, ...);
> >>
> >> Are your derivatives reasonable? If they are 0 or very small, you'll
> always be sampling from the finest level of the MIP-map, which is probably
> not kind to caches, and also that finest level of the MIPmap will tend to
> use bicubic sampling unless you force bilinear everywhere (somewhat more
> math). If you are using correct derivs and your textures are sized well to
> handle all your views (without forcing the highest-res level), then you
> should be in good shape and as long as you're not "magnifying"/blurring/on
> the top level, then "SmartBicubic" will actually give you bilinear most of
> the time.
> >>
> >> Another difference you may be seeing is from our anisotropic texturing,
> compared to your old engine. If you don't require the anisotropy, then you
> may want to set options.mipmode to MipModeTrilinear rather than
> MipModeAniso (which is the default).
> >>
> >> What kind of hardware are you compiling for? Are you using appropriate
> USE_SIMD flags? Because that can speed up the texture system quite a bit.
> >>
> >> I'm not sure how you are benchmarking, but make sure your benchmark run
> is long enough (in time) that you are measuring the steady state, and not
> having it dominated by initial texture read time. For example, if your
> prior system was reading whole textures in one shot, and the new one is
> reading tiles on demand (and reading multiple MIP levels as well), the
> total read time may be a bit higher. That won't matter at all for a 1 hour
> render, but the increase in disk read may show up as significant for a 15
> second benchmark.
> >>
> >> Assuming you're doing all this... well, you may just be seeing the
> overhead of all the flexibility of TextureSystem. Remember that in some
> sense, it is NOT designed to be the fastest possible texture implementation
> for texture sets that fit in memory. Rather, it's supposed to be acceptable
> speed and degrade gracefully as the texture set grows. In production, we
> routinely render frames that reference many thousands of textures totalling
> many hundreds of GB (well into the TB range), using a memory cache of
> perhaps only 2 or 4GB, and it performs very, very well. Texture sets much
> larger than available memory are the case where it really shines.
> >>
> >>      -- lg
> >>
> >>
> >>> On May 11, 2017, at 7:26 AM, Stefan Werner <[email protected]> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I’m in the middle of integrating OIIO’s TextureSys into a path tracer.
> Previously, textures were just loaded into memory in in full, and lookups
> would always happen at the full resolution, without mip maps. When
> replacing that with TextureSys, I’m noticing a significant performance
> drop, up to the point where texture lookups (sample_bilinear() for example,
> sample_bicubic() even more) occupy 30% or more of the render time. This is
> with good cache hit rates, the cache size exceeds the size of all textures
> and the OIIO stats report a cache miss rate of < 0.01% (in addition, I
> tried hardcoding dsdx/dsdy/dtdx/dtdy to 0.01, just to be sure).
> >>>
> >>> I did expect some performance drop compared to the previous naive
> strategy, but this is a bit steeper than I expected. I am wondering if I am
> doing something wrong on my side and if there are some best practises on
> how to integrate OIIO into a path tracer. (I had it running in a REYES
> renderer years ago and don’t remember it being that slow.)
> >>>
> >>> I am creating one TextureSys instance per CPU thread, with a shared
> ImageCache - are separate caches per thread any better? I cache perthread
> data and do lookups using TextureHandle, not texture name. Do people
> generally use smartbicubic for path tracing or do you not see enough of a
> difference and stay with bilinear (as pbrt does)? For any
> diffuse/sss/smooth glossy/etc bounces, I use MipModeNoMIP/InterpClosest. I
> am observing this on macOS, Windows and Ubuntu, OIIO built with whatever
> compiler flags CMake picks for a Release build. Is it worth it forcing more
> aggressive optimisation (-O3 -lto -ffast-math…)?
> >>>
> >>> Thanks,
> >>> Stefan
> >>> _______________________________________________
> >>> Oiio-dev mailing list
> >>> [email protected]
> >>> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
> >>
> >> --
> >> Larry Gritz
> >> [email protected]
> >>
> >>
> >> _______________________________________________
> >> Oiio-dev mailing list
> >> [email protected]
> >> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
> >
> > _______________________________________________
> > Oiio-dev mailing list
> > [email protected]
> > http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
>
> --
> Larry Gritz
> [email protected]
>
>
> _______________________________________________
> Oiio-dev mailing list
> [email protected]
> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
>
_______________________________________________
Oiio-dev mailing list
[email protected]
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

Reply via email to