Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Feb 17, 2016, at 7:07 PM, Roland Scheideggerwrote: > > You could use different functions for avx and avx2 code, and plug the > right ones in at runtime, as you can link them both just fine. It just > requires that your code containing avx2 code is in a different compile > unit to the one containing avx-only code. This way you only really have > separate compiled code for the functions where there's really a > difference (obviously, this prevents the compiler from using avx2 on its > own in the shared parts, but I doubt that's a problem). Albeit if you > have lots of differences scattered around (the worst would probably be > different structures based on such difference used everywhere...) this > might not be very practical (at a first glance, didn't look like it at > least for avx and avx2). > Though I'm not actually sure how you would do that for c++ template > code, maybe it doesn't work as easily... > In any case, so far for llvmpipe we didn't bother (except for the jitted > code of course) to optimize for newer instruction sets precisely due to > it being annoying (certainly prevents you from doing "let's just > optimize this math here in this little inline function when avx is > available" - so we still have rasterization functions which emulate > sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on). Unfortunately we have avx and avx2 usage in the general swr code, hidden behind some macros which emulate the missing avx2 instructions on avx, so there isn’t a clear boundary layer inside the swr rasterizer we can load behind. Additionally some of the structures will start changing size when we add avx512 support. I was thinking that “objcopy —prefix-symbols” might be the answer to the problem of creating two versions of the rasterizer that could be linked together with the driver, but it does a global rename on all symbols (internal and externals like malloc/free/c++ constructors/etc..) leaving unresolvable externals. Maybe a global c++ namespace might work, but I don’t see a nonintrusive way of adding that. -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
On Thu, Feb 18, 2016 at 02:07:25AM +0100, Roland Scheidegger wrote: > Am 17.02.2016 um 22:09 schrieb Rowley, Timothy O: > >> On Nov 18, 2015, at 12:34 PM, Emil Velikov > >>wrote: I have no objections against > >> getting this merged, although here are a couple of things that > >> should be sorted. Some of these are just reiteration from others: > > > > Sorry about the delay responding to this; we’ve been working on a > > number of the issues you mentioned (plus the usual year-end holidays > > and other work). > > > >> - First and foremost - please base your work against master. Mesa, > >> alike most other open-source projects, tries to keep features out > >> of bugfix releases. As such basing things against 11.0 is not > >> suitable. > > > > Basing our efforts on a particular Mesa branch was an initial > > development decision to keep a stable base while we figured out how > > to build a driver from scratch. We have now rebased to the Mesa > > master and periodically merge updates. > > > >> - Further combinatorial explosion of build configurations - with > >> internal/external core, swr-arch, etc. Some of these can (should?) > >> be nuked, although further comments will follow as patch(es) hit > >> the mailing list. > > > > All the additional swr build options have been removed, leaving swr > > simply as an additional gallium driver that can be enabled. The > > build-time architecture dependence has been addressed by building the > > swr driver twice (avx and avx2), and having swr_create_screen check > > the architecture and load the appropriate library. I’m not > > completely satisfied with the current solution as since the driver is > > part of the loaded library we need to link most of mesa into the > > “driver”. The fix for this seems to be to just build the core swr > > rasterizer architecture specific and dlopen/dlsym the fifty or so API > > entry points. However this interim solution simplifies things for > > our users and removes the swr specific options from the general Mesa > > build system. > You could use different functions for avx and avx2 code, and plug the > right ones in at runtime, as you can link them both just fine. It just > requires that your code containing avx2 code is in a different compile > unit to the one containing avx-only code. This way you only really have > separate compiled code for the functions where there's really a > difference (obviously, this prevents the compiler from using avx2 on its > own in the shared parts, but I doubt that's a problem). Albeit if you > have lots of differences scattered around (the worst would probably be > different structures based on such difference used everywhere...) this > might not be very practical (at a first glance, didn't look like it at > least for avx and avx2). You can set feature flags on a per-function basis now, so it's possible to have an avx and avx2 function in the same module. I haven't actually tried this, though, so I'm not sure now well it's working at the moment. -Tom > Though I'm not actually sure how you would do that for c++ template > code, maybe it doesn't work as easily... > In any case, so far for llvmpipe we didn't bother (except for the jitted > code of course) to optimize for newer instruction sets precisely due to > it being annoying (certainly prevents you from doing "let's just > optimize this math here in this little inline function when avx is > available" - so we still have rasterization functions which emulate > sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on). > > Roland > > > > > >> - Using llvm's C++ interface, building against multiple LLVM > >> versions. If openswr only supports only limited versions of llvm, > >> then the build should bail out accordingly - more > >> comments/suggestions as patch(es) hit the ML. > > > > OpenSWR now supports llvm 3.6, 3.7, and 3.8. We don’t explicitly > > prevent people from trying to use llvm-svn, though as you say the C++ > > api is not stable so they might encounter problems. > > > >> - Will patches porting core openswr functionality from the > >> internal tree be part of the public discussions ? The VMWare people > >> have done a great thing trying to keep things open, and people > >> have, on the rare occasion, found nitpicks in their patches. > > > > Moving patches from the internal rasterizer tree can be scripted at a > > top level, but unfortunately that’s the easy bit of keeping the two > > in sync when changes happen on both sides of the fence. I can try > > tracking individual patches up to my git knowledge. > > > >> - And last but not least - please split patches sensibly, for your > >> submission and further work). The "Initial public Mesa+SWR" > >> touches files in quite a few different places. > > > > I’m about to send the patches to the list for review; splitting them > > into the driver, rasterizer, mesa changes, and build system. > > > >> Mildly related - I'll be resending/merging a
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
Am 17.02.2016 um 22:09 schrieb Rowley, Timothy O: >> On Nov 18, 2015, at 12:34 PM, Emil Velikov >>wrote: I have no objections against >> getting this merged, although here are a couple of things that >> should be sorted. Some of these are just reiteration from others: > > Sorry about the delay responding to this; we’ve been working on a > number of the issues you mentioned (plus the usual year-end holidays > and other work). > >> - First and foremost - please base your work against master. Mesa, >> alike most other open-source projects, tries to keep features out >> of bugfix releases. As such basing things against 11.0 is not >> suitable. > > Basing our efforts on a particular Mesa branch was an initial > development decision to keep a stable base while we figured out how > to build a driver from scratch. We have now rebased to the Mesa > master and periodically merge updates. > >> - Further combinatorial explosion of build configurations - with >> internal/external core, swr-arch, etc. Some of these can (should?) >> be nuked, although further comments will follow as patch(es) hit >> the mailing list. > > All the additional swr build options have been removed, leaving swr > simply as an additional gallium driver that can be enabled. The > build-time architecture dependence has been addressed by building the > swr driver twice (avx and avx2), and having swr_create_screen check > the architecture and load the appropriate library. I’m not > completely satisfied with the current solution as since the driver is > part of the loaded library we need to link most of mesa into the > “driver”. The fix for this seems to be to just build the core swr > rasterizer architecture specific and dlopen/dlsym the fifty or so API > entry points. However this interim solution simplifies things for > our users and removes the swr specific options from the general Mesa > build system. You could use different functions for avx and avx2 code, and plug the right ones in at runtime, as you can link them both just fine. It just requires that your code containing avx2 code is in a different compile unit to the one containing avx-only code. This way you only really have separate compiled code for the functions where there's really a difference (obviously, this prevents the compiler from using avx2 on its own in the shared parts, but I doubt that's a problem). Albeit if you have lots of differences scattered around (the worst would probably be different structures based on such difference used everywhere...) this might not be very practical (at a first glance, didn't look like it at least for avx and avx2). Though I'm not actually sure how you would do that for c++ template code, maybe it doesn't work as easily... In any case, so far for llvmpipe we didn't bother (except for the jitted code of course) to optimize for newer instruction sets precisely due to it being annoying (certainly prevents you from doing "let's just optimize this math here in this little inline function when avx is available" - so we still have rasterization functions which emulate sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on). Roland > >> - Using llvm's C++ interface, building against multiple LLVM >> versions. If openswr only supports only limited versions of llvm, >> then the build should bail out accordingly - more >> comments/suggestions as patch(es) hit the ML. > > OpenSWR now supports llvm 3.6, 3.7, and 3.8. We don’t explicitly > prevent people from trying to use llvm-svn, though as you say the C++ > api is not stable so they might encounter problems. > >> - Will patches porting core openswr functionality from the >> internal tree be part of the public discussions ? The VMWare people >> have done a great thing trying to keep things open, and people >> have, on the rare occasion, found nitpicks in their patches. > > Moving patches from the internal rasterizer tree can be scripted at a > top level, but unfortunately that’s the easy bit of keeping the two > in sync when changes happen on both sides of the fence. I can try > tracking individual patches up to my git knowledge. > >> - And last but not least - please split patches sensibly, for your >> submission and further work). The "Initial public Mesa+SWR" >> touches files in quite a few different places. > > I’m about to send the patches to the list for review; splitting them > into the driver, rasterizer, mesa changes, and build system. > >> Mildly related - I'll be resending/merging a series with reworks >> things in src/gallium/auxiliary/target-helpers/ so things might >> clash as you rebase your work. > > No problem - all part of working with a larger project. Thanks for > the heads-up. > > -Tim > > ___ mesa-dev mailing > list mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev > ___ mesa-dev mailing list
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Nov 18, 2015, at 12:34 PM, Emil Velikovwrote: > I have no objections against getting this merged, although here are a > couple of things that should be sorted. Some of these are just > reiteration from others: Sorry about the delay responding to this; we’ve been working on a number of the issues you mentioned (plus the usual year-end holidays and other work). > - First and foremost - please base your work against master. Mesa, > alike most other open-source projects, tries to keep features out of > bugfix releases. As such basing things against 11.0 is not suitable. Basing our efforts on a particular Mesa branch was an initial development decision to keep a stable base while we figured out how to build a driver from scratch. We have now rebased to the Mesa master and periodically merge updates. > - Further combinatorial explosion of build configurations - with > internal/external core, swr-arch, etc. Some of these can (should?) be > nuked, although further comments will follow as patch(es) hit the > mailing list. All the additional swr build options have been removed, leaving swr simply as an additional gallium driver that can be enabled. The build-time architecture dependence has been addressed by building the swr driver twice (avx and avx2), and having swr_create_screen check the architecture and load the appropriate library. I’m not completely satisfied with the current solution as since the driver is part of the loaded library we need to link most of mesa into the “driver”. The fix for this seems to be to just build the core swr rasterizer architecture specific and dlopen/dlsym the fifty or so API entry points. However this interim solution simplifies things for our users and removes the swr specific options from the general Mesa build system. > - Using llvm's C++ interface, building against multiple LLVM > versions. If openswr only supports only limited versions of llvm, then > the build should bail out accordingly - more comments/suggestions as > patch(es) hit the ML. OpenSWR now supports llvm 3.6, 3.7, and 3.8. We don’t explicitly prevent people from trying to use llvm-svn, though as you say the C++ api is not stable so they might encounter problems. > - Will patches porting core openswr functionality from the internal > tree be part of the public discussions ? The VMWare people have done a > great thing trying to keep things open, and people have, on the rare > occasion, found nitpicks in their patches. Moving patches from the internal rasterizer tree can be scripted at a top level, but unfortunately that’s the easy bit of keeping the two in sync when changes happen on both sides of the fence. I can try tracking individual patches up to my git knowledge. > - And last but not least - please split patches sensibly, for your > submission and further work). The "Initial public Mesa+SWR" touches > files in quite a few different places. I’m about to send the patches to the list for review; splitting them into the driver, rasterizer, mesa changes, and build system. > Mildly related - I'll be resending/merging a series with reworks > things in src/gallium/auxiliary/target-helpers/ so things might clash > as you rebase your work. No problem - all part of working with a larger project. Thanks for the heads-up. -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
Hi Tim, I have no objections against getting this merged, although here are a couple of things that should be sorted. Some of these are just reiteration from others: - First and foremost - please base your work against master. Mesa, alike most other open-source projects, tries to keep features out of bugfix releases. As such basing things against 11.0 is not suitable. - Further combinatorial explosion of build configurations - with internal/external core, swr-arch, etc. Some of these can (should?) be nuked, although further comments will follow as patch(es) hit the mailing list. - Using llvm's C++ interface, building against multiple LLVM versions. If openswr only supports only limited versions of llvm, then the build should bail out accordingly - more comments/suggestions as patch(es) hit the ML. - Will patches porting core openswr functionality from the internal tree be part of the public discussions ? The VMWare people have done a great thing trying to keep things open, and people have, on the rare occasion, found nitpicks in their patches. - And last but not least - please split patches sensibly, for your submission and further work). The "Initial public Mesa+SWR" touches files in quite a few different places. Mildly related - I'll be resending/merging a series with reworks things in src/gallium/auxiliary/target-helpers/ so things might clash as you rebase your work. Thanks Emil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Oct 22, 2015, at 4:17 PM, Jose Fonsecawrote: > > They do share a lot already, Mesa, gallium statetracker, and gallivm. If > further development in openswr is planned, it might require to jump through a > few hoops, but I think it's worth to figure out what would take to get this > merged into master so that, whenever there are interface changes, openswer > won't get the short stick. Yes, openswr and llvmpipe share a fair bit. It is my hope that as we start working more on openswr performance, some of the effort will benefit both drivers. We’re willing to jump through the hoops needed to merge into master. To that end, I’ve pushed some updates that amongst other things allow us to support both llvm 3.6 and 3.7 (and possibly llvm-svn). Are there any other hoops that spring to mind? -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
Am 10.11.2015 um 20:36 schrieb Rowley, Timothy O: > >> On Oct 22, 2015, at 4:17 PM, Jose Fonsecawrote: >> >> They do share a lot already, Mesa, gallium statetracker, and gallivm. If >> further development in openswr is planned, it might require to jump through >> a few hoops, but I think it's worth to figure out what would take to get >> this merged into master so that, whenever there are interface changes, >> openswer won't get the short stick. > > Yes, openswr and llvmpipe share a fair bit. It is my hope that as we start > working more on openswr performance, some of the effort will benefit both > drivers. > > We’re willing to jump through the hoops needed to merge into master. To that > end, I’ve pushed some updates that amongst other things allow us to support > both llvm 3.6 and 3.7 (and possibly llvm-svn). Are there any other hoops > that spring to mind? > FWIW this looks ok to me. You didn't really touch any shared code apart from adding some extern C wrappers, so there's not much to review there. Plus of course some build changes, which seem fairly obvious though someone else might want to look at that. I didn't look that closely at the driver bits but as long as you're able to maintain it it should be fine... Roland ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
On 22/10/15 00:43, Rowley, Timothy O wrote: On Oct 20, 2015, at 5:58 PM, Jose Fonsecawrote: Thanks for the explanations. It's closer now, but still a bit of gap: $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss SWR create screen! This processor supports AVX2. --> numThreads = 3 1102 frames in 5.002 seconds = 220.312 FPS 1133 frames in 5.001 seconds = 226.555 FPS 1130 frames in 5.002 seconds = 225.91 FPS ^C $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss 1456 frames in 5 seconds = 291.2 FPS 1617 frames in 5.003 seconds = 323.206 FPS 1571 frames in 5.002 seconds = 314.074 FPS A bit more of an apples to apples comparison might be single-threaded llvmpipe (LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1). Running gloss and glxgears (another favorite “benchmark” :) ) under these conditions show swr running a bit slower, though a little closer than your numbers. Indeed that seems a better comparison. $ KNOB_SINGLE_THREADED=1 ./gloss SWR create screen! This processor supports AVX2. 733 frames in 5.003 seconds = 146.512 FPS 787 frames in 5.004 seconds = 157.274 FPS 793 frames in 5.005 seconds = 158.442 FPS 799 frames in 5.001 seconds = 159.768 FPS 787 frames in 5.005 seconds = 157.243 FPS $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=0 ./gloss 939 frames in 5.002 seconds = 187.725 FPS 1032 frames in 5.001 seconds = 206.359 FPS 1017 frames in 5.002 seconds = 203.319 FPS 1021 frames in 5 seconds = 204.2 FPS 1039 frames in 5.002 seconds = 207.717 FPS > Examining performance traces, we think swr’s concept of hot-tiles, the working memory representation of the render target, and the associated load/store functions contribute to most of the difference. We might be able to optimize those conversions; additionally fast clear would help these demos. For larger workloads this small per-frame cost doesn’t really affect the performance. These initial observations from you and others regarding performance have been interesting. Our performance work has been with large workloads on high core count configurations, where while some of the decisions such as a dedicated core for the application/API might have cost performance a bit, the percentage is much less than on the dual and quad core processors. We’ll look into some changes/tuning that will benefit both extremes, though we might have to end up conceding that llvmpipe will be faster at glxgears. :-) I don't care for gears -- it practically measure present/blit rate --, but gloss spite simple is sensitive to texturing performance. Final thoughts: I understand this project has its own history, but I echo what Roland said -- it would be nice to unify with llvmpipe at one point, in some way or fashion. Our (VMware's) focus has been desktop composition, but there's no reason why a single SW renderer can't satisfy both ends of the spectrum, especially for JIT enable renderers, since they can emit at runtime the code most suited for the workload. We would be happy for someone to take some of the ideas from swr to speed up llvmpipe, but for now our development will continue on the swr core and driver. We’re not planning on replacing llvmpipe - its intent of working on any architecture is admirable. In the ideal world the solution would be something that combines the best traits of both rasterizers, but at this point the shortest path to having a performant solution for our customers is with swr. Fair enough. They do share a lot already, Mesa, gallium statetracker, and gallivm. If further development in openswr is planned, it might require to jump through a few hoops, but I think it's worth to figure out what would take to get this merged into master so that, whenever there are interface changes, openswer won't get the short stick. That said, it's really nice seeing Mesa and Gallium enabling this sort of experiments with SW rendering. Yes, we were quite happy with how fast we were able to get a new driver functioning with gallium. The major thing slowing us was the documentation, which is not uniform in coverage. There was a lot of reading other drivers’ source to figure out how things were supposed to work. Yes, that's a fair comment. Jose ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Oct 20, 2015, at 2:03 PM, Roland Scheideggerwrote: > > Certainly looks interesting... > From a high level point of view, seems quite similar to llvmpipe (both > tile based, using llvm for jitting shaders, ...). Of course llvmpipe > isn't well suited for these kind of workloads (the most important use > case is desktop compositing, so a couple dozen vertices per frame but > millions of pixels...). Making vertex loads scale is something which > just wasn't worth the effort so far (there's not actually that many > people working on llvmpipe), albeit we realize that the completely > non-parallel nature of it currently actually can hinder scaling quite a > bit even for "typical" workloads (not desktop compositing, but "simple" > 3d apps) once you've got enough cores/threads (8 or so), but that's > something we're not worried too much about. > I think requiring llvm 3.6 probably isn't going to work if you want to > upstream this, a minimum version of 3.6 is fine but the general rule is > things should still work with newer versions (including current > development version, seems like you're using c++ interface of llvm quite > a bit so that's probably going to require some #ifdef mess). Albeit I > guess if you just don't try to build the driver with non-released > versions that's probably ok (but will limit the ability for some people > to try out your driver). Some differences between llvmpipe and swr based on my understanding of llvmpipe’s architecture: threading model llvmpipe: single threaded vertex processing, up to 16 rasterization threads swr: common thread pool that pick up frontend or backend work as available vertex processing llvmpipe: entire draw call processed in a single pass swr: large draws chopped into chunks that can be processed in parallel frontend/backend coupling llvmpipe: separate binning pass in single threaded frontend swr: frontend vertex processing and binning combined in a single pass primitive assembly and binning llvmpipe: scalar c code swr: x86 avx/avx2 working on vector of primitives fragment processing llvmpipe: single jitted shader combining depth/fragment/stencil/blend on16x16 block swr: separate jitted fragment and blend shaders, plus templated depth test in-memory representation llvmpipe: direct access to render targets swr: hot-tile working representation with load and/or store at required times As you say, we do use LLVM’s C++ API. While that has some advantages, it’s not guaranteed to be stable and can/does make nontrivial changes. 3.6 to 3.7 made some change to at least the GEP instruction which we could work around if necessary for upstreaming. -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
Am 22.10.2015 um 00:41 schrieb Rowley, Timothy O: > >> On Oct 20, 2015, at 2:03 PM, Roland Scheideggerwrote: >> >> Certainly looks interesting... >> From a high level point of view, seems quite similar to llvmpipe (both >> tile based, using llvm for jitting shaders, ...). Of course llvmpipe >> isn't well suited for these kind of workloads (the most important use >> case is desktop compositing, so a couple dozen vertices per frame but >> millions of pixels...). Making vertex loads scale is something which >> just wasn't worth the effort so far (there's not actually that many >> people working on llvmpipe), albeit we realize that the completely >> non-parallel nature of it currently actually can hinder scaling quite a >> bit even for "typical" workloads (not desktop compositing, but "simple" >> 3d apps) once you've got enough cores/threads (8 or so), but that's >> something we're not worried too much about. >> I think requiring llvm 3.6 probably isn't going to work if you want to >> upstream this, a minimum version of 3.6 is fine but the general rule is >> things should still work with newer versions (including current >> development version, seems like you're using c++ interface of llvm quite >> a bit so that's probably going to require some #ifdef mess). Albeit I >> guess if you just don't try to build the driver with non-released >> versions that's probably ok (but will limit the ability for some people >> to try out your driver). > > Some differences between llvmpipe and swr based on my understanding of > llvmpipe’s architecture: > > threading model > llvmpipe: single threaded vertex processing, up to 16 rasterization > threads The limit is actually pretty much arbitrary. Though since vertex processing is single threaded, there's definitely practical scaling limits (and having more threads than render tiles wouldn't show any advantage). > swr: common thread pool that pick up frontend or backend work as > available > vertex processing > llvmpipe: entire draw call processed in a single pass > swr: large draws chopped into chunks that can be processed in parallel > frontend/backend coupling > llvmpipe: separate binning pass in single threaded frontend > swr: frontend vertex processing and binning combined in a single pass There's definitive advantages to swr there. llvmpipe's binning pass isn't really separate from vertex processing, so this being single-threaded is more of a result of vertex processing also being handled in the same frontend thread (though of course if it were multithreaded some extra logic would be needed for things to stay correctly in order). Part of it is due to draw really being separate from llvmpipe (it can and is used by other drivers), so the "interface" between vs and fs is rather simple. But certainly it's not like this is set in stone, rather noone had the time to do something a bit more scalable there... > primitive assembly and binning > llvmpipe: scalar c code there's actually some jit code there plus some manual sse code (though still c fallback). Albeit it is indeed not quite as parallel as I'd like (only works on a single primitive at a time). > swr: x86 avx/avx2 working on vector of primitives > fragment processing > llvmpipe: single jitted shader combining depth/fragment/stencil/blend > on16x16 block It is working on a 4x4 block actually, but otherwise that's right. > swr: separate jitted fragment and blend shaders, plus templated depth > test > in-memory representation > llvmpipe: direct access to render targets > swr: hot-tile working representation with load and/or store at required > times This is actually an interesting difference, of course also tied to llvmpipe integrating everything together into the fragment shader. So yes, these are all definitely significant architectural differences to llvmpipe. But most of it (ok the combined fragment shader / backend jit code is not) is not really due to a concious design decision - I'd happily accept patches to make it possible to do vertex processing in parallel :-). > As you say, we do use LLVM’s C++ API. While that has some advantages, it’s > not guaranteed to be stable and can/does make nontrivial changes. 3.6 to 3.7 > made some change to at least the GEP instruction which we could work around > if necessary for upstreaming. IMHO you should really try to keep up at least with llvm releases (and ideally llvm head). Otherwise you make it a pain to build not just for users but developers alike (and if stuff doesn't get at least built, it has a tendency to break quite often when there's gallium interface changes etc.). Roland > > -Tim > ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Oct 20, 2015, at 5:58 PM, Jose Fonsecawrote: > > Thanks for the explanations. It's closer now, but still a bit of gap: > > $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss > SWR create screen! > This processor supports AVX2. > --> numThreads = 3 > 1102 frames in 5.002 seconds = 220.312 FPS > 1133 frames in 5.001 seconds = 226.555 FPS > 1130 frames in 5.002 seconds = 225.91 FPS > ^C > $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss > 1456 frames in 5 seconds = 291.2 FPS > 1617 frames in 5.003 seconds = 323.206 FPS > 1571 frames in 5.002 seconds = 314.074 FPS A bit more of an apples to apples comparison might be single-threaded llvmpipe (LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1). Running gloss and glxgears (another favorite “benchmark” :) ) under these conditions show swr running a bit slower, though a little closer than your numbers. Examining performance traces, we think swr’s concept of hot-tiles, the working memory representation of the render target, and the associated load/store functions contribute to most of the difference. We might be able to optimize those conversions; additionally fast clear would help these demos. For larger workloads this small per-frame cost doesn’t really affect the performance. > One final question: you said that one thread is reserved for the API, but I > see all threads (with top `H`) maxing up the CPU. So if the thread reserved > for the API is not doing vertex/fragment processing, then what is it using > 100% of a CPU thread for? With a trivial application main loop and light api usage, the API thread is going to end up spending most of the time waiting for the other threads to finish work. These initial observations from you and others regarding performance have been interesting. Our performance work has been with large workloads on high core count configurations, where while some of the decisions such as a dedicated core for the application/API might have cost performance a bit, the percentage is much less than on the dual and quad core processors. We’ll look into some changes/tuning that will benefit both extremes, though we might have to end up conceding that llvmpipe will be faster at glxgears. :-) > Final thoughts: I understand this project has its own history, but I echo > what Roland said -- it would be nice to unify with llvmpipe at one point, in > some way or fashion. Our (VMware's) focus has been desktop composition, but > there's no reason why a single SW renderer can't satisfy both ends of the > spectrum, especially for JIT enable renderers, since they can emit at runtime > the code most suited for the workload. We would be happy for someone to take some of the ideas from swr to speed up llvmpipe, but for now our development will continue on the swr core and driver. We’re not planning on replacing llvmpipe - its intent of working on any architecture is admirable. In the ideal world the solution would be something that combines the best traits of both rasterizers, but at this point the shortest path to having a performant solution for our customers is with swr. > That said, it's really nice seeing Mesa and Gallium enabling this sort of > experiments with SW rendering. Yes, we were quite happy with how fast we were able to get a new driver functioning with gallium. The major thing slowing us was the documentation, which is not uniform in coverage. There was a lot of reading other drivers’ source to figure out how things were supposed to work. -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
Am 20.10.2015 um 19:11 schrieb Rowley, Timothy O: > Hi. I'd like to introduce the Mesa3D community to a software project > that we hope to upstream. We're a small team at Intel working on > software defined visualization (http://sdvis.org/), and have > opensource projects in both the raytracing (Embree, OSPRay) and > rasterization (OpenSWR) realms. > > We're a different Intel team from that of i965 fame, with a different > type of customer and workloads. Our customers have large clusters of > compute nodes that for various reasons do not have GPUs, and are > working with extremely large geometry models. > > We've been working on a high performance, highly scalable rasterizer > and driver to interface with Mesa3D. Our rasterizer functions as a > "software gpu", relying on the mature well-supported Mesa3D to provide > API and state tracking layers. > > We would like to contribute this code to Mesa3D and continue doing > active development in your source repository. We welcome discussion > about how this will happen and questions about the project itself. > Below are some answers to what we think might be frequently asked > questions. > > Bruce and I will be the public contacts for this project, but this > project isn't solely our work - there's a dedicated group of people > working on the core SWR code. > > Tim Rowley > Bruce Cherniak > > Intel Corporation > > Why another software rasterizer? > > > Good question, given there are already three (swrast, softpipe, > llvmpipe) in the Mesa3D tree. Two important reasons for this: > > * Architecture - given our focus on scientific visualization, our >workloads are much different than the typical game; we have heavy >vertex load and relatively simple shaders. In addition, the core >counts of machines we run on are much higher. These parameters led >to design decisions much different than llvmpipe. > > * Historical - Intel had developed a high performance software >graphics stack for internal purposes. Later we adapted this >graphics stack for use in visualization and decided to move forward >with Mesa3D to provide a high quality API layer while at the same >time benefiting from the excellent performance the software >rasterizerizer gives us. > > What's the architecture? > > > SWR is a tile based immediate mode renderer with a sort-free threading > model which is arranged as a ring of queues. Each entry in the ring > represents a draw context that contains all of the draw state and work > queues. An API thread sets up each draw context and worker threads > will execute both the frontend (vertex/geometry processing) and > backend (fragment) work as required. The ring allows for backend > threads to pull work in order. Large draws are split into chunks to > allow vertex processing to happen in parallel, with the backend work > pickup preserving draw ordering. > > Our pipeline uses just-in-time compiled code for the fetch shader that > does vertex attribute gathering and AOS to SOA conversions, the vertex > shader and fragment shaders, streamout, and fragment blending. SWR > core also supports geometry and compute shaders but we haven't exposed > them through our driver yet. The fetch shader, streamout, and blend is > built internally to swr core using LLVM directly, while for the vertex > and pixel shaders we reuse bits of llvmpipe from > gallium/auxiliary/gallivm to build the kernels, which we wrap > differently than llvmpipe's auxiliary/draw code. > > What's the performance? > --- > > For the types of high-geometry workloads we're interested in, we are > significantly faster than llvmpipe. This is to be expected, as > llvmpipe only threads the fragment processing and not the geometry > frontend. > > The linked slide below shows some performance numbers from a benchmark > dataset and application. On a 36 total core dual E5-2699v3 we see > performance 29x to 51x that of llvmpipe. > > http://openswr.org/slides/SWR_Sept15.pdf > > While our current performance is quite good, we know there is more > potential in this architecture. When we switched from a prototype > OpenGL driver to Mesa we regressed performance severely, some due to > interface issues that need tuning, some differences in shader code > generation, and some due to conformance and feature additions to the > core swr. We are looking to recovering most of this performance back. > > What's the conformance? > --- > > The major applications we are targeting are all based on the > Visualization Toolkit (VTK), and as such our development efforts have > been focused on making sure these work as best as possible. Our > current code passes vtk's rendering tests with their new "OpenGL2" > (really OpenGL 3.2) backend at 99%. > > piglit testing shows a much lower pass rate, roughly 80% at the time > of writing. Core SWR undergoes
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
[re-adding mesa-dev, dropped accidentally] On Tue, Oct 20, 2015 at 1:43 PM, Ilia Mirkinwrote: > On Tue, Oct 20, 2015 at 1:11 PM, Rowley, Timothy O > wrote: >> Does one build work on both AVX and AVX2? >> - >> >> * Unfortunately, no. The architecture support is fixed at compile >>time. While the AVX version of course will run on AVX2 machines >>and the jitted code will use AVX2, the overall performance will >>suffer relative to a full AVX2 build. >> >> * There is some idea that if we move some code from the driver back >>to SWR core, we could build two versions of libSWR and dynamically >>load the correct version at runtime. Unfortunately this mechanism >>would not work with AVX512, as some of the SWR state structures >>would change size. > > Without commenting on any of the other issues, I believe one of your > stated goals is to ease distribution to your end-users. If you expect > them to build their own code, that's no problem. However if you're > thinking of relying on distros to include your driver and have end > users use that, then you should consider some solution that enables > runtime selection of this stuff (even if that's building 3 versions of > the driver -- swr-avx, swr-avx2, swr-avx512, and having e.g. loader > magic determine which the right one is for the current CPU). > > Cheers, > > -ilia ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Oct 20, 2015, at 12:44 PM, Ilia Mirkinwrote: > > On Tue, Oct 20, 2015 at 1:43 PM, Ilia Mirkin wrote: >> On Tue, Oct 20, 2015 at 1:11 PM, Rowley, Timothy O >> wrote: >>> Does one build work on both AVX and AVX2? >>> - >>> >>> * Unfortunately, no. The architecture support is fixed at compile >>> time. While the AVX version of course will run on AVX2 machines >>> and the jitted code will use AVX2, the overall performance will >>> suffer relative to a full AVX2 build. >>> >>> * There is some idea that if we move some code from the driver back >>> to SWR core, we could build two versions of libSWR and dynamically >>> load the correct version at runtime. Unfortunately this mechanism >>> would not work with AVX512, as some of the SWR state structures >>> would change size. >> >> Without commenting on any of the other issues, I believe one of your >> stated goals is to ease distribution to your end-users. If you expect >> them to build their own code, that's no problem. However if you're >> thinking of relying on distros to include your driver and have end >> users use that, then you should consider some solution that enables >> runtime selection of this stuff (even if that's building 3 versions of >> the driver -- swr-avx, swr-avx2, swr-avx512, and having e.g. loader >> magic determine which the right one is for the current CPU). We’ve found that the large clusters tend to roll their own user environment specific to their system configuration, so this problem of binary support hasn’t been an immediate concern for the initial users. We hadn’t considered building complete driver/core-swr combinations behind a loader; we’ll consider this as a possibility for avx512. Most of the code movement to make runtime selection at the interface layer between core SWR and the driver has been done; we would need to verify any stray AVX/AVX2 architecture differences in the driver and add loader logic. -Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
On 20/10/15 18:11, Rowley, Timothy O wrote: Hi. I'd like to introduce the Mesa3D community to a software project that we hope to upstream. We're a small team at Intel working on software defined visualization (http://sdvis.org/), and have opensource projects in both the raytracing (Embree, OSPRay) and rasterization (OpenSWR) realms. We're a different Intel team from that of i965 fame, with a different type of customer and workloads. Our customers have large clusters of compute nodes that for various reasons do not have GPUs, and are working with extremely large geometry models. We've been working on a high performance, highly scalable rasterizer and driver to interface with Mesa3D. Our rasterizer functions as a "software gpu", relying on the mature well-supported Mesa3D to provide API and state tracking layers. We would like to contribute this code to Mesa3D and continue doing active development in your source repository. We welcome discussion about how this will happen and questions about the project itself. Below are some answers to what we think might be frequently asked questions. Bruce and I will be the public contacts for this project, but this project isn't solely our work - there's a dedicated group of people working on the core SWR code. Tim Rowley Bruce Cherniak Intel Corporation Why another software rasterizer? Good question, given there are already three (swrast, softpipe, llvmpipe) in the Mesa3D tree. Two important reasons for this: * Architecture - given our focus on scientific visualization, our workloads are much different than the typical game; we have heavy vertex load and relatively simple shaders. In addition, the core counts of machines we run on are much higher. These parameters led to design decisions much different than llvmpipe. * Historical - Intel had developed a high performance software graphics stack for internal purposes. Later we adapted this graphics stack for use in visualization and decided to move forward with Mesa3D to provide a high quality API layer while at the same time benefiting from the excellent performance the software rasterizerizer gives us. It wouldn't be too dificult to make llvmpipe's vertex-shading distributed across threads. What's the architecture? SWR is a tile based immediate mode renderer with a sort-free threading model which is arranged as a ring of queues. Each entry in the ring represents a draw context that contains all of the draw state and work queues. An API thread sets up each draw context and worker threads will execute both the frontend (vertex/geometry processing) and backend (fragment) work as required. The ring allows for backend threads to pull work in order. Large draws are split into chunks to allow vertex processing to happen in parallel, with the backend work pickup preserving draw ordering. Our pipeline uses just-in-time compiled code for the fetch shader that does vertex attribute gathering and AOS to SOA conversions, the vertex shader and fragment shaders, streamout, and fragment blending. SWR core also supports geometry and compute shaders but we haven't exposed them through our driver yet. The fetch shader, streamout, and blend is built internally to swr core using LLVM directly, while for the vertex and pixel shaders we reuse bits of llvmpipe from gallium/auxiliary/gallivm to build the kernels, which we wrap differently than llvmpipe's auxiliary/draw code. What's the performance? --- For the types of high-geometry workloads we're interested in, we are significantly faster than llvmpipe. This is to be expected, as llvmpipe only threads the fragment processing and not the geometry frontend. The linked slide below shows some performance numbers from a benchmark dataset and application. On a 36 total core dual E5-2699v3 we see performance 29x to 51x that of llvmpipe. http://openswr.org/slides/SWR_Sept15.pdf While our current performance is quite good, we know there is more potential in this architecture. When we switched from a prototype OpenGL driver to Mesa we regressed performance severely, some due to interface issues that need tuning, some differences in shader code generation, and some due to conformance and feature additions to the core swr. We are looking to recovering most of this performance back. I tried it on my i7-5500U, but I run into two issues: - OpenSWR seems to only use 2 threads (even though my system support 4 threads) - and even when I compensate llvmpipe to only use 2 rasterizer threads, I still only get half the framerate of llvmpipe with the "gloss" Mesa demo (a very simple texturing demo): $ ./gloss SWR create screen! This processor supports AVX2. 720 frames in 5.004 seconds = 143.885 FPS 737 frames in 5.005 seconds = 147.253 FPS 729 frames in 5.004 seconds = 145.683 FPS 732 frames in 5.002 seconds = 146.341 FPS 735
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
On 20/10/15 23:16, Rowley, Timothy O wrote: On Oct 20, 2015, at 4:23 PM, Jose Fonsecawrote: I tried it on my i7-5500U, but I run into two issues: - OpenSWR seems to only use 2 threads (even though my system support 4 threads) - and even when I compensate llvmpipe to only use 2 rasterizer threads, I still only get half the framerate of llvmpipe with the "gloss" Mesa demo (a very simple texturing demo): $ ./gloss SWR create screen! This processor supports AVX2. 720 frames in 5.004 seconds = 143.885 FPS 737 frames in 5.005 seconds = 147.253 FPS 729 frames in 5.004 seconds = 145.683 FPS 732 frames in 5.002 seconds = 146.341 FPS 735 frames in 5.001 seconds = 146.971 FPS [...] $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss 1539 frames in 5.002 seconds = 307.677 FPS 1719 frames in 5 seconds = 343.8 FPS 1780 frames in 5.002 seconds = 355.858 FPS 1497 frames in 5.002 seconds = 299.28 FPS 1548 frames in 5.001 seconds = 309.538 FPS [..] I see similar ratio with more complex workload with the trace from: http://people.freedesktop.org/~jrfonseca/traces/furmark-1.8.2-svga.trace (you'll need to download https://github.com/apitrace/apitrace and build) My questions are: - Is this the expected performance when texturing is used? Or is there something wrong with my setup? Two things are happening here to cause the behavior you’re seeing. First, OpenSWR only generates threads equal to the number of physical cores. On our workloads, going beyond that and using hyperthreads was a minimal or negative performance increase. Second, one thread is reserved for the API thread, which does not participate in either frontend (geometry) or backend (fragment) work. Thus on your two core 5500U OpenSWR only had one raster thread versus llvmpipe’s two, giving half the performance. If you want to switch OpenSWR to using hyperthreads, set the environment variable KNOB_MAX_THREADS_PER_CORE=0. Thanks for the explanations. It's closer now, but still a bit of gap: $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss SWR create screen! This processor supports AVX2. --> numThreads = 3 1102 frames in 5.002 seconds = 220.312 FPS 1133 frames in 5.001 seconds = 226.555 FPS 1130 frames in 5.002 seconds = 225.91 FPS ^C $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss 1456 frames in 5 seconds = 291.2 FPS 1617 frames in 5.003 seconds = 323.206 FPS 1571 frames in 5.002 seconds = 314.074 FPS One final question: you said that one thread is reserved for the API, but I see all threads (with top `H`) maxing up the CPU. So if the thread reserved for the API is not doing vertex/fragment processing, then what is it using 100% of a CPU thread for? Final thoughts: I understand this project has its own history, but I echo what Roland said -- it would be nice to unify with llvmpipe at one point, in some way or fashion. Our (VMware's) focus has been desktop composition, but there's no reason why a single SW renderer can't satisfy both ends of the spectrum, especially for JIT enable renderers, since they can emit at runtime the code most suited for the workload. That said, it's really nice seeing Mesa and Gallium enabling this sort of experiments with SW rendering. Jose ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer
> On Oct 20, 2015, at 4:23 PM, Jose Fonsecawrote: > > I tried it on my i7-5500U, but I run into two issues: > > - OpenSWR seems to only use 2 threads (even though my system support 4 > threads) > > - and even when I compensate llvmpipe to only use 2 rasterizer threads, I > still only get half the framerate of llvmpipe with the "gloss" Mesa demo (a > very simple texturing demo): > > $ ./gloss > SWR create screen! > This processor supports AVX2. > 720 frames in 5.004 seconds = 143.885 FPS > 737 frames in 5.005 seconds = 147.253 FPS > 729 frames in 5.004 seconds = 145.683 FPS > 732 frames in 5.002 seconds = 146.341 FPS > 735 frames in 5.001 seconds = 146.971 FPS > [...] > $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss > 1539 frames in 5.002 seconds = 307.677 FPS > 1719 frames in 5 seconds = 343.8 FPS > 1780 frames in 5.002 seconds = 355.858 FPS > 1497 frames in 5.002 seconds = 299.28 FPS > 1548 frames in 5.001 seconds = 309.538 FPS > [..] > > I see similar ratio with more complex workload with the trace from: > > http://people.freedesktop.org/~jrfonseca/traces/furmark-1.8.2-svga.trace > > (you'll need to download https://github.com/apitrace/apitrace and build) > > My questions are: > > - Is this the expected performance when texturing is used? Or is there > something wrong with my setup? > Two things are happening here to cause the behavior you’re seeing. First, OpenSWR only generates threads equal to the number of physical cores. On our workloads, going beyond that and using hyperthreads was a minimal or negative performance increase. Second, one thread is reserved for the API thread, which does not participate in either frontend (geometry) or backend (fragment) work. Thus on your two core 5500U OpenSWR only had one raster thread versus llvmpipe’s two, giving half the performance. If you want to switch OpenSWR to using hyperthreads, set the environment variable KNOB_MAX_THREADS_PER_CORE=0. > I understand that OpenSWR actually leverages llvmpipe (well gallivm's) code > for texture sampling, so I was expecting a smaller gap. Yes, we use gallivm’s texture sampler so our performance should be similar on texture-limited workloads. I tried a quick test of openarena on a 4-core machine and the performance delta was about 6% (default N-1 OpenSWR worker threads). > - What exactly was the benchmark used for SWR_Sept15.pdf's figures ? Was > there any texture sampling used on it, or was it just simple lighting? I don’t have the apitrace in front of me, but I believe the turbulence data was two-sided lit, with a textured plane. Tim ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev