Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2016-02-22 Thread Rowley, Timothy O

> On Feb 17, 2016, at 7:07 PM, Roland Scheidegger  wrote:
> 
> You could use different functions for avx and avx2 code, and plug the
> right ones in at runtime, as you can link them both just fine. It just
> requires that your code containing avx2 code is in a different compile
> unit to the one containing avx-only code. This way you only really have
> separate compiled code for the functions where there's really a
> difference (obviously, this prevents the compiler from using avx2 on its
> own in the shared parts, but I doubt that's a problem). Albeit if you
> have lots of differences scattered around (the worst would probably be
> different structures based on such difference used everywhere...) this
> might not be very practical (at a first glance, didn't look like it at
> least for avx and avx2).
> Though I'm not actually sure how you would do that for c++ template
> code, maybe it doesn't work as easily...
> In any case, so far for llvmpipe we didn't bother (except for the jitted
> code of course) to optimize for newer instruction sets precisely due to
> it being annoying (certainly prevents you from doing "let's just
> optimize this math here in this little inline function when avx is
> available" - so we still have rasterization functions which emulate
> sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on).

Unfortunately we have avx and avx2 usage in the general swr code, hidden behind 
some macros which emulate the missing avx2 instructions on avx, so there isn’t 
a clear boundary layer inside the swr rasterizer we can load behind.  
Additionally some of the structures will start changing size when we add avx512 
support.

I was thinking that “objcopy —prefix-symbols” might be the answer to the 
problem of creating two versions of the rasterizer that could be linked 
together with the driver, but it does a global rename on all symbols (internal 
and externals like malloc/free/c++ constructors/etc..) leaving unresolvable 
externals.

Maybe a global c++ namespace might work, but I don’t see a nonintrusive way of 
adding that.

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2016-02-17 Thread Tom Stellard
On Thu, Feb 18, 2016 at 02:07:25AM +0100, Roland Scheidegger wrote:
> Am 17.02.2016 um 22:09 schrieb Rowley, Timothy O:
> >> On Nov 18, 2015, at 12:34 PM, Emil Velikov
> >>  wrote: I have no objections against
> >> getting this merged, although here are a couple of things that
> >> should be sorted. Some of these are just reiteration from others:
> > 
> > Sorry about the delay responding to this; we’ve been working on a
> > number of the issues you mentioned (plus the usual year-end holidays
> > and other work).
> > 
> >> - First and foremost - please base your work against master. Mesa, 
> >> alike most other open-source projects, tries to keep features out
> >> of bugfix releases. As such basing things against 11.0 is not
> >> suitable.
> > 
> > Basing our efforts on a particular Mesa branch was an initial
> > development decision to keep a stable base while we figured out how
> > to build a driver from scratch.  We have now rebased to the Mesa
> > master and periodically merge updates.
> > 
> >> - Further combinatorial explosion of build configurations - with 
> >> internal/external core, swr-arch, etc. Some of these can (should?)
> >> be nuked, although further comments will follow as patch(es) hit
> >> the mailing list.
> > 
> > All the additional swr build options have been removed, leaving swr
> > simply as an additional gallium driver that can be enabled.  The
> > build-time architecture dependence has been addressed by building the
> > swr driver twice (avx and avx2), and having swr_create_screen check
> > the architecture and load the appropriate library.  I’m not
> > completely satisfied with the current solution as since the driver is
> > part of the loaded library we need to link most of mesa into the
> > “driver”.  The fix for this seems to be to just build the core swr
> > rasterizer architecture specific and dlopen/dlsym the fifty or so API
> > entry points.  However this interim solution simplifies things for
> > our users and removes the swr specific options from the general Mesa
> > build system.
> You could use different functions for avx and avx2 code, and plug the
> right ones in at runtime, as you can link them both just fine. It just
> requires that your code containing avx2 code is in a different compile
> unit to the one containing avx-only code. This way you only really have
> separate compiled code for the functions where there's really a
> difference (obviously, this prevents the compiler from using avx2 on its
> own in the shared parts, but I doubt that's a problem). Albeit if you
> have lots of differences scattered around (the worst would probably be
> different structures based on such difference used everywhere...) this
> might not be very practical (at a first glance, didn't look like it at
> least for avx and avx2).

You can set feature flags on a per-function basis now, so it's possible
to have an avx and avx2 function in the same module.  I haven't actually
tried this, though, so I'm not sure now well it's working at the moment.

-Tom

> Though I'm not actually sure how you would do that for c++ template
> code, maybe it doesn't work as easily...
> In any case, so far for llvmpipe we didn't bother (except for the jitted
> code of course) to optimize for newer instruction sets precisely due to
> it being annoying (certainly prevents you from doing "let's just
> optimize this math here in this little inline function when avx is
> available" - so we still have rasterization functions which emulate
> sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on).
> 
> Roland
> 
> 
> > 
> >> - Using llvm's C++ interface, building against multiple LLVM 
> >> versions. If openswr only supports only limited versions of llvm,
> >> then the build should bail out accordingly - more
> >> comments/suggestions as patch(es) hit the ML.
> > 
> > OpenSWR now supports llvm 3.6, 3.7, and 3.8.  We don’t explicitly
> > prevent people from trying to use llvm-svn, though as you say the C++
> > api is not stable so they might encounter problems.
> > 
> >> - Will patches porting core openswr functionality from the
> >> internal tree be part of the public discussions ? The VMWare people
> >> have done a great thing trying to keep things open, and people
> >> have, on the rare occasion, found nitpicks in their patches.
> > 
> > Moving patches from the internal rasterizer tree can be scripted at a
> > top level, but unfortunately that’s the easy bit of keeping the two
> > in sync when changes happen on both sides of the fence.  I can try
> > tracking individual patches up to my git knowledge.
> > 
> >> - And last but not least - please split patches sensibly, for your 
> >> submission and further work). The "Initial public Mesa+SWR"
> >> touches files in quite a few different places.
> > 
> > I’m about to send the patches to the list for review; splitting them
> > into the driver, rasterizer, mesa changes, and build system.
> > 
> >> Mildly related - I'll be resending/merging a 

Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2016-02-17 Thread Roland Scheidegger
Am 17.02.2016 um 22:09 schrieb Rowley, Timothy O:
>> On Nov 18, 2015, at 12:34 PM, Emil Velikov
>>  wrote: I have no objections against
>> getting this merged, although here are a couple of things that
>> should be sorted. Some of these are just reiteration from others:
> 
> Sorry about the delay responding to this; we’ve been working on a
> number of the issues you mentioned (plus the usual year-end holidays
> and other work).
> 
>> - First and foremost - please base your work against master. Mesa, 
>> alike most other open-source projects, tries to keep features out
>> of bugfix releases. As such basing things against 11.0 is not
>> suitable.
> 
> Basing our efforts on a particular Mesa branch was an initial
> development decision to keep a stable base while we figured out how
> to build a driver from scratch.  We have now rebased to the Mesa
> master and periodically merge updates.
> 
>> - Further combinatorial explosion of build configurations - with 
>> internal/external core, swr-arch, etc. Some of these can (should?)
>> be nuked, although further comments will follow as patch(es) hit
>> the mailing list.
> 
> All the additional swr build options have been removed, leaving swr
> simply as an additional gallium driver that can be enabled.  The
> build-time architecture dependence has been addressed by building the
> swr driver twice (avx and avx2), and having swr_create_screen check
> the architecture and load the appropriate library.  I’m not
> completely satisfied with the current solution as since the driver is
> part of the loaded library we need to link most of mesa into the
> “driver”.  The fix for this seems to be to just build the core swr
> rasterizer architecture specific and dlopen/dlsym the fifty or so API
> entry points.  However this interim solution simplifies things for
> our users and removes the swr specific options from the general Mesa
> build system.
You could use different functions for avx and avx2 code, and plug the
right ones in at runtime, as you can link them both just fine. It just
requires that your code containing avx2 code is in a different compile
unit to the one containing avx-only code. This way you only really have
separate compiled code for the functions where there's really a
difference (obviously, this prevents the compiler from using avx2 on its
own in the shared parts, but I doubt that's a problem). Albeit if you
have lots of differences scattered around (the worst would probably be
different structures based on such difference used everywhere...) this
might not be very practical (at a first glance, didn't look like it at
least for avx and avx2).
Though I'm not actually sure how you would do that for c++ template
code, maybe it doesn't work as easily...
In any case, so far for llvmpipe we didn't bother (except for the jitted
code of course) to optimize for newer instruction sets precisely due to
it being annoying (certainly prevents you from doing "let's just
optimize this math here in this little inline function when avx is
available" - so we still have rasterization functions which emulate
sse41 _mm_mul_epi32 with _mm_mul_epu32 and so on).

Roland


> 
>> - Using llvm's C++ interface, building against multiple LLVM 
>> versions. If openswr only supports only limited versions of llvm,
>> then the build should bail out accordingly - more
>> comments/suggestions as patch(es) hit the ML.
> 
> OpenSWR now supports llvm 3.6, 3.7, and 3.8.  We don’t explicitly
> prevent people from trying to use llvm-svn, though as you say the C++
> api is not stable so they might encounter problems.
> 
>> - Will patches porting core openswr functionality from the
>> internal tree be part of the public discussions ? The VMWare people
>> have done a great thing trying to keep things open, and people
>> have, on the rare occasion, found nitpicks in their patches.
> 
> Moving patches from the internal rasterizer tree can be scripted at a
> top level, but unfortunately that’s the easy bit of keeping the two
> in sync when changes happen on both sides of the fence.  I can try
> tracking individual patches up to my git knowledge.
> 
>> - And last but not least - please split patches sensibly, for your 
>> submission and further work). The "Initial public Mesa+SWR"
>> touches files in quite a few different places.
> 
> I’m about to send the patches to the list for review; splitting them
> into the driver, rasterizer, mesa changes, and build system.
> 
>> Mildly related - I'll be resending/merging a series with reworks 
>> things in src/gallium/auxiliary/target-helpers/ so things might
>> clash as you rebase your work.
> 
> No problem - all part of working with a larger project.  Thanks for
> the heads-up.
> 
> -Tim
> 
> ___ mesa-dev mailing
> list mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 

___
mesa-dev mailing list

Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2016-02-17 Thread Rowley, Timothy O
> On Nov 18, 2015, at 12:34 PM, Emil Velikov  wrote:
> I have no objections against getting this merged, although here are a
> couple of things that should be sorted. Some of these are just
> reiteration from others:

Sorry about the delay responding to this; we’ve been working on a number of the 
issues you mentioned (plus the usual year-end holidays and other work).

> - First and foremost - please base your work against master. Mesa,
> alike most other open-source projects, tries to keep features out of
> bugfix releases. As such basing things against 11.0 is not suitable.

Basing our efforts on a particular Mesa branch was an initial development 
decision to keep a stable base while we figured out how to build a driver from 
scratch.  We have now rebased to the Mesa master and periodically merge updates.

> - Further combinatorial explosion of build configurations - with
> internal/external core, swr-arch, etc. Some of these can (should?) be
> nuked, although further comments will follow as patch(es) hit the
> mailing list.

All the additional swr build options have been removed, leaving swr simply as 
an additional gallium driver that can be enabled.  The build-time architecture 
dependence has been addressed by building the swr driver twice (avx and avx2), 
and having swr_create_screen check the architecture and load the appropriate 
library.  I’m not completely satisfied with the current solution as since the 
driver is part of the loaded library we need to link most of mesa into the 
“driver”.  The fix for this seems to be to just build the core swr rasterizer 
architecture specific and dlopen/dlsym the fifty or so API entry points.  
However this interim solution simplifies things for our users and removes the 
swr specific options from the general Mesa build system. 

> - Using llvm's C++ interface, building against multiple LLVM
> versions. If openswr only supports only limited versions of llvm, then
> the build should bail out accordingly - more comments/suggestions as
> patch(es) hit the ML.

OpenSWR now supports llvm 3.6, 3.7, and 3.8.  We don’t explicitly prevent 
people from trying to use llvm-svn, though as you say the C++ api is not stable 
so they might encounter problems.

> - Will patches porting core openswr functionality from the internal
> tree be part of the public discussions ? The VMWare people have done a
> great thing trying to keep things open, and people have, on the rare
> occasion, found nitpicks in their patches.

Moving patches from the internal rasterizer tree can be scripted at a top 
level, but unfortunately that’s the easy bit of keeping the two in sync when 
changes happen on both sides of the fence.  I can try tracking individual 
patches up to my git knowledge.

> - And last but not least - please split patches sensibly, for your
> submission and further work). The "Initial public Mesa+SWR" touches
> files in quite a few different places.

I’m about to send the patches to the list for review; splitting them into the 
driver, rasterizer, mesa changes, and build system.

> Mildly related - I'll be resending/merging a series with reworks
> things in src/gallium/auxiliary/target-helpers/ so things might clash
> as you rebase your work.

No problem - all part of working with a larger project.  Thanks for the 
heads-up.

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-11-18 Thread Emil Velikov
Hi Tim,

I have no objections against getting this merged, although here are a
couple of things that should be sorted. Some of these are just
reiteration from others:

 - First and foremost - please base your work against master. Mesa,
alike most other open-source projects, tries to keep features out of
bugfix releases. As such basing things against 11.0 is not suitable.

 - Further combinatorial explosion of build configurations - with
internal/external core, swr-arch, etc. Some of these can (should?) be
nuked, although further comments will follow as patch(es) hit the
mailing list.

 - Using llvm's C++ interface, building against multiple LLVM
versions. If openswr only supports only limited versions of llvm, then
the build should bail out accordingly - more comments/suggestions as
patch(es) hit the ML.

 - Will patches porting core openswr functionality from the internal
tree be part of the public discussions ? The VMWare people have done a
great thing trying to keep things open, and people have, on the rare
occasion, found nitpicks in their patches.

 - And last but not least - please split patches sensibly, for your
submission and further work). The "Initial public Mesa+SWR" touches
files in quite a few different places.

Mildly related - I'll be resending/merging a series with reworks
things in src/gallium/auxiliary/target-helpers/ so things might clash
as you rebase your work.

Thanks
Emil
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-11-10 Thread Rowley, Timothy O

> On Oct 22, 2015, at 4:17 PM, Jose Fonseca  wrote:
> 
> They do share a lot already, Mesa, gallium statetracker, and gallivm. If 
> further development in openswr is planned, it might require to jump through a 
> few hoops, but I think it's worth to figure out what would take to get this 
> merged into master so that, whenever there are interface changes, openswer 
> won't get the short stick.

Yes, openswr and llvmpipe share a fair bit.  It is my hope that as we start 
working more on openswr performance, some of the effort will benefit both 
drivers.

We’re willing to jump through the hoops needed to merge into master.  To that 
end, I’ve pushed some updates that amongst other things allow us to support 
both llvm 3.6 and 3.7 (and possibly llvm-svn).  Are there any other hoops that 
spring to mind?

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-11-10 Thread Roland Scheidegger
Am 10.11.2015 um 20:36 schrieb Rowley, Timothy O:
> 
>> On Oct 22, 2015, at 4:17 PM, Jose Fonseca  wrote:
>>
>> They do share a lot already, Mesa, gallium statetracker, and gallivm. If 
>> further development in openswr is planned, it might require to jump through 
>> a few hoops, but I think it's worth to figure out what would take to get 
>> this merged into master so that, whenever there are interface changes, 
>> openswer won't get the short stick.
> 
> Yes, openswr and llvmpipe share a fair bit.  It is my hope that as we start 
> working more on openswr performance, some of the effort will benefit both 
> drivers.
> 
> We’re willing to jump through the hoops needed to merge into master.  To that 
> end, I’ve pushed some updates that amongst other things allow us to support 
> both llvm 3.6 and 3.7 (and possibly llvm-svn).  Are there any other hoops 
> that spring to mind?
> 

FWIW this looks ok to me. You didn't really touch any shared code apart
from adding some extern C wrappers, so there's not much to review there.
Plus of course some build changes, which seem fairly obvious though
someone else might want to look at that.
I didn't look that closely at the driver bits but as long as you're able
to maintain it it should be fine...

Roland

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-22 Thread Jose Fonseca

On 22/10/15 00:43, Rowley, Timothy O wrote:



On Oct 20, 2015, at 5:58 PM, Jose Fonseca  wrote:

Thanks for the explanations.  It's closer now, but still a bit of gap:

$ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
SWR create screen!
This processor supports AVX2.
--> numThreads = 3
1102 frames in 5.002 seconds = 220.312 FPS
1133 frames in 5.001 seconds = 226.555 FPS
1130 frames in 5.002 seconds = 225.91 FPS
^C
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
1456 frames in 5 seconds = 291.2 FPS
1617 frames in 5.003 seconds = 323.206 FPS
1571 frames in 5.002 seconds = 314.074 FPS


A bit more of an apples to apples comparison might be single-threaded llvmpipe 
(LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1).  Running 
gloss and glxgears (another favorite “benchmark” :) ) under these conditions 
show swr running a bit slower, though a little closer than your numbers.



Indeed that seems a better comparison.

$ KNOB_SINGLE_THREADED=1 ./gloss
SWR create screen!
This processor supports AVX2.
733 frames in 5.003 seconds = 146.512 FPS
787 frames in 5.004 seconds = 157.274 FPS
793 frames in 5.005 seconds = 158.442 FPS
799 frames in 5.001 seconds = 159.768 FPS
787 frames in 5.005 seconds = 157.243 FPS
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=0 ./gloss
939 frames in 5.002 seconds = 187.725 FPS
1032 frames in 5.001 seconds = 206.359 FPS
1017 frames in 5.002 seconds = 203.319 FPS
1021 frames in 5 seconds = 204.2 FPS
1039 frames in 5.002 seconds = 207.717 FPS

> Examining performance traces, we think swr’s concept of hot-tiles, 
the working memory representation of the render target, and the 
associated load/store functions contribute to most of the difference. 
We might be able to optimize those conversions; additionally fast clear 
would help these demos.  For larger workloads this small per-frame cost 
doesn’t really affect the performance.



These initial observations from you and others regarding performance have been 
interesting.  Our performance work has been with large workloads on high core 
count configurations, where while some of the decisions such as a dedicated 
core for the application/API might have cost performance a bit, the percentage 
is much less than on the dual and quad core processors.  We’ll look into some 
changes/tuning that will benefit both extremes, though we might have to end up 
conceding that llvmpipe will be faster at glxgears. :-)


I don't care for gears -- it practically measure present/blit rate --, 
but gloss spite simple is sensitive to texturing performance.



Final thoughts: I understand this project has its own history, but I echo what 
Roland said -- it would be nice to unify with llvmpipe at one point, in some 
way or fashion.  Our (VMware's) focus has been desktop composition, but there's 
no reason why a single SW renderer can't satisfy both ends of the spectrum, 
especially for JIT enable renderers, since they can emit at runtime the code 
most suited for the workload.


We would be happy for someone to take some of the ideas from swr to speed up 
llvmpipe, but for now our development will continue on the swr core and driver. 
 We’re not planning on replacing llvmpipe - its intent of working on any 
architecture is admirable.  In the ideal world the solution would be something 
that combines the best traits of both rasterizers, but at this point the 
shortest path to having a performant solution for our customers is with swr.


Fair enough.

They do share a lot already, Mesa, gallium statetracker, and gallivm. 
If further development in openswr is planned, it might require to jump 
through a few hoops, but I think it's worth to figure out what would 
take to get this merged into master so that, whenever there are 
interface changes, openswer won't get the short stick.



That said, it's really nice seeing Mesa and Gallium enabling this sort of 
experiments with SW rendering.


Yes, we were quite happy with how fast we were able to get a new driver 
functioning with gallium.  The major thing slowing us was the documentation, 
which is not uniform in coverage.  There was a lot of reading other drivers’ 
source to figure out how things were supposed to work.


Yes, that's a fair comment.

Jose
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-21 Thread Rowley, Timothy O

> On Oct 20, 2015, at 2:03 PM, Roland Scheidegger  wrote:
> 
> Certainly looks interesting...
> From a high level point of view, seems quite similar to llvmpipe (both
> tile based, using llvm for jitting shaders, ...). Of course llvmpipe
> isn't well suited for these kind of workloads (the most important use
> case is desktop compositing, so a couple dozen vertices per frame but
> millions of pixels...). Making vertex loads scale is something which
> just wasn't worth the effort so far (there's not actually that many
> people working on llvmpipe), albeit we realize that the completely
> non-parallel nature of it currently actually can hinder scaling quite a
> bit even for "typical" workloads (not desktop compositing, but "simple"
> 3d apps) once you've got enough cores/threads (8 or so), but that's
> something we're not worried too much about.
> I think requiring llvm 3.6 probably isn't going to work if you want to
> upstream this, a minimum version of 3.6 is fine but the general rule is
> things should still work with newer versions (including current
> development version, seems like you're using c++ interface of llvm quite
> a bit so that's probably going to require some #ifdef mess). Albeit I
> guess if you just don't try to build the driver with non-released
> versions that's probably ok (but will limit the ability for some people
> to try out your driver).

Some differences between llvmpipe and swr based on my understanding of 
llvmpipe’s architecture:

threading model
llvmpipe: single threaded vertex processing, up to 16 rasterization 
threads
swr: common thread pool that pick up frontend or backend work as 
available
vertex processing
llvmpipe: entire draw call processed in a single pass
swr: large draws chopped into chunks that can be processed in parallel
frontend/backend coupling
llvmpipe: separate binning pass in single threaded frontend
swr: frontend vertex processing and binning combined in a single pass
primitive assembly and binning
llvmpipe: scalar c code
swr: x86 avx/avx2 working on vector of primitives
fragment processing
llvmpipe: single jitted shader combining depth/fragment/stencil/blend 
on16x16 block
swr: separate jitted fragment and blend shaders, plus templated depth 
test
in-memory representation
llvmpipe: direct access to render targets
swr: hot-tile working representation with load and/or store at required 
times

As you say, we do use LLVM’s C++ API.  While that has some advantages, it’s not 
guaranteed to be stable and can/does make nontrivial changes.  3.6 to 3.7 made 
some change to at least the GEP instruction which we could work around if 
necessary for upstreaming.

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-21 Thread Roland Scheidegger
Am 22.10.2015 um 00:41 schrieb Rowley, Timothy O:
> 
>> On Oct 20, 2015, at 2:03 PM, Roland Scheidegger  wrote:
>>
>> Certainly looks interesting...
>> From a high level point of view, seems quite similar to llvmpipe (both
>> tile based, using llvm for jitting shaders, ...). Of course llvmpipe
>> isn't well suited for these kind of workloads (the most important use
>> case is desktop compositing, so a couple dozen vertices per frame but
>> millions of pixels...). Making vertex loads scale is something which
>> just wasn't worth the effort so far (there's not actually that many
>> people working on llvmpipe), albeit we realize that the completely
>> non-parallel nature of it currently actually can hinder scaling quite a
>> bit even for "typical" workloads (not desktop compositing, but "simple"
>> 3d apps) once you've got enough cores/threads (8 or so), but that's
>> something we're not worried too much about.
>> I think requiring llvm 3.6 probably isn't going to work if you want to
>> upstream this, a minimum version of 3.6 is fine but the general rule is
>> things should still work with newer versions (including current
>> development version, seems like you're using c++ interface of llvm quite
>> a bit so that's probably going to require some #ifdef mess). Albeit I
>> guess if you just don't try to build the driver with non-released
>> versions that's probably ok (but will limit the ability for some people
>> to try out your driver).
> 
> Some differences between llvmpipe and swr based on my understanding of 
> llvmpipe’s architecture:
> 
> threading model
>   llvmpipe: single threaded vertex processing, up to 16 rasterization 
> threads
The limit is actually pretty much arbitrary. Though since vertex
processing is single threaded, there's definitely practical scaling
limits (and having more threads than render tiles wouldn't show any
advantage).

>   swr: common thread pool that pick up frontend or backend work as 
> available
> vertex processing
>   llvmpipe: entire draw call processed in a single pass
>   swr: large draws chopped into chunks that can be processed in parallel
> frontend/backend coupling
>   llvmpipe: separate binning pass in single threaded frontend
>   swr: frontend vertex processing and binning combined in a single pass
There's definitive advantages to swr there. llvmpipe's binning pass
isn't really separate from vertex processing, so this being
single-threaded is more of a result of vertex processing also being
handled in the same frontend thread (though of course if it were
multithreaded some extra logic would be needed for things to stay
correctly in order).
Part of it is due to draw really being separate from llvmpipe (it can
and is used by other drivers), so the "interface" between vs and fs is
rather simple. But certainly it's not like this is set in stone, rather
noone had the time to do something a bit more scalable there...

> primitive assembly and binning
>   llvmpipe: scalar c code
there's actually some jit code there plus some manual sse code (though
still c fallback). Albeit it is indeed not quite as parallel as I'd like
(only works on a single primitive at a time).

>   swr: x86 avx/avx2 working on vector of primitives
> fragment processing
>   llvmpipe: single jitted shader combining depth/fragment/stencil/blend 
> on16x16 block
It is working on a 4x4 block actually, but otherwise that's right.

>   swr: separate jitted fragment and blend shaders, plus templated depth 
> test
> in-memory representation
>   llvmpipe: direct access to render targets
>   swr: hot-tile working representation with load and/or store at required 
> times
This is actually an interesting difference, of course also tied to
llvmpipe integrating everything together into the fragment shader.

So yes, these are all definitely significant architectural differences
to llvmpipe. But most of it (ok the combined fragment shader / backend
jit code is not) is not really due to a concious design decision - I'd
happily accept patches to make it possible to do vertex processing in
parallel :-).


> As you say, we do use LLVM’s C++ API.  While that has some advantages, it’s 
> not guaranteed to be stable and can/does make nontrivial changes.  3.6 to 3.7 
> made some change to at least the GEP instruction which we could work around 
> if necessary for upstreaming.
IMHO you should really try to keep up at least with llvm releases (and
ideally llvm head). Otherwise you make it a pain to build not just for
users but developers alike (and if stuff doesn't get at least built, it
has a tendency to break quite often when there's gallium interface
changes etc.).


Roland

> 
> -Tim
> 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-21 Thread Rowley, Timothy O

> On Oct 20, 2015, at 5:58 PM, Jose Fonseca  wrote:
> 
> Thanks for the explanations.  It's closer now, but still a bit of gap:
> 
> $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
> SWR create screen!
> This processor supports AVX2.
> --> numThreads = 3
> 1102 frames in 5.002 seconds = 220.312 FPS
> 1133 frames in 5.001 seconds = 226.555 FPS
> 1130 frames in 5.002 seconds = 225.91 FPS
> ^C
> $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
> 1456 frames in 5 seconds = 291.2 FPS
> 1617 frames in 5.003 seconds = 323.206 FPS
> 1571 frames in 5.002 seconds = 314.074 FPS

A bit more of an apples to apples comparison might be single-threaded llvmpipe 
(LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1).  Running 
gloss and glxgears (another favorite “benchmark” :) ) under these conditions 
show swr running a bit slower, though a little closer than your numbers.  
Examining performance traces, we think swr’s concept of hot-tiles, the working 
memory representation of the render target, and the associated load/store 
functions contribute to most of the difference.  We might be able to optimize 
those conversions; additionally fast clear would help these demos.  For larger 
workloads this small per-frame cost doesn’t really affect the performance.

> One final question: you said that one thread is reserved for the API, but I 
> see all threads (with top `H`) maxing up the CPU. So if the thread reserved 
> for the API is not doing vertex/fragment processing, then what is it using 
> 100% of a CPU thread for?

With a trivial application main loop and light api usage, the API thread is 
going to end up spending most of the time waiting for the other threads to 
finish work.

These initial observations from you and others regarding performance have been 
interesting.  Our performance work has been with large workloads on high core 
count configurations, where while some of the decisions such as a dedicated 
core for the application/API might have cost performance a bit, the percentage 
is much less than on the dual and quad core processors.  We’ll look into some 
changes/tuning that will benefit both extremes, though we might have to end up 
conceding that llvmpipe will be faster at glxgears. :-)  

> Final thoughts: I understand this project has its own history, but I echo 
> what Roland said -- it would be nice to unify with llvmpipe at one point, in 
> some way or fashion.  Our (VMware's) focus has been desktop composition, but 
> there's no reason why a single SW renderer can't satisfy both ends of the 
> spectrum, especially for JIT enable renderers, since they can emit at runtime 
> the code most suited for the workload.

We would be happy for someone to take some of the ideas from swr to speed up 
llvmpipe, but for now our development will continue on the swr core and driver. 
 We’re not planning on replacing llvmpipe - its intent of working on any 
architecture is admirable.  In the ideal world the solution would be something 
that combines the best traits of both rasterizers, but at this point the 
shortest path to having a performant solution for our customers is with swr. 

> That said, it's really nice seeing Mesa and Gallium enabling this sort of 
> experiments with SW rendering.

Yes, we were quite happy with how fast we were able to get a new driver 
functioning with gallium.  The major thing slowing us was the documentation, 
which is not uniform in coverage.  There was a lot of reading other drivers’ 
source to figure out how things were supposed to work.

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Roland Scheidegger
Am 20.10.2015 um 19:11 schrieb Rowley, Timothy O:
> Hi.  I'd like to introduce the Mesa3D community to a software project
> that we hope to upstream.  We're a small team at Intel working on
> software defined visualization (http://sdvis.org/), and have
> opensource projects in both the raytracing (Embree, OSPRay) and
> rasterization (OpenSWR) realms.
> 
> We're a different Intel team from that of i965 fame, with a different
> type of customer and workloads.  Our customers have large clusters of
> compute nodes that for various reasons do not have GPUs, and are
> working with extremely large geometry models.
> 
> We've been working on a high performance, highly scalable rasterizer
> and driver to interface with Mesa3D.  Our rasterizer functions as a
> "software gpu", relying on the mature well-supported Mesa3D to provide
> API and state tracking layers.
> 
> We would like to contribute this code to Mesa3D and continue doing
> active development in your source repository.  We welcome discussion
> about how this will happen and questions about the project itself.
> Below are some answers to what we think might be frequently asked
> questions.
> 
> Bruce and I will be the public contacts for this project, but this
> project isn't solely our work - there's a dedicated group of people
> working on the core SWR code.
> 
>   Tim Rowley
>   Bruce Cherniak
> 
>   Intel Corporation
> 
> Why another software rasterizer?
> 
> 
> Good question, given there are already three (swrast, softpipe,
> llvmpipe) in the Mesa3D tree. Two important reasons for this:
> 
>  * Architecture - given our focus on scientific visualization, our
>workloads are much different than the typical game; we have heavy
>vertex load and relatively simple shaders.  In addition, the core
>counts of machines we run on are much higher.  These parameters led
>to design decisions much different than llvmpipe.
> 
>  * Historical - Intel had developed a high performance software
>graphics stack for internal purposes.  Later we adapted this
>graphics stack for use in visualization and decided to move forward
>with Mesa3D to provide a high quality API layer while at the same
>time benefiting from the excellent performance the software
>rasterizerizer gives us.
> 
> What's the architecture?
> 
> 
> SWR is a tile based immediate mode renderer with a sort-free threading
> model which is arranged as a ring of queues.  Each entry in the ring
> represents a draw context that contains all of the draw state and work
> queues.  An API thread sets up each draw context and worker threads
> will execute both the frontend (vertex/geometry processing) and
> backend (fragment) work as required.  The ring allows for backend
> threads to pull work in order.  Large draws are split into chunks to
> allow vertex processing to happen in parallel, with the backend work
> pickup preserving draw ordering.
> 
> Our pipeline uses just-in-time compiled code for the fetch shader that
> does vertex attribute gathering and AOS to SOA conversions, the vertex
> shader and fragment shaders, streamout, and fragment blending. SWR
> core also supports geometry and compute shaders but we haven't exposed
> them through our driver yet. The fetch shader, streamout, and blend is
> built internally to swr core using LLVM directly, while for the vertex
> and pixel shaders we reuse bits of llvmpipe from
> gallium/auxiliary/gallivm to build the kernels, which we wrap
> differently than llvmpipe's auxiliary/draw code.
> 
> What's the performance?
> ---
> 
> For the types of high-geometry workloads we're interested in, we are
> significantly faster than llvmpipe.  This is to be expected, as
> llvmpipe only threads the fragment processing and not the geometry
> frontend.
> 
> The linked slide below shows some performance numbers from a benchmark
> dataset and application.  On a 36 total core dual E5-2699v3 we see
> performance 29x to 51x that of llvmpipe.  
> 
>   http://openswr.org/slides/SWR_Sept15.pdf
> 
> While our current performance is quite good, we know there is more
> potential in this architecture.  When we switched from a prototype
> OpenGL driver to Mesa we regressed performance severely, some due to
> interface issues that need tuning, some differences in shader code
> generation, and some due to conformance and feature additions to the
> core swr.  We are looking to recovering most of this performance back.
> 
> What's the conformance?
> ---
> 
> The major applications we are targeting are all based on the
> Visualization Toolkit (VTK), and as such our development efforts have
> been focused on making sure these work as best as possible.  Our
> current code passes vtk's rendering tests with their new "OpenGL2"
> (really OpenGL 3.2) backend at 99%.
> 
> piglit testing shows a much lower pass rate, roughly 80% at the time
> of writing.  Core SWR undergoes 

Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Ilia Mirkin
[re-adding mesa-dev, dropped accidentally]

On Tue, Oct 20, 2015 at 1:43 PM, Ilia Mirkin  wrote:
> On Tue, Oct 20, 2015 at 1:11 PM, Rowley, Timothy O
>  wrote:
>> Does one build work on both AVX and AVX2?
>> -
>>
>>  * Unfortunately, no.  The architecture support is fixed at compile
>>time.  While the AVX version of course will run on AVX2 machines
>>and the jitted code will use AVX2, the overall performance will
>>suffer relative to a full AVX2 build.
>>
>>  * There is some idea that if we move some code from the driver back
>>to SWR core, we could build two versions of libSWR and dynamically
>>load the correct version at runtime.  Unfortunately this mechanism
>>would not work with AVX512, as some of the SWR state structures
>>would change size.
>
> Without commenting on any of the other issues, I believe one of your
> stated goals is to ease distribution to your end-users. If you expect
> them to build their own code, that's no problem. However if you're
> thinking of relying on distros to include your driver and have end
> users use that, then you should consider some solution that enables
> runtime selection of this stuff (even if that's building 3 versions of
> the driver -- swr-avx, swr-avx2, swr-avx512, and having e.g. loader
> magic determine which the right one is for the current CPU).
>
> Cheers,
>
>   -ilia
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Rowley, Timothy O

> On Oct 20, 2015, at 12:44 PM, Ilia Mirkin  wrote:
> 
> On Tue, Oct 20, 2015 at 1:43 PM, Ilia Mirkin  wrote:
>> On Tue, Oct 20, 2015 at 1:11 PM, Rowley, Timothy O
>>  wrote:
>>> Does one build work on both AVX and AVX2?
>>> -
>>> 
>>> * Unfortunately, no.  The architecture support is fixed at compile
>>>   time.  While the AVX version of course will run on AVX2 machines
>>>   and the jitted code will use AVX2, the overall performance will
>>>   suffer relative to a full AVX2 build.
>>> 
>>> * There is some idea that if we move some code from the driver back
>>>   to SWR core, we could build two versions of libSWR and dynamically
>>>   load the correct version at runtime.  Unfortunately this mechanism
>>>   would not work with AVX512, as some of the SWR state structures
>>>   would change size.
>> 
>> Without commenting on any of the other issues, I believe one of your
>> stated goals is to ease distribution to your end-users. If you expect
>> them to build their own code, that's no problem. However if you're
>> thinking of relying on distros to include your driver and have end
>> users use that, then you should consider some solution that enables
>> runtime selection of this stuff (even if that's building 3 versions of
>> the driver -- swr-avx, swr-avx2, swr-avx512, and having e.g. loader
>> magic determine which the right one is for the current CPU).

We’ve found that the large clusters tend to roll their own user environment 
specific to their system configuration, so this problem of binary support 
hasn’t been an immediate concern for the initial users.  We hadn’t considered 
building complete driver/core-swr combinations behind a loader; we’ll consider 
this as a possibility for avx512.

Most of the code movement to make runtime selection at the interface layer 
between core SWR and the driver has been done; we would need to verify any 
stray AVX/AVX2 architecture differences in the driver and add loader logic.

-Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Jose Fonseca

On 20/10/15 18:11, Rowley, Timothy O wrote:

Hi.  I'd like to introduce the Mesa3D community to a software project
that we hope to upstream.  We're a small team at Intel working on
software defined visualization (http://sdvis.org/), and have
opensource projects in both the raytracing (Embree, OSPRay) and
rasterization (OpenSWR) realms.

We're a different Intel team from that of i965 fame, with a different
type of customer and workloads.  Our customers have large clusters of
compute nodes that for various reasons do not have GPUs, and are
working with extremely large geometry models.

We've been working on a high performance, highly scalable rasterizer
and driver to interface with Mesa3D.  Our rasterizer functions as a
"software gpu", relying on the mature well-supported Mesa3D to provide
API and state tracking layers.

We would like to contribute this code to Mesa3D and continue doing
active development in your source repository.  We welcome discussion
about how this will happen and questions about the project itself.
Below are some answers to what we think might be frequently asked
questions.

Bruce and I will be the public contacts for this project, but this
project isn't solely our work - there's a dedicated group of people
working on the core SWR code.

   Tim Rowley
   Bruce Cherniak

   Intel Corporation

Why another software rasterizer?


Good question, given there are already three (swrast, softpipe,
llvmpipe) in the Mesa3D tree. Two important reasons for this:

  * Architecture - given our focus on scientific visualization, our
workloads are much different than the typical game; we have heavy
vertex load and relatively simple shaders.  In addition, the core
counts of machines we run on are much higher.  These parameters led
to design decisions much different than llvmpipe.

  * Historical - Intel had developed a high performance software
graphics stack for internal purposes.  Later we adapted this
graphics stack for use in visualization and decided to move forward
with Mesa3D to provide a high quality API layer while at the same
time benefiting from the excellent performance the software
rasterizerizer gives us.


It wouldn't be too dificult to make llvmpipe's vertex-shading 
distributed across threads.



What's the architecture?


SWR is a tile based immediate mode renderer with a sort-free threading
model which is arranged as a ring of queues.  Each entry in the ring
represents a draw context that contains all of the draw state and work
queues.  An API thread sets up each draw context and worker threads
will execute both the frontend (vertex/geometry processing) and
backend (fragment) work as required.  The ring allows for backend
threads to pull work in order.  Large draws are split into chunks to
allow vertex processing to happen in parallel, with the backend work
pickup preserving draw ordering.

Our pipeline uses just-in-time compiled code for the fetch shader that
does vertex attribute gathering and AOS to SOA conversions, the vertex
shader and fragment shaders, streamout, and fragment blending. SWR
core also supports geometry and compute shaders but we haven't exposed
them through our driver yet. The fetch shader, streamout, and blend is
built internally to swr core using LLVM directly, while for the vertex
and pixel shaders we reuse bits of llvmpipe from
gallium/auxiliary/gallivm to build the kernels, which we wrap
differently than llvmpipe's auxiliary/draw code.

What's the performance?
---

For the types of high-geometry workloads we're interested in, we are
significantly faster than llvmpipe.  This is to be expected, as
llvmpipe only threads the fragment processing and not the geometry
frontend.

The linked slide below shows some performance numbers from a benchmark
dataset and application.  On a 36 total core dual E5-2699v3 we see
performance 29x to 51x that of llvmpipe.

http://openswr.org/slides/SWR_Sept15.pdf

While our current performance is quite good, we know there is more
potential in this architecture.  When we switched from a prototype
OpenGL driver to Mesa we regressed performance severely, some due to
interface issues that need tuning, some differences in shader code
generation, and some due to conformance and feature additions to the
core swr.  We are looking to recovering most of this performance back.


I tried it on my i7-5500U, but I run into two issues:

- OpenSWR seems to only use 2 threads (even though my system support 4 
threads)


- and even when I compensate llvmpipe to only use 2 rasterizer threads, 
I still only get half the framerate of llvmpipe with the "gloss" Mesa 
demo (a very simple texturing demo):


$ ./gloss
SWR create screen!
This processor supports AVX2.
720 frames in 5.004 seconds = 143.885 FPS
737 frames in 5.005 seconds = 147.253 FPS
729 frames in 5.004 seconds = 145.683 FPS
732 frames in 5.002 seconds = 146.341 FPS
735 

Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Jose Fonseca

On 20/10/15 23:16, Rowley, Timothy O wrote:



On Oct 20, 2015, at 4:23 PM, Jose Fonseca  wrote:

I tried it on my i7-5500U, but I run into two issues:

- OpenSWR seems to only use 2 threads (even though my system support 4 threads)

- and even when I compensate llvmpipe to only use 2 rasterizer threads, I still only get 
half the framerate of llvmpipe with the "gloss" Mesa demo (a very simple 
texturing demo):

$ ./gloss
SWR create screen!
This processor supports AVX2.
720 frames in 5.004 seconds = 143.885 FPS
737 frames in 5.005 seconds = 147.253 FPS
729 frames in 5.004 seconds = 145.683 FPS
732 frames in 5.002 seconds = 146.341 FPS
735 frames in 5.001 seconds = 146.971 FPS
[...]
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
1539 frames in 5.002 seconds = 307.677 FPS
1719 frames in 5 seconds = 343.8 FPS
1780 frames in 5.002 seconds = 355.858 FPS
1497 frames in 5.002 seconds = 299.28 FPS
1548 frames in 5.001 seconds = 309.538 FPS
[..]

I see similar ratio with more complex  workload with the trace from:

  http://people.freedesktop.org/~jrfonseca/traces/furmark-1.8.2-svga.trace

(you'll need to download https://github.com/apitrace/apitrace and build)

My questions are:

- Is this the expected performance when texturing is used? Or is there 
something wrong with my setup?



Two things are happening here to cause the behavior you’re seeing.  First, 
OpenSWR only generates threads equal to the number of physical cores.  On our 
workloads, going beyond that and using hyperthreads was a minimal or negative 
performance increase.  Second, one thread is reserved for the API thread, which 
does not participate in either frontend (geometry) or backend (fragment) work.  
Thus on your two core 5500U OpenSWR only had one raster thread versus 
llvmpipe’s two, giving half the performance.  If you want to switch OpenSWR to 
using hyperthreads, set the environment variable KNOB_MAX_THREADS_PER_CORE=0.


Thanks for the explanations.  It's closer now, but still a bit of gap:

$ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
SWR create screen!
This processor supports AVX2.
--> numThreads = 3
1102 frames in 5.002 seconds = 220.312 FPS
1133 frames in 5.001 seconds = 226.555 FPS
1130 frames in 5.002 seconds = 225.91 FPS
^C
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
1456 frames in 5 seconds = 291.2 FPS
1617 frames in 5.003 seconds = 323.206 FPS
1571 frames in 5.002 seconds = 314.074 FPS


One final question: you said that one thread is reserved for the API, 
but I see all threads (with top `H`) maxing up the CPU.  So if the 
thread reserved for the API is not doing vertex/fragment processing, 
then what is it using 100% of a CPU thread for?



Final thoughts: I understand this project has its own history, but I 
echo what Roland said -- it would be nice to unify with llvmpipe at one 
point, in some way or fashion.  Our (VMware's) focus has been desktop 
composition, but there's no reason why a single SW renderer can't 
satisfy both ends of the spectrum, especially for JIT enable renderers, 
since they can emit at runtime the code most suited for the workload.


That said, it's really nice seeing Mesa and Gallium enabling this sort 
of experiments with SW rendering.



Jose
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Introducing OpenSWR: High performance software rasterizer

2015-10-20 Thread Rowley, Timothy O

> On Oct 20, 2015, at 4:23 PM, Jose Fonseca  wrote:
> 
> I tried it on my i7-5500U, but I run into two issues:
> 
> - OpenSWR seems to only use 2 threads (even though my system support 4 
> threads)
> 
> - and even when I compensate llvmpipe to only use 2 rasterizer threads, I 
> still only get half the framerate of llvmpipe with the "gloss" Mesa demo (a 
> very simple texturing demo):
> 
> $ ./gloss
> SWR create screen!
> This processor supports AVX2.
> 720 frames in 5.004 seconds = 143.885 FPS
> 737 frames in 5.005 seconds = 147.253 FPS
> 729 frames in 5.004 seconds = 145.683 FPS
> 732 frames in 5.002 seconds = 146.341 FPS
> 735 frames in 5.001 seconds = 146.971 FPS
> [...]
> $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
> 1539 frames in 5.002 seconds = 307.677 FPS
> 1719 frames in 5 seconds = 343.8 FPS
> 1780 frames in 5.002 seconds = 355.858 FPS
> 1497 frames in 5.002 seconds = 299.28 FPS
> 1548 frames in 5.001 seconds = 309.538 FPS
> [..]
> 
> I see similar ratio with more complex  workload with the trace from:
> 
>  http://people.freedesktop.org/~jrfonseca/traces/furmark-1.8.2-svga.trace
> 
> (you'll need to download https://github.com/apitrace/apitrace and build)
> 
> My questions are:
> 
> - Is this the expected performance when texturing is used? Or is there 
> something wrong with my setup?
> 

Two things are happening here to cause the behavior you’re seeing.  First, 
OpenSWR only generates threads equal to the number of physical cores.  On our 
workloads, going beyond that and using hyperthreads was a minimal or negative 
performance increase.  Second, one thread is reserved for the API thread, which 
does not participate in either frontend (geometry) or backend (fragment) work.  
Thus on your two core 5500U OpenSWR only had one raster thread versus 
llvmpipe’s two, giving half the performance.  If you want to switch OpenSWR to 
using hyperthreads, set the environment variable KNOB_MAX_THREADS_PER_CORE=0.

>  I understand that OpenSWR actually leverages llvmpipe (well gallivm's) code 
> for texture sampling, so I was expecting a smaller gap.

Yes, we use gallivm’s texture sampler so our performance should be similar on 
texture-limited workloads.  I tried a quick test of openarena on a 4-core 
machine and the performance delta was about 6% (default N-1 OpenSWR worker 
threads).

> - What exactly was the benchmark used for SWR_Sept15.pdf's figures ? Was 
> there any texture sampling used on it, or was it just simple lighting?

I don’t have the apitrace in front of me, but I believe the turbulence data was 
two-sided lit, with a textured plane.

Tim

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev