Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-06-11 Thread Jose Fonseca

On 11/06/17 07:59, Jacob Lifshay wrote:
On Sat, Jun 10, 2017 at 3:25 PM Jose Fonseca > wrote:


I don't see how to effectively tack triangle setup into the vertex
shader: vertex shader applies to vertices, where as triangle setup and
bining applies to primitives.  Usually, each vertex gets transformed
only once with llvmpipe, no matter how many triangles refer that vertex.
   The only way to tack triangle setup into vertex shading would be if
you processed vertices a primitive at a time.  Of course one could put
an if-statement to skip reprocessing a vertex that already was
processed, but then you have race conditions, and no benefit of
inlining.

I was mostly thinking of non-indexed vertices.



And I'm afraid that tacking rasterization too is one those things that
sound great on paper, quite bad in practice.  And I speak from
experience: in fact llvmpipe had the last step of rasterization bolted
on the fragment shaders for some time.  But we took it out because it
was _slower_.

The issue is that if you bolt on to the shader body, you either:

- inline in the shader body code for the maxmimum number of planes that
(which are 7, 3 sides of triangle, plus 4 sides of a scissor rect), and
waste cpu cicles going through all of those tests, even when most of the
time many of those tests aren't needed

- or you generate if/for blocks for each place, so you only do the
needed tests, but then you have branch prediction issues...

Whereas if you keep rasterization _outside_ the shader you can have
specialized functions to do the rasterization based on the primitive
itself: (is the triangle fully inside the scissor, you need 3 planes, if
the stamp is fully inside the triangle you need zero).  Essentially you
can "compose" by coupling two functions calls: you call a rasterization
function that's especiallized for the primitive, then a shading function
that's specialized for the state (but not depends on the primitive).

It makes sense: rasterization needs to be specialized for the primitive,
not the graphics state; where as the shader needs to be specialized for
the state.

I am planning on generating a function for each primitive type and state 
combination, or I can convert all primitives into triangles and just 
have a function for each state. The state includes stuff like if a 
particular clipping/scissor equation needs to be checked. I did it that 
way in my proof-of-concept code by using c++ templates to do the code 
duplication: 
https://github.com/programmerjake/tiled-renderer/blob/47e09f5d711803b8e899c3669fbeae3e62c9e32c/main.cpp#L366


I'm not sure there will be enough benefits of iniline to compensate the 
time spent on compiling 2**7 variants of each shader to cope with all 
possible incoming triangles..



And this is just one of those non-intuitive things that's not obvious
until one actually does a lot of profiling, a lot of experimentation.
And trust me, lot of time was spent fine tuning this for llvmpipe (not
be me -- most of rasterization was done by Keith Whitwell.)  And by
throwing llvmpipe out of the window and starting a new software
rendering from scratch you'd be just subscribing to do it all over
again.

Whereas if instead of starting from scratch, you take llvmpipe, and you
rewrite/replace one component at a time, you can reach exactly the same
destination you want to reach, however you'll have something working
every step of the way, so when you take a bad step, you can measure
performance impact, and readjust.  Plus if you run out of time, you have
something useful -- not yet another half finished project, which quickly
will rot away.

In the case that the project is not finished this summer, I'm still 
planning on working on it, just at a reduced rate. If all else fails, we 
will at least have a up-to-date spir-v to llvm converter that handles 
the glsl spir-v extensions.


Regarding generating the spir-v -> scalar llvm, then do whole function
vectorization, I don't think it's a bad idea per se.  If was I writing
llvmpipe from scratch today I'd do something like that.  Especially
because (scalar) LLVM IR is so pervasive in the graphics ecosistem
anyway.

It was only after I had tgsi -> llvm ir all done that I stumbled into
http://compilers.cs.uni-saarland.de/projects/wfv/ .

I think the important thing here is that, once you've vectorized the
shader, and you converted your "texture_sample" to
"texture_sample.vector8", and your "output_merger" intrinsics to
"output_merger.vector8", or you log2/exp2, you then slot the fine tuned
llvmpipe code for texture sampling and blending and math, as that's were
your bottle necks tend to be.  Because if you plan to write all texture
sampling from scratch then you need 

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-06-11 Thread Jacob Lifshay
On Sat, Jun 10, 2017 at 3:25 PM Jose Fonseca  wrote:

> I don't see how to effectively tack triangle setup into the vertex
> shader: vertex shader applies to vertices, where as triangle setup and
> bining applies to primitives.  Usually, each vertex gets transformed
> only once with llvmpipe, no matter how many triangles refer that vertex.
>   The only way to tack triangle setup into vertex shading would be if
> you processed vertices a primitive at a time.  Of course one could put
> an if-statement to skip reprocessing a vertex that already was
> processed, but then you have race conditions, and no benefit of inlining.
>
I was mostly thinking of non-indexed vertices.

And I'm afraid that tacking rasterization too is one those things that
> sound great on paper, quite bad in practice.  And I speak from
> experience: in fact llvmpipe had the last step of rasterization bolted
> on the fragment shaders for some time.  But we took it out because it
> was _slower_.
>
> The issue is that if you bolt on to the shader body, you either:
>
> - inline in the shader body code for the maxmimum number of planes that
> (which are 7, 3 sides of triangle, plus 4 sides of a scissor rect), and
> waste cpu cicles going through all of those tests, even when most of the
> time many of those tests aren't needed
>
> - or you generate if/for blocks for each place, so you only do the
> needed tests, but then you have branch prediction issues...
>
> Whereas if you keep rasterization _outside_ the shader you can have
> specialized functions to do the rasterization based on the primitive
> itself: (is the triangle fully inside the scissor, you need 3 planes, if
> the stamp is fully inside the triangle you need zero).  Essentially you
> can "compose" by coupling two functions calls: you call a rasterization
> function that's especiallized for the primitive, then a shading function
> that's specialized for the state (but not depends on the primitive).
>
> It makes sense: rasterization needs to be specialized for the primitive,
> not the graphics state; where as the shader needs to be specialized for
> the state.
>
I am planning on generating a function for each primitive type and state
combination, or I can convert all primitives into triangles and just have a
function for each state. The state includes stuff like if a particular
clipping/scissor equation needs to be checked. I did it that way in my
proof-of-concept code by using c++ templates to do the code duplication:
https://github.com/programmerjake/tiled-renderer/blob/47e09f5d711803b8e899c3669fbeae3e62c9e32c/main.cpp#L366


And this is just one of those non-intuitive things that's not obvious
> until one actually does a lot of profiling, a lot of experimentation.
> And trust me, lot of time was spent fine tuning this for llvmpipe (not
> be me -- most of rasterization was done by Keith Whitwell.)  And by
> throwing llvmpipe out of the window and starting a new software
> rendering from scratch you'd be just subscribing to do it all over again.
>
> Whereas if instead of starting from scratch, you take llvmpipe, and you
> rewrite/replace one component at a time, you can reach exactly the same
> destination you want to reach, however you'll have something working
> every step of the way, so when you take a bad step, you can measure
> performance impact, and readjust.  Plus if you run out of time, you have
> something useful -- not yet another half finished project, which quickly
> will rot away.
>
In the case that the project is not finished this summer, I'm still
planning on working on it, just at a reduced rate. If all else fails, we
will at least have a up-to-date spir-v to llvm converter that handles the
glsl spir-v extensions.

Regarding generating the spir-v -> scalar llvm, then do whole function
> vectorization, I don't think it's a bad idea per se.  If was I writing
> llvmpipe from scratch today I'd do something like that.  Especially
> because (scalar) LLVM IR is so pervasive in the graphics ecosistem anyway.
>
> It was only after I had tgsi -> llvm ir all done that I stumbled into
> http://compilers.cs.uni-saarland.de/projects/wfv/ .
>
> I think the important thing here is that, once you've vectorized the
> shader, and you converted your "texture_sample" to
> "texture_sample.vector8", and your "output_merger" intrinsics to
> "output_merger.vector8", or you log2/exp2, you then slot the fine tuned
> llvmpipe code for texture sampling and blending and math, as that's were
> your bottle necks tend to be.  Because if you plan to write all texture
> sampling from scratch then you need a time/clone machine to complete
> this in a summer; and if just use LLVM's / standard C runtime's
> sqrt/log2/exp2/sin/cos then it would be dead slow.
>
I am planning on using c++ templates to help with a lot of the texture
sampler code generation -- clang can convert it to llvm ir and then I can
inline it into the appropriate places. I think that all of the
non-compressed 

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-06-10 Thread Jose Fonseca
I know this is an old thread.  I completely missed it the first time, 
but recently rediscovered after reading 
http://www.phoronix.com/scan.php?page=news_item=Vulkan-CPU-Repository 
, and perhaps it's not too late for a couple comments FWIW.


On 13/02/17 02:17, Jacob Lifshay wrote:

forgot to add mesa-dev when I sent.
-- Forwarded message --
From: "Jacob Lifshay" <programmerj...@gmail.com 
<mailto:programmerj...@gmail.com>>

Date: Feb 12, 2017 6:16 PM
Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc
To: "Dave Airlie" <airl...@gmail.com <mailto:airl...@gmail.com>>
Cc:



On Feb 12, 2017 5:34 PM, "Dave Airlie" <airl...@gmail.com 
<mailto:airl...@gmail.com>> wrote:


 > I'm assuming that control barriers in Vulkan are identical to
barriers
 > across a work-group in opencl. I was going to have a work-group
be a single
 > OS thread, with the different work-items mapped to SIMD lanes. If
we need to
 > have additional scheduling, I have written a javascript compiler that
 > supports generator functions, so I mostly know how to write a
llvm pass to
 > implement that. I was planning on writing the shader compiler
using llvm,
 > using the whole-function-vectorization pass I will write, and
using the
 > pre-existing spir-v to llvm translation layer. I would also write
some llvm
 > passes to translate from texture reads and stuff to basic vector ops.

Well the problem is number of work-groups that gets launched could be
quite high, and this can cause a large overhead in number of host
threads
that have to be launched. There was some discussion on this in mesa-dev
archives back when I added softpipe compute shaders.


I would start a thread for each cpu, then have each thread run the 
compute shader a number of times instead of having a thread per shader 
invocation.


At least for llvmpipe, last time I looked into this, using OS green 
threads seemed a simple non-intrusive method of dealing with this --

https://lists.freedesktop.org/archives/mesa-dev/2016-April/114790.html
-- but it sounds like LLVM coroutines can handle this more effectively.




 > I have a prototype rasterizer, however I haven't implemented
binning for
 > triangles yet or implemented interpolation. currently, it can handle
 > triangles in 3D homogeneous and calculate edge equations.
 > https://github.com/programmerjake/tiled-renderer
<https://github.com/programmerjake/tiled-renderer>
 > A previous 3d renderer that doesn't implement any vectorization
and has
 > opengl 1.x level functionality:
 >
https://github.com/programmerjake/lib3d/blob/master/softrender.cpp
<https://github.com/programmerjake/lib3d/blob/master/softrender.cpp>

Well I think we already have a completely fine rasterizer and binning
and whatever
else in the llvmpipe code base. I'd much rather any Mesa based
project doesn't
throw all of that away, there is no reason the same swrast backend
couldn't
be abstracted to be used for both GL and Vulkan and introducing another
just because it's interesting isn't a great fit for long term project
maintenance..

If there are improvements to llvmpipe that need to be made, then that
is something
to possibly consider, but I'm not sure why a swrast vulkan needs a
from scratch
raster implemented. For a project that is so large in scope, I'd think
reusing that code
would be of some use. Since most of the fun stuff is all the texture
sampling etc.


I actually think implementing the rasterization algorithm is the best 
part. I wanted the rasterization algorithm to be included in the 
shaders, eg. triangle setup and binning would be tacked on to the end of 
the vertex shader and parameter interpolation and early z tests would be 
tacked on to the beginning of the fragment shader and blending on to the 
end. That way, llvm could do more specialization and instruction 
scheduling than is possible in llvmpipe now.


Parameter interpolation, early z test, and blending *is* tacked to 
llmvpipe's fragment shaders.



I don't see how to effectively tack triangle setup into the vertex 
shader: vertex shader applies to vertices, where as triangle setup and 
bining applies to primitives.  Usually, each vertex gets transformed 
only once with llvmpipe, no matter how many triangles refer that vertex. 
 The only way to tack triangle setup into vertex shading would be if 
you processed vertices a primitive at a time.  Of course one could put 
an if-statement to skip reprocessing a vertex that already was 
processed, but then you have race conditions, and no benefit of inlining.



And I'm afraid that tacking rasterization too is one those things that 
sound great on paper, quite bad in practice.  A

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-14 Thread Nicolai Hähnle

On 14.02.2017 09:58, Jacob Lifshay wrote:



On Feb 14, 2017 12:18 AM, "Nicolai Hähnle" > wrote:

On 13.02.2017 17:54, Jacob Lifshay wrote:

the algorithm i was going to use would get the union of the sets
of live
variables at the barriers (union over barriers), create an array of
structs that holds them all, then for each barrier, insert the
code to
store all live variables, then end the for loop over
tid_in_workgroup,
then run the memory barrier, then start another for loop over
tid_in_workgroup, then load all live variables.


Okay, sounds reasonable in theory.

There are some issues, like: how do you actually determine live
variables? If you're working off TGSI like llvmpipe does today,
you'd need to write your own analysis for that, but in a structured
control flow graph like TGSI has, that shouldn't be too difficult.


I was planning on using the spir-v to llvm translator and never using
tgsi.


Cool, it would be interesting to see how that goes. Mind you, I don't 
think that code is being maintained very well.




I could implement the pass using llvm coroutines, however, I'd
need to have several additional passes to convert the output; it might
not optimize all the way because we would have the switch on the suspend
point index still left. Also, according to the docs from llvm trunk,
llvm doesn't support reducing the space required by using the minimum
size needed to store all the live variables at the suspend point with
the largest space requirements, instead, it allocates separate space for
each variable at each suspend
point: http://llvm.org/docs/Coroutines.html#areas-requiring-attention


Yes, that actually makes sense. About the switches, though, I'm not so 
sure how you can really avoid those. Consider kernel code like this:


void main()
{
   if (cond) {
  ...
  barrier();
  ...
   } else {
  ...
  barrier();
  ...
   }
}

This kernel is perfectly valid and will work as expected if (and only 
if) cond is uniform across the threads of a workgroup.


Consider what you'd want the control flow in the LLVM implementation to 
look like, how you'd handle the fact that the set of live values would 
be different across the different barriers.


As a bonus, perhaps you could set things up so that the user gets a nice 
error message when the kernel is incorrect (i.e., when cond is _not_ 
uniform across a workgroup).


Cheers,
Nicolai




I'd still recommend you to at least seriously read through the LLVM
coroutine stuff.

Cheers,
Nicolai

Jacob Lifshay

On Feb 13, 2017 08:45, "Nicolai Hähnle" 
>> wrote:

[ re-adding mesa-dev on the assumption that it got dropped
by accident ]

On 13.02.2017 17:27, Jacob Lifshay wrote:

I would start a thread for each cpu, then have each
thread run the
compute shader a number of times instead of having a
thread per
shader
invocation.


This will not work.

Please, read again what the barrier() instruction
does: When the
barrier() call is reached, _all_ threads within the
workgroup are
supposed to be run until they reach that barrier() call.


to clarify, I had meant that each os thread would run the
sections of
the shader between the barriers for all the shaders in a
work group,
then, when it finished the work group, it would go to
the next work
group assigned to the os thread.

so, if our shader is:
a = b + tid;
barrier();
d = e + f;

and our simd width is 4, our work-group size is 128, and
we have
16 os
threads, then it will run for each os thread:
for(workgroup = os_thread_index; workgroup <
workgroup_count;
workgroup++)
{
for(tid_in_workgroup = 0; tid_in_workgroup < 128;
tid_in_workgroup += 4)
{
ivec4 tid = ivec4(0, 1, 2, 3) +
ivec4(tid_in_workgroup +
workgroup * 128);
a[tid_in_workgroup / 4] =
ivec_add(b[tid_in_workgroup /
4], tid);
}
memory_fence(); // if needed
for(tid_in_workgroup = 0; tid_in_workgroup < 128;
tid_in_workgroup += 4)
{
d[tid_in_workgroup / 4] 

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-14 Thread Jacob Lifshay
On Feb 14, 2017 12:18 AM, "Nicolai Hähnle"  wrote:

On 13.02.2017 17:54, Jacob Lifshay wrote:

> the algorithm i was going to use would get the union of the sets of live
> variables at the barriers (union over barriers), create an array of
> structs that holds them all, then for each barrier, insert the code to
> store all live variables, then end the for loop over tid_in_workgroup,
> then run the memory barrier, then start another for loop over
> tid_in_workgroup, then load all live variables.
>

Okay, sounds reasonable in theory.

There are some issues, like: how do you actually determine live variables?
If you're working off TGSI like llvmpipe does today, you'd need to write
your own analysis for that, but in a structured control flow graph like
TGSI has, that shouldn't be too difficult.


I was planning on using the spir-v to llvm translator and never using tgsi.
I could implement the pass using llvm coroutines, however, I'd need to have
several additional passes to convert the output; it might not optimize all
the way because we would have the switch on the suspend point index still
left. Also, according to the docs from llvm trunk, llvm doesn't support
reducing the space required by using the minimum size needed to store all
the live variables at the suspend point with the largest space
requirements, instead, it allocates separate space for each variable at
each suspend point:
http://llvm.org/docs/Coroutines.html#areas-requiring-attention


I'd still recommend you to at least seriously read through the LLVM
coroutine stuff.

Cheers,
Nicolai

Jacob Lifshay
>
> On Feb 13, 2017 08:45, "Nicolai Hähnle"  > wrote:
>
> [ re-adding mesa-dev on the assumption that it got dropped by accident
> ]
>
> On 13.02.2017 17:27, Jacob Lifshay wrote:
>
> I would start a thread for each cpu, then have each
> thread run the
> compute shader a number of times instead of having a
> thread per
> shader
> invocation.
>
>
> This will not work.
>
> Please, read again what the barrier() instruction does: When
> the
> barrier() call is reached, _all_ threads within the
> workgroup are
> supposed to be run until they reach that barrier() call.
>
>
> to clarify, I had meant that each os thread would run the
> sections of
> the shader between the barriers for all the shaders in a work
> group,
> then, when it finished the work group, it would go to the next work
> group assigned to the os thread.
>
> so, if our shader is:
> a = b + tid;
> barrier();
> d = e + f;
>
> and our simd width is 4, our work-group size is 128, and we have
> 16 os
> threads, then it will run for each os thread:
> for(workgroup = os_thread_index; workgroup < workgroup_count;
> workgroup++)
> {
> for(tid_in_workgroup = 0; tid_in_workgroup < 128;
> tid_in_workgroup += 4)
> {
> ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup +
> workgroup * 128);
> a[tid_in_workgroup / 4] = ivec_add(b[tid_in_workgroup /
> 4], tid);
> }
> memory_fence(); // if needed
> for(tid_in_workgroup = 0; tid_in_workgroup < 128;
> tid_in_workgroup += 4)
> {
> d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4],
> f[tid_in_workgroup / 4]);
> }
> }
> // after this, we run the next rendering or compute job
>
>
> Okay good, that's the right concept.
>
> Actually doing that is not at all straightforward though: consider
> that the barrier() might occur inside a loop in the shader.
>
> So if you implemented that within the framework of llvmpipe, you'd
> make a lot of people very happy: it would allow finally adding
> compute shader support to llvmpipe. Mind you, that in itself would
> already be a pretty decent-sized project for GSoC!
>
> Cheers,
> Nicolai
>
>

Jacob Lifshay
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-14 Thread Nicolai Hähnle

On 13.02.2017 17:54, Jacob Lifshay wrote:

the algorithm i was going to use would get the union of the sets of live
variables at the barriers (union over barriers), create an array of
structs that holds them all, then for each barrier, insert the code to
store all live variables, then end the for loop over tid_in_workgroup,
then run the memory barrier, then start another for loop over
tid_in_workgroup, then load all live variables.


Okay, sounds reasonable in theory.

There are some issues, like: how do you actually determine live 
variables? If you're working off TGSI like llvmpipe does today, you'd 
need to write your own analysis for that, but in a structured control 
flow graph like TGSI has, that shouldn't be too difficult.


I'd still recommend you to at least seriously read through the LLVM 
coroutine stuff.


Cheers,
Nicolai


Jacob Lifshay

On Feb 13, 2017 08:45, "Nicolai Hähnle" > wrote:

[ re-adding mesa-dev on the assumption that it got dropped by accident ]

On 13.02.2017 17:27, Jacob Lifshay wrote:

I would start a thread for each cpu, then have each
thread run the
compute shader a number of times instead of having a
thread per
shader
invocation.


This will not work.

Please, read again what the barrier() instruction does: When the
barrier() call is reached, _all_ threads within the
workgroup are
supposed to be run until they reach that barrier() call.


to clarify, I had meant that each os thread would run the
sections of
the shader between the barriers for all the shaders in a work group,
then, when it finished the work group, it would go to the next work
group assigned to the os thread.

so, if our shader is:
a = b + tid;
barrier();
d = e + f;

and our simd width is 4, our work-group size is 128, and we have
16 os
threads, then it will run for each os thread:
for(workgroup = os_thread_index; workgroup < workgroup_count;
workgroup++)
{
for(tid_in_workgroup = 0; tid_in_workgroup < 128;
tid_in_workgroup += 4)
{
ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup +
workgroup * 128);
a[tid_in_workgroup / 4] = ivec_add(b[tid_in_workgroup /
4], tid);
}
memory_fence(); // if needed
for(tid_in_workgroup = 0; tid_in_workgroup < 128;
tid_in_workgroup += 4)
{
d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4],
f[tid_in_workgroup / 4]);
}
}
// after this, we run the next rendering or compute job


Okay good, that's the right concept.

Actually doing that is not at all straightforward though: consider
that the barrier() might occur inside a loop in the shader.

So if you implemented that within the framework of llvmpipe, you'd
make a lot of people very happy: it would allow finally adding
compute shader support to llvmpipe. Mind you, that in itself would
already be a pretty decent-sized project for GSoC!

Cheers,
Nicolai



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-13 Thread Roland Scheidegger
Am 13.02.2017 um 03:17 schrieb Jacob Lifshay:
> forgot to add mesa-dev when I sent.
> -- Forwarded message --
> From: "Jacob Lifshay" <programmerj...@gmail.com
> <mailto:programmerj...@gmail.com>>
> Date: Feb 12, 2017 6:16 PM
> Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc
> To: "Dave Airlie" <airl...@gmail.com <mailto:airl...@gmail.com>>
> Cc:
> 
> 
> 
> On Feb 12, 2017 5:34 PM, "Dave Airlie" <airl...@gmail.com
> <mailto:airl...@gmail.com>> wrote:
> 
> > I'm assuming that control barriers in Vulkan are identical to barriers
> > across a work-group in opencl. I was going to have a work-group be
> a single
> > OS thread, with the different work-items mapped to SIMD lanes. If
> we need to
> > have additional scheduling, I have written a javascript compiler that
> > supports generator functions, so I mostly know how to write a llvm
> pass to
> > implement that. I was planning on writing the shader compiler
> using llvm,
> > using the whole-function-vectorization pass I will write, and
> using the
> > pre-existing spir-v to llvm translation layer. I would also write
> some llvm
> > passes to translate from texture reads and stuff to basic vector ops.
> 
> Well the problem is number of work-groups that gets launched could be
> quite high, and this can cause a large overhead in number of host
> threads
> that have to be launched. There was some discussion on this in mesa-dev
> archives back when I added softpipe compute shaders.
> 
> 
> I would start a thread for each cpu, then have each thread run the
> compute shader a number of times instead of having a thread per shader
> invocation.
> 
> 
> > I have a prototype rasterizer, however I haven't implemented
> binning for
> > triangles yet or implemented interpolation. currently, it can handle
> > triangles in 3D homogeneous and calculate edge equations.
> > https://github.com/programmerjake/tiled-renderer
> <https://github.com/programmerjake/tiled-renderer>
> > A previous 3d renderer that doesn't implement any vectorization
> and has
> > opengl 1.x level functionality:
> > https://github.com/programmerjake/lib3d/blob/master/softrender.cpp
> <https://github.com/programmerjake/lib3d/blob/master/softrender.cpp>
> 
> Well I think we already have a completely fine rasterizer and binning
> and whatever
> else in the llvmpipe code base. I'd much rather any Mesa based
> project doesn't
> throw all of that away, there is no reason the same swrast backend
> couldn't
> be abstracted to be used for both GL and Vulkan and introducing another
> just because it's interesting isn't a great fit for long term project
> maintenance..
> 
> If there are improvements to llvmpipe that need to be made, then that
> is something
> to possibly consider, but I'm not sure why a swrast vulkan needs a
> from scratch
> raster implemented. For a project that is so large in scope, I'd think
> reusing that code
> would be of some use. Since most of the fun stuff is all the texture
> sampling etc.
> 
> 
> I actually think implementing the rasterization algorithm is the best
> part. I wanted the rasterization algorithm to be included in the
> shaders, eg. triangle setup and binning would be tacked on to the end of
> the vertex shader and parameter interpolation and early z tests would be
> tacked on to the beginning of the fragment shader and blending on to the
> end. That way, llvm could do more specialization and instruction
> scheduling than is possible in llvmpipe now.
> 
> so the tile rendering function would essentially be:
> 
> for(i = 0; i < triangle_count; i+= vector_width)
> jit_functions[i](tile_x, tile_y, _setup_results[i]);
> 
> as opposed to the current llvmpipe code where there is a large amount of
> fixed code that isn't optimized with the shaders.

That isn't true for llvmpipe for the fragment side at least.
parameter interpolation, early z (if possible, otherwise late z), blend
etc. are all part of the fragment jit function in the end. The actual
edge function evaluation is not, albeit they use optimized assembly as
well (though this isn't quite as universal, only specifically for x86
sse2 and powerpc altivec, on other archs rasterization might take quite
noticeable cpu time with the scalar edge function evaluation).

On the vertex side though, llvmpipe can't do threaded setup or binning
(nor vertex shader execution itself for that matter). Clearly, th

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-13 Thread Jacob Lifshay
the algorithm i was going to use would get the union of the sets of live
variables at the barriers (union over barriers), create an array of structs
that holds them all, then for each barrier, insert the code to store all
live variables, then end the for loop over tid_in_workgroup, then run the
memory barrier, then start another for loop over tid_in_workgroup, then
load all live variables.
Jacob Lifshay

On Feb 13, 2017 08:45, "Nicolai Hähnle"  wrote:

> [ re-adding mesa-dev on the assumption that it got dropped by accident ]
>
> On 13.02.2017 17:27, Jacob Lifshay wrote:
>
>> I would start a thread for each cpu, then have each thread run the
>> compute shader a number of times instead of having a thread per
>> shader
>> invocation.
>>
>>
>> This will not work.
>>
>> Please, read again what the barrier() instruction does: When the
>> barrier() call is reached, _all_ threads within the workgroup are
>> supposed to be run until they reach that barrier() call.
>>
>>
>> to clarify, I had meant that each os thread would run the sections of
>> the shader between the barriers for all the shaders in a work group,
>> then, when it finished the work group, it would go to the next work
>> group assigned to the os thread.
>>
>> so, if our shader is:
>> a = b + tid;
>> barrier();
>> d = e + f;
>>
>> and our simd width is 4, our work-group size is 128, and we have 16 os
>> threads, then it will run for each os thread:
>> for(workgroup = os_thread_index; workgroup < workgroup_count; workgroup++)
>> {
>> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup +=
>> 4)
>> {
>> ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup +
>> workgroup * 128);
>> a[tid_in_workgroup / 4] = ivec_add(b[tid_in_workgroup / 4], tid);
>> }
>> memory_fence(); // if needed
>> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup +=
>> 4)
>> {
>> d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4],
>> f[tid_in_workgroup / 4]);
>> }
>> }
>> // after this, we run the next rendering or compute job
>>
>
> Okay good, that's the right concept.
>
> Actually doing that is not at all straightforward though: consider that
> the barrier() might occur inside a loop in the shader.
>
> So if you implemented that within the framework of llvmpipe, you'd make a
> lot of people very happy: it would allow finally adding compute shader
> support to llvmpipe. Mind you, that in itself would already be a pretty
> decent-sized project for GSoC!
>
> Cheers,
> Nicolai
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-13 Thread Nicolai Hähnle

[ re-adding mesa-dev on the assumption that it got dropped by accident ]

On 13.02.2017 17:27, Jacob Lifshay wrote:

I would start a thread for each cpu, then have each thread run the
compute shader a number of times instead of having a thread per
shader
invocation.


This will not work.

Please, read again what the barrier() instruction does: When the
barrier() call is reached, _all_ threads within the workgroup are
supposed to be run until they reach that barrier() call.


to clarify, I had meant that each os thread would run the sections of
the shader between the barriers for all the shaders in a work group,
then, when it finished the work group, it would go to the next work
group assigned to the os thread.

so, if our shader is:
a = b + tid;
barrier();
d = e + f;

and our simd width is 4, our work-group size is 128, and we have 16 os
threads, then it will run for each os thread:
for(workgroup = os_thread_index; workgroup < workgroup_count; workgroup++)
{
for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup += 4)
{
ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup +
workgroup * 128);
a[tid_in_workgroup / 4] = ivec_add(b[tid_in_workgroup / 4], tid);
}
memory_fence(); // if needed
for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup += 4)
{
d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4],
f[tid_in_workgroup / 4]);
}
}
// after this, we run the next rendering or compute job


Okay good, that's the right concept.

Actually doing that is not at all straightforward though: consider that 
the barrier() might occur inside a loop in the shader.


So if you implemented that within the framework of llvmpipe, you'd make 
a lot of people very happy: it would allow finally adding compute shader 
support to llvmpipe. Mind you, that in itself would already be a pretty 
decent-sized project for GSoC!


Cheers,
Nicolai
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-13 Thread Jacob Lifshay
forgot to add mesa-dev when I sent (again).
-- Forwarded message --
From: "Jacob Lifshay" <programmerj...@gmail.com>
Date: Feb 13, 2017 8:27 AM
Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc
To: "Nicolai Hähnle" <nhaeh...@gmail.com>
Cc:


>
> On Feb 13, 2017 7:54 AM, "Nicolai Hähnle" <nhaeh...@gmail.com> wrote:
>
> On 13.02.2017 03:17, Jacob Lifshay wrote:
>
>> On Feb 12, 2017 5:34 PM, "Dave Airlie" <airl...@gmail.com
>> <mailto:airl...@gmail.com>> wrote:
>>
>> > I'm assuming that control barriers in Vulkan are identical to
>> barriers
>> > across a work-group in opencl. I was going to have a work-group be
>> a single
>> > OS thread, with the different work-items mapped to SIMD lanes. If
>> we need to
>> > have additional scheduling, I have written a javascript compiler
>> that
>> > supports generator functions, so I mostly know how to write a llvm
>> pass to
>> > implement that. I was planning on writing the shader compiler
>> using llvm,
>> > using the whole-function-vectorization pass I will write, and
>> using the
>> > pre-existing spir-v to llvm translation layer. I would also write
>> some llvm
>> > passes to translate from texture reads and stuff to basic vector
>> ops.
>>
>> Well the problem is number of work-groups that gets launched could be
>> quite high, and this can cause a large overhead in number of host
>> threads
>> that have to be launched. There was some discussion on this in
>> mesa-dev
>> archives back when I added softpipe compute shaders.
>>
>>
>> I would start a thread for each cpu, then have each thread run the
>> compute shader a number of times instead of having a thread per shader
>> invocation.
>>
>
> This will not work.
>
> Please, read again what the barrier() instruction does: When the barrier()
> call is reached, _all_ threads within the workgroup are supposed to be run
> until they reach that barrier() call.
>
>
> to clarify, I had meant that each os thread would run the sections of the
> shader between the barriers for all the shaders in a work group, then, when
> it finished the work group, it would go to the next work group assigned to
> the os thread.
>
> so, if our shader is:
> a = b + tid;
> barrier();
> d = e + f;
>
> and our simd width is 4, our work-group size is 128, and we have 16 os
> threads, then it will run for each os thread:
> for(workgroup = os_thread_index; workgroup < workgroup_count; workgroup++)
> {
> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup +=
> 4)
> {
> ivec4 tid = ivec4(0, 1, 2, 3) + ivec4(tid_in_workgroup + workgroup
> * 128);
> a[tid_in_workgroup / 4] = ivec_add(b[tid_in_workgroup / 4], tid);
> }
> memory_fence(); // if needed
> for(tid_in_workgroup = 0; tid_in_workgroup < 128; tid_in_workgroup +=
> 4)
> {
> d[tid_in_workgroup / 4] = vec_add(e[tid_in_workgroup / 4],
> f[tid_in_workgroup / 4]);
> }
> }
> // after this, we run the next rendering or compute job
>
>
>> > I have a prototype rasterizer, however I haven't implemented
>> binning for
>> > triangles yet or implemented interpolation. currently, it can handle
>> > triangles in 3D homogeneous and calculate edge equations.
>> > https://github.com/programmerjake/tiled-renderer
>> <https://github.com/programmerjake/tiled-renderer>
>> > A previous 3d renderer that doesn't implement any vectorization
>> and has
>> > opengl 1.x level functionality:
>> > https://github.com/programmerjake/lib3d/blob/master/softrender.cpp
>> <https://github.com/programmerjake/lib3d/blob/master/softrender.cpp>
>>
>> Well I think we already have a completely fine rasterizer and binning
>> and whatever
>> else in the llvmpipe code base. I'd much rather any Mesa based
>> project doesn't
>> throw all of that away, there is no reason the same swrast backend
>> couldn't
>> be abstracted to be used for both GL and Vulkan and introducing
>> another
>> just because it's interesting isn't a great fit for long term project
>> maintenance..
>>
>> If there are improvements to llvmpipe that need to be made, then that
>> is something
>> to possibly consider, but I'm not sure why a swrast vulkan needs a
>>

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-13 Thread Nicolai Hähnle

On 13.02.2017 03:17, Jacob Lifshay wrote:

On Feb 12, 2017 5:34 PM, "Dave Airlie" > wrote:

> I'm assuming that control barriers in Vulkan are identical to barriers
> across a work-group in opencl. I was going to have a work-group be
a single
> OS thread, with the different work-items mapped to SIMD lanes. If
we need to
> have additional scheduling, I have written a javascript compiler that
> supports generator functions, so I mostly know how to write a llvm
pass to
> implement that. I was planning on writing the shader compiler
using llvm,
> using the whole-function-vectorization pass I will write, and
using the
> pre-existing spir-v to llvm translation layer. I would also write
some llvm
> passes to translate from texture reads and stuff to basic vector ops.

Well the problem is number of work-groups that gets launched could be
quite high, and this can cause a large overhead in number of host
threads
that have to be launched. There was some discussion on this in mesa-dev
archives back when I added softpipe compute shaders.


I would start a thread for each cpu, then have each thread run the
compute shader a number of times instead of having a thread per shader
invocation.


This will not work.

Please, read again what the barrier() instruction does: When the 
barrier() call is reached, _all_ threads within the workgroup are 
supposed to be run until they reach that barrier() call.


So you need a way of suspending and resuming shader threads when they 
reach the barrier() call.


The brute-force way of doing this would be to have one OS thread per 
shader thread (or per N shader threads, where N is a fixed number 
corresponding to SIMD lanes), but that gives you a giant number of OS 
threads to contend with.


The alternative is to do "threads" in user space, and there are a bunch 
of options for that. LLVM coroutines are worth checking out, since I 
think they're more or less designed for that kind of thing. Another 
option is user space stack switching, or perhaps something entirely 
different.


Nicolai




> I have a prototype rasterizer, however I haven't implemented
binning for
> triangles yet or implemented interpolation. currently, it can handle
> triangles in 3D homogeneous and calculate edge equations.
> https://github.com/programmerjake/tiled-renderer

> A previous 3d renderer that doesn't implement any vectorization
and has
> opengl 1.x level functionality:
> https://github.com/programmerjake/lib3d/blob/master/softrender.cpp


Well I think we already have a completely fine rasterizer and binning
and whatever
else in the llvmpipe code base. I'd much rather any Mesa based
project doesn't
throw all of that away, there is no reason the same swrast backend
couldn't
be abstracted to be used for both GL and Vulkan and introducing another
just because it's interesting isn't a great fit for long term project
maintenance..

If there are improvements to llvmpipe that need to be made, then that
is something
to possibly consider, but I'm not sure why a swrast vulkan needs a
from scratch
raster implemented. For a project that is so large in scope, I'd think
reusing that code
would be of some use. Since most of the fun stuff is all the texture
sampling etc.


I actually think implementing the rasterization algorithm is the best
part. I wanted the rasterization algorithm to be included in the
shaders, eg. triangle setup and binning would be tacked on to the end of
the vertex shader and parameter interpolation and early z tests would be
tacked on to the beginning of the fragment shader and blending on to the
end. That way, llvm could do more specialization and instruction
scheduling than is possible in llvmpipe now.

so the tile rendering function would essentially be:

for(i = 0; i < triangle_count; i+= vector_width)
jit_functions[i](tile_x, tile_y, _setup_results[i]);

as opposed to the current llvmpipe code where there is a large amount of
fixed code that isn't optimized with the shaders.


> The scope that I intended to complete is the bare minimum to be vulkan
> conformant (i.e. no tessellation and no geometry shaders), so
implementing a
> loadable ICD for linux and windows that implements a single queue,
vertex,
> fragment, and compute shaders, implementing events, semaphores,
and fences,
> implementing images with the minimum requirements, supporting a
f32 depth
> buffer or a f24 with 8bit stencil, and supporting a
yet-to-be-determined
> compressed format. For the image optimal layouts, I will probably
use the
> same chunked layout I use in
>

Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-12 Thread Jacob Lifshay
forgot to add mesa-dev when I sent.
-- Forwarded message --
From: "Jacob Lifshay" <programmerj...@gmail.com>
Date: Feb 12, 2017 6:16 PM
Subject: Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc
To: "Dave Airlie" <airl...@gmail.com>
Cc:



On Feb 12, 2017 5:34 PM, "Dave Airlie" <airl...@gmail.com> wrote:

> I'm assuming that control barriers in Vulkan are identical to barriers
> across a work-group in opencl. I was going to have a work-group be a
single
> OS thread, with the different work-items mapped to SIMD lanes. If we need
to
> have additional scheduling, I have written a javascript compiler that
> supports generator functions, so I mostly know how to write a llvm pass to
> implement that. I was planning on writing the shader compiler using llvm,
> using the whole-function-vectorization pass I will write, and using the
> pre-existing spir-v to llvm translation layer. I would also write some
llvm
> passes to translate from texture reads and stuff to basic vector ops.

Well the problem is number of work-groups that gets launched could be
quite high, and this can cause a large overhead in number of host threads
that have to be launched. There was some discussion on this in mesa-dev
archives back when I added softpipe compute shaders.


I would start a thread for each cpu, then have each thread run the compute
shader a number of times instead of having a thread per shader invocation.


> I have a prototype rasterizer, however I haven't implemented binning for
> triangles yet or implemented interpolation. currently, it can handle
> triangles in 3D homogeneous and calculate edge equations.
> https://github.com/programmerjake/tiled-renderer
> A previous 3d renderer that doesn't implement any vectorization and has
> opengl 1.x level functionality:
> https://github.com/programmerjake/lib3d/blob/master/softrender.cpp

Well I think we already have a completely fine rasterizer and binning
and whatever
else in the llvmpipe code base. I'd much rather any Mesa based project
doesn't
throw all of that away, there is no reason the same swrast backend couldn't
be abstracted to be used for both GL and Vulkan and introducing another
just because it's interesting isn't a great fit for long term project
maintenance..

If there are improvements to llvmpipe that need to be made, then that
is something
to possibly consider, but I'm not sure why a swrast vulkan needs a from
scratch
raster implemented. For a project that is so large in scope, I'd think
reusing that code
would be of some use. Since most of the fun stuff is all the texture
sampling etc.


I actually think implementing the rasterization algorithm is the best part.
I wanted the rasterization algorithm to be included in the shaders, eg.
triangle setup and binning would be tacked on to the end of the vertex
shader and parameter interpolation and early z tests would be tacked on to
the beginning of the fragment shader and blending on to the end. That way,
llvm could do more specialization and instruction scheduling than is
possible in llvmpipe now.

so the tile rendering function would essentially be:

for(i = 0; i < triangle_count; i+= vector_width)
jit_functions[i](tile_x, tile_y, _setup_results[i]);

as opposed to the current llvmpipe code where there is a large amount of
fixed code that isn't optimized with the shaders.


> The scope that I intended to complete is the bare minimum to be vulkan
> conformant (i.e. no tessellation and no geometry shaders), so
implementing a
> loadable ICD for linux and windows that implements a single queue, vertex,
> fragment, and compute shaders, implementing events, semaphores, and
fences,
> implementing images with the minimum requirements, supporting a f32 depth
> buffer or a f24 with 8bit stencil, and supporting a yet-to-be-determined
> compressed format. For the image optimal layouts, I will probably use the
> same chunked layout I use in
> https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59
,
> where I have a linear array of chunks where each chunk has a linear array
of
> texels. If you think that's too big, we could leave out all of the image
> formats except the two depth-stencil formats, the 8-bit and 32-bit integer
> and 32-bit float formats.
>

Seems like a quite large scope, possibly a bit big for a GSoC though,
esp one that
intends to not use any existing Mesa code.


most of the vulkan functions have a simple implementation when we don't
need to worry about building stuff for a gpu and synchronization (because
we have only one queue), and llvm implements most of the rest of the needed
functionality. If we leave out most of the image formats, that would
probably cut the amount of code by a third.


Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-12 Thread Dave Airlie
> I'm assuming that control barriers in Vulkan are identical to barriers
> across a work-group in opencl. I was going to have a work-group be a single
> OS thread, with the different work-items mapped to SIMD lanes. If we need to
> have additional scheduling, I have written a javascript compiler that
> supports generator functions, so I mostly know how to write a llvm pass to
> implement that. I was planning on writing the shader compiler using llvm,
> using the whole-function-vectorization pass I will write, and using the
> pre-existing spir-v to llvm translation layer. I would also write some llvm
> passes to translate from texture reads and stuff to basic vector ops.

Well the problem is number of work-groups that gets launched could be
quite high, and this can cause a large overhead in number of host threads
that have to be launched. There was some discussion on this in mesa-dev
archives back when I added softpipe compute shaders.

> I have a prototype rasterizer, however I haven't implemented binning for
> triangles yet or implemented interpolation. currently, it can handle
> triangles in 3D homogeneous and calculate edge equations.
> https://github.com/programmerjake/tiled-renderer
> A previous 3d renderer that doesn't implement any vectorization and has
> opengl 1.x level functionality:
> https://github.com/programmerjake/lib3d/blob/master/softrender.cpp

Well I think we already have a completely fine rasterizer and binning
and whatever
else in the llvmpipe code base. I'd much rather any Mesa based project doesn't
throw all of that away, there is no reason the same swrast backend couldn't
be abstracted to be used for both GL and Vulkan and introducing another
just because it's interesting isn't a great fit for long term project
maintenance..

If there are improvements to llvmpipe that need to be made, then that
is something
to possibly consider, but I'm not sure why a swrast vulkan needs a from scratch
raster implemented. For a project that is so large in scope, I'd think
reusing that code
would be of some use. Since most of the fun stuff is all the texture
sampling etc.

> The scope that I intended to complete is the bare minimum to be vulkan
> conformant (i.e. no tessellation and no geometry shaders), so implementing a
> loadable ICD for linux and windows that implements a single queue, vertex,
> fragment, and compute shaders, implementing events, semaphores, and fences,
> implementing images with the minimum requirements, supporting a f32 depth
> buffer or a f24 with 8bit stencil, and supporting a yet-to-be-determined
> compressed format. For the image optimal layouts, I will probably use the
> same chunked layout I use in
> https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59 ,
> where I have a linear array of chunks where each chunk has a linear array of
> texels. If you think that's too big, we could leave out all of the image
> formats except the two depth-stencil formats, the 8-bit and 32-bit integer
> and 32-bit float formats.
>

Seems like a quite large scope, possibly a bit big for a GSoC though,
esp one that
intends to not use any existing Mesa code.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-12 Thread Jacob Lifshay
I'm assuming that control barriers in Vulkan are identical to barriers
across a work-group in opencl. I was going to have a work-group be a single
OS thread, with the different work-items mapped to SIMD lanes. If we need
to have additional scheduling, I have written a javascript compiler that
supports generator functions, so I mostly know how to write a llvm pass to
implement that. I was planning on writing the shader compiler using llvm,
using the whole-function-vectorization pass I will write, and using the
pre-existing spir-v to llvm translation layer. I would also write some llvm
passes to translate from texture reads and stuff to basic vector ops.

I have a prototype rasterizer, however I haven't implemented binning for
triangles yet or implemented interpolation. currently, it can handle
triangles in 3D homogeneous and calculate edge equations.
https://github.com/programmerjake/tiled-renderer
A previous 3d renderer that doesn't implement any vectorization and has
opengl 1.x level functionality:
https://github.com/programmerjake/lib3d/blob/master/softrender.cpp

The scope that I intended to complete is the bare minimum to be vulkan
conformant (i.e. no tessellation and no geometry shaders), so implementing
a loadable ICD for linux and windows that implements a single queue,
vertex, fragment, and compute shaders, implementing events, semaphores, and
fences, implementing images with the minimum requirements, supporting a f32
depth buffer or a f24 with 8bit stencil, and supporting a
yet-to-be-determined compressed format. For the image optimal layouts, I
will probably use the same chunked layout I use in
https://github.com/programmerjake/tiled-renderer/blob/master2/image.h#L59 ,
where I have a linear array of chunks where each chunk has a linear array
of texels. If you think that's too big, we could leave out all of the image
formats except the two depth-stencil formats, the 8-bit and 32-bit integer
and 32-bit float formats.

As mentioned by Roland Mainz, I plan to implement it so all state is stored
in the VkDevice structure or structures created from VKDevice, so there are
no global variables that prevent the library from being completely
reentrant. I might have global variables for something like detecting cpu
features, but that will be protected by a mutex.

Jacob Lifshay

On Sun, Feb 12, 2017 at 3:14 PM Dave Airlie  wrote:

> On 11 February 2017 at 09:03, Jacob Lifshay 
> wrote:
> > I would like to write a software implementation of Vulkan for inclusion
> in
> > mesa3d. I wanted to use a tiled renderer coupled with llvm and either
> write
> > or use a whole-function-vectorization pass. Would anyone be willing to
> > mentor me for this project? I would probably only need help getting it
> > committed, and would be able to do the rest with minimal help.
>
> So I started writing a vulkan->gallium swrast layer
>
> https://cgit.freedesktop.org/~airlied/mesa/log/?h=not-a-vulkan-swrast
>
> with the intention of using it to prove a vulkan swrast driver on top
> of llvmpipe eventually.
>
> This was because I was being too lazy to just rewrite llvmpipe as a
> vulkan driver,
> and it seemed easier to just write the layer to investigate. The thing
> about vulkan is it
> already very based around the idea of command streams and parallel
> building/execution,
> so having the gallium/vulkan layer record a CPU command stream and execute
> that
> isn't going to be a large an overhead as doing something similiar with
> hw drivers.
>
> I got it working with softpipe after adding a bunch of features to
> softpipe, however to
> get it going with llvmpipe, there would need to be a lot of work on
> improving llvmpipe.
>
> Vulkan really wants images and compute shaders (i.e. it requires
> them), and so far we haven't got
> image and compute shader support for llvmpipe. There are a few threads
> previously on this,
> but the main problem with compute shader is getting efficient barriers
> working, which needs
> some kind of threading model, maybe llvm's coroutine support is useful
> for this we won't know
> until we try I suppose.
>
> I'd probably be happy to mentor on the project, but you'd want to
> define the scope of it pretty
> well, as there is a lot of work to get the non-graphics pieces even if
> you are just ripping stuff
> out of llvmpipe.
>
> Dave.
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-12 Thread Dave Airlie
On 11 February 2017 at 09:03, Jacob Lifshay  wrote:
> I would like to write a software implementation of Vulkan for inclusion in
> mesa3d. I wanted to use a tiled renderer coupled with llvm and either write
> or use a whole-function-vectorization pass. Would anyone be willing to
> mentor me for this project? I would probably only need help getting it
> committed, and would be able to do the rest with minimal help.

So I started writing a vulkan->gallium swrast layer

https://cgit.freedesktop.org/~airlied/mesa/log/?h=not-a-vulkan-swrast

with the intention of using it to prove a vulkan swrast driver on top
of llvmpipe eventually.

This was because I was being too lazy to just rewrite llvmpipe as a
vulkan driver,
and it seemed easier to just write the layer to investigate. The thing
about vulkan is it
already very based around the idea of command streams and parallel
building/execution,
so having the gallium/vulkan layer record a CPU command stream and execute that
isn't going to be a large an overhead as doing something similiar with
hw drivers.

I got it working with softpipe after adding a bunch of features to
softpipe, however to
get it going with llvmpipe, there would need to be a lot of work on
improving llvmpipe.

Vulkan really wants images and compute shaders (i.e. it requires
them), and so far we haven't got
image and compute shader support for llvmpipe. There are a few threads
previously on this,
but the main problem with compute shader is getting efficient barriers
working, which needs
some kind of threading model, maybe llvm's coroutine support is useful
for this we won't know
until we try I suppose.

I'd probably be happy to mentor on the project, but you'd want to
define the scope of it pretty
well, as there is a lot of work to get the non-graphics pieces even if
you are just ripping stuff
out of llvmpipe.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-11 Thread Jacob Lifshay
by tiled renderer, I meant that I would split the render target into small
pieces, then, for each triangle, decide which pieces contains the triangle
and add that triangle to per-piece render lists. afterwards, I'd use the
constructed render lists and render all the parts of triangles in a piece,
then go to the next piece. Obviously, I'd use multiple threads that are all
rendering their separate pieces simultaneously. I'm not sure if you'd be
able to use the whole-function-vectorization pass with gallium3d, you'd
need to translate the shader to llvm ir and back. the
whole-function-vectorization pass would still output scalar code for
statically uniform values, llvm (as of 3.9.1) doesn't have a pass to
devectorize vectors where all elements are identical.
Jacob Lifshay

On Feb 11, 2017 11:11, "Roland Scheidegger"  wrote:

Am 11.02.2017 um 00:03 schrieb Jacob Lifshay:
> I would like to write a software implementation of Vulkan for inclusion
> in mesa3d. I wanted to use a tiled renderer coupled with llvm and either
> write or use a whole-function-vectorization pass. Would anyone be
> willing to mentor me for this project? I would probably only need help
> getting it committed, and would be able to do the rest with minimal help.
> Jacob Lifshay

This sounds like a potentially interesting project, though I don't have
much of an idea if it's feasible as gsoc.
By "using a tiled renderer" do you mean you want to "borrow" that,
presumably from either llvmpipe or openswr?
The whole-function-vectorization idea for shader execution looks
reasonable to me, just not sure if it will deliver good results. I guess
it would be nice if that could sort of be used as a replacement for the
current gallivm cpu shader execution implementation (used by both
llvmpipe and openswr).

Roland
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-11 Thread Roland Scheidegger
Am 11.02.2017 um 00:03 schrieb Jacob Lifshay:
> I would like to write a software implementation of Vulkan for inclusion
> in mesa3d. I wanted to use a tiled renderer coupled with llvm and either
> write or use a whole-function-vectorization pass. Would anyone be
> willing to mentor me for this project? I would probably only need help
> getting it committed, and would be able to do the rest with minimal help.
> Jacob Lifshay

This sounds like a potentially interesting project, though I don't have
much of an idea if it's feasible as gsoc.
By "using a tiled renderer" do you mean you want to "borrow" that,
presumably from either llvmpipe or openswr?
The whole-function-vectorization idea for shader execution looks
reasonable to me, just not sure if it will deliver good results. I guess
it would be nice if that could sort of be used as a replacement for the
current gallivm cpu shader execution implementation (used by both
llvmpipe and openswr).

Roland

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-10 Thread Jacob Lifshay
I think vulkan is supposed to be reentrant already.
Jacob Lifshay

On Feb 10, 2017 3:38 PM, "Roland Mainz"  wrote:

> On Sat, Feb 11, 2017 at 12:03 AM, Jacob Lifshay
>  wrote:
> > I would like to write a software implementation of Vulkan for inclusion
> in
> > mesa3d. I wanted to use a tiled renderer coupled with llvm and either
> write
> > or use a whole-function-vectorization pass. Would anyone be willing to
> > mentor me for this project? I would probably only need help getting it
> > committed, and would be able to do the rest with minimal help.
>
> Please do me a favour and implement the renderer in an reentrant way,
> i.e. no global variables (e.g. put all variables which are "global" in
> a "handle" struct which is then passed around, e.g. like libpam was
> implemented). This helps a lot with later multithreading and helps
> with debugging the code.
>
> 
>
> Bye,
> Roland
>
> --
>   __ .  . __
>  (o.\ \/ /.o) roland.ma...@nrubsig.org
>   \__\/\/__/  MPEG specialist, C&&& programmer
>   /O /==\ O\  TEL +49 641 3992797
>  (;O/ \/ \O;)
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] software implementation of vulkan for gsoc/evoc

2017-02-10 Thread Roland Mainz
On Sat, Feb 11, 2017 at 12:03 AM, Jacob Lifshay
 wrote:
> I would like to write a software implementation of Vulkan for inclusion in
> mesa3d. I wanted to use a tiled renderer coupled with llvm and either write
> or use a whole-function-vectorization pass. Would anyone be willing to
> mentor me for this project? I would probably only need help getting it
> committed, and would be able to do the rest with minimal help.

Please do me a favour and implement the renderer in an reentrant way,
i.e. no global variables (e.g. put all variables which are "global" in
a "handle" struct which is then passed around, e.g. like libpam was
implemented). This helps a lot with later multithreading and helps
with debugging the code.



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, C&&& programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev