Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-21 Thread Iago Toral
On Wed, 2015-10-21 at 14:58 +0300, Francisco Jerez wrote:
> Iago Toral  writes:
> 
> > On Wed, 2015-10-21 at 13:00 +0300, Francisco Jerez wrote:
> >> Iago Toral  writes:
> >> 
> >> > Hi Curro,
> >> >
> >> > On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
> >> >> Iago Toral  writes:
> >> >> 
> >> >> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> >> >> >> Iago Toral Quiroga  writes:
> >> >> >> 
> >> >> >> > This allows us to re-use the results of previous ssbo loads in 
> >> >> >> > situations
> >> >> >> > that are safe (i.e. when there are no stores, atomic operations or
> >> >> >> > memory barriers in between).
> >> >> >> >
> >> >> >> > This is particularly useful for things like matrix 
> >> >> >> > multiplications, where
> >> >> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 
> >> >> >> > reads of
> >> >> >> > each column) down to 4 (1 read of each column).
> >> >> >> >
> >> >> >> > The pass can only cache ssbo loads that involve constant blocks and
> >> >> >> > offsets, but could be extended to compare sub-expressions for these
> >> >> >> > as well, similar to a CSE pass.
> >> >> >> >
> >> >> >> > The way the cache works is simple: ssbo loads with constant 
> >> >> >> > block/offset
> >> >> >> > are included in a cache as they are seen. Stores invalidate cache 
> >> >> >> > entries.
> >> >> >> > Stores with non-constant offset invalidate all cached loads for 
> >> >> >> > the block
> >> >> >> > and stores with non-constant block invalidate all cache entries. 
> >> >> >> > There is
> >> >> >> > room to improve this by using the actual variable name we are 
> >> >> >> > accessing to
> >> >> >> > limit the entries that should be invalidated. We also need to 
> >> >> >> > invalidate
> >> >> >> > cache entries when we exit the block in which they have been 
> >> >> >> > defined
> >> >> >> > (i.e. inside if/else blocks or loops).
> >> >> >> >
> >> >> >> > The cache optimization is built as a separate pass, instead of 
> >> >> >> > merging it
> >> >> >> > inside the lower_ubo_reference pass for a number of reasons:
> >> >> >> >
> >> >> >> > 1) The way we process assignments in visitors is that the LHS is
> >> >> >> > processed before the RHS. This creates a problem for an 
> >> >> >> > optimization
> >> >> >> > such as this when we do things like a = a + 1, since we would see 
> >> >> >> > the
> >> >> >> > store before the read when the actual execution order is reversed.
> >> >> >> > This could be fixed by re-implementing the logic in the visit_enter
> >> >> >> > method for ir_assignment in lower_ubo_reference and then returning
> >> >> >> > visit_continue_with_parent.
> >> >> >> >
> >> >> >> > 2) Some writes/reads need to be split into multiple smaller
> >> >> >> > writes/reads, and we need to handle caching for each one. This 
> >> >> >> > happens
> >> >> >> > deep inside the code that handles the lowering and some
> >> >> >> > of the information we need to do this is not available. This could 
> >> >> >> > also
> >> >> >> > be fixed by passing more data into the corresponding functions or 
> >> >> >> > by
> >> >> >> > making this data available as class members, but the current 
> >> >> >> > implementation
> >> >> >> > is already complex enough and  this would only contribute to the 
> >> >> >> > complexity.
> >> >> >> >
> >> >> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In 
> >> >> >> > these cases
> >> >> >> > the current code in lower_uo_reference would see the store before 
> >> >> >> > the read.
> >> >> >> > Probably fixable, but again would add more complexity to the 
> >> >> >> > lowering.
> >> >> >> >
> >> >> >> > On the other hand, a separate pass that runs after the lowering 
> >> >> >> > sees
> >> >> >> > all the individal loads and stores in the correct order (so we 
> >> >> >> > don't need
> >> >> >> > to do any tricks) and it allows us to sepearate the lowering logic 
> >> >> >> > (which
> >> >> >> > is already complex) from the caching logic. It also gives us a 
> >> >> >> > chance to
> >> >> >> > run it after other optimization passes have run and turned constant
> >> >> >> > expressions for block/offset into constants, enabling more 
> >> >> >> > opportunities
> >> >> >> > for caching.
> >> >> >> 
> >> >> >> Seems like a restricted form of CSE that only handles SSBO loads, and
> >> >> >> only the ones with constant arguments.  Why don't we CSE these? (and
> >> >> >> other memory access operations like image loads)
> >> >> >
> >> >> > There is not a CSE pass in GLSL IR any more so we would have to do it 
> >> >> > in
> >> >> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
> >> >> > level seemed like a win from this perspective.
> >> >> >
> >> >> > Then there is the fact that we cannot just CSE these. We need to make
> >> >> > sure that we only CSE them when it is safe to do so (i.e. no
> >> >> > stores/atomics to the 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-21 Thread Francisco Jerez
Iago Toral  writes:

> Hi Curro,
>
> On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
>> Iago Toral  writes:
>> 
>> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
>> >> Iago Toral Quiroga  writes:
>> >> 
>> >> > This allows us to re-use the results of previous ssbo loads in 
>> >> > situations
>> >> > that are safe (i.e. when there are no stores, atomic operations or
>> >> > memory barriers in between).
>> >> >
>> >> > This is particularly useful for things like matrix multiplications, 
>> >> > where
>> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads 
>> >> > of
>> >> > each column) down to 4 (1 read of each column).
>> >> >
>> >> > The pass can only cache ssbo loads that involve constant blocks and
>> >> > offsets, but could be extended to compare sub-expressions for these
>> >> > as well, similar to a CSE pass.
>> >> >
>> >> > The way the cache works is simple: ssbo loads with constant block/offset
>> >> > are included in a cache as they are seen. Stores invalidate cache 
>> >> > entries.
>> >> > Stores with non-constant offset invalidate all cached loads for the 
>> >> > block
>> >> > and stores with non-constant block invalidate all cache entries. There 
>> >> > is
>> >> > room to improve this by using the actual variable name we are accessing 
>> >> > to
>> >> > limit the entries that should be invalidated. We also need to invalidate
>> >> > cache entries when we exit the block in which they have been defined
>> >> > (i.e. inside if/else blocks or loops).
>> >> >
>> >> > The cache optimization is built as a separate pass, instead of merging 
>> >> > it
>> >> > inside the lower_ubo_reference pass for a number of reasons:
>> >> >
>> >> > 1) The way we process assignments in visitors is that the LHS is
>> >> > processed before the RHS. This creates a problem for an optimization
>> >> > such as this when we do things like a = a + 1, since we would see the
>> >> > store before the read when the actual execution order is reversed.
>> >> > This could be fixed by re-implementing the logic in the visit_enter
>> >> > method for ir_assignment in lower_ubo_reference and then returning
>> >> > visit_continue_with_parent.
>> >> >
>> >> > 2) Some writes/reads need to be split into multiple smaller
>> >> > writes/reads, and we need to handle caching for each one. This happens
>> >> > deep inside the code that handles the lowering and some
>> >> > of the information we need to do this is not available. This could also
>> >> > be fixed by passing more data into the corresponding functions or by
>> >> > making this data available as class members, but the current 
>> >> > implementation
>> >> > is already complex enough and  this would only contribute to the 
>> >> > complexity.
>> >> >
>> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these 
>> >> > cases
>> >> > the current code in lower_uo_reference would see the store before the 
>> >> > read.
>> >> > Probably fixable, but again would add more complexity to the lowering.
>> >> >
>> >> > On the other hand, a separate pass that runs after the lowering sees
>> >> > all the individal loads and stores in the correct order (so we don't 
>> >> > need
>> >> > to do any tricks) and it allows us to sepearate the lowering logic 
>> >> > (which
>> >> > is already complex) from the caching logic. It also gives us a chance to
>> >> > run it after other optimization passes have run and turned constant
>> >> > expressions for block/offset into constants, enabling more opportunities
>> >> > for caching.
>> >> 
>> >> Seems like a restricted form of CSE that only handles SSBO loads, and
>> >> only the ones with constant arguments.  Why don't we CSE these? (and
>> >> other memory access operations like image loads)
>> >
>> > There is not a CSE pass in GLSL IR any more so we would have to do it in
>> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
>> > level seemed like a win from this perspective.
>> >
>> > Then there is the fact that we cannot just CSE these. We need to make
>> > sure that we only CSE them when it is safe to do so (i.e. no
>> > stores/atomics to the same offsets in between, no memory barriers, etc).
>> > The current CSE pass in NIR does not support this as far as I can see. I
>> > suppose that we could look into changing the pass to accommodate
>> > restrictions such as this if we think is worth it.
>> >
>> Not really sure if the NIR CSE pass would be adequate, but at least at
>> the i965 back-end level this could be handled easily in the CSE pass
>> (for typed and untyped surface read opcodes in general) in roughly the
>> same way that source variable interference is handled -- Just kill
>> potentially overlapping entries from the AEB whenever an atomic or write
>> instruction for the same surface is seen.
>
> I've been having a quick look at this option but I don't think this is
> going to work for us. The first 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-21 Thread Francisco Jerez
Iago Toral  writes:

> On Wed, 2015-10-21 at 13:00 +0300, Francisco Jerez wrote:
>> Iago Toral  writes:
>> 
>> > Hi Curro,
>> >
>> > On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
>> >> Iago Toral  writes:
>> >> 
>> >> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
>> >> >> Iago Toral Quiroga  writes:
>> >> >> 
>> >> >> > This allows us to re-use the results of previous ssbo loads in 
>> >> >> > situations
>> >> >> > that are safe (i.e. when there are no stores, atomic operations or
>> >> >> > memory barriers in between).
>> >> >> >
>> >> >> > This is particularly useful for things like matrix multiplications, 
>> >> >> > where
>> >> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 
>> >> >> > reads of
>> >> >> > each column) down to 4 (1 read of each column).
>> >> >> >
>> >> >> > The pass can only cache ssbo loads that involve constant blocks and
>> >> >> > offsets, but could be extended to compare sub-expressions for these
>> >> >> > as well, similar to a CSE pass.
>> >> >> >
>> >> >> > The way the cache works is simple: ssbo loads with constant 
>> >> >> > block/offset
>> >> >> > are included in a cache as they are seen. Stores invalidate cache 
>> >> >> > entries.
>> >> >> > Stores with non-constant offset invalidate all cached loads for the 
>> >> >> > block
>> >> >> > and stores with non-constant block invalidate all cache entries. 
>> >> >> > There is
>> >> >> > room to improve this by using the actual variable name we are 
>> >> >> > accessing to
>> >> >> > limit the entries that should be invalidated. We also need to 
>> >> >> > invalidate
>> >> >> > cache entries when we exit the block in which they have been defined
>> >> >> > (i.e. inside if/else blocks or loops).
>> >> >> >
>> >> >> > The cache optimization is built as a separate pass, instead of 
>> >> >> > merging it
>> >> >> > inside the lower_ubo_reference pass for a number of reasons:
>> >> >> >
>> >> >> > 1) The way we process assignments in visitors is that the LHS is
>> >> >> > processed before the RHS. This creates a problem for an optimization
>> >> >> > such as this when we do things like a = a + 1, since we would see the
>> >> >> > store before the read when the actual execution order is reversed.
>> >> >> > This could be fixed by re-implementing the logic in the visit_enter
>> >> >> > method for ir_assignment in lower_ubo_reference and then returning
>> >> >> > visit_continue_with_parent.
>> >> >> >
>> >> >> > 2) Some writes/reads need to be split into multiple smaller
>> >> >> > writes/reads, and we need to handle caching for each one. This 
>> >> >> > happens
>> >> >> > deep inside the code that handles the lowering and some
>> >> >> > of the information we need to do this is not available. This could 
>> >> >> > also
>> >> >> > be fixed by passing more data into the corresponding functions or by
>> >> >> > making this data available as class members, but the current 
>> >> >> > implementation
>> >> >> > is already complex enough and  this would only contribute to the 
>> >> >> > complexity.
>> >> >> >
>> >> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In 
>> >> >> > these cases
>> >> >> > the current code in lower_uo_reference would see the store before 
>> >> >> > the read.
>> >> >> > Probably fixable, but again would add more complexity to the 
>> >> >> > lowering.
>> >> >> >
>> >> >> > On the other hand, a separate pass that runs after the lowering sees
>> >> >> > all the individal loads and stores in the correct order (so we don't 
>> >> >> > need
>> >> >> > to do any tricks) and it allows us to sepearate the lowering logic 
>> >> >> > (which
>> >> >> > is already complex) from the caching logic. It also gives us a 
>> >> >> > chance to
>> >> >> > run it after other optimization passes have run and turned constant
>> >> >> > expressions for block/offset into constants, enabling more 
>> >> >> > opportunities
>> >> >> > for caching.
>> >> >> 
>> >> >> Seems like a restricted form of CSE that only handles SSBO loads, and
>> >> >> only the ones with constant arguments.  Why don't we CSE these? (and
>> >> >> other memory access operations like image loads)
>> >> >
>> >> > There is not a CSE pass in GLSL IR any more so we would have to do it in
>> >> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
>> >> > level seemed like a win from this perspective.
>> >> >
>> >> > Then there is the fact that we cannot just CSE these. We need to make
>> >> > sure that we only CSE them when it is safe to do so (i.e. no
>> >> > stores/atomics to the same offsets in between, no memory barriers, etc).
>> >> > The current CSE pass in NIR does not support this as far as I can see. I
>> >> > suppose that we could look into changing the pass to accommodate
>> >> > restrictions such as this if we think is worth it.
>> >> >
>> >> Not really sure if the NIR CSE pass would be adequate, but 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-21 Thread Iago Toral
On Wed, 2015-10-21 at 13:00 +0300, Francisco Jerez wrote:
> Iago Toral  writes:
> 
> > Hi Curro,
> >
> > On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
> >> Iago Toral  writes:
> >> 
> >> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> >> >> Iago Toral Quiroga  writes:
> >> >> 
> >> >> > This allows us to re-use the results of previous ssbo loads in 
> >> >> > situations
> >> >> > that are safe (i.e. when there are no stores, atomic operations or
> >> >> > memory barriers in between).
> >> >> >
> >> >> > This is particularly useful for things like matrix multiplications, 
> >> >> > where
> >> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 
> >> >> > reads of
> >> >> > each column) down to 4 (1 read of each column).
> >> >> >
> >> >> > The pass can only cache ssbo loads that involve constant blocks and
> >> >> > offsets, but could be extended to compare sub-expressions for these
> >> >> > as well, similar to a CSE pass.
> >> >> >
> >> >> > The way the cache works is simple: ssbo loads with constant 
> >> >> > block/offset
> >> >> > are included in a cache as they are seen. Stores invalidate cache 
> >> >> > entries.
> >> >> > Stores with non-constant offset invalidate all cached loads for the 
> >> >> > block
> >> >> > and stores with non-constant block invalidate all cache entries. 
> >> >> > There is
> >> >> > room to improve this by using the actual variable name we are 
> >> >> > accessing to
> >> >> > limit the entries that should be invalidated. We also need to 
> >> >> > invalidate
> >> >> > cache entries when we exit the block in which they have been defined
> >> >> > (i.e. inside if/else blocks or loops).
> >> >> >
> >> >> > The cache optimization is built as a separate pass, instead of 
> >> >> > merging it
> >> >> > inside the lower_ubo_reference pass for a number of reasons:
> >> >> >
> >> >> > 1) The way we process assignments in visitors is that the LHS is
> >> >> > processed before the RHS. This creates a problem for an optimization
> >> >> > such as this when we do things like a = a + 1, since we would see the
> >> >> > store before the read when the actual execution order is reversed.
> >> >> > This could be fixed by re-implementing the logic in the visit_enter
> >> >> > method for ir_assignment in lower_ubo_reference and then returning
> >> >> > visit_continue_with_parent.
> >> >> >
> >> >> > 2) Some writes/reads need to be split into multiple smaller
> >> >> > writes/reads, and we need to handle caching for each one. This happens
> >> >> > deep inside the code that handles the lowering and some
> >> >> > of the information we need to do this is not available. This could 
> >> >> > also
> >> >> > be fixed by passing more data into the corresponding functions or by
> >> >> > making this data available as class members, but the current 
> >> >> > implementation
> >> >> > is already complex enough and  this would only contribute to the 
> >> >> > complexity.
> >> >> >
> >> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In 
> >> >> > these cases
> >> >> > the current code in lower_uo_reference would see the store before the 
> >> >> > read.
> >> >> > Probably fixable, but again would add more complexity to the lowering.
> >> >> >
> >> >> > On the other hand, a separate pass that runs after the lowering sees
> >> >> > all the individal loads and stores in the correct order (so we don't 
> >> >> > need
> >> >> > to do any tricks) and it allows us to sepearate the lowering logic 
> >> >> > (which
> >> >> > is already complex) from the caching logic. It also gives us a chance 
> >> >> > to
> >> >> > run it after other optimization passes have run and turned constant
> >> >> > expressions for block/offset into constants, enabling more 
> >> >> > opportunities
> >> >> > for caching.
> >> >> 
> >> >> Seems like a restricted form of CSE that only handles SSBO loads, and
> >> >> only the ones with constant arguments.  Why don't we CSE these? (and
> >> >> other memory access operations like image loads)
> >> >
> >> > There is not a CSE pass in GLSL IR any more so we would have to do it in
> >> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
> >> > level seemed like a win from this perspective.
> >> >
> >> > Then there is the fact that we cannot just CSE these. We need to make
> >> > sure that we only CSE them when it is safe to do so (i.e. no
> >> > stores/atomics to the same offsets in between, no memory barriers, etc).
> >> > The current CSE pass in NIR does not support this as far as I can see. I
> >> > suppose that we could look into changing the pass to accommodate
> >> > restrictions such as this if we think is worth it.
> >> >
> >> Not really sure if the NIR CSE pass would be adequate, but at least at
> >> the i965 back-end level this could be handled easily in the CSE pass
> >> (for typed and untyped surface read opcodes in general) in roughly the

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-21 Thread Iago Toral
Hi Curro,

On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
> Iago Toral  writes:
> 
> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> >> Iago Toral Quiroga  writes:
> >> 
> >> > This allows us to re-use the results of previous ssbo loads in situations
> >> > that are safe (i.e. when there are no stores, atomic operations or
> >> > memory barriers in between).
> >> >
> >> > This is particularly useful for things like matrix multiplications, where
> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
> >> > each column) down to 4 (1 read of each column).
> >> >
> >> > The pass can only cache ssbo loads that involve constant blocks and
> >> > offsets, but could be extended to compare sub-expressions for these
> >> > as well, similar to a CSE pass.
> >> >
> >> > The way the cache works is simple: ssbo loads with constant block/offset
> >> > are included in a cache as they are seen. Stores invalidate cache 
> >> > entries.
> >> > Stores with non-constant offset invalidate all cached loads for the block
> >> > and stores with non-constant block invalidate all cache entries. There is
> >> > room to improve this by using the actual variable name we are accessing 
> >> > to
> >> > limit the entries that should be invalidated. We also need to invalidate
> >> > cache entries when we exit the block in which they have been defined
> >> > (i.e. inside if/else blocks or loops).
> >> >
> >> > The cache optimization is built as a separate pass, instead of merging it
> >> > inside the lower_ubo_reference pass for a number of reasons:
> >> >
> >> > 1) The way we process assignments in visitors is that the LHS is
> >> > processed before the RHS. This creates a problem for an optimization
> >> > such as this when we do things like a = a + 1, since we would see the
> >> > store before the read when the actual execution order is reversed.
> >> > This could be fixed by re-implementing the logic in the visit_enter
> >> > method for ir_assignment in lower_ubo_reference and then returning
> >> > visit_continue_with_parent.
> >> >
> >> > 2) Some writes/reads need to be split into multiple smaller
> >> > writes/reads, and we need to handle caching for each one. This happens
> >> > deep inside the code that handles the lowering and some
> >> > of the information we need to do this is not available. This could also
> >> > be fixed by passing more data into the corresponding functions or by
> >> > making this data available as class members, but the current 
> >> > implementation
> >> > is already complex enough and  this would only contribute to the 
> >> > complexity.
> >> >
> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these 
> >> > cases
> >> > the current code in lower_uo_reference would see the store before the 
> >> > read.
> >> > Probably fixable, but again would add more complexity to the lowering.
> >> >
> >> > On the other hand, a separate pass that runs after the lowering sees
> >> > all the individal loads and stores in the correct order (so we don't need
> >> > to do any tricks) and it allows us to sepearate the lowering logic (which
> >> > is already complex) from the caching logic. It also gives us a chance to
> >> > run it after other optimization passes have run and turned constant
> >> > expressions for block/offset into constants, enabling more opportunities
> >> > for caching.
> >> 
> >> Seems like a restricted form of CSE that only handles SSBO loads, and
> >> only the ones with constant arguments.  Why don't we CSE these? (and
> >> other memory access operations like image loads)
> >
> > There is not a CSE pass in GLSL IR any more so we would have to do it in
> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
> > level seemed like a win from this perspective.
> >
> > Then there is the fact that we cannot just CSE these. We need to make
> > sure that we only CSE them when it is safe to do so (i.e. no
> > stores/atomics to the same offsets in between, no memory barriers, etc).
> > The current CSE pass in NIR does not support this as far as I can see. I
> > suppose that we could look into changing the pass to accommodate
> > restrictions such as this if we think is worth it.
> >
> Not really sure if the NIR CSE pass would be adequate, but at least at
> the i965 back-end level this could be handled easily in the CSE pass
> (for typed and untyped surface read opcodes in general) in roughly the
> same way that source variable interference is handled -- Just kill
> potentially overlapping entries from the AEB whenever an atomic or write
> instruction for the same surface is seen.

I've been having a quick look at this option but I don't think this is
going to work for us. The first thing I noticed is that the CSE pass in
i965 was not currently acting on ssbo loads at all: that is, it would
add them to the AEB but immediately remove them because of this:

/* Kill any AEB entries using 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-20 Thread Iago Toral
On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> Iago Toral Quiroga  writes:
> 
> > This allows us to re-use the results of previous ssbo loads in situations
> > that are safe (i.e. when there are no stores, atomic operations or
> > memory barriers in between).
> >
> > This is particularly useful for things like matrix multiplications, where
> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
> > each column) down to 4 (1 read of each column).
> >
> > The pass can only cache ssbo loads that involve constant blocks and
> > offsets, but could be extended to compare sub-expressions for these
> > as well, similar to a CSE pass.
> >
> > The way the cache works is simple: ssbo loads with constant block/offset
> > are included in a cache as they are seen. Stores invalidate cache entries.
> > Stores with non-constant offset invalidate all cached loads for the block
> > and stores with non-constant block invalidate all cache entries. There is
> > room to improve this by using the actual variable name we are accessing to
> > limit the entries that should be invalidated. We also need to invalidate
> > cache entries when we exit the block in which they have been defined
> > (i.e. inside if/else blocks or loops).
> >
> > The cache optimization is built as a separate pass, instead of merging it
> > inside the lower_ubo_reference pass for a number of reasons:
> >
> > 1) The way we process assignments in visitors is that the LHS is
> > processed before the RHS. This creates a problem for an optimization
> > such as this when we do things like a = a + 1, since we would see the
> > store before the read when the actual execution order is reversed.
> > This could be fixed by re-implementing the logic in the visit_enter
> > method for ir_assignment in lower_ubo_reference and then returning
> > visit_continue_with_parent.
> >
> > 2) Some writes/reads need to be split into multiple smaller
> > writes/reads, and we need to handle caching for each one. This happens
> > deep inside the code that handles the lowering and some
> > of the information we need to do this is not available. This could also
> > be fixed by passing more data into the corresponding functions or by
> > making this data available as class members, but the current implementation
> > is already complex enough and  this would only contribute to the complexity.
> >
> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these cases
> > the current code in lower_uo_reference would see the store before the read.
> > Probably fixable, but again would add more complexity to the lowering.
> >
> > On the other hand, a separate pass that runs after the lowering sees
> > all the individal loads and stores in the correct order (so we don't need
> > to do any tricks) and it allows us to sepearate the lowering logic (which
> > is already complex) from the caching logic. It also gives us a chance to
> > run it after other optimization passes have run and turned constant
> > expressions for block/offset into constants, enabling more opportunities
> > for caching.
> 
> Seems like a restricted form of CSE that only handles SSBO loads, and
> only the ones with constant arguments.  Why don't we CSE these? (and
> other memory access operations like image loads)

There is not a CSE pass in GLSL IR any more so we would have to do it in
NIR and some drivers would lose the optimization. Doing it at GLSL IR
level seemed like a win from this perspective.

Then there is the fact that we cannot just CSE these. We need to make
sure that we only CSE them when it is safe to do so (i.e. no
stores/atomics to the same offsets in between, no memory barriers, etc).
The current CSE pass in NIR does not support this as far as I can see. I
suppose that we could look into changing the pass to accommodate
restrictions such as this if we think is worth it.

Iago

> 
> > ---
> >  src/glsl/Makefile.sources  |   1 +
> >  src/glsl/ir_optimization.h |   1 +
> >  src/glsl/opt_ssbo_load.cpp | 338 
> > +
> >  3 files changed, 340 insertions(+)
> >  create mode 100644 src/glsl/opt_ssbo_load.cpp
> >
> > diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources
> > index ca87036..73c7514 100644
> > --- a/src/glsl/Makefile.sources
> > +++ b/src/glsl/Makefile.sources
> > @@ -201,6 +201,7 @@ LIBGLSL_FILES = \
> > opt_noop_swizzle.cpp \
> > opt_rebalance_tree.cpp \
> > opt_redundant_jumps.cpp \
> > +   opt_ssbo_load.cpp \
> > opt_structure_splitting.cpp \
> > opt_swizzle_swizzle.cpp \
> > opt_tree_grafting.cpp \
> > diff --git a/src/glsl/ir_optimization.h b/src/glsl/ir_optimization.h
> > index ce5c492..26677d7 100644
> > --- a/src/glsl/ir_optimization.h
> > +++ b/src/glsl/ir_optimization.h
> > @@ -125,6 +125,7 @@ bool lower_clip_distance(gl_shader *shader);
> >  void lower_output_reads(unsigned stage, exec_list *instructions);
> >  bool lower_packing_builtins(exec_list 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-20 Thread Francisco Jerez
Iago Toral Quiroga  writes:

> This allows us to re-use the results of previous ssbo loads in situations
> that are safe (i.e. when there are no stores, atomic operations or
> memory barriers in between).
>
> This is particularly useful for things like matrix multiplications, where
> for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
> each column) down to 4 (1 read of each column).
>
> The pass can only cache ssbo loads that involve constant blocks and
> offsets, but could be extended to compare sub-expressions for these
> as well, similar to a CSE pass.
>
> The way the cache works is simple: ssbo loads with constant block/offset
> are included in a cache as they are seen. Stores invalidate cache entries.
> Stores with non-constant offset invalidate all cached loads for the block
> and stores with non-constant block invalidate all cache entries. There is
> room to improve this by using the actual variable name we are accessing to
> limit the entries that should be invalidated. We also need to invalidate
> cache entries when we exit the block in which they have been defined
> (i.e. inside if/else blocks or loops).
>
> The cache optimization is built as a separate pass, instead of merging it
> inside the lower_ubo_reference pass for a number of reasons:
>
> 1) The way we process assignments in visitors is that the LHS is
> processed before the RHS. This creates a problem for an optimization
> such as this when we do things like a = a + 1, since we would see the
> store before the read when the actual execution order is reversed.
> This could be fixed by re-implementing the logic in the visit_enter
> method for ir_assignment in lower_ubo_reference and then returning
> visit_continue_with_parent.
>
> 2) Some writes/reads need to be split into multiple smaller
> writes/reads, and we need to handle caching for each one. This happens
> deep inside the code that handles the lowering and some
> of the information we need to do this is not available. This could also
> be fixed by passing more data into the corresponding functions or by
> making this data available as class members, but the current implementation
> is already complex enough and  this would only contribute to the complexity.
>
> 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these cases
> the current code in lower_uo_reference would see the store before the read.
> Probably fixable, but again would add more complexity to the lowering.
>
> On the other hand, a separate pass that runs after the lowering sees
> all the individal loads and stores in the correct order (so we don't need
> to do any tricks) and it allows us to sepearate the lowering logic (which
> is already complex) from the caching logic. It also gives us a chance to
> run it after other optimization passes have run and turned constant
> expressions for block/offset into constants, enabling more opportunities
> for caching.

Seems like a restricted form of CSE that only handles SSBO loads, and
only the ones with constant arguments.  Why don't we CSE these? (and
other memory access operations like image loads)


> ---
>  src/glsl/Makefile.sources  |   1 +
>  src/glsl/ir_optimization.h |   1 +
>  src/glsl/opt_ssbo_load.cpp | 338 
> +
>  3 files changed, 340 insertions(+)
>  create mode 100644 src/glsl/opt_ssbo_load.cpp
>
> diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources
> index ca87036..73c7514 100644
> --- a/src/glsl/Makefile.sources
> +++ b/src/glsl/Makefile.sources
> @@ -201,6 +201,7 @@ LIBGLSL_FILES = \
>   opt_noop_swizzle.cpp \
>   opt_rebalance_tree.cpp \
>   opt_redundant_jumps.cpp \
> + opt_ssbo_load.cpp \
>   opt_structure_splitting.cpp \
>   opt_swizzle_swizzle.cpp \
>   opt_tree_grafting.cpp \
> diff --git a/src/glsl/ir_optimization.h b/src/glsl/ir_optimization.h
> index ce5c492..26677d7 100644
> --- a/src/glsl/ir_optimization.h
> +++ b/src/glsl/ir_optimization.h
> @@ -125,6 +125,7 @@ bool lower_clip_distance(gl_shader *shader);
>  void lower_output_reads(unsigned stage, exec_list *instructions);
>  bool lower_packing_builtins(exec_list *instructions, int op_mask);
>  void lower_ubo_reference(struct gl_shader *shader, exec_list *instructions);
> +bool opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions);
>  void lower_packed_varyings(void *mem_ctx,
> unsigned locations_used, ir_variable_mode mode,
> unsigned gs_input_vertices, gl_shader *shader);
> diff --git a/src/glsl/opt_ssbo_load.cpp b/src/glsl/opt_ssbo_load.cpp
> new file mode 100644
> index 000..5404907
> --- /dev/null
> +++ b/src/glsl/opt_ssbo_load.cpp
> @@ -0,0 +1,338 @@
> +/*
> + * Copyright © 2015 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the 

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-20 Thread Francisco Jerez
Iago Toral  writes:

> On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
>> Iago Toral Quiroga  writes:
>> 
>> > This allows us to re-use the results of previous ssbo loads in situations
>> > that are safe (i.e. when there are no stores, atomic operations or
>> > memory barriers in between).
>> >
>> > This is particularly useful for things like matrix multiplications, where
>> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
>> > each column) down to 4 (1 read of each column).
>> >
>> > The pass can only cache ssbo loads that involve constant blocks and
>> > offsets, but could be extended to compare sub-expressions for these
>> > as well, similar to a CSE pass.
>> >
>> > The way the cache works is simple: ssbo loads with constant block/offset
>> > are included in a cache as they are seen. Stores invalidate cache entries.
>> > Stores with non-constant offset invalidate all cached loads for the block
>> > and stores with non-constant block invalidate all cache entries. There is
>> > room to improve this by using the actual variable name we are accessing to
>> > limit the entries that should be invalidated. We also need to invalidate
>> > cache entries when we exit the block in which they have been defined
>> > (i.e. inside if/else blocks or loops).
>> >
>> > The cache optimization is built as a separate pass, instead of merging it
>> > inside the lower_ubo_reference pass for a number of reasons:
>> >
>> > 1) The way we process assignments in visitors is that the LHS is
>> > processed before the RHS. This creates a problem for an optimization
>> > such as this when we do things like a = a + 1, since we would see the
>> > store before the read when the actual execution order is reversed.
>> > This could be fixed by re-implementing the logic in the visit_enter
>> > method for ir_assignment in lower_ubo_reference and then returning
>> > visit_continue_with_parent.
>> >
>> > 2) Some writes/reads need to be split into multiple smaller
>> > writes/reads, and we need to handle caching for each one. This happens
>> > deep inside the code that handles the lowering and some
>> > of the information we need to do this is not available. This could also
>> > be fixed by passing more data into the corresponding functions or by
>> > making this data available as class members, but the current implementation
>> > is already complex enough and  this would only contribute to the 
>> > complexity.
>> >
>> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these 
>> > cases
>> > the current code in lower_uo_reference would see the store before the read.
>> > Probably fixable, but again would add more complexity to the lowering.
>> >
>> > On the other hand, a separate pass that runs after the lowering sees
>> > all the individal loads and stores in the correct order (so we don't need
>> > to do any tricks) and it allows us to sepearate the lowering logic (which
>> > is already complex) from the caching logic. It also gives us a chance to
>> > run it after other optimization passes have run and turned constant
>> > expressions for block/offset into constants, enabling more opportunities
>> > for caching.
>> 
>> Seems like a restricted form of CSE that only handles SSBO loads, and
>> only the ones with constant arguments.  Why don't we CSE these? (and
>> other memory access operations like image loads)
>
> There is not a CSE pass in GLSL IR any more so we would have to do it in
> NIR and some drivers would lose the optimization. Doing it at GLSL IR
> level seemed like a win from this perspective.
>
> Then there is the fact that we cannot just CSE these. We need to make
> sure that we only CSE them when it is safe to do so (i.e. no
> stores/atomics to the same offsets in between, no memory barriers, etc).
> The current CSE pass in NIR does not support this as far as I can see. I
> suppose that we could look into changing the pass to accommodate
> restrictions such as this if we think is worth it.
>
Not really sure if the NIR CSE pass would be adequate, but at least at
the i965 back-end level this could be handled easily in the CSE pass
(for typed and untyped surface read opcodes in general) in roughly the
same way that source variable interference is handled -- Just kill
potentially overlapping entries from the AEB whenever an atomic or write
instruction for the same surface is seen.

> Iago
>
>> 
>> > ---
>> >  src/glsl/Makefile.sources  |   1 +
>> >  src/glsl/ir_optimization.h |   1 +
>> >  src/glsl/opt_ssbo_load.cpp | 338 
>> > +
>> >  3 files changed, 340 insertions(+)
>> >  create mode 100644 src/glsl/opt_ssbo_load.cpp
>> >
>> > diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources
>> > index ca87036..73c7514 100644
>> > --- a/src/glsl/Makefile.sources
>> > +++ b/src/glsl/Makefile.sources
>> > @@ -201,6 +201,7 @@ LIBGLSL_FILES = \
>> >opt_noop_swizzle.cpp \
>> >

Re: [Mesa-dev] [PATCH 1/2] glsl: Implement a SSBO load optimization pass

2015-10-20 Thread Iago Toral
On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote:
> Iago Toral  writes:
> 
> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote:
> >> Iago Toral Quiroga  writes:
> >> 
> >> > This allows us to re-use the results of previous ssbo loads in situations
> >> > that are safe (i.e. when there are no stores, atomic operations or
> >> > memory barriers in between).
> >> >
> >> > This is particularly useful for things like matrix multiplications, where
> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads of
> >> > each column) down to 4 (1 read of each column).
> >> >
> >> > The pass can only cache ssbo loads that involve constant blocks and
> >> > offsets, but could be extended to compare sub-expressions for these
> >> > as well, similar to a CSE pass.
> >> >
> >> > The way the cache works is simple: ssbo loads with constant block/offset
> >> > are included in a cache as they are seen. Stores invalidate cache 
> >> > entries.
> >> > Stores with non-constant offset invalidate all cached loads for the block
> >> > and stores with non-constant block invalidate all cache entries. There is
> >> > room to improve this by using the actual variable name we are accessing 
> >> > to
> >> > limit the entries that should be invalidated. We also need to invalidate
> >> > cache entries when we exit the block in which they have been defined
> >> > (i.e. inside if/else blocks or loops).
> >> >
> >> > The cache optimization is built as a separate pass, instead of merging it
> >> > inside the lower_ubo_reference pass for a number of reasons:
> >> >
> >> > 1) The way we process assignments in visitors is that the LHS is
> >> > processed before the RHS. This creates a problem for an optimization
> >> > such as this when we do things like a = a + 1, since we would see the
> >> > store before the read when the actual execution order is reversed.
> >> > This could be fixed by re-implementing the logic in the visit_enter
> >> > method for ir_assignment in lower_ubo_reference and then returning
> >> > visit_continue_with_parent.
> >> >
> >> > 2) Some writes/reads need to be split into multiple smaller
> >> > writes/reads, and we need to handle caching for each one. This happens
> >> > deep inside the code that handles the lowering and some
> >> > of the information we need to do this is not available. This could also
> >> > be fixed by passing more data into the corresponding functions or by
> >> > making this data available as class members, but the current 
> >> > implementation
> >> > is already complex enough and  this would only contribute to the 
> >> > complexity.
> >> >
> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these 
> >> > cases
> >> > the current code in lower_uo_reference would see the store before the 
> >> > read.
> >> > Probably fixable, but again would add more complexity to the lowering.
> >> >
> >> > On the other hand, a separate pass that runs after the lowering sees
> >> > all the individal loads and stores in the correct order (so we don't need
> >> > to do any tricks) and it allows us to sepearate the lowering logic (which
> >> > is already complex) from the caching logic. It also gives us a chance to
> >> > run it after other optimization passes have run and turned constant
> >> > expressions for block/offset into constants, enabling more opportunities
> >> > for caching.
> >> 
> >> Seems like a restricted form of CSE that only handles SSBO loads, and
> >> only the ones with constant arguments.  Why don't we CSE these? (and
> >> other memory access operations like image loads)
> >
> > There is not a CSE pass in GLSL IR any more so we would have to do it in
> > NIR and some drivers would lose the optimization. Doing it at GLSL IR
> > level seemed like a win from this perspective.
> >
> > Then there is the fact that we cannot just CSE these. We need to make
> > sure that we only CSE them when it is safe to do so (i.e. no
> > stores/atomics to the same offsets in between, no memory barriers, etc).
> > The current CSE pass in NIR does not support this as far as I can see. I
> > suppose that we could look into changing the pass to accommodate
> > restrictions such as this if we think is worth it.
> >
> Not really sure if the NIR CSE pass would be adequate, but at least at
> the i965 back-end level this could be handled easily in the CSE pass
> (for typed and untyped surface read opcodes in general) in roughly the
> same way that source variable interference is handled -- Just kill
> potentially overlapping entries from the AEB whenever an atomic or write
> instruction for the same surface is seen.

I didn't think of solving this for i965 only, but if we can get
non-constant block/offset handled easily it is definitely worth a try.
I'll have a go at it.

Even if we do that for i965, I wonder if we still want to have something
like this at GLSL IR though.

Iago

> > Iago
> >
> >> 
> >> > ---
> >> >