Re: [Mesa-dev] [PATCH 01/22] glsl: Reorder optimization-passes

2015-01-17 Thread Matt Turner
On Sat, Jan 17, 2015 at 8:31 AM, Thomas Helland
thomashellan...@gmail.com wrote:
 2015-01-03 22:48 GMT+01:00 Matt Turner matts...@gmail.com:
 On Sat, Jan 3, 2015 at 11:18 AM, Thomas Helland
 thomashellan...@gmail.com wrote:
 This allows opt_algebraic to resolve open-coded
 saturates into ir_unop_saturate before we potentially
 mess it up by removing the min or max in min/max-pruning.

 Since we are now emitting more free saturates on i965
 this gives us some decrease in instruction count.

 total instructions in shared programs: 1317459 - 1317065 (-0.03%)
 instructions in affected programs: 4084 - 3690 (-9.65%)
 GAINED:0
 LOST:  0

 You're definitely onto something here. On our collection of shaders:

 total instructions in shared programs: 5876617 - 5875919 (-0.01%)
 instructions in affected programs: 9443 - 8745 (-7.39%)

 with some fragment shaders hurt in Natural Selection 2 and Kerbal Space 
 program.

 I'll investigate these.

 Hi Matt,

 Don't want to be a nuisance (if that is even the right word?
 English is not my native tongue), but did you find the
 time to look at these regressions?

Nuisance is indeed the right word, but you are not being one. :)

I'll definitely look into this. Sorry that I haven't had a chance yet.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] nir: use Python to autogenerate opcode information

2015-01-17 Thread Connor Abbott
On Sat, Jan 17, 2015 at 11:42 AM, ahmad luig...@yandex.com wrote:
 hi.

 #! /usr/bin/env python corresponds python 3.x series for some  major distro 
 (arch,fedora ...) and python 2.x for some others.

 python 2.x and python 3.x are not source compatible each other.

 python 3.x not contains xrange funcion anymore.

 range vs xrange only meaningfull for python 2.x.

 http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/

 Distros that which still use 2.x  series as default python interpreter going 
 to 3.x.

 regargs.
 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Yes... if you look at the part of the patch that modifies Makefile.am,
it's actually called with $(PYTHON) which will be python2 on distro's
where python3 is the default. Unfortunately, on some distros there's
no python2, so #!/usr/bin/env python2 won't work either... you can't
please everyone. So the line you mentioned is more customary than
anything else.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] nir: use Python to autogenerate opcode information

2015-01-17 Thread Dylan Baker
On Saturday, January 17, 2015 01:09:45 PM Connor Abbott wrote:
 On Sat, Jan 17, 2015 at 11:42 AM, ahmad luig...@yandex.com wrote:
  hi.
 
  #! /usr/bin/env python corresponds python 3.x series for some  major 
  distro (arch,fedora ...) and python 2.x for some others.
 
  python 2.x and python 3.x are not source compatible each other.
 
  python 3.x not contains xrange funcion anymore.
 
  range vs xrange only meaningfull for python 2.x.
 
  http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/
 
  Distros that which still use 2.x  series as default python interpreter 
  going to 3.x.
 
  regargs.
  ___
  mesa-dev mailing list
  mesa-dev@lists.freedesktop.org
  http://lists.freedesktop.org/mailman/listinfo/mesa-dev
 
 Yes... if you look at the part of the patch that modifies Makefile.am,
 it's actually called with $(PYTHON) which will be python2 on distro's
 where python3 is the default. Unfortunately, on some distros there's
 no python2, so #!/usr/bin/env python2 won't work either... you can't
 please everyone. So the line you mentioned is more customary than
 anything else.

While I agree with you Conner, when I did a survey for piglit I found
that OSX was the only major OS that didn't provide a python2 symlink,
Arch, Gentoo, Debian, Fedora, and CentOS all did, and Windows doesn't
care about a shbang.

I would actually be in favor of using /usr/bin/python2 anyway,
just because it makes it clear we're using python2, but ultimately
you're right and it doesn't really matter.

It's also been on my todo list to get all of the python in mesa working
with both python2 and python3, but I have too many things to get done.

Dylan

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
 


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 1/7] i965: Enable L3 caching of buffer surfaces.

2015-01-17 Thread Francisco Jerez
And remove the mocs argument of the emit_buffer_surface_state vtbl hook.  Its
semantics vary greatly from one generation to another, so it kind of
encourages the caller to pass 0 which is the only valid setting across
generations.  After this commit the hardware-specific code decides what the
best cacheability settings are for buffer surfaces, just like we do for
textures.

This together with some additional changes coming is expected to improve
performance of pull constants, buffer textures, atomic counters and image
objects on Gen7 and up.
---
 src/mesa/drivers/dri/i965/brw_context.h   | 1 -
 src/mesa/drivers/dri/i965/brw_wm_surface_state.c  | 4 +---
 src/mesa/drivers/dri/i965/gen7_wm_surface_state.c | 4 +---
 src/mesa/drivers/dri/i965/gen8_surface_state.c| 3 +--
 4 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_context.h 
b/src/mesa/drivers/dri/i965/brw_context.h
index a4b29fa..6195d3d 100644
--- a/src/mesa/drivers/dri/i965/brw_context.h
+++ b/src/mesa/drivers/dri/i965/brw_context.h
@@ -975,7 +975,6 @@ struct brw_context
 unsigned surface_format,
 unsigned buffer_size,
 unsigned pitch,
-unsigned mocs,
 bool rw);
 
   /**
diff --git a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c 
b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
index 85a08d5..ece352b 100644
--- a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
@@ -221,7 +221,6 @@ gen4_emit_buffer_surface_state(struct brw_context *brw,
unsigned surface_format,
unsigned buffer_size,
unsigned pitch,
-   unsigned mocs,
bool rw)
 {
uint32_t *surf = brw_state_batch(brw, AUB_TRACE_SURFACE_STATE,
@@ -279,7 +278,6 @@ brw_update_buffer_texture_surface(struct gl_context *ctx,
brw_format,
size / texel_size,
texel_size,
-   0, /* mocs */
false /* rw */);
 }
 
@@ -382,7 +380,7 @@ brw_create_constant_surface(struct brw_context *brw,
 
brw-vtbl.emit_buffer_surface_state(brw, out_offset, bo, offset,
BRW_SURFACEFORMAT_R32G32B32A32_FLOAT,
-   elements, stride, 0, false);
+   elements, stride, false);
 }
 
 /**
diff --git a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c 
b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
index e2c347a..24547d9 100644
--- a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
@@ -225,7 +225,6 @@ gen7_emit_buffer_surface_state(struct brw_context *brw,
unsigned surface_format,
unsigned buffer_size,
unsigned pitch,
-   unsigned mocs,
bool rw)
 {
uint32_t *surf = brw_state_batch(brw, AUB_TRACE_SURFACE_STATE,
@@ -241,7 +240,7 @@ gen7_emit_buffer_surface_state(struct brw_context *brw,
surf[3] = SET_FIELD(((buffer_size - 1)  21)  0x3f, BRW_SURFACE_DEPTH) |
  (pitch - 1);
 
-   surf[5] = SET_FIELD(mocs, GEN7_SURFACE_MOCS);
+   surf[5] = SET_FIELD(GEN7_MOCS_L3, GEN7_SURFACE_MOCS);
 
if (brw-is_haswell) {
   surf[7] |= (SET_FIELD(HSW_SCS_RED,   GEN7_SURFACE_SCS_R) |
@@ -385,7 +384,6 @@ gen7_create_raw_surface(struct brw_context *brw, 
drm_intel_bo *bo,
   BRW_SURFACEFORMAT_RAW,
   size,
   1,
-  0 /* mocs */,
   true /* rw */);
 }
 
diff --git a/src/mesa/drivers/dri/i965/gen8_surface_state.c 
b/src/mesa/drivers/dri/i965/gen8_surface_state.c
index d1b095c..8d4e180 100644
--- a/src/mesa/drivers/dri/i965/gen8_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_surface_state.c
@@ -116,9 +116,9 @@ gen8_emit_buffer_surface_state(struct brw_context *brw,
unsigned surface_format,
unsigned buffer_size,
unsigned pitch,
-   unsigned mocs,
bool rw)
 {
+   const unsigned mocs = brw-gen = 9 ? SKL_MOCS_WB : BDW_MOCS_WB;
uint32_t *surf = allocate_surface_state(brw, out_offset);
 
surf[0] = BRW_SURFACE_BUFFER  BRW_SURFACE_TYPE_SHIFT |
@@ -286,7 +286,6 @@ gen8_create_raw_surface(struct brw_context *brw, 
drm_intel_bo *bo,
 

[Mesa-dev] [PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.

2015-01-17 Thread Francisco Jerez
It's a wrapper around emit_buffer_surface_state with format=RAW, pitch=1,
rw=true and the remaining arguments ordered differently.  There's no point in
having a separate vtbl pointer for that.
---
 src/mesa/drivers/dri/i965/brw_binding_tables.c|  8 +---
 src/mesa/drivers/dri/i965/brw_context.h   |  6 --
 src/mesa/drivers/dri/i965/brw_wm_surface_state.c  |  6 +++---
 src/mesa/drivers/dri/i965/gen7_wm_surface_state.c | 19 ---
 src/mesa/drivers/dri/i965/gen8_surface_state.c| 16 
 5 files changed, 8 insertions(+), 47 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_binding_tables.c 
b/src/mesa/drivers/dri/i965/brw_binding_tables.c
index ea82e71..08e4191 100644
--- a/src/mesa/drivers/dri/i965/brw_binding_tables.c
+++ b/src/mesa/drivers/dri/i965/brw_binding_tables.c
@@ -68,9 +68,11 @@ brw_upload_binding_table(struct brw_context *brw,
} else {
   /* Upload a new binding table. */
   if (INTEL_DEBUG  DEBUG_SHADER_TIME) {
- brw-vtbl.create_raw_surface(
-brw, brw-shader_time.bo, 0, brw-shader_time.bo-size,
-
stage_state-surf_offset[prog_data-binding_table.shader_time_start], true);
+ brw-vtbl.emit_buffer_surface_state(
+brw, stage_state-surf_offset[
+prog_data-binding_table.shader_time_start],
+brw-shader_time.bo, 0, BRW_SURFACEFORMAT_RAW,
+brw-shader_time.bo-size, 1, true);
   }
 
   uint32_t *bind = brw_state_batch(brw, AUB_TRACE_BINDING_TABLE,
diff --git a/src/mesa/drivers/dri/i965/brw_context.h 
b/src/mesa/drivers/dri/i965/brw_context.h
index 6195d3d..d21e175 100644
--- a/src/mesa/drivers/dri/i965/brw_context.h
+++ b/src/mesa/drivers/dri/i965/brw_context.h
@@ -962,12 +962,6 @@ struct brw_context
   void (*update_null_renderbuffer_surface)(struct brw_context *brw,
   unsigned unit);
 
-  void (*create_raw_surface)(struct brw_context *brw,
- drm_intel_bo *bo,
- uint32_t offset,
- uint32_t size,
- uint32_t *out_offset,
- bool rw);
   void (*emit_buffer_surface_state)(struct brw_context *brw,
 uint32_t *out_offset,
 drm_intel_bo *bo,
diff --git a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c 
b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
index ece352b..e5f2058 100644
--- a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
@@ -920,9 +920,9 @@ brw_upload_abo_surfaces(struct brw_context *brw,
   drm_intel_bo *bo = intel_bufferobj_buffer(
  brw, intel_bo, binding-Offset, intel_bo-Base.Size - 
binding-Offset);
 
-  brw-vtbl.create_raw_surface(brw, bo, binding-Offset,
-   bo-size - binding-Offset,
-   surf_offsets[i], true);
+  brw-vtbl.emit_buffer_surface_state(brw, surf_offsets[i], bo,
+  binding-Offset, 
BRW_SURFACEFORMAT_RAW,
+  bo-size - binding-Offset, 1, true);
}
 
if (prog-NumAtomicBuffers)
diff --git a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c 
b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
index 24547d9..1421ac4 100644
--- a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
@@ -370,24 +370,6 @@ gen7_update_texture_surface(struct gl_context *ctx,
 }
 
 /**
- * Create a raw surface for untyped R/W access.
- */
-static void
-gen7_create_raw_surface(struct brw_context *brw, drm_intel_bo *bo,
-uint32_t offset, uint32_t size,
-uint32_t *out_offset, bool rw)
-{
-   gen7_emit_buffer_surface_state(brw,
-  out_offset,
-  bo,
-  offset,
-  BRW_SURFACEFORMAT_RAW,
-  size,
-  1,
-  true /* rw */);
-}
-
-/**
  * Creates a null renderbuffer surface.
  *
  * This is used when the shader doesn't write to any color output.  An FB
@@ -563,6 +545,5 @@ gen7_init_vtable_surface_functions(struct brw_context *brw)
brw-vtbl.update_renderbuffer_surface = gen7_update_renderbuffer_surface;
brw-vtbl.update_null_renderbuffer_surface =
   gen7_update_null_renderbuffer_surface;
-   brw-vtbl.create_raw_surface = gen7_create_raw_surface;
brw-vtbl.emit_buffer_surface_state = gen7_emit_buffer_surface_state;
 }
diff --git a/src/mesa/drivers/dri/i965/gen8_surface_state.c 
b/src/mesa/drivers/dri/i965/gen8_surface_state.c
index 8d4e180..9ddbbad 100644
--- a/src/mesa/drivers/dri/i965/gen8_surface_state.c

[Mesa-dev] [PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in lower_load_payload().

2015-01-17 Thread Francisco Jerez
It's perfectly fine to read the second half of a register written with
force_writemask_all from a first half MOV instruction or vice versa, and
lower_load_payload shouldn't mark the whole MOV as belonging to the second
half in that case.  Replicate the same metadata to both halves of the
destination when writemasking is disabled.
---
 src/mesa/drivers/dri/i965/brw_fs.cpp | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 4a61943..d585a67 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -3059,9 +3059,11 @@ fs_visitor::lower_load_payload()
   }
 
   if (inst-dst.file == MRF || inst-dst.file == GRF) {
- bool force_sechalf = inst-force_sechalf;
+ bool force_sechalf = inst-force_sechalf 
+  !inst-force_writemask_all;
  bool toggle_sechalf = inst-dst.width == 16 
-   type_sz(inst-dst.type) == 4;
+   type_sz(inst-dst.type) == 4 
+   !inst-force_writemask_all;
  for (int i = 0; i  inst-regs_written; ++i) {
 metadata[dst_reg + i].written = true;
 metadata[dst_reg + i].force_sechalf = force_sechalf;
@@ -3104,11 +3106,15 @@ fs_visitor::lower_load_payload()
   mov-force_writemask_all = 
metadata[src_reg].force_writemask_all;
   metadata[dst_reg] = metadata[src_reg];
   if (dst.width * type_sz(dst.type)  32) {
- assert((!metadata[src_reg].written ||
- !metadata[src_reg].force_sechalf) 
-(!metadata[src_reg + 1].written ||
- metadata[src_reg + 1].force_sechalf));
- metadata[dst_reg + 1] = metadata[src_reg + 1];
+ if (metadata[src_reg].force_writemask_all) {
+metadata[dst_reg + 1] = metadata[src_reg];
+ } else {
+assert((!metadata[src_reg].written ||
+!metadata[src_reg].force_sechalf) 
+   (!metadata[src_reg + 1].written ||
+metadata[src_reg + 1].force_sechalf));
+metadata[dst_reg + 1] = metadata[src_reg + 1];
+ }
   }
} else {
   metadata[dst_reg].force_writemask_all = false;
-- 
2.1.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the target cache.

2015-01-17 Thread Francisco Jerez
brw_set_dp_read_message already had a target_cache argument, but its
interpretation was rather contrived: On Gen7+ it was ignored and the data
cache was always used, on Gen6 the render cache was used if the caller asked
for it, otherwise it was ignored using the sampler cache instead.

brw_set_dp_write_message used the data cache on Gen7+ except for
RENDER_TARGET_WRITE messages, in which case it would use the render cache.  On
Gen6 the render cache was always used.

Makes no functional changes.  Some of the nested ternary operators introduced
here will go away in a future commit.
---
 src/mesa/drivers/dri/i965/brw_eu.h   |  1 +
 src/mesa/drivers/dri/i965/brw_eu_emit.c  | 58 
 src/mesa/drivers/dri/i965/brw_vec4_generator.cpp | 16 ++-
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_eu.h 
b/src/mesa/drivers/dri/i965/brw_eu.h
index 22d5a0a..60f6f69 100644
--- a/src/mesa/drivers/dri/i965/brw_eu.h
+++ b/src/mesa/drivers/dri/i965/brw_eu.h
@@ -225,6 +225,7 @@ void brw_set_dp_write_message(struct brw_compile *p,
  unsigned binding_table_index,
  unsigned msg_control,
  unsigned msg_type,
+  unsigned target_cache,
  unsigned msg_length,
  bool header_present,
  unsigned last_render_target,
diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c 
b/src/mesa/drivers/dri/i965/brw_eu_emit.c
index 8f15db9..c2e490d 100644
--- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
@@ -675,6 +675,7 @@ brw_set_dp_write_message(struct brw_compile *p,
 unsigned binding_table_index,
 unsigned msg_control,
 unsigned msg_type,
+ unsigned target_cache,
 unsigned msg_length,
 bool header_present,
 unsigned last_render_target,
@@ -683,20 +684,8 @@ brw_set_dp_write_message(struct brw_compile *p,
 unsigned send_commit_msg)
 {
struct brw_context *brw = p-brw;
-   unsigned sfid;
-
-   if (brw-gen = 7) {
-  /* Use the Render Cache for RT writes; otherwise use the Data Cache */
-  if (msg_type == GEN6_DATAPORT_WRITE_MESSAGE_RENDER_TARGET_WRITE)
-sfid = GEN6_SFID_DATAPORT_RENDER_CACHE;
-  else
-sfid = GEN7_SFID_DATAPORT_DATA_CACHE;
-   } else if (brw-gen == 6) {
-  /* Use the render cache for all write messages. */
-  sfid = GEN6_SFID_DATAPORT_RENDER_CACHE;
-   } else {
-  sfid = BRW_SFID_DATAPORT_WRITE;
-   }
+   const unsigned sfid = (brw-gen = 6 ? target_cache :
+  BRW_SFID_DATAPORT_WRITE);
 
brw_set_message_descriptor(p, insn, sfid, msg_length, response_length,
  header_present, end_of_thread);
@@ -722,18 +711,8 @@ brw_set_dp_read_message(struct brw_compile *p,
unsigned response_length)
 {
struct brw_context *brw = p-brw;
-   unsigned sfid;
-
-   if (brw-gen = 7) {
-  sfid = GEN7_SFID_DATAPORT_DATA_CACHE;
-   } else if (brw-gen == 6) {
-  if (target_cache == BRW_DATAPORT_READ_TARGET_RENDER_CACHE)
-sfid = GEN6_SFID_DATAPORT_RENDER_CACHE;
-  else
-sfid = GEN6_SFID_DATAPORT_SAMPLER_CACHE;
-   } else {
-  sfid = BRW_SFID_DATAPORT_READ;
-   }
+   const unsigned sfid = (brw-gen = 6 ? target_cache :
+  BRW_SFID_DATAPORT_READ);
 
brw_set_message_descriptor(p, insn, sfid, msg_length, response_length,
  header_present, false);
@@ -1989,6 +1968,10 @@ void brw_oword_block_write_scratch(struct brw_compile *p,
   unsigned offset)
 {
struct brw_context *brw = p-brw;
+   const unsigned target_cache =
+  (brw-gen = 7 ? GEN7_SFID_DATAPORT_DATA_CACHE :
+   brw-gen = 6 ? GEN6_SFID_DATAPORT_RENDER_CACHE :
+   BRW_DATAPORT_READ_TARGET_RENDER_CACHE);
uint32_t msg_control, msg_type;
int mlen;
 
@@ -2077,6 +2060,7 @@ void brw_oword_block_write_scratch(struct brw_compile *p,
   255, /* binding table index (255=stateless) */
   msg_control,
   msg_type,
+   target_cache,
   mlen,
   true, /* header_present */
   0, /* not a render target */
@@ -2102,6 +2086,10 @@ brw_oword_block_read_scratch(struct brw_compile *p,
 unsigned offset)
 {
struct brw_context *brw = p-brw;
+   const unsigned target_cache =
+  (brw-gen = 7 ? GEN7_SFID_DATAPORT_DATA_CACHE :
+   brw-gen = 6 ? GEN6_SFID_DATAPORT_RENDER_CACHE :
+   BRW_DATAPORT_READ_TARGET_RENDER_CACHE);

[Mesa-dev] [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.

2015-01-17 Thread Francisco Jerez
This reverts to using the oword block read messages for uniform pull constant
loads, as used to be the case until 4c1fdae0a01b3f92ec03b61aac1d3df5.  There
are two important differences though: Now the L3 cacheability bits are set up
correctly, and we target the constant cache instead of the data cache.  The
latter turns out to get no L3 way allocation on boot on most platforms, so
data cache messages are currently *not* cached on L3 regardless of the MOCS
bits, what probably explains the apparent slowness of oword fetches back then.

Constant cache loads seem to perform better than SIMD4x2 sampler loads
in a number of cases, they alleviate some of the cache thrashing
caused by the competition with textures for the L1/L2 sampler caches,
and they allow fetching up to 8 consecutive owords (128B) with just
one message.

FPS deltas relative to master are shown below for all generations since Gen6
and all oword block sizes from 1 to 8.

1oword  2oword  4oword  8oword

OglShMapPcf SNB 3%  3%  5%  6%
IVB 9%  11% 28% 41%
BYT 5%  7%  25% 42%
HSW 9%  9%  20% 30%
BDW 3%  5%  19% 33%
BSW 3%  5%  25% 44%

ubo-worst   SNB 2%  2%  2%  3%
IVB -85%-71%-71%-71%
BYT 14% 14% 14% 14%
HSW 0%  -1% -1% -1%
BDW -40%0%  0%  0%
BSW 0%  0%  0%  0%

ubo-bestSNB 191%190%205%205%
IVB 152%350%474%563%
BYT 83% 208%292%353%
HSW 292%464%615%726%
BDW 38% 267%546%581%
BSW 124%721%1135%   580%

shader-db   HSW
  gained - lost 0   -3  -1  -1
  instruction delta in  3.44%   1.90%   -1.77%  -3.48%
  affected programs

OglShMapPcf is a PCF shadow mapping benchmark from SynMark that exercises pull
constants and texture sampling, other tests from the Finnish benchmarking
system show either a smaller improvement or no significant change.  ubo-worst
and ubo-best are the worst- and best-case scenarios of a simple microbenchmark
that reads n constants (with n between 1 and 128) from a UBO, accessing up to
2kB of memory per invocation with alignment taken into account.  Typically the
gap between master and my constant cache branch increases with the amount of
bandwidth used by the shader, with n=128 showing the greatest improvement.

IVB's apparent worst-case regression deserves an explanation.  After some
investigation it seems like it's caused by a hardware bug leading to
serialization of read requests to the L3 for the same cacheline as result of a
(on IVB buggy) mechanism of the L3 to preserve coherency.  As read requests
for matching cachelines from any L3 client are not pipelined throughput will
decrease in cases where there are no non-overlapping requests left in the
queue that can be processed in between.  I suspect that this situation is
relatively uncommon in real-world applications, as the regression disappears
completely from my microbenchmark as soon as each individual shader invocation
accesses more than two non-overlapping cachelines from L3.

To make this situation less likely we should make sure that we don't use the
1/2 oword messages at all if the shader intends to read from any other
location in the same cacheline at some other point.  This is generally a good
idea anyway on all generations because using the 1 and 2 oword messages is
expected to waste bandwidth since the minimum L3 request size for the DC is
exactly 4 owords (i.e. one cacheline.  This probably explains the negative
result in the first column for BDW).  A future commit will have this effect.
I haven't been able to find any real-world example where this would still
result in a regression, but if someone happens to find one it shouldn't be too
difficult to add an IVB-specific heuristic that falls back to using the
sampler for pull constant loads when a shader uses less than certain amount of
L3 bandwidth.
---
 src/mesa/drivers/dri/i965/brw_eu_emit.c|  5 +-
 src/mesa/drivers/dri/i965/brw_fs.cpp   | 34 ---
 src/mesa/drivers/dri/i965/brw_fs.h |  2 +-
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 84 --
 4 files changed, 40 insertions(+), 85 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c 
b/src/mesa/drivers/dri/i965/brw_eu_emit.c
index c2e490d..7829878 100644
--- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
@@ -2194,7 +2194,7 @@ gen7_block_read_scratch(struct brw_compile *p,
 }
 
 /**
- * Read a float[4] vector from the data port Data Cache (const buffer).
+ * Read a 

[Mesa-dev] [PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.

2015-01-17 Thread Francisco Jerez
Not used anymore.  It was just a scalar MOV.
---
 src/mesa/drivers/dri/i965/brw_defines.h|  1 -
 src/mesa/drivers/dri/i965/brw_fs.h |  3 ---
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 26 --
 src/mesa/drivers/dri/i965/brw_shader.cpp   |  2 --
 4 files changed, 32 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_defines.h 
b/src/mesa/drivers/dri/i965/brw_defines.h
index f02a0b8..fe255cc 100644
--- a/src/mesa/drivers/dri/i965/brw_defines.h
+++ b/src/mesa/drivers/dri/i965/brw_defines.h
@@ -933,7 +933,6 @@ enum opcode {
FS_OPCODE_DISCARD_JUMP,
FS_OPCODE_SET_OMASK,
FS_OPCODE_SET_SAMPLE_ID,
-   FS_OPCODE_SET_SIMD4X2_OFFSET,
FS_OPCODE_PACK_HALF_2x16_SPLIT,
FS_OPCODE_UNPACK_HALF_2x16_SPLIT_X,
FS_OPCODE_UNPACK_HALF_2x16_SPLIT_Y,
diff --git a/src/mesa/drivers/dri/i965/brw_fs.h 
b/src/mesa/drivers/dri/i965/brw_fs.h
index 8349ad2..28a427e 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.h
+++ b/src/mesa/drivers/dri/i965/brw_fs.h
@@ -830,9 +830,6 @@ private:
struct brw_reg src0,
struct brw_reg src1);
 
-   void generate_set_simd4x2_offset(fs_inst *inst,
-struct brw_reg dst,
-struct brw_reg offset);
void generate_discard_jump(fs_inst *inst);
 
void generate_pack_half_2x16_split(fs_inst *inst,
diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp 
b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
index b1fca41..e9cd0d9 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
@@ -1276,28 +1276,6 @@ fs_generator::generate_pixel_interpolator_query(fs_inst 
*inst,
  inst-regs_written);
 }
 
-
-/**
- * Sets the first word of a vgrf for gen7+ simd4x2 uniform pull constant
- * sampler LD messages.
- *
- * We don't want to bake it into the send message's code generation because
- * that means we don't get a chance to schedule the instructions.
- */
-void
-fs_generator::generate_set_simd4x2_offset(fs_inst *inst,
-  struct brw_reg dst,
-  struct brw_reg value)
-{
-   assert(value.file == BRW_IMMEDIATE_VALUE);
-
-   brw_push_insn_state(p);
-   brw_set_default_compression_control(p, BRW_COMPRESSION_NONE);
-   brw_set_default_mask_control(p, BRW_MASK_DISABLE);
-   brw_MOV(p, retype(brw_vec1_reg(dst.file, dst.nr, 0), value.type), value);
-   brw_pop_insn_state(p);
-}
-
 /* Sets vstride=16, width=8, hstride=2 or vstride=0, width=1, hstride=0
  * (when mask is passed as a uniform) of register mask before moving it
  * to register dst.
@@ -1947,10 +1925,6 @@ fs_generator::generate_code(const cfg_t *cfg, int 
dispatch_width)
  generate_untyped_surface_read(inst, dst, src[0], src[1]);
  break;
 
-  case FS_OPCODE_SET_SIMD4X2_OFFSET:
- generate_set_simd4x2_offset(inst, dst, src[0]);
- break;
-
   case FS_OPCODE_SET_OMASK:
  generate_set_omask(inst, dst, src[0]);
  break;
diff --git a/src/mesa/drivers/dri/i965/brw_shader.cpp 
b/src/mesa/drivers/dri/i965/brw_shader.cpp
index d76134b..f77c9a2 100644
--- a/src/mesa/drivers/dri/i965/brw_shader.cpp
+++ b/src/mesa/drivers/dri/i965/brw_shader.cpp
@@ -512,8 +512,6 @@ brw_instruction_name(enum opcode op)
   return set_omask;
case FS_OPCODE_SET_SAMPLE_ID:
   return set_sample_id;
-   case FS_OPCODE_SET_SIMD4X2_OFFSET:
-  return set_simd4x2_offset;
 
case FS_OPCODE_PACK_HALF_2x16_SPLIT:
   return pack_half_2x16_split;
-- 
2.1.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-17 Thread Francisco Jerez
This is the first part of a series meant to improve our usage of the L3 cache.
Currently it's far from ideal since the following objects aren't taking any
advantage of it:
 - Pull constants (i.e. UBOs and demoted uniforms)
 - Buffer textures
 - Shader scratch space (i.e. register spills and fills)
 - Atomic counters
 - (Soon) Images

This first series addresses the first two issues.  Fixing the last three is
going to be a bit more difficult because we need to modify the partitioning of
the L3 cache in order to increase the number of ways assigned to the DC, which
happens to be zero on boot until Gen8.  That's likely to require kernel
changes because we don't have any extremely satisfactory API to change that
from userspace right now.

The first patch in the series sets the MOCS L3 cacheability bit in the surface
state structure for buffers so the mentioned memory objects (except the shader
scratch space that gets its MOCS from elsewhere) have a chance of getting
cached in L3.

The fourth patch in the series switches to using the constant cache (which,
unlike the data cache that was used years ago before we started using the
sampler, is cached on L3 with the default partitioning on all gens) for
uniform pull constants loads.  The overall performance numbers I've collected
are included in the commit message of the same patch for future reference.
Most of it points at the constant cache being faster than the sampler in a
number of cases (assuming the L3 caching settings are correct), it's also
likely to alleviate some cache thrashing caused by the competition with
textures for the L1/L2 sampler caches, and it allows fetching up to eight
consecutive owords (128B) with just one message.

The sixth patch enables 4 oword loads because they're basically for free and
they avoid some of the shortcomings of the 1 and 2 oword messages (see the
commit message for more details).  I'll have a look into enabling 8 oword
loads but it's going to require an analysis pass to avoid wasting bandwidth
and increasing the register pressure unnecessarily when the shader doesn't
actually need as many constants.

We could do something similar for non-uniform offset pull constant loads and
for both kinds of pull constant loads on the vec4 back-end, but I don't have
enough performance data to support that yet.

[PATCH 1/7] i965: Enable L3 caching of buffer surfaces.
[PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.
[PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the 
target cache.
[PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.
[PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in 
lower_load_payload().
[PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.
[PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.

2015-01-17 Thread Francisco Jerez
Asking the DC for less than one cacheline (4 owords) of data for uniform pull
constants is suboptimal because the DC itself cannot request less than that
from L3, resulting in wasted bandwidth, unnecessary message dispatch overhead
and exacerbating the L3 serialization bug on IVB.  Improves performance of
pull constants on all generations I've tested so far.  On BDW and BSW the FPS
of a microbenchmark increases up to 5-6x, see the third column of the table in
i965/fs: Switch to the constant cache for uniform pull constants. for more
detailed numbers.

Going up to 8 oword blocks would improve performance of pull constants even
more, but at the cost of some additional bandwidth and register pressure, so
I'd rather do that as a follow-up together with some on-demand mechanism to
calculate the block size based on the number of constants actually used by the
shader.

Currently untested on Gen4-5.
---
 src/mesa/drivers/dri/i965/brw_eu_emit.c| 10 
 src/mesa/drivers/dri/i965/brw_fs.cpp   | 33 --
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 13 +-
 src/mesa/drivers/dri/i965/brw_fs_nir.cpp   | 25 +++
 src/mesa/drivers/dri/i965/brw_fs_visitor.cpp   | 27 -
 5 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c 
b/src/mesa/drivers/dri/i965/brw_eu_emit.c
index 7829878..b30db88 100644
--- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
@@ -2194,7 +2194,7 @@ gen7_block_read_scratch(struct brw_compile *p,
 }
 
 /**
- * Read a float[4] vector from the data port constant cache.
+ * Read four float[4] vectors from the data port constant cache.
  * Location (in buffer) should be a multiple of 16.
  * Used for fetching shader constants.
  */
@@ -2231,8 +2231,8 @@ void brw_oword_block_read(struct brw_compile *p,
 
brw_inst *insn = next_insn(p, BRW_OPCODE_SEND);
 
-   /* cast dest to a uword[8] vector */
-   dest = retype(vec8(dest), BRW_REGISTER_TYPE_UW);
+   /* cast dest to a dword[16] vector */
+   dest = retype(vec16(dest), BRW_REGISTER_TYPE_UD);
 
brw_set_dest(p, insn, dest);
if (brw-gen = 6) {
@@ -2245,12 +2245,12 @@ void brw_oword_block_read(struct brw_compile *p,
brw_set_dp_read_message(p,
   insn,
   bind_table_index,
-  BRW_DATAPORT_OWORD_BLOCK_1_OWORDLOW,
+  BRW_DATAPORT_OWORD_BLOCK_4_OWORDS,
   BRW_DATAPORT_READ_MESSAGE_OWORD_BLOCK_READ,
   target_cache,
   1, /* msg_length */
true, /* header_present */
-  1); /* response_length (1 reg, 2 owords!) */
+  2); /* response_length (2 regs, 4 owords!) */
 
brw_pop_insn_state(p);
 }
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index d585a67..3c41e01 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -2261,29 +2261,40 @@ fs_visitor::demote_pull_constants()
  current_annotation = inst-annotation;
 
  fs_reg 
surf_index(stage_prog_data-binding_table.pull_constants_start);
- fs_reg dst = fs_reg(this, glsl_type::float_type);
 
  /* Generate a pull load into dst. */
  if (inst-src[i].reladdr) {
+const fs_reg dst = fs_reg(this, glsl_type::float_type);
 exec_list list = VARYING_PULL_CONSTANT_LOAD(dst,
 surf_index,
 *inst-src[i].reladdr,
 pull_index);
 inst-insert_before(block, list);
+
+/* Rewrite the instruction to use the temporary VGRF. */
+inst-src[i].file = GRF;
 inst-src[i].reladdr = NULL;
+inst-src[i].reg = dst.reg;
+inst-src[i].reg_offset = 0;
  } else {
-fs_reg offset = fs_reg((unsigned)(pull_index * 4)  ~15);
+const unsigned num_regs = 2; /* Fetch 4 owords at a time. */
+const unsigned base = (pull_index * 4)  ~(32 * num_regs - 1);
+const fs_reg dst(GRF, virtual_grf_alloc(num_regs),
+ BRW_REGISTER_TYPE_F, dispatch_width);
 fs_inst *pull =
-   new(mem_ctx) fs_inst(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD, 8,
-dst, surf_index, offset);
+   new(mem_ctx) fs_inst(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
+dst, surf_index, fs_reg(base));
+pull-force_writemask_all = true;
+pull-regs_written = num_regs;
 inst-insert_before(block, pull);
-inst-src[i].set_smear(pull_index  3);
+
+/* Rewrite the instruction to 

Re: [Mesa-dev] [PATCH 1/7] i965: Enable L3 caching of buffer surfaces.

2015-01-17 Thread Kenneth Graunke
On Sunday, January 18, 2015 01:04:03 AM Francisco Jerez wrote:
 And remove the mocs argument of the emit_buffer_surface_state vtbl hook.  Its
 semantics vary greatly from one generation to another, so it kind of
 encourages the caller to pass 0 which is the only valid setting across
 generations.  After this commit the hardware-specific code decides what the
 best cacheability settings are for buffer surfaces, just like we do for
 textures.
 
 This together with some additional changes coming is expected to improve
 performance of pull constants, buffer textures, atomic counters and image
 objects on Gen7 and up.

Thanks!  I had a version of this lying around, but never measured any gain
from it, so I never bothered to send it.  I definitely like removing the
parameter, and we probably should set it - we do everywhere else...

This patch is:
Reviewed-by: Kenneth Graunke kenn...@whitecape.org

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] nir: use Python to autogenerate opcode information

2015-01-17 Thread ahmad
thats make sense.

regards.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] i965: Work around mysterious Gen4 GPU hangs with minimal state changes.

2015-01-17 Thread Kenneth Graunke
Gen4 hardware appears to GPU hang frequently when using Chromium, and
also when running 'glmark2 -b ideas'.  Most of the error states contain
3DPRIMITIVE commands in quick succession, with very few state packets
between them - usually VERTEX_BUFFERS/ELEMENTS and CONSTANT_BUFFER.

I trimmed an apitrace of the glmark2 hang down to two draw calls with a
glUniformMatrix4fv call between the two.  Either draw by itself works
fine, but together, they hang the GPU.  Removing the glUniform call
makes the hangs disappear.  In the hardware state, this translates to
removing the CONSTANT_BUFFER packet between the two 3DPRIMITIVE packets.

Flushing before emitting CONSTANT_BUFFER packets also appears to make
the hangs disappear.  I observed a slowdown in glxgears by doing it all
the time, so I've chosen to only do it when BRW_NEW_BATCH and
BRW_NEW_PSP are unset (i.e. we haven't done a CS_URB_STATE change or
already flushed the whole pipeline).

I'd much rather understand the problem, but at this point, I don't see
how we'd ever be able to track it down further.  We have no real tools,
and the hardware people moved on years ago.  I've analyzed 20+ error
states and read every scrap of documentation I could find.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=80568
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=85367
Signed-off-by: Kenneth Graunke kenn...@whitecape.org
Cc: 10.4 10.3 mesa-sta...@lists.freedesktop.org
---
 src/mesa/drivers/dri/i965/brw_curbe.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/mesa/drivers/dri/i965/brw_curbe.c 
b/src/mesa/drivers/dri/i965/brw_curbe.c
index c3d3b9d..d0ec859 100644
--- a/src/mesa/drivers/dri/i965/brw_curbe.c
+++ b/src/mesa/drivers/dri/i965/brw_curbe.c
@@ -285,6 +285,19 @@ brw_upload_constant_buffer(struct brw_context *brw)
 */
 
 emit:
+   /* Work around mysterious 965 hangs that appear to happen if you do
+* two 3DPRIMITIVEs with only a CONSTANT_BUFFER inbetween.  If we
+* haven't already flushed for some other reason, explicitly do so.
+*
+* We've found no documented reason why this should be necessary.
+*/
+   if (brw-gen == 4  !brw-is_g4x 
+   (brw-state.dirty.brw  (BRW_NEW_BATCH | BRW_NEW_PSP)) == 0) {
+  BEGIN_BATCH(1);
+  OUT_BATCH(MI_FLUSH);
+  ADVANCE_BATCH();
+   }
+
/* BRW_NEW_URB_FENCE: From the gen4 PRM, volume 1, section 3.9.8
 * (CONSTANT_BUFFER (CURBE Load)):
 *
-- 
2.2.2

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/2] i965/fs: Don't use backend_visitor::instructions after creating the CFG.

2015-01-17 Thread Kenneth Graunke
On Friday, January 16, 2015 11:55:33 PM Matt Turner wrote:
 On Fri, Jan 16, 2015 at 11:45 PM, Kenneth Graunke kenn...@whitecape.org 
 wrote:
  On Tuesday, January 13, 2015 03:35:57 PM Matt Turner wrote:
  This is a fix for a regression introduced in commit a9f8296d (i965/fs:
  Preserve the CFG in a few more places.).
 
  The errata this code works around is described in a comment before the 
  function:
 
 [DevBW, DevCL] Errata: A destination register from a send can not be
  used as a destination register until after it has been sourced by an
  instruction with a different destination register.
 
  The framebuffer write's sources must be in message registers, which SEND
  instructions cannot have as a destination. There's no way for this
  errata to affect anything at the end of the program. Just remove the
  code.
 
  I don't think that's the point.  The idea is that code such as
 
 SEND g10  ...sources... rlen 4
 MUL  g10  ... ...
 
  needs a workaround - you can't write to the destination of a SEND safely
  without reading them first.  You'd have to do:
 
 SEND g10  ...sources... rlen 4
 MOV  null g10   pointless read of g10, any instruction will do
 MUL  g10  ...
 
  Normally, the results of SEND instructions are actually used.  However, they
  aren't always - for example, depth texturing returns 4 values, but we only
  care about the .X channel.
 
 Right, and we throw up our hands and resolve all remaining
 dependencies when we see the end of the basic block because there's a
 subsequent basic block that may write the destination.
 
 At the end of the program though... we can't possibly need to resolve
 anything outstanding because we can't possibly overwrite it. Can we?

I agree, I think this should be safe.  It sounds like the effects of the bug
are an undefined write ordering...probably not GPU hangs.  If that's true,
then we're obviously fine - we never overwrite it.

On the completely paranoid side of things, there could be some bit in the
hardware that leaves the register stuck: I'm not done with the last write,
I need to stall until it completes before doing this one.  And, it's
possible it could persist between threads.  Which would leave us stalled
forever, and we'd hang the GPU.

But I sincerely doubt that's the case, and I agree with you that this should
be fine.  I would like to see the commit message updated - instead of the bit
about MRFs, say that we think it's pointless to apply the workaround for
registers that are never written again, and that deleting the code is an
alternative to making it work in CFG-land.

With an updated commit message and Piglit passing (I'll test and let you know),
Reviewed-by: Kenneth Graunke kenn...@whitecape.org

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 88523] sha1.c:37: error: 'SHA1_CTX' undeclared (first use in this function)

2015-01-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=88523

José Fonseca jfons...@vmware.com changed:

   What|Removed |Added

 CC||jfons...@vmware.com

--- Comment #4 from José Fonseca jfons...@vmware.com ---
I think that for src/util either we:

- name headers as prefix_foo.h and include them as

  include prefix_foo.h

- or we always include the directory name

  include util/foo.h

Naming headers as foo.h and including as foo.h is bound to cause conflicts.


I also think that util might not be a good prefix for this.  I'd suggest we
rename src/util to for exmaple src/cgr -- for common graphics runtime.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/2] i965/fs: Only apply Gen4 work-arounds if regs_written 1.

2015-01-17 Thread Kenneth Graunke
On Tuesday, January 13, 2015 03:40:32 PM Matt Turner wrote:
 On Tue, Jan 13, 2015 at 3:35 PM, Matt Turner matts...@gmail.com wrote:
  Otherwise, we would have necessarily read the results or eliminated the
  dead SEND. In either case, no work around is necessary.
 
  Noticed when debugging the problem the previous patch fixed that any
  time we hit a math instruction, we'd walk through subsequent
  instructions, and of course each time discover that its result was in
  fact used.
  ---
 
 I was thinking through the pre-send dependency work around:
 
 /**
  * Implements this workaround for the original 965:
  *
  * [DevBW, DevCL] Implementation Restrictions: As the hardware does not
  *  check for post destination dependencies on this instruction, software
  *  must ensure that there is no destination hazard for the case of ‘write
  *  followed by a posted write’ shown in the following example.
  *
  *  1. mov r3 0
  *  2. send r3.xy rest of send instruction
  *  3. mov r2 r3
  *
  *  Due to no post-destination dependency check on the ‘send’, the above
  *  code sequence could have two instructions (1 and 2) in flight at the
  *  same time that both consider ‘r3’ as the target of their final writes.
  */
 
 While this is a hardware problem or something, isn't it impossible for
 us to hit? If the first MOV's results weren't read, we would have dead
 code eliminated it. If they were read (necessarily between it and the
 SEND), we would never have both instructions in flight at once.

It's definitely pretty rare, though I'm not certain I can say it never
happens.  If you care to look into it further, I found the bug report
which spawned this code:

https://bugs.freedesktop.org/show_bug.cgi?id=58960

The attachment contains a sample application which I managed to compile via:
$ for file in *.h; do moc-qt4 $file  moc-$(basename $file .h).cpp; done
$ g++ -Wall -g $(pkg-config --libs --cflags QtCore QtGui QtOpenGL gl) *.cpp

It would be great if we could make a Piglit test.

 Is there some case where we could realistically hit this problem?
 Maybe with control flow?
 
 I would like to mention that neither of these work arounds are
 implemented in the vec4 backend.

That's true, but they probably should be.  We originally reproduced this bug
with texturing instructions, which at the time were only supported in the FS.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] gallium: add MULTISAMPLE_Z_RESOLVE cap

2015-01-17 Thread Axel Davy
Resolving a multisampled depth texture into
a single sampled texture is supported on = SM4.1
hw. It is possible some previous hw support it.

The ability was tested on radeonsi and nvc0.
Apparently is is also supported for radeon = r700.

This patch adds the MULTISAMPLE_Z_RESOLVE cap and
add it to the drivers. It is advertised for drivers
for which it is sure the ability is supported.
A comment was added for drivers for which the feature
is probably supported.

Signed-off-by: Axel Davy axel.d...@ens.fr
---
This feature corresponds to the RESZ d3d9 hack.
d3d9 hacks are equivalent to GL extensions.

RESZ is advertised under win by amd = r700 and intel = G45.
Nv doesn't advertise the extension but allows similar feature
in some Nv specific Api.

I don't send right away the gallium Nine RESZ support patch,
as I want other patches be merged first.

 src/gallium/docs/source/screen.rst   | 2 ++
 src/gallium/drivers/freedreno/freedreno_screen.c | 1 +
 src/gallium/drivers/i915/i915_screen.c   | 1 +
 src/gallium/drivers/ilo/ilo_screen.c | 1 +
 src/gallium/drivers/llvmpipe/lp_screen.c | 2 ++
 src/gallium/drivers/nouveau/nv30/nv30_screen.c   | 1 +
 src/gallium/drivers/nouveau/nv50/nv50_screen.c   | 1 +
 src/gallium/drivers/nouveau/nvc0/nvc0_screen.c   | 1 +
 src/gallium/drivers/r300/r300_screen.c   | 1 +
 src/gallium/drivers/r600/r600_pipe.c | 2 ++
 src/gallium/drivers/radeonsi/si_pipe.c   | 1 +
 src/gallium/drivers/softpipe/sp_screen.c | 2 ++
 src/gallium/drivers/svga/svga_screen.c   | 1 +
 src/gallium/drivers/vc4/vc4_screen.c | 1 +
 src/gallium/include/pipe/p_defines.h | 1 +
 15 files changed, 19 insertions(+)

diff --git a/src/gallium/docs/source/screen.rst 
b/src/gallium/docs/source/screen.rst
index 55d114c..b2485bc 100644
--- a/src/gallium/docs/source/screen.rst
+++ b/src/gallium/docs/source/screen.rst
@@ -241,6 +241,8 @@ The integer capabilities:
   semantics. Only relevant if geometry shaders are supported.
   (Currently not possible to query availability of these two semantics outside
   this, at least BASEVERTEX should be exposed separately too).
+* ``PIPE_CAP_MULTISAMPLE_Z_RESOLVE``: Whether the driver supports blitting
+  a multisampled depth buffer into a single-sampled texture (or depth buffer).
 
 
 .. _pipe_capf:
diff --git a/src/gallium/drivers/freedreno/freedreno_screen.c 
b/src/gallium/drivers/freedreno/freedreno_screen.c
index 084a0ec..bf8d4e9 100644
--- a/src/gallium/drivers/freedreno/freedreno_screen.c
+++ b/src/gallium/drivers/freedreno/freedreno_screen.c
@@ -229,6 +229,7 @@ fd_screen_get_param(struct pipe_screen *pscreen, enum 
pipe_cap param)
case PIPE_CAP_SAMPLER_VIEW_TARGET:
case PIPE_CAP_CLIP_HALFZ:
case PIPE_CAP_VERTEXID_NOBASE:
+   case PIPE_CAP_MULTISAMPLE_Z_RESOLVE:
return 0;
 
case PIPE_CAP_MAX_VIEWPORTS:
diff --git a/src/gallium/drivers/i915/i915_screen.c 
b/src/gallium/drivers/i915/i915_screen.c
index 1277de3..1393e7e 100644
--- a/src/gallium/drivers/i915/i915_screen.c
+++ b/src/gallium/drivers/i915/i915_screen.c
@@ -227,6 +227,7 @@ i915_get_param(struct pipe_screen *screen, enum pipe_cap 
cap)
case PIPE_CAP_CONDITIONAL_RENDER_INVERTED:
case PIPE_CAP_CLIP_HALFZ:
case PIPE_CAP_VERTEXID_NOBASE:
+   case PIPE_CAP_MULTISAMPLE_Z_RESOLVE:
   return 0;
 
case PIPE_CAP_MAX_DUAL_SOURCE_RENDER_TARGETS:
diff --git a/src/gallium/drivers/ilo/ilo_screen.c 
b/src/gallium/drivers/ilo/ilo_screen.c
index 0c948f4..a4c9b03 100644
--- a/src/gallium/drivers/ilo/ilo_screen.c
+++ b/src/gallium/drivers/ilo/ilo_screen.c
@@ -470,6 +470,7 @@ ilo_get_param(struct pipe_screen *screen, enum pipe_cap 
param)
case PIPE_CAP_TGSI_FS_FINE_DERIVATIVE:
case PIPE_CAP_CONDITIONAL_RENDER_INVERTED:
case PIPE_CAP_SAMPLER_VIEW_TARGET:
+   case PIPE_CAP_MULTISAMPLE_Z_RESOLVE: /* may be supported */
   return 0;
 
case PIPE_CAP_VENDOR_ID:
diff --git a/src/gallium/drivers/llvmpipe/lp_screen.c 
b/src/gallium/drivers/llvmpipe/lp_screen.c
index 0e4456a..f6e1e52 100644
--- a/src/gallium/drivers/llvmpipe/lp_screen.c
+++ b/src/gallium/drivers/llvmpipe/lp_screen.c
@@ -284,6 +284,8 @@ llvmpipe_get_param(struct pipe_screen *screen, enum 
pipe_cap param)
   return 1;
case PIPE_CAP_VERTEXID_NOBASE:
   return 0;
+   case PIPE_CAP_MULTISAMPLE_Z_RESOLVE: /* may be supported */
+  return 0;
}
/* should only get here on unhandled cases */
debug_printf(Unexpected PIPE_CAP %d query\n, param);
diff --git a/src/gallium/drivers/nouveau/nv30/nv30_screen.c 
b/src/gallium/drivers/nouveau/nv30/nv30_screen.c
index 46c21a1..f7809cb 100644
--- a/src/gallium/drivers/nouveau/nv30/nv30_screen.c
+++ b/src/gallium/drivers/nouveau/nv30/nv30_screen.c
@@ -158,6 +158,7 @@ nv30_screen_get_param(struct pipe_screen *pscreen, enum 
pipe_cap param)
case PIPE_CAP_SAMPLER_VIEW_TARGET:
case PIPE_CAP_CLIP_HALFZ:
case 

[Mesa-dev] [Bug 88534] include/c11/threads_posix.h PTHREAD_MUTEX_RECURSIVE_NP not defined

2015-01-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=88534

Bug ID: 88534
   Summary: include/c11/threads_posix.h PTHREAD_MUTEX_RECURSIVE_NP
not defined
   Product: Mesa
   Version: git
  Hardware: Other
OS: Linux (All)
Status: NEW
  Severity: normal
  Priority: medium
 Component: Mesa core
  Assignee: mesa-dev@lists.freedesktop.org
  Reporter: felix.ja...@posteo.de

Created attachment 112394
  -- https://bugs.freedesktop.org/attachment.cgi?id=112394action=edit
Proposed patch

The non-portable version of PTHREAD_MUTEX_RECURSIVE is used since older glibc
didn't have the POSIX version. The attached patch makes the code only fall back
to PTHREAD_MUTEX_RECURSIVE_NP if PTHREAD_MUTEX_RECURSIVE is not defined. This
fixes compilation with other libcs such as musl, which don't have the
nonstandard version.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 64449] xorg hangs randomly with Radeon HD 7450A

2015-01-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=64449

Alberto Salvia Novella es204904...@gmail.com changed:

   What|Removed |Added

   Priority|medium  |highest
URL||https://bugs.launchpad.net/
   ||ubuntu/+source/xserver-xorg
   ||-video-ati/+bug/881526
 CC||es204904...@gmail.com
  Component|Drivers/Gallium/r600|GLX
   Assignee|dri-devel@lists.freedesktop |mesa-dev@lists.freedesktop.
   |.org|org
Summary|AMD graphics hardware hangs |xorg hangs randomly with
   |with an homogeneous |Radeon HD 7450A
   |coloured screen or blank|
   |screen, and with chirp  |
   |coming from the graphics|
   |card|

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 05/22] glsl: Add sqrt, rsq, exp, exp2 to get_range

2015-01-17 Thread Thomas Helland
I see why you are worried, and I agree 100%.
This just reinforces my impression that expanding this pass does
not give adequate return on investment.
If we had even better coverage we just might get some advantage,
but even then I have a bad feeling about this.

Do you have any suggestions for operations apart from
expressions and constants that we can get a range of?
If so I could work on it some more to figure out if this is
getting us anywhere at all. If I recall correctly the z
component of gl_Position is bound between 0 and 1?

2015-01-09 4:15 GMT+01:00 Connor Abbott cwabbo...@gmail.com:
 On Sat, Jan 3, 2015 at 2:18 PM, Thomas Helland
 thomashellan...@gmail.com wrote:
 Also handle undefined behaviour for sqrt(x) where x  0
 and rsq(x) where x = 0.

 This gives us some reduction in instruction count on three
 Dungeon Defenders shaders as they are doing: max(exp(x), 0)

 So initially when you said that Dungeon Defenders was doing
 max(exp(x), 0), my thought was wat? but after thinking about it some
 more, I can see why it would do this. The GLSL spec doesn't guarantee
 that implementations of +, *, exp(), etc. will return NaN when one of
 the arguments is NaN, but it also doesn't guarantee that they *won't*;
 in other words, if for some strange reason you need the old-style
 never-return-NaN functionality, you need to do something like what
 this game is doing. For implementations that don't return NaN, this
 optimization is just fine, but if you remove it when the HW does
 return NaN, then whatever's using the result might get a NaN when it's
 not expecting it, leading to Bad Things happening. Maybe it isn't an
 issue with this particular game, but in order to be correct here it
 seems like we do have to take NaN's into account after all.

 There was a related thread (and other discussions) about the behavior
 of min/max wrt NaN's:

 http://lists.freedesktop.org/archives/mesa-dev/2014-December/073182.html

 My conclusion is that basically everyone that actually produces NaN's
 follows the IEEE/D3D behavior here, which I'm assuming the Dungeon
 Defenders developers were probably depending on.


 v2: Change to use new IS_CONSTANT() macro
 Fix high unintenionally not being returned
 Add some air for readability
 Comment on the exploit of undefined behavior
 Constify mem_ctx
 ---
  src/glsl/opt_minmax.cpp | 31 +++
  1 file changed, 31 insertions(+)

 diff --git a/src/glsl/opt_minmax.cpp b/src/glsl/opt_minmax.cpp
 index 56805c0..2faa3c3 100644
 --- a/src/glsl/opt_minmax.cpp
 +++ b/src/glsl/opt_minmax.cpp
 @@ -274,9 +274,40 @@ get_range(ir_rvalue *rval)
 minmax_range r0;
 minmax_range r1;

 +   void *const mem_ctx = ralloc_parent(rval);
 +
 +   ir_constant *low = NULL;
 +   ir_constant *high = NULL;
 +
 if (expr) {
switch (expr-operation) {

 +  case ir_unop_exp:
 +  case ir_unop_exp2:
 +  case ir_unop_sqrt:
 +  case ir_unop_rsq:
 + r0 = get_range(expr-operands[0]);
 +
 + /* The spec says sqrt is undefined if x  0
 +  * We can use this to set the range to whatever we want
 +  */
 + if (expr-operation == ir_unop_sqrt 
 + IS_CONSTANT(r0.high, , 0.0f))
 +high = new(mem_ctx) ir_constant(0.0f);
 +
 + /* The spec says rsq is undefined if x = 0
 +  * We can use this to set the range to whatever we want
 +  */
 + if (expr-operation == ir_unop_rsq 
 + IS_CONSTANT(r0.high, =, 0.0f))
 +high = new(mem_ctx) ir_constant(0.0f);
 +
 + /* TODO: If we know, i.e, the lower range of the operand
 +  * we can calculate the lower range
 +  */
 + low = new(mem_ctx) ir_constant(0.0f);
 + return minmax_range(low, high);
 +
case ir_binop_min:
case ir_binop_max:
   r0 = get_range(expr-operands[0]);
 --
 2.2.1

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/22] glsl: Reorder optimization-passes

2015-01-17 Thread Thomas Helland
2015-01-03 22:48 GMT+01:00 Matt Turner matts...@gmail.com:
 On Sat, Jan 3, 2015 at 11:18 AM, Thomas Helland
 thomashellan...@gmail.com wrote:
 This allows opt_algebraic to resolve open-coded
 saturates into ir_unop_saturate before we potentially
 mess it up by removing the min or max in min/max-pruning.

 Since we are now emitting more free saturates on i965
 this gives us some decrease in instruction count.

 total instructions in shared programs: 1317459 - 1317065 (-0.03%)
 instructions in affected programs: 4084 - 3690 (-9.65%)
 GAINED:0
 LOST:  0

 You're definitely onto something here. On our collection of shaders:

 total instructions in shared programs: 5876617 - 5875919 (-0.01%)
 instructions in affected programs: 9443 - 8745 (-7.39%)

 with some fragment shaders hurt in Natural Selection 2 and Kerbal Space 
 program.

 I'll investigate these.

Hi Matt,

Don't want to be a nuisance (if that is even the right word?
English is not my native tongue), but did you find the
time to look at these regressions?

If I had some information about what regressions you are
seeing I could try to work them out.
Then this patch would be merge-material I guess.

The rest of the series I'm not that happy about.
Seems to me the return on investment is not adequate.
But I'll leave that up to other people to decide.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] nir: use Python to autogenerate opcode information

2015-01-17 Thread ahmad
hi.

#! /usr/bin/env python corresponds python 3.x series for some  major distro 
(arch,fedora ...) and python 2.x for some others. 

python 2.x and python 3.x are not source compatible each other.

python 3.x not contains xrange funcion anymore.

range vs xrange only meaningfull for python 2.x.

http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/

Distros that which still use 2.x  series as default python interpreter going to 
3.x. 

regargs.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev