On Tue, Jun 25, 2013 at 10:27 AM, Paul Berry <stereotype...@gmail.com> wrote: > On 19 June 2013 19:45, Anuj Phogat <anuj.pho...@gmail.com> wrote: >> >> Current implementation of ext_framebuffer_multisample_blit_scaled in >> i965/blorp uses nearest filtering for multisample scaled blits. Using >> nearest filtering produces blocky artifacts and negates the benefits >> of MSAA. That is the reason why extension was not enabled on i965. >> >> This patch implements the bilinear filtering of samples in blorp engine. >> Images generated with this patch are free from blocky artifacts and show >> big improvement in visual quality. >> >> Observed no piglit and gles3 regressions. >> >> V3: >> - Algorithm used for filtering assumes a rectangular grid of samples >> roughly corresponding to sample locations. >> - Test the boundary conditions on the edges of texture. >> >> Signed-off-by: Anuj Phogat <anuj.pho...@gmail.com> > > > Thanks for all your effort on this, Anuj. I think we have an algorithm > that's going to get us good quality. I have a number of minor comments > below, but I think the basic approach is good. > Thanks for the detailed review Paul. All of your suggestions are very useful to improve the patch. >> >> --- >> src/mesa/drivers/dri/i965/brw_blorp.h | 11 ++ >> src/mesa/drivers/dri/i965/brw_blorp_blit.cpp | 257 >> +++++++++++++++++++++++++-- >> 2 files changed, 258 insertions(+), 10 deletions(-) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.h >> b/src/mesa/drivers/dri/i965/brw_blorp.h >> index ffc27cc..0a15b89 100644 >> --- a/src/mesa/drivers/dri/i965/brw_blorp.h >> +++ b/src/mesa/drivers/dri/i965/brw_blorp.h >> @@ -319,6 +319,17 @@ struct brw_blorp_blit_prog_key >> * than one sample per pixel. >> */ >> bool persample_msaa_dispatch; >> + >> + /* True for scaled blitting. */ >> + bool blit_scaled; >> + >> + /* Source rectangle dimensions. Used to test boundary conditions in >> shader >> + * program. >> + */ >> + float src_x0; >> + float src_y0; >> + float src_x1; >> + float src_y1; > > > Two comments about this: > > (1) Rather than clamp to the boundary of the source rectangle, I have a > minor preference for clamping to the boundary of the source image. This is > what I've observed in nVidia's blitter, and it's what's implemented in > Mesa's meta path for scaled blits. Also, it would have the benefit of > simplifying some of the logic below, since the lower clamp will always be 0. > > (2) We can't put these numbers in the program key, since they're going to > change frequently, and changing the program key causes the program to be > recompiled. They need to go in push constants, like we do for the > destination rectangle coordinates and the multipliers and offsets. > I agree.
>> >> }; >> >> class brw_blorp_blit_params : public brw_blorp_params >> diff --git a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp >> b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp >> index 8694128..e99c9df 100644 >> --- a/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp >> +++ b/src/mesa/drivers/dri/i965/brw_blorp_blit.cpp >> @@ -622,7 +622,8 @@ private: >> void kill_if_outside_dst_rect(); >> void translate_dst_to_src(); >> void single_to_blend(); >> - void manual_blend(unsigned num_samples); >> + void manual_blend_average(unsigned num_samples); >> + void manual_blend_linear(unsigned num_samples); > > > Minor nit pick: can we call this function manual_blend_bilinear()? "linear" > is such a generic term that it's hard to guess what it does, but "bilinear" > makes it obvious. > >> >> void sample(struct brw_reg dst); >> void texel_fetch(struct brw_reg dst); >> void mcs_fetch(); >> @@ -676,6 +677,16 @@ private: >> */ >> struct brw_reg y_coords[2]; >> >> + /* X, Y coordinates of the pixel from which we need to fetch the >> specific >> + * sample. These are used for multisample scaled blitting. >> + */ >> + struct brw_reg x_sample_coords; >> + struct brw_reg y_sample_coords; >> + >> + /* Store the values to use to interpolate in x and y directions */ >> + struct brw_reg x_lerp; >> + struct brw_reg y_lerp; >> + > > > I had trouble guessing what the "lerp" variables are used for. How about > something like this instead? > > /* Fractional parts of the x and y coordinates, used as bilinear > interpolation coefficients */ > struct brw_reg x_frac; > struct brw_reg y_frac; > I'll change the variable names. >> >> /* Which element of x_coords and y_coords is currently in use. >> */ >> int xy_coord_index; >> @@ -814,15 +825,17 @@ brw_blorp_blit_program::compile(struct brw_context >> *brw, >> * that we want to texture from. Exception: if we are blending, then >> S is >> * irrelevant, because we are going to fetch all samples. >> */ >> - if (key->blend) { >> + if (key->blend && !key->blit_scaled) { >> if (brw->intel.gen == 6) { >> /* Gen6 hardware an automatically blend using the SAMPLE message >> */ >> single_to_blend(); >> sample(texture_data[0]); >> } else { >> /* Gen7+ hardware doesn't automaticaly blend. */ >> - manual_blend(key->src_samples); >> + manual_blend_average(key->src_samples); >> } >> + } else if(key->blend && key->blit_scaled) { >> + manual_blend_linear(key->src_samples); >> } else { >> /* We aren't blending, which means we just want to fetch a single >> sample >> * from the source surface. The address that we want to fetch from >> is >> @@ -913,6 +926,18 @@ brw_blorp_blit_program::alloc_regs() >> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD); >> reg += 2; >> } >> + >> + if (key->blit_scaled && key->blend) { >> + this->x_sample_coords = brw_vec8_grf(reg, 0); >> + reg += 2; >> + this->y_sample_coords = brw_vec8_grf(reg, 0); >> + reg += 2; >> + this->x_lerp = brw_vec8_grf(reg, 0); >> + reg += 2; >> + this->y_lerp = brw_vec8_grf(reg, 0); >> + reg += 2; >> + } >> + >> this->xy_coord_index = 0; >> this->sample_index >> = retype(brw_vec8_grf(reg, 0), BRW_REGISTER_TYPE_UD); >> @@ -1368,11 +1393,82 @@ brw_blorp_blit_program::translate_dst_to_src() >> brw_MUL(&func, Y_f, Yp_f, y_transform.multiplier); >> brw_ADD(&func, X_f, X_f, x_transform.offset); >> brw_ADD(&func, Y_f, Y_f, y_transform.offset); >> - /* Round the float coordinates down to nearest integer by moving to >> - * UD registers. >> - */ >> - brw_MOV(&func, Xp, X_f); >> - brw_MOV(&func, Yp, Y_f); >> + if (key->blit_scaled && key->blend) { >> + float x_scale = 2.0; >> + float y_scale = key->src_samples / 2.0; > > > It would be nice to have a comment above x_scale and y_scale explaining that > they are the scale factor between the pixel grid and the grid of samples > that we're texturing from. > > Also, I'd recommend moving them to someplace more global (either the class > definition or the program key) so that they can be accessed from > manual_blend_linear(). > yes. I'll do. >> >> + /* Translate coordinates to lay out the samples in a rectangular >> grid >> + * roughly corresponding to sample locations. >> >> + */ >> + brw_ADD(&func, X_f, X_f, brw_imm_f(-0.25)); >> + brw_ADD(&func, Y_f, Y_f, brw_imm_f(-1.0 / key->src_samples)); >> + brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale)); >> + brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale)); > > > The additive constants -0.25 and -1.0/key->src_samples are a bit difficult > to understand. I believe this is equivalent, and would be easier to follow: > > /* Translate coordinates to lay out the samples in a rectangular grid > * roughly corresponding to sample locations. > */ > brw_MUL(&func, X_f, X_f, brw_imm_f(x_scale)); > brw_MUL(&func, Y_f, Y_f, brw_imm_f(y_scale)); > > /* Adjust coordinates so that integers represent pixel centers rather > * than pixel edges. > */ > brw_ADD(&func, X_f, X_f, brw_imm_f(-0.5)); > brw_ADD(&func, Y_f, Y_f, brw_imm_f(-0.5)); > Yes, this looks better. >> >> + >> + /* Test boundary conditions to properly handle the sampling of >> texels on >> + * texture edges. >> + */ >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, >> + X_f, brw_imm_f(2.0 * key->src_x0)); >> + brw_IF(&func, BRW_EXECUTE_16); >> + { >> + /* Left Edge in X */ >> + brw_MOV(&func, x_lerp, brw_imm_f(1.0)); >> + } >> + brw_ELSE(&func); >> + { >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE, >> + X_f, brw_imm_f(2.0 * (key->src_x1 - 0.50))); >> + brw_IF(&func, BRW_EXECUTE_16); >> + { >> + /* Right Edge in X */ >> + brw_MOV(&func, x_lerp, brw_imm_f(0.0)); >> + } >> + brw_ELSE(&func); >> + { >> + /* Store the fractional part to be used as color interpolator >> */ >> + brw_FRC(&func, x_lerp, X_f); >> + } >> + brw_ENDIF(&func); >> + } >> + brw_ENDIF(&func); >> >> + >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, >> + Y_f, brw_imm_f((y_scale) * key->src_y0)); >> + brw_IF(&func, BRW_EXECUTE_16); >> + { >> + /* Bottom Edge in Y */ >> + brw_MOV(&func, y_lerp, brw_imm_f(1.0)); >> + } >> + brw_ELSE(&func); >> + { >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_GE, Y_f, >> + brw_imm_f((y_scale) * (key->src_y1 - 1 / y_scale))); >> + brw_IF(&func, BRW_EXECUTE_16); >> + { >> + /* Top Edge in Y */ >> + brw_MOV(&func, y_lerp, brw_imm_f(0.0)); >> + } >> + brw_ELSE(&func); >> + { >> + /* Store the fractional part to be used as color interpolator >> */ >> + brw_FRC(&func, y_lerp, Y_f); >> + } >> + brw_ENDIF(&func); >> + } >> + brw_ENDIF(&func); > > > A few thoughts on the above conditional logic: > > (1) I think the math would be easier to follow if the following expressions > were changed to equivalent forms: > > "2.0 * key->src_x0" => "x_scale * key->src_x0" > "2.0 * (key->src_x1 - 0.50)" => "x_scale * key->src_x1 - 1.0" > "(y_scale) * (key->src_y1 - 1 / y_scale)" => "y_scale * key->src_y1 - 1.0" > > This will also help in the future when we want to support hardware with 16x > MSAA, because in that case we'll want x_scale = y_scale = 4. > > (2) Effectively what this code is doing is clamping the x and y coordinates > to the boundary of the source rectangle. But since it's accomplishing the > clamping by adjusting x_lerp to 0.0 or 1.0, can only clamp things that > aren't too far outside the source rectangle to begin with (i.e. no more than > 1.0 outside the boundary). I *think* that will work, but it seems like an > unnecessary risk (and an unnecessary source of confusion). Why not just > clamp X_f to the range [x_scale * key->src_x0, x_scale * key->src_x1 - 1.0] > (and similar for Y_f), and then afterward compute x_lerp = FRC(X_f)? That > way there will be no doubt that the coordinate is properly in range. > > (3) You can probably save a lot of instructions if you use conditional moves > to do the clamping, e.g. > > brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, X_f, > brw_imm_f(x_scale * key->src_x0)); > brw_MOV(&func, X_f, brw_imm_f(x_scale * key->src_x0)); > brw_set_predicate_control(&func, BRW_PREDICATE_NONE); > ...etc. > yeah, this will save a lot of instructions. >> >> + >> + /* Round the float coordinates down to nearest integer */ >> + brw_RNDD(&func, Xp_f, X_f); >> + brw_RNDD(&func, Yp_f, Y_f); >> + brw_MUL(&func, X_f, Xp_f, brw_imm_f(0.5)); >> + brw_MUL(&func, Y_f, Yp_f, brw_imm_f(1 / y_scale)); >> + } else { >> + /* Round the float coordinates down to nearest integer by moving to >> + * UD registers. >> + */ >> + brw_MOV(&func, Xp, X_f); >> + brw_MOV(&func, Yp, Y_f); >> + } >> SWAP_XY_AND_XPYP(); >> brw_set_compression_control(&func, BRW_COMPRESSION_NONE); >> } >> @@ -1418,7 +1514,7 @@ inline int count_trailing_one_bits(unsigned value) >> >> >> void >> -brw_blorp_blit_program::manual_blend(unsigned num_samples) >> +brw_blorp_blit_program::manual_blend_average(unsigned num_samples) >> { >> if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS) >> mcs_fetch(); >> @@ -1523,6 +1619,135 @@ brw_blorp_blit_program::manual_blend(unsigned >> num_samples) >> brw_ENDIF(&func); >> } >> >> +void >> +brw_blorp_blit_program::manual_blend_linear(unsigned num_samples) >> +{ >> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS) >> + mcs_fetch(); > > > This won't work. The MCS value we fetch has to match up with the pixel that > we're sampling from. Since this function samples from different pixels in > each iteration of the "for" loop below, the call to mcs_fetch() needs to go > inside the loop, and it needs to happen after storing the coordinates in X > and Y. > >> >> + >> + /* We do this computation by performing the following operations: >> + * >> + * In case of 4x, 8x MSAA: >> + * - Compute the pixel coordinates and sample numbers (a, b, c, d) >> + * which are later used for interpolation >> + * - linearly interpolate samples a and b in X >> + * - linearly interpolate samples c and d in X >> + * - linearly interpolate the results of last two operations in Y >> + * >> + * result = lrp(lrp(a + b) + lrp(c + d)) >> + */ >> + struct brw_reg Xp_f = retype(Xp, BRW_REGISTER_TYPE_F); >> + struct brw_reg Yp_f = retype(Yp, BRW_REGISTER_TYPE_F); >> + struct brw_reg t1_f = retype(t1, BRW_REGISTER_TYPE_F); >> + struct brw_reg t2_f = retype(t2, BRW_REGISTER_TYPE_F); >> + >> + for (unsigned i = 0; i < 4; ++i) { >> + assert(i < ARRAY_SIZE(texture_data)); >> + s_is_zero = false; >> + >> + /* Compute pixel coordinates */ >> + brw_ADD(&func, vec16(x_sample_coords), Xp_f, >> + brw_imm_f((float)(i & 0x1) * 0.5)); >> + brw_ADD(&func, vec16(y_sample_coords), Yp_f, >> + brw_imm_f((float)(i & 0x2) * (1.0 / num_samples))); >> + brw_MOV(&func, vec16(X), x_sample_coords); >> + brw_MOV(&func, vec16(Y), y_sample_coords); >> + >> + /* Compute sample index. In case of 4x MSAA, sample index is >> + * equal to sample number. >> + */ > > > This comment should explain what you mean by "sample index" vs "sample > number". > >> >> + brw_FRC(&func, vec16(t1_f), x_sample_coords); >> + brw_FRC(&func, vec16(t2_f), y_sample_coords); >> + brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(2.0)); >> + brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(num_samples)); > > > How about this instead? It's a little more future proof and it clarifies > that what we're doing is computing our position within the grid of samples > corresponding to a single pixel: > > brw_MUL(&func, vec16(t1_f), t1_f, brw_imm_f(x_scale)); > brw_MUL(&func, vec16(t2_f), t2_f, brw_imm_f(x_scale * y_scale)); > >> >> + brw_ADD(&func, vec16(t1_f), t1_f, t2_f); >> + brw_MOV(&func, vec16(S), t1_f); >> + >> + if (num_samples == 8) { >> + /* Map the sample index to a sample number */ >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_L, >> + S, brw_imm_d(4)); >> + brw_IF(&func, BRW_EXECUTE_16); >> + { >> + brw_MOV(&func, vec16(t2), brw_imm_d(5)); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(1)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(2)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(2)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(4)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(3)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(6)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + } >> + brw_ELSE(&func); >> + { >> + brw_MOV(&func, vec16(t2), brw_imm_d(0)); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(5)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(3)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(6)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(7)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_EQ, >> + S, brw_imm_d(7)); >> + brw_MOV(&func, vec16(t2), brw_imm_d(1)); >> + brw_set_predicate_control(&func, BRW_PREDICATE_NONE); >> + } >> + brw_ENDIF(&func); >> + brw_MOV(&func, vec16(S), t2); > > > This seems like a lot of work to accomplish what is effectively a lookup > table. If this winds up becoming a performance bottleneck, you might want > to consider passing the table in via a push constant, and using indirect > addressing to convert sample index to sample number. > Yes, that would be helpful. Is it acceptable to do this optimization as a follow up patch? >> >> + } >> + texel_fetch(texture_data[i]); >> + >> + if (i == 0 && key->tex_layout == INTEL_MSAA_LAYOUT_CMS) { >> + /* The Ivy Bridge PRM, Vol4 Part1 p27 (Multisample Control >> Surface) >> + * suggests an optimization: >> + * >> + * "A simple optimization with probable large return in >> + * performance is to compare the MCS value to zero >> (indicating >> + * all samples are on sample slice 0), and sample only from >> + * sample slice 0 using ld2dss if MCS is zero." >> + * >> + * Note that in the case where the MCS value is zero, sampling >> from >> + * sample slice 0 using ld2dss and sampling from sample 0 using >> + * ld2dms are equivalent (since all samples are on sample slice >> 0). >> + * Since we have already sampled from sample 0, all we need to >> do is >> + * skip the remaining fetches and averaging if MCS is zero. >> + */ >> + brw_CMP(&func, vec16(brw_null_reg()), BRW_CONDITIONAL_NZ, >> + mcs_data, brw_imm_ud(0)); >> + brw_IF(&func, BRW_EXECUTE_16); >> + } > > > We can't do this optimization for scaled multisample blits, because it > relies on the assumption that all the samples we're blending together belong > to the same pixel. I'd recommend just pulling this code out. > Makes sense. I'll remove this code. >> >> + } >> + >> +#define SAMPLE(x, y) offset(texture_data[x], y) >> + brw_set_access_mode(&func, BRW_ALIGN_16); >> + for (int index = 3; index > 0; ) { >> + for (int k = 0; k < 8; ++k) >> + brw_LRP(&func, >> + vec8(SAMPLE(index - 1, k)), >> + offset(x_lerp, k & 1), >> + SAMPLE(index, k), >> + SAMPLE(index - 1, k)); >> + index -= 2; >> + } >> + for (int k = 0; k < 8; ++k) >> + brw_LRP(&func, >> + vec8(SAMPLE(0, k)), >> + offset(y_lerp, k & 1), >> + vec8(SAMPLE(2, k)), >> + vec8(SAMPLE(0, k))); >> + brw_set_access_mode(&func, BRW_ALIGN_1); > > > I'm confused why we need loops from 0 to 7 here. It looks like you're > trying to interpolate each component of the SIMD8 register separately. That > shouldn't be necessary. > Output of each SIMD16 sampler instruction (send) is written to 8 grf registers. So, I'm looping from 0 to 7 to interpolate all these 8 registers. If we lower the loop counter, we end up with color data 'not-interpolated' in few color channels. Does this explain your concern? >> >> +#undef SAMPLE >> + if (key->tex_layout == INTEL_MSAA_LAYOUT_CMS) >> + brw_ENDIF(&func); >> +} >> + >> /** >> * Emit code to look up a value in the texture using the SAMPLE message >> (which >> * does blending of MSAA surfaces). >> @@ -1535,7 +1760,8 @@ brw_blorp_blit_program::sample(struct brw_reg dst) >> SAMPLER_MESSAGE_ARG_V_FLOAT >> }; >> >> - texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args, >> ARRAY_SIZE(args)); >> + texture_lookup(dst, GEN5_SAMPLER_MESSAGE_SAMPLE, args, >> + ARRAY_SIZE(args)); >> } >> >> /** >> @@ -1872,6 +2098,17 @@ brw_blorp_blit_params::brw_blorp_blit_params(struct >> brw_context *brw, >> wm_prog_key.persample_msaa_dispatch = true; >> } >> >> + /* Scaled blitting or not. */ >> + wm_prog_key.blit_scaled = >> + ((dst_x1 - dst_x0) == (src_x1 - src_x0) && >> + (dst_y1 - dst_y0) == (src_y1 - src_y0)) ? false : true; >> + >> + /* Source rectangle dimensions */ >> + wm_prog_key.src_x0 = src_x0; >> + wm_prog_key.src_y0 = src_y0; >> + wm_prog_key.src_x1 = src_x1; >> + wm_prog_key.src_y1 = src_y1; >> + >> /* The render path must be configured to use the same number of >> samples as >> * the destination buffer. >> */ >> -- >> 1.8.1.4 >> > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev