On Mon, Jun 12, 2017 at 9:26 PM, Jason Ekstrand <ja...@jlekstrand.net> wrote: > On Mon, Jun 12, 2017 at 7:38 PM, Connor Abbott <cwabbo...@gmail.com> wrote: >> >> On Mon, Jun 12, 2017 at 7:19 PM, Jason Ekstrand <ja...@jlekstrand.net> >> wrote: >> > On Mon, Jun 12, 2017 at 11:58 AM, Nicolai Hähnle <nhaeh...@gmail.com> >> > wrote: >> >> >> >> On 12.06.2017 20:50, Connor Abbott wrote: >> >>> >> >>> On Mon, Jun 12, 2017 at 2:17 AM, Nicolai Hähnle <nhaeh...@gmail.com> >> >>> wrote: >> >>>> >> >>>> On 10.06.2017 01:44, Connor Abbott wrote: >> >>>>> >> >>>>> >> >>>>> From: Connor Abbott <cwabbo...@gmail.com> >> >>>>> >> >>>>> These are properties of the instruction that must be respected when >> >>>>> moving it around, in addition to the usual SSA dominance guarantee. >> >>>>> Previously, we only had special handling for fddx and fddy, in a >> >>>>> very >> >>>>> ad-hoc way. But with arb_shader_ballot and arb_shader_group_vote, >> >>>>> we'll >> >>>>> have to start handling a lot more instructions with similar >> >>>>> constraints, >> >>>>> so we want to add a more formal model of what the optimizer can and >> >>>>> cannot do. >> >>>>> >> >>>>> v2: don't add attribute for ALU instructions >> >>>>> v3: special-case derivative ALU instructions >> >>>>> Signed-off-by: Connor Abbott <cwabbo...@gmail.com> >> >>>>> --- >> >>>>> src/compiler/nir/nir.h | 80 >> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>>> 1 file changed, 80 insertions(+) >> >>>>> >> >>>>> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h >> >>>>> index 3b827bf..64caccb 100644 >> >>>>> --- a/src/compiler/nir/nir.h >> >>>>> +++ b/src/compiler/nir/nir.h >> >>>>> @@ -985,6 +985,25 @@ typedef enum { >> >>>>> * intrinsic are due to the register reads/writes. >> >>>>> */ >> >>>>> NIR_INTRINSIC_CAN_REORDER = (1 << 1), >> >>>>> + >> >>>>> + /** >> >>>>> + * Indicates whether this intrinsic is "cross-thread". An >> >>>>> operation >> >>>>> is >> >>>>> + * cross-thread if results in one thread depend on inputs in >> >>>>> another >> >>>>> thread, >> >>>>> + * and therefore optimizations cannot change the execution mask >> >>>>> when >> >>>>> the >> >>>>> + * operation is called. Examples of cross-thread operations >> >>>>> include >> >>>>> + * screen-space derivatives, the "any" reduction which returns >> >>>>> "true" >> >>>>> in >> >>>>> + * all threads if any thread inputs "true", etc. >> >>>>> + */ >> >>>>> + NIR_INTRINSIC_CROSS_THREAD, >> >>>>> + >> >>>>> + /** >> >>>>> + * Indicates that this intrinsic is "convergent". An operation >> >>>>> is >> >>>>> + * convergent when it must always be called in convergent >> >>>>> control >> >>>>> flow, >> >>>>> + * that is, control flow with the same execution mask as when >> >>>>> the >> >>>>> program >> >>>>> + * started. If an operation is convergent, it must be >> >>>>> cross-thread >> >>>>> as >> >>>>> well, >> >>>>> + * since the optimizer must maintain the guarantee. >> >>>>> + */ >> >>>>> + NIR_INTRINSIC_CONVERGENT, >> >>>> >> >>>> >> >>>> >> >>>> This is inconsistent with LLVM's definition of 'convergent', and I'd >> >>>> like >> >>>> you to change it to match up with LLVM. >> >>>> >> >>>> LLVM's definition of convergent is: "The operation must not be made >> >>>> control-dependent on additional values." >> >>>> >> >>>> In the language of execution masks, this means that optimizations >> >>>> must >> >>>> guarantee that the execution mask for the instruction can only become >> >>>> a >> >>>> superset of what it was originally. This means lifting is actually >> >>>> okay. >> >>>> >> >>>> This is relevant because e.g. texture instructions with implicit >> >>>> derivatives >> >>>> are actually convergent operations (in the LLVM sense), but obviously >> >>>> they >> >>>> can be called with exec masks that are subsets of the exec mask at >> >>>> program >> >>>> start. >> >>> >> >>> >> >>> Actually, according to GLSL (and I think SPIR-V, although I'm not 100% >> >>> sure), they can't be called that way -- results are undefined if >> >>> derivatives (or textures that take implicit derivatives) aren't called >> >>> in uniform control flow, full stop. That's why I changed the >> >>> definition compared to LLVM - this definition of convergent allows all >> >>> the optimizations that the LLVM definition does, but it opens up >> >>> additional optimization opportunities since we can assume that control >> >>> flow is always uniform when doing divergence analysis. Also, as-is, >> >>> the definition matches the GLSL/SPIR-V semantics closely, and since >> >>> the purpose of the convergent attribute is to model derivatives in >> >>> GLSL and SPIR-V, I'd like to keep that. If GLSL or SPIR-V change their >> >>> semantics to allow what you describe, then we can add something >> >>> something closer to the LLVM convergent semantics. If you want me to >> >>> change the name to avoid confusion with LLVM, that's fair though -- >> >>> suggestions welcome on what to call it ;) >> >> >> >> >> >> Okay, I'm convinced that it makes sense to have these semantics, but a >> >> different name would be good. >> > >> > >> > I'm not quite so convinced. :-) The LLVM definition seems, at first >> > brush, >> > more powerful than the proposed definition and I think it's actually >> > what >> > you want for most optimizations. >> >> The LLVM definition is actually less powerful than my definition - see >> my response to Nicolai in patch 3. > > > I'm still not sure I'm convinced. I'll give it more thought. > >> >> > The only advantage I can see to the strict >> > uniform definition is that it would let us imply information about >> > control-flow uniformity from instructions. However, while probably >> > technically correct, that sounds like a dangerous path to go down. What >> > specific optimizations were you thinking this stricter definition would >> > enable? >> >> It would let us assume that control flow is uniform in more places, >> which means that we can infer that more values are uniform. This would >> be useful, for example, for barriers in compute shaders, since if the >> author is doing some complicated stuff with loads and stores but is >> using a barrier, then we can infer that control is stlll uniform, and >> hopefully use less storage space for registers if we can infer that >> they're uniform too, enable SPF on Intel, use less instructions for >> control flow on AMD, etc. The original implementation of the >> Divergence Analysis paper by Coutinho et. al. actually does this [1]. > > > Yeah... That's exactly the kind of inference I'm afraid of. Barrier may be > safe, but my gut tells me that people get derivatives (whether explicit or > implicit in texturing) wrong all the time. I'm afraid of doing crazy > optimizations based on a texture instruction being in our block. Maybe it's > ok, but it makes me very nervous... > > I'll give it more thought
So, I realized that we might still need something like the LLVM definition of convergent, at least in two places: 1. read_invocation is just convergent, not cross-thread, since if you call it with a superset of active invocations, you'll still get the same result. 2. AMD_shader_ballot adds a whole-workgroup reduction operation, which is guaranteed to work in non-uniform control flow as long as the subgroups are uniform within a workgroup. Basically this turns into a subgroup reduction, followed by some atomic operations on shared memory and a barrier. But the barrier has to execute in non-uniform control flow, so it really is LLVM-convergent. So if we want to do the lowering in NIR, we'll need to have a barrier that can occur in non-uniform control flow. Should we keep the original thing, rename it to uniform_control or whatever, and add convergent, or just get rid of it? Thoughts? > > --Jason > >> >> [1] >> https://github.com/gtcasl/gpuocelot/blob/master/ocelot/ocelot/analysis/implementation/DivergenceAnalysis.cpp#L514 >> >> > >> >> >> >> How about NIR_INTRINSIC_UNIFORM_CONTROL? >> > >> > >> > That works. >> > >> > --Jason >> > >> >> >> >> Cheers, >> >> Nicolai >> >> >> >> >> >> >> >> >> >>> >> >>>> >> >>>> LLVM currently has no equivalent to cross_thread, and we hack around >> >>>> it >> >>>> as >> >>>> I'm sure you're well aware. The nightmare is trying to find a sound >> >>>> definition of "cross_thread" that works in LLVM's execution model. >> >>> >> >>> >> >>> Yeah... this stuff is really tricky to reason about. I think that >> >>> eventually, we're going to have to add the notions of control flow >> >>> divergence and re-convergence to LLVM's execution model, even though >> >>> there's been pushback from some LLVM developers about it. I just don't >> >>> see any way we'll be able to do stuff like LICM, aggressive CSE, etc. >> >>> effectively in the presence of this cross-thread operations, when >> >>> whether you can do those things at all depends on whether branch >> >>> conditions are uniform. >> >>> >> >>>> >> >>>> Cheers, >> >>>> Nicolai >> >>>> >> >>>> >> >>>> >> >>>>> } nir_intrinsic_semantic_flag; >> >>>>> /** >> >>>>> @@ -1459,6 +1478,67 @@ NIR_DEFINE_CAST(nir_instr_as_parallel_copy, >> >>>>> nir_instr, >> >>>>> type, nir_instr_type_parallel_copy) >> >>>>> /* >> >>>>> + * Helpers to determine if an instruction is cross-thread or >> >>>>> convergent. >> >>>>> See >> >>>>> + * NIR_INTRINSIC_{CONVERGENT|CROSS_THREAD} for the definitions. >> >>>>> + */ >> >>>>> +static inline bool >> >>>>> +nir_instr_is_convergent(const nir_instr *instr) >> >>>>> +{ >> >>>>> + switch (instr->type) { >> >>>>> + case nir_instr_type_alu: >> >>>>> + switch (nir_instr_as_alu(instr)->op) { >> >>>>> + case nir_op_fddx: >> >>>>> + case nir_op_fddy: >> >>>>> + case nir_op_fddx_fine: >> >>>>> + case nir_op_fddy_fine: >> >>>>> + case nir_op_fddx_coarse: >> >>>>> + case nir_op_fddy_coarse: >> >>>>> + /* Partial derivatives are convergent */ >> >>>>> + return true; >> >>>>> + >> >>>>> + default: >> >>>>> + return false; >> >>>>> + } >> >>>>> + >> >>>>> + case nir_instr_type_intrinsic: { >> >>>>> + nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr); >> >>>>> + return nir_intrinsic_infos[intrin->intrinsic].flags & >> >>>>> + NIR_INTRINSIC_CONVERGENT; >> >>>>> + } >> >>>>> + >> >>>>> + case nir_instr_type_tex: >> >>>>> + switch (nir_instr_as_tex(instr)->op) { >> >>>>> + case nir_texop_tex: >> >>>>> + case nir_texop_txb: >> >>>>> + case nir_texop_lod: >> >>>>> + /* These three take implicit derivatives, so they are >> >>>>> convergent */ >> >>>>> + return true; >> >>>>> + >> >>>>> + default: >> >>>>> + return false; >> >>>>> + } >> >>>>> + >> >>>>> + default: >> >>>>> + return false; >> >>>>> + } >> >>>>> +} >> >>>>> + >> >>>>> +static inline bool >> >>>>> +nir_instr_is_cross_thread(const nir_instr *instr) >> >>>>> +{ >> >>>>> + switch (instr->type) { >> >>>>> + case nir_instr_type_intrinsic: { >> >>>>> + nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr); >> >>>>> + return nir_intrinsic_infos[intrin->intrinsic].flags & >> >>>>> + NIR_INTRINSIC_CROSS_THREAD; >> >>>>> + } >> >>>>> + >> >>>>> + default: >> >>>>> + return nir_instr_is_convergent(instr); >> >>>>> + } >> >>>>> +} >> >>>>> + >> >>>>> +/* >> >>>>> * Control flow >> >>>>> * >> >>>>> * Control flow consists of a tree of control flow nodes, which >> >>>>> include >> >>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Lerne, wie die Welt wirklich ist, >> >>>> Aber vergiss niemals, wie sie sein sollte. >> >>>> _______________________________________________ >> >>>> mesa-dev mailing list >> >>>> mesa-dev@lists.freedesktop.org >> >>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> >> >> >> >> >> >> >> -- >> >> Lerne, wie die Welt wirklich ist, >> >> Aber vergiss niemals, wie sie sein sollte. >> >> _______________________________________________ >> >> mesa-dev mailing list >> >> mesa-dev@lists.freedesktop.org >> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > >> > > > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev