Francisco Jerez <curroje...@riseup.net> writes: > Jason Ekstrand <ja...@jlekstrand.net> writes: > >> On May 15, 2015 2:40 PM, "Francisco Jerez" <curroje...@riseup.net> wrote: >>> >>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>> >>> > On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <curroje...@riseup.net> >> wrote: >>> >> Jason Ekstrand <ja...@jlekstrand.net> writes: >>> >> >>> >>> I haven't said much about this series up until now. I've mostly sat >>> >>> and watched and focused my time on other things. As I said in the >>> >>> meeting today, I think that part of the problem here is that there are >>> >>> at least 3 "refactors" in here besides the new feature. By >>> >>> "refactors", I mean "new ways of solving problem X". Part of the >>> >>> problem with the discussion is that all of this has been conflated and >>> >>> we aren't actually talking about how to solve each of those problems. >>> >>> I'm going to try and break it out here to try and aid in the >>> >>> discussion. >>> >>> >>> >> Thanks a lot Jason for giving a more constructive turn to this >>> >> discussion. :) >>> >> >>> >>> 1) The builder. I think *everyone* likes the builder. The argument >>> >>> here is not over whether or not we want it. People have expressed >>> >>> concern that adding the builder now without actually doing the >>> >>> refactor will result in duplicated code that will get out-of-sync. I >>> >>> think the best solution here is to go ahead and do the builder stuff >>> >>> immediately after the branch point. >>> >>> >>> >> Agreed. >>> >> >>> >>> 2) SIMD8 instruction splitting. As I said in the call, we've done >>> >>> this a variety of ways in the past usually by emitting the split >>> >>> instructions in the visitor and manually using half() and >>> >>> inst->force_sechalf. We also have a few places that I think still >>> >>> split the code in the generator. What you're doing is a somewhat more >>> >>> elegant version of the "emit it in the visitor" stuff that we've done >>> >>> in the past. You've got your zip/unzip functions etc. to handle >>> >>> splitting and combining and then you have a loop to actually emit the >>> >>> sends. Really, it's fairly nice and concise. The question is whether >>> >>> or not it's really what we want to do (more on this in a bit). >>> >>> >>> >> There is another closely related problem that this infrastructure >> solves >>> >> and may not be obvious in this series (it was in my original branch >>> >> though [1]), but I think it's worth keeping in mind, I'll call it 2.1: >>> >> Some of the surface messages lacked SIMD4x2 variants, so a strided copy >>> >> is required to transpose the original AoS vector into SoA form. This >>> >> turns out to be algorithmically identical to the strided copy required >>> >> to "unzip" a SIMD16 vector, so it can be accomplished without >> additional >>> >> effort. >>> >> >>> >> These transformations are largely transparent for the functions >> building >>> >> typed/untyped surface messages, they are hidden behind an interface >> that >>> >> doesn't seem to have received any attention yet: >>> >> >>> >> - struct vector_layout: This represents the layout of a vector in a >>> >> payload. It's initialized based on the restrictions of the hardware >>> >> for the specific message we want to send, and the current code >>> >> generation parameters (i.e. SIMD mode). >>> >> >>> >> - emit_insert(): This takes a vector with the layout native to the EU >>> >> (e.g. an fs_reg) and performs the required transformations to turn >> it >>> >> into a form suitable for the shared unit according to a >> vector_layout >>> >> struct (what may involve unzipping, a strided copy or no work at >>> >> all). >>> >> >>> >> - emit_extract(): The converse of emit_insert(). This takes a payload >>> >> fragment (array_reg) and transforms it into a vector of native >>> >> layout, according to a vector_layout struct. >>> >> >>> >> My intention was to come up with an interface more general than SIMD16 >>> >> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also >>> >> thinking texturing instruction payload quirks (which might require >>> >> padding or different permutations of the same values depending on the >>> >> generation even if the SIMD mode is natively supported -- let's call >>> >> this 2.2) and framebuffer writes (2.3). >>> > >>> > Thank you for pointing out what else you were trying to do with it. I >>> > understand that you're trying to solve a dozen different problems and >>> > that the infrastructure you've created is one that you think solves >>> > all of them at the same time. It helps to know what all problems you >>> > were trying to solve. However, since we are putting SIMD4x2 aside for >>> > now, in the interest of keeping us from getting side-tracked, let's >>> > table this for now. It's a problem that may need solving, but it's a >>> > *different* problem. >>> > >>> If it can be solved with the same approach, it is a closely related >>> problem, isn't it? And wouldn't it be a bit silly to knowingly >>> implement a less general solution when the more general solution >>> involves *less* work and while we know we'll want to address the >>> remaining problems solved by the more general solution eventually? ;) >> >> Perhaps, but trying to solve all the problems at the same time is why the >> conversation has gone nowhere on the past. We need to keep things split >> out. >> > > >>> >>> 3) Array reg. This is something of a helper to (2). >>> >> >>> >> In fact the primary motivation for array_reg wasn't instruction >>> >> splitting, but emit_collect(), more on that later. >>> >> >>> >>> I agree that it's nice to be able to have access to the size of a >>> >>> register without having to look at the allocator. The problem here >>> >>> (and why we don't have a size in fs_reg) is that we copy the reigster >>> >>> around. Every register does have some sort of "underlying storage" >>> >>> that it shares with other registers with the same number. But that >>> >>> storage is represented by the size in the allocator and nothing else. >>> >>> Is that bad? Maybe. Should we have a way of passing the size around >>> >>> with it? Maybe. In any case, I think it's best to table this >>> >>> discussion until we've talked a bit about (2) because I think the >>> >>> resolution there will make (3) more clear. >>> >>> >>> >> I think the mechanism we use to determine the size of a register hardly >>> >> ever involves the allocator, and there are good reasons not to rely on >>> >> it: It only works for VGRFs, can easily break with register coalescing, >>> >> and it makes "slicing" and other zero-copy transformations of registers >>> >> difficult (e.g. taking an individual element out of a VGRF array >> without >>> >> copying it into a one-component register). Typically the size of a >>> >> register is implicitly assumed by the instruction using it, and where >>> >> it's not (because the instruction expects a variable length payload) it >>> >> is provided separately from the register via "mlen". Neither of these >>> >> two options works here as will be obvious shortly. >>> > >>> > Understood. >>> > >>> >>> One note on 3 before we table it. People have been pulling >>> >>> emit_collect out as a straw-man and beating on it but I really do like >>> >>> it in principle. There are piles of times where you want a payload >>> >>> and you have to allocate an array, do stuff, and then put it into the >>> >>> payload and it's really annoying. It would be much easier if we had a >>> >>> helper that just did it all in one function and that's what >>> >>> emit_collect tries to do. *Thank you* for finally trying to solve >>> >>> that problem and make payloads less painful! However, I do wish it >>> >>> were implemented a bit differently but I'll make those comments on the >>> >>> patch itself. >>> >>> >>> >> >>> >> Yeah. I think it was Matt who mentioned in the call that >> emit_collect() >>> >> is really complex. I completely agree with that statement. It needs >> to >>> >> take care of a bunch of things to build the payload correctly: >>> >> >>> >> - Allocate a register that will hold the result, calculating the >>> >> correct total size based on the amount of data we want to >>> >> concatenate. >>> >> >>> >> - Calculate the correct header size. >>> >> >>> >> - Allocate and release memory of the correct size to hold the array of >>> >> LOAD_PAYLOAD sources, also based on the amount of data we want to >>> >> concatenate. >>> >> >>> >> - Split up the arguments to be passed to LOAD_PAYLOAD as sources in >>> >> scalars of the correct width (which depends on whether the argument >>> >> is part of the header or not, and what the dispatch width is), and >>> >> initialize the array. >>> >> >>> >> - Create a LOAD_PAYLOAD instruction with the result of the previous >>> >> steps. >>> >> >>> >> Each one of these tasks is non-trivial, repetitive, can conceivably go >>> >> wrong, and until now duplicated wherever we had to build a payload -- >>> >> obscuring the actually meaningful work being done. This is why I think >>> >> that the assertion that emit_collect() makes code difficult to >>> >> understand is backwards. Thanks to emit_collect() the task of building >>> >> a payload is tremendously simplified. >>> > >>> > Yes. I really don't think people are disagreeing with you too badly >>> > about emit_collect(). I also don't think it's as complicated as you >>> > make it sound. >>> > >>> Heh, maybe "tremendously" wasn't the word I was looking for, but it's >>> appreciably simplified. :P >>> >>> >> For emit_collect() to be possible there has to be some way to represent >>> >> a register+size pair, because it needs to know the size of each >>> >> argument, and because its result is itself a register+size pair -- The >>> >> caller cannot do anything useful with a payload of unknown size, unless >>> >> it re-calculates the size of the result what would defeat much of the >>> >> purpose of emit_collect(). >>> >> >>> >> The biggest benefit from being able to represent a register+size pair >> as >>> >> a value is that emit_collect() can easily be composed with other >>> >> functions (including with itself), so the construction of specific >>> >> chunks of the payload can easily be decoupled and re-used. My code >>> >> takes advantage of this feature extensively in order to share code that >>> >> constructs surface message headers or transforms some value into the >>> >> form expected by the shared unit (doing vector splitting or a strided >>> >> copy). >>> > >>> > Yes, I understand that it's useful for that. Hence the suggestion of >>> > either a reg/size pair struct or passing the size as an argument. The >>> > struct has the advantage, as you said, of allowing chaining. >>> > >>> >>> ### SIMD16 Instruction Splitting ### >>> >>> >>> >>> SIMD16 instruction splitting is an unfortunate fact of our hardware. >>> >>> There are a variety of times we have to do it including dual-source FB >>> >>> writes, some texturing, math ops on older gens and maybe another place >>> >>> or two. Classically, this has been done in one of two places: The >>> >>> visitor as we emit the code, or the generator. The problem with doing >>> >>> it in the generator is that we can't schedule it and, if it involves a >>> >>> payload, it's not really possible. The result is that we usually do >>> >>> it in the visitor these days. >>> >>> >>> >>> Unfortunately, even in the visitor, it's gen-specific and annoying. >>> >>> It gets even worse when you're working with something such as the >>> >>> untyped surface read/write messages that work with multi-component >>> >>> values that have to be zipped/unzipped to use Curro's terminology. >>> >>> Curro came up with some helper functions to make it substantially less >>> >>> annoying but it still involves nasty looping. >>> >>> >>> >>> At some point in the past I proposed a completely different and more >>> >>> solution to this problem. Unfortunately, while I've talked to Matt & >>> >>> Ken about it, it's never really been discussed all that publicly so >>> >>> Curro may not be aware of it. I'm going to lay it out here for >>> >>> Curro's sake as well as the sake of public record. >>> >>> >>> >>> The solution involves first changing the way we handle sends into a >>> >>> two step process. First, we emit a logical instruction that contains >>> >>> all of the data needed for the actual instruction. Then, we convert >>> >>> from the logical to the actual in a lowering pass. Take, for example, >>> >>> FB writes with which I am fairly familiar. We would first emit a >>> >>> logical FS_FB_WRITE_# instruction that has separate sources for color, >>> >>> depth, replicated alpha, etc. Then, in the lower_fb_writes pass >>> >>> (which we would have to implement), we would construct the payload >>> >>> from the sources provided on the logical instruction and emit the >>> >>> actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes >>> >>> function would then get called before the optimization loop so that >>> >>> the rest of the optimization could would never see it. >>> >>> >>> >>> Second, we add a split_instruction helper that would take a SIMD16 >>> >>> instruction and naively split it into two SIMD8 instructions. Such a >>> >>> helper really shouldn't be that hard to write. It would have to know >>> >>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that >>> >>> shouldn't be bad. Any of these new logical send instructions would >>> >>> have their values as separate sources so they should be safe to split. >>> >>> >>> >>> Third, we add a lower_simd16_to_simd8 pass that walks the >>> >>> instructions, picks out the ones that need splitting, and calls >>> >>> split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16 >>> >>> knowledge would be contained in this one pass. This pass would happen >>> >>> between actually emitting code and running lower_fb_writes (or >>> >>> whatever other send lowering passes we have). >>> >>> >>> >>> Finally, and this is the icing on the cake, we write a >>> >>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32 >>> >>> instructions (that act on full-sized data-types) to SIMD16 >>> >>> instructions. Once the rest of the work is done, we get this pass, >>> >>> and with it SIMD32 mode, almost for free. >>> >>> >>> >>> I know this approach looks like more work and, to be honest, it may >>> >>> be. However, I think it makes a lot of things far more >>> >>> straightforward. In particular, it means that people working on the >>> >>> visitor code don't have to think about whether or not an instruction >>> >>> needs splitting. You also don't have to deal with the complexity of >>> >>> zipping/unzipping sources every time. Instead, we put all that code >>> >>> in one place and get to stop thinking about it. Also, if we *ever* >>> >>> want to get SIMD32, we will need some sort of automatic instruction >>> >>> splitting and this seems like a reasonable way to do it. >>> >>> >>> >>> I've talked to Ken about this approach and he's 100% on-board. I >>> >>> don't remember what Matt thinks of it. If we like the approach, then >>> >>> we should just split the tasks up and make it happen. It's a bit of >>> >>> refactoring but it shouldn't be terrible. If we wanted to demo it, I >>> >>> would probably suggest starting with FB writes as those are fairly >>> >>> complex but yet self-contained. They also have a case where we do >>> >>> split an instruction so it would be interesting to see what the code >>> >>> looks like before and after. >>> >>> >>> >> >>> >> I generally like your proposal. I guess the question we need to answer >>> >> is whether we want this complexity to be in a lowering pass or in a >>> >> helper function used to build the send-like instruction -- In either >>> >> case we need code to handle zipping and unzipping of SIMD16 vectors, >>> >> it's just about whether this code is called by a lowering pass or >> higher >>> >> up in the visitor. >>> >> >>> >> I can think of several benefits of the approach you propose over mine: >>> >> >>> >> - It's more transparent for the visitor code emitting the message -- I >>> >> completely agree with you that the explicit loops are rather ugly. >>> >> >>> >> - Instructions with explicit separate sources are likely to be more >>> >> suitable for certain optimization passes. Pull constant loads use a >>> >> similar approach with an expression-style opcode which is at some >>> >> point lowered to a load payload and send message. This may not be >>> >> terribly important at this point because of the optimizations >> already >>> >> performed in GLSL IR and NIR and due to the nature of the majority >> of >>> >> opcodes that don't support SIMD16, but it still seems appealing. >>> > >>> > There's one more really big one that you missed: >>> > >>> > It scales! We can't afford to have a for loop for every ADD and MUL >>> > instruction. Sure, we might be able to afford it on sends, but not >>> > for everything. >>> > >>> Well, I doubt you'd want to implement your proposal to the letter for >>> non-send instructions: For those we don't need separate lowered and >>> non-lowered instructions because there's no payload to assemble, so we >>> can just do with a single opcode with different execution widths. When >>> you take payloads out of the picture all the disadvantages mentioned of >>> the lowering pass approach no longer apply, so I totally agree with you >>> that we want a general lowering pass to lower instructions that expect >>> their arguments as separate sources in their final form. In any case >>> send-like instructions need a somewhat different (and less scalable) >>> treatment. >>> >>> >> Some disadvantages come to my mind too: >>> >> >>> >> - It increases the amount of work required to add a new send-like >>> >> instruction because you need lowered and unlowered variants for >> each. >>> >> >>> >> - It seems tricky to get right when splitting an instruction in halves >>> >> involves changing the actual contents of the payload beyond zipping >>> >> and unzipping its arguments -- This might not seem like a big deal >>> >> right now, but it will be a problem when we implement SIMD32. The >>> >> surface messages that take a sample mask as argument are a good >>> >> example, because they only have 16 bits of space for it so you >>> >> actually need to provide different values depending on the "slot >>> >> group" the message is meant for. This can be worked around easily >> in >>> >> the visitor by shifting the sample mask register but it seems harder >>> >> to fix up later. >>> > >>> > Why do sample masks need to be part of the logical instruction? Can't >>> > we figure that out when we lower from logical to physical based on the >>> > quarter control? >>> > >>> You can surely do anything you want during the logical-to-physical >>> conversion, including rewriting the header, the problem is that it that >>> forces you to have a pile of message-specific handling code in the >>> lowering pass. How are you planning to address that? With a separate >>> lowering pass for each message opcode or a general one with >>> message-specific knowledge? >> >> Yes, the lowering pass will have *all* of the message-specific >> information. Probably broken out into helper functions exactly the way the >> message emit code is broken out now. The lowering pass then just knows >> what helper to call for what instruction. >> > I think I only buy your proposal if it saves us more work than it > creates in the long term, e.g. by using general splitting and payload > assembly algorithms shared among all opcodes with minimal > message-specific information. Otherwise what you are describing sounds > like a "bureaucratic" variant of my proposal, with lowered and unlowered > versions of each opcode and with the payload assembly code (functionally > almost the same as mine) hidden behind a lowering pass under a > switch-case statement instead of being called up front. >
I've given this idea a shot. Can you have a look at the image-load-store-lower branch of my tree [1]? It's just a quick and dirty proof of concept, so don't bother to review it carefully, just let me know if you agree with the general design before I spend more time on it. [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store-lower >>> >> - My intention was to handle vector layout peculiarities beyond >>> >> SIMDN-to-SIMD8 splitting under a consistent framework, leaving it >>> >> under the control of the helper function that builds the message >>> >> which quirks to apply. With the lowering pass approach you'd either >>> >> need message-specific lowering passes, message-specific code in a >>> >> general lowering pass, or some way for the builder code to pass >>> >> vector_layout-like metadata to the lowering pass -- In the first two >>> >> cases the policy on how to format each source in the payload would >> be >>> >> "delocalized" between the visitor and lowering pass, what I don't >>> >> particularly like. >>> > >>> > We already do that and we would continue to. The logical instruction >>> > wouldn't be much more than a serialized version of the helper >>> > function. >>> > >>> > Thanks for your reply. >>> > --Jason >>> > >>> >>> Thoughts? >>> >>> >>> >>> Question for Curro: Supposing for the moment that we decide to do >>> >>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff >>> >>> would still be needed for image_load_store? >>> >>> >>> >> I think I've already answered this question, but I'll answer it again >>> >> here for clarity -- It would still help with building and passing >> around >>> >> payloads concisely using emit_collect(), regardless of how we implement >>> >> SIMD16 instruction splitting. >>> >> >>> >>> I hope this e-mail provides us with some good talking points and a >> way forward. >>> >>> --Jason >>> >> >>> >> Thanks Jason! >>> >> >>> >> [1] >> http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev