I haven't said much about this series up until now. I've mostly sat and watched and focused my time on other things. As I said in the meeting today, I think that part of the problem here is that there are at least 3 "refactors" in here besides the new feature. By "refactors", I mean "new ways of solving problem X". Part of the problem with the discussion is that all of this has been conflated and we aren't actually talking about how to solve each of those problems. I'm going to try and break it out here to try and aid in the discussion.
1) The builder. I think *everyone* likes the builder. The argument here is not over whether or not we want it. People have expressed concern that adding the builder now without actually doing the refactor will result in duplicated code that will get out-of-sync. I think the best solution here is to go ahead and do the builder stuff immediately after the branch point. 2) SIMD8 instruction splitting. As I said in the call, we've done this a variety of ways in the past usually by emitting the split instructions in the visitor and manually using half() and inst->force_sechalf. We also have a few places that I think still split the code in the generator. What you're doing is a somewhat more elegant version of the "emit it in the visitor" stuff that we've done in the past. You've got your zip/unzip functions etc. to handle splitting and combining and then you have a loop to actually emit the sends. Really, it's fairly nice and concise. The question is whether or not it's really what we want to do (more on this in a bit). 3) Array reg. This is something of a helper to (2). I agree that it's nice to be able to have access to the size of a register without having to look at the allocator. The problem here (and why we don't have a size in fs_reg) is that we copy the reigster around. Every register does have some sort of "underlying storage" that it shares with other registers with the same number. But that storage is represented by the size in the allocator and nothing else. Is that bad? Maybe. Should we have a way of passing the size around with it? Maybe. In any case, I think it's best to table this discussion until we've talked a bit about (2) because I think the resolution there will make (3) more clear. One note on 3 before we table it. People have been pulling emit_collect out as a straw-man and beating on it but I really do like it in principle. There are piles of times where you want a payload and you have to allocate an array, do stuff, and then put it into the payload and it's really annoying. It would be much easier if we had a helper that just did it all in one function and that's what emit_collect tries to do. *Thank you* for finally trying to solve that problem and make payloads less painful! However, I do wish it were implemented a bit differently but I'll make those comments on the patch itself. ### SIMD16 Instruction Splitting ### SIMD16 instruction splitting is an unfortunate fact of our hardware. There are a variety of times we have to do it including dual-source FB writes, some texturing, math ops on older gens and maybe another place or two. Classically, this has been done in one of two places: The visitor as we emit the code, or the generator. The problem with doing it in the generator is that we can't schedule it and, if it involves a payload, it's not really possible. The result is that we usually do it in the visitor these days. Unfortunately, even in the visitor, it's gen-specific and annoying. It gets even worse when you're working with something such as the untyped surface read/write messages that work with multi-component values that have to be zipped/unzipped to use Curro's terminology. Curro came up with some helper functions to make it substantially less annoying but it still involves nasty looping. At some point in the past I proposed a completely different and more solution to this problem. Unfortunately, while I've talked to Matt & Ken about it, it's never really been discussed all that publicly so Curro may not be aware of it. I'm going to lay it out here for Curro's sake as well as the sake of public record. The solution involves first changing the way we handle sends into a two step process. First, we emit a logical instruction that contains all of the data needed for the actual instruction. Then, we convert from the logical to the actual in a lowering pass. Take, for example, FB writes with which I am fairly familiar. We would first emit a logical FS_FB_WRITE_# instruction that has separate sources for color, depth, replicated alpha, etc. Then, in the lower_fb_writes pass (which we would have to implement), we would construct the payload from the sources provided on the logical instruction and emit the actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes function would then get called before the optimization loop so that the rest of the optimization could would never see it. Second, we add a split_instruction helper that would take a SIMD16 instruction and naively split it into two SIMD8 instructions. Such a helper really shouldn't be that hard to write. It would have to know how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that shouldn't be bad. Any of these new logical send instructions would have their values as separate sources so they should be safe to split. Third, we add a lower_simd16_to_simd8 pass that walks the instructions, picks out the ones that need splitting, and calls split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16 knowledge would be contained in this one pass. This pass would happen between actually emitting code and running lower_fb_writes (or whatever other send lowering passes we have). Finally, and this is the icing on the cake, we write a lower_simd32_to_simd16 pass that goes through and lowers all SIMD32 instructions (that act on full-sized data-types) to SIMD16 instructions. Once the rest of the work is done, we get this pass, and with it SIMD32 mode, almost for free. I know this approach looks like more work and, to be honest, it may be. However, I think it makes a lot of things far more straightforward. In particular, it means that people working on the visitor code don't have to think about whether or not an instruction needs splitting. You also don't have to deal with the complexity of zipping/unzipping sources every time. Instead, we put all that code in one place and get to stop thinking about it. Also, if we *ever* want to get SIMD32, we will need some sort of automatic instruction splitting and this seems like a reasonable way to do it. I've talked to Ken about this approach and he's 100% on-board. I don't remember what Matt thinks of it. If we like the approach, then we should just split the tasks up and make it happen. It's a bit of refactoring but it shouldn't be terrible. If we wanted to demo it, I would probably suggest starting with FB writes as those are fairly complex but yet self-contained. They also have a case where we do split an instruction so it would be interesting to see what the code looks like before and after. Thoughts? Question for Curro: Supposing for the moment that we decide to do SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff would still be needed for image_load_store? I hope this e-mail provides us with some good talking points and a way forward. --Jason _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev