On Sun, Jul 5, 2015 at 4:45 PM, Francisco Jerez <curroje...@riseup.net> wrote: > Hi Matt, > > Matt Turner <matts...@gmail.com> writes: > >> On Fri, Jul 3, 2015 at 3:46 AM, Francisco Jerez <curroje...@riseup.net> >> wrote: >>> Heh, I happened to come across this comment yesterday while looking for >>> the remaining no16 calls and wondered why on earth it couldn't do the >>> same that the normal interpolation code does. After this patch and a >>> series coming up that will remove all SIMD8 fallbacks from the texturing >>> code, the only case left still applicable to Gen7 hardware and later >>> will be "SIMD16 explicit accumulator operands unsupported". Anyone? >> >> I can explain the problem: >> >> Prior to Gen7, the were were two accumulator registers usable for most >> datatypes (acc0, acc1). On Gen7, they removed integer-support from >> acc1, which was necessary to implement SIMD16 integer multiplication >> using the normal MUL/MACH sequence. > > IIRC they got rid of the acc1 register on IVB altogether, but managed to > emulate it for floating point types by taking advantage of the extra > precision not normally used for floating point arithmetic (the fake acc1 > basically uses the same storage in the EU that holds the 32 MSBs of each > component of acc0), what explains the apparent asymmetry between integer > and floating point data types.
I've never read anything that told me that -- what have you seen? >> I implemented 32-bit integer multiplication without using the >> accumulator in: >> >> commit f7df169ba13d22338e9276839a7e9629ca0a6b4f >> Author: Matt Turner <matts...@gmail.com> >> Date: Wed May 13 18:34:03 2015 -0700 >> >> i965/fs: Implement integer multiply without mul/mach. >> >> The remaining cases of "SIMD16 explicit accumulator operands >> unsupported" are ADDC, SUBB, and 32x32 -> high 32-bit multiplication. >> The remaining multiplication case can probably be reimplemented >> without the accumulator, like I did for the low 32-bit result. >> > Hmm, I have the suspicion that high 32-bit multiplication is the one > legit use-case of the accumulator we have left, any algorithm breaking > it up into individual 32/16-bit MULs would end up doing more > multiplications than the two MUL/MACH instructions we do now, because we > wouldn't be able to take advantage of the full precision implemented in > the hardware if we truncate the 48-bit intermediate results to fit in a > 32-bit register. That's probably true. It's just that Sandybridge and earlier don't expose the functionality (but could do 64-bit integer multiplication just fine), Ivybridge has the quarter-control/accumulator bug, Haswell works fine if you split the multiplication sequence into SIMD8, and Broadwell let's you do 32x32 -> 64-bit multiplication without the accumulator. So you have only two platforms where it's you have to use the accumulator, and one of them is broken (but I guess can be trivially fixed by some force-writemask-all hackery). The best SIMD16 code for [iu]mulExtended() where both lsb and msb results are used is probably 2 sets of mul/mach/mov (with some kind of work around for Ivybridge), but that's kind of hard to recognize. > How about we use the SIMD width lowering pass to split the computation > in half? It should be quite straightforward but will probably require > adding a new virtual opcode so that the SIMD width lowering pass doesn't > have to deal with (seriously fucked-up) accumulators directly. Seems fine to me. >> The ADDC and SUBB instructions implicitly write a bit to the >> accumulator if their operations overflowed. The 1Q/2Q quarter control >> is supposed to select which register is implicitly written -- except >> that there is no acc1 for integer types. Haswell and newer ignore the >> quarter control and always write acc0, but IVB (and presumably BYT) >> attempt to write to the nonexistent acc1. >> >> You could split the the SIMD16 operations into 2x SIMD8s and set >> force_writemask_all on the second, followed by a 2Q MOV from the >> accumulator. Maybe we'd rather use the .o (overflow) conditional mod >> on a result ADD to implement this. >> > Yeah. I did in fact try to implement uaddCarry last Friday without > using the accumulator by doing something like: > > | CMP.o tmp, src0, -src1 > | MOV dst, -tmp > > ...what of course didn't work because of the extra argument precision > post-source modifiers and also because the .o condmod doesn't work at > all on CMP, but... Ah, you were trying to use the fact that CMP returns 0/-1. That's a cool idea. It's too bad that the CMP instruction doesn't do .o I'd been thinking of doing "ADD.o tmp, src0, src1" and then something that sets/selects 0/1 based on the flag register. Maybe even a move from the flag register would be best. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev