Francisco Jerez <curroje...@riseup.net> writes: > Neil Roberts <n...@linux.intel.com> writes: > >> Ok, that makes sense, thanks for the detailed explanation. I think >> you're right, given the extra restrictions avoiding the integer dword >> multiplication seems like the simplest solution. >> >> I can't say that I fully understand what the documentation is trying to >> say here. Does it mean that the horizontal strides must be set such that >> each element in the source is offset to the same qword within its >> register as the corresponding element in the destination? >> > > The way I understand it one's required to: > - Align the stride in bytes (i.e. horizontal stride times the size of > the register type) between components of an individual source or > destination to a multiple of 64b (if the register region is more than > a single scalar that is). The multiple must be the same for all > non-scalar sources and the destination. > - Use the same sub-register offsets for all non-scalar sources and the > destination. > - Set vertical stride to horizontal stride times width. > - Don't use indirect addressing or ARF registers. > >> I tried changing the test caseĀ¹ so that one of the sources comes from a >> uniform. That would make the register for the uniform source be a scalar >> and thus all of its elements would come from the first qword, whereas >> for the destination the elements would be written to different qwords. >> That ends up with an instruction like this: >> >> mul(16) g120<1>D g9<8,8,1>D g6<0,1,0>D { align1 >> 1H compacted }; >> > > Yeah, this should definitely violate the alignment rule on both the > destination and first source. > >> However, the test still passes on my BXT so I've probably misunderstood >> something. On the other hand this does look similar to the stride you >> had in your example instruction that the simulator complained about. >> >> Given that the documentation and simulator implies this shouldn't work >> I'm happy to leave this patch and I'm not suggesting we do anything now. >> > > Honestly I'm also not 100% sure if this restriction really applies to > all BXT hardware or if it's specific to some preproduction steppings -- > The documentation seems to imply the former, but I guess it could be > wrong. > I forgot to mention that when I tried this yesterday the simulator seemed to complain about the alignment rule regardless of the stepping for all the ones it recognised (a-c IIRC).
>> Regards, >> - Neil >> >> 1. >> https://github.com/bpeel/piglit/blob/test-integer-multiply-uniform/tests/general/mult32.c >> >> Francisco Jerez <curroje...@riseup.net> writes: >> >>> Neil Roberts <n...@linux.intel.com> writes: >>> >>>> Are you sure this patch is necessary? The documentation for the multiply >>>> instruction on BDW+ says: >>>> >>>> SourceType : *D >>>> DestinationType : *D >>>> Project : EXCLUDE(CHV) >>>> >>>> This to me implies that it should work on BXT because it doesn't say >>>> EXCLUDE(BXT). >>>> >>> >>> In fact both CHV and BXT *can* support 32x32 multiplication, that's not >>> really the reason that motivated this patch -- The problem is that 32x32 >>> multiplication has QWORD execution type which has several restrictions >>> on CHV and BXT, the annoying one is: >>> >>> "CHV, BXT >>> >>> When source or destination datatype is 64b or operation is integer DWord >>> multiply, regioning in Align1 must follow these rules: >>> >>> Source and Destination horizontal stride must be aligned to the same >>> qword. >>> >>> Example: >>> [...] >>> // mul (4) r10.0<2>:d r11.0<8;4,2>:d r12.0<8;4,2>:d // Source and >>> Destination stride must be 2 since the execution type is Qword." >>> >>> So it sounds like we could use the native 32x32 multiply with some >>> additional effort to pass the arguments with the correct stride, but >>> it's not clear to me whether the solution would be any more better than >>> splitting up the multiply into 32x16 multiplies, so doing the same as in >>> CHV and pre-Gen8 seemed like the most KISS solution for now. >>> >>>> I made a little test case and ran it on my BXT and it seems to work even >>>> without this patch. I looked at the assembler source and it is >>>> definitely doing a MUL instruction with D types for the dst and two >>>> sources. >>>> >>>> https://github.com/bpeel/piglit/blob/test-integer-multiply/tests/general/mult32.c >>>> >>> Hmm, I guess the docs could be wrong, but I'm not sure what the >>> consequences would be of violating the alignment rule, I guess the >>> failure may be non-deterministic. What stepping of BXT did you test >>> this on? >>> >>>> Regards, >>>> - Neil >>>> >>>> Francisco Jerez <curroje...@riseup.net> writes: >>>> >>>>> AFAIK BXT has the same annoying alignment limitation as CHV on the >>>>> source register regions of 32x32 bit MULs, give it the same treatment. >>>>> --- >>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> index 244f299..fc9f007 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> @@ -3126,9 +3126,9 @@ fs_visitor::lower_integer_multiplication() >>>>> bool progress = false; >>>>> >>>>> /* Gen8's MUL instruction can do a 32-bit x 32-bit -> 32-bit operation >>>>> - * directly, but Cherryview cannot. >>>>> + * directly, but CHV/BXT cannot. >>>>> */ >>>>> - if (devinfo->gen >= 8 && !devinfo->is_cherryview) >>>>> + if (devinfo->gen >= 8 && !devinfo->is_cherryview && >>>>> !devinfo->is_broxton) >>>>> return false; >>>>> >>>>> foreach_block_and_inst_safe(block, fs_inst, inst, cfg) { >>>>> -- >>>>> 2.4.6 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev