Hi Nilay, I was asking about the meaning of "upper lane merging" as I was not aware of this behaviour for SSE instructions. Reading through the manual, it seems that the upper octword remains unchanged for the legacy SSE instructions (as Giacomo was mentioning). However, even some SSE instructions are doing zeroing, the extended ones.
Having said all this, I don't see why there needs to be a read of the old value for any particular write, a masked OR is a lot more efficient. The difference between legacy and extended SSE/AVX instructions, for this aspect, becomes just applying a different mask over the entire register before writing (which is just an OR operation) to it. For example, if the register is 256 bits and we are writing the lower 128 bits with a legacy SSE instruction the writing operation should be (128bits_new_value) | (0xFFFFFFFF00000000 & 256bits_old_value). If we are writing with an extended SSE instruction the operation should be (128bits_new_value) | (0x0 & 256bits_old_value). Implementing this in hardware might be more efficient than reading the register before every register write. However, what the Intel implementation actually does is to save the upper lanes of the registers for every transition from AVX/SSE extended to legacy SSE instructions [1]. The VEX prefix should be a good indicator for these transitions. Also, programmers and compilers are encouraged to reduce these transitions for better performance as it reduces the overhead of saving upper lanes [2]. At a first glance, it seems there are a few instructions which are reading and writing just the upper parts of the vector registers, at least for X86. Best, Alex [1] https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf [2] https://www.pgroup.com/lit/articles/insider/v3n2a4.htm ________________________________________ From: Nilay Vaish [[email protected]] Sent: Monday, July 06, 2015 9:17 PM To: Dutu, Alexandru Cc: Default Subject: Re: Review Request 2828: cpu: implements vector registers On Mon, 6 Jul 2015, Alexandru Dutu wrote: > > >> On July 6, 2015, 12:42 p.m., Giacomo Gabrielli wrote: >>> These are my current thoughts about this patch: >>> >>> 1. My impression is that there is still not enough architectural support to >>> understand if the new vector register type as it stands can address all the >>> different corner cases efficiently; I'd leave to the wider gem5 community >>> decide where we want to draw that line... >>> >>> 2. Legacy SSE requires merging of upper lanes, while AVX does zeroing; >>> also ARMv8 AArch64 scalar FP and NEON instructions perform zeroing. >>> Assuming that destination vectors are always read is going to >>> introduce unneded serialization for those ISA extensions if they >>> are going to be ported to the new scheme, so I'd suggest to avoid >>> to implicitly read on write. Also for cases where merging is >>> required, maybe something smarter should be done to avoid unneded >>> serialization; without optimizations, any sequence of x86 FP scalar >>> instructions could be significantly slow compared to real hw >>> implementations. > > Could you please detail a bit the merging issue for legacy SSE? > > I am not sure what you are asking for exactly. I have two interpretations of your question: 1. SSE instructions work with 128-bit registers. AVX instructions work with 128-bit, 256-bit and 512-bit registers. Since the actual underlying set of registers is the same, we need to do something about the bits that are not part of the output. For SSE instructions, bits 128 to VLmax-1 are retained as before. For AVX, the instructions that output only 128-bits, zero rest of the bits in the register. For example, suppose we are doing 32-bit adds on two 128-bit register, but the underlying register is 256-bit. So C[0..3] = A[0..3] + B[0..3] for both SSE and AVX. But C[4..7] = C_old[4..7] for SSE and C[4..7] = 0 for AVX. 2. In the implementation that I posted, we only maintain the largest size register that the ISA supports. So, if the largest vector width is 512-bits, then all vector registers are 512-bit wide. While executing SSE instructions, we need to retain the previous data. So while writing to the output register, we need perform a merge between the new and the old values. This means we need to read the old values first. So there would be serialization between instructions that read and write different parts of the vector register.. But now that I think about it, most instructions are going to read / write the lower bits. So the serialization would occur anyway. -- Nilay _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
