Re: [gem5-dev] Review Request 2828: cpu: implements vector registers

Dutu, Alexandru Tue, 07 Jul 2015 12:17:29 -0700

Hi Nilay,

I was asking about the meaning of "upper lane merging" as I was not aware of 
this behaviour for SSE instructions. Reading through the manual, it seems that 
the upper octword remains unchanged for the legacy SSE instructions (as Giacomo 
was mentioning). However, even some SSE instructions are doing zeroing, the 
extended ones.

Having said all this, I don't see why there needs to be a read of the old value 
for any particular write, a masked OR is a lot more efficient. The difference 
between legacy and extended SSE/AVX instructions, for this aspect, becomes just 
applying a different mask over the entire register before writing (which is 
just an OR operation) to it. For example, if the register is 256 bits and we 
are writing the lower 128 bits with a legacy SSE instruction the writing 
operation should be (128bits_new_value) | (0xFFFFFFFF00000000 & 
256bits_old_value). If we are writing with an extended SSE instruction the 
operation should be (128bits_new_value) | (0x0 & 256bits_old_value). 
Implementing this in hardware might be more efficient than reading the register 
before every register write.

However, what the Intel implementation actually does is to save the upper lanes 
of the registers for every transition from AVX/SSE extended to legacy SSE 
instructions [1]. The VEX prefix should be a good indicator for these 
transitions. Also, programmers and compilers are encouraged to reduce these 
transitions for better performance as it reduces the overhead of saving upper 
lanes [2]. At a first glance, it seems there are a few instructions which are 
reading and writing just the upper parts of the vector registers, at least for 
X86.

Best,
Alex

[1] 
https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
[2] https://www.pgroup.com/lit/articles/insider/v3n2a4.htm
________________________________________
From: Nilay Vaish [[email protected]]
Sent: Monday, July 06, 2015 9:17 PM
To: Dutu, Alexandru
Cc: Default
Subject: Re: Review Request 2828: cpu: implements vector registers

On Mon, 6 Jul 2015, Alexandru Dutu wrote:

>
>
>> On July 6, 2015, 12:42 p.m., Giacomo Gabrielli wrote:
>>> These are my current thoughts about this patch:
>>>
>>> 1. My impression is that there is still not enough architectural support to 
>>> understand if the new vector register type as it stands can address all the
>>> different corner cases efficiently; I'd leave to the wider gem5 community 
>>> decide where we want to draw that line...
>>>
>>> 2. Legacy SSE requires merging of upper lanes, while AVX does zeroing;
>>>    also ARMv8 AArch64 scalar FP and NEON instructions perform zeroing.
>>>    Assuming that destination vectors are always read is going to
>>>    introduce unneded serialization for those ISA extensions if they
>>>    are going to be ported to the new scheme, so I'd suggest to avoid
>>>    to implicitly read on write.  Also for cases where merging is
>>>    required, maybe something smarter should be done to avoid unneded
>>>    serialization; without optimizations, any sequence of x86 FP scalar
>>>    instructions could be significantly slow compared to real hw
>>>    implementations.
>
> Could you please detail a bit the merging issue for legacy SSE?
>
>

I am not sure what you are asking for exactly.  I have two interpretations
of your question:

1. SSE instructions work with 128-bit registers.  AVX instructions work
    with 128-bit, 256-bit and 512-bit registers.  Since the actual
    underlying set of registers is the same, we need to do something about
    the bits that are not part of the output.  For SSE instructions, bits
    128 to VLmax-1 are retained as before.  For AVX, the instructions that
    output only 128-bits, zero rest of the bits in the register.  For
    example, suppose we are doing 32-bit adds on two 128-bit register, but
    the underlying register is 256-bit.  So C[0..3] = A[0..3] + B[0..3] for
    both SSE and AVX.  But C[4..7] = C_old[4..7] for SSE and C[4..7] = 0
    for AVX.

2. In the implementation that I posted, we only maintain the largest size
    register that the ISA supports.  So, if the largest vector width is
    512-bits, then all vector registers are 512-bit wide.  While executing
    SSE instructions, we need to retain the previous data.  So while
    writing to the output register, we need perform a merge between the new
    and the old values.  This means we need to read the old values first.
    So there would be serialization between instructions that read and
    write different parts of the vector register..  But now that I think
    about it, most instructions are going to read / write the lower bits.
    So the serialization would occur anyway.

--
Nilay
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 2828: cpu: implements vector registers

Reply via email to