On 04/12/2017 06:33 PM, Kelvin Nilsen wrote:
> 
> 1. As input arguments, out_insn represents an rtl expression that
> potentially "produces" a store to memory and in_insn represents an rtl
> expression that potentially "consumes" a value recently stored to memory.
> 
You have this reversed, the code is trying to find situations where
out_insn is producing a value that in_insn will be storing to memory.

> 2. If the memory store produced matches the memory fetch consumed, this
> function returns true to indicate that this sequence of two instructions
> qualifies for a special "bypass" latency that represents the fact that
> the fetch will obtain the value out of the write buffer.  So, whereas
> the instruction scheduler might normally expect that this sequence of
> two instructions would experience Load-Hit-Store penalties associated
> with cache coherency hardware costs, since these two instruction qualify
> for the store_data_bypass optimization, the instruction scheduler counts
> the latency as only 1 or 2 cycles (potentially).  [This is what I
> understand, but I may be wrong, so please correct me if so.]
> 
In general, yes, if the function returns true then the sequence has been
identified and the target will take appropriate action (adjusting
latency or whatever).


As for the remainder below dealing with PARALLELs, I don't have any
history on that so hopefully others can chime in. For the rs6000 port, I
don't see the handling of multiple SET operations making much sense, but
then again I don't know if it will actually occur either based on the
places where store_data_bypass_p is used.

-Pat

> 3. Actually, what I described above is only the "simple" case.  It may
> be that the rtl for either out_insn or in_insn is really a parallel
> clause with multiple rtl trees beneath it.  In this case, we compare the
> subtrees in a "similar" way to see if the compound expressions qualify
> for the store_data_bypass_p "optimization".  (I've got some questions
> about how this is done below)  As currently implemented, special
> handling is given to a CLOBBER subtree as part of either PARALLEL
> expression: we ignore it.  This is because CLOBBER does not represent
> any real machine instructions.  It just represents semantic information
> that might be used by the compiler.
> 
> In addition to seeking confirmation of my existing understanding of the
> code as outlined above, the specific questions that I am seeking help
> with are:
> 
> 1. In the current implementation (as I understand it), near the top of
> the function body, we handle the case that the consumer (in_insn) rtl is
> a single SET expression and the producer (out_insn) rtl is a PARALLEL
> expression containing multiple sets.  The way I read this code, we are
> requiring that every one of the producer's parallel SET instructions
> produce the same value that is to be consumed in order to qualify this
> sequence as a "store data bypass".  That seems wrong to me.  I would
> expect that we only need "one" of the produced values to match the
> consumed value in order to qualify for the "store data bypass"
> optimization.  Please explain.  (The same confusing behavior happens
> below in the same function, in the case that the consumer rtl is a
> PARALLEL expression of multiple SETs: we require that every producer's
> stored value match every consumer's fetched value.)
> 
> 2. A "bigger" concern is that any time any SETs are buried within a
> PARALLEL tree, I'm not sure the answer produced by this function, as
> currently implemented, is at all reliable:
> 
>  a) PARALLEL does not necessarily mean all of its subtrees happen in
> parallel on hardware.  It just means that there is no sequencing imposed
> by the source code, so the final order in which the multiple subtrees
> beneath the PARALLEL node is not known at this stage of compilation.
> 
>  b) It seems to me that it doesn't really make sense to speak of whether
> a whole bunch of producers combined with a whole bunch of consumers
> qualify for an optimized store data bypass latency.  If we say that they
> do qualify (as a group), which pair(s) of producer and consumer machine
> instructions qualify?  It seems we need to know which producer matches
> with which consumer in order to know where the bypass latencies "fit"
> into the schedule.
> 
>  c) Furthermore, if it turns out that the "arbitrary" order in which the
> producer instructions and consumer instructions are emitted places too
> much "distance" between a producer and the matching consumer, then it is
> possible that by the time the hardware executes the consumer, the stored
> value is no longer in the write buffer, so even though we might have
> "thought" two PARALLEL rtl expressions qualified for the store bypass
> optimization, we really should have returned false.
> 
> Can someone help me understand this better?
> 
> Thanks much.
> 
> 

Reply via email to