On 04/12/2017 06:33 PM, Kelvin Nilsen wrote: > > 1. As input arguments, out_insn represents an rtl expression that > potentially "produces" a store to memory and in_insn represents an rtl > expression that potentially "consumes" a value recently stored to memory. > You have this reversed, the code is trying to find situations where out_insn is producing a value that in_insn will be storing to memory.
> 2. If the memory store produced matches the memory fetch consumed, this > function returns true to indicate that this sequence of two instructions > qualifies for a special "bypass" latency that represents the fact that > the fetch will obtain the value out of the write buffer. So, whereas > the instruction scheduler might normally expect that this sequence of > two instructions would experience Load-Hit-Store penalties associated > with cache coherency hardware costs, since these two instruction qualify > for the store_data_bypass optimization, the instruction scheduler counts > the latency as only 1 or 2 cycles (potentially). [This is what I > understand, but I may be wrong, so please correct me if so.] > In general, yes, if the function returns true then the sequence has been identified and the target will take appropriate action (adjusting latency or whatever). As for the remainder below dealing with PARALLELs, I don't have any history on that so hopefully others can chime in. For the rs6000 port, I don't see the handling of multiple SET operations making much sense, but then again I don't know if it will actually occur either based on the places where store_data_bypass_p is used. -Pat > 3. Actually, what I described above is only the "simple" case. It may > be that the rtl for either out_insn or in_insn is really a parallel > clause with multiple rtl trees beneath it. In this case, we compare the > subtrees in a "similar" way to see if the compound expressions qualify > for the store_data_bypass_p "optimization". (I've got some questions > about how this is done below) As currently implemented, special > handling is given to a CLOBBER subtree as part of either PARALLEL > expression: we ignore it. This is because CLOBBER does not represent > any real machine instructions. It just represents semantic information > that might be used by the compiler. > > In addition to seeking confirmation of my existing understanding of the > code as outlined above, the specific questions that I am seeking help > with are: > > 1. In the current implementation (as I understand it), near the top of > the function body, we handle the case that the consumer (in_insn) rtl is > a single SET expression and the producer (out_insn) rtl is a PARALLEL > expression containing multiple sets. The way I read this code, we are > requiring that every one of the producer's parallel SET instructions > produce the same value that is to be consumed in order to qualify this > sequence as a "store data bypass". That seems wrong to me. I would > expect that we only need "one" of the produced values to match the > consumed value in order to qualify for the "store data bypass" > optimization. Please explain. (The same confusing behavior happens > below in the same function, in the case that the consumer rtl is a > PARALLEL expression of multiple SETs: we require that every producer's > stored value match every consumer's fetched value.) > > 2. A "bigger" concern is that any time any SETs are buried within a > PARALLEL tree, I'm not sure the answer produced by this function, as > currently implemented, is at all reliable: > > a) PARALLEL does not necessarily mean all of its subtrees happen in > parallel on hardware. It just means that there is no sequencing imposed > by the source code, so the final order in which the multiple subtrees > beneath the PARALLEL node is not known at this stage of compilation. > > b) It seems to me that it doesn't really make sense to speak of whether > a whole bunch of producers combined with a whole bunch of consumers > qualify for an optimized store data bypass latency. If we say that they > do qualify (as a group), which pair(s) of producer and consumer machine > instructions qualify? It seems we need to know which producer matches > with which consumer in order to know where the bypass latencies "fit" > into the schedule. > > c) Furthermore, if it turns out that the "arbitrary" order in which the > producer instructions and consumer instructions are emitted places too > much "distance" between a producer and the matching consumer, then it is > possible that by the time the hardware executes the consumer, the stored > value is no longer in the write buffer, so even though we might have > "thought" two PARALLEL rtl expressions qualified for the store bypass > optimization, we really should have returned false. > > Can someone help me understand this better? > > Thanks much. > >