On Tue, 20 Apr 2021, Xionghu Luo wrote:

> 
> 
> On 2021/4/15 19:34, Richard Biener wrote:
> > On Thu, 15 Apr 2021, Xionghu Luo wrote:
> > 
> >> Thanks,
> >>
> >> On 2021/4/14 14:41, Richard Biener wrote:
> >>>> "#538,#235,#234,#233" will all be sunk from bb 35 to bb 37 by rtl-sink,
> >>>> but it moves #538 first, then #235, there is strong dependency here. It
> >>>> seemsdoesn't like the LCM framework that could solve all and do the
> >>>> delete-insert in one iteration.
> >>> So my question was whether we want to do both within the LCM store
> >>> sinking framework.  The LCM dataflow is also used by RTL PRE which
> >>> handles both loads and non-loads so in principle it should be able
> >>> to handle stores and non-stores for the sinking case (PRE on the
> >>> reverse CFG).
> >>>
> >>> A global dataflow is more powerful than any local ad-hoc method.
> >>
> >> My biggest concern is whether the LCM DF framework could support sinking
> >> *multiple* reverse-dependent non-store instructions together by *one*
> >> calling of LCM DF.   If this is not supported, we need run multiple LCM
> >> until no new changes, it would be time consuming obviously (unless
> >> compiling time is not important here).
> > 
> > As said it is used for PRE and there it most definitely can do that.
> 
> I did some investigation about PRE and attached a case to show how it
> works, it is quite like store-motion, and actually there is a rtl-hoist
> pass in gcse.c which only works for code size.  All of them are
> leveraging the LCM framework to move instructions upward or downward.
> 
> PRE and rtl-hoist move instructions upward, they analyze/hash the SOURCE
> exprs and call pre_edge_lcm, store-motion and rtl-sink move instructions
> downward, so they analyze/hash the DEST exprs and call pre_edge_rev_lcm.
> The four problems are all converted to the LCM DF problem with
> n_basic_blocks * m_exprs of 4 matrix (antic, transp, avail, kill) as input
> and two outputs of where to insert/delete.
> 
> PRE scan each instruction and hash the SRC to table without *checking the
> relationship between instructions*, for the case attached, BB 37, BB 38
> and BB 41 both contains SOURCE expr "r262:DI+r139:DI", but BB 37 and BB 41
> save it to index 106, BB 38 save it to index 110. After finishing this pass,
> "r262:DI+r139:DI" BB41 is replaced with "r194:DI=r452:DI", then insert
> expr to BB 75~BB 80 to create full redundancies from partial redundancies,
> finally update instruction in BB 37.

I'm not familiar with the actual PRE code but reading the toplevel comment
it seems that indeed it can only handle expressions contained in a single
insn unless a REG_EQUAL note provides a short-hand for the larger one.

That of course means it would need to mark things as not transparent
for correctness where they'd be if moved together.  Now, nothing
prevents you changing the granularity of what you feed LCM.

So originally we arrived at looking into LCM because there's already
a (store) sinking pass on RTL (using LCM) so adding another (loop-special)
one didn't look like the best obvious solution.

That said, LCM would work for single-instruction expressions.  
Alternatively a greedy algorithm like you prototyped could be used.
Another pass to look at would be RTL invariant motion which seems to
compute some kind of dependency graph - not sure if that would be
adaptable for the reverse CFG problem.

> Issues witnessed in the PRE run:
> 1) "r262:DI+r139:DI" in BLOCK 38 is not replaced;
> 2) PRE also use pseudo register to store the result like store-motion and
> even rtl-hoist. Actually we need real "move" instead of "replace" for
> rtl-sink to improve performance though also potential register pressure issue
> like rtl-hoist?
> 3) SRC instruction is scanned WITHOUT back chain check cross instructions
> in hash_scan_set, it couldn't handle below case though a+c is same with b+d.
> So I am afraid single run couldn't solve the instruction dependency issue
> to sink multiple instructions out as before for rtl-sink.
> 
> BB1:
> a = b;
> c = d;
> s = a + c;
> BB2:
> s = b + d;
> 
> 
> gcse.c:
> changed = one_pre_gcse_pass ()
> alloc_hash_table (&expr_hash_table);
> compute_hash_table (&expr_hash_table);
> compute_hash_table_work (table);
>   FOR_EACH_BB_FN (current_bb, cfun)
>    FOR_BB_INSNS (current_bb, insn)
>     hash_scan_insn (insn, table);
>       hash_scan_set (pat, insn, table);
>        if REG_P (dest)
>          insert_expr_in_table (src, GET_MODE (dest), insn, antic_p, avail_p,
>          max_distance, table);
>            hash = hash_expr (x, mode, &do_not_record_p, table->size);
> dump_hash_table (dump_file, "Expression", &expr_hash_table);
> edge_list = compute_pre_data ();
> compute_local_properties (transp, comp, antloc, &expr_hash_table);
> edge_list = pre_edge_lcm (expr_hash_table.n_elems, transp, comp, antloc,
> ae_kill, &pre_insert_map, &pre_delete_map);
> changed |= pre_gcse (edge_list);
> changed = pre_delete ();  /* Create a pseudo-reg to store the result of
> reaching expressions into. */
> did_insert = pre_edge_insert (edge_list, index_map); 
> 
> > 
> >>
> >>>
> >>> Richard.
> >>>
> >>>> However, there are still some common methods could be shared, like the
> >>>> def-use check(though store-motion is per bb, rtl-sink is per loop),
> >>>> insert_store, commit_edge_insertions etc.
> >>>>
> >>>>
> >>>>     508: L508:
> >>>>     507: NOTE_INSN_BASIC_BLOCK 34
> >>>>      12: r139:DI=r140:DI
> >>>>         REG_DEAD r140:DI
> >>>>     240: L240:
> >>>>     231: NOTE_INSN_BASIC_BLOCK 35
> >>>>     232: r142:DI=zero_extend(r139:DI#0)
> >>>>     233: r371:SI=r142:DI#0-0x1
> >>>>     234: r243:DI=zero_extend(r371:SI)
> >>>>         REG_DEAD r371:SI
> >>>>     235: r452:DI=r262:DI+r139:DI
> >>>>     538: r194:DI=r452:DI
> >>>>     236: r372:CCUNS=cmp(r142:DI#0,r254:DI#0)
> >>
> >>
> >> Like here, Each instruction's dest reg is calculated in the input vector
> >> bitmap, after solving the equations by calling pre_edge_rev_lcm,
> >> move #538 out of loop for the first call, then move #235 out of loop
> >> after a second call... 4 repeat calls needed in total here, is the LCM
> >> framework smart enough to move the all 4 instruction within one iteration?
> >> I am worried that the input vector bitmap couldn't solve the dependency
> >> problem for two back chained instructions.
> >>
> >>
> >>
> > 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Reply via email to