2011/8/8 Richard Sandiford <[email protected]>
>
> Ayal Zaks <[email protected]> writes:
> >> OK. For the follow-on iv patch, it seemed easier to keep both bounds
> >> inclusive for the loop, then make the "end" exclusive when setting the
> >> out parameters. (Note that there shouldn't be any problem with overflow
> >> when making the bound exclusive, because the size of the range has been
> >> restricted to "ii" by that stage. The follow-on patch does use
> >> saturating maths for all ops though.)
> >
> > Sure, no problem having "end" inclusive, as it simplifies things. We
> > can even keep it inclusive all the way through, iterating over: for (c
> > = start; c != end+step; c += step)...
>
> Yeah, I'd prefer that too. I might do it once Revital's and
> Roman's changes are in.
>
> >> I don't have powerpc hardware that I can do meaningful performance
> >> testing on, but I did run it through a Popular* Embedded Benchmark
> >> on an ARM Cortex-A8 board with -O3 -fmodulo-sched
> >> -fmodulo-sched-allow-regmoves. There were no changes. (And this is
> >> a benchmark that does benefit from modulo scheduling, in some cases
> >> by a significant amount.)
> >>
> > Ok, the patch should be neutral performance-wise. One could check btw
> > if SMS produces exactly the same output in both cases, even w/o a
> > powerpc (or any other) HW.
>
> OK. The patch does change the output a little because of the MEM_DEP
> range thing.
Ahh, right.
>
> To compensate for that, I tried compiling libav on ARM with:
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c 2011-08-05 11:23:16.000000000 +0100
> +++ gcc/modulo-sched.c 2011-08-05 11:27:57.000000000 +0100
> @@ -1680,7 +1680,7 @@ get_sched_window (partial_schedule_ptr p
> p_st, early_start, e->latency);
>
> if (e->data_type == MEM_DEP)
> - end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> + end = MIN (end, SCHED_TIME (v_node) + ii);
> }
> else if (dump_file)
> fprintf (dump_file, "the node is not scheduled\n");
> @@ -1729,7 +1729,7 @@ get_sched_window (partial_schedule_ptr p
> s_st, late_start, e->latency);
>
> if (e->data_type == MEM_DEP)
> - end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> + end = MAX (end, SCHED_TIME (v_node) - ii);
> if (dump_file)
> fprintf (dump_file, "end = %d\n", end);
>
> @@ -1791,7 +1791,7 @@ get_sched_window (partial_schedule_ptr p
> count_preds++;
>
> if (e->data_type == MEM_DEP)
> - end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> + end = MIN (end, SCHED_TIME (v_node) + ii);
> }
> else if (dump_file)
> fprintf (dump_file, "the node is not scheduled\n");
>
> applied, then tried the same thing with the revised patch below.
> The output was the same.
ok, that's a good sanity check.
>
> (FWIW, libav did show up extra differences when using the patch
> that I'd originally submitted. They were due to the count_preds
> and count_succs thing that you picked up in your review.)
(These differences had no noticable consequences performance-wise, right?)
>
> >> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the
> >> additional patch:
> >>
> >> Index: gcc/opts.c
> >> ===================================================================
> >> --- gcc/opts.c 2011-08-03 10:56:50.000000000 +0100
> >> +++ gcc/opts.c 2011-08-03 10:56:50.000000000 +0100
> >> @@ -449,6 +449,8 @@ static const struct default_options defa
> >> { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
> >> { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
> >> { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
> >> + { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 },
> >> + { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 },
> >>
> >> /* -O2 optimizations. */
> >> { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
> >>
> >> applied for both the "before" and "after" runs. OK to install?
> >
> >
> > Yes, with just a couple of minor non-mandatory comments:
> > 1. Setting "count_preds = psp_not_empty;" and "count_succs =
> > pss_not_empty;" suggests that you want to decrement non-interesting
> > neighbors instead of incrementing interesting ones. Otherwise can
> > initialize to zero (doesn't really matter, just a bit confusing).
>
> Yeah, good point. I initialised them to zero, like you said,
> and changed the reversal condition to:
>
> if (pss_not_empty && count_succs >= count_preds)
>
> Which I have admit makes a lot more sense than what I submitted,
> and avoids some unwanted changes in output.
>
> > 2. I find it easier to read "late_start = MIN (late_start, early_start
> > + (ii - 1));" instead of "late_start = MIN (early_start + (ii - 1),
> > late_start);".
>
> Oops, that was accidental. Fixed.
>
> > 3. Suggest to set step=1 at the beginning, explaining from the start
> > that we first compute a forward range (start<end), and may later
> > possibly revert it.
>
> Yeah, that's better.
>
> > 4. "late_start" would better be renamed "late_end".
>
> Since you said this was optional, I decided to leave it as it is.
> The name and value of "late start" come directly from the paper.
Ok. We can write another paper ;-)
>
> > 5. While we're at it, as you've unified the treatments, we can now
> > actually do with "start" and "end" only, and do away with early_start
> > and late_start altogether...
>
> The IV patch needs the same distinction as we have now, so I'd prefer
> to keep it.
Ok, looking forward to the IV patch then...
>
> Thanks for the review. Here's what I applied after testing on
> powerpc-ibm-aix5.3.
>
Thanks for looking into this,
Ayal.
>
> Richard
>
>
> gcc/
> * modulo-sched.c (get_sched_window): Use just one loop for predecessors
> and one loop for successors. Fix upper bound of memory range.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> *** gcc/modulo-sched.c 2011-08-05 10:39:27.000000000 +0100
> --- gcc/modulo-sched.c 2011-08-05 10:40:49.000000000 +0100
> *************** #define MAX_SPLIT_NUM 10
> *** 1630,1638 ****
>
> static int
> get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node,
> ! sbitmap sched_nodes, int ii, int *start_p, int *step_p, int
> *end_p)
> {
> int start, step, end;
> ddg_edge_ptr e;
> sbitmap psp = sbitmap_alloc (ps->g->num_nodes);
> sbitmap pss = sbitmap_alloc (ps->g->num_nodes);
> --- 1630,1640 ----
>
> static int
> get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node,
> ! sbitmap sched_nodes, int ii, int *start_p, int *step_p,
> ! int *end_p)
> {
> int start, step, end;
> + int early_start, late_start;
> ddg_edge_ptr e;
> sbitmap psp = sbitmap_alloc (ps->g->num_nodes);
> sbitmap pss = sbitmap_alloc (ps->g->num_nodes);
> *************** get_sched_window (partial_schedule_ptr p
> *** 1640,1645 ****
> --- 1642,1649 ----
> sbitmap u_node_succs = NODE_SUCCESSORS (u_node);
> int psp_not_empty;
> int pss_not_empty;
> + int count_preds;
> + int count_succs;
>
> /* 1. compute sched window for u (start, end, step). */
> sbitmap_zero (psp);
> *************** get_sched_window (partial_schedule_ptr p
> *** 1647,1861 ****
> psp_not_empty = sbitmap_a_and_b_cg (psp, u_node_preds, sched_nodes);
> pss_not_empty = sbitmap_a_and_b_cg (pss, u_node_succs, sched_nodes);
>
> ! if (psp_not_empty && !pss_not_empty)
> ! {
> ! int early_start = INT_MIN;
> !
> ! end = INT_MAX;
> ! for (e = u_node->in; e != 0; e = e->next_in)
> ! {
> ! ddg_node_ptr v_node = e->src;
> !
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge: ");
> ! print_ddg_edge (dump_file, e);
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in psp_not_empty,"
> ! " checking p %d (%d): ", u_node->cuid,
> ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
> !
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int p_st = SCHED_TIME (v_node);
> !
> ! early_start =
> ! MAX (early_start, p_st + e->latency - (e->distance * ii));
> !
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "pred st = %d; early_start = %d; latency: %d",
> ! p_st, early_start, e->latency);
> !
> ! if (e->data_type == MEM_DEP)
> ! end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
> ! }
> ! start = early_start;
> ! end = MIN (end, early_start + ii);
> ! /* Schedule the node close to it's predecessors. */
> ! step = 1;
> !
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in a window (%d..%d) with step %d\n",
> ! u_node->cuid, INSN_UID (u_node->insn), start, end, step);
> ! }
> !
> ! else if (!psp_not_empty && pss_not_empty)
> ! {
> ! int late_start = INT_MAX;
> !
> ! end = INT_MIN;
> ! for (e = u_node->out; e != 0; e = e->next_out)
> ! {
> ! ddg_node_ptr v_node = e->dest;
> !
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge:");
> ! print_ddg_edge (dump_file, e);
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in pss_not_empty,"
> ! " checking s %d (%d): ", u_node->cuid,
> ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
> !
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int s_st = SCHED_TIME (v_node);
> !
> ! late_start = MIN (late_start,
> ! s_st - e->latency + (e->distance * ii));
> !
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "succ st = %d; late_start = %d; latency = %d",
> ! s_st, late_start, e->latency);
> !
> ! if (e->data_type == MEM_DEP)
> ! end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> ! if (dump_file)
> ! fprintf (dump_file, "end = %d\n", end);
> !
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
>
> ! }
> ! start = late_start;
> ! end = MAX (end, late_start - ii);
> ! /* Schedule the node close to it's successors. */
> ! step = -1;
>
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in a window (%d..%d) with step %d\n",
> ! u_node->cuid, INSN_UID (u_node->insn), start, end, step);
>
> ! }
>
> ! else if (psp_not_empty && pss_not_empty)
> ! {
> ! int early_start = INT_MIN;
> ! int late_start = INT_MAX;
> ! int count_preds = 0;
> ! int count_succs = 0;
>
> ! start = INT_MIN;
> ! end = INT_MAX;
> ! for (e = u_node->in; e != 0; e = e->next_in)
> ! {
> ! ddg_node_ptr v_node = e->src;
>
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge:");
> ! print_ddg_edge (dump_file, e);
> fprintf (dump_file,
> ! "\nScheduling %d (%d) in psp_pss_not_empty,"
> ! " checking p %d (%d): ", u_node->cuid, INSN_UID
> ! (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
> !
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int p_st = SCHED_TIME (v_node);
> !
> ! early_start = MAX (early_start,
> ! p_st + e->latency
> ! - (e->distance * ii));
> !
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "pred st = %d; early_start = %d; latency = %d",
> ! p_st, early_start, e->latency);
> !
> ! if (e->type == TRUE_DEP && e->data_type == REG_DEP)
> ! count_preds++;
>
> ! if (e->data_type == MEM_DEP)
> ! end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
>
> ! }
> ! for (e = u_node->out; e != 0; e = e->next_out)
> ! {
> ! ddg_node_ptr v_node = e->dest;
>
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge:");
> ! print_ddg_edge (dump_file, e);
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in psp_pss_not_empty,"
> ! " checking s %d (%d): ", u_node->cuid, INSN_UID
> ! (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
>
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int s_st = SCHED_TIME (v_node);
>
> ! late_start = MIN (late_start,
> ! s_st - e->latency
> ! + (e->distance * ii));
>
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "succ st = %d; late_start = %d; latency = %d",
> ! s_st, late_start, e->latency);
>
> ! if (e->type == TRUE_DEP && e->data_type == REG_DEP)
> ! count_succs++;
>
> ! if (e->data_type == MEM_DEP)
> ! start = MAX (start, SCHED_TIME (v_node) - ii + 1);
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
>
> ! }
> ! start = MAX (start, early_start);
> ! end = MIN (end, MIN (early_start + ii, late_start + 1));
> ! step = 1;
> ! /* If there are more successors than predecessors schedule the
> ! node close to it's successors. */
> ! if (count_succs >= count_preds)
> ! {
> ! int old_start = start;
>
> ! start = end - 1;
> ! end = old_start - 1;
> ! step = -1;
> ! }
> ! }
> ! else /* psp is empty && pss is empty. */
> ! {
> ! start = SCHED_ASAP (u_node);
> ! end = start + ii;
> ! step = 1;
> }
>
> *start_p = start;
> *step_p = step;
> *end_p = end;
> --- 1651,1772 ----
> psp_not_empty = sbitmap_a_and_b_cg (psp, u_node_preds, sched_nodes);
> pss_not_empty = sbitmap_a_and_b_cg (pss, u_node_succs, sched_nodes);
>
> ! /* We first compute a forward range (start <= end), then decide whether
> ! to reverse it. */
> ! early_start = INT_MIN;
> ! late_start = INT_MAX;
> ! start = INT_MIN;
> ! end = INT_MAX;
> ! step = 1;
> !
> ! count_preds = 0;
> ! count_succs = 0;
> !
> ! /* Calculate early_start and limit end. Both bounds are inclusive. */
> ! if (psp_not_empty)
> ! for (e = u_node->in; e != 0; e = e->next_in)
> ! {
> ! ddg_node_ptr v_node = e->src;
>
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge: ");
> ! print_ddg_edge (dump_file, e);
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in psp_not_empty,"
> ! " checking p %d (%d): ", u_node->cuid,
> ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
>
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int p_st = SCHED_TIME (v_node);
>
> ! early_start = MAX (early_start,
> ! p_st + e->latency - (e->distance * ii));
>
> ! if (e->data_type == MEM_DEP)
> ! end = MIN (end, p_st + ii - 1);
>
> ! if (e->type == TRUE_DEP && e->data_type == REG_DEP)
> ! count_preds++;
>
> ! if (dump_file)
> fprintf (dump_file,
> ! "pred st = %d; early_start = %d; latency: %d;"
> ! " end: %d\n", p_st, early_start, e->latency, end);
>
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
> ! }
>
> ! /* Calculate late_start and limit start. Both bounds are inclusive. */
> ! if (pss_not_empty)
> ! for (e = u_node->out; e != 0; e = e->next_out)
> ! {
> ! ddg_node_ptr v_node = e->dest;
>
> ! if (dump_file)
> ! {
> ! fprintf (dump_file, "\nProcessing edge:");
> ! print_ddg_edge (dump_file, e);
> ! fprintf (dump_file,
> ! "\nScheduling %d (%d) in pss_not_empty,"
> ! " checking s %d (%d): ", u_node->cuid,
> ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID
> ! (v_node->insn));
> ! }
>
> ! if (TEST_BIT (sched_nodes, v_node->cuid))
> ! {
> ! int s_st = SCHED_TIME (v_node);
>
> ! late_start = MIN (late_start,
> ! s_st - e->latency + (e->distance * ii));
>
> ! if (e->data_type == MEM_DEP)
> ! start = MAX (start, s_st - ii + 1);
>
> ! if (e->type == TRUE_DEP && e->data_type == REG_DEP)
> ! count_succs++;
>
> ! if (dump_file)
> ! fprintf (dump_file,
> ! "succ st = %d; late_start = %d; latency = %d;"
> ! " start=%d", s_st, late_start, e->latency, start);
>
> ! }
> ! else if (dump_file)
> ! fprintf (dump_file, "the node is not scheduled\n");
> ! }
>
> ! /* Get a target scheduling window no bigger than ii. */
> ! if (early_start == INT_MIN && late_start == INT_MAX)
> ! early_start = SCHED_ASAP (u_node);
> ! else if (early_start == INT_MIN)
> ! early_start = late_start - (ii - 1);
> ! late_start = MIN (late_start, early_start + (ii - 1));
> !
> ! /* Apply memory dependence limits. */
> ! start = MAX (start, early_start);
> ! end = MIN (end, late_start);
> !
> ! /* If there are at least as many successors as predecessors, schedule the
> ! node close to its successors. */
> ! if (pss_not_empty && count_succs >= count_preds)
> ! {
> ! int tmp = end;
> ! end = start;
> ! start = tmp;
> ! step = -1;
> }
>
> + /* Now that we've finalized the window, make END an exclusive rather
> + than an inclusive bound. */
> + end += step;
> +
> *start_p = start;
> *step_p = step;
> *end_p = end;
> *************** get_sched_window (partial_schedule_ptr p
> *** 1867,1876 ****
> if (dump_file)
> fprintf (dump_file, "\nEmpty window: start=%d, end=%d, step=%d\n",
> start, end, step);
> ! return -1;
> }
>
> ! return 0;
> }
>
> /* Calculate MUST_PRECEDE/MUST_FOLLOW bitmaps of U_NODE; which is the
> --- 1778,1787 ----
> if (dump_file)
> fprintf (dump_file, "\nEmpty window: start=%d, end=%d, step=%d\n",
> start, end, step);
> ! return -1;
> }
>
> ! return 0;
> }
>
> /* Calculate MUST_PRECEDE/MUST_FOLLOW bitmaps of U_NODE; which is the