Re: VLIW scheduling and delayed branch

2007-12-10 Thread Pranav Bhandarkar
On Dec 9, 2007 2:19 AM, Thomas Sailer [EMAIL PROTECTED] wrote:
  Has anyone faced a similar problem before? Are there targets for which
  both VLIW and DBR are enabled? Perhaps ia64?


Ok, this was a long time back, but Yes I have faced a similar problem.
We disabled
delayed branch scheduling and used the machdep reorg pass. We examined
the dependencies of the
branch instructions moving backwards from the branch instruction and
marking all the instructions ( and the
containing insn bundle) that the branch depended upon. Then again,
moving backwards from the branch
insn, we picked the first insn bundle with all unmarked insns ( and
cycle size of the bundle = no of delay slots
of a branch insn ) and put that bundle into the delay slot.

This approach worked fine for the small testcases that we had, but we
really didnt test this on any monstrous piece of software. We
implemented this for the TMS320C6x VLIW DSP.

HTH,
Pranav


Re: VLIW scheduling and delayed branch

2007-12-10 Thread Hariharan Sandanagobalane

Hi thomas,
Thanks for your reply. A couple of questions below.

Thomas Sailer wrote:
Has anyone faced a similar problem before? Are there targets for which 
both VLIW and DBR are enabled? Perhaps ia64?


I did something similar a few months ago.


What was your target? Is the target code available in Gcc mainline? If 
not, could you pass your code to me?




The problem is that haifa and the delayed branch scheduling passes don't
really fit together. delayed branch scheduling happily undoes all the
haifa decisions.

The question is how much you gain by delayed branch scheduling. I don't
have numbers, but it wasn't much in my case. And since your company name
is picochip, you certainly value size more than speed ?!


Yeah. We do. But, in our architecture, a branch has to have a delay slot 
instruction anyway. In the absence of one, we put a nop in there. If 
GCC manages to move a single instruction vliw into the delay slot, we 
would benefit in both size and speed, otherwise, we will just have no 
impact on either.




I pursued two approaches. The first one was to insert stop bit pseudo
insns into the RTL stream in machdep reorg, so I didn't have to rely on
TImode insn flags during output. But then delayed branch scheduling just
took one insn out of an insn group and put it into the delay slot,
meaning there was usually no cycle gain at all, just larger code size
(due to insn duplication).


This seems fairly straightforward to implement.



The second approach was having lots of parallel insns (using match
parallel and a custom predicate). machdep reorg then converts insn
bundles into a single parallel insn. Delayed branch scheduling then does
the right thing. This approach works fairly well for me, but there are a
few complications. My output code is pretty hackish, as I didn't want to
duplicate outputing a single insn / outputing the same insn as component
of a parallel insn group.


When do you un-parallel those instructions? And, how?

Regards
Hari



Tom



Re: VLIW scheduling and delayed branch

2007-12-10 Thread Thomas Sailer

 When do you un-parallel those instructions? And, how?

I don't; I use a C function to output such an insn group.

In that C function, I basically save the global state of final, and use
functions of final.c to output constitutent insns.

The insn group output function basically looks like this:

first prepare:
  static char buf[256];
  FILE *old_out_file;
  /* open memory file */
  old_out_file = asm_out_file;
  asm_out_file = fmemopen (buf, sizeof(buf), w);
  gcc_assert (asm_out_file);


then loop over all constitutent insns:
  cleanup_subreg_operands (insn);
  if (! constrain_operands_cached (1))
fatal_insn_not_found (insn);
  current_output_insn = insn;
  /* Find the proper template for this insn.  */
  template = get_insn_template (insn_code_number, insn);
  gcc_assert (template);
  gcc_assert (!(template[0] == '#'  template[1] == '\0'));
  fprintf (asm_out_file, \t||);
  output_asm_insn (template, recog_data.operand);
  fseek (asm_out_file, ftell (asm_out_file) - 1, SEEK_SET);

finally cleanup:
  fclose (asm_out_file);
  asm_out_file = old_out_file;
  return buf[4];

That's why I wrote it's kind of hackish :-)  fmemopen also isn't
necessarily very portable, but is needed since all the final output
routines directly output to a FILE *, and I need to intercept that
output.

Tom




VLIW scheduling and delayed branch

2007-12-08 Thread Hariharan Sandanagobalane

Hi,
I am trying to enable delayed branch scheduling on our port of Gcc for 
picochip (16-bit VLIW DSP). I understand that delayed-branch is run as a 
seperate pass after the DFA scheduling is done. We basically depend on 
the TImode set on the cycle-start instructions to decide what 
instructions form a valid VLIW. By enabling delayed-branch, it seems 
like the delay-branch pass takes any instruction and puts it on the 
delay slot. This sometimes seem to pick the TImode set instructions, but 
does not seem to set the TImode on the next instruction.


Has anyone faced a similar problem before? Are there targets for which 
both VLIW and DBR are enabled? Perhaps ia64?


Thanks for your help.

Regards
Hari


Re: VLIW scheduling and delayed branch

2007-12-08 Thread Thomas Sailer
 Has anyone faced a similar problem before? Are there targets for which 
 both VLIW and DBR are enabled? Perhaps ia64?

I did something similar a few months ago.

The problem is that haifa and the delayed branch scheduling passes don't
really fit together. delayed branch scheduling happily undoes all the
haifa decisions.

The question is how much you gain by delayed branch scheduling. I don't
have numbers, but it wasn't much in my case. And since your company name
is picochip, you certainly value size more than speed ?!

I pursued two approaches. The first one was to insert stop bit pseudo
insns into the RTL stream in machdep reorg, so I didn't have to rely on
TImode insn flags during output. But then delayed branch scheduling just
took one insn out of an insn group and put it into the delay slot,
meaning there was usually no cycle gain at all, just larger code size
(due to insn duplication).

The second approach was having lots of parallel insns (using match
parallel and a custom predicate). machdep reorg then converts insn
bundles into a single parallel insn. Delayed branch scheduling then does
the right thing. This approach works fairly well for me, but there are a
few complications. My output code is pretty hackish, as I didn't want to
duplicate outputing a single insn / outputing the same insn as component
of a parallel insn group.

Tom