Re: VLIW scheduling and delayed branch
On Dec 9, 2007 2:19 AM, Thomas Sailer [EMAIL PROTECTED] wrote: Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? Ok, this was a long time back, but Yes I have faced a similar problem. We disabled delayed branch scheduling and used the machdep reorg pass. We examined the dependencies of the branch instructions moving backwards from the branch instruction and marking all the instructions ( and the containing insn bundle) that the branch depended upon. Then again, moving backwards from the branch insn, we picked the first insn bundle with all unmarked insns ( and cycle size of the bundle = no of delay slots of a branch insn ) and put that bundle into the delay slot. This approach worked fine for the small testcases that we had, but we really didnt test this on any monstrous piece of software. We implemented this for the TMS320C6x VLIW DSP. HTH, Pranav
Re: VLIW scheduling and delayed branch
Hi thomas, Thanks for your reply. A couple of questions below. Thomas Sailer wrote: Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? I did something similar a few months ago. What was your target? Is the target code available in Gcc mainline? If not, could you pass your code to me? The problem is that haifa and the delayed branch scheduling passes don't really fit together. delayed branch scheduling happily undoes all the haifa decisions. The question is how much you gain by delayed branch scheduling. I don't have numbers, but it wasn't much in my case. And since your company name is picochip, you certainly value size more than speed ?! Yeah. We do. But, in our architecture, a branch has to have a delay slot instruction anyway. In the absence of one, we put a nop in there. If GCC manages to move a single instruction vliw into the delay slot, we would benefit in both size and speed, otherwise, we will just have no impact on either. I pursued two approaches. The first one was to insert stop bit pseudo insns into the RTL stream in machdep reorg, so I didn't have to rely on TImode insn flags during output. But then delayed branch scheduling just took one insn out of an insn group and put it into the delay slot, meaning there was usually no cycle gain at all, just larger code size (due to insn duplication). This seems fairly straightforward to implement. The second approach was having lots of parallel insns (using match parallel and a custom predicate). machdep reorg then converts insn bundles into a single parallel insn. Delayed branch scheduling then does the right thing. This approach works fairly well for me, but there are a few complications. My output code is pretty hackish, as I didn't want to duplicate outputing a single insn / outputing the same insn as component of a parallel insn group. When do you un-parallel those instructions? And, how? Regards Hari Tom
Re: VLIW scheduling and delayed branch
When do you un-parallel those instructions? And, how? I don't; I use a C function to output such an insn group. In that C function, I basically save the global state of final, and use functions of final.c to output constitutent insns. The insn group output function basically looks like this: first prepare: static char buf[256]; FILE *old_out_file; /* open memory file */ old_out_file = asm_out_file; asm_out_file = fmemopen (buf, sizeof(buf), w); gcc_assert (asm_out_file); then loop over all constitutent insns: cleanup_subreg_operands (insn); if (! constrain_operands_cached (1)) fatal_insn_not_found (insn); current_output_insn = insn; /* Find the proper template for this insn. */ template = get_insn_template (insn_code_number, insn); gcc_assert (template); gcc_assert (!(template[0] == '#' template[1] == '\0')); fprintf (asm_out_file, \t||); output_asm_insn (template, recog_data.operand); fseek (asm_out_file, ftell (asm_out_file) - 1, SEEK_SET); finally cleanup: fclose (asm_out_file); asm_out_file = old_out_file; return buf[4]; That's why I wrote it's kind of hackish :-) fmemopen also isn't necessarily very portable, but is needed since all the final output routines directly output to a FILE *, and I need to intercept that output. Tom
VLIW scheduling and delayed branch
Hi, I am trying to enable delayed branch scheduling on our port of Gcc for picochip (16-bit VLIW DSP). I understand that delayed-branch is run as a seperate pass after the DFA scheduling is done. We basically depend on the TImode set on the cycle-start instructions to decide what instructions form a valid VLIW. By enabling delayed-branch, it seems like the delay-branch pass takes any instruction and puts it on the delay slot. This sometimes seem to pick the TImode set instructions, but does not seem to set the TImode on the next instruction. Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? Thanks for your help. Regards Hari
Re: VLIW scheduling and delayed branch
Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? I did something similar a few months ago. The problem is that haifa and the delayed branch scheduling passes don't really fit together. delayed branch scheduling happily undoes all the haifa decisions. The question is how much you gain by delayed branch scheduling. I don't have numbers, but it wasn't much in my case. And since your company name is picochip, you certainly value size more than speed ?! I pursued two approaches. The first one was to insert stop bit pseudo insns into the RTL stream in machdep reorg, so I didn't have to rely on TImode insn flags during output. But then delayed branch scheduling just took one insn out of an insn group and put it into the delay slot, meaning there was usually no cycle gain at all, just larger code size (due to insn duplication). The second approach was having lots of parallel insns (using match parallel and a custom predicate). machdep reorg then converts insn bundles into a single parallel insn. Delayed branch scheduling then does the right thing. This approach works fairly well for me, but there are a few complications. My output code is pretty hackish, as I didn't want to duplicate outputing a single insn / outputing the same insn as component of a parallel insn group. Tom