On 07/28/2011 06:25 PM, Richard Guenther wrote:
> On Thu, 28 Jul 2011, Tom de Vries wrote:
>
>> On 07/28/2011 12:22 PM, Richard Guenther wrote:
>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>
>>>> On 07/27/2011 05:27 PM, Richard Guenther wrote:
>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>
>>>>>> On 07/27/2011 02:12 PM, Richard Guenther wrote:
>>>>>>> On Wed, 27 Jul 2011, Tom de Vries wrote:
>>>>>>>
>>>>>>>> On 07/27/2011 01:50 PM, Tom de Vries wrote:
>>>>>>>>> Hi Richard,
>>>>>>>>>
>>>>>>>>> I have a patch set for bug 43513 - The stack pointer is adjusted
>>>>>>>>> twice.
>>>>>>>>>
>>>>>>>>> 01_pr43513.3.patch
>>>>>>>>> 02_pr43513.3.test.patch
>>>>>>>>> 03_pr43513.3.mudflap.patch
>>>>>>>>>
>>>>>>>>> The patch set has been bootstrapped and reg-tested on x86_64.
>>>>>>>>>
>>>>>>>>> I will sent out the patches individually.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The patch replaces a vla __builtin_alloca that has a constant argument
>>>>>>>> with an
>>>>>>>> array declaration.
>>>>>>>>
>>>>>>>> OK for trunk?
>>>>>>>
>>>>>>> I don't think it is safe to try to get at the VLA type the way you do.
>>>>>>
>>>>>> I don't understand in what way it's not safe. Do you mean I don't manage
>>>>>> to find
>>>>>> the type always, or that I find the wrong type, or something else?
>>>>>
>>>>> I think you might get the wrong type,
>>>>
>>>> Ok, I'll review that code one more time.
>>>>
>>>>> you also do not transform code
>>>>> like
>>>>>
>>>>> int *p = alloca(4);
>>>>> *p = 3;
>>>>>
>>>>> as there is no array type involved here.
>>>>>
>>>>
>>>> I was trying to stay away from non-vla allocas. A source declared alloca
>>>> has
>>>> function livetime, so we could have a single alloca in a loop, called 10
>>>> times,
>>>> with all 10 instances live at the same time. This patch does not detect
>>>> such
>>>> cases, and thus stays away from non-vla allocas. A vla decl does not have
>>>> such
>>>> problems, the lifetime ends when it goes out of scope.
>>>
>>> Yes indeed - that probably would require more detailed analysis.
>>>
>>>>>>> In fact I would simply do sth like
>>>>>>>
>>>>>>> elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
>>>>>>> n_elem = size * 8 / BITS_PER_UNIT;
>>>>>>> array_type = build_array_type_nelts (elem_type, n_elem);
>>>>>>> var = create_tmp_var (array_type, NULL);
>>>>>>> return fold_convert (TREE_TYPE (lhs), build_fold_addr_expr (var));
>>>>>>>
>>>>>>
>>>>>> I tried this code on the example, and it works, but the newly declared
>>>>>> type has
>>>>>> an 8-bit alignment, while the vla base type has a 32 bit alignment.
>>>>>> This make
>>>>>> the memory access in the example potentially unaligned, which prohibits
>>>>>> an
>>>>>> ivopts optimization, so the resulting text size is 68 instead of the 64
>>>>>> achieved
>>>>>> with my current patch.
>>>>>
>>>>> Ok, so then set DECL_ALIGN of the variable to something reasonable
>>>>> like MIN (size * 8, GET_MODE_PRECISION (word_mode)). Basically the
>>>>> alignment that the targets alloca function would guarantee.
>>>>>
>>>>
>>>> I tried that, but that doesn't help. It's the alignment of the type that
>>>> matters, not of the decl.
>>>
>>> It shouldn't. All accesses are performed with the original types and
>>> alignment comes from that (plus the underlying decl).
>>>
>>
>> I managed to get it all working by using build_aligned_type rather that
>> DECL_ALIGN.
>
> That's really odd, DECL_ALIGN should just work - nothing refers to the
> type of the decl in the IL. Can you try also setting DECL_USER_ALIGN to
> 1 maybe?
>
This doesn't work either.
/* Declare array. */
elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
n_elem = size * 8 / BITS_PER_UNIT;
align = MIN (size * 8, GET_MODE_PRECISION (word_mode));
array_type = build_array_type_nelts (elem_type, n_elem);
var = create_tmp_var (array_type, NULL);
DECL_ALIGN (var) = align;
DECL_USER_ALIGN (var) = 1;
Maybe this clarifies it:
Breakpoint 1, may_be_unaligned_p (ref=0xf7d9d410, step=0xf7d3d578) at
/home/vries/local/google/src/gcc-mainline/gcc/tree-ssa-loop-ivopts.c:1621
(gdb) call debug_generic_expr (ref)
MEM[(int[0:D.2579] *)&D.2595][0]
(gdb) call debug_generic_expr (step)
4
1627 base = get_inner_reference (ref, &bitsize, &bitpos, &toffset, &mode,
(gdb) call debug_generic_expr (base)
D.2595
1629 base_type = TREE_TYPE (base);
(gdb) call debug_generic_expr (base_type)
<unnamed-unsigned:8>[40]
1630 base_align = TYPE_ALIGN (base_type);
(gdb) p base_align
$1 = 8
So the align is 8-bits, and we return true here:
(gdb) n
1632 if (mode != BLKmode)
(gdb) n
1634 unsigned mode_align = GET_MODE_ALIGNMENT (mode);
(gdb)
1636 if (base_align < mode_align
(gdb)
1639 return true;
Here we can see that the base actually has the (user) align on it:
(gdb) call debug_tree (base)
<var_decl 0xf7e1b420 D.2595
type <array_type 0xf7e1b360
type <integer_type 0xf7e1b2a0 public unsigned QI
size <integer_cst 0xf7d3d604 constant 8>
unit size <integer_cst 0xf7d3d620 constant 1>
align 8 symtab 0 alias set -1 canonical type 0xf7e1b2a0 precision 8
min <integer_cst 0xf7dffaf0 0> max <integer_cst 0xf7dffb0c 255>
pointer_to_this <pointer_type 0xf7e1b3c0>>
BLK
size <integer_cst 0xf7d5d070 constant 320>
unit size <integer_cst 0xf7dde2a0 constant 40>
align 8 symtab 0 alias set -1 canonical type 0xf7e1b360
domain <integer_type 0xf7de12a0
type <integer_type 0xf7d51000 unsigned int>
SI
size <integer_cst 0xf7d3d78c constant 32>
unit size <integer_cst 0xf7d3d578 constant 4>
align 32 symtab 0 alias set -1 canonical type 0xf7de12a0
precision 32 min <integer_cst 0xf7d3d594 0>
max <integer_cst 0xf7dde284 39>>
pointer_to_this <pointer_type 0xf7e1b480>>
addressable used ignored BLK file pr43513.c line 4 col 6
size <integer_cst 0xf7d5d070 320> unit size <integer_cst 0xf7dde2a0 40>
user align 32 context <function_decl 0xf7dfd480 foo3>>
>>
>>>> So should we try to find the base type of the vla, and use that, or use the
>>>> nonstandard char type?
>>>
>>> I don't think we can reliably find the base type of the vla - well,
>>> in practice we may because we control how we lower VLAs during
>>> gimplification, but nothing in the IL constraints say that the
>>> resulting pointer type should be special.
>>>
>>> Using a char[] decl shouldn't be a problem IMHO.
>>>
>>>>>>> And obviously you lose the optimization we arrange with inserting
>>>>>>> __builtin_stack_save/restore pairs that way - stack space will no
>>>>>>> longer be shared for subsequent VLAs. Which means that you'd
>>>>>>> better limit the size you allow this promotion.
>>>>>>>
>>>>>>
>>>>>> Right, I could introduce a parameter for this.
>>>>>
>>>>> I would think you could use PARAM_LARGE_STACK_FRAME for now and say,
>>>>> allow a size of PARAM_LARGE_STACK_FRAME / 10?
>>>>>
>>>>
>>>> That unfortunately is too small for the example from bug report. The
>>>> default
>>>> value of the param is 250, so that would be a threshold of 25, and the
>>>> alloca
>>>> size of the example is 40. Perhaps we can try a threshold of
>>>> PARAM_LARGE_STACK_FRAME - estimated_stack_size or some such?
>>>
>>> Hm. estimated_stack_size is not O(1), so no. I think we need to
>>> find a sensible way of allowing stack sharing. Eventually Michas
>>> patch for introducing points-of-death would help here, if we'd
>>> go for folding this during stack-save/restore optimization.
>>>
>>
>> I changed the heuristics to this:
>>
>> + /* Heuristic: don't fold large vlas. */
>> + threshold = (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME);
>> + /* In case a vla is declared at function scope, it has the same lifetime
>> as a
>> + declared array, so we allow a larger size. */
>> + block = gimple_block (stmt);
>> + if (!(cfun->after_inlining
>> + && TREE_CODE (BLOCK_SUPERCONTEXT (block)) == FUNCTION_DECL))
>> + threshold /= 10;
>> + if (size > threshold)
>> + return NULL_TREE;
>>
>> The heuristics distinguishes between before and after inlining.
>>
>> After inlining, vla's declared at function scope have the same lifetimes as
>> declared arrays, and don't share their space. There should be no negative
>> effects from folding an alloca in this case, but for safety we set a
>> threshold
>> of PARAM_LARGE_STACK_FRAME.
>>
>> Before inlining, such a vla might be inlined and share its space with another
>> vla, so we stick with the normal threshold before inlining.
>
> That sounds reasonable, though the block check should probably use the
> original VLA decl block, not that of the basic-block of the allocation,
> but unfortunately we don't have access to that. So I suppose using
> the allocation basic-block BLOCK is good enough (still we don't
> really care about BLOCK boundaries when doing CFG manipulations, so
> the allocation bbs block may be not the outermost scope in more cases
> than necessary).
>
>> However, using this heuristic we still don't generate optimal code.
>>
>> During the first pass_ccp, the folding is not done, because the size (40) is
>> larger than the threshold 25. The threshold is 25, because inlining is not
>> yet done.
>>
>> During pass_fold_builtins, the folding is done because it's after inlining,
>> but
>> it's later than pass_iv_optimize, so that still doesn't yield the optimal
>> size
>> of 64.
>>
>> The folding is not done during any of the other invocations or pass_ccp,
>> because
>> the argument has already become constant in the earlier invocation.
>
> Yeah, that's the issue with relying on folding to do this transformation.
>
>> Using this change, I manage to trigger folding during the second invocation
>> of
>> pass_ccp, before iv_optimize so we generate optimal code.
>>
>> Index: gcc/tree-ssa-ccp.c
>> ===================================================================
>> --- gcc/tree-ssa-ccp.c (revision 173734)
>> +++ gcc/tree-ssa-ccp.c (working copy)
>> @@ -1727,6 +1727,13 @@ ccp_fold_stmt (gimple_stmt_iterator *gsi
>> if (gimple_call_internal_p (stmt))
>> return false;
>>
>> + /* The heuristic of fold_builtin_alloca differs before and after
>> + inlining, so we don't require the arg to be changed into a
>> constant
>> + for folding, but just to be constant. */
>> + if (gimple_call_alloca_for_var_p (stmt)
>> + && get_constant_value (gimple_call_arg (stmt, 0)))
>
> Probably reverse the get_constant_value check and the transformation
Done.
> (gimple_call_alloca_for_var_p isn't a predicate as it has side-effects,
> so its name should be changed).
>
>> + return true;
>> +
>> /* Propagate into the call arguments. Compared to replace_uses_in
>> this can use the argument slot types for type verification
>> instead of the current argument type. We also can safely
>>
>> But, to me it feels like a hack. Do you have any ideas how to do this better?
>
> It's somewhat of a hack, but at least it is more of a defined place
> for this transformation - which then suggests to remove it from
> generic folding and only keep calling it from CCP this way.
>
Done.
> Richard.
>
>> Attaching untested patch for reference (will test overnight).
>>
>> Thanks,
>> - Tom
>>
>> 2011-07-28 Tom de Vries <[email protected]>
>>
>> PR middle-end/43513
>> * gimple-fold.c (params.h): Include.
>> (fold_builtin_alloca): New function.
>> (gimple_fold_builtin): Use fold_builtin_alloca.
>> * tree-ssa-ccp.c (ccp_fold_stmt): Force folding of vla-related alloca.
>> * Makefile.in (gimple-fold.o): Add $(PARAMS_H) to rule.
Thanks,
- Tom