failed attempt: retain identifier length from frontend to backend
Hello, my last attempt on improving something serious was about three weeks ago, trying to keep all lengths of all strings parsed in the frontend for the whole compilation phase until the assembly output. I was hoping that would help on using faster hashes (knowing the length allows us to hash 4 or 8 bytes per iteration), quicker strcmp in various places, and using less strlen() calls, which show especially on -g3 compilations that store huge macro strings. I'll post no patch here, since what I currently have is a mess in 3 different branches and most don't even compile. I tried various approaches. First I tried adding an extra length parameter in all relevant functions, starting from the assembly generation and working my way upwards. This got too complex, and I'd really like to ask if you find any meaning in changing target specific hooks and macros to actually accept length as argument (e.g. ASM_OUTPUT_*) or return it (e.g. ASM_GENERATE_*). Changes seemed too intrusive for me to continue. But seeing that identifier length is there inside struct ht_identifier (or cpp_hashnode) and not lost, I tried the approach of having the length at str[-4] for all identifiers. To achieve this I changed ht_identifier to store str with the flexible array hack. Unfortunately I hadn't noticed that ht_identifier was a part of tree_node and also part of too many other structs, so changing all those structs to have variable size was not without side effects. In the end it compiled but I got crashes all over, and I'm sure I didn't do things right since I broke things like the static assert in libcpp/identifiers.c, that I don't even understand: /* We don't need a proxy since the hash table's identifier comes first in cpp_hashnode. However, in case this is ever changed, we have a static assertion for it. */ -extern char proxy_assertion_broken[offsetof (struct cpp_hashnode, ident) == 0 ? 1 : -1]; Anyway last attempt was to decouple ht_identifier completely from trees and other structs by storing pointer to it, but I was pretty worn out and quickly gave up after getting errors on gengtype-generated files that I didn't even know how to handle. Was all this project too ambitious? I'd appreciate all input. Thanks, Dimitris
[DF] RFC: obstacks in DF
Hi, while I was happy using obstacks in other parts of the compiler I thought they would provide a handy solution for the XNEWVECs/XRESIZEVECs in df-scan.c, especially df_install_refs() which is the heaviest malloc() user after the rest of my patches. In the process I realised that obstacks weren't exactly suitable (thanks matz for helping on IRC), nevertheless I have a patch which is stable, a bit faster and more memory friendly (don't know why, but RSS memory for common non-pathological compilations, was smaller). However after trying various incorrect changes I'm aware that there are leaks in there that can't be closed without dirty hacks, so I gave up. I'm posting the simplest but stable version of my changes and would really like to hear ideas. There are "holes" in the obstack that should have been free'd, but in the end I didn't see memory increase. I don't know what would happen for huge functions that keep the obstack alive for long, I didn't have such testcase. Finally, I think my next try will involve converting the chains to pool_alloc'ed linked list, do you think that makes sense? Thanks, Dimitris === modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2012-07-16 11:32:42 + +++ gcc/df-scan.c 2012-08-20 14:01:59 + @@ -184,6 +184,17 @@ struct df_scan_problem_data bitmap_obstack insn_bitmaps; }; +/* Obstack for storing all of all of + insn_info->{defs,uses,eq_uses,mw_hardregs} and + bb_info->artificial_{defs,uses}. */ +static struct obstack df_refs_obstack; + +/* Keep the obstack initialised (only 4k overhead) and use this pointer to + actually empty it fast. */ +static void *df_first_refs_obj; +static int df_refs_obstack_is_init; + + typedef struct df_scan_bb_info *df_scan_bb_info_t; @@ -193,36 +204,13 @@ df_scan_free_internal (void) { struct df_scan_problem_data *problem_data = (struct df_scan_problem_data *) df_scan->problem_data; - unsigned int i; - basic_block bb; - /* The vectors that hold the refs are not pool allocated because - they come in many sizes. This makes them impossible to delete - all at once. */ - for (i = 0; i < DF_INSN_SIZE(); i++) -{ - struct df_insn_info *insn_info = DF_INSN_UID_GET(i); - /* Skip the insns that have no insn_info or have been -deleted. */ - if (insn_info) - { - df_scan_free_ref_vec (insn_info->defs); - df_scan_free_ref_vec (insn_info->uses); - df_scan_free_ref_vec (insn_info->eq_uses); - df_scan_free_mws_vec (insn_info->mw_hardregs); - } -} - - FOR_ALL_BB (bb) -{ - unsigned int bb_index = bb->index; - struct df_scan_bb_info *bb_info = df_scan_get_bb_info (bb_index); - if (bb_info) - { - df_scan_free_ref_vec (bb_info->artificial_defs); - df_scan_free_ref_vec (bb_info->artificial_uses); - } -} + /* Delete at once all vectors that hold the refs (all of + insn_info->{defs,uses,eq_uses,mw_hardregs} and + bb_info->artificial_{defs,uses}) but keep the obstack initialised, so + that it's ready for more BBs. */ + obstack_free (&df_refs_obstack, df_first_refs_obj); + df_first_refs_obj = NULL; free (df->def_info.refs); free (df->def_info.begin); @@ -364,6 +352,14 @@ df_scan_alloc (bitmap all_blocks ATTRIBU bb_info->artificial_uses = NULL; } + if (! df_refs_obstack_is_init) +{ + obstack_init (&df_refs_obstack); + df_refs_obstack_is_init = 1; +} + gcc_assert (df_first_refs_obj == NULL); + df_first_refs_obj = XOBNEW (&df_refs_obstack, void *); + bitmap_initialize (&df->hardware_regs_used, &problem_data->reg_bitmaps); bitmap_initialize (&df->regular_block_artificial_uses, &problem_data->reg_bitmaps); bitmap_initialize (&df->eh_block_artificial_uses, &problem_data->reg_bitmaps); @@ -791,9 +787,15 @@ df_install_ref_incremental (df_ref ref) } ref_rec = *ref_rec_ptr; + /* fprintf (stderr, "count:%d ref_rec:%p\n", count, ref_rec); */ if (count) { - ref_rec = XRESIZEVEC (df_ref, ref_rec, count+2); + /* Impossible to actually know if ref_rec was malloc'ed or obstack'ed +thus we might have a leak here! */ + df_ref *ref_rec2 = XOBNEWVEC (&df_refs_obstack, df_ref, count+2); + memcpy (ref_rec2, ref_rec, count*sizeof(df_ref)); + /* free (ref_rec); */ + ref_rec = ref_rec2; *ref_rec_ptr = ref_rec; ref_rec[count] = ref; ref_rec[count+1] = NULL; @@ -801,7 +803,7 @@ df_install_ref_incremental (df_ref ref) } else { - df_ref *ref_rec = XNEWVEC (df_ref, 2); + ref_rec = XOBNEWVEC (&df_refs_obstack, df_ref, 2); ref_rec[0] = ref; ref_rec[1] = NULL; *ref_rec_ptr = ref_rec; @@ -1070,8 +1072,9 @@ df_ref_chain_delete (df_ref *ref_rec) /* If the list is empty, it has a special shared element that is not to be deleted. */ - if (*start) -free (start); + /* if (*start) */ + /* free (sta
Re: [graphds.h] Allocate graph from obstack
Hi Paolo, On Mon, 20 Aug 2012, Paolo Bonzini wrote: Il 19/08/2012 18:55, Richard Guenther ha scritto: Initially I had one obstack per struct graph, which was better than using XNEW for every edge, but still obstack_init() called from new_graph() was too frequent. So in this iteration of the patch the obstack is static global, initialised once and never freed. Please advise on whether this is acceptable, and also where I should initialise the obstack once, and avoid checking if it's NULL in every use. Minor speed gains (couple of ms), tested with pre-C++ conversion snapshot, I'll retest soon and post update. Either an obstack per graph or the ability to specify an obstack for allocation. A global obstack with global lifetime is bad IMHO. Dimitrios's patch has a per-file obstack with per-pass lifetime Notice that I never call XOBDELETE with NULL argument, I only free the first object, which means that the 4KB per obstack overhead is retained until program exit, I did that to save on malloc calls. Thanks, Dimitris
Re: alloc_pool for tree-ssa-pre.c:phi_translate_table
On Mon, 20 Aug 2012, Jakub Jelinek wrote: I'd note for all the recently posted patches from Dimitrios, the gcc/ prefix doesn't belong to the ChangeLog entry pathnames, the filenames are relative to the corresponding ChangeLog location. Ah sorry, it's what the mklog utility generates, it seems I got carried away over-using it. :-) Dimitris
Re: enlarge hot allocation pools
Hi Steven, On Sun, 19 Aug 2012, Steven Bosscher wrote: On Sun, Aug 19, 2012 at 8:31 PM, Dimitrios Apostolou wrote: Hello, 2012-08-19 Dimitrios Apostolou * gcc/cselib.c (cselib_init): Make allocation pools larger since they are too hot and show to expand often on the profiler. * gcc/df-problems.c (df_chain_alloc): Same. * gcc/et-forest.c (et_new_occ, et_new_tree): Same. * gcc/tree-ssa-pre.c (init_pre): Same These allocation pools are the ones that I've noticed calling malloc() too often, for expanding their size. I don't like the way these pools are sized with a seemingly arbitrary initial size. They're there to hold something, and they grow because there are "more somethings" than initially guessed. I think you should look at what the pools hold and choose an initial size based on some representative measure. E.g. if a pool holds some kind of expressions, then you should be able to make an initial guess of the size of the pool based on the number of pseudos or ssa names. Ideally you'd simply "derive" the initial pool size by a regression analysis of the final pool size and some potential representative measures (# of regs, basic blocks, edges, whatever). Some time ago I tried changing the pool allocator growth strategy from linear to exponential, doubling the size of the pool chunk every time it needed another chunk. It was for the exact same reason, to set the pool size according to their contents. Unfortunately I remember it didn't make a difference so I dumped it and only manually changed the important pools, which made a small difference. I'll now try deducing information on the size at runtime, thanks for the tip. Dimitris
enlarge hot allocation pools
Hello, 2012-08-19 Dimitrios Apostolou * gcc/cselib.c (cselib_init): Make allocation pools larger since they are too hot and show to expand often on the profiler. * gcc/df-problems.c (df_chain_alloc): Same. * gcc/et-forest.c (et_new_occ, et_new_tree): Same. * gcc/tree-ssa-pre.c (init_pre): Same These allocation pools are the ones that I've noticed calling malloc() too often, for expanding their size. Also I don't like the way the pools are created in et_new_{occ,tree}() but couldn't find a single point to move the initialisation either. Any ideas on this one? Thanks, Dimitris2012-08-19 Dimitrios Apostolou * gcc/cselib.c (cselib_init): Make allocation pools larger since they are too hot and show to expand often on the profiler. * gcc/df-problems.c (df_chain_alloc): Same. * gcc/et-forest.c (et_new_occ, et_new_tree): Same. * gcc/tree-ssa-pre.c (init_pre): Same === modified file 'gcc/cselib.c' --- gcc/cselib.c2012-08-02 22:39:57 + +++ gcc/cselib.c2012-08-19 15:13:28 + @@ -2659,12 +2659,12 @@ void cselib_init (int record_what) { elt_list_pool = create_alloc_pool ("elt_list", -sizeof (struct elt_list), 10); +sizeof (struct elt_list), 128); elt_loc_list_pool = create_alloc_pool ("elt_loc_list", -sizeof (struct elt_loc_list), 10); +sizeof (struct elt_loc_list), 128); cselib_val_pool = create_alloc_pool ("cselib_val_list", - sizeof (cselib_val), 10); - value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 100); + sizeof (cselib_val), 128); + value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 128); cselib_record_memory = record_what & CSELIB_RECORD_MEMORY; cselib_preserve_constants = record_what & CSELIB_PRESERVE_CONSTANTS; cselib_any_perm_equivs = false; === modified file 'gcc/df-problems.c' --- gcc/df-problems.c 2012-08-17 09:42:06 + +++ gcc/df-problems.c 2012-08-19 15:29:37 + @@ -1993,7 +1993,7 @@ df_chain_alloc (bitmap all_blocks ATTRIB { df_chain_remove_problem (); df_chain->block_pool = create_alloc_pool ("df_chain_block pool", -sizeof (struct df_link), 50); +sizeof (struct df_link), 128); df_chain->optional_p = true; } === modified file 'gcc/et-forest.c' --- gcc/et-forest.c 2012-05-31 16:43:31 + +++ gcc/et-forest.c 2012-08-19 15:50:25 + @@ -444,8 +444,8 @@ et_new_occ (struct et_node *node) { struct et_occ *nw; - if (!et_occurrences) -et_occurrences = create_alloc_pool ("et_occ pool", sizeof (struct et_occ), 300); + if (!et_occurrences) +et_occurrences = create_alloc_pool ("et_occ pool", sizeof (struct et_occ), 1024); nw = (struct et_occ *) pool_alloc (et_occurrences); nw->of = node; @@ -467,8 +467,8 @@ et_new_tree (void *data) { struct et_node *nw; - if (!et_nodes) -et_nodes = create_alloc_pool ("et_node pool", sizeof (struct et_node), 300); + if (!et_nodes) +et_nodes = create_alloc_pool ("et_node pool", sizeof (struct et_node), 512); nw = (struct et_node *) pool_alloc (et_nodes); nw->data = data; === modified file 'gcc/tree-ssa-pre.c' --- gcc/tree-ssa-pre.c 2012-08-18 06:31:26 + +++ gcc/tree-ssa-pre.c 2012-08-19 15:28:21 + @@ -4812,9 +4812,9 @@ init_pre (bool do_fre) phi_translate_table.create (5110); expression_to_id.create (num_ssa_names * 3); bitmap_set_pool = create_alloc_pool ("Bitmap sets", - sizeof (struct bitmap_set), 30); + sizeof (struct bitmap_set), 128); pre_expr_pool = create_alloc_pool ("pre_expr nodes", -sizeof (struct pre_expr_d), 30); +sizeof (struct pre_expr_d), 32); FOR_ALL_BB (bb) { EXP_GEN (bb) = bitmap_set_new ();
alloc_pool for tree-ssa-pre.c:phi_translate_table
2012-08-19 Dimitrios Apostolou * gcc/tree-ssa-pre.c (phi_translate_pool): New static global alloc_pool, used for allocating struct expr_pred_trans_d for phi_translate_table. (phi_trans_add, init_pre, fini_pre): Use it, avoids thousand of malloc() and free() calls. This avoids lots of malloc/free calls and slow iterations during numerous htab_delete() in fini_pre(). Tested on pre C++-snapshot, will update info as soon as a post C++ one is available. Thanks, Dimitris2012-08-19 Dimitrios Apostolou * gcc/tree-ssa-pre.c (phi_translate_pool): New static global alloc_pool, used for allocating struct expr_pred_trans_d for phi_translate_table. (phi_trans_add, init_pre, fini_pre): Use it, avoids thousand of malloc() and free() calls. === modified file 'gcc/tree-ssa-pre.c' --- gcc/tree-ssa-pre.c 2012-08-17 08:03:54 + +++ gcc/tree-ssa-pre.c 2012-08-18 16:43:02 + @@ -486,7 +486,7 @@ static bitmap need_ab_cleanup; /* A three tuple {e, pred, v} used to cache phi translations in the phi_translate_table. */ -typedef struct expr_pred_trans_d : typed_free_remove +typedef struct expr_pred_trans_d : typed_noop_remove { /* The expression. */ pre_expr e; @@ -508,6 +508,12 @@ typedef struct expr_pred_trans_d : typed } *expr_pred_trans_t; typedef const struct expr_pred_trans_d *const_expr_pred_trans_t; +/* Pool of memory for the above */ + +static alloc_pool phi_translate_pool; + +/* Return the hash value for a phi translation table entry. */ + inline hashval_t expr_pred_trans_d::hash (const expr_pred_trans_d *e) { @@ -561,7 +567,8 @@ static inline void phi_trans_add (pre_expr e, pre_expr v, basic_block pred) { expr_pred_trans_t *slot; - expr_pred_trans_t new_pair = XNEW (struct expr_pred_trans_d); + expr_pred_trans_t new_pair = +(expr_pred_trans_t) pool_alloc (phi_translate_pool); new_pair->e = e; new_pair->pred = pred; new_pair->v = v; @@ -570,7 +577,8 @@ phi_trans_add (pre_expr e, pre_expr v, b slot = phi_translate_table.find_slot_with_hash (new_pair, new_pair->hashcode, INSERT); - free (*slot); + if (*slot) +pool_free (phi_translate_pool, *slot); *slot = new_pair; } @@ -4798,6 +4806,9 @@ init_pre (bool do_fre) calculate_dominance_info (CDI_DOMINATORS); bitmap_obstack_initialize (&grand_bitmap_obstack); + phi_translate_pool = create_alloc_pool ("phi_translate_table", + sizeof (struct expr_pred_trans_d), + 512); phi_translate_table.create (5110); expression_to_id.create (num_ssa_names * 3); bitmap_set_pool = create_alloc_pool ("Bitmap sets", @@ -4832,6 +4843,7 @@ fini_pre (bool do_fre) free_alloc_pool (bitmap_set_pool); free_alloc_pool (pre_expr_pool); phi_translate_table.dispose (); + free_alloc_pool (phi_translate_pool); expression_to_id.dispose (); VEC_free (unsigned, heap, name_to_id);
obstack for equiv_class_label, more vectors on stack
2012-08-19 Dimitrios Apostolou * gcc/tree-ssa-structalias.c: Change declaration of ce_s type vector from heap to stack. Update all relevant functions to VEC_alloc() such vector upfront with enough (32) slots so that malloc() calls are mostly avoided. (equiv_class_obstack) New global static obstack for allocating struct equiv_class_label. (equiv_class_add): Use the above instead of malloc(). (perform_var_substitution): Don't allow entries of location_equiv_class_table to be freed, because they are free'd... (free_var_substitution_info): ...here by freeing the obstack. * gcc/vecir.h: Add declaration of stack allocated tree type vector. * gcc/tree-ssa-sccvn.c (vn_phi_insert, print_scc, compare_ops) (sort_scc, copy_reference, extract_and_process_scc_for_name): Use it, instead of heap allocated vector. Not all of these places are hot on the profiler, but since I changed a few I had to change them all to remove complete the heap ce_s vector. Passes tests and offers small gains (couple of ms), but expect a more thorough report and testing against a new snapshot by next week. Thanks, Dimitris2012-08-19 Dimitrios Apostolou * gcc/tree-ssa-structalias.c: Change declaration of ce_s type vector from heap to stack. Update all relevant functions to VEC_alloc() such vector upfront with enough (32) slots so that malloc() calls are mostly avoided. (equiv_class_obstack) New global static obstack for allocating struct equiv_class_label. (equiv_class_add): Use the above instead of malloc(). (perform_var_substitution): Don't allow entries of location_equiv_class_table to be freed, because they are free'd... (free_var_substitution_info): ...here by freeing the obstack. * gcc/vecir.h: Add declaration of stack allocated tree type vector. * gcc/tree-ssa-sccvn.c (vn_phi_insert, print_scc, compare_ops) (sort_scc, copy_reference, extract_and_process_scc_for_name): Use it, instead of heap allocated vector. === modified file 'gcc/tree-ssa-structalias.c' --- gcc/tree-ssa-structalias.c 2012-08-16 14:27:51 + +++ gcc/tree-ssa-structalias.c 2012-08-18 16:43:02 + @@ -472,11 +472,14 @@ struct constraint_expr typedef struct constraint_expr ce_s; DEF_VEC_O(ce_s); -DEF_VEC_ALLOC_O(ce_s, heap); -static void get_constraint_for_1 (tree, VEC(ce_s, heap) **, bool, bool); -static void get_constraint_for (tree, VEC(ce_s, heap) **); -static void get_constraint_for_rhs (tree, VEC(ce_s, heap) **); -static void do_deref (VEC (ce_s, heap) **); +DEF_VEC_ALLOC_O_STACK(ce_s); +#define VEC_ce_s_stack_alloc(alloc) \ + VEC_stack_alloc (ce_s, alloc) + +static void get_constraint_for_1 (tree, VEC(ce_s, stack) **, bool, bool); +static void get_constraint_for (tree, VEC(ce_s, stack) **); +static void get_constraint_for_rhs (tree, VEC(ce_s, stack) **); +static void do_deref (VEC (ce_s, stack) **); /* Our set constraints are made up of two constraint expressions, one LHS, and one RHS. @@ -1893,6 +1896,9 @@ static htab_t pointer_equiv_class_table; classes. */ static htab_t location_equiv_class_table; +/* Pool of memory for storing the above */ +static struct obstack equiv_class_obstack; + /* Hash function for a equiv_class_label_t */ static hashval_t @@ -1942,7 +1948,7 @@ equiv_class_add (htab_t table, unsigned bitmap labels) { void **slot; - equiv_class_label_t ecl = XNEW (struct equiv_class_label); + equiv_class_label_t ecl = XOBNEW (&equiv_class_obstack, struct equiv_class_label); ecl->labels = labels; ecl->equivalence_class = equivalence_class; @@ -2153,10 +2159,12 @@ perform_var_substitution (constraint_gra struct scc_info *si = init_scc_info (size); bitmap_obstack_initialize (&iteration_obstack); + gcc_obstack_init (&equiv_class_obstack); + /* NULL free function, we'll free the whole pool at the end of the pass. */ pointer_equiv_class_table = htab_create (511, equiv_class_label_hash, - equiv_class_label_eq, free); + equiv_class_label_eq, NULL); location_equiv_class_table = htab_create (511, equiv_class_label_hash, - equiv_class_label_eq, free); + equiv_class_label_eq, NULL); pointer_equiv_class = 1; location_equiv_class = 1; @@ -2263,6 +2271,7 @@ free_var_substitution_info (struct scc_i sbitmap_free (graph->direct_nodes); htab_delete (pointer_equiv_class_table); htab_delete (location_equiv_class_table); + obstack_free (&equiv_class_obstack, NULL); bitmap_obstack_release (&iteration_obstack); } @@ -2741,7 +2750,7 @@ new_scalar_tmp_constraint_exp (const cha If address_p is t
more malloc mitigation
Hi, 2012-08-18 Dimitrios Apostolou * gcc/tree-ssa-sccvn.c (struct vn_tables_s): Add obstack_start to mark the first allocated object on the obstack. (process_scc, allocate_vn_table): Use it. (init_scc_vn): Don't truncate shared_lookup_references vector. (prealloc_ref_ops_vec): New static vector. (create_reference_ops_from_ref, create_reference_ops_from_call): Use it instead of locally allocated one. (free_scc_vn): Truncate vectors to 0, but keep them most 16 slots long. I'd have used a stack vector for create_reference_ops_from_{ref,call} instead of a static one on the heap, but the functions return heap vectors. Can I just cast a stack vector to a heap one? Also is it acceptable to leak 4K because of never freeing the whole obstack? Tested on x86 with old snapshot, will post update soon after testing with more recent one. Thanks, Dimitris2012-08-18 Dimitrios Apostolou * gcc/tree-ssa-sccvn.c (struct vn_tables_s): Add obstack_start to mark the first allocated object on the obstack. (process_scc, allocate_vn_table): Use it. (init_scc_vn): Don't truncate shared_lookup_references vector. (prealloc_ref_ops_vec): New static vector. (create_reference_ops_from_ref, create_reference_ops_from_call): Use it instead of locally allocated one. (free_scc_vn): Truncate vectors to 0, but keep them most 16 slots long. === modified file 'gcc/tree-ssa-sccvn.c' --- gcc/tree-ssa-sccvn.c2012-08-16 14:27:51 + +++ gcc/tree-ssa-sccvn.c2012-08-18 16:43:02 + @@ -105,6 +105,7 @@ typedef struct vn_tables_s htab_t phis; htab_t references; struct obstack nary_obstack; + long *obstack_start; alloc_pool phis_pool; alloc_pool references_pool; } *vn_tables_t; @@ -973,16 +974,21 @@ copy_reference_ops_from_call (gimple cal } } +/* Preallocated vector with sufficient space - see free_scc_vn(). Helps to + avoid reallocations and space, since when we VEC_copy() it, only its exact + length is duplicated. */ + +static VEC(vn_reference_op_s, heap) *prealloc_ref_ops_vec; + /* Create a vector of vn_reference_op_s structures from REF, a - REFERENCE_CLASS_P tree. The vector is not shared. */ + REFERENCE_CLASS_P tree. The vector is not shared. */ static VEC(vn_reference_op_s, heap) * create_reference_ops_from_ref (tree ref) { - VEC (vn_reference_op_s, heap) *result = NULL; - - copy_reference_ops_from_ref (ref, &result); - return result; + VEC_truncate (vn_reference_op_s, prealloc_ref_ops_vec, 0); + copy_reference_ops_from_ref (ref, &prealloc_ref_ops_vec); + return VEC_copy (vn_reference_op_s, heap, prealloc_ref_ops_vec); } /* Create a vector of vn_reference_op_s structures from CALL, a @@ -991,10 +997,9 @@ create_reference_ops_from_ref (tree ref) static VEC(vn_reference_op_s, heap) * create_reference_ops_from_call (gimple call) { - VEC (vn_reference_op_s, heap) *result = NULL; - - copy_reference_ops_from_call (call, &result); - return result; + VEC_truncate (vn_reference_op_s, prealloc_ref_ops_vec, 0); + copy_reference_ops_from_call (call, &prealloc_ref_ops_vec); + return VEC_copy (vn_reference_op_s, heap, prealloc_ref_ops_vec); } /* Fold *& at position *I_P in a vn_reference_op_s vector *OPS. Updates @@ -3610,8 +3615,9 @@ process_scc (VEC (tree, heap) *scc) htab_empty (optimistic_info->nary); htab_empty (optimistic_info->phis); htab_empty (optimistic_info->references); - obstack_free (&optimistic_info->nary_obstack, NULL); - gcc_obstack_init (&optimistic_info->nary_obstack); + /* Quick way to empty the obstack but keep it ready for reuse. */ + obstack_free (&optimistic_info->nary_obstack, + optimistic_info->obstack_start); empty_alloc_pool (optimistic_info->phis_pool); empty_alloc_pool (optimistic_info->references_pool); FOR_EACH_VEC_ELT (tree, scc, i, var) @@ -3796,6 +3802,7 @@ allocate_vn_table (vn_tables_t table) free_reference); gcc_obstack_init (&table->nary_obstack); + table->obstack_start = XOBNEW (&table->nary_obstack, long); table->phis_pool = create_alloc_pool ("VN phis", sizeof (struct vn_phi_s), 30); @@ -3842,7 +3849,6 @@ init_scc_vn (void) gcc_obstack_init (&vn_ssa_aux_obstack); shared_lookup_phiargs = NULL; - shared_lookup_references = NULL; rpo_numbers = XNEWVEC (int, last_basic_block); rpo_numbers_temp = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS); pre_and_rev_post_order_compute (NULL, rpo_numbers_temp, false); @@ -3887,9 +3893,19 @@ free_scc_vn (void) htab_delete (constant_to_value_id); BITMAP_FREE (constant_value_ids); VEC_fr
[graphds.h] Allocate graph from obstack
Initially I had one obstack per struct graph, which was better than using XNEW for every edge, but still obstack_init() called from new_graph() was too frequent. So in this iteration of the patch the obstack is static global, initialised once and never freed. Please advise on whether this is acceptable, and also where I should initialise the obstack once, and avoid checking if it's NULL in every use. Minor speed gains (couple of ms), tested with pre-C++ conversion snapshot, I'll retest soon and post update. Thanks, Dimitris 2012-08-18 Dimitrios Apostolou * gcc/graphds.h: Guarded the header file. Included libiberty.h and obstack.h. (graph_obstack): New static obstack to allocate graphs and graph_edges from. (new_graph, add_edge): Add obstack argument. (free_graph): Remove. * gcc/graphds.c (new_graph): Allocate graph, vertices from obstack. (add_edge): Allocate edge from obstack. (free_graph): Remove, all graph data is now freed when freeing the object in the obstack. * gcc/tree-data-ref.h (free_rdg): Same. (build_empty_rdg): Add obstack argument. * gcc/cfgloopanal.c (mark_irreducible_loops): Initialise obstack if it's not, use it for graph allocations. * gcc/dominance.c (iterate_fix_dominators): Same. * gcc/graphite-sese-to-poly.c (build_alias_set_optimal_p) (build_base_obj_set_for_drs): Same. * gcc/tree-data-ref.c (rdg_vertex_for_stmt) (create_rdg_edge_for_ddr, create_rdg_edges_for_scalar) (create_rdg_edges, create_rdg_vertices) (known_dependences_p,build_empty_rdg, build_rdg, free_rdg): Same. * gcc/tree-loop-distribution.c (distribute_loop): Same. === modified file 'gcc/cfgloopanal.c' --- gcc/cfgloopanal.c 2012-05-31 20:19:00 + +++ gcc/cfgloopanal.c 2012-08-18 16:43:02 + @@ -93,8 +93,14 @@ mark_irreducible_loops (void) e->flags &= ~EDGE_IRREDUCIBLE_LOOP; } + if (graph_obstack == NULL) +{ + graph_obstack = XNEW (struct obstack); + obstack_init (graph_obstack); +} + /* Create the edge lists. */ - g = new_graph (last_basic_block + num); + g = new_graph (last_basic_block + num, graph_obstack); FOR_BB_BETWEEN (act, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, next_bb) FOR_EACH_EDGE (e, ei, act->succs) @@ -134,7 +140,7 @@ mark_irreducible_loops (void) src = LOOP_REPR (cloop); } - add_edge (g, src, dest)->data = e; + add_edge (g, src, dest, graph_obstack)->data = e; } /* Find the strongly connected components. */ @@ -161,7 +167,8 @@ mark_irreducible_loops (void) real->src->flags |= BB_IRREDUCIBLE_LOOP; } - free_graph (g); + /* Destroy all data allocated for the graph. */ + XOBDELETE (graph_obstack, g); loops_state_set (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS); return irred_loop_found; === modified file 'gcc/dominance.c' --- gcc/dominance.c 2012-08-14 15:59:41 + +++ gcc/dominance.c 2012-08-18 16:43:02 + @@ -1341,7 +1341,13 @@ iterate_fix_dominators (enum cdi_directi } *pointer_map_insert (map, ENTRY_BLOCK_PTR) = (void *) (size_t) n; - g = new_graph (n + 1); + if (graph_obstack == NULL) +{ + graph_obstack = XNEW (struct obstack); + obstack_init (graph_obstack); +} + + g = new_graph (n + 1, graph_obstack); for (y = 0; y < g->n_vertices; y++) g->vertices[y].data = BITMAP_ALLOC (NULL); FOR_EACH_VEC_ELT (basic_block, bbs, i, bb) @@ -1358,7 +1364,7 @@ iterate_fix_dominators (enum cdi_directi if (!bitmap_set_bit ((bitmap) g->vertices[dom_i].data, i)) continue; - add_edge (g, dom_i, i); + add_edge (g, dom_i, i, graph_obstack); } } for (y = 0; y < g->n_vertices; y++) @@ -1392,7 +1398,8 @@ iterate_fix_dominators (enum cdi_directi free (brother); free (parent); - free_graph (g); + /* Destroy all data allocated for the graph. */ + XOBDELETE (graph_obstack, g); } void === modified file 'gcc/graphds.c' --- gcc/graphds.c 2009-11-25 10:55:54 + +++ gcc/graphds.c 2012-08-18 16:43:02 + @@ -53,28 +53,29 @@ dump_graph (FILE *f, struct graph *g) } } -/* Creates a new graph with N_VERTICES vertices. */ +/* Creates a new graph with N_VERTICES vertices from obstack O. */ struct graph * -new_graph (int n_vertices) +new_graph (int n_vertices, struct obstack *o) { - struct graph *g = XNEW (struct graph); + struct graph *g = XOBNEW (o, struct graph); g->n_vertices = n_vertices; - g->vertices = XCNEWVEC (struct vertex, n_vertices); + g->vertices = XOBNEWVEC (o, struct vertex, n_vertices); + memset (g->vertices, 0, n_vertices * sizeof (*g->vertices)); return g; } -/* Adds an edge from F to T to graph G. The new edge is returned. */ +/
Re: Speedups/Cleanups: End of GSOC patch collection
Hi, these are the type-safe obstack macros that I use in other patches. As requested, I also changed XOBFINISH to return (T *) and changed all callers, felt a little strange to change (char *) to char. I also replaced all _obstack_begin() calls in libcpp with obstack_init() which is the documented way of allocating macros. 2012-08-18 Dimitrios Apostolou * include/libiberty.h (XOBDELETE, XOBGROW, XOBGROWVEC, XOBSHRINK) (XOBSHRINKVEC, XOBFINISH): New type-safe macros for obstack operations. (XOBFINISH): Changed to return (T *) instead of T. All callers updated. * libcpp/include/symtab.h (obstack_chunk_alloc) (obstack_chunk_free): Define, so that obstack_init() can be used. * libcpp/internal.h (struct cset_converter): Same. * libcpp/files.c (_cpp_init_files): Changed _obstack_begin() to obstack_init(). * libcpp/identifiers.c (_cpp_init_hashtable): Same. * libcpp/symtab.c (ht_create): Same. * libcpp/init.c (cpp_create_reader): Same. Thanks, Dimitris 2012-08-18 Dimitrios Apostolou * include/libiberty.h (XOBDELETE, XOBGROW, XOBGROWVEC, XOBSHRINK) (XOBSHRINKVEC, XOBFINISH): New type-safe macros for obstack operations. (XOBFINISH): Changed to return (T *) instead of T. All callers updated. * libcpp/include/symtab.h (obstack_chunk_alloc) (obstack_chunk_free): Define, so that obstack_init() can be used. * libcpp/internal.h (struct cset_converter): Same. * libcpp/files.c (_cpp_init_files): Changed _obstack_begin() to obstack_init(). * libcpp/identifiers.c (_cpp_init_hashtable): Same. * libcpp/symtab.c (ht_create): Same. * libcpp/init.c (cpp_create_reader): Same. === modified file 'gcc/c-family/c-lex.c' --- gcc/c-family/c-lex.c2012-06-19 19:55:33 + +++ gcc/c-family/c-lex.c2012-08-18 13:42:37 + @@ -1037,7 +1037,7 @@ lex_string (const cpp_token *tok, tree * /* We have read one more token than we want. */ _cpp_backup_tokens (parse_in, 1); if (concats) -strs = XOBFINISH (&str_ob, cpp_string *); +strs = XOBFINISH (&str_ob, cpp_string); if (concats && !objc_string && !in_system_header) warning (OPT_Wtraditional, === modified file 'gcc/collect2.c' --- gcc/collect2.c 2012-05-31 20:19:00 + +++ gcc/collect2.c 2012-08-18 13:42:51 + @@ -514,7 +514,7 @@ extract_string (const char **pp) obstack_1grow (&temporary_obstack, '\0'); *pp = p; - return XOBFINISH (&temporary_obstack, char *); + return XOBFINISH (&temporary_obstack, char); } void @@ -535,7 +535,7 @@ dump_ld_file (const char *name, FILE *to const char *word, *p; char *result; obstack_1grow (&temporary_obstack, '\0'); - word = XOBFINISH (&temporary_obstack, const char *); + word = XOBFINISH (&temporary_obstack, const char); if (*word == '.') ++word, putc ('.', to); @@ -943,7 +943,7 @@ maybe_run_lto_and_relink (char **lto_ld_ lto_o_files = XNEWVEC (char *, num_files + 1); lto_o_files[num_files] = NULL; - start = XOBFINISH (&temporary_obstack, char *); + start = XOBFINISH (&temporary_obstack, char); for (i = 0; i < num_files; ++i) { end = start; === modified file 'gcc/dbxout.c' --- gcc/dbxout.c2012-06-24 17:58:46 + +++ gcc/dbxout.c2012-08-18 13:48:19 + @@ -864,7 +864,7 @@ dbxout_finish_complex_stabs (tree sym, s obstack_1grow (&stabstr_ob, '\0'); len = obstack_object_size (&stabstr_ob); - chunk = str = XOBFINISH (&stabstr_ob, char *); + chunk = str = XOBFINISH (&stabstr_ob, char); /* Within the buffer are a sequence of NUL-separated strings, each of which is to be written out as a separate stab @@ -897,7 +897,7 @@ dbxout_finish_complex_stabs (tree sym, s comma than to do a two-character fputs. */ obstack_grow (&stabstr_ob, "\",", 2); len = obstack_object_size (&stabstr_ob); - str = XOBFINISH (&stabstr_ob, char *); + str = XOBFINISH (&stabstr_ob, char); fwrite (str, 1, len, asm_out_file); DBX_FINISH_STABS (sym, code, line, addr, label, number); === modified file 'gcc/gcc.c' --- gcc/gcc.c 2012-08-15 01:56:07 + +++ gcc/gcc.c 2012-08-18 13:47:06 + @@ -1435,7 +1435,7 @@ init_spec (void) } obstack_1grow (&obstack, '\0'); -libgcc_spec = XOBFINISH (&obstack, const char *); +libgcc_spec = XOBFINISH (&obstack, const char); } #endif #ifdef USE_AS_TRADITIONAL_FORMAT @@ -1444,7 +1444,7 @@ init_spec (void) static const char tf[] = "--traditional-format "; obst
Re: Speedups/Cleanups: End of GSOC patch collection
2012-08-18 Dimitrios Apostolou * dwarf2out.c (output_indirect_string): Use ASM_OUTPUT_INTERNAL_LABEL instead of slower ASM_OUTPUT_LABEL. * varasm.c (assemble_string): Don't break string in chunks, this is assembler specific and already done in most versions of ASM_OUTPUT_ASCII. I think there is no correctness issue regarding output_indirect_string() since .debug_str are always compiler generated labels, and this gives a small speedup with -g3 debug info. And regarding assemble_string() I find it superfluous to break it in two places. I found only the following versions of ASM_OUTPUT_ASCII not caring about string length, I guess the assemblers don't have a limit: arm.c: vmsdbgout.c:ASM_OUTPUT_ASCII picochip.c: picochip_output_ascii() pdp11.c: output_ascii() Thanks, Dimitris=== modified file 'gcc/varasm.c' --- gcc/varasm.c2012-08-15 01:56:07 + +++ gcc/varasm.c2012-08-16 06:12:28 + @@ -1726,22 +1726,7 @@ assemble_align (int align) void assemble_string (const char *p, int size) { - int pos = 0; - int maximum = 2000; - - /* If the string is very long, split it up. */ - - while (pos < size) -{ - int thissize = size - pos; - if (thissize > maximum) - thissize = maximum; - - ASM_OUTPUT_ASCII (asm_out_file, p, thissize); - - pos += thissize; - p += thissize; -} + ASM_OUTPUT_ASCII (asm_out_file, p, size); } === modified file 'gcc/dwarf2out.c' --- gcc/dwarf2out.c 2012-08-15 01:56:07 + +++ gcc/dwarf2out.c 2012-08-16 06:19:19 + @@ -20887,7 +20887,7 @@ output_indirect_string (void **h, void * if (node->form == DW_FORM_strp) { switch_to_section (debug_str_section); - ASM_OUTPUT_LABEL (asm_out_file, node->label); + ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, node->label); assemble_string (node->str, strlen (node->str) + 1); }
Speedups/Cleanups: End of GSOC patch collection
Hello list, for the following couple of days I'll be posting under this thread my collection of patches. Unless otherwise mentioned they've been bootstrapped and tested on x86, but with a three-weeks old snapshot, that is pre-C++ conversion. I plan to test again next week with a latest snapshot, so please wait for it before applying. Most things are minor changes and tweaks, some are ported patches from last year, some are new. I also have some bigger patches that are flawed (some don't even compile) either because of difficulties I encountered or simply lack of time. I 'll try and post what is presentable from them, I'd appreciate discussion on whether/how to continue with those. Thanks, Dimitris
Re: [PATCH][C++] Save memory and reallocations in name-lookup
Hi, On Fri, 17 Aug 2012, Jakub Jelinek wrote: On Fri, Aug 17, 2012 at 06:41:37AM -0500, Gabriel Dos Reis wrote: I am however concerned with: static void store_bindings (tree names, VEC(cxx_saved_binding,gc) **old_bindings) { ! static VEC(tree,heap) *bindings_need_stored = NULL; I would be more comfortable to see the cache be on per-scope (e.g. namespace scope) basis as opposed a blanket global cache stored in a global variable. It is not any kind of cache. It could be in theory an automatic variable vector pointer, it is only used during that function. The reason why it is static variable instead is just to avoid constant allocation/deallocation of the vector, this way after the first call it will be already allocated (but, upon entry to store_bindings will always be empty). Why not use stack vector of sufficient size for most cases then? I usually do something like: VEC (tree, stack) *bindings_need_stored = VEC_alloc (tree, stack, 32); ... VEC_free (tree, stack, bindings_need_stored); if I've measured that 32 (or whatever) is enough for most cases. I don't have a clue about this case though. Dimitris
Re: Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)
On Tue, 7 Aug 2012, Ian Lance Taylor wrote: On Tue, Aug 7, 2012 at 2:24 PM, Dimitrios Apostolou wrote: BTW I can't find why ELF_STRING_LIMIT is only 256, it seems GAS supports arbitrary lengths. I'd have to change my code if we ever set it too high (or even unlimited) since I allocate the buffer on the stack. See the comment for ELF_STRING_LIMIT in config/elfos.h. gas is not the only ELF assembler. I've seen it and thought that for non-GAS ELF platforms you should define it to override elfos.h, I obviously misunderstood. So now I'm curious, this limit is there dominating all platforms because of the worst assembler? :-) Dimitris P.S. there is an obvious typo in the comment, STRING_LIMIT should be ELF_STRING_LIMIT (I think I introduced the typo last year...)
Re: Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)
I should mention that with my patch .ascii is used more aggresively than before, so if a string is longer than ELF_STRING_LIMIT it will be written as .ascii all of it, while in the past it would use .string for the string's tail. Example diff to original behaviour: .LASF15458: - .ascii "SSA_VAR_P(DECL) (TREE_CODE (DECL) == VA" - .string "R_DECL || TREE_CODE (DECL) == PARM_DECL || TREE_CODE (DECL) == RESULT_DECL || (TREE_CODE (DECL) == SSA_NAME && (TREE_CODE (SSA_NAME_VAR (DECL)) == VAR_DECL || TREE_CODE (SSA_NAME_VAR (DECL)) == PARM_DECL || TREE_CODE (SSA_NAME_VAR (DECL)) == RESULT_DECL)))" + .ascii "SSA_VAR_P(DECL) (TREE_CODE (DECL) == VAR_DECL || TREE_CODE (D" + .ascii "ECL) == PARM_DECL || TREE_CODE (DECL) == RESULT_DECL || (TREE" + .ascii "_CODE (DECL) == SSA_NAME && (TREE_CODE (SSA_NAME_VAR (DECL)) " + .ascii "== VAR_DECL || TREE_CODE (SSA_NAME_VAR (DECL)) == PARM_DECL |" + .ascii "| TREE_CODE (SSA_NAME_VAR (DECL)) == RESULT_DECL)))\000" BTW I can't find why ELF_STRING_LIMIT is only 256, it seems GAS supports arbitrary lengths. I'd have to change my code if we ever set it too high (or even unlimited) since I allocate the buffer on the stack. Dimitris
Re: add strnlen to libiberty (was Re: Assembly output optimisations)
On Mon, 6 Aug 2012, Ian Lance Taylor wrote: On Mon, Aug 6, 2012 at 9:34 PM, Dimitrios Apostolou wrote: As an addendum to my previous patch, I made an attempt to properly add strnlen() to libiberty, with the code copied from gnulib. Unfortunately it seems I've messed it up somewhere since defining HAVE_STRNLEN to 0 doesn't seem to build strnlen.o for me. Any ideas? What do you mean by "defining HAVE_STRNLEN to 0"? The thing that will control building strnlen.o is having AC_REPLACE_FUNCS in libiberty/configure.in fail to find strlen. One way you can test this is by adding this to libiberty/config.cache: ac_cv_func_strnlen=${ac_cv_func_strnlen=yes} before invoking configure. Thanks Ian, that helped a lot. I changed ac_cv_func_strnlen=no and reconfigured with -C, and strnlen.o was built. :-) What else is missing to make this patch appropriate for libiberty? Should I change the prolog in strnlen.c, since I only copied it intact from gnulib? Thanks, Dimitris
add strnlen to libiberty (was Re: Assembly output optimisations)
As an addendum to my previous patch, I made an attempt to properly add strnlen() to libiberty, with the code copied from gnulib. Unfortunately it seems I've messed it up somewhere since defining HAVE_STRNLEN to 0 doesn't seem to build strnlen.o for me. Any ideas? Thanks, Dimitris === modified file 'libiberty/Makefile.in' --- libiberty/Makefile.in 2012-04-27 14:14:14 + +++ libiberty/Makefile.in 2012-08-07 03:52:53 + @@ -151,7 +151,7 @@ CFILES = alloca.c argv.c asprintf.c atex spaces.c splay-tree.c stack-limit.c stpcpy.c stpncpy.c \ strcasecmp.c strchr.c strdup.c strerror.c strncasecmp.c\ strncmp.c strrchr.c strsignal.c strstr.c strtod.c strtol.c \ -strtoul.c strndup.c strverscmp.c \ +strtoul.c strndup.c strnlen.c strverscmp.c \ timeval-utils.c tmpnam.c\ unlink-if-ordinary.c\ vasprintf.c vfork.c vfprintf.c vprintf.c vsnprintf.c vsprintf.c \ @@ -218,6 +218,7 @@ CONFIGURED_OFILES = ./asprintf.$(objext) ./strncmp.$(objext) ./strndup.$(objext) ./strrchr.$(objext)\ ./strstr.$(objext) ./strtod.$(objext) ./strtol.$(objext) \ ./strtoul.$(objext) ./strverscmp.$(objext) \ + ./strnlen.$(objext) \ ./tmpnam.$(objext) \ ./vasprintf.$(objext) ./vfork.$(objext) ./vfprintf.$(objext)\ ./vprintf.$(objext) ./vsnprintf.$(objext) ./vsprintf.$(objext) \ @@ -622,8 +623,7 @@ $(CONFIGURED_OFILES): stamp-picdir else true; fi $(COMPILE.c) $(srcdir)/crc32.c $(OUTPUT_OPTION) -./dwarfnames.$(objext): $(srcdir)/dwarfnames.c $(INCDIR)/dwarf2.h \ - $(INCDIR)/dwarf2.def +./dwarfnames.$(objext): $(srcdir)/dwarfnames.c $(INCDIR)/dwarf2.h if [ x"$(PICFLAG)" != x ]; then \ $(COMPILE.c) $(PICFLAG) $(srcdir)/dwarfnames.c -o pic/$@; \ else true; fi @@ -656,7 +656,8 @@ $(CONFIGURED_OFILES): stamp-picdir else true; fi $(COMPILE.c) $(srcdir)/fibheap.c $(OUTPUT_OPTION) -./filename_cmp.$(objext): $(srcdir)/filename_cmp.c config.h $(INCDIR)/filenames.h \ +./filename_cmp.$(objext): $(srcdir)/filename_cmp.c config.h $(INCDIR)/ansidecl.h \ + $(INCDIR)/filenames.h $(INCDIR)/hashtab.h \ $(INCDIR)/safe-ctype.h if [ x"$(PICFLAG)" != x ]; then \ $(COMPILE.c) $(PICFLAG) $(srcdir)/filename_cmp.c -o pic/$@; \ @@ -757,7 +758,7 @@ $(CONFIGURED_OFILES): stamp-picdir $(COMPILE.c) $(srcdir)/insque.c $(OUTPUT_OPTION) ./lbasename.$(objext): $(srcdir)/lbasename.c config.h $(INCDIR)/ansidecl.h \ - $(INCDIR)/filenames.h $(INCDIR)/libiberty.h \ + $(INCDIR)/filenames.h $(INCDIR)/hashtab.h $(INCDIR)/libiberty.h \ $(INCDIR)/safe-ctype.h if [ x"$(PICFLAG)" != x ]; then \ $(COMPILE.c) $(PICFLAG) $(srcdir)/lbasename.c -o pic/$@; \ @@ -1043,7 +1044,7 @@ $(CONFIGURED_OFILES): stamp-picdir else true; fi $(COMPILE.c) $(srcdir)/splay-tree.c $(OUTPUT_OPTION) -./stack-limit.$(objext): $(srcdir)/stack-limit.c config.h +./stack-limit.$(objext): $(srcdir)/stack-limit.c config.h $(INCDIR)/ansidecl.h if [ x"$(PICFLAG)" != x ]; then \ $(COMPILE.c) $(PICFLAG) $(srcdir)/stack-limit.c -o pic/$@; \ else true; fi @@ -1104,6 +1105,12 @@ $(CONFIGURED_OFILES): stamp-picdir else true; fi $(COMPILE.c) $(srcdir)/strndup.c $(OUTPUT_OPTION) +./strnlen.$(objext): $(srcdir)/strnlen.c + if [ x"$(PICFLAG)" != x ]; then \ + $(COMPILE.c) $(PICFLAG) $(srcdir)/strnlen.c -o pic/$@; \ + else true; fi + $(COMPILE.c) $(srcdir)/strnlen.c $(OUTPUT_OPTION) + ./strrchr.$(objext): $(srcdir)/strrchr.c $(INCDIR)/ansidecl.h if [ x"$(PICFLAG)" != x ]; then \ $(COMPILE.c) $(PICFLAG) $(srcdir)/strrchr.c -o pic/$@; \ === modified file 'libiberty/config.in' --- libiberty/config.in 2011-07-22 08:33:37 + +++ libiberty/config.in 2012-08-07 03:29:18 + @@ -262,6 +262,9 @@ /* Define to 1 if you have the `strndup' function. */ #undef HAVE_STRNDUP +/* Define to 1 if you have the `strnlen' function. */ +#undef HAVE_STRNLEN + /* Define to 1 if you have the `strrchr' function. */ #undef HAVE_STRRCHR === modified file 'libiberty/configure' --- libiberty/configure 2012-01-23 06:25:28 + +++ libiberty/configure 2012-08-07 03:10:54 + @@ -5340,6 +5340,7 @@ funcs="$funcs strchr" funcs="$funcs strdup" funcs="$funcs strncasecmp" funcs="$funcs strndup" +funcs="$funcs strnlen" funcs="$funcs strrchr" funcs="$funcs strstr" funcs="$funcs strtod" @@ -5380,8 +5381,8 @@ if test "x" = "y"; then random realpath rename rindex \ sbrk setenv setproctitle setrlimit sigsetmask snprintf spawnve spawnvpe \ stpcpy stpncpy strcasecmp strchr strdup \ -
Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)
Hello list, these clean-ups and minor speedups complete some TODOs and semi-finished changes I have gathered in the ELF backend. In a nutshell: Fixed comment style, used INT_BITS_STRLEN_BOUND from gnulib to be future proof on integer representation string length, replaced long arguments in fast printing functions with HOST_WIDE_INT that is always a larger type (also asserted that), converted some never-negative ints to unsigned. Guarded the output.h:default_elf_asm_output_* declarations, mimicking varasm.c (I'm not exactly sure why this is guarded in the first place). Changed default_elf_asm_output_* to be clearer and faster, they now fwrite() line by line instead of putting char by char. Implemented fast octal output in default_elf_asm_output_*, this should give a good boost to -flto, but I haven't measured a big testcase for this one. All in all I get a speed-up of ~30 M instr out of ~2 G instr, for -g3 compilation of reload.c. Actually saving all the putc() calls gives more significant gain, but I lost a tiny bit because of converting [sf]print_* functions to HOST_WIDE_INT from long, for PR 51094. So on i586 which has HOST_WIDE_INT 8 byte wide, I can see slow calls to __u{div,mod}di3 taking place. I don't know whether there is a meaning in writing LEB128 values greater than 2^31 but I could change all that to HOST_WIDEST_FAST_INT if you think so. Time savings are minor too, about 10 ms out of 0.85 s. Memory usage is the same. Bootstrapped on x86, no regressions for C,C++ testsuite. Thanks Andreas, hp, Mike, for your comments. Mike I'd appreciate if you elaborated on how to speed-up sprint_uw_rev(), I don't think I understood what you have in mind. Thanks, Dimitris2012-08-07 Dimitrios Apostolou * final.c: Assert that HOST_WIDE_INT is at least as wide as long. (output_addr_const): Use fprint_w() instead of fprintf() when CONST_INT or CONST_FIXED. (_sprint_uw): New static function. (sprint_ul_rev): Change to: (_sprint_uw_rev): Accept HOST_WIDE_INT arg instead of long. Changed i to unsigned. (INT_BITS_STRLEN_BOUND): Copied from gnulib. (HOST_WIDE_INT_STRING_LEN): Define. (fprint_ul, sprint_ul): Change to: (fprint_uw, sprint_uw): Accept HOST_WIDE_INT arg instead of long. Changed counter variables to unsigned. (fprint_uw_hex): Renamed from fprint_whex * output.h (fprint_ul, sprint_ul): Remove declarations. (fprint_w, fprint_uw, sprint_uw): Declare. (default_elf_asm_output_limited_string) (default_elf_asm_output_ascii): wrap in #ifdef ELF_ASCII_ESCAPES (fprint_uw_hex): Renamed from fprint_whex * elfos.h (ASM_GENERATE_INTERNAL_LABEL): Use sprint_uw() instead of sprint_ul(). (ASM_OUTPUT_ASCII): Removed questionmark at the end of macro. * i386.c (print_reg): Use fprint_uw() instead of fprint_ul(). * dwarf2asm.c (asm_output_data_sleb128): Change fprintf() to fputs() plus fprint_w(). Change fputc() to putc() in hot path. (dw2_assemble_integer, dw2_asm_output_data) (dw2_asm_output_data_uleb128): fprint_whex() renamed to fprint_uw_hex(). * dwarf2out.c (dwarf2out_source_line): Changed comment. Use fprint_uw() instead of fprint_ul(). * varasm.c (_elf_escape_char): New static function that writes a char to a string according to ELF_ASCII_ESCAPES. (_elf_output_ascii_line): New static function that writes to file a single .ascii assembler declaration. (default_elf_asm_output_limited_string) (default_elf_asm_output_ascii): Rewrote functions so that they fwrite() a full assembler line instead of putting char by char. === modified file 'gcc/config/elfos.h' --- gcc/config/elfos.h 2012-06-19 19:55:33 + +++ gcc/config/elfos.h 2012-08-06 03:19:16 + @@ -119,7 +119,7 @@ see the files COPYING3 and COPYING.RUNTI (LABEL)[0] = '*';\ (LABEL)[1] = '.';\ __p = stpcpy (&(LABEL)[2], PREFIX); \ - sprint_ul (__p, (unsigned long) (NUM)); \ + sprint_uw (__p, (unsigned HOST_WIDE_INT) (NUM)); \ } \ while (0) @@ -418,7 +418,7 @@ see the files COPYING3 and COPYING.RUNTI #undef ASM_OUTPUT_ASCII #define ASM_OUTPUT_ASCII(FILE, STR, LENGTH)\ - default_elf_asm_output_ascii ((FILE), (STR), (LENGTH)); + default_elf_asm_output_ascii ((FILE), (STR), (LENGTH)) /* Allow the use of the -frecord-gcc-switches switch via the elf_record_gcc_switches function defined in varasm.c. */ === modified file 'gcc/config/i386/i386.c' --- gcc/config/i386/i386.c 2012-07-28 09:16:52 + +++ gcc/c
Re: [libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)
On Sat, 4 Aug 2012, Ian Lance Taylor wrote: On Fri, 3 Aug 2012, Ian Lance Taylor wrote: I'm not sure where you are looking. I only see one call to _obstack_begin in the gcc directory, and it could easily be replaced with a call to obstack_specify_allocation instead. In libcpp/ mostly, but only 4 cases so that's not too many, I can change those with obstack_init/begin. +#define XOBFINISH(O, PT) ((PT) obstack_finish ((O))) For XOBNEW, etc., we use (T *) rather than (PT). Using (PT) seems error-probe--it's the only use of the obstack with a different type parameter. Why not use T rather than PT here, and return (T *)? I'd have to change many (about 60) occurences of XOBFINISH if I change that. I'd go for it if I was sure it's what we want, it can be a separate patch later on. I'm sorry to ask you to change a lot of code, but it simply doesn't make sense to me to have all but one macro take the type as an argument, and have one macro take a pointer to the type. They really have to be consistent. No problem, if anyone else doesn't object I'll change those (in a second patch, right?). Thanks, Dimitris
Re: [libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)
On Fri, 3 Aug 2012, Ian Lance Taylor wrote: 2012-08-04 Dimitrios Apostolou * libiberty.h (XOBDELETE,XOBGROW,XOBGROWVEC,XOBSHRINK,XOBSHRINKVEC): New type-safe macros for obstack allocation. (XOBFINISH): Renamed argument to PT since it is a pointer to T. +/* Type-safe obstack allocator. You must first initialize the obstack with + obstack_init() or _obstack_begin(). This should recommend obstack_init, or obstack_begin, but not _obstack_begin. Also obstack_specify_allocation and obstack_specify_allocation_with_arg are OK, so really it might be better not to list the functions, but simply say "You must first initialization the obstack." Grep reveals that 9 out of 10 times we use _obstack_begin(), and we set alignment to 0 (isn't it suboptimal?). + T: Type, O: Obstack, N: Number of elements, S: raw Size, s/Size/size/ +#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T)) +#define XOBSHRINKVEC(O, T, N) obstack_blank ((O), -1 * sizeof (T) * (N)) These are hard to use safely. I'm not sure we should define them at all. I've already used XOBSHRINK and it looks clear to me, but I could use obstack_blank() directly if necessary. +#define XOBFINISH(O, PT) ((PT) obstack_finish ((O))) For XOBNEW, etc., we use (T *) rather than (PT). Using (PT) seems error-probe--it's the only use of the obstack with a different type parameter. Why not use T rather than PT here, and return (T *)? I'd have to change many (about 60) occurences of XOBFINISH if I change that. I'd go for it if I was sure it's what we want, it can be a separate patch later on. Thanks, Dimitris
[libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)
Hi Dodji, I appreciate your review but I'm replying for now only for the libiberty.h part (attached updated patch), which I've needed elsewhere and seems the easiest to quickly apply. On Wed, 18 Jul 2012, Dodji Seketeli wrote: === modified file 'include/libiberty.h' --- include/libiberty.h 2011-09-28 19:04:30 + +++ include/libiberty.h 2012-07-07 16:04:01 + [...] -/* Type-safe obstack allocator. */ +/* Type-safe obstack allocator. You must first initialize the obstack with + obstack_init() or _obstack_begin(). */ Why not just using the _obstack_begin function? Why introducing the use of the obstack_init macro? Please correct me if I'm wrong, but my impression is that obstack_init is the documented way (see [1]), _obstack_begin seems hidden and gives access to dangerous (performance-wise) parameters, even though we always use it with the defaults (hmmm double checking I see that we mostly set size 0 which is default, but also alignment 0???). [1] http://www.gnu.org/software/libc/manual/html_node/Preparing-for-Obstacks.html -#define XOBFINISH(O, T) ((T) obstack_finish ((O))) I think this change is not needed. You basically remove this line to replace it with the hunk below, that comes later in the patch: +#define XOBFINISH(O, PT) ((PT) obstack_finish ((O))) So I believe you could do away with the change. I just wanted to point out that it expects and returns a Pointer to the Type, e.g. after you XOBGROW(O, int, DATA) you should XOBFINISH(O, int *). It would probably be better for XOBFINISH to accept T and return (T *), but I didn't want to change existing status quo. I think I addressed your other concerns (style, comments, unneeded macro params). CC'd maintainers. What do you think about attached patch? Thanks, Dimitris P.S. Personally I prefer the original obstack interface, i.e. obstack_grow(), obstack_finish() etc. I just added the macros to follow existing style and avoid typecasting because of C++. Let me know if it's OK to use obstacks directly and I'll back out the patch. 2012-08-04 Dimitrios Apostolou * libiberty.h (XOBDELETE,XOBGROW,XOBGROWVEC,XOBSHRINK,XOBSHRINKVEC): New type-safe macros for obstack allocation. (XOBFINISH): Renamed argument to PT since it is a pointer to T. === modified file 'include/libiberty.h' --- include/libiberty.h 2011-09-28 19:04:30 + +++ include/libiberty.h 2012-08-03 23:51:55 + @@ -361,12 +361,21 @@ extern unsigned int xcrc32 (const unsign #define XDUPVAR(T, P, S1, S2) ((T *) xmemdup ((P), (S1), (S2))) #define XRESIZEVAR(T, P, S)((T *) xrealloc ((P), (S))) -/* Type-safe obstack allocator. */ +/* Type-safe obstack allocator. You must first initialize the obstack with + obstack_init() or _obstack_begin(). + T: Type, O: Obstack, N: Number of elements, S: raw Size, + PT: Pointer to T, D: Data, P: Pointer to element. */ #define XOBNEW(O, T) ((T *) obstack_alloc ((O), sizeof (T))) #define XOBNEWVEC(O, T, N) ((T *) obstack_alloc ((O), sizeof (T) * (N))) #define XOBNEWVAR(O, T, S) ((T *) obstack_alloc ((O), (S))) -#define XOBFINISH(O, T) ((T) obstack_finish ((O))) +#define XOBDELETE(O, P)(obstack_free ((O), (P))) + +#define XOBGROW(O, T, D) obstack_grow ((O), (D), sizeof (T)) +#define XOBGROWVEC(O, T, D, N) obstack_grow ((O), (D), sizeof (T) * (N)) +#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T)) +#define XOBSHRINKVEC(O, T, N) obstack_blank ((O), -1 * sizeof (T) * (N)) +#define XOBFINISH(O, PT) ((PT) obstack_finish ((O))) /* hex character manipulation routines */
Re: cosmetic change - simplify cse.c:preferable()
On Thu, 19 Jul 2012, Richard Guenther wrote: I don't think it's any good or clearer to understand. Hi Richi, I had forgotten I prepared this for PR #19832, maybe you want to take a look. FWIW, with my patch applied there is a difference of ~3 M instr, which is almost unmeasurable in time. But we can close the PR even with the simple patch Cristophe posted, since mine does not make things clearer. Thanks, Dimitris
Re: ping again - Show hash table stats when -fmem-report
I'm always forgetting something, now it was the changelog, see attached (same as old, nothing significant changed). On Fri, 3 Aug 2012, Dimitrios Apostolou wrote: Hi, I've updated this patch to trunk and rebootstrapped it, so I'm resubmitting it, I'm also making a try to CC all relevant maintainers, sorry for spamming but even though this patch is simple, it touches many parts. I'm also adding stats to pointer_{set,map} but it will come in a separate patch. The notes quoted from earlier mail still apply: On Sun, 8 Jul 2012, Dimitrios Apostolou wrote: Hi, This patch adds many nice stats about hash tables when gcc is run with -fmem-report. Attached patch tested on x86, no regressions. Also attached is sample output of -fmem-report when compiling reload.c with -O2 -g. Especially interesting is the extreme htab usage in var-tracking (ofcourse we already knew that... :-) and maybe symtab's slightly high collision rate (given it's a very hot hash table). Moreover it's notable that since last year, collision rate for mem_attrs_htab has been reduced from around 8 to 1, still not good enough but major improvement. Jan: It has negligible overhead in runtime, that's why I'm pushing it again as it is. Having compile-time instrumentation is something different, but still I'm not quite sure if it's worth the effort. IMHO doing separate builds to get memory info is tedious and even I've stopped doing them. If it has no overhead maybe it's better as a runtime option? Any way, compile-time instrumentation can come later. Finally I repost some of my notes I'd like to get feedback on: * Is it OK that I included in libiberty's hashtab.c? What's the proper way to assert stuff, since gcc_assert() is not accessible? * Many hash tables are created with htab_create_ggc(), for example referenced_vars and default_defs in tree-ssa.c. I collect statistics in delete_tree_ssa() but maybe some are not deallocated in there but automatically garbage collected? * Obviously hash table sizes are inflated where two entries might reference the same element (for example in symtab_hash) but I don't handle this. * Changes that reduce load factor have been backed out since they brought controversy. I still think they are good, at least for symtab. I'll measure numbers separately for this after this patch makes it. Thanks, Dimitris 2012-05-21 Dimitrios Apostolou Print various statistics about hash tables when called with -fmem-report. If the tables are created once use htab_dump_statistics(), if they are created/destroyed multiple times then introduce global variables to track statistics. * cgraph.c, cgraph.h: (cgraph_dump_stats): New function to dump stats about hash tables. * gcc/symtab.c, cgraph.h: (symtab_dump_stats): New function to dump stats about hash tables. * cgraph.c: (call_site_hash_{num,expands,searches,collisions}): New globals to keep statistics about call_site_hash hash tables. (cgraph_remove_node_callees,cgraph_remove_node): if (mem_report) then keep statistics about hash tables. * cselib.c, cselib.h (cselib_dump_stats): New function to dump stats about cselib_hash_table. * cselib.c (cselib_htab_{num,expands,searches,collisions}): New globals to keep hash table statistics. (cselib_finish): if (mem_report) keep hash table statistics. * dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if -fmem_report. * emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to dump statistics about mem_attrs_htab hash table. * tree.h, tree-ssa.c (tree_ssa_dump_stats): New function to print statistics about all referenced_vars and default_defs hash tables. * tree-ssa.c (default_defs_{num,expands,searches,collisions}) (referenced_vars_{num,expands,searches,collisions}): new globals to keep statistics about hash tables. (delete_tree_ssa): Keep statistics for hash tables by increasing the new global variables. * tree.c (dump_tree_statistics): Call tree_ssa_dump_stats(). (print_type_hash_statistics): Used the new htab_dump_statistics() function. * var-tracking.c (vars_htab_{num,expands,searches,collisions}) (dropval_htab_{num,expands,searches,collisions}) (cv_htab_{num,expands,searches,collisions}): new globals to keep hash table statistics. (shared_hash_destroy, vt_finalize): Keep statistics by increasing values of new global variables if -fmem-report. * var-tracking.c, rtl.h (vt_dump_stats): New function to dump stats about vars->htab, dropped_variables and value_chains hash tables. * toplev.c: Included cselib.h for cselib_dump_stats(). (dump_memory_report):
Re: ping again - Show hash table stats when -fmem-report
Hi, I've updated this patch to trunk and rebootstrapped it, so I'm resubmitting it, I'm also making a try to CC all relevant maintainers, sorry for spamming but even though this patch is simple, it touches many parts. I'm also adding stats to pointer_{set,map} but it will come in a separate patch. The notes quoted from earlier mail still apply: On Sun, 8 Jul 2012, Dimitrios Apostolou wrote: Hi, This patch adds many nice stats about hash tables when gcc is run with -fmem-report. Attached patch tested on x86, no regressions. Also attached is sample output of -fmem-report when compiling reload.c with -O2 -g. Especially interesting is the extreme htab usage in var-tracking (ofcourse we already knew that... :-) and maybe symtab's slightly high collision rate (given it's a very hot hash table). Moreover it's notable that since last year, collision rate for mem_attrs_htab has been reduced from around 8 to 1, still not good enough but major improvement. Jan: It has negligible overhead in runtime, that's why I'm pushing it again as it is. Having compile-time instrumentation is something different, but still I'm not quite sure if it's worth the effort. IMHO doing separate builds to get memory info is tedious and even I've stopped doing them. If it has no overhead maybe it's better as a runtime option? Any way, compile-time instrumentation can come later. Finally I repost some of my notes I'd like to get feedback on: * Is it OK that I included in libiberty's hashtab.c? What's the proper way to assert stuff, since gcc_assert() is not accessible? * Many hash tables are created with htab_create_ggc(), for example referenced_vars and default_defs in tree-ssa.c. I collect statistics in delete_tree_ssa() but maybe some are not deallocated in there but automatically garbage collected? * Obviously hash table sizes are inflated where two entries might reference the same element (for example in symtab_hash) but I don't handle this. * Changes that reduce load factor have been backed out since they brought controversy. I still think they are good, at least for symtab. I'll measure numbers separately for this after this patch makes it. Thanks, Dimitris === modified file 'gcc/Makefile.in' --- gcc/Makefile.in 2012-07-26 13:10:04 + +++ gcc/Makefile.in 2012-08-03 14:37:47 + @@ -2679,7 +2679,7 @@ toplev.o : toplev.c $(CONFIG_H) $(SYSTEM langhooks.h insn-flags.h $(CFGLOOP_H) hosthooks.h \ $(CGRAPH_H) $(COVERAGE_H) alloc-pool.h $(GGC_H) \ $(OPTS_H) params.def tree-mudflap.h $(TREE_PASS_H) $(GIMPLE_H) \ - tree-ssa-alias.h $(PLUGIN_H) realmpfr.h tree-diagnostic.h \ + tree-ssa-alias.h $(PLUGIN_H) cselib.h realmpfr.h tree-diagnostic.h \ $(TREE_PRETTY_PRINT_H) opts-diagnostic.h $(COMMON_TARGET_H) hwint.o : hwint.c $(CONFIG_H) $(SYSTEM_H) $(DIAGNOSTIC_CORE_H) === modified file 'gcc/cgraph.c' --- gcc/cgraph.c2012-07-16 11:32:42 + +++ gcc/cgraph.c2012-08-03 14:37:47 + @@ -1066,6 +1066,13 @@ cgraph_update_edges_for_call_stmt (gimpl } +/* Keep statistics about call_site_hash hash tables. */ +static unsigned int call_site_hash_num; +static unsigned long call_site_hash_expands; +static unsigned long call_site_hash_searches; +static unsigned long call_site_hash_collisions; + + /* Remove all callees from the node. */ void @@ -1096,6 +1103,16 @@ cgraph_node_remove_callees (struct cgrap node->callees = NULL; if (node->call_site_hash) { + if (mem_report) + { + call_site_hash_num++; + call_site_hash_expands += + htab_expands_num (node->call_site_hash); + call_site_hash_searches += + htab_searches_num (node->call_site_hash); + call_site_hash_collisions += + htab_collisions_num (node->call_site_hash); + } htab_delete (node->call_site_hash); node->call_site_hash = NULL; } @@ -1252,6 +1269,16 @@ cgraph_remove_node (struct cgraph_node * node->symbol.decl = NULL; if (node->call_site_hash) { + if (mem_report) + { + call_site_hash_num++; + call_site_hash_expands += + htab_expands_num (node->call_site_hash); + call_site_hash_searches += + htab_searches_num (node->call_site_hash); + call_site_hash_collisions += + htab_collisions_num (node->call_site_hash); + } htab_delete (node->call_site_hash); node->call_site_hash = NULL; } @@ -2457,4 +2484,18 @@ verify_cgraph (void) FOR_EACH_FUNCTION (node) verify_cgraph_node (node); } + +void +cgraph_dump_stats (void) +{ + fprintf (stderr, "\ncgraph.c hash table stats:\n"); + fprintf (stderr, "\t%u cgraph_node->call_site_hash hash tables:\n", + call_site_hash_num); + fprintf (stderr, "
PR 51094 - fprint_w() in output_addr_const() reinstated
Since output_addr_const() shows pretty hot in the compiler, I reinstated the fprint_w() call in place of fprintf(). This patch relies on two things: 2's complement representation for negative int and that HOST_WIDE_INT is at least as large type as long for all platforms (will it always be?). Bootstrapped/tested on i386, regtested on x86_64 multilib, i386-pc-solaris2.10 (thanks ro), i686-darwin9 (thanks iains). 2012-07-09 Dimitrios Apostolou * final.c, output.h (fprint_w): New function to write a HOST_WIDE_INT to a file, fast. * final.c (output_addr_const): Use fprint_w() instead of fprintf(). (sprint_ul_rev): New static function to write a HOST_WIDE_INT to a string in reverse, fast. Thanks, Dimitris === modified file 'gcc/final.c' --- gcc/final.c 2012-06-24 17:58:46 + +++ gcc/final.c 2012-07-09 04:17:48 + @@ -3693,7 +3693,7 @@ output_addr_const (FILE *file, rtx x) break; case CONST_INT: - fprintf (file, HOST_WIDE_INT_PRINT_DEC, INTVAL (x)); + fprint_w (file, INTVAL (x)); break; case CONST: @@ -3827,11 +3827,12 @@ fprint_whex (FILE *f, unsigned HOST_WIDE } } -/* Internal function that prints an unsigned long in decimal in reverse. - The output string IS NOT null-terminated. */ +/* Internal function that writes to a string an unsigned HOST_WIDE_INT or an + unsigned long in decimal in reverse. The output string IS NOT + null-terminated. */ static int -sprint_ul_rev (char *s, unsigned long value) +sprint_ul_rev (char *s, unsigned HOST_WIDE_INT value) { int i = 0; do @@ -3849,6 +3850,32 @@ sprint_ul_rev (char *s, unsigned long va return i; } +/* Write a signed HOST_WIDE_INT as decimal to a file, fast. */ + +void +fprint_w (FILE *f, HOST_WIDE_INT value) +{ + /* python says: len(str(2**64)) == 20 */ + char s[20]; + int i; + + if (value >= 0) +i = sprint_ul_rev (s, value); + else +{ + i = sprint_ul_rev (s, (unsigned HOST_WIDE_INT) (~value) + 1); + putc('-', f); +} + + /* It's probably too small so don't bother with string reversal and fputs */ + do +{ + i--; + putc (s[i], f); +} + while (i != 0); +} + /* Write an unsigned long as decimal to a file, fast. */ void === modified file 'gcc/output.h' --- gcc/output.h2012-06-24 17:58:46 + +++ gcc/output.h2012-07-08 09:24:59 + @@ -125,6 +125,7 @@ extern void output_addr_const (FILE *, r #define ATTRIBUTE_ASM_FPRINTF(m, n) ATTRIBUTE_NONNULL(m) #endif +extern void fprint_w (FILE *, HOST_WIDE_INT); extern void fprint_whex (FILE *, unsigned HOST_WIDE_INT); extern void fprint_ul (FILE *, unsigned long); extern int sprint_ul (char *, unsigned long);
cosmetic change - simplify cse.c:preferable()
Hello, I've had this patch some time now, it's simple and cosmetic only, I had done it while trying to understand expression costs in CSE. I think it's more readable than the previous one. FWIW it passed all tests on x86. Thanks, Dimitris=== modified file 'gcc/cse.c' --- gcc/cse.c 2012-06-15 09:22:00 + +++ gcc/cse.c 2012-07-08 07:28:52 + @@ -713,32 +713,25 @@ approx_reg_cost (rtx x) static int preferable (int cost_a, int regcost_a, int cost_b, int regcost_b) { - /* First, get rid of cases involving expressions that are entirely - unwanted. */ - if (cost_a != cost_b) -{ - if (cost_a == MAX_COST) - return 1; - if (cost_b == MAX_COST) - return -1; -} + int cost_diff = cost_a - cost_b; + int regcost_diff = regcost_a - regcost_b; - /* Avoid extending lifetimes of hardregs. */ - if (regcost_a != regcost_b) + if (cost_diff != 0) { - if (regcost_a == MAX_COST) - return 1; - if (regcost_b == MAX_COST) - return -1; + /* If none of the expressions are entirely unwanted */ + if ((cost_a != MAX_COST) && (cost_b != MAX_COST) + /* AND only one of the regs is HARD_REG */ + && (regcost_diff != 0) + && ((regcost_a == MAX_COST) || (regcost_b == MAX_COST)) + ) + /* Then avoid extending lifetime of HARD_REG */ + return regcost_diff; + + return cost_diff; } - /* Normal operation costs take precedence. */ - if (cost_a != cost_b) -return cost_a - cost_b; - /* Only if these are identical consider effects on register pressure. */ - if (regcost_a != regcost_b) -return regcost_a - regcost_b; - return 0; + /* cost_a == costb, consider effects on register pressure */ + return regcost_diff; } /* Internal function, to compute cost when X is not a register; called
PR #53525 - track-macro-expansion performance regression
With the attached patches I introduce four new obstacks in struct cpp_reader to substitute malloc's/realloc's when expanding macros. Numbers have been posted in the PR, but to summarize: before: 0.785 s or 2201 M instr after: 0.760 s or 2108 M instr Memory overhead is some tens kilobytes worst case. Tested on x86, no regressions. I think this version of the patch is OK to merge, besides some TODO comments (I'd appreciate opinions on them) and the following maybe: IMHO the new obstack_{mark,release} functions are the ones that will be harder to apply, because they make gcc's obstacks even more different than libc's. I sent the patch to libc-alpha but the feedback I got was that I should first make the two obstack versions (gcc,libc) identical, and then try to push changes. I've noted the primary differences and plan on tackling this, but it's not as trivial as it seems. So if it's not OK, where could the new obstack_{mark,release} go? Thanks, Dimitris === modified file 'include/libiberty.h' --- include/libiberty.h 2011-09-28 19:04:30 + +++ include/libiberty.h 2012-07-07 16:04:01 + @@ -361,12 +361,19 @@ extern unsigned int xcrc32 (const unsign #define XDUPVAR(T, P, S1, S2) ((T *) xmemdup ((P), (S1), (S2))) #define XRESIZEVAR(T, P, S)((T *) xrealloc ((P), (S))) -/* Type-safe obstack allocator. */ +/* Type-safe obstack allocator. You must first initialize the obstack with + obstack_init() or _obstack_begin(). */ #define XOBNEW(O, T) ((T *) obstack_alloc ((O), sizeof (T))) #define XOBNEWVEC(O, T, N) ((T *) obstack_alloc ((O), sizeof (T) * (N))) #define XOBNEWVAR(O, T, S) ((T *) obstack_alloc ((O), (S))) -#define XOBFINISH(O, T) ((T) obstack_finish ((O))) +#define XOBDELETE(O, T, P) (obstack_free ((O), (P))) + +#define XOBGROW(O, T, D) obstack_grow ((O), (D), sizeof (T)) +#define XOBGROWVEC(O, T, D, N) obstack_grow ((O), (D), sizeof (T) * (N)) +#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T)) +#define XOBSHRINKVEC(O, T, N) obstack_blank ((O), -1 * sizeof (T) * (N)) +#define XOBFINISH(O, PT) ((PT) obstack_finish ((O))) /* hex character manipulation routines */ === modified file 'libcpp/init.c' --- libcpp/init.c 2012-04-30 11:43:43 + +++ libcpp/init.c 2012-07-07 16:04:01 + @@ -241,10 +241,20 @@ cpp_create_reader (enum c_lang lang, has /* The expression parser stack. */ _cpp_expand_op_stack (pfile); +#define obstack_chunk_alloc ((void *(*) (long)) xmalloc) +#define obstack_chunk_free ((void (*) (void *)) free) + /* Initialize the buffer obstack. */ - _obstack_begin (&pfile->buffer_ob, 0, 0, - (void *(*) (long)) xmalloc, - (void (*) (void *)) free); + obstack_init(&pfile->buffer_ob); + + /* Initialize the macro obstacks. */ + obstack_init (&pfile->exp_ob); + if (CPP_OPTION (pfile, track_macro_expansion)) +{ + obstack_init (&pfile->virt_locs_ob); + obstack_init (&pfile->arg_locs_ob); + obstack_init (&pfile->exp_locs_ob); +} _cpp_init_files (pfile); @@ -253,6 +263,9 @@ cpp_create_reader (enum c_lang lang, has return pfile; } +#undef obstack_chunk_alloc +#undef obstack_chunk_free + /* Set the line_table entry in PFILE. This is called after reading a PCH file, as the old line_table will be incorrect. */ void @@ -289,6 +302,14 @@ cpp_destroy (cpp_reader *pfile) deps_free (pfile->deps); obstack_free (&pfile->buffer_ob, 0); + obstack_free (&pfile->exp_ob, 0); + if (CPP_OPTION (pfile, track_macro_expansion)) +{ + obstack_free (&pfile->virt_locs_ob, 0); + obstack_free (&pfile->arg_locs_ob, 0); + obstack_free (&pfile->exp_locs_ob, 0); +} + _cpp_destroy_hashtable (pfile); _cpp_cleanup_files (pfile); _cpp_destroy_iconv (pfile); === modified file 'libcpp/internal.h' --- libcpp/internal.h 2012-05-29 09:36:29 + +++ libcpp/internal.h 2012-07-07 17:18:53 + @@ -555,6 +555,13 @@ struct cpp_reader /* If non-null, the lexer will use this location for the next token instead of getting a location from the linemap. */ source_location *forced_token_location_p; + + /* Obstacks used to speed up macro expansion and virt_locs tracking. */ + struct obstack exp_ob; /* for expanding macro arguments */ + /* The rest are used only when -ftrack-macro-expansion */ + struct obstack exp_locs_ob; /* virt_locs of expanded macro arguments */ + struct obstack arg_locs_ob; /* virt_locs of macro arguments */ + struct obstack virt_locs_ob; /* virt locs for all other macros */ }; /* Character classes. Based on the more primitive macros in safe-ctype.h. === modified file 'libcpp/macro.c' --- libcpp/macro.c 2012-05-29 14:53:50 + +++ libcpp/macro.c 2012-07-07 18:08:44 + @@ -24,10 +24,20 @@ along with this program; see the file CO You are forbidden to forbid anyone else to use, share and improve what you give them
ping again - Show hash table stats when -fmem-report
uot;, - (unsigned long) table->nslots); - fprintf (stderr, "deleted\t\t%lu\n", + fprintf (stderr, "\tentries\t\t%lu (%.2f%%)\n", + (unsigned long) nelts, nelts * 100.0 / table->nslots); + fprintf (stderr, "\tdeleted\t\t%lu\n", (unsigned long) deleted); - fprintf (stderr, "bytes\t\t%lu%c (%lu%c overhead)\n", + fprintf (stderr, "\tslots\t\t%u\n", + table->nslots); + fprintf (stderr, "\tstring bytes\t%lu%c (%lu%c obstack_memory_used)\n", SCALE (total_bytes), LABEL (total_bytes), - SCALE (overhead), LABEL (overhead)); - fprintf (stderr, "table size\t%lu%c\n", + SCALE (obmem), LABEL (obmem)); + fprintf (stderr, "\ttable size\t%lu%c\n", SCALE (headers), LABEL (headers)); exp_len = (double)total_bytes / (double)nelts; exp2_len = exp_len * exp_len; exp_len2 = (double) sum_of_squares / (double) nelts; - fprintf (stderr, "coll/search\t%.4f\n", + fprintf (stderr, "\texpansions\t%u\n", + table->expands); + fprintf (stderr, "\tsearches\t%u\n", + table->searches); + fprintf (stderr, "\tcollisions\t%u\n", + table->collisions); + fprintf (stderr, "\tcoll/search\t%.4f\n", (double) table->collisions / (double) table->searches); - fprintf (stderr, "ins/search\t%.4f\n", + fprintf (stderr, "\tins/search\t%.4f\n", (double) nelts / (double) table->searches); - fprintf (stderr, "avg. entry\t%.2f bytes (+/- %.2f)\n", + fprintf (stderr, "\tavg. entry\t%.2f bytes (+/- %.2f)\n", exp_len, approx_sqrt (exp_len2 - exp2_len)); - fprintf (stderr, "longest entry\t%lu\n", + fprintf (stderr, "\tlongest entry\t%lu\n", (unsigned long) longest); #undef SCALE #undef LABEL === modified file 'libiberty/hashtab.c' --- libiberty/hashtab.c 2011-02-03 07:23:20 + +++ libiberty/hashtab.c 2012-07-07 19:10:57 + @@ -58,6 +58,7 @@ Boston, MA 02110-1301, USA. */ #endif #include +#include #include "libiberty.h" #include "ansidecl.h" @@ -563,6 +564,7 @@ htab_expand (htab_t htab) htab->size_prime_index = nindex; htab->n_elements -= htab->n_deleted; htab->n_deleted = 0; + htab->expands++; p = oentries; do @@ -812,6 +814,107 @@ htab_collisions (htab_t htab) return (double) htab->collisions / (double) htab->searches; } +/* Return the number of expands */ + +unsigned int +htab_expands_num (htab_t htab) +{ + return htab->expands; +} + +/* Return the number of collisions */ + +unsigned int +htab_collisions_num (htab_t htab) +{ + return htab->collisions; +} + +/* Return the number of searches */ + +unsigned int +htab_searches_num (htab_t htab) +{ + return htab->searches; +} + +/* Dump allocation statistics to stderr. If elem_size > 0 display total memory + * usage of hash table too. */ + +void +htab_dump_statistics (htab_t table, size_t elem_size) +{ + size_t n_valid, headers, contents, empties, deleted, total; + void **p, **limit; + +#define SCALE(x) ((unsigned long) ((x) < 1024*10 \ + ? (x) \ + : ((x) < 1024*1024*10 \ +? (x) / 1024 \ +: (x) / (1024*1024 +#define LABEL(x) ((x) < 1024*10 ? ' ' : ((x) < 1024*1024*10 ? 'k' : 'M')) + + deleted = n_valid = empties = 0; + p = table->entries; + limit = p + table->size; + do +if (*p == HTAB_DELETED_ENTRY) + ++deleted; +else if (*p == HTAB_EMPTY_ENTRY) + ++empties; +else + ++n_valid; + while (++p < limit); + + assert (deleted == table->n_deleted); + assert (empties == table->size - table->n_elements); + assert (n_valid + deleted == table->n_elements); + + headers = table->size * sizeof (*(table->entries)); + contents = n_valid * elem_size; + total = sizeof (*table) + headers + contents; + + fprintf (stderr, "\tslots\t\t%lu\n", + (unsigned long) table->size); + + fprintf (stderr, "\tused\t\t%lu (%.2f%%)\n", + (unsigned long) table->n_elements, + table->n_elements * 100.0 / table->size); + fprintf (stderr, "\t\tvalid\t\t%lu\n", + (unsigned long) n_valid); + fprintf (stderr, "\t\tdeleted\t\t%lu\n", + (unsigned long) table->n_deleted); + + fprintf(stderr, "\ttotal size\t%lu %cB\n", + SCALE (total), LABEL (total)); + fprintf (stderr, "\t\tstruct htab\t%lu\t B\n", + (unsigned long) sizeof (*table)); + fprintf (stderr, "\t\ttable\t\t%lu\t%cB\n", + SCALE (headers), LABEL (headers)); + if (elem_size > 0) +{ + fprintf (stderr, "\t\tone element\t%lu\t B\n", + (u
Re: [line-map] simple oneliner that speeds up track-macro-expansion
Hi Dodji, On Mon, 4 Jun 2012, Dodji Seketeli wrote: Hello Dimitrios, I cannot approve or deny your patch, but I have one question. Who should I CC then? I saw that you have commits in that file. I am wondering why this change implies better performance. Is this because when we later want to encode a new line/column, and hit the spot below, (in linemap_line_start via linemap_position_for_column), we call less linemap_add (and thus allocate less maps): Almost. To be exact we don't even enter linemap_line_start() because of the check in linemap_position_for_column(). As a result we avoid almost 450K calls to linemap_line_start() (and many calls to malloc() subsequently), for my example case of compiling reload.c. Here is the source annotated by callgrind: Before: . source_location . linemap_position_for_column (struct line_maps *set, unsigned int to_column) 5,193,344 { 649,168source_location r = set->highest_line; . 3,895,008linemap_assert . (!linemap_macro_expansion_map_p (LINEMAPS_LAST_ORDINARY_MAP (set))); . 1,298,336if (to_column >= set->max_column_hint) . { 1,696,684if (r >= 0xC00 || to_column > 10) . { . /* Running low on source_locations - disable column numbers. */ . return r; . } .else . { . struct line_map *map = LINEMAPS_LAST_ORDINARY_MAP (set); 5,938,394 r = linemap_line_start (set, SOURCE_LINE (map, r), to_column + 50); 36,425,117 => line-map.c:linemap_line_start (424171x) . } . } 649,168r = r + to_column; 1,298,336if (r >= set->highest_location) 640,785 set->highest_location = r; .return r; 3,895,008 } After . source_location . linemap_position_for_column (struct line_maps *set, unsigned int to_column) 5,193,344 { 649,168source_location r = set->highest_line; . 3,895,008linemap_assert . (!linemap_macro_expansion_map_p (LINEMAPS_LAST_ORDINARY_MAP (set))); . 1,298,336if (to_column >= set->max_column_hint) . { 48if (r >= 0xC00 || to_column > 10) . { . /* Running low on source_locations - disable column numbers. */ . return r; . } .else . { . struct line_map *map = LINEMAPS_LAST_ORDINARY_MAP (set); 168 r = linemap_line_start (set, SOURCE_LINE (map, r), to_column + 50); 2,573 => line-map.c:linemap_line_start (12x) . } . } 649,168r = r + to_column; 1,298,336if (r >= set->highest_location) 640,805 set->highest_location = r; .return r; 3,895,008 } Dimitris
[line-map] simple oneliner that speeds up track-macro-expansion
Hello list, the attached simple patch is a small part of the fix in Bug #53525, but it is independant and offers a noticeable performance improvement for track-macro-expansion. No regressions tested on x86, besides memory usage going up some KB. I don't get why the max_column_hint was zeroed in every macro. Maybe there are pathological cases that I don't see? 2012-06-04 Dimitrios Apostolou * line-map.c (linemap_enter_macro): Don't zero max_column_hint in every macro. This improves performance by reducing the number of reallocations when track-macro-expansion is on. Thanks Dimitris === modified file 'libcpp/line-map.c' --- libcpp/line-map.c 2012-04-30 11:42:12 + +++ libcpp/line-map.c 2012-05-27 06:52:08 + @@ -331,7 +331,6 @@ linemap_enter_macro (struct line_maps *s num_tokens * sizeof (source_location)); LINEMAPS_MACRO_CACHE (set) = LINEMAPS_MACRO_USED (set) - 1; - set->max_column_hint = 0; return map; }
Re: [DF] Generate REFs in REGNO order
On Tue, 22 May 2012, Paolo Bonzini wrote: Il 22/05/2012 18:26, Dimitrios Apostolou ha scritto: You are right, and I noticed that if we reverse (actually put straight) the loop for the PARALLEL defs inside df_defs_record() then the speedup stands for both x86 and ppc64. The following patch was tested on x86, do you think it is meaningful for the generic case? It's really a lottery, but if it doesn't make things worse for x86 why not. It doesn't so please apply, even though I'll try understanding in depth and see if I can think of a proper way of generating DEFs in proper order. Dimitris
Re: [DF] Generate REFs in REGNO order
On Tue, 22 May 2012, Paolo Bonzini wrote: Il 21/05/2012 19:49, Dimitrios Apostolou ha scritto: Thanks for reviewing, in the meantime I'll try to figure out why this patch doesn't offer any speed-up on ppc64 (doesn't break anything though), so expect a followup by tomorrow. Perhaps you hit this? else if (GET_CODE (XEXP (note, 0)) == CLOBBER) { if (REG_P (XEXP (XEXP (note, 0), 0))) { unsigned int regno = REGNO (XEXP (XEXP (note, 0), 0)); if (!TEST_HARD_REG_BIT (defs_generated, regno)) df_defs_record (collection_rec, XEXP (note, 0), bb, insn_info, flags); } You are right, and I noticed that if we reverse (actually put straight) the loop for the PARALLEL defs inside df_defs_record() then the speedup stands for both x86 and ppc64. The following patch was tested on x86, do you think it is meaningful for the generic case? --- gcc/df-scan.c 2012-05-06 04:08:43 + +++ gcc/df-scan.c 2012-05-22 13:08:33 + @@ -3010,7 +3010,7 @@ df_defs_record (struct df_collection_rec break; case PARALLEL: - for (i = XVECLEN (x, 0) - 1; i >= 0; i--) + for (i = 0; i < XVECLEN (x, 0); i++) df_defs_record (collection_rec, XVECEXP (x, 0, i), bb, insn_info, flags); break; Thanks, Dimitris
Re: [DF] Generate REFs in REGNO order
Hi Paolo, On Mon, 21 May 2012, Paolo Bonzini wrote: Il 20/05/2012 20:50, Dimitrios Apostolou ha scritto: Paolo: I couldn't find a single test-case where the mw_reg_pool was heavily used so I reduced its size. You think it's OK for all archs? Makes sense, we can see if something breaks. I'll commit the patch tomorrow after a re-review. Nothing will break hopefully, it will only become slower as the mw_reg_pool gets resized according to requests. But I can't find a mw_reg_pool-intensive testcase, so I guess this won't happen. Thanks for reviewing, in the meantime I'll try to figure out why this patch doesn't offer any speed-up on ppc64 (doesn't break anything though), so expect a followup by tomorrow. Thanks, Dimitris
Re: Show hash table stats when -fmem-report
One line patch to update Makefile. 2012-05-21 Dimitrios Apostolou * gcc/Makefile.in: (toplev.o) toplev.o depends on cselib.h. === modified file 'gcc/Makefile.in' --- gcc/Makefile.in 2012-05-04 20:04:47 + +++ gcc/Makefile.in 2012-05-21 14:08:45 + @@ -2751,7 +2751,7 @@ toplev.o : toplev.c $(CONFIG_H) $(SYSTEM langhooks.h insn-flags.h $(CFGLAYOUT_H) $(CFGLOOP_H) hosthooks.h \ $(CGRAPH_H) $(COVERAGE_H) alloc-pool.h $(GGC_H) $(INTEGRATE_H) \ $(OPTS_H) params.def tree-mudflap.h $(TREE_PASS_H) $(GIMPLE_H) \ - tree-ssa-alias.h $(PLUGIN_H) realmpfr.h tree-diagnostic.h \ + tree-ssa-alias.h $(PLUGIN_H) cselib.h realmpfr.h tree-diagnostic.h \ tree-pretty-print.h opts-diagnostic.h $(COMMON_TARGET_H) hwint.o : hwint.c $(CONFIG_H) $(SYSTEM_H) $(DIAGNOSTIC_CORE_H)
Show hash table stats when -fmem-report
Hello list, I ported a patch from last year's GSOC to current trunk, removing deprecated hash tables and adding some new ones. I also backed out a change that reduced collisions by decreasing load factor because it had created controversy, so this patch should be easily applicable. CC'd whoever is relevant with subsystems or had commented on this last year. Some notes: * Is it OK that I included in libiberty's hashtab.c? * Many hash tables are created with htab_create_ggc(), for example referenced_vars and default_defs in tree-ssa.c. I collect statistics in delete_tree_ssa() but maybe some are not deallocated in there but automatically garbage collected? * Obviously hash table sizes are inflated where two entries might reference the same element (for example in symtab_hash) but I don't handle this. * Maybe the best overall solution would be to add a string parameter to htab_create*() and store statistics in an internal hash table according to this string. Statistics would then be printed for all hash tables by iterating this internal table. Since it would cause a significant slowdown, -fmem-report would be better as a compile-time option than a run-time one. This is probably an overkill so I think I'll skip it. Thanks, Dimitris 2012-05-21 Dimitrios Apostolou Print various statistics about hash tables when called with -fmem-report. If the tables are created once use htab_dump_statistics(), if they are created/destroyed multiple times then introduce global variables to track statistics. * cgraph.c, cgraph.h: (cgraph_dump_stats): New function to dump stats about hash tables. * gcc/symtab.c, cgraph.h: (symtab_dump_stats): New function to dump stats about hash tables. * cgraph.c: (call_site_hash_{num,expands,searches,collisions}): New globals to keep statistics about call_site_hash hash tables. (cgraph_remove_node_callees,cgraph_remove_node): if (mem_report) then keep statistics about hash tables. * cselib.c, cselib.h (cselib_dump_stats): New function to dump stats about cselib_hash_table. * cselib.c (cselib_htab_{num,expands,searches,collisions}): New globals to keep hash table statistics. (cselib_finish): if (mem_report) keep hash table statistics. * dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if -fmem_report. * emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to dump statistics about mem_attrs_htab hash table. * tree.h, tree-ssa.c (tree_ssa_dump_stats): New function to print statistics about all referenced_vars and default_defs hash tables. * tree-ssa.c (default_defs_{num,expands,searches,collisions}) (referenced_vars_{num,expands,searches,collisions}): new globals to keep statistics about hash tables. (delete_tree_ssa): Keep statistics for hash tables by increasing the new global variables. * tree.c (dump_tree_statistics): Call tree_ssa_dump_stats(). (print_type_hash_statistics): Used the new htab_dump_statistics() function. * var-tracking.c (vars_htab_{num,expands,searches,collisions}) (dropval_htab_{num,expands,searches,collisions}) (cv_htab_{num,expands,searches,collisions}): new globals to keep hash table statistics. (shared_hash_destroy, vt_finalize): Keep statistics by increasing values of new global variables if -fmem-report. * var-tracking.c, rtl.h (vt_dump_stats): New function to dump stats about vars->htab, dropped_variables and value_chains hash tables. * toplev.c: Included cselib.h for cselib_dump_stats(). (dump_memory_report): Call all the above functions to provide better statistics. * hashtab.c, hashtab.h: Added "expands" variable to htab_t for tracking total times the hash table was expanded. * hashtab.c, hashtab.h (htab_dump_statistics, htab_collisions_num) (htab_searches_num, htab_expands_num): New functions for statistics. * hashtab.c: Included assert.h for checks in htab_dump_statistics. * cgraph.h, varpool.c (varpool_dump_stats): New function to dump stats about varpool_hash hash table. * libcpp/symtab.c, symtab.h: Added "expands" variable to hash_table type for tracking total times the hash table was expanded. * symtab.c (ht_dump_statistics): Beautified stats output. Number of expanded macros: 23397 Average number of tokens per macro expansion: 10 Line Table allocations during the compilation process Number of ordinary maps used: 576 Ordinary map used size: 13k Number of ordinary maps allocated:1365 Ordinary maps allocated size: 31k Number of macro maps used: 20k Macro
[DF] Generate REFs in REGNO order
Hello list, I'm resubmitting this patch from last year's GSOC which speeds up compilation by avoiding thousands of calls to qsort(). Measured again it's impact on compilation speed, this time compiling (cc1) gcc's reload.c on i386: orig: 0.734s patched:0.720s Tested on i686, ppc64. No regressions. Paolo: I couldn't find a single test-case where the mw_reg_pool was heavily used so I reduced its size. You think it's OK for all archs? 2012-05-20 Dimitrios Apostolou Paolo Bonzini Provide almost 2% speedup on -O0 compilations by generating the DF_REF_BASE register defs in REGNO order, so that collection_rec is already sorted and most qsort() calls are avoided. In detail: (df_def_record_1): Assert a parallel must contain an EXPR_LIST at this point. Receive the LOC and move its extraction... (df_defs_record): ... here. Rewrote logic with a switch statement instead of multiple if-else. (df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions that duplicate the logic of df_defs_record() and df_def_record_1() but without actually recording any DEFs, only marking them in the defs HARD_REG_SET. (df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that are the result of the call. Record DF_REF_BASE DEFs in REGNO order. Use HARD_REG_SET instead of bitmap for regs_invalidated_by_call. Changed defs_generated from bitmap to HARD_REG_SET, it's much faster. (df_insn_refs_collect): Record DF_REF_REGULAR DEFs after df_get_call_refs(). (df_scan_alloc): Rounded up allocation pools size, reduced the mw_reg_pool size, it was unnecessarily large. Thanks, Dimitris === modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2012-02-29 08:12:04 + +++ gcc/df-scan.c 2012-05-20 15:46:06 + @@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c rtx, rtx *, basic_block, struct df_insn_info *, enum df_ref_type, int ref_flags); -static void df_def_record_1 (struct df_collection_rec *, rtx, +static void df_def_record_1 (struct df_collection_rec *, rtx *, basic_block, struct df_insn_info *, int ref_flags); static void df_defs_record (struct df_collection_rec *, rtx, @@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU { struct df_scan_problem_data *problem_data; unsigned int insn_num = get_max_uid () + 1; - unsigned int block_size = 400; + unsigned int block_size = 512; basic_block bb; /* Given the number of pools, this is really faster than tearing @@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU sizeof (struct df_reg_info), block_size); problem_data->mw_reg_pool = create_alloc_pool ("df_scan mw_reg", -sizeof (struct df_mw_hardreg), block_size); +sizeof (struct df_mw_hardreg), block_size / 16); bitmap_obstack_initialize (&problem_data->reg_bitmaps); bitmap_obstack_initialize (&problem_data->insn_bitmaps); @@ -2917,40 +2917,27 @@ df_read_modify_subreg_p (rtx x) } -/* Process all the registers defined in the rtx, X. +/* Process all the registers defined in the rtx pointed by LOC. Autoincrement/decrement definitions will be picked up by df_uses_record. */ static void df_def_record_1 (struct df_collection_rec *collection_rec, - rtx x, basic_block bb, struct df_insn_info *insn_info, + rtx *loc, basic_block bb, struct df_insn_info *insn_info, int flags) { - rtx *loc; - rtx dst; - - /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL - construct. */ - if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER) -loc = &XEXP (x, 0); - else -loc = &SET_DEST (x); - dst = *loc; + rtx dst = *loc; /* It is legal to have a set destination be a parallel. */ if (GET_CODE (dst) == PARALLEL) { int i; - for (i = XVECLEN (dst, 0) - 1; i >= 0; i--) { rtx temp = XVECEXP (dst, 0, i); - if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER - || GET_CODE (temp) == SET) - df_def_record_1 (collection_rec, - temp, bb, insn_info, -GET_CODE (temp) == CLOBBER -? flags | DF_REF_MUST_CLOBBER : flags); + gcc_assert (GET_CODE (temp) == EXPR_LIST); + df_def_record_1 (collection_rec, &XEXP (temp, 0), + bb, insn_info, flags); } return; } @@ -3004,26 +2991,98 @@ df_defs_record (struct df_collection_rec int flags) { RTX_CODE code = GET
Re: [PATCH] Fix Linux/sparc build after generic asm output optimizations.
Hi, On Sat, 12 Nov 2011, Eric Botcazou wrote: We just need to declare it in system.h in order to use the definition in libiberty. OK, this should be fine. do the patches I sent for bug #51094 solve the problems? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51094 Thanks, Dimitris
Re: [PATCH] Fix Linux/sparc build after generic asm output optimizations.
Hi David, I couldn't imagine such breakage... If too many platforms break perhaps we should undo the optimisation - see attached patch. Thanks, Dimitris P.S. see also bug #51094 I've attached some more fixes === modified file 'gcc/config/elfos.h' --- gcc/config/elfos.h 2011-10-30 01:45:46 + +++ gcc/config/elfos.h 2011-11-12 02:51:39 + @@ -125,9 +125,6 @@ see the files COPYING3 and COPYING.RUNTI } \ while (0) -#undef TARGET_ASM_INTERNAL_LABEL -#define TARGET_ASM_INTERNAL_LABEL default_elf_internal_label - /* Output the label which precedes a jumptable. Note that for all svr4 systems where we actually generate jumptables (which is to say every svr4 target except i386, where we use casesi instead) we put the jump-
Re: repost: [DF] Use HARD_REG_SETs instead of bitmaps
On Mon, 7 Nov 2011, Jakub Jelinek wrote: On Mon, Nov 07, 2011 at 12:01:29AM +0200, Dimitrios Apostolou wrote: On Sun, 6 Nov 2011, Joern Rennecke wrote: But where HARD_REG_SETS make no material difference in speed, and the compilation unit has no other tight coupling with tm.h, it would really be cleaner to move from HARD_REG_SETS to a target-independent type, like sbitmap or bitmap. Maybe we want something more lightweight than an sbitmap, like passing in a HARD_REG_ELT_TYPE *base, int n_elements pair to describe the data - n_elements would be set to HARD_REG_SET_LONGS. When you are doing anything time-critical with hard register sets, there is little point having the same size stored in multiple places, you might as well keep it in a register and pass it around. Basic operations could presumably be inlined, so the main overhead would be loop overhead. With the right loop unrolling applied, these loops should be well predictable. Working with bitmaps in gcc made me really miss the simple bitmap.h of the linux kernel. I'm not sure if we can borrow from there. More info at: http://lxr.linux.no/#linux+v3.1/include/linux/bitmap.h Hm, what else is hard-reg-set.h? It's about the same performance-wise, but: * much uglier because of all the #ifdef cruft * less useful since it can't be used as a generic bitmap ** Closest generic bitmap is sbitmap.h, but IMHO it's more complex than it should be. I'd love to have the kernel bitmap.h in libiberty. When you are dealing with pseudos etc. sparse bitmaps are very often better. Indeed. They are much better memory-wise, but they have a non-negligible overhead performance-wise. I've noted somewhere to replace all hot bitmaps with ebitmaps and see what happens, whenever I have the time... Dimitris
Re: repost: [DF] Use HARD_REG_SETs instead of bitmaps
On Sun, 6 Nov 2011, Joern Rennecke wrote: But where HARD_REG_SETS make no material difference in speed, and the compilation unit has no other tight coupling with tm.h, it would really be cleaner to move from HARD_REG_SETS to a target-independent type, like sbitmap or bitmap. Maybe we want something more lightweight than an sbitmap, like passing in a HARD_REG_ELT_TYPE *base, int n_elements pair to describe the data - n_elements would be set to HARD_REG_SET_LONGS. When you are doing anything time-critical with hard register sets, there is little point having the same size stored in multiple places, you might as well keep it in a register and pass it around. Basic operations could presumably be inlined, so the main overhead would be loop overhead. With the right loop unrolling applied, these loops should be well predictable. Working with bitmaps in gcc made me really miss the simple bitmap.h of the linux kernel. I'm not sure if we can borrow from there. More info at: http://lxr.linux.no/#linux+v3.1/include/linux/bitmap.h Dimitris P.S. we badly need a gxr site :-)
Re: [dwarf2out, elfos] Output assembly faster
Hello list, I updated this patch to latest trunk and also incorporated apinski's suggestion to use stpcpy() instead of a custom loop. Bootstrapped and tested again on i386, no regressions. The comments from my previous email still apply, I never got a reply: On Mon, 22 Aug 2011, Dimitrios Apostolou wrote: Hello again, most of this patch was posted at the beginning of my GSOC adventure and primarily is about replacing fprintf() calls with custom faster ones. From that time I changed minor things you suggested, omitted some complexities that offered minor speed-up, and made the code as clear as possible, while following more closely coding guidelines. Bootstrapped and tested on both i386 and x86_64, showing runtime improvement when compiling with debug info enabled. The only thing I am not sure is the differentiation between HOST_WIDE_INT and long, so I provided two versions of the same function. Please let me know if I should handle this differently. Future speedup in assembly generation *is possible* but requires more intrusive changes, maybe making assembly hard to read by human: * Always use hex numbers, they are much faster to produce * Call GNU assembler with -s parameter, though it's pretty hard to be compliant. * Even further in the future we could generate binary data, if we *know* the assembler is GAS. Slightly more descriptive changelog: 2011-08-12 Dimitrios Apostolou * final.c, output.h (fprint_whex, fprint_w, fprint_ul, sprint_ul): New functions serving as fast replacements for fprintf() integer to string conversions. They were used in the following changes. * final.c (sprint_ul_rev): Internal helper for the above. (output_addr_const): case CONST_INT: don't use fprintf(). * elfos.h (ASM_GENERATE_INTERNAL_LABEL): Don't use sprintf("%u"), use sprint_ul() and stpcpy() which are much faster. (TARGET_ASM_INTERNAL_LABEL): Define as default_elf_internal_label. (ELF_ASCII_ESCAPES, ELF_STRING_LIMIT): Are the old ESCAPES and STRING_LIMIT macros. (ASM_OUTPUT_LIMITED_STRING, ASM_OUTPUT_ASCII): Macros now just call respective functions that provide the same functionality. Those are default_elf_asm_output_limited_string() and default_elf_asm_output_ascii() in varasm.c. * varasm.c: Fixed some whitespace inconsistencies. (default_elf_asm_output_limited_string) (default_elf_asm_output_ascii): The above macros from elfos.h are implemented here as these functions, avoiding superfluous calls to fprintf(). (default_elf_internal_label): Hook for targetm.asm_out.internal_label and ASM_OUTPUT_DEBUG_LABEL. * i386.c: Don't call fprintf("%u") but fprint_ul() instead. * defaults.h (ASM_OUTPUT_LABEL, ASM_OUTPUT_INTERNAL_LABEL): Expanded the macros on multiple lines for readability. (ASM_OUTPUT_LABELREF): Have two calls to fputs() instead of one to asm_fprintf(). * dwarf2asm.c (dw2_assemble_integer, dw2_asm_output_data) (dw2_asm_data_uleb128, dw2_asm_delta_uleb128) (dw2_asm_delta_sleb128): Convert fprintf() calls to the new faster functions. * dwarf2out.c (dwarf2out_source_line): Convert fprintf() calls to the new faster functions. I had measured during the summer ([1] before, [2] after, if necessary I'll refresh measurements) that this gives a good 2-3% improvement, especially when debug info is enabled. Major obstacle was that it made code less readable but I think the new function names are more understandable. [1] http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=4.6.0-baseline_prof.txt [2] http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=4.6.0-changed1_prof.txt Jakub, you had raised concerns that it's better to avoid this optimisation completely and make gcc always optimise fprintf("%d") calls, to keep the code readable. Do you think that's the case still? I appreciate all comments, Dimitris P.S. Cc'd i386 backend maintainers=== modified file 'gcc/config/elfos.h' --- gcc/config/elfos.h 2010-11-21 00:54:14 + +++ gcc/config/elfos.h 2011-10-29 22:57:35 + @@ -117,10 +117,17 @@ see the files COPYING3 and COPYING.RUNTI #define ASM_GENERATE_INTERNAL_LABEL(LABEL, PREFIX, NUM)\ do \ { \ - sprintf (LABEL, "*.%s%u", PREFIX, (unsigned) (NUM)); \ + char *__p; \ + (LABEL)[0] = '*';\ + (LABEL)[1] = '.';
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
Hello all, I received my GSOC t-shirt yesterday which reminded me I have a promise to keep... After realising that it can take forever to find enough free time to work on GCC, I decided to work a couple of hours whenever I can and post updates to my patches as time permits. Hopefully some of them will make it into 4.7? On Mon, 22 Aug 2011, Dimitrios Apostolou wrote: For the record I'm posting here the final version of this patch, in case it gets applied. It adds minor stylistic fixes, plus a small change in alloc_pool sizes. Any further testing I do will be posted under this thread. The previously posted Changelog applies, with the following addition: (df_scan_alloc): Rounded up allocation pools size, reduced the mw_reg_pool size, it was unnecessarily large. Paolo, did I assume correctly that the mw_reg_pool is significantly smaller than the rest? That was the case on i386, I assumed it would be similar in other arch as well. The attached patch (df2b.diff, exactly the same as the one in parent email) applies successfully to latest gcc snapshot. In addition to previous testing (i386,x86_64) I've just finished testing on sparc-linux-gnu at the GCC compile farm having no regressions. Finally I think Steven's tests on IA64 went ok. Wasn't testing the only thing holding this patch? On sparc runtime of compiling df-scan.c seems to have been reduced from 34s to 33s user time, for a debug build (--enable-checking=assert,misc,runtime,rtl,df). But measurements are too flaky since node is busy. The complete changelog is the following: 2011-07-29 Dimitrios Apostolou Paolo Bonzini (df_def_record_1): Assert a parallel must contain an EXPR_LIST at this point. Receive the LOC and move its extraction... (df_defs_record): ... here. Rewrote logic with a switch statement instead of multiple if-else. (df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions that duplicate the logic of df_defs_record() and df_def_record_1() but without actually recording any DEFs, only marking them in the defs HARD_REG_SET. (df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that are the result of the call. Record DF_REF_BASE DEFs in REGNO order. Use regs_invalidated_by_call HARD_REG_SET instead of regs_invalidated_by_call_regset bitmap. (df_insn_refs_collect): Record DF_REF_REGULAR DEFs after df_get_call_refs(). (df_scan_alloc): Rounded up allocation pools size, reduced the mw_reg_pool size, it was unnecessarily large. Thanks, Dimitris=== modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-08-22 15:17:18 + @@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c rtx, rtx *, basic_block, struct df_insn_info *, enum df_ref_type, int ref_flags); -static void df_def_record_1 (struct df_collection_rec *, rtx, +static void df_def_record_1 (struct df_collection_rec *, rtx *, basic_block, struct df_insn_info *, int ref_flags); static void df_defs_record (struct df_collection_rec *, rtx, @@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU { struct df_scan_problem_data *problem_data; unsigned int insn_num = get_max_uid () + 1; - unsigned int block_size = 400; + unsigned int block_size = 512; basic_block bb; /* Given the number of pools, this is really faster than tearing @@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU sizeof (struct df_reg_info), block_size); problem_data->mw_reg_pool = create_alloc_pool ("df_scan mw_reg", -sizeof (struct df_mw_hardreg), block_size); +sizeof (struct df_mw_hardreg), block_size / 16); bitmap_obstack_initialize (&problem_data->reg_bitmaps); bitmap_obstack_initialize (&problem_data->insn_bitmaps); @@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x) } -/* Process all the registers defined in the rtx, X. +/* Process all the registers defined in the rtx pointed by LOC. Autoincrement/decrement definitions will be picked up by df_uses_record. */ static void df_def_record_1 (struct df_collection_rec *collection_rec, - rtx x, basic_block bb, struct df_insn_info *insn_info, + rtx *loc, basic_block bb, struct df_insn_info *insn_info, int flags) { - rtx *loc; - rtx dst; - - /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL - construct. */ - if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER) -loc = &XEXP (x, 0); - else -loc = &SET_DEST (x); - dst = *loc; + rtx dst = *loc; /* It is legal to have a se
Re: [var-tracking] small speed-ups
On Tue, 23 Aug 2011, Jakub Jelinek wrote: On Tue, Aug 23, 2011 at 02:40:56PM +0300, Dimitrios Apostolou wrote: dst->vars = (shared_hash) pool_alloc (shared_hash_pool); dst->vars->refcount = 1; dst->vars->htab -= htab_create (MAX (src1_elems, src2_elems), variable_htab_hash, += htab_create (2 * MAX (src1_elems, src2_elems), variable_htab_hash, variable_htab_eq, variable_htab_free); This looks wrong, 2 * max is definitely too much. For a hash table to fit N elements, it has to have at least 4/3*N slots, or 2*N slots if htab has the 50% load factor I was proposing. For var-tracking, 50% load factor is IMHO a very bad idea, memory consumption of var-tracking is already high right now, we IMHO don't have the luxury to waste more RAM. Agreed, then I 'll change it to 4/3 * MAX so that I avoid expansions. In total for dataflow_set_destroy I can see that calls to attrs_list_clear() have been reduced from 500K to 250K, and I can also see a reduction of free() calls from htab_delete(), from 30K to free calls? If you avoid free calls, then you end up with a memory leak. I can understand when you avoid pool_free calls... You are right, I just mentioned the total difference in free() calls from all my patches. But in this part there is no free() involved, so the major gain should be from avoiding htab_delete() iterating many times over the hash tables. Annotated source from the callgrind profiler (shows instruction count): Before: . void 15,820,597 htab_delete (htab_t htab) 250,914 { 41,819size_t size = htab_size (htab); 41,819PTR *entries = htab->entries; .int i; . 83,638if (htab->del_f) 66,493,884 for (i = size - 1; i >= 0; i--) 49,825,998if (entries[i] != HTAB_EMPTY_ENTRY && entries[i] != HTAB_DELETED_ENTRY) 1,813,018 (*htab->del_f) (entries[i]); 1,800,731 => tree-into-ssa.c:def_blocks_free (10988x) 345,910 => tree-into-ssa.c:repl_map_free (2815x) 53,950 => tree-scalar-evolution.c:del_scev_info (1197x) 139,354 => tree-ssa-loop-im.c:memref_free (198x) 281,777 => tree-ssa-sccvn.c:free_phi (3788x) 81,726 => tree-ssa-uncprop.c:equiv_free (463x) 17,359,572 => var-tracking.c:variable_htab_free (835512x) 42,161 => cfgloop.c:loop_exit_free (720x) 284,904 => ira-costs.c:cost_classes_del (2270x) 157 => tree-ssa-loop-im.c:vtoe_free (1x) 11,454,221 => ???:free (37228x) 1,460,684 => tree-ssa-sccvn.c:free_reference (11329x) . 125,457if (htab->free_f != NULL) After: . void 6,543,474 htab_delete (htab_t htab) 250,914 { 41,819size_t size = htab_size (htab); 41,819PTR *entries = htab->entries; .int i; . 83,638if (htab->del_f) 29,288,268 for (i = size - 1; i >= 0; i--) 21,927,330if (entries[i] != HTAB_EMPTY_ENTRY && entries[i] != HTAB_DELETED_ENTRY) 1,738,584 (*htab->del_f) (entries[i]); 139,344 => tree-ssa-loop-im.c:memref_free (198x) 157 => tree-ssa-loop-im.c:vtoe_free (1x) 81,762 => tree-ssa-uncprop.c:equiv_free (463x) 281,884 => tree-ssa-sccvn.c:free_phi (3788x) 42,145 => cfgloop.c:loop_exit_free (720x) 345,870 => tree-into-ssa.c:repl_map_free (2815x) 1,462,000 => tree-ssa-sccvn.c:free_reference (11329x) 1,800,951 => tree-into-ssa.c:def_blocks_free (10988x) 16,518,004 => var-tracking.c:variable_htab_free (824010x) 284,979 => ira-costs.c:cost_classes_del (2270x) 9,080,245 => ???:free (11513x) 53,921 => tree-scalar-evolution.c:del_scev_info (1197x) . 125,457if (htab->free_f != NULL) So if the part relevant to this patch is the number of calls to variable_htab_free, I suppose it won't make a big difference. But if I take a look at the top callers and top callees of htab_delete(): Before: 19,590,149 < tree-ssa-structalias.c:solve_constraints (2058x) [cc1] 33,299,064 < var-tracking.c:shared_hash_destroy (6923x) [cc1] 68,941,719 < tree-ssa-pre.c:execute_pre (2058x) [cc1] 134,915,334 * hashtab.c:htab_delete [cc1] 17,359,572 > var-tracking.c:variable_htab_free (835512x) [cc1] 11,454,221 > ???:free (108456x) [libc-2.12.so] 1,800,731 > tree-into-ssa.c:def_blocks_free (10988x) [cc1] After: 8,756,328 < tree-ssa-structalias.c:delete_points_to_sets (1029x) [cc1] 9,316,165 < tree-ssa-dom.c:tree_ssa_dominator_optimize (562x) [cc1] 83,696,211 < var-tracking.c:shared_hash_destroy (6923x) [cc1] 60,459,493 * hashtab.c:htab_delete [cc1] 16,518,004 > var-tracking.c:variable_htab_free (824010x) [cc1] 9,080,245 > ???:free (82741x) [libc-2.12.so] 1,800,951 > tree-into-ssa.c:def_blocks_free (10988x) [cc1] Then I'm more confused, htab_delete() seems to be much colder, pr
Re: [var-tracking] small speed-ups
Hi jakub, On Mon, 22 Aug 2011, Jakub Jelinek wrote: On Mon, Aug 22, 2011 at 01:30:33PM +0300, Dimitrios Apostolou wrote: @@ -1191,7 +1189,7 @@ dv_uid2hash (dvuid uid) static inline hashval_t dv_htab_hash (decl_or_value dv) { - return dv_uid2hash (dv_uid (dv)); + return (hashval_t) (dv_uid (dv)); } Why? dv_uid2hash is an inline that does exactly that. @@ -1202,7 +1200,7 @@ variable_htab_hash (const void *x) { const_variable const v = (const_variable) x; - return dv_htab_hash (v->dv); + return (hashval_t) (dv_uid (v->dv)); } Why? /* Compare the declaration of variable X with declaration Y. */ @@ -1211,9 +1209,8 @@ static int variable_htab_eq (const void *x, const void *y) { const_variable const v = (const_variable) x; - decl_or_value dv = CONST_CAST2 (decl_or_value, const void *, y); - return (dv_as_opaque (v->dv) == dv_as_opaque (dv)); + return (v->dv) == y; } Why? I was hoping you'd ask so I'll ask back :-) Why are we doing it that way? Why so much indirection in the first place? Why create inline functions just to typecast and why do we need this CONST_CAST2 ugliness in C code. I bet there are things I don't understand so I'd be happy to listen... The reason I did this (and many more I didn't publish) simplifications within var-tracking is because it hurt my brains to follow the logic. Even with the help of TAGS I have a specific stack depth before I forget where I begun diving into TAG declarations. Well in var-tracking this limit was surpassed by much... @@ -1397,19 +1398,40 @@ shared_var_p (variable var, shared_hash || shared_hash_shared (vars)); } +/* Copy all variables from hash table SRC to hash table DST without rehashing + any values. */ + +static htab_t +htab_dup (htab_t src) +{ + htab_t dst; + + dst = (htab_t) xmalloc (sizeof (*src)); + memcpy (dst, src, sizeof (*src)); + dst->entries = (void **) xmalloc (src->size * sizeof (*src->entries)); + memcpy (dst->entries, src->entries, + src->size * sizeof (*src->entries)); + return dst; +} + This certainly doesn't belong here, it should go into libiberty/hashtab.c and prototype into include/hashtab.h. It relies on hashtab.c implementation details. OK I'll do that in the future. Should I also move some other htab functions I saw in var-tracking and rtl? FOR_EACH_HTAB_ELEMENT comes to mind, probably other too. @@ -2034,7 +2041,8 @@ val_resolve (dataflow_set *set, rtx val, static void dataflow_set_init (dataflow_set *set) { - init_attrs_list_set (set->regs); + /* Initialize the set (array) SET of attrs to empty lists. */ + memset (set->regs, 0, sizeof (set->regs)); set->vars = shared_hash_copy (empty_shared_hash); set->stack_adjust = 0; set->traversed_vars = NULL; I'd say you should instead just implement init_attrs_list_set inline using memset. It's used only once, that's why I deleted the function. I'll bring it back if you think it helps. dst->vars = (shared_hash) pool_alloc (shared_hash_pool); dst->vars->refcount = 1; dst->vars->htab -= htab_create (MAX (src1_elems, src2_elems), variable_htab_hash, += htab_create (2 * MAX (src1_elems, src2_elems), variable_htab_hash, variable_htab_eq, variable_htab_free); This looks wrong, 2 * max is definitely too much. For a hash table to fit N elements, it has to have at least 4/3*N slots, or 2*N slots if htab has the 50% load factor I was proposing. @@ -8996,11 +9006,13 @@ vt_finalize (void) FOR_ALL_BB (bb) { - dataflow_set_destroy (&VTI (bb)->in); - dataflow_set_destroy (&VTI (bb)->out); + /* The "false" do_free parameter means to not bother to iterate and free +all hash table elements, since we'll destroy the pools. */ + dataflow_set_destroy (&VTI (bb)->in, false); + dataflow_set_destroy (&VTI (bb)->out, false); if (VTI (bb)->permp) { - dataflow_set_destroy (VTI (bb)->permp); + dataflow_set_destroy (VTI (bb)->permp, false); XDELETE (VTI (bb)->permp); } } How much does this actually speed things up (the not freeing pool allocated stuff during finalizaqtion)? Is it really worth it? In total for dataflow_set_destroy I can see that calls to attrs_list_clear() have been reduced from 500K to 250K, and I can also see a reduction of free() calls from htab_delete(), from 30K to 10K. I'm willing to bet that much of this is because of this change, I have kept only the ones that showed difference and remember clearly that var-tracking is iterating over hash tables too much, either directly or from htab_traverse()/htab_delete(). Thanks, Dimitris
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
On Mon, 22 Aug 2011, Dimitrios Apostolou wrote: Hi Steven, On Mon, 1 Aug 2011, Steven Bosscher wrote: On Sun, Jul 31, 2011 at 11:59 PM, Steven Bosscher wrote: On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher wrote: I'll see if I can test the patch on the compile farm this weekend, just to be sure. Bootstrap on ia64-unknown-linux-gnu is in stage2 but it is taking forever (on gcc60)... Just to be clear, it is taking forever without the patch too. I did time -O2 non-bootstrap builds but there was no difference worth mentioning. Did the testsuite finish with no regressions on IA64? I'd test on a sparcstation but unfortunately the machine crashed before finishing and I can't regain access to it. I'll hopefully test it in some other platform by next week, I'm curious to actually measure runtime there. For the record I'm posting here the final version of this patch, in case it gets applied. It adds minor stylistic fixes, plus a small change in alloc_pool sizes. Any further testing I do will be posted under this thread. The previously posted Changelog applies, with the following addition: (df_scan_alloc): Rounded up allocation pools size, reduced the mw_reg_pool size, it was unnecessarily large. Paolo, did I assume correctly that the mw_reg_pool is significantly smaller than the rest? That was the case on i386, I assumed it would be similar in other arch as well. Thanks, Dimitris=== modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-08-22 15:17:18 + @@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c rtx, rtx *, basic_block, struct df_insn_info *, enum df_ref_type, int ref_flags); -static void df_def_record_1 (struct df_collection_rec *, rtx, +static void df_def_record_1 (struct df_collection_rec *, rtx *, basic_block, struct df_insn_info *, int ref_flags); static void df_defs_record (struct df_collection_rec *, rtx, @@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU { struct df_scan_problem_data *problem_data; unsigned int insn_num = get_max_uid () + 1; - unsigned int block_size = 400; + unsigned int block_size = 512; basic_block bb; /* Given the number of pools, this is really faster than tearing @@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU sizeof (struct df_reg_info), block_size); problem_data->mw_reg_pool = create_alloc_pool ("df_scan mw_reg", -sizeof (struct df_mw_hardreg), block_size); +sizeof (struct df_mw_hardreg), block_size / 16); bitmap_obstack_initialize (&problem_data->reg_bitmaps); bitmap_obstack_initialize (&problem_data->insn_bitmaps); @@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x) } -/* Process all the registers defined in the rtx, X. +/* Process all the registers defined in the rtx pointed by LOC. Autoincrement/decrement definitions will be picked up by df_uses_record. */ static void df_def_record_1 (struct df_collection_rec *collection_rec, - rtx x, basic_block bb, struct df_insn_info *insn_info, + rtx *loc, basic_block bb, struct df_insn_info *insn_info, int flags) { - rtx *loc; - rtx dst; - - /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL - construct. */ - if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER) -loc = &XEXP (x, 0); - else -loc = &SET_DEST (x); - dst = *loc; + rtx dst = *loc; /* It is legal to have a set destination be a parallel. */ if (GET_CODE (dst) == PARALLEL) { int i; - for (i = XVECLEN (dst, 0) - 1; i >= 0; i--) { rtx temp = XVECEXP (dst, 0, i); - if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER - || GET_CODE (temp) == SET) - df_def_record_1 (collection_rec, - temp, bb, insn_info, -GET_CODE (temp) == CLOBBER -? flags | DF_REF_MUST_CLOBBER : flags); + gcc_assert (GET_CODE (temp) == EXPR_LIST); + df_def_record_1 (collection_rec, &XEXP (temp, 0), + bb, insn_info, flags); } return; } @@ -3003,26 +2990,98 @@ df_defs_record (struct df_collection_rec int flags) { RTX_CODE code = GET_CODE (x); + int i; - if (code == SET || code == CLOBBER) -{ - /* Mark the single def within the pattern. */ - int clobber_flags = flags; - clobber_flags |= (code == CLOBBER) ? DF_REF_MUST_CLOBBER : 0; - df_def_record_1 (collection_rec, x, bb, insn_info, clobber_flags); -} - else if (code == COND_EXEC) + switch (code) { +
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
Hi Steven, On Mon, 1 Aug 2011, Steven Bosscher wrote: On Sun, Jul 31, 2011 at 11:59 PM, Steven Bosscher wrote: On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher wrote: I'll see if I can test the patch on the compile farm this weekend, just to be sure. Bootstrap on ia64-unknown-linux-gnu is in stage2 but it is taking forever (on gcc60)... Just to be clear, it is taking forever without the patch too. I did time -O2 non-bootstrap builds but there was no difference worth mentioning. Did the testsuite finish with no regressions on IA64? I'd test on a sparcstation but unfortunately the machine crashed before finishing and I can't regain access to it. I'll hopefully test it in some other platform by next week, I'm curious to actually measure runtime there. Thanks for helping! Dimitris
[dwarf2out, elfos] Output assembly faster
Hello again, most of this patch was posted at the beginning of my GSOC adventure and primarily is about replacing fprintf() calls with custom faster ones. From that time I changed minor things you suggested, omitted some complexities that offered minor speed-up, and made the code as clear as possible, while following more closely coding guidelines. Bootstrapped and tested on both i386 and x86_64, showing runtime improvement when compiling with debug info enabled. The only thing I am not sure is the differentiation between HOST_WIDE_INT and long, so I provided two versions of the same function. Please let me know if I should handle this differently. Future speedup in assembly generation *is possible* but requires more intrusive changes, maybe making assembly hard to read by human: * Always use hex numbers, they are much faster to produce * Call GNU assembler with -s parameter, though it's pretty hard to be compliant. * Even further in the future we could generate binary data, if we *know* the assembler is GAS. Changelog: 2011-08-12 Dimitrios Apostolou * final.c, output.h (fprint_whex, fprint_w, fprint_ul, sprint_ul): New functions serving as fast replacements for fprintf() integer to string conversions. They were used in the following changes. * final.c (sprint_ul_rev): Internal helper for the above. (output_addr_const): case CONST_INT: don't use fprintf(). * elfos.h (ASM_GENERATE_INTERNAL_LABEL): Don't use sprintf("%u"), use sprint_ul() and custom loop which are much faster. (TARGET_ASM_INTERNAL_LABEL): Define as default_elf_internal_label. (ELF_ASCII_ESCAPES, ELF_STRING_LIMIT): Are the old ESCAPES and STRING_LIMIT macros. (ASM_OUTPUT_LIMITED_STRING, ASM_OUTPUT_ASCII): Macros now just call respective functions that provide now the same functionality. Those are default_elf_asm_output_limited_string() and default_elf_asm_output_ascii() in varasm.c. * varasm.c: Fixed some whitespace inconsistencies. (default_elf_asm_output_limited_string) (default_elf_asm_output_ascii): The above macros from elfos.h are implemented here, avoiding superfluous calls to fprintf(). (default_elf_internal_label): Hook for targetm.asm_out.internal_label and ASM_OUTPUT_DEBUG_LABEL. * i386.c: Don't call fprintf("%u") but fprint_ul() instead. * defaults.h (ASM_OUTPUT_LABEL, ASM_OUTPUT_INTERNAL_LABEL): Expanded the macros on multiple lines for readability. (ASM_OUTPUT_LABELREF): Have two calls to fputs() instead of one to asm_fprintf(). * dwarf2asm.c (dw2_assemble_integer, dw2_asm_output_data) (dw2_asm_data_uleb128, dw2_asm_delta_uleb128) (dw2_asm_delta_sleb128): Convert fprintf() calls to faster, simpler functions. * dwarf2out.c (dwarf2out_source_line): Convert fprintf() calls to faster, simpler functions. Thanks, Dimitris === modified file 'gcc/config/elfos.h' --- gcc/config/elfos.h 2010-11-21 00:54:14 + +++ gcc/config/elfos.h 2011-08-22 11:39:43 + @@ -117,10 +117,23 @@ see the files COPYING3 and COPYING.RUNTI #define ASM_GENERATE_INTERNAL_LABEL(LABEL, PREFIX, NUM)\ do \ { \ - sprintf (LABEL, "*.%s%u", PREFIX, (unsigned) (NUM)); \ + int __i = 2; \ + int __j = 0; \ + (LABEL)[0] = '*';\ + (LABEL)[1] = '.';\ + do \ + { \ + (LABEL)[__i] = (PREFIX)[__j]; \ + __i++; __j++; \ + } \ + while ((PREFIX)[__j] != '\0'); \ + sprint_ul (&(LABEL)[__i], (unsigned long) (NUM));\ } \ while (0) +#undef TARGET_ASM_INTERNAL_LABEL +#define TARGET_ASM_INTERNAL_LABEL default_elf_internal_label + /* Output the label which precedes a jumptable. Note that for all svr4 systems where we actually generate jumptables (which is to say every svr4 target except i386, where we use casesi instead) we put the jump- @@ -371,7 +384,7 @@ see the files COPYING3 and COPYING.RUNTI the i386) don't know about that. Also, we don't use \v since some versions of gas, such as 2.2 did not accept it. */ -#define ESCAPES \ +#define ELF_ASCII_ESCAPES \ "\1\
Re: Dump stats about hottest hash tables when -fmem-report
I should note here that specialised hash-tables in pointer-set.c have a load-factor of at most 25%. Also another very fast hash table I've studied, dense_hash_map from google's sparse_hash_table, has a load factor of 50% max. As I understand it a good hash function gives a perfectly random value up to N. So if we have only 25% of slots empty, like we do with today's 75% load factor, chances we won't have collision are small. And collisions cost more for an open-addressing hash table. But for implementing a closed-addressing hash-table with linked lists for collision resolution, we'd need allocator pools within libiberty. Is it OK to move alloc-pool.[ch] into libiberty as a first step? Dimitris
[var-tracking] [not-good!] disable shared_hash and other simplifications
Hello, the attached patch applies after my previous one, and actually cancels all runtime gains from it. It doesn't make things worse than initially, so it's not *that* bad. While trying to understand var-tracking I deleted the whole shared hash table concept and some other indirections. It didn't make things much worse, and it makes me think that getting rid of other indirections like shared variables reference counting would actually help. But it got too complicated and I had to give up. While I don't understand variable tracking, I couldn't help but notice some things that could be discussed: * This is the part of the compiler that makes the heaviest use of hash tables. htab_traverse(), htab_delete(), FOR_EACH_HTAB_ELEMENT (should that be moved in libiberty?) all iterate way too many times over hash tables. This is all because decl/value UIDs are sparse. * Since var-tracking is happening per-function, is there a way to produce per-function UIDs and use those in a regular array instead of a sparse hash table? These UIDs could be produced at tree/rtx level, and I bet other parts in the compiler could make good use of them. * Another alternative would be to store variables in a vector, sorted by UID value. That way checking if it's there would be O(logn) but merging would be a very simple operation. Still, merging dataflow sets is way to complex to follow so I didn't manage to change this. * From a very wide field of view, all this DF solving reminded me a lot of what I've seen in df-*.c. Why can't variable tracking be integrated in the main DF pass of the compiler, the one that happens *after* register allocation? Thanks, Dimitris === modified file 'gcc/var-tracking.c' --- gcc/var-tracking.c 2011-08-22 08:27:47 + +++ gcc/var-tracking.c 2011-08-22 10:36:15 + @@ -233,17 +233,6 @@ typedef struct attrs_def HOST_WIDE_INT offset; } *attrs; -/* Structure holding a refcounted hash table. If refcount > 1, - it must be first unshared before modified. */ -typedef struct shared_hash_def -{ - /* Reference count. */ - int refcount; - - /* Actual hash table. */ - htab_t htab; -} *shared_hash; - /* Structure holding the IN or OUT set for a basic block. */ typedef struct dataflow_set_def { @@ -253,11 +242,11 @@ typedef struct dataflow_set_def /* Attributes for registers (lists of attrs). */ attrs regs[FIRST_PSEUDO_REGISTER]; - /* Variable locations. */ - shared_hash vars; + /* Variable locations, stored in a hash table. */ + htab_t vars; /* Vars that is being traversed. */ - shared_hash traversed_vars; + htab_t traversed_vars; } dataflow_set; /* The structure (one for each basic block) containing the information @@ -379,9 +368,6 @@ static alloc_pool valvar_pool; /* Alloc pool for struct location_chain_def. */ static alloc_pool loc_chain_pool; -/* Alloc pool for struct shared_hash_def. */ -static alloc_pool shared_hash_pool; - /* Alloc pool for struct value_chain_def. */ static alloc_pool value_chain_pool; @@ -395,7 +381,7 @@ static htab_t value_chains; static bool emit_notes; /* Empty shared hashtable. */ -static shared_hash empty_shared_hash; +static htab_t empty_htab; /* Scratch register bitmap used by cselib_expand_value_rtx. */ static bitmap scratch_regs = NULL; @@ -478,7 +464,7 @@ static void **delete_slot_part (dataflow static void delete_variable_part (dataflow_set *, rtx, decl_or_value, HOST_WIDE_INT); static int emit_note_insn_var_location (void **, void *); -static void emit_notes_for_changes (rtx, enum emit_note_where, shared_hash); +static void emit_notes_for_changes (rtx, enum emit_note_where, htab_t); static int emit_notes_for_differences_1 (void **, void *); static int emit_notes_for_differences_2 (void **, void *); static void emit_notes_for_differences (rtx, dataflow_set *, dataflow_set *); @@ -1370,164 +1356,30 @@ attrs_list_mpdv_union (attrs *dstp, attr } } -/* Shared hashtable support. */ - -/* Return true if VARS is shared. */ - -static inline bool -shared_hash_shared (shared_hash vars) -{ - return vars->refcount > 1; -} - -/* Return the hash table for VARS. */ - -static inline htab_t -shared_hash_htab (shared_hash vars) -{ - return vars->htab; -} - /* Return true if VAR is shared, or maybe because VARS is shared. */ static inline bool -shared_var_p (variable var, shared_hash vars) +shared_var_p (variable var) { /* Don't count an entry in the changed_variables table as a duplicate. */ - return ((var->refcount > 1 + (int) var->in_changed_variables) - || shared_hash_shared (vars)); + return (var->refcount - (int) var->in_changed_variables) > 1; } -/* Copy all variables from hash table SRC to hash table DST without rehashing - any values. */ - -static htab_t -htab_dup (htab_t src) -{ - htab_t dst; - - dst = (htab_t) xmalloc (sizeof (*src)); - memcpy (dst, src, sizeof (*src)); - dst->entries = (void **) xmalloc (src->s
[var-tracking] small speed-ups
Hello list, this patch has all of my changes to var-tracking that I believe are worth keeping. These are all minor changes not touching algorithmic issues, I lost much time trying to understand what is actually done in var-tracking.c but I still don't get it. I wish there was some document describing current implementation. I'd appreciate all help I can get here. Bootstrapped/tested on x86_64. Thanks to lxo for helping me, hopefully his plan for better var-tracking throughout all optimisation passes will be implemented. This is the only var-tracking related doc I could find... ( http://gcc.gnu.org/wiki/Var_Tracking_Assignments). This patch also includes minor stylistic changes that made the code just a tiny bit more accessible to me, the indirection was so much that it hardly reminded me of C, let me know if I should remove these parts. :-s For the sake of completion I'll also post a follow-up patch where I delete/simplify a big part of var-tracking, unfortunately with some impact on performance. 2011-08-22 Dimitrios Apostolou * var-tracking.c (init_attrs_list_set): Remove function, instead use a memset() call to zero the attr list in... (dataflow_set_init). (vars_copy): Remove function because inserting each element into a new hash table was too costly. Replaced with the ... (htab_dup): ... new function that only does a memcpy() of the element table in htab_t, without rehashing any elements. (shared_hash_unshare): Replace the vars_copy() call with htab_dup(), plus do a little extra work (reference counting) which was in vars_copy. (shared_hash_destroy, dataflow_set_destroy): Add an extra "do_free" bool argument, to avoid iterating over hash tables freeing elements, when not needed. (vt_finalize, vt_emit_notes): Call the above with do_free=false since all pools will be freed later. (dataflow_set_clear, dataflow_set_copy, dataflow_set_union) (dataflow_set_merge, dataflow_set_different, compute_bb_dataflow) (vt_find_locations): Call shared_hash_destroy with do_free=true. (attrs_list_copy): Do not free destination list but reuse already allocated elements if possible. === modified file 'gcc/var-tracking.c' --- gcc/var-tracking.c 2011-06-03 01:42:31 + +++ gcc/var-tracking.c 2011-08-22 09:52:08 + @@ -414,7 +414,6 @@ static hashval_t variable_htab_hash (con static int variable_htab_eq (const void *, const void *); static void variable_htab_free (void *); -static void init_attrs_list_set (attrs *); static void attrs_list_clear (attrs *); static attrs attrs_list_member (attrs, decl_or_value, HOST_WIDE_INT); static void attrs_list_insert (attrs *, decl_or_value, HOST_WIDE_INT, rtx); @@ -423,7 +422,6 @@ static void attrs_list_union (attrs *, a static void **unshare_variable (dataflow_set *set, void **slot, variable var, enum var_init_status); -static void vars_copy (htab_t, htab_t); static tree var_debug_decl (tree); static void var_reg_set (dataflow_set *, rtx, enum var_init_status, rtx); static void var_reg_delete_and_set (dataflow_set *, rtx, bool, @@ -447,7 +445,7 @@ static bool variable_part_different_p (v static bool onepart_variable_different_p (variable, variable); static bool variable_different_p (variable, variable); static bool dataflow_set_different (dataflow_set *, dataflow_set *); -static void dataflow_set_destroy (dataflow_set *); +static void dataflow_set_destroy (dataflow_set *, bool); static bool contains_symbol_ref (rtx); static bool track_expr_p (tree, bool); @@ -1069,14 +1067,14 @@ adjust_insn (basic_block bb, rtx insn) static inline bool dv_is_decl_p (decl_or_value dv) { - return !dv || (int) TREE_CODE ((tree) dv) != (int) VALUE; + return !dv || ((int) TREE_CODE ((tree) dv) != (int) VALUE); } /* Return true if a decl_or_value is a VALUE rtl. */ static inline bool dv_is_value_p (decl_or_value dv) { - return dv && !dv_is_decl_p (dv); + return !dv_is_decl_p (dv); } /* Return the decl in the decl_or_value. */ @@ -1092,7 +1090,7 @@ static inline rtx dv_as_value (decl_or_value dv) { gcc_checking_assert (dv_is_value_p (dv)); - return (rtx)dv; + return (rtx) dv; } /* Return the opaque pointer in the decl_or_value. */ @@ -1191,7 +1189,7 @@ dv_uid2hash (dvuid uid) static inline hashval_t dv_htab_hash (decl_or_value dv) { - return dv_uid2hash (dv_uid (dv)); + return (hashval_t) (dv_uid (dv)); } /* The hash function for variable_htab, computes the hash value @@ -1202,7 +1200,7 @@ variable_htab_hash (const void *x) { const_variable const v = (const_variable) x; - return dv_htab_hash (v->dv); + return (hashval_t) (dv_uid (v->dv)); } /* Compare the declaration of variable X with declaration Y. */ @@ -1211,9 +1209,8 @@ static int variable_htab_eq (const voi
Re: tree-ssa-structalias.c: alloc_pool for struct equiv_class_label
On Mon, 22 Aug 2011, Richard Guenther wrote: On Mon, Aug 22, 2011 at 9:46 AM, Dimitrios Apostolou wrote: 2011-08-22 Dimitrios Apostolou * tree-ssa-structalias.c (equiv_class_add) (perform_var_substitution, free_var_substitution_info): Created a new equiv_class_pool allocator pool for struct equiv_class_label. Changed the pointer_equiv_class_table and location_equiv_class_table hash tables to not iterate freeing all elements in the end, but just free the pool. Did you check if the hash functions have ever called free()? If so why not use the pool free function so that entries can get re-used? If not, the natural allocator would be an obstack instead. I have not found any relevant call of htab_clear_slot(). I didn't consider obstacks at all for all these cases, thanks for telling me, I'll see where I can use them. As I've noted I have bootstrapped and tested all these changes at least on x86_64 with release-checking enabled, but I plan to test and measure all my changes together later, and hopefully on other platforms in the future. Thanks, Dimitris
Re: Dump stats about hottest hash tables when -fmem-report
Hello, I'm attaching here the final version of statistics dumping, I think I made some stylistic/insignificant changes. Tested on i386 and x86_64. I have withheld the hash table size changes, so please apply if you find everything good. Would you agree on a future patch that would make such statistics being computed only when the gather-mem-stats compile-time variable is set? On Sat, 20 Aug 2011, Richard Guenther wrote: On Fri, Aug 19, 2011 at 6:40 PM, Tom Tromey wrote: Your patch to change the symbol table's load factor is fine technically. I think the argument for putting it in is lacking; what I would like to see is either some rationale explaining that the increased memory use is not important, or some numbers showing that it still performs well on more than a single machine. My reason for wanting this is just that, historically, GCC has been very sensitive to increases in memory use. Alternatively, comments from more active maintainers indicating that they don't care about this would also help your case. I can't approve or reject the libiberty change, just the libcpp one. Yes, memory usage is as important as compile-time. We still have testcases that show a vast imbalance of them. I don't know if the symbol table hash is ever the problem, but changing the generic load factor in libiberty doesn't sound a good idea - maybe instead have a away of specifying that factor per hashtable instance. Or, as usual, improve the hash function to reduce the collision rate and/or to make re-hashing cheaper. Regarding hash table size, I will back off from hashtab.c. I really believe that even though it makes a difference the proper solution would be a much more intrusive one: complete reimplementation. If we primarily care about memory usage we should use a closed-addressing (with linked-lists in each bucket) hash table, and set the load-factor high. If we care about cache-efficiency and speed we should use an open-addressing hash table, like the one we have, but decrease the load-factor and resolve collisions by quadratic probing instead of rehashing. Also we should make sure our hash table is always power-of-2 sized, and that hash values are always computed using good and proved hash functions (we use Bob Jenkins v2 hash, we should upgrade to v3 even though v2 seems very good *when* we actually do use it). That would probably mean the interface should change to disallow custom hash values, and while we do that I'd really like to see new functions dedicated to searching and inserting. I've experimented with some of these changes, but these changes individually do not offer significant speed-up. I believe that change will show if hash tables are completely reimplemented. But regarding symtab.c, I only see benefits. Collisions are reduced and extra memory usage is negligible. For a hash table with 32K elements and 64K slots, the overhead of empty slots is 32K*8 = 256KB. It it was 75% full it would be 128KB. Actual measurements on i386 show negligible overhead, it could be by luck that final ht size is the same (they are always power-of-2 sized, so it's possible). Anyway I'll hopefully test in other platforms within September and report back. A change that would probably help for symtab would be to store custom string structs, that include string length so we avoid several strlen() calls. Another could be to not support deletion of strings (since what we are actually doing is string interning, isn't it?). This would make the open-addressing hash table much simpler, and I think there is only one case we actually delete strings. Have to look further into this one. All comments welcome, Dimitris Changelog: 2011-08-22 Dimitrios Apostolou * cgraph.c, cgraph.h (cgraph_dump_stats): New function to dump stats about cgraph_hash hash table. * cselib.c, cselib.h (cselib_dump_stats): New function to dump stats about cselib_hash_table. * cselib.c (cselib_finish): Keep statistics by increasing values of new global variables cselib_htab_{num,expands,searches,collisions} if -fmem-report. * emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to dump statistics about mem_attrs_htab hash table. * tree.c (print_type_hash_statistics): Used the new htab_dump_statistics() function. * var-tracking.c (shared_hash_destroy): Keep statistics by increasing values of new global variables vars_htab_{num,expands,searches,collisions} if -fmem-report. (vt_finalize): Keep statistics by increasing values of new global variables cv_htab_{num,expands,searches,collisions} and vc_htab_{num,expands,searches,collisions} if -fmem-report. * var-tracking.c, rtl.h (vt_dump_stats): New function to dump stats about vars->htab, changed_variables and value_chains hash tables.
cse.c: preferable()
Attached patch is also posted at bug #19832 and I think resolves it, as well as /maybe/ offers a negligible speedup of 3-4 M instr or a couple milliseconds. I also post it here for comments. 2011-08-13 Dimitrios Apostolou * cse.c (preferable): Make it more readable and slightly faster, without affecting its logic. === modified file 'gcc/cse.c' --- gcc/cse.c 2011-06-02 21:52:46 + +++ gcc/cse.c 2011-08-13 00:54:06 + @@ -720,32 +720,25 @@ approx_reg_cost (rtx x) static int preferable (int cost_a, int regcost_a, int cost_b, int regcost_b) { - /* First, get rid of cases involving expressions that are entirely - unwanted. */ - if (cost_a != cost_b) -{ - if (cost_a == MAX_COST) - return 1; - if (cost_b == MAX_COST) - return -1; -} + int cost_diff = cost_a - cost_b; + int regcost_diff = regcost_a - regcost_b; - /* Avoid extending lifetimes of hardregs. */ - if (regcost_a != regcost_b) + if (cost_diff != 0) { - if (regcost_a == MAX_COST) - return 1; - if (regcost_b == MAX_COST) - return -1; + /* If none of the expressions are entirely unwanted */ + if ((cost_a != MAX_COST) && (cost_b != MAX_COST) + /* AND only one of the regs is HARD_REG */ + && (regcost_diff != 0) + && ((regcost_a == MAX_COST) || (regcost_b == MAX_COST)) + ) + /* Then avoid extending lifetime of HARD_REG */ + return regcost_diff; + + return cost_diff; } - /* Normal operation costs take precedence. */ - if (cost_a != cost_b) -return cost_a - cost_b; - /* Only if these are identical consider effects on register pressure. */ - if (regcost_a != regcost_b) -return regcost_a - regcost_b; - return 0; + /* cost_a == costb, consider effects on register pressure */ + return regcost_diff; } /* Internal function, to compute cost when X is not a register; called
Re: Various minor speed-ups
For whoever is concerned about memory usage, I didn't measure a real increase, besides a few KB. These are very hot allocation pools and allocating too many blocks of 10 elements is suboptimal. 2011-08-22 Dimitrios Apostolou * cselib.c (cselib_init): Increased initial size of elt_list_pool, elt_loc_list_pool, cselib_val_pool, value_pool allocation pools since they are very frequently used. === modified file 'gcc/cselib.c' --- gcc/cselib.c2011-05-31 19:14:21 + +++ gcc/cselib.c2011-08-17 14:03:56 + @@ -2484,12 +2484,12 @@ void cselib_init (int record_what) { elt_list_pool = create_alloc_pool ("elt_list", -sizeof (struct elt_list), 10); +sizeof (struct elt_list), 128); elt_loc_list_pool = create_alloc_pool ("elt_loc_list", -sizeof (struct elt_loc_list), 10); +sizeof (struct elt_loc_list), 128); cselib_val_pool = create_alloc_pool ("cselib_val_list", - sizeof (cselib_val), 10); - value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 100); + sizeof (cselib_val), 128); + value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 128); cselib_record_memory = record_what & CSELIB_RECORD_MEMORY; cselib_preserve_constants = record_what & CSELIB_PRESERVE_CONSTANTS;
Re: mem_attrs_htab
Hi Jakub, I forgot to mention that all patches are against mid-July trunk, I was hoping I'd have no conflicts. Anyway thanks for letting me know, if there are conflicts with my other patches please let me know, and I'll post an updated version at a later date. All your other concerns are valid and I'll try addressing them in the future. I didn't like hashing addresses either, and I was surprised I saw no regressions. Dimitris This patch isn't against the trunk, where p->offset and p->size aren't rtxes anymore, but HOST_WIDE_INTs. Furthermore, it is a bad idea to hash the p->expr address itself, it doesn't make any sense to hash on what p->expr points to in that case. And p->offset and p->size should be ignored if the *known_p corresponding fields are false. So, if you really think using iterative_hash_object is a win, it should be something like: mem_attrs q = *p; q.expr = NULL; if (!q.offset_known_p) q.offset = 0; if (!q.size_known_p) q.size = 0; return iterative_hash_object (q, iterative_hash_expr (p->expr, 0)); (or better yet avoid q.expr = NULL and instead start hashing from the next field after expr). Hashing the struct padding might not be a good idea either. Jakub
Re: tree-ssa-structalias.c: alloc_pool for struct equiv_class_label
Forgot the patch... On Mon, 22 Aug 2011, Dimitrios Apostolou wrote: 2011-08-22 Dimitrios Apostolou * tree-ssa-structalias.c (equiv_class_add) (perform_var_substitution, free_var_substitution_info): Created a new equiv_class_pool allocator pool for struct equiv_class_label. Changed the pointer_equiv_class_table and location_equiv_class_table hash tables to not iterate freeing all elements in the end, but just free the pool. === modified file 'gcc/tree-ssa-structalias.c' --- gcc/tree-ssa-structalias.c 2011-04-29 10:59:33 + +++ gcc/tree-ssa-structalias.c 2011-08-18 06:53:12 + @@ -1899,6 +1899,9 @@ static htab_t pointer_equiv_class_table; classes. */ static htab_t location_equiv_class_table; +/* Pool of memory for storing the above */ +static alloc_pool equiv_class_pool; + /* Hash function for a equiv_class_label_t */ static hashval_t @@ -1948,7 +1951,8 @@ equiv_class_add (htab_t table, unsigned bitmap labels) { void **slot; - equiv_class_label_t ecl = XNEW (struct equiv_class_label); + equiv_class_label_t ecl += (equiv_class_label_t) pool_alloc (equiv_class_pool); ecl->labels = labels; ecl->equivalence_class = equivalence_class; @@ -2159,10 +2163,14 @@ perform_var_substitution (constraint_gra struct scc_info *si = init_scc_info (size); bitmap_obstack_initialize (&iteration_obstack); + equiv_class_pool = create_alloc_pool ("equiv_class_label pool", + sizeof (struct equiv_class_label), + 64); + /* NULL free function, we'll free the whole pool at the end of the pass. */ pointer_equiv_class_table = htab_create (511, equiv_class_label_hash, - equiv_class_label_eq, free); + equiv_class_label_eq, NULL); location_equiv_class_table = htab_create (511, equiv_class_label_hash, - equiv_class_label_eq, free); + equiv_class_label_eq, NULL); pointer_equiv_class = 1; location_equiv_class = 1; @@ -2269,6 +2277,7 @@ free_var_substitution_info (struct scc_i sbitmap_free (graph->direct_nodes); htab_delete (pointer_equiv_class_table); htab_delete (location_equiv_class_table); + free_alloc_pool (equiv_class_pool); bitmap_obstack_release (&iteration_obstack); }
Re: Various minor speed-ups
2011-08-22 Dimitrios Apostolou * tree-ssa-pre.c (phi_trans_add, init_pre, fini_pre): Added a pool for phi_translate_table elements to avoid free() calls from htab_delete(). === modified file 'gcc/tree-ssa-pre.c' --- gcc/tree-ssa-pre.c 2011-05-04 09:04:53 + +++ gcc/tree-ssa-pre.c 2011-08-17 08:43:23 + @@ -515,6 +515,10 @@ typedef struct expr_pred_trans_d } *expr_pred_trans_t; typedef const struct expr_pred_trans_d *const_expr_pred_trans_t; +/* Pool of memory for the above */ + +static alloc_pool phi_translate_pool; + /* Return the hash value for a phi translation table entry. */ static hashval_t @@ -571,7 +575,8 @@ static inline void phi_trans_add (pre_expr e, pre_expr v, basic_block pred) { void **slot; - expr_pred_trans_t new_pair = XNEW (struct expr_pred_trans_d); + expr_pred_trans_t new_pair += (expr_pred_trans_t) pool_alloc (phi_translate_pool); new_pair->e = e; new_pair->pred = pred; new_pair->v = v; @@ -580,7 +585,8 @@ phi_trans_add (pre_expr e, pre_expr v, b slot = htab_find_slot_with_hash (phi_translate_table, new_pair, new_pair->hashcode, INSERT); - free (*slot); + if (*slot) +pool_free (phi_translate_pool, *slot); *slot = (void *) new_pair; } @@ -4804,8 +4810,12 @@ init_pre (bool do_fre) calculate_dominance_info (CDI_DOMINATORS); bitmap_obstack_initialize (&grand_bitmap_obstack); + phi_translate_pool = create_alloc_pool ("phi_translate_table pool", + sizeof (struct expr_pred_trans_d), + 4096); + /* NULL as free because we'll free the whole pool in the end. */ phi_translate_table = htab_create (5110, expr_pred_trans_hash, -expr_pred_trans_eq, free); +expr_pred_trans_eq, NULL); expression_to_id = htab_create (num_ssa_names * 3, pre_expr_hash, pre_expr_eq, NULL); @@ -4839,6 +4849,7 @@ fini_pre (bool do_fre) free_alloc_pool (bitmap_set_pool); free_alloc_pool (pre_expr_pool); htab_delete (phi_translate_table); + free_alloc_pool (phi_translate_pool); htab_delete (expression_to_id); VEC_free (unsigned, heap, name_to_id);
tree-ssa-structalias.c: alloc_pool for struct equiv_class_label
2011-08-22 Dimitrios Apostolou * tree-ssa-structalias.c (equiv_class_add) (perform_var_substitution, free_var_substitution_info): Created a new equiv_class_pool allocator pool for struct equiv_class_label. Changed the pointer_equiv_class_table and location_equiv_class_table hash tables to not iterate freeing all elements in the end, but just free the pool.
tree-ssa*: reduce malloc() calls by preallocating hot VECs on the stack
2011-08-22 Dimitrios Apostolou Allocate some very frequently used vectors on the stack: * vecir.h: Defined a tree vector on the stack. * tree-ssa-sccvn.c (print_scc, sort_scc, process_scc) (extract_and_process_scc_for_name): Allocate the scc vector on the stack instead of the heap, giving it a minimal initial size instead of 0. * tree-ssa-structalias.c (get_constraint_for_1) (get_constraint_for, get_constraint_for_rhs, do_deref) (get_constraint_for_ssa_var, get_constraint_for_ptr_offset) (get_constraint_for_component_ref, get_constraint_for_address_of) (process_all_all_constraints, do_structure_copy) (make_constraints_to, make_constraint_to, handle_rhs_call) (handle_lhs_call, handle_const_call, handle_pure_call) (find_func_aliases_for_builtin_call, find_func_aliases_for_call) (find_func_aliases, process_ipa_clobber, find_func_clobbers) (create_variable_info_for): Converted the rhsc, lhsc vectors from heap to stack, with a minimal initial size, since they were very frequently allocated. === modified file 'gcc/tree-ssa-structalias.c' --- gcc/tree-ssa-structalias.c 2011-08-18 06:53:12 + +++ gcc/tree-ssa-structalias.c 2011-08-19 09:43:41 + @@ -477,11 +477,14 @@ struct constraint_expr typedef struct constraint_expr ce_s; DEF_VEC_O(ce_s); -DEF_VEC_ALLOC_O(ce_s, heap); -static void get_constraint_for_1 (tree, VEC(ce_s, heap) **, bool, bool); -static void get_constraint_for (tree, VEC(ce_s, heap) **); -static void get_constraint_for_rhs (tree, VEC(ce_s, heap) **); -static void do_deref (VEC (ce_s, heap) **); +DEF_VEC_ALLOC_O_STACK(ce_s); +#define VEC_ce_s_stack_alloc(alloc) \ + VEC_stack_alloc (ce_s, alloc) + +static void get_constraint_for_1 (tree, VEC(ce_s, stack) **, bool, bool); +static void get_constraint_for (tree, VEC(ce_s, stack) **); +static void get_constraint_for_rhs (tree, VEC(ce_s, stack) **); +static void do_deref (VEC (ce_s, stack) **); /* Our set constraints are made up of two constraint expressions, one LHS, and one RHS. @@ -2736,7 +2739,7 @@ new_scalar_tmp_constraint_exp (const cha If address_p is true, the result will be taken its address of. */ static void -get_constraint_for_ssa_var (tree t, VEC(ce_s, heap) **results, bool address_p) +get_constraint_for_ssa_var (tree t, VEC(ce_s, stack) **results, bool address_p) { struct constraint_expr cexpr; varinfo_t vi; @@ -2776,12 +2779,12 @@ get_constraint_for_ssa_var (tree t, VEC( for (; vi; vi = vi->next) { cexpr.var = vi->id; - VEC_safe_push (ce_s, heap, *results, &cexpr); + VEC_safe_push (ce_s, stack, *results, &cexpr); } return; } - VEC_safe_push (ce_s, heap, *results, &cexpr); + VEC_safe_push (ce_s, stack, *results, &cexpr); } /* Process constraint T, performing various simplifications and then @@ -2861,7 +2864,7 @@ bitpos_of_field (const tree fdecl) static void get_constraint_for_ptr_offset (tree ptr, tree offset, - VEC (ce_s, heap) **results) + VEC (ce_s, stack) **results) { struct constraint_expr c; unsigned int j, n; @@ -2920,7 +2923,7 @@ get_constraint_for_ptr_offset (tree ptr, c2.type = ADDRESSOF; c2.offset = 0; if (c2.var != c.var) - VEC_safe_push (ce_s, heap, *results, &c2); + VEC_safe_push (ce_s, stack, *results, &c2); temp = temp->next; } while (temp); @@ -2955,7 +2958,7 @@ get_constraint_for_ptr_offset (tree ptr, c2.var = temp->next->id; c2.type = ADDRESSOF; c2.offset = 0; - VEC_safe_push (ce_s, heap, *results, &c2); + VEC_safe_push (ce_s, stack, *results, &c2); } c.var = temp->id; c.offset = 0; @@ -2974,7 +2977,7 @@ get_constraint_for_ptr_offset (tree ptr, as the lhs. */ static void -get_constraint_for_component_ref (tree t, VEC(ce_s, heap) **results, +get_constraint_for_component_ref (tree t, VEC(ce_s, stack) **results, bool address_p, bool lhs_p) { tree orig_t = t; @@ -2999,7 +3002,7 @@ get_constraint_for_component_ref (tree t temp.offset = 0; temp.var = integer_id; temp.type = SCALAR; - VEC_safe_push (ce_s, heap, *results, &temp); + VEC_safe_push (ce_s, stack, *results, &temp); return; } @@ -3021,7 +3024,7 @@ get_constraint_for_component_ref (tree t temp.offset = 0; temp.var = anything_id; temp.type = ADDRESSOF; - VEC_safe_push (ce_s, heap, *results, &temp); + VEC_safe_push (ce_s, stack, *results, &temp); return; } } @@ -3062,7 +3065,7 @@ get_constraint_for_
graphds.[ch]: alloc_pool for edges
free() was called way too often before, this patch reduces it significantly. Minor speed-up here too, I don't mention it individually since numbers are within noise margins. 2011-08-22 Dimitrios Apostolou * graphds.h (struct graph): Added edge_pool as a pool for allocating edges. * graphds.c (new_graph): Initialise edge_pool. (add_edge): Allocate edge from edge_pool rather than with malloc. (free_graph): Instead of iterating across the graph freeing edges, just destroy the edge_pool. === modified file 'gcc/graphds.c' --- gcc/graphds.c 2009-11-25 10:55:54 + +++ gcc/graphds.c 2011-08-19 16:44:41 + @@ -62,7 +62,8 @@ new_graph (int n_vertices) g->n_vertices = n_vertices; g->vertices = XCNEWVEC (struct vertex, n_vertices); - + g->edge_pool = create_alloc_pool ("edge_pool", + sizeof (struct graph_edge), 32); return g; } @@ -71,7 +72,7 @@ new_graph (int n_vertices) struct graph_edge * add_edge (struct graph *g, int f, int t) { - struct graph_edge *e = XNEW (struct graph_edge); + struct graph_edge *e = (struct graph_edge *) pool_alloc (g->edge_pool); struct vertex *vf = &g->vertices[f], *vt = &g->vertices[t]; @@ -326,19 +327,7 @@ for_each_edge (struct graph *g, graphds_ void free_graph (struct graph *g) { - struct graph_edge *e, *n; - struct vertex *v; - int i; - - for (i = 0; i < g->n_vertices; i++) -{ - v = &g->vertices[i]; - for (e = v->succ; e; e = n) - { - n = e->succ_next; - free (e); - } -} + free_alloc_pool (g->edge_pool); free (g->vertices); free (g); } === modified file 'gcc/graphds.h' --- gcc/graphds.h 2009-02-20 15:20:38 + +++ gcc/graphds.h 2011-08-19 16:44:41 + @@ -18,6 +18,10 @@ You should have received a copy of the G along with GCC; see the file COPYING3. If not see <http://www.gnu.org/licenses/>. */ + +#include "alloc-pool.h" + + /* Structure representing edge of a graph. */ struct graph_edge @@ -44,10 +48,10 @@ struct vertex struct graph { - int n_vertices; /* Number of vertices. */ - struct vertex *vertices; - /* The vertices. */ - htab_t indices; /* Fast lookup for indices. */ + int n_vertices; /* Number of vertices. */ + struct vertex *vertices; /* The vertices. */ + htab_t indices; /* Fast lookup for indices. */ + alloc_pool edge_pool;/* Pool for allocating edges. */ }; struct graph *new_graph (int);
mem_attrs_htab
2011-08-22 Dimitrios Apostolou * emit-rtl.c (mem_attrs_htab_hash): Hash massively by calling iterative_hash(). We disregard the offset,size rtx fields of the mem_attrs struct, but overall this hash is a *huge* improvement to the previous one, it reduces the collisions/searches ratio from 8 to 0.8 for some cases. (init_emit_once): Slightly increase the mem_attrs_htab initial size because it's frequently used and expanded many times. === modified file 'gcc/emit-rtl.c' --- gcc/emit-rtl.c 2011-05-29 17:40:05 + +++ gcc/emit-rtl.c 2011-08-21 04:44:25 + @@ -256,11 +256,10 @@ mem_attrs_htab_hash (const void *x) { const mem_attrs *const p = (const mem_attrs *) x; - return (p->alias ^ (p->align * 1000) - ^ (p->addrspace * 4000) - ^ ((p->offset ? INTVAL (p->offset) : 0) * 5) - ^ ((p->size ? INTVAL (p->size) : 0) * 250) - ^ (size_t) iterative_hash_expr (p->expr, 0)); + /* By massively feeding the mem_attrs struct to iterative_hash() we + disregard the p->offset and p->size rtx, but in total the hash is + quick and good enough. */ + return iterative_hash_object (*p, iterative_hash_expr (p->expr, 0)); } /* Returns nonzero if the value represented by X (which is really a @@ -5494,7 +5500,7 @@ init_emit_once (void) const_fixed_htab = htab_create_ggc (37, const_fixed_htab_hash, const_fixed_htab_eq, NULL); - mem_attrs_htab = htab_create_ggc (37, mem_attrs_htab_hash, + mem_attrs_htab = htab_create_ggc (128, mem_attrs_htab_hash, mem_attrs_htab_eq, NULL); reg_attrs_htab = htab_create_ggc (37, reg_attrs_htab_hash, reg_attrs_htab_eq, NULL);
Various minor speed-ups
Hello list, the followup patches are a selection of minor changes introduced in various times during my GSOC project. They mostly are simple or not that important to be posted alone, so I'll post them alltogether under this thread. Nevertheless they have been carefully selected from a pool of other changes and they are the ones that *do* offer some (minor) speed improvement, and have the least impact on memory usage, if at all. They have all been tested on x86_64, some also on i386. For production builds I have seen no regression introduced. Thanks, Dimitris
Re: Dump stats about hottest hash tables when -fmem-report
On Fri, 19 Aug 2011, Tom Tromey wrote: I think you are the most likely person to do this sort of testing. You can use machines on the GCC compile farm for this. Your patch to change the symbol table's load factor is fine technically. I think the argument for putting it in is lacking; what I would like to see is either some rationale explaining that the increased memory use is not important, or some numbers showing that it still performs well on more than a single machine. My reason for wanting this is just that, historically, GCC has been very sensitive to increases in memory use. Alternatively, comments from more active maintainers indicating that they don't care about this would also help your case. I can't approve or reject the libiberty change, just the libcpp one. Hi Tom, thanks for your comments. I'm well aware that I should test on more equipment besides i386 and x86_64 and I certainly plan to. It's just that writing patches is one thing, but advocating them on gcc-patches is an equally hard thing, and I plan on doing the latter correctly after GSOC ends. As for the technical side of this patch, I've noticed that while in general there is a good gain to be earned from reduced collisions, there is an overhead in htab_traverse() and htab_delete() that iterate over the array. This is evident primarily in var-tracking, which is more htab-intensive than the rest of the compiler alltogether! So I plan to resubmit my patch together with small changes in var-tracking. Thanks, Dimitris
Re: Dump stats about hottest hash tables when -fmem-report
On Tue, 9 Aug 2011, Tom Tromey wrote: "Richard" == Richard Guenther writes: The libcpp part is ok with this change. Richard> Note that sparsely populated hashes come at the cost of increased Richard> cache footprint. Not sure what is more important here though, memory Richard> access or hash computation. I was only approving the change to the dumping. I am undecided about making the hash tables more sparse. Since my Core Quad processor has large caches and the i386 has small pointer size, the few extra empty buckets impose small overhead, which as it seems is minor in comparison to gains due to less rehashes. Maybe that's not true on older or alternate equipment. I'd be very interested to hear about runtime measurements on various equipment, please let me know if you do any. Thanks, Dimitris
Re: Dump stats about hottest hash tables when -fmem-report
I forgot to include the dwarf2out.c:file_table. Stats are printed when -g. See attached patch. Additional Changelog: * dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if -fmem-report. Dimitris === modified file 'gcc/dwarf2out.c' --- gcc/dwarf2out.c 2011-06-06 17:46:00 + +++ gcc/dwarf2out.c 2011-08-09 07:56:11 + @@ -24820,6 +24820,13 @@ dwarf2out_finish (const char *filename) add_comp_dir_attribute (comp_unit_die ()); } + if (mem_report) +{ + fprintf(stderr, "\ndwarf2out.c: file_table hash table statistics:\n"); + htab_dump_statistics(file_table, sizeof (struct dwarf_file_data)); +} + + for (i = 0; i < VEC_length (deferred_locations, deferred_locations_list); i++) { add_location_or_const_value_attribute (
Dump stats about hottest hash tables when -fmem-report
Hello list, this is the second part of my patch. It collects and prints some memory statistics for hash tables that I've measured as the hottest ones. Tested together with previous patch on i386. Example output: libcpp symtab string pool: identifiers 32644 (100.00%) entries 32644 (49.81%) deleted 0 slots 65536 string bytes445k (4064 obstack_memory_used) table size 256k searches234217 collisions 86838 coll/search 0.3708 ins/search 0.1394 avg. entry 13.97 bytes (+/- 6.42) longest entry 52 No gimple statistics ??? tree nodes created (No per-node statistics) DECL_DEBUG_EXPR hash: size 1021, 23 elements, 0.00 collisions DECL_VALUE_EXPR hash: size 1021, 0 elements, 0.00 collisions tree.c:type_hash_table stats slots 32749 identifiers 5605 entries 5605 (17.12%) deleted 0 struct htab 60 B table size 127 kB element 8 B total contents 43 kB searches27032 collisions 9148 coll/search 0.3384 emit-rtl.c:mem_attrs_htab hash table: slots 8191 identifiers 2295 entries 2295 (28.02%) deleted 0 struct htab 60 B table size 31 kB element 24 B total contents 53 kB searches7166 collisions 3784 coll/search 0.5280 cgraph.c:cgraph_hash hash table stats: slots 8191 identifiers 198 entries 3100 (37.85%) deleted 2902 struct htab 60 B table size 31 kB element 160 B total contents 30 kB searches103425 collisions 32761 coll/search 0.3168 var-tracking.c stats 2363 vars->htab hash tables: total searches 483055 total collisions68621 total coll/search 0.1421 54 changed_variables hash tables total searches 33417 total collisions14253 total coll/search 0.4265 54 value_chains hash tables total searches 43924 total collisions14027 total coll/search 0.3193 cselib stats for 614 hash tables total searches 52840 total collisions9597 total coll/search 0.1816 Changelog: 2011-08-09 Dimitrios Apostolou * cgraph.c, cgraph.h (cgraph_dump_stats): New function to dump stats about cgraph_hash hash table. * cselib.c, cselib.h (cselib_dump_stats): New function to dump stats about cselib_hash_table. * cselib.c (cselib_finish): Keep statistics by increasing values of new global variables cselib_htab_{num,searches,collisions} if -fmem-report. * emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to dump statistics about mem_attrs_htab hash table. * tree.c (print_type_hash_statistics): Used the new htab_dump_statistics() function. * var-tracking.c (shared_hash_destroy): Keep statistics by increasing values of new global variables vars_htab_{num,searches,collisions} if -fmem-report. (vt_finalize): Keep statistics by increasing values of new global variables cv_htab_{num,searches,collisions} and vc_htab_{num,searches,collisions} if -fmem-report. * var-tracking.c, rtl.h (vt_dump_stats): New function to dump stats about vars->htab, changed_variables and value_chains hash tables. * toplev.c: Included cselib.h for cselib_dump_stats(). (dump_memory_report): Call all the above functions to provide better statistics. * hashtab.c, hashtab.h (htab_dump_statistics, htab_collisions_num) (htab_searches_num): New functions for statistics. * hashtab.c: Included assert.h for checks in htab_dump_statistics. * symtab.c (ht_dump_statistics): Beautified stats output. Thanks, Dimitris === modified file 'gcc/cgraph.c' --- gcc/cgraph.c2011-06-06 17:12:25 + +++ gcc/cgraph.c2011-08-09 01:01:19 + @@ -2897,4 +2897,11 @@ cgraph_used_from_object_file_p (struct c return false; } +void +cgraph_dump_stats (void) +{ + fprintf (stderr, "\ncgraph.c:cgraph_hash hash table stats:\n"); + htab_dump_statistics (cgraph_hash, sizeof (struct cgraph_node)); +} + #include "gt-cgraph.h" === modified file 'gcc/cgraph.h' --- gcc/cgraph.h2011-05-31 14:58:49 + +++ gcc/cgraph.h2011-08-08 23:19:51 + @@ -540,6 +540,7 @@ bool cgraph_can_remove_if_no_direct
Decrease fill-ratio of hash tables
Hello list, the attach simple patch saves reproducively a tiny bit of time on various -O0 compilations I've performed, for example 5ms out of 627ms. Tested on i386. We trade a little bit of memory (maxmem2.sh from [1] reported 25240 KB instead of 25080 KB without the patch) to allow our hash tables be more sparsely populated (always at least half-empty). Since our hash tables resolve conflicts by rehashing, reducing collisions is good. Another patch that I'll post soon helped me measure collisions/searches ratio for the hottest hash tables. What could be easily noticed was that the ratio was too high, reaching 0.5 or even 0.7 in some cases. This made clear that we need some deep refactoring of our hash tables, either changing hash functions or the complete hash table implementation, to not involve any rehashing for collision handling. Unfortunately such tries failed either because they were too simple to show any difference, or they were too intrusive and involved changes everywhere htab_t is used (almost everywhere). This patch is the simplest one to show positive change but I believe that some careful hash table redesign must be planned. For example, for the mem_attrs_htab hash table, coll/searches ratio is still sometimes higher than 0.5. Changelog: 2011-08-09 Dimitrios Apostolou * symtab.c (ht_lookup_with_hash): Hash table will now be doubled when 50% full, not 75%, to reduce collisions. * hashtab.c (find_empty_slot_for_expand) (htab_find_slot_with_hash): The same. Thanks, Dimitris [1] http://gcc.gnu.org/wiki/PerformanceTesting === modified file 'libcpp/symtab.c' --- libcpp/symtab.c 2009-07-18 02:22:16 + +++ libcpp/symtab.c 2011-08-09 02:39:45 + @@ -172,7 +172,7 @@ HT_STR (node) = (const unsigned char *) obstack_copy0 (&table->stack, str, len); - if (++table->nelements * 4 >= table->nslots * 3) + if (++table->nelements * 2 > table->nslots) /* Must expand the string table. */ ht_expand (table); === modified file 'libiberty/hashtab.c' --- libiberty/hashtab.c 2011-02-03 07:23:20 + +++ libiberty/hashtab.c 2011-08-09 02:39:45 + @@ -515,8 +515,9 @@ } /* The following function changes size of memory allocated for the - entries and repeatedly inserts the table elements. The occupancy - of the table after the call will be about 50%. Naturally the hash + entries and repeatedly inserts the table elements. It is designed to + be called when table is half-full. The occupancy + of the table after the call will be about 25%. Naturally the hash table must already exist. Remember also that the place of the table entries is changed. If memory allocation failures are allowed, this function will return zero, indicating that the table could not be @@ -542,7 +543,7 @@ too full or too empty. */ if (elts * 2 > osize || (elts * 8 < osize && osize > 32)) { - nindex = higher_prime_index (elts * 2); + nindex = higher_prime_index (elts * 4); nsize = prime_tab[nindex].prime; } else @@ -648,7 +649,7 @@ PTR entry; size = htab_size (htab); - if (insert == INSERT && size * 3 <= htab->n_elements * 4) + if (insert == INSERT && htab->n_elements * 2 > size) { if (htab_expand (htab) == 0) return NULL;
Re: [RFC] hard-reg-set.h refactoring
On Mon, 1 Aug 2011, Paolo Bonzini wrote: On 08/01/2011 05:57 PM, Dimitrios Apostolou wrote: I don't fully understand the output from -fdump-tree-all, but my conclusion based also on profiler output and objdump, is that both unrolling and inlining is happening in both versions. Nevertheless I can see that assembly output is a bit different in the two cases (I can post specific disassembly output if you are interested). Thanks for checking. Have you tried the idea of passing an unsigned HOST_WIDEST_FAST_INT * (or whatever the name) to the target hook? Keeping my patch exactly the same, just changing the hook_void_hard_reg_set to receive a (HOST_WIDEST_FAST_INT *) arg and doing the necessary typecasts, added an extra 3 M instructions. But the ix86_live_on_entry is only called 1233x times from df-scan.c. This isn't enough to explain all this overhead. Dimitris
Re: [RFC] hard-reg-set.h refactoring
On Sun, 31 Jul 2011, Paolo Bonzini wrote: On Sat, Jul 30, 2011 at 19:21, Dimitrios Apostolou wrote: Nevertheless I'd appreciate comments on whether any part of this patch is worth keeping. FWIW I've profiled this on i386 to be about 4 M instr slower out of ~1.5 G inst. I'll be now checking the profiler to see where exactly the overhead is. I suggest -fdump-tree-all too, to check if unrolling is happening and if not why. I don't fully understand the output from -fdump-tree-all, but my conclusion based also on profiler output and objdump, is that both unrolling and inlining is happening in both versions. Nevertheless I can see that assembly output is a bit different in the two cases (I can post specific disassembly output if you are interested). My opinion is that code cleanup is worth the minor overhead, given that there should be no regressions. Thanks, Dimitris
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
On Sun, 31 Jul 2011, Steven Bosscher wrote: On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher wrote: I'll see if I can test the patch on the compile farm this weekend, just to be sure. Worked fine with some cross-builds to arm-eabi. Bootstrap on ia64-unknown-linux-gnu is in stage2 but it is taking forever (on gcc60)... I will not be able to run the test suite anymore this weekend, but I'll start the run. Anyway, anything that makes it into stage2 on ia64 can't be all bad, eh? ;-) Steven thanks a lot for your help! :-) If you have any scripts/guidelines for automatic testing on various platforms on the compile farm, I'd be happy to use them. Thanks, Dimitris
[RFC] hard-reg-set.h refactoring
Hello list, the attached patch changes hard-reg-set.h in the following areas: 1) HARD_REG_SET is now always a struct so that it can be used in files where we don't want to include tm.h. Many thanks to Paolo for providing the idea and the original patch. 2) Code for specific HARD_REG_SET_LONG values is deleted and only generic code is left, making the file much more readable/maintainable. I was expecting gcc would unroll, even at -O2, loops with 2-3 iterations, so performance should have been the same. I don't intend for this to go mainline, Jakub has explained on IRC that certain ABIs make it slower to pass structs and we wouldn't want that. Nevertheless I'd appreciate comments on whether any part of this patch is worth keeping. FWIW I've profiled this on i386 to be about 4 M instr slower out of ~1.5 G inst. I'll be now checking the profiler to see where exactly the overhead is. Thanks, Dimitris === modified file 'gcc/hard-reg-set.h' --- gcc/hard-reg-set.h 2011-01-03 20:52:22 + +++ gcc/hard-reg-set.h 2011-07-29 22:32:27 + @@ -24,35 +24,31 @@ along with GCC; see the file COPYING3. /* Define the type of a set of hard registers. */ /* HARD_REG_ELT_TYPE is a typedef of the unsigned integral type which - will be used for hard reg sets, either alone or in an array. - - If HARD_REG_SET is a macro, its definition is HARD_REG_ELT_TYPE, - and it has enough bits to represent all the target machine's hard - registers. Otherwise, it is a typedef for a suitably sized array - of HARD_REG_ELT_TYPEs. HARD_REG_SET_LONGS is defined as how many. + will be used for hard reg sets. An HARD_REG_ELT_TYPE, or an + array of them is wrapped in a struct. Note that lots of code assumes that the first part of a regset is the same format as a HARD_REG_SET. To help make sure this is true, we only try the widest fast integer mode (HOST_WIDEST_FAST_INT) - instead of all the smaller types. This approach loses only if - there are very few registers and then only in the few cases where - we have an array of HARD_REG_SETs, so it needn't be as complex as - it used to be. */ - -typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE; - -#if FIRST_PSEUDO_REGISTER <= HOST_BITS_PER_WIDEST_FAST_INT - -#define HARD_REG_SET HARD_REG_ELT_TYPE + instead of all the smaller types. */ +#ifdef ENABLE_RTL_CHECKING +#define gcc_rtl_assert(EXPR) gcc_assert (EXPR) #else +#define gcc_rtl_assert(EXPR) ((void)(0 && (EXPR))) +#endif + +typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE; #define HARD_REG_SET_LONGS \ ((FIRST_PSEUDO_REGISTER + HOST_BITS_PER_WIDEST_FAST_INT - 1) \ / HOST_BITS_PER_WIDEST_FAST_INT) -typedef HARD_REG_ELT_TYPE HARD_REG_SET[HARD_REG_SET_LONGS]; -#endif +#define HARD_REG_SET struct hard_reg_set + +struct hard_reg_set { + HARD_REG_ELT_TYPE elems[HARD_REG_SET_LONGS]; +}; /* HARD_CONST is used to cast a constant to the appropriate type for use with a HARD_REG_SET. */ @@ -89,343 +85,108 @@ typedef HARD_REG_ELT_TYPE HARD_REG_SET[H hard_reg_set_intersect_p (X, Y), which returns true if X and Y intersect. hard_reg_set_empty_p (X), which returns true if X is empty. */ -#define UHOST_BITS_PER_WIDE_INT ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT) -#ifdef HARD_REG_SET +#define HARD_REG_ELT_BITS ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT) #define SET_HARD_REG_BIT(SET, BIT) \ - ((SET) |= HARD_CONST (1) << (BIT)) + hard_reg_set_set_bit (&(SET), (BIT)) #define CLEAR_HARD_REG_BIT(SET, BIT) \ - ((SET) &= ~(HARD_CONST (1) << (BIT))) + hard_reg_set_clear_bit(&(SET), (BIT)) #define TEST_HARD_REG_BIT(SET, BIT) \ - (!!((SET) & (HARD_CONST (1) << (BIT - -#define CLEAR_HARD_REG_SET(TO) ((TO) = HARD_CONST (0)) -#define SET_HARD_REG_SET(TO) ((TO) = ~ HARD_CONST (0)) - -#define COPY_HARD_REG_SET(TO, FROM) ((TO) = (FROM)) -#define COMPL_HARD_REG_SET(TO, FROM) ((TO) = ~(FROM)) - -#define IOR_HARD_REG_SET(TO, FROM) ((TO) |= (FROM)) -#define IOR_COMPL_HARD_REG_SET(TO, FROM) ((TO) |= ~ (FROM)) -#define AND_HARD_REG_SET(TO, FROM) ((TO) &= (FROM)) -#define AND_COMPL_HARD_REG_SET(TO, FROM) ((TO) &= ~ (FROM)) - -static inline bool -hard_reg_set_subset_p (const HARD_REG_SET x, const HARD_REG_SET y) -{ - return (x & ~y) == HARD_CONST (0); -} + hard_reg_set_bit_p((SET), (BIT)) -static inline bool -hard_reg_set_equal_p (const HARD_REG_SET x, const HARD_REG_SET y) -{ - return x == y; -} - -static inline bool -hard_reg_set_intersect_p (const HARD_REG_SET x, const HARD_REG_SET y) -{ - return (x & y) != HARD_CONST (0); -} - -static inline bool -hard_reg_set_empty_p (const HARD_REG_SET x) +static inline void +hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit) { - return x == HARD_CONST (0); -} - +#if HARD_REG_SET_LONGS > 1 + int word = bit / HARD_REG_ELT_BITS; + int bitpos = bit % HARD_REG_ELT_BITS; #else - -#define SET_HARD_REG_BIT(SET, BIT) \ - ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT] \ - |= HARD_CONST (1) << ((BIT) %
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
On Fri, 29 Jul 2011, Kenneth Zadeck wrote: i really think that patches of this magnitude having to with the rtl level should be tested on more than one platform. I'd really appreciate further testing on alternate platforms from whoever does it casually, for me it would take too much time to setup my testing platform on GCC compile farm, and deadlines are approaching. Thanks, Dimitris
Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order
Completely forgot it: Tested on i386, no regressions. Dimitrios
[DF] [performance] generate DF_REF_BASE REFs in REGNO order
Hello list, the attached patch achieves a performance improvement by first recording DF_REF_BASE DEFs within df_get_call_refs() before the DF_REF_REGULARs are recorded in df_defs_record(). BASE DEFs are also recorded in REGNO order. Improvement has been measured as follows, for compiling tcp_ipv4.c of linux kernel with -O0 optimisation: trunk : 1438.4 M instr, 0.627s patched: 1376.5 M instr, 0.604s It also includes suggested changes from Paolo discussed on list (subject: what can be in a group set). Many thanks to him for the invaluable help while writing the patch. For whoever is interested, you can see the two profiles with fully annotated source before and after the change, at the following links. The big difference is that qsort() is now called only 33 times instead of thousands, from df_sort_and_compress_refs(). Further measurements, comments and ideas for further improvements are welcome. http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=callgrind-tcp_ipv4-trunk-co-109439-prod.txt http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=callgrind-tcp_ipv4-df2-co-prod.txt Changelog: 2011-07-29 Dimitrios Apostolou Paolo Bonzini (df_def_record_1): Assert a parallel must contain an EXPR_LIST at this point. Receive the LOC and move its extraction... (df_defs_record): ... here. Rewrote logic with a switch statement instead of multiple if-else. (df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions that duplicate the logic of df_defs_record() and df_def_record_1() but without actually recording any DEFs, only marking them in the defs HARD_REG_SET. (df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that are the result of the call. Record DF_REF_BASE DEFs in REGNO order. Use regs_invalidated_by_call HARD_REG_SET instead of regs_invalidated_by_call_regset bitmap. (df_insn_refs_collect): Record DF_REF_REGULAR DEFs after df_get_call_refs(). Thanks, Dimitris P.S. maraz: that's 4.3% improvement in instruction count, should you start worrying or is it too late? ;-) === modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-07-29 16:01:50 + @@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c rtx, rtx *, basic_block, struct df_insn_info *, enum df_ref_type, int ref_flags); -static void df_def_record_1 (struct df_collection_rec *, rtx, +static void df_def_record_1 (struct df_collection_rec *, rtx *, basic_block, struct df_insn_info *, int ref_flags); static void df_defs_record (struct df_collection_rec *, rtx, @@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x) } -/* Process all the registers defined in the rtx, X. +/* Process all the registers defined in the rtx pointed by LOC. Autoincrement/decrement definitions will be picked up by df_uses_record. */ static void df_def_record_1 (struct df_collection_rec *collection_rec, - rtx x, basic_block bb, struct df_insn_info *insn_info, + rtx *loc, basic_block bb, struct df_insn_info *insn_info, int flags) { - rtx *loc; - rtx dst; - - /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL - construct. */ - if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER) -loc = &XEXP (x, 0); - else -loc = &SET_DEST (x); - dst = *loc; + rtx dst = *loc; /* It is legal to have a set destination be a parallel. */ if (GET_CODE (dst) == PARALLEL) { int i; - for (i = XVECLEN (dst, 0) - 1; i >= 0; i--) { rtx temp = XVECEXP (dst, 0, i); - if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER - || GET_CODE (temp) == SET) - df_def_record_1 (collection_rec, - temp, bb, insn_info, -GET_CODE (temp) == CLOBBER -? flags | DF_REF_MUST_CLOBBER : flags); + gcc_assert (GET_CODE (temp) == EXPR_LIST); + df_def_record_1 (collection_rec, &XEXP (temp, 0), + bb, insn_info, flags); } return; } @@ -3003,26 +2990,98 @@ df_defs_record (struct df_collection_rec int flags) { RTX_CODE code = GET_CODE (x); + int i; - if (code == SET || code == CLOBBER) -{ - /* Mark the single def within the pattern. */ - int clobber_flags = flags; - clobber_flags |= (code == CLOBBER) ? DF_REF_MUST_CLOBBER : 0; - df_def_record_1 (collection_rec, x, bb, insn_info, clobber_flags); -} - else if (code == COND_EXEC) + switch (code) { +case SET: + df_def_record_1 (colle
Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version
Bug found at last, it's in the following hunk, the ampersand in &exit_block_uses is wrong... :-@ @@ -3951,7 +3949,7 @@ df_get_exit_block_use_set (bitmap exit_b { rtx tmp = EH_RETURN_STACKADJ_RTX; if (tmp && REG_P (tmp)) - df_mark_reg (tmp, exit_block_uses); + df_mark_reg (tmp, &exit_block_uses); } #endif @@ -3961,12 +3959,12 @@ df_get_exit_block_use_set (bitmap exit_b { rtx tmp = EH_RETURN_HANDLER_RTX; if (tmp && REG_P (tmp)) - df_mark_reg (tmp, exit_block_uses); + df_mark_reg (tmp, &exit_block_uses); } #endif /* Mark function return value. */ - diddle_return_value (df_mark_reg, (void*) exit_block_uses); + diddle_return_value (df_mark_reg, (void*) &exit_block_uses); } Thanks to everyone for looking in my code, it seems to be working now, failing only on some mudflap tests that I've been told to ignore. Expect patch repost soon :-) FWIW test failures in comparison to trunk are the following, but I'll ignore them: FAIL: libmudflap.cth/pass39-frag.c (rerun 18) execution test FAIL: libmudflap.cth/pass39-frag.c (rerun 18) output pattern test FAIL: libmudflap.cth/pass39-frag.c (-static -DSTATIC) (rerun 10) execution test FAIL: libmudflap.cth/pass39-frag.c (-static -DSTATIC) (rerun 10) output pattern test FAIL: libmudflap.cth/pass39-frag.c (-O2) (rerun 7) execution test FAIL: libmudflap.cth/pass39-frag.c (-O2) (rerun 7) output pattern test FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 5) execution test FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 5) output pattern test Dimitris
added some assert checks in hard-reg-set.h
Hello list, the attached patch was tested on i386, and measured to have almost no overhead in runtime, when RTL checks are enabled: Instruction Count before: 2328.6 M Instruction Count after: 2334.4 M which translates to some milliseconds, well within noise area. Changelog: 2011-07-25 Dimitrios Apostolou * hard-reg-set.h (TEST_HARD_REG_BIT, SET_HARD_REG_BIT, CLEAR_HARD_REG_BIT): Added some assert checks for test, set and clear operations of HARD_REG_SETs, enabled when RTL checks are on. Runtime overhead was measured as negligible. Thanks, Dimitris=== modified file 'gcc/hard-reg-set.h' --- gcc/hard-reg-set.h 2011-01-03 20:52:22 + +++ gcc/hard-reg-set.h 2011-07-25 17:06:36 + @@ -41,6 +41,13 @@ along with GCC; see the file COPYING3. typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE; +#ifdef ENABLE_RTL_CHECKING +#define gcc_rtl_assert(EXPR) gcc_assert (EXPR) +#else +#define gcc_rtl_assert(EXPR) ((void)(0 && (EXPR))) +#endif + + #if FIRST_PSEUDO_REGISTER <= HOST_BITS_PER_WIDEST_FAST_INT #define HARD_REG_SET HARD_REG_ELT_TYPE @@ -91,14 +98,35 @@ typedef HARD_REG_ELT_TYPE HARD_REG_SET[H #define UHOST_BITS_PER_WIDE_INT ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT) -#ifdef HARD_REG_SET - #define SET_HARD_REG_BIT(SET, BIT) \ - ((SET) |= HARD_CONST (1) << (BIT)) + hard_reg_set_set_bit (&(SET), (BIT)) #define CLEAR_HARD_REG_BIT(SET, BIT) \ - ((SET) &= ~(HARD_CONST (1) << (BIT))) + hard_reg_set_clear_bit(&(SET), (BIT)) #define TEST_HARD_REG_BIT(SET, BIT) \ - (!!((SET) & (HARD_CONST (1) << (BIT + hard_reg_set_bit_p((SET), (BIT)) + +#ifdef HARD_REG_SET + +static inline void +hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit) +{ + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + (*s) |= HARD_CONST (1) << bit; +} + +static inline void +hard_reg_set_clear_bit (HARD_REG_SET *s, unsigned int bit) +{ + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + (*s) &= ~(HARD_CONST (1) << bit); +} + +static inline bool +hard_reg_set_bit_p (const HARD_REG_SET s, unsigned int bit) +{ + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + return ((s >> bit) & HARD_CONST (1)); +} #define CLEAR_HARD_REG_SET(TO) ((TO) = HARD_CONST (0)) #define SET_HARD_REG_SET(TO) ((TO) = ~ HARD_CONST (0)) @@ -137,17 +165,32 @@ hard_reg_set_empty_p (const HARD_REG_SET #else -#define SET_HARD_REG_BIT(SET, BIT) \ - ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT] \ - |= HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT)) - -#define CLEAR_HARD_REG_BIT(SET, BIT) \ - ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT] \ - &= ~(HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT))) - -#define TEST_HARD_REG_BIT(SET, BIT)\ - (!!((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT] \ - & (HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT +static inline void +hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit) +{ + int byte = bit / UHOST_BITS_PER_WIDE_INT; + int bitpos = bit % UHOST_BITS_PER_WIDE_INT; + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + (*s)[byte] |= HARD_CONST (1) << bitpos; +} + +static inline void +hard_reg_set_clear_bit (HARD_REG_SET *s, unsigned int bit) +{ + int byte = bit / UHOST_BITS_PER_WIDE_INT; + int bitpos = bit % UHOST_BITS_PER_WIDE_INT; + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + (*s)[byte] &= ~(HARD_CONST (1) << bitpos); +} + +static inline bool +hard_reg_set_bit_p (const HARD_REG_SET s, unsigned int bit) +{ + int byte = bit / UHOST_BITS_PER_WIDE_INT; + int bitpos = bit % UHOST_BITS_PER_WIDE_INT; + gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER); + return ((s[byte] >> bitpos) & HARD_CONST (1)); +} #if FIRST_PSEUDO_REGISTER <= 2*HOST_BITS_PER_WIDEST_FAST_INT #define CLEAR_HARD_REG_SET(TO) \
Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version
That was a bug, indeed, but unfortunately it wasn't the one causing the crash I posted earlier... Even after fixing it I get the same backtrace from gdb. So the petition "spot the bug" holds... Thanks, Dimitris
eliminate bitmap regs_invalidated_by_call_regset
Hello list, the attached patch eliminates regs_invalidated_by_call_regset bitmap and uses instead the original regs_invalidated_by_call HARD_REG_SET. Tested on i386, I had the following two regressions that I'll investigate right on: FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 10) execution test FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 10) output pattern test Performance measured not to be affected, maybe it is now a couple milliseconds faster: Original: PC1:0.878s, PC2:6.55s, 2105.6 M instr Patched : PC1:0.875s, PC2:6.54s, 2104.9 M instr 2011-07-25 Dimitrios Apostolou * df-core.c, df-problems.c, df-scan.c, df.h, reginfo.c, regset.h: Eliminate regs_invalidated_by_call_regset bitmap and use instead the original regs_invalidated_by_call HARD_REG_SET. All comments are welcome, Dimitris=== modified file 'gcc/df-core.c' --- gcc/df-core.c 2011-04-20 18:19:03 + +++ gcc/df-core.c 2011-07-25 13:58:58 + @@ -1886,6 +1886,17 @@ df_print_regset (FILE *file, bitmap r) } +void +df_print_hard_reg_set (FILE *f, HARD_REG_SET r) +{ + unsigned int i; + hard_reg_set_iterator iter; + + EXECUTE_IF_SET_IN_HARD_REG_SET (r, 0, i, iter) +fprintf (f, " %d [%s]", i, reg_names[i]); + fprintf (f, "\n"); +} + /* Write information about registers and basic blocks into FILE. The bitmap is in the form used by df_byte_lr. This is part of making a debugging dump. */ === modified file 'gcc/df-problems.c' --- gcc/df-problems.c 2011-05-04 20:24:15 + +++ gcc/df-problems.c 2011-07-25 13:58:58 + @@ -432,6 +432,7 @@ df_rd_local_compute (bitmap all_blocks) { unsigned int bb_index; bitmap_iterator bi; + hard_reg_set_iterator iter; unsigned int regno; struct df_rd_problem_data *problem_data = (struct df_rd_problem_data *) df_rd->problem_data; @@ -449,7 +450,7 @@ df_rd_local_compute (bitmap all_blocks) } /* Set up the knockout bit vectors to be applied across EH_EDGES. */ - EXECUTE_IF_SET_IN_BITMAP (regs_invalidated_by_call_regset, 0, regno, bi) + EXECUTE_IF_SET_IN_HARD_REG_SET (regs_invalidated_by_call, 0, regno, iter) { if (DF_DEFS_COUNT (regno) > DF_SPARSE_THRESHOLD) bitmap_set_bit (sparse_invalidated, regno); @@ -969,6 +970,29 @@ df_lr_confluence_0 (basic_block bb) bitmap_copy (op1, &df->hardware_regs_used); } +/* to |= from1 & ~from2 + from2 is of type HARD_REG_SET */ + +static bool +bitmap_ior_and_compl_from_hard_reg_set (bitmap to, const_bitmap from1, + HARD_REG_SET from2) +{ + bool ret; + unsigned int i; + bitmap_head from1_tmp; + hard_reg_set_iterator iter; + + bitmap_initialize (&from1_tmp, &bitmap_default_obstack); + bitmap_copy (&from1_tmp, from1); + + /* TODO optimise per-word */ + EXECUTE_IF_SET_IN_HARD_REG_SET (from2, 0, i, iter) +bitmap_clear_bit (&from1_tmp, i); + ret = bitmap_ior_into (to, &from1_tmp); + + bitmap_clear (&from1_tmp); + return ret; +} /* Confluence function that ignores fake edges. */ @@ -983,7 +1007,8 @@ df_lr_confluence_n (edge e) /* ??? Abnormal call edges ignored for the moment, as this gets confused by sibling call edges, which crashes reg-stack. */ if (e->flags & EDGE_EH) -changed = bitmap_ior_and_compl_into (op1, op2, regs_invalidated_by_call_regset); +changed = bitmap_ior_and_compl_from_hard_reg_set (op1, op2, + regs_invalidated_by_call); else changed = bitmap_ior_into (op1, op2); @@ -4450,8 +4475,8 @@ df_md_confluence_n (edge e) return false; if (e->flags & EDGE_EH) -return bitmap_ior_and_compl_into (op1, op2, - regs_invalidated_by_call_regset); +return bitmap_ior_and_compl_from_hard_reg_set (op1, op2, + regs_invalidated_by_call); else return bitmap_ior_into (op1, op2); } === modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-07-25 13:58:58 + @@ -409,7 +409,7 @@ df_scan_start_dump (FILE *file ATTRIBUTE rtx insn; fprintf (file, ";; invalidated by call \t"); - df_print_regset (file, regs_invalidated_by_call_regset); + df_print_hard_reg_set (file, regs_invalidated_by_call); fprintf (file, ";; hardware regs used \t"); df_print_regset (file, &df->hardware_regs_used); fprintf (file, ";; regular block artificial uses \t"); @@ -3317,7 +3317,7 @@ df_get_call_refs (struct df_collection_r int flags) { rtx note; - bitmap_iterator bi; + hard_reg_set_iterator iter; unsigned int ui; bool is_sibling_call; unsigned int i; @@ -3375,7 +3375,7 @@ df_get_call_refs (struct df_collection_r } is_sibling_call = SIBLING_CALL_P (insn_info->in
Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version
Bug found, in df_mark_reg I need to iterate until regno + n, not n. The error is at the following hunk: --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-07-24 17:16:46 + @@ -3713,35 +3717,40 @@ df_mark_reg (rtx reg, void *vset) if (regno < FIRST_PSEUDO_REGISTER) { int n = hard_regno_nregs[regno][GET_MODE (reg)]; - bitmap_set_range (set, regno, n); + int i; + for (i=regno; iMany thanks to monoid from IRC for spotting it! I'll post an updated patch soon. Thanks, Dimitris
Re: [RFC] Replace some bitmaps with HARD_REG_SETs
Hi Steven, On Sun, 24 Jul 2011, Steven Bosscher wrote: Can you please create your patches with the -p option, so that it's easier to see what function you are changing? Also, even for an RFC patch a ChangeLog is more than just nice to have ;-) Do you mean an entry in Changelog file in root directory? I should update my tree to latest every time, for my patch to be valid, right? This hunk in df-problems looks odd: @@ -2511,9 +2535,9 @@ if (bb_index == EXIT_BLOCK) { unsigned regno; - bitmap_iterator bi; - EXECUTE_IF_SET_IN_BITMAP (df->exit_block_uses, FIRST_PSEUDO_REGISTER, - regno, bi) So this EXECUTE_IF_SET_IN_BITMAP starts with FIRST_PSEUDO_REGISTER (i.e. works on pseudo registers) ... + hard_reg_set_iterator iter; + EXECUTE_IF_SET_IN_HARD_REG_SET (df->exit_block_uses, FIRST_PSEUDO_REGISTER, + regno, iter) gcc_unreachable (); } else ... and you change it to work only on hard registers. That code looked like a check to verify that exit_block_uses only contains hard registers. So you can probably just drop this check. Thanks for reminding me this, I had indeed removed this check, I just didn't commit to my VCS :-) Should I resend my patch with all the suggestions you made? Thanks, Dimitris
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
On Fri, 8 Jul 2011, Paolo Bonzini wrote: On 07/08/2011 05:51 AM, Dimitrios Apostolou wrote: + /* first write DF_REF_BASE */ This is not necessary. These uses are written to use_vec, while the uses from REG_EQUIV and REG_EQUAL are written to eq_use_vec (see df_ref_create_structure). Thanks for pointing this out, I missed it, it's complex to follow what is written where. Perhaps there is meaning in changing the interface of df_ref_create_structure() to accept the particular vector. Also, anyway this wouldn't work because you would have to split the loop in two. I'll attribute that to the time of day when you were writing the message. :) Even though I'm having my own disputes with Morpheus, sleep deprivation wasn't tha culprit for this. :-) My intention was to keep DF_REF_BASEs as close to each other, because a close-to-be-sorted array will be sorted faster with the sorting function I'll write... :-) But anyway as you've explained it's irrelevant since they are going to different vectors. Thanks, Dimitris
Re: what can be in a group set?
Thanks Paolo for the detailed explanation! On Fri, 8 Jul 2011, Paolo Bonzini wrote: That said, changing exit_block_uses and entry_block_defs to HARD_REG_SET would be a nice cleanup, but it would also touch target code due to targetm.extra_live_on_entry (entry_block_defs); I've already done that :-p I wouldn't bother for now until you're a bit more experienced. Unlike invalidated_by_call it shouldn't show up in profiles, or does it? Indeed it doesn't show, I just wanted to do it as a clean-up for transitioning to HARD_REG_SET all relevant sets in struct df_d. The only problem remaining is I need a bitmap_copy_from_hard_reg_set() function for df_lr_local_compute(), where the bb_info->use bitmap is initialised from the exit_block_uses HARD_REG_SET. Dimitris
Re: what can be in a group set?
On Fri, 8 Jul 2011, Paolo Bonzini wrote: On 07/08/2011 12:43 PM, Richard Sandiford wrote: The docs also say that the first expr_list can be null: If @var{lval} is a @code{parallel}, it is used to represent the case of a function returning a structure in multiple registers. Each element of the @code{parallel} is an @code{expr_list} whose first operand is a @code{reg} and whose second operand is a @code{const_int} representing the offset (in bytes) into the structure at which the data in that register corresponds. The first element may be null to indicate that the structure is also passed partly in memory. but I can't see any code to handle that. Am I missing something, or does the lack of a crash here mean that we can remove the last sentence? (It might have been added for symmetry with argument passing, where this sort of thing is needed. But if it isn't actually used or implemented for returns, it might be less confusing to remove it.) Indeed. Dimitrios, can you pick up the patch since it will somewhat simplify your work to eliminate defs_generated? I'll certainly try :-) Paolo, something else, in df_mark_reg() is it ever possible for regno to be >= FIRST_PSEUDO_REGISTER? An assert I've put doesn't trigger for my simple test :-) Thanks, Dimitris
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
On Fri, 8 Jul 2011, Richard Guenther wrote: On Fri, Jul 8, 2011 at 5:20 AM, Dimitrios Apostolou wrote: Hello list, The attached patch does two things for df_get_call_refs(): * First it uses HARD_REG_SETs for defs_generated and regs_invalidated_by_call, instead of bitmaps. Replacing in total more than 400K calls (for my testcase) to bitmap_bit_p() with the much faster TEST_HARD_REG_BIT, reduces the total instruction count from about 13M to 1.5M. * Second it produces the REFs in REGNO order, which is important to keep the collection_rec sorted most times, and avoid expensive calls to qsort(). Thanks to Paolo Bonzini for idea and mentoring. The second part makes a big difference if accompanied with another patch in df_insn_refs_collect(). I'll post a followup patch, that is unfortunately unstable for some of my tests, so I'd appreciate any comments. Did you check the impact on memory usage? I suppose on targets with not many hard registers it should even improve, but do we expect memory usage to be worse in any case? Hi Richard, I didn't check memory usage, is that important? Since the struct bitmap is fairly bulky, it should take an arch with lots of hard regs (which one has the most?). But still a few bytes tradeoff wouldn't be acceptable for a much faster type? And IMHO it makes the code better to understand, since once you see HARD_REG_SET you know you can't expect else. FWIW I'm now in the process of converting all other bitmap uses for hard regs, to HARD_REG_SETs, at least within DF. I'm not sure whether performance gains will be visible, however, not much code is as hot as df_get_call_refs(). Thanks, Dimitris
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
On Fri, 8 Jul 2011, Jakub Jelinek wrote: On Fri, Jul 08, 2011 at 06:20:04AM +0300, Dimitrios Apostolou wrote: The attached patch does two things for df_get_call_refs(): * First it uses HARD_REG_SETs for defs_generated and regs_invalidated_by_call, instead of bitmaps. Replacing in total more than 400K calls (for my testcase) to bitmap_bit_p() with the much faster TEST_HARD_REG_BIT, reduces the total instruction count from about 13M to 1.5M. Have you verified that collection_rec->def_vec never contains pseudo register references? Otherwise you couldn't use HARD_REG_SET... gcc_checking_assert might be useful. Hi Jakub, Steve pointed me to the following from GCC Internals Manual: call_insn insns have the same extra fields as insn insns, accessed in the same way and in addition contain a field CALL_INSN_FUNCTION_USAGE, which contains a list (chain of expr_list expressions) containing use and clobber expressions that denote hard registers and MEMs used or clobbered by the called function. So doesn't that mean that for CALL insns it should contain only HARD_REG DEFs? I will ofcourse use an assert to be sure. Thanks, Dimitris
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
On Fri, 8 Jul 2011, Steven Bosscher wrote: On Fri, Jul 8, 2011 at 5:20 AM, Dimitrios Apostolou wrote: The attached patch does two things for df_get_call_refs(): How did you test this patch? Normally, a patch submission comes with text like, "Bootstrapped & tested on ..., no regressions.". Also, you chould write a ChangeLog entry, best included in your mail somewhere at the end ;-) Hi Steven, thanks for the instructions. I've not run the mandatory tests you have told me about, only done some minor testing due to lack of time. I'm not yet posting patches for inclusion, but more as an RFC. Should such patches be sent to gcc instead of gcc-patches? Thanks, Dimitris
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
And here is the patch that breaks things. By moving df_defs_record() *after* df_get_call_refs() most times collection_rec remains sorted, and about 50M instructions are avoided in qsort() calls of df_canonize_collection_rec(). Unfortunately this does not work. Sometimes cc1 crashes, for example because regstack is empty in subst_stack_regs_in_debug_insn(). Any ideas for ensuring proper ordering of collection_rec? Thanks, Dimitris=== modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-07-08 01:28:55 + +++ gcc/df-scan.c 2011-07-08 03:38:38 + @@ -3412,8 +3412,9 @@ VEC_truncate (df_ref, collection_rec->eq_use_vec, 0); VEC_truncate (df_mw_hardreg_ptr, collection_rec->mw_vec, 0); - /* Record register defs. */ - df_defs_record (collection_rec, PATTERN (insn_info->insn), bb, insn_info, 0); + if (CALL_P (insn_info->insn)) +df_get_call_refs (collection_rec, bb, insn_info, + (is_cond_exec) ? DF_REF_CONDITIONAL : 0); /* Process REG_EQUIV/REG_EQUAL notes. */ for (note = REG_NOTES (insn_info->insn); note; @@ -3421,33 +3422,33 @@ { switch (REG_NOTE_KIND (note)) { + /* first write DF_REF_BASE */ +case REG_NON_LOCAL_GOTO: + /* The frame ptr is used by a non-local goto. */ + df_ref_record (DF_REF_BASE, collection_rec, + regno_reg_rtx[FRAME_POINTER_REGNUM], + NULL, bb, insn_info, + DF_REF_REG_USE, 0); +#if !HARD_FRAME_POINTER_IS_FRAME_POINTER + df_ref_record (DF_REF_BASE, collection_rec, + regno_reg_rtx[HARD_FRAME_POINTER_REGNUM], + NULL, bb, insn_info, + DF_REF_REG_USE, 0); +#endif + break; case REG_EQUIV: case REG_EQUAL: df_uses_record (collection_rec, &XEXP (note, 0), DF_REF_REG_USE, bb, insn_info, DF_REF_IN_NOTE); break; -case REG_NON_LOCAL_GOTO: - /* The frame ptr is used by a non-local goto. */ - df_ref_record (DF_REF_BASE, collection_rec, - regno_reg_rtx[FRAME_POINTER_REGNUM], - NULL, bb, insn_info, - DF_REF_REG_USE, 0); -#if !HARD_FRAME_POINTER_IS_FRAME_POINTER - df_ref_record (DF_REF_BASE, collection_rec, - regno_reg_rtx[HARD_FRAME_POINTER_REGNUM], - NULL, bb, insn_info, - DF_REF_REG_USE, 0); -#endif - break; default: break; } } - if (CALL_P (insn_info->insn)) -df_get_call_refs (collection_rec, bb, insn_info, - (is_cond_exec) ? DF_REF_CONDITIONAL : 0); + /* Record register defs. */ + df_defs_record (collection_rec, PATTERN (insn_info->insn), bb, insn_info, 0); /* Record the register uses. */ df_uses_record (collection_rec,
Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
To document the gains from the bitmaps, here is (part of) the annotated source from callgrind profiler, showing instruction count. Before: 1,154,400 if (bitmap_bit_p(regs_invalidated_by_call_regset, i) 8,080,800 => bitmap.c:bitmap_bit_p (192400x) 1,021,200 && !bitmap_bit_p (&defs_generated, i) 5,106,000 => bitmap.c:bitmap_bit_p (170200x) 340,400 && (!is_sibling_call . || !bitmap_bit_p (df->exit_block_uses, i) . || refers_to_regno_p (i, i+1, .crtl->return_rtx, NULL))) 2,053,500df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i 35,279,934 => df-scan.c:df_ref_record (170200x) . NULL, bb, insn_info, DF_REF_REG_DEF, . DF_REF_MAY_CLOBBER | flags); . } After: 1,346,800 if (TEST_HARD_REG_BIT(regs_invalidated_by_call, i) 510,600 && !TEST_HARD_REG_BIT (defs_generated, i) 340,400 && (!is_sibling_call . || !bitmap_bit_p (df->exit_block_uses, i) . || refers_to_regno_p (i, i+1, .crtl->return_rtx, NULL))) 2,057,200df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i 35,279,934 => df-scan.c:df_ref_record (170200x) . NULL, bb, insn_info, DF_REF_REG_DEF, . DF_REF_MAY_CLOBBER | flags); . } Dimitris
[df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps
Hello list, The attached patch does two things for df_get_call_refs(): * First it uses HARD_REG_SETs for defs_generated and regs_invalidated_by_call, instead of bitmaps. Replacing in total more than 400K calls (for my testcase) to bitmap_bit_p() with the much faster TEST_HARD_REG_BIT, reduces the total instruction count from about 13M to 1.5M. * Second it produces the REFs in REGNO order, which is important to keep the collection_rec sorted most times, and avoid expensive calls to qsort(). Thanks to Paolo Bonzini for idea and mentoring. The second part makes a big difference if accompanied with another patch in df_insn_refs_collect(). I'll post a followup patch, that is unfortunately unstable for some of my tests, so I'd appreciate any comments. Thanks, Dimitris === modified file 'gcc/df-scan.c' --- gcc/df-scan.c 2011-02-02 20:08:06 + +++ gcc/df-scan.c 2011-07-08 01:28:55 + @@ -3317,20 +3317,56 @@ int flags) { rtx note; - bitmap_iterator bi; - unsigned int ui; bool is_sibling_call; unsigned int i; df_ref def; - bitmap_head defs_generated; + HARD_REG_SET defs_generated; - bitmap_initialize (&defs_generated, &df_bitmap_obstack); + CLEAR_HARD_REG_SET(defs_generated); /* Do not generate clobbers for registers that are the result of the call. This causes ordering problems in the chain building code depending on which def is seen first. */ FOR_EACH_VEC_ELT (df_ref, collection_rec->def_vec, i, def) -bitmap_set_bit (&defs_generated, DF_REF_REGNO (def)); +SET_HARD_REG_BIT (defs_generated, DF_REF_REGNO (def)); + + is_sibling_call = SIBLING_CALL_P (insn_info->insn); + + for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) +{ + if (i == STACK_POINTER_REGNUM) + /* The stack ptr is used (honorarily) by a CALL insn. */ + df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], + NULL, bb, insn_info, DF_REF_REG_USE, + DF_REF_CALL_STACK_USAGE | flags); + else if (global_regs[i]) + { + /* Calls to const functions cannot access any global registers and +calls to pure functions cannot set them. All other calls may +reference any of the global registers, so they are recorded as +used. */ + if (!RTL_CONST_CALL_P (insn_info->insn)) + { + df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], +NULL, bb, insn_info, DF_REF_REG_USE, flags); + if (!RTL_PURE_CALL_P (insn_info->insn)) + df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], + NULL, bb, insn_info, DF_REF_REG_DEF, flags); + } + } + /* TODO HARD_REG_SET set intersection! */ + else /* !global_regs[i] */ + /* track Caller-Saved registers */ + if (TEST_HARD_REG_BIT(regs_invalidated_by_call, i) + && !TEST_HARD_REG_BIT (defs_generated, i) + && (!is_sibling_call + || !bitmap_bit_p (df->exit_block_uses, i) + || refers_to_regno_p (i, i+1, + crtl->return_rtx, NULL))) + df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], +NULL, bb, insn_info, DF_REF_REG_DEF, +DF_REF_MAY_CLOBBER | flags); +} /* Record the registers used to pass arguments, and explicitly noted as clobbered. */ @@ -3345,7 +3381,7 @@ if (REG_P (XEXP (XEXP (note, 0), 0))) { unsigned int regno = REGNO (XEXP (XEXP (note, 0), 0)); - if (!bitmap_bit_p (&defs_generated, regno)) + if (!TEST_HARD_REG_BIT (defs_generated, regno)) df_defs_record (collection_rec, XEXP (note, 0), bb, insn_info, flags); } @@ -3355,40 +3391,6 @@ } } - /* The stack ptr is used (honorarily) by a CALL insn. */ - df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[STACK_POINTER_REGNUM], -NULL, bb, insn_info, DF_REF_REG_USE, -DF_REF_CALL_STACK_USAGE | flags); - - /* Calls to const functions cannot access any global registers and calls to - pure functions cannot set them. All other calls may reference any of the - global registers, so they are recorded as used. */ - if (!RTL_CONST_CALL_P (insn_info->insn)) -for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) - if (global_regs[i]) - { - df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], -NULL, bb, insn_info, DF_REF_REG_USE, flags); - if (!RTL_PURE_CALL_P (insn_info->insn)) - df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], - NULL, bb, insn_info, DF_REF_REG_DEF, flags); - } - - is_sibling_call = SIBLING_CALL_P (insn_info->insn); - EXECUTE_IF_SET_
Re: Patch: speed up compiler a little bit by optimizing lookup_attribute() and is_attribute_p()
Hi Nicola, my patch is too simple compared to yours, feel free to work on it as much as you wish, no need to credit me since you posted it independantly. I just posted it to note that the inlining part is the one providing most performance benefit. richi: I used always_inline because it is the case that you *never* want it to be out-of-line, since it's a small wrapper function providing important performance benefit. Do you think a macro would be better? Thanks, Dimitris P.S. Nicola, if you remember it, please keep me Cc'd in further posts of yours related to performance of gcc On Tue, 21 Jun 2011, Nicola Pero wrote: Dimitrious I didn't realize you were working on this. Your patch is indeed very similar. :-) Can I go ahead and rewrite mine following Richard's suggestions (which would make it even more similar to yours), and add your name to the ChangeLog entry too ? Thanks
Re: Patch: speed up compiler a little bit by optimizing lookup_attribute() and is_attribute_p()
FWIW I think that most of the speedup is due to inlining lookup_attribute(). I got almost the same by applying only the attached very simple patch, since strlen() was called too often (according to the profile at [1]). I used the always_inline attribute to avoid using a macro. I was going to send it together with some other changes I'm trying and after proper measurements. Anyway, better late than ever. Thanks to Christophe Jaillet (CC'd) for pointing it to me. Thanks, Dimitris [1] http://gcc.gnu.org/wiki/OptimisingGCC On Tue, 21 Jun 2011, Richard Guenther wrote: On Tue, Jun 21, 2011 at 12:17 PM, Nicola Pero wrote: This patch speeds up the C/C++/ObjC/ObjC++ compiler a little bit by optimizing lookup_attribute() and is_attribute_p(). The main change is that these functions are now inline. I don't think this is a good idea. Can you explain why ? You never do in your response :-) I'm guessing it's because you think the inline functions are too big and that's counter-productive ? Yes. For this case I'd factor out the NULL attribute list check into an inline function and keeping the non-NULL attribute list pieces out-of-line. Ok ... so is this what you'd like - the common, hot case in the inline function, and the less common, cold case in the out-of-line one ? :-) Right. That makes sense - I'll try that out tonight (it takes a few hours to run all the benchmarks). ;-) Thanks. Does it work for all languages? And yes, I agree it would be nice to canonicalize to the form without _s even in the attribute lists itself. Or go one step further - do not store an identifier but an enum as we in fact only ever insert known attributes into the list (which should be a VEC instead, possibly even sorted to allow binary search). Sure ... one step at a time. :-) Of course ;) Richard. Thanks --- gcc-4.6.0.orig/gcc/tree.c 2011-03-14 14:20:48.0 +0200 +++ gcc-4.6.0-mine/gcc/tree.c 2011-06-21 12:57:00.0 +0300 @@ -5230,10 +5230,9 @@ is_attribute_p (const char *attr, const_ be passed back in if further occurrences are wanted. */ tree -lookup_attribute (const char *attr_name, tree list) +lookup_attribute_len (const char *attr_name, size_t attr_len, tree list) { tree l; - size_t attr_len = strlen (attr_name); for (l = list; l; l = TREE_CHAIN (l)) { --- gcc-4.6.0.orig/gcc/tree.h 2011-03-12 00:38:58.0 +0200 +++ gcc-4.6.0-mine/gcc/tree.h 2011-06-21 13:13:02.0 +0300 @@ -4369,7 +4369,16 @@ extern int is_attribute_p (const char *, /* Given an attribute name and a list of attributes, return the list element of the attribute or NULL_TREE if not found. */ -extern tree lookup_attribute (const char *, tree); +extern tree lookup_attribute_len (const char *, size_t, tree); + +/* implemented as inline so that the compiler optimises away the strlen() for + known strings*/ +static __attribute__((__always_inline__)) tree +lookup_attribute (const char *attr_name, tree list) +{ + return lookup_attribute_len(attr_name, strlen (attr_name), list); +} + /* Remove any instances of attribute ATTR_NAME in LIST and return the modified list. */
Re: fix memory leak in gengtype
On Thu, 21 Apr 2011, Laurynas Biveinis wrote: :( Why don't you get yourself a compile farm account? http://gcc.gnu.org/wiki/CompileFarm Thanks Laurynas, I am absolutely thrilled to see such a variety of hardware! I'll try applying, but I'm not sure I'm eligible, my contributions to OSS are just a few and mostly simple. Thanks, Dimitris
Re: fix memory leak in gengtype
On Wed, 20 Apr 2011, Jeff Law wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 04/20/11 15:08, Dimitrios Apostolou wrote: Hello list, while trying to build gcc-4.6.0 on my sparcstation, I got gengtype OOM killed. That's when I noticed that its RAM usage peaks at 150MB, which is a bit excessive for parsing a ~500K text file. The attached patch fixes the leak and gengtype now uses a peak of 4MB heap. Hopefully I don't do something wrong, since it took me a while to understand those obstacks... The code in question creates an obstack, allocates (and grows) a single object on the obstack, then frees the object. This leaks the underlying obstack structure itself and potentially any chunks that were too small to hold the object. Plus a whole page which is preallocated by the obstack, if I understand correctly. As a result, for each word in the text file we consume 4KB, which are never freed. It turns out there's a similar leak in gengtype.c which is fixed in the same way. Nice, thanks for looking deeper into this, I just stopped when memory utilisation seemed ok. A quick valgrind test shows that prior to your change gengtype leaked roughly 200M, after your change it leaks about 1.3M and after fixing gengtype it leaks a little under 300k. I'll run those changes through the usual tests and check in the changes assuming they pass those tests. Thanks for the patch! P.S. I was trying to test gcc on a rare arch (sparc-unknown-linux-gnu) but unfortunately the sparcstation crashed and burned after this, so I can't continue the build and report back :-( :( My old PA box has similar problems, though it merely overheats before a bootstrap can complete, so in theory I could coax it to finish a bootstrap. Luckily others (particularly John) have stepped in over the last decade and taken excellent care of the PA port. If by PA you mean PA-RISC, I remember when I had access to a Visualize C200 with gentoo on. I loved the machine, but it had an important issue: it was absolutely random if it would power up, when pressing the power button. But as long as we never turned it off, it worked ok :-) Dimitris
fix memory leak in gengtype
Hello list, while trying to build gcc-4.6.0 on my sparcstation, I got gengtype OOM killed. That's when I noticed that its RAM usage peaks at 150MB, which is a bit excessive for parsing a ~500K text file. The attached patch fixes the leak and gengtype now uses a peak of 4MB heap. Hopefully I don't do something wrong, since it took me a while to understand those obstacks... Thanks, Dimitris P.S. I was trying to test gcc on a rare arch (sparc-unknown-linux-gnu) but unfortunately the sparcstation crashed and burned after this, so I can't continue the build and report back :-( --- gcc/gengtype-state.c.orig 2011-04-20 23:06:29.0 +0300 +++ gcc/gengtype-state.c2011-04-20 23:12:43.0 +0300 @@ -303,7 +303,7 @@ obstack_1grow (&id_obstack, (char) 0); ids = XOBFINISH (&id_obstack, char *); sid = state_ident_by_name (ids, INSERT); - obstack_free (&id_obstack, ids); + obstack_free (&id_obstack, NULL); ids = NULL; tk = XCNEW (struct state_token_st); tk->stok_kind = STOK_NAME; @@ -408,7 +408,7 @@ tk->stok_file = state_path; tk->stok_next = NULL; strcpy (tk->stok_un.stok_string, cstr); - obstack_free (&bstring_obstack, cstr); + obstack_free (&bstring_obstack, NULL); return tk; }