failed attempt: retain identifier length from frontend to backend

2012-08-20 Thread Dimitrios Apostolou

Hello,

my last attempt on improving something serious was about three weeks ago, 
trying to keep all lengths of all strings parsed in the frontend for the 
whole compilation phase until the assembly output. I was hoping that would 
help on using faster hashes (knowing the length allows us to hash 4 or 8 
bytes per iteration), quicker strcmp in various places, and using less 
strlen() calls, which show especially on -g3 compilations that store huge 
macro strings.


I'll post no patch here, since what I currently have is a mess in 3 
different branches and most don't even compile. I tried various 
approaches. First I tried adding an extra length parameter in all relevant 
functions, starting from the assembly generation and working my way 
upwards. This got too complex, and I'd really like to ask if you find any 
meaning in changing target specific hooks and macros to actually accept 
length as argument (e.g. ASM_OUTPUT_*) or return it (e.g. ASM_GENERATE_*). 
Changes seemed too intrusive for me to continue.


But seeing that identifier length is there inside struct ht_identifier (or 
cpp_hashnode) and not lost, I tried the approach of having the length at 
str[-4] for all identifiers. To achieve this I changed ht_identifier to 
store str with the flexible array hack. Unfortunately I hadn't noticed 
that ht_identifier was a part of tree_node and also part of too many other 
structs, so changing all those structs to have variable size was not 
without side effects. In the end it compiled but I got crashes all over, 
and I'm sure I didn't do things right since I broke things like the static 
assert in libcpp/identifiers.c, that I don't even understand:


 /* We don't need a proxy since the hash table's identifier comes first
in cpp_hashnode.  However, in case this is ever changed, we have a
static assertion for it.  */
-extern char proxy_assertion_broken[offsetof (struct cpp_hashnode, ident) == 0 
? 1 : -1];

Anyway last attempt was to decouple ht_identifier completely from trees 
and other structs by storing pointer to it, but I was pretty worn out and 
quickly gave up after getting errors on gengtype-generated files that I 
didn't even know how to handle.


Was all this project too ambitious? I'd appreciate all input.


Thanks,
Dimitris



[DF] RFC: obstacks in DF

2012-08-20 Thread Dimitrios Apostolou

Hi,

while I was happy using obstacks in other parts of the compiler I thought 
they would provide a handy solution for the XNEWVECs/XRESIZEVECs in 
df-scan.c, especially df_install_refs() which is the heaviest malloc() 
user after the rest of my patches.


In the process I realised that obstacks weren't exactly suitable (thanks 
matz for helping on IRC), nevertheless I have a patch which is stable, a 
bit faster and more memory friendly (don't know why, but RSS memory for 
common non-pathological compilations, was smaller). However after trying 
various incorrect changes I'm aware that there are leaks in there that 
can't be closed without dirty hacks, so I gave up.


I'm posting the simplest but stable version of my changes and would really 
like to hear ideas. There are "holes" in the obstack that should have been 
free'd, but in the end I didn't see memory increase. I don't know what 
would happen for huge functions that keep the obstack alive for long, I 
didn't have such testcase. Finally, I think my next try will involve 
converting the chains to pool_alloc'ed linked list, do you think that 
makes sense?



Thanks,
Dimitris
=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2012-07-16 11:32:42 +
+++ gcc/df-scan.c   2012-08-20 14:01:59 +
@@ -184,6 +184,17 @@ struct df_scan_problem_data
   bitmap_obstack insn_bitmaps;
 };
 
+/* Obstack for storing all of all of
+   insn_info->{defs,uses,eq_uses,mw_hardregs} and
+   bb_info->artificial_{defs,uses}.  */
+static struct obstack df_refs_obstack;
+
+/* Keep the obstack initialised (only 4k overhead) and use this pointer to
+   actually empty it fast.  */
+static void *df_first_refs_obj;
+static int df_refs_obstack_is_init;
+
+
 typedef struct df_scan_bb_info *df_scan_bb_info_t;
 
 
@@ -193,36 +204,13 @@ df_scan_free_internal (void)
 {
   struct df_scan_problem_data *problem_data
 = (struct df_scan_problem_data *) df_scan->problem_data;
-  unsigned int i;
-  basic_block bb;
 
-  /* The vectors that hold the refs are not pool allocated because
- they come in many sizes.  This makes them impossible to delete
- all at once.  */
-  for (i = 0; i < DF_INSN_SIZE(); i++)
-{
-  struct df_insn_info *insn_info = DF_INSN_UID_GET(i);
-  /* Skip the insns that have no insn_info or have been
-deleted.  */
-  if (insn_info)
-   {
- df_scan_free_ref_vec (insn_info->defs);
- df_scan_free_ref_vec (insn_info->uses);
- df_scan_free_ref_vec (insn_info->eq_uses);
- df_scan_free_mws_vec (insn_info->mw_hardregs);
-   }
-}
-
-  FOR_ALL_BB (bb)
-{
-  unsigned int bb_index = bb->index;
-  struct df_scan_bb_info *bb_info = df_scan_get_bb_info (bb_index);
-  if (bb_info)
-   {
- df_scan_free_ref_vec (bb_info->artificial_defs);
- df_scan_free_ref_vec (bb_info->artificial_uses);
-   }
-}
+  /* Delete at once all vectors that hold the refs (all of
+ insn_info->{defs,uses,eq_uses,mw_hardregs} and
+ bb_info->artificial_{defs,uses}) but keep the obstack initialised, so
+ that it's ready for more BBs.  */
+  obstack_free (&df_refs_obstack, df_first_refs_obj);
+  df_first_refs_obj = NULL;
 
   free (df->def_info.refs);
   free (df->def_info.begin);
@@ -364,6 +352,14 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
   bb_info->artificial_uses = NULL;
 }
 
+  if (! df_refs_obstack_is_init)
+{
+  obstack_init (&df_refs_obstack);
+  df_refs_obstack_is_init = 1;
+}
+  gcc_assert (df_first_refs_obj == NULL);
+  df_first_refs_obj = XOBNEW (&df_refs_obstack, void *);
+
   bitmap_initialize (&df->hardware_regs_used, &problem_data->reg_bitmaps);
   bitmap_initialize (&df->regular_block_artificial_uses, 
&problem_data->reg_bitmaps);
   bitmap_initialize (&df->eh_block_artificial_uses, 
&problem_data->reg_bitmaps);
@@ -791,9 +787,15 @@ df_install_ref_incremental (df_ref ref)
 }
 
   ref_rec = *ref_rec_ptr;
+  /* fprintf (stderr, "count:%d ref_rec:%p\n", count, ref_rec); */
   if (count)
 {
-  ref_rec = XRESIZEVEC (df_ref, ref_rec, count+2);
+  /* Impossible to actually know if ref_rec was malloc'ed or obstack'ed
+thus we might have a leak here!  */
+  df_ref *ref_rec2 = XOBNEWVEC (&df_refs_obstack, df_ref, count+2);
+  memcpy (ref_rec2, ref_rec, count*sizeof(df_ref));
+  /* free (ref_rec); */
+  ref_rec = ref_rec2;
   *ref_rec_ptr = ref_rec;
   ref_rec[count] = ref;
   ref_rec[count+1] = NULL;
@@ -801,7 +803,7 @@ df_install_ref_incremental (df_ref ref)
 }
   else
 {
-  df_ref *ref_rec = XNEWVEC (df_ref, 2);
+  ref_rec = XOBNEWVEC (&df_refs_obstack, df_ref, 2);
   ref_rec[0] = ref;
   ref_rec[1] = NULL;
   *ref_rec_ptr = ref_rec;
@@ -1070,8 +1072,9 @@ df_ref_chain_delete (df_ref *ref_rec)
 
   /* If the list is empty, it has a special shared element that is not
  to be deleted.  */
-  if (*start)
-free (start);
+  /* if (*start) */
+  /*   free (sta

Re: [graphds.h] Allocate graph from obstack

2012-08-20 Thread Dimitrios Apostolou

Hi Paolo,

On Mon, 20 Aug 2012, Paolo Bonzini wrote:

Il 19/08/2012 18:55, Richard Guenther ha scritto:

Initially I had one obstack per struct graph, which was better than using
XNEW for every edge, but still obstack_init() called from new_graph() was
too frequent.

So in this iteration of the patch the obstack is static global, initialised
once and never freed. Please advise on whether this is acceptable, and also
where I should initialise the obstack once, and avoid checking if it's NULL
in every use.

Minor speed gains (couple of ms), tested with pre-C++ conversion snapshot,
I'll retest soon and post update.

Either an obstack per graph or the ability to specify an obstack for allocation.
A global obstack with global lifetime is bad IMHO.


Dimitrios's patch has a per-file obstack with per-pass lifetime


Notice that I never call XOBDELETE with NULL argument, I only free the 
first object, which means that the 4KB per obstack overhead is retained 
until program exit, I did that to save on malloc calls.



Thanks,
Dimitris



Re: alloc_pool for tree-ssa-pre.c:phi_translate_table

2012-08-20 Thread Dimitrios Apostolou

On Mon, 20 Aug 2012, Jakub Jelinek wrote:


I'd note for all the recently posted patches from Dimitrios, the gcc/
prefix doesn't belong to the ChangeLog entry pathnames, the filenames are
relative to the corresponding ChangeLog location.



Ah sorry, it's what the mklog utility generates, it seems I got carried 
away over-using it. :-)



Dimitris



Re: enlarge hot allocation pools

2012-08-19 Thread Dimitrios Apostolou

Hi Steven,

On Sun, 19 Aug 2012, Steven Bosscher wrote:

On Sun, Aug 19, 2012 at 8:31 PM, Dimitrios Apostolou  wrote:

Hello,

2012-08-19  Dimitrios Apostolou  

* gcc/cselib.c (cselib_init): Make allocation pools larger since
they are too hot and show to expand often on the profiler.
* gcc/df-problems.c (df_chain_alloc): Same.
* gcc/et-forest.c (et_new_occ, et_new_tree): Same.
* gcc/tree-ssa-pre.c (init_pre): Same


These allocation pools are the ones that I've noticed calling malloc() too
often, for expanding their size.


I don't like the way these pools are sized with a seemingly arbitrary
initial size. They're there to hold something, and they grow because
there are "more somethings" than initially guessed. I think you should
look at what the pools hold and choose an initial size based on some
representative measure. E.g. if a pool holds some kind of expressions,
then you should be able to make an initial guess of the size of the
pool based on the number of pseudos or ssa names. Ideally you'd simply
"derive" the initial pool size by a regression analysis of the final
pool size and some potential representative measures (# of regs, basic
blocks, edges, whatever).


Some time ago I tried changing the pool allocator growth strategy from 
linear to exponential, doubling the size of the pool chunk every time it 
needed another chunk. It was for the exact same reason, to set the pool 
size according to their contents. Unfortunately I remember it didn't make 
a difference so I dumped it and only manually changed the important pools, 
which made a small difference. I'll now try deducing information on the 
size at runtime, thanks for the tip.



Dimitris



enlarge hot allocation pools

2012-08-19 Thread Dimitrios Apostolou

Hello,

2012-08-19  Dimitrios Apostolou  

* gcc/cselib.c (cselib_init): Make allocation pools larger since
they are too hot and show to expand often on the profiler.
* gcc/df-problems.c (df_chain_alloc): Same.
* gcc/et-forest.c (et_new_occ, et_new_tree): Same.
* gcc/tree-ssa-pre.c (init_pre): Same


These allocation pools are the ones that I've noticed calling malloc() too 
often, for expanding their size. Also I don't like the way the pools are 
created in et_new_{occ,tree}() but couldn't find a single point to move 
the initialisation either. Any ideas on this one?



Thanks,
Dimitris2012-08-19  Dimitrios Apostolou  

* gcc/cselib.c (cselib_init): Make allocation pools larger since
they are too hot and show to expand often on the profiler.
* gcc/df-problems.c (df_chain_alloc): Same.
* gcc/et-forest.c (et_new_occ, et_new_tree): Same.
* gcc/tree-ssa-pre.c (init_pre): Same


=== modified file 'gcc/cselib.c'
--- gcc/cselib.c2012-08-02 22:39:57 +
+++ gcc/cselib.c2012-08-19 15:13:28 +
@@ -2659,12 +2659,12 @@ void
 cselib_init (int record_what)
 {
   elt_list_pool = create_alloc_pool ("elt_list",
-sizeof (struct elt_list), 10);
+sizeof (struct elt_list), 128);
   elt_loc_list_pool = create_alloc_pool ("elt_loc_list",
-sizeof (struct elt_loc_list), 10);
+sizeof (struct elt_loc_list), 128);
   cselib_val_pool = create_alloc_pool ("cselib_val_list",
-  sizeof (cselib_val), 10);
-  value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 100);
+  sizeof (cselib_val), 128);
+  value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 128);
   cselib_record_memory = record_what & CSELIB_RECORD_MEMORY;
   cselib_preserve_constants = record_what & CSELIB_PRESERVE_CONSTANTS;
   cselib_any_perm_equivs = false;

=== modified file 'gcc/df-problems.c'
--- gcc/df-problems.c   2012-08-17 09:42:06 +
+++ gcc/df-problems.c   2012-08-19 15:29:37 +
@@ -1993,7 +1993,7 @@ df_chain_alloc (bitmap all_blocks ATTRIB
 {
   df_chain_remove_problem ();
   df_chain->block_pool = create_alloc_pool ("df_chain_block pool",
-sizeof (struct df_link), 50);
+sizeof (struct df_link), 128);
   df_chain->optional_p = true;
 }
 

=== modified file 'gcc/et-forest.c'
--- gcc/et-forest.c 2012-05-31 16:43:31 +
+++ gcc/et-forest.c 2012-08-19 15:50:25 +
@@ -444,8 +444,8 @@ et_new_occ (struct et_node *node)
 {
   struct et_occ *nw;
 
-  if (!et_occurrences)
-et_occurrences = create_alloc_pool ("et_occ pool", sizeof (struct et_occ), 
300);
+  if (!et_occurrences)
+et_occurrences = create_alloc_pool ("et_occ pool", sizeof (struct et_occ), 
1024);
   nw = (struct et_occ *) pool_alloc (et_occurrences);
 
   nw->of = node;
@@ -467,8 +467,8 @@ et_new_tree (void *data)
 {
   struct et_node *nw;
 
-  if (!et_nodes)
-et_nodes = create_alloc_pool ("et_node pool", sizeof (struct et_node), 
300);
+  if (!et_nodes)
+et_nodes = create_alloc_pool ("et_node pool", sizeof (struct et_node), 
512);
   nw = (struct et_node *) pool_alloc (et_nodes);
 
   nw->data = data;

=== modified file 'gcc/tree-ssa-pre.c'
--- gcc/tree-ssa-pre.c  2012-08-18 06:31:26 +
+++ gcc/tree-ssa-pre.c  2012-08-19 15:28:21 +
@@ -4812,9 +4812,9 @@ init_pre (bool do_fre)
   phi_translate_table.create (5110);
   expression_to_id.create (num_ssa_names * 3);
   bitmap_set_pool = create_alloc_pool ("Bitmap sets",
-  sizeof (struct bitmap_set), 30);
+  sizeof (struct bitmap_set), 128);
   pre_expr_pool = create_alloc_pool ("pre_expr nodes",
-sizeof (struct pre_expr_d), 30);
+sizeof (struct pre_expr_d), 32);
   FOR_ALL_BB (bb)
 {
   EXP_GEN (bb) = bitmap_set_new ();



alloc_pool for tree-ssa-pre.c:phi_translate_table

2012-08-19 Thread Dimitrios Apostolou



2012-08-19  Dimitrios Apostolou  

* gcc/tree-ssa-pre.c (phi_translate_pool): New static global
alloc_pool, used for allocating struct expr_pred_trans_d for
phi_translate_table.
(phi_trans_add, init_pre, fini_pre): Use it, avoids thousand of
malloc() and free() calls.


This avoids lots of malloc/free calls and slow iterations during numerous 
htab_delete() in fini_pre(). Tested on pre C++-snapshot, will update info 
as soon as a post C++ one is available.



Thanks,
Dimitris2012-08-19  Dimitrios Apostolou  

* gcc/tree-ssa-pre.c (phi_translate_pool): New static global
alloc_pool, used for allocating struct expr_pred_trans_d for
phi_translate_table.
(phi_trans_add, init_pre, fini_pre): Use it, avoids thousand of
malloc() and free() calls.

=== modified file 'gcc/tree-ssa-pre.c'
--- gcc/tree-ssa-pre.c  2012-08-17 08:03:54 +
+++ gcc/tree-ssa-pre.c  2012-08-18 16:43:02 +
@@ -486,7 +486,7 @@ static bitmap need_ab_cleanup;
 /* A three tuple {e, pred, v} used to cache phi translations in the
phi_translate_table.  */
 
-typedef struct expr_pred_trans_d : typed_free_remove
+typedef struct expr_pred_trans_d : typed_noop_remove
 {
   /* The expression.  */
   pre_expr e;
@@ -508,6 +508,12 @@ typedef struct expr_pred_trans_d : typed
 } *expr_pred_trans_t;
 typedef const struct expr_pred_trans_d *const_expr_pred_trans_t;
 
+/* Pool of memory for the above */
+
+static alloc_pool phi_translate_pool;
+
+/* Return the hash value for a phi translation table entry.  */
+
 inline hashval_t
 expr_pred_trans_d::hash (const expr_pred_trans_d *e)
 {
@@ -561,7 +567,8 @@ static inline void
 phi_trans_add (pre_expr e, pre_expr v, basic_block pred)
 {
   expr_pred_trans_t *slot;
-  expr_pred_trans_t new_pair = XNEW (struct expr_pred_trans_d);
+  expr_pred_trans_t new_pair =
+(expr_pred_trans_t) pool_alloc (phi_translate_pool);
   new_pair->e = e;
   new_pair->pred = pred;
   new_pair->v = v;
@@ -570,7 +577,8 @@ phi_trans_add (pre_expr e, pre_expr v, b
 
   slot = phi_translate_table.find_slot_with_hash (new_pair,
   new_pair->hashcode, INSERT);
-  free (*slot);
+  if (*slot)
+pool_free (phi_translate_pool, *slot);
   *slot = new_pair;
 }
 
@@ -4798,6 +4806,9 @@ init_pre (bool do_fre)
   calculate_dominance_info (CDI_DOMINATORS);
 
   bitmap_obstack_initialize (&grand_bitmap_obstack);
+  phi_translate_pool = create_alloc_pool ("phi_translate_table",
+ sizeof (struct expr_pred_trans_d),
+ 512);
   phi_translate_table.create (5110);
   expression_to_id.create (num_ssa_names * 3);
   bitmap_set_pool = create_alloc_pool ("Bitmap sets",
@@ -4832,6 +4843,7 @@ fini_pre (bool do_fre)
   free_alloc_pool (bitmap_set_pool);
   free_alloc_pool (pre_expr_pool);
   phi_translate_table.dispose ();
+  free_alloc_pool (phi_translate_pool);
   expression_to_id.dispose ();
   VEC_free (unsigned, heap, name_to_id);
 



obstack for equiv_class_label, more vectors on stack

2012-08-19 Thread Dimitrios Apostolou




2012-08-19  Dimitrios Apostolou  

* gcc/tree-ssa-structalias.c: Change declaration of ce_s type
vector from heap to stack. Update all relevant functions to
VEC_alloc() such vector upfront with enough (32) slots so that
malloc() calls are mostly avoided.
(equiv_class_obstack) New global static obstack for allocating
struct equiv_class_label.
(equiv_class_add): Use the above instead of malloc().
(perform_var_substitution): Don't allow entries of
location_equiv_class_table to be freed, because they are free'd...
(free_var_substitution_info): ...here by freeing the obstack.
* gcc/vecir.h: Add declaration of stack allocated tree type vector.
* gcc/tree-ssa-sccvn.c (vn_phi_insert, print_scc, compare_ops)
(sort_scc, copy_reference, extract_and_process_scc_for_name): Use
it, instead of heap allocated vector.



Not all of these places are hot on the profiler, but since I changed a few 
I had to change them all to remove complete the heap ce_s vector. Passes 
tests and offers small gains (couple of ms), but expect a more thorough 
report and testing against a new snapshot by next week.



Thanks,
Dimitris2012-08-19  Dimitrios Apostolou  

* gcc/tree-ssa-structalias.c: Change declaration of ce_s type
vector from heap to stack. Update all relevant functions to
VEC_alloc() such vector upfront with enough (32) slots so that
malloc() calls are mostly avoided.
(equiv_class_obstack) New global static obstack for allocating
struct equiv_class_label.
(equiv_class_add): Use the above instead of malloc().
(perform_var_substitution): Don't allow entries of
location_equiv_class_table to be freed, because they are free'd...
(free_var_substitution_info): ...here by freeing the obstack.
* gcc/vecir.h: Add declaration of stack allocated tree type vector.
* gcc/tree-ssa-sccvn.c (vn_phi_insert, print_scc, compare_ops)
(sort_scc, copy_reference, extract_and_process_scc_for_name): Use
it, instead of heap allocated vector.


=== modified file 'gcc/tree-ssa-structalias.c'
--- gcc/tree-ssa-structalias.c  2012-08-16 14:27:51 +
+++ gcc/tree-ssa-structalias.c  2012-08-18 16:43:02 +
@@ -472,11 +472,14 @@ struct constraint_expr
 
 typedef struct constraint_expr ce_s;
 DEF_VEC_O(ce_s);
-DEF_VEC_ALLOC_O(ce_s, heap);
-static void get_constraint_for_1 (tree, VEC(ce_s, heap) **, bool, bool);
-static void get_constraint_for (tree, VEC(ce_s, heap) **);
-static void get_constraint_for_rhs (tree, VEC(ce_s, heap) **);
-static void do_deref (VEC (ce_s, heap) **);
+DEF_VEC_ALLOC_O_STACK(ce_s);
+#define VEC_ce_s_stack_alloc(alloc) \
+  VEC_stack_alloc (ce_s, alloc)
+
+static void get_constraint_for_1 (tree, VEC(ce_s, stack) **, bool, bool);
+static void get_constraint_for (tree, VEC(ce_s, stack) **);
+static void get_constraint_for_rhs (tree, VEC(ce_s, stack) **);
+static void do_deref (VEC (ce_s, stack) **);
 
 /* Our set constraints are made up of two constraint expressions, one
LHS, and one RHS.
@@ -1893,6 +1896,9 @@ static htab_t pointer_equiv_class_table;
classes.  */
 static htab_t location_equiv_class_table;
 
+/* Pool of memory for storing the above */
+static struct obstack equiv_class_obstack;
+
 /* Hash function for a equiv_class_label_t */
 
 static hashval_t
@@ -1942,7 +1948,7 @@ equiv_class_add (htab_t table, unsigned
 bitmap labels)
 {
   void **slot;
-  equiv_class_label_t ecl = XNEW (struct equiv_class_label);
+  equiv_class_label_t ecl = XOBNEW (&equiv_class_obstack, struct 
equiv_class_label);
 
   ecl->labels = labels;
   ecl->equivalence_class = equivalence_class;
@@ -2153,10 +2159,12 @@ perform_var_substitution (constraint_gra
   struct scc_info *si = init_scc_info (size);
 
   bitmap_obstack_initialize (&iteration_obstack);
+  gcc_obstack_init (&equiv_class_obstack);
+  /* NULL free function, we'll free the whole pool at the end of the pass. */
   pointer_equiv_class_table = htab_create (511, equiv_class_label_hash,
-  equiv_class_label_eq, free);
+  equiv_class_label_eq, NULL);
   location_equiv_class_table = htab_create (511, equiv_class_label_hash,
-   equiv_class_label_eq, free);
+   equiv_class_label_eq, NULL);
   pointer_equiv_class = 1;
   location_equiv_class = 1;
 
@@ -2263,6 +2271,7 @@ free_var_substitution_info (struct scc_i
   sbitmap_free (graph->direct_nodes);
   htab_delete (pointer_equiv_class_table);
   htab_delete (location_equiv_class_table);
+  obstack_free (&equiv_class_obstack, NULL);
   bitmap_obstack_release (&iteration_obstack);
 }
 
@@ -2741,7 +2750,7 @@ new_scalar_tmp_constraint_exp (const cha
If address_p is t

more malloc mitigation

2012-08-18 Thread Dimitrios Apostolou

Hi,

2012-08-18  Dimitrios Apostolou  

* gcc/tree-ssa-sccvn.c (struct vn_tables_s): Add obstack_start to
mark the first allocated object on the obstack.
(process_scc, allocate_vn_table): Use it.
(init_scc_vn): Don't truncate shared_lookup_references vector.
(prealloc_ref_ops_vec): New static vector.
(create_reference_ops_from_ref, create_reference_ops_from_call):
Use it instead of locally allocated one.
(free_scc_vn): Truncate vectors to 0, but keep them most 16 slots
long.


I'd have used a stack vector for create_reference_ops_from_{ref,call} 
instead of a static one on the heap, but the functions return heap 
vectors. Can I just cast a stack vector to a heap one?


Also is it acceptable to leak 4K  because of never freeing the whole 
obstack?


Tested on x86 with old snapshot, will post update soon after testing with 
more recent one.



Thanks,
Dimitris2012-08-18  Dimitrios Apostolou  

* gcc/tree-ssa-sccvn.c (struct vn_tables_s): Add obstack_start to
mark the first allocated object on the obstack.
(process_scc, allocate_vn_table): Use it.
(init_scc_vn): Don't truncate shared_lookup_references vector.
(prealloc_ref_ops_vec): New static vector.
(create_reference_ops_from_ref, create_reference_ops_from_call):
Use it instead of locally allocated one.
(free_scc_vn): Truncate vectors to 0, but keep them most 16 slots
long.


=== modified file 'gcc/tree-ssa-sccvn.c'
--- gcc/tree-ssa-sccvn.c2012-08-16 14:27:51 +
+++ gcc/tree-ssa-sccvn.c2012-08-18 16:43:02 +
@@ -105,6 +105,7 @@ typedef struct vn_tables_s
   htab_t phis;
   htab_t references;
   struct obstack nary_obstack;
+  long *obstack_start;
   alloc_pool phis_pool;
   alloc_pool references_pool;
 } *vn_tables_t;
@@ -973,16 +974,21 @@ copy_reference_ops_from_call (gimple cal
 }
 }
 
+/* Preallocated vector with sufficient space - see free_scc_vn(). Helps to
+   avoid reallocations and space, since when we VEC_copy() it, only its exact
+   length is duplicated.  */
+
+static VEC(vn_reference_op_s, heap) *prealloc_ref_ops_vec;
+
 /* Create a vector of vn_reference_op_s structures from REF, a
-   REFERENCE_CLASS_P tree.  The vector is not shared. */
+   REFERENCE_CLASS_P tree.  The vector is not shared.  */
 
 static VEC(vn_reference_op_s, heap) *
 create_reference_ops_from_ref (tree ref)
 {
-  VEC (vn_reference_op_s, heap) *result = NULL;
-
-  copy_reference_ops_from_ref (ref, &result);
-  return result;
+  VEC_truncate (vn_reference_op_s, prealloc_ref_ops_vec, 0);
+  copy_reference_ops_from_ref (ref, &prealloc_ref_ops_vec);
+  return VEC_copy (vn_reference_op_s, heap, prealloc_ref_ops_vec);
 }
 
 /* Create a vector of vn_reference_op_s structures from CALL, a
@@ -991,10 +997,9 @@ create_reference_ops_from_ref (tree ref)
 static VEC(vn_reference_op_s, heap) *
 create_reference_ops_from_call (gimple call)
 {
-  VEC (vn_reference_op_s, heap) *result = NULL;
-
-  copy_reference_ops_from_call (call, &result);
-  return result;
+  VEC_truncate (vn_reference_op_s, prealloc_ref_ops_vec, 0);
+  copy_reference_ops_from_call (call, &prealloc_ref_ops_vec);
+  return VEC_copy (vn_reference_op_s, heap, prealloc_ref_ops_vec);
 }
 
 /* Fold *& at position *I_P in a vn_reference_op_s vector *OPS.  Updates
@@ -3610,8 +3615,9 @@ process_scc (VEC (tree, heap) *scc)
   htab_empty (optimistic_info->nary);
   htab_empty (optimistic_info->phis);
   htab_empty (optimistic_info->references);
-  obstack_free (&optimistic_info->nary_obstack, NULL);
-  gcc_obstack_init (&optimistic_info->nary_obstack);
+  /* Quick way to empty the obstack but keep it ready for reuse.  */
+  obstack_free (&optimistic_info->nary_obstack,
+   optimistic_info->obstack_start);
   empty_alloc_pool (optimistic_info->phis_pool);
   empty_alloc_pool (optimistic_info->references_pool);
   FOR_EACH_VEC_ELT (tree, scc, i, var)
@@ -3796,6 +3802,7 @@ allocate_vn_table (vn_tables_t table)
   free_reference);
 
   gcc_obstack_init (&table->nary_obstack);
+  table->obstack_start = XOBNEW (&table->nary_obstack, long);
   table->phis_pool = create_alloc_pool ("VN phis",
sizeof (struct vn_phi_s),
30);
@@ -3842,7 +3849,6 @@ init_scc_vn (void)
   gcc_obstack_init (&vn_ssa_aux_obstack);
 
   shared_lookup_phiargs = NULL;
-  shared_lookup_references = NULL;
   rpo_numbers = XNEWVEC (int, last_basic_block);
   rpo_numbers_temp = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS);
   pre_and_rev_post_order_compute (NULL, rpo_numbers_temp, false);
@@ -3887,9 +3893,19 @@ free_scc_vn (void)
   htab_delete (constant_to_value_id);
   BITMAP_FREE (constant_value_ids);
   VEC_fr

[graphds.h] Allocate graph from obstack

2012-08-18 Thread Dimitrios Apostolou
Initially I had one obstack per struct graph, which was better than using 
XNEW for every edge, but still obstack_init() called from new_graph() was 
too frequent.


So in this iteration of the patch the obstack is static global, 
initialised once and never freed. Please advise on whether this is 
acceptable, and also where I should initialise the obstack once, and avoid 
checking if it's NULL in every use.


Minor speed gains (couple of ms), tested with pre-C++ conversion snapshot, 
I'll retest soon and post update.



Thanks,
Dimitris
2012-08-18  Dimitrios Apostolou  

* gcc/graphds.h: Guarded the header file. Included libiberty.h and
obstack.h.
(graph_obstack): New static obstack to allocate graphs and
graph_edges from.
(new_graph, add_edge): Add obstack argument.
(free_graph): Remove.
* gcc/graphds.c (new_graph): Allocate graph, vertices from obstack.
(add_edge): Allocate edge from obstack.
(free_graph): Remove, all graph data is now freed when freeing the
object in the obstack.
* gcc/tree-data-ref.h (free_rdg): Same.
(build_empty_rdg): Add obstack argument.
* gcc/cfgloopanal.c (mark_irreducible_loops): Initialise obstack
if it's not, use it for graph allocations.
* gcc/dominance.c (iterate_fix_dominators): Same.
* gcc/graphite-sese-to-poly.c (build_alias_set_optimal_p)
(build_base_obj_set_for_drs): Same.
* gcc/tree-data-ref.c (rdg_vertex_for_stmt)
(create_rdg_edge_for_ddr, create_rdg_edges_for_scalar)
(create_rdg_edges, create_rdg_vertices)
(known_dependences_p,build_empty_rdg, build_rdg, free_rdg): Same.
* gcc/tree-loop-distribution.c (distribute_loop): Same.


=== modified file 'gcc/cfgloopanal.c'
--- gcc/cfgloopanal.c   2012-05-31 20:19:00 +
+++ gcc/cfgloopanal.c   2012-08-18 16:43:02 +
@@ -93,8 +93,14 @@ mark_irreducible_loops (void)
e->flags &= ~EDGE_IRREDUCIBLE_LOOP;
 }
 
+  if (graph_obstack == NULL)
+{
+  graph_obstack = XNEW (struct obstack);
+  obstack_init (graph_obstack);
+}
+
   /* Create the edge lists.  */
-  g = new_graph (last_basic_block + num);
+  g = new_graph (last_basic_block + num, graph_obstack);
 
   FOR_BB_BETWEEN (act, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, next_bb)
 FOR_EACH_EDGE (e, ei, act->succs)
@@ -134,7 +140,7 @@ mark_irreducible_loops (void)
src = LOOP_REPR (cloop);
  }
 
-   add_edge (g, src, dest)->data = e;
+   add_edge (g, src, dest, graph_obstack)->data = e;
   }
 
   /* Find the strongly connected components.  */
@@ -161,7 +167,8 @@ mark_irreducible_loops (void)
  real->src->flags |= BB_IRREDUCIBLE_LOOP;
   }
 
-  free_graph (g);
+  /* Destroy all data allocated for the graph.  */
+  XOBDELETE (graph_obstack, g);
 
   loops_state_set (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS);
   return irred_loop_found;

=== modified file 'gcc/dominance.c'
--- gcc/dominance.c 2012-08-14 15:59:41 +
+++ gcc/dominance.c 2012-08-18 16:43:02 +
@@ -1341,7 +1341,13 @@ iterate_fix_dominators (enum cdi_directi
 }
   *pointer_map_insert (map, ENTRY_BLOCK_PTR) = (void *) (size_t) n;
 
-  g = new_graph (n + 1);
+  if (graph_obstack == NULL)
+{
+  graph_obstack = XNEW (struct obstack);
+  obstack_init (graph_obstack);
+}
+
+  g = new_graph (n + 1, graph_obstack);
   for (y = 0; y < g->n_vertices; y++)
 g->vertices[y].data = BITMAP_ALLOC (NULL);
   FOR_EACH_VEC_ELT (basic_block, bbs, i, bb)
@@ -1358,7 +1364,7 @@ iterate_fix_dominators (enum cdi_directi
  if (!bitmap_set_bit ((bitmap) g->vertices[dom_i].data, i))
continue;
 
- add_edge (g, dom_i, i);
+ add_edge (g, dom_i, i, graph_obstack);
}
 }
   for (y = 0; y < g->n_vertices; y++)
@@ -1392,7 +1398,8 @@ iterate_fix_dominators (enum cdi_directi
   free (brother);
   free (parent);
 
-  free_graph (g);
+  /* Destroy all data allocated for the graph.  */
+  XOBDELETE (graph_obstack, g);
 }
 
 void

=== modified file 'gcc/graphds.c'
--- gcc/graphds.c   2009-11-25 10:55:54 +
+++ gcc/graphds.c   2012-08-18 16:43:02 +
@@ -53,28 +53,29 @@ dump_graph (FILE *f, struct graph *g)
 }
 }
 
-/* Creates a new graph with N_VERTICES vertices.  */
+/* Creates a new graph with N_VERTICES vertices from obstack O.  */
 
 struct graph *
-new_graph (int n_vertices)
+new_graph (int n_vertices, struct obstack *o)
 {
-  struct graph *g = XNEW (struct graph);
+  struct graph *g = XOBNEW (o, struct graph);
 
   g->n_vertices = n_vertices;
-  g->vertices = XCNEWVEC (struct vertex, n_vertices);
+  g->vertices = XOBNEWVEC (o, struct vertex, n_vertices);
+  memset (g->vertices, 0, n_vertices * sizeof (*g->vertices));
 
   return g;
 }
 
-/* Adds an edge from F to T to graph G.  The new edge is returned.  */
+/

Re: Speedups/Cleanups: End of GSOC patch collection

2012-08-18 Thread Dimitrios Apostolou
Hi, these are the type-safe obstack macros that I use in other patches. As 
requested, I also changed XOBFINISH to return (T *) and changed all 
callers, felt a little strange to change (char *) to char. I also replaced 
all _obstack_begin() calls in libcpp with obstack_init() which is the 
documented way of allocating macros.



2012-08-18  Dimitrios Apostolou  

* include/libiberty.h (XOBDELETE, XOBGROW, XOBGROWVEC, XOBSHRINK)
(XOBSHRINKVEC, XOBFINISH): New type-safe macros for obstack
operations.
(XOBFINISH): Changed to return (T *) instead of T. All callers
updated.
* libcpp/include/symtab.h (obstack_chunk_alloc)
(obstack_chunk_free): Define, so that obstack_init() can be used.
* libcpp/internal.h (struct cset_converter): Same.
* libcpp/files.c (_cpp_init_files): Changed _obstack_begin() to
obstack_init().
* libcpp/identifiers.c (_cpp_init_hashtable): Same.
* libcpp/symtab.c (ht_create): Same.
* libcpp/init.c (cpp_create_reader): Same.



Thanks,
Dimitris
2012-08-18  Dimitrios Apostolou  

* include/libiberty.h (XOBDELETE, XOBGROW, XOBGROWVEC, XOBSHRINK)
(XOBSHRINKVEC, XOBFINISH): New type-safe macros for obstack
operations.
(XOBFINISH): Changed to return (T *) instead of T. All callers
updated.
* libcpp/include/symtab.h (obstack_chunk_alloc)
(obstack_chunk_free): Define, so that obstack_init() can be used.
* libcpp/internal.h (struct cset_converter): Same.
* libcpp/files.c (_cpp_init_files): Changed _obstack_begin() to
obstack_init().
* libcpp/identifiers.c (_cpp_init_hashtable): Same.
* libcpp/symtab.c (ht_create): Same.
* libcpp/init.c (cpp_create_reader): Same.


=== modified file 'gcc/c-family/c-lex.c'
--- gcc/c-family/c-lex.c2012-06-19 19:55:33 +
+++ gcc/c-family/c-lex.c2012-08-18 13:42:37 +
@@ -1037,7 +1037,7 @@ lex_string (const cpp_token *tok, tree *
   /* We have read one more token than we want.  */
   _cpp_backup_tokens (parse_in, 1);
   if (concats)
-strs = XOBFINISH (&str_ob, cpp_string *);
+strs = XOBFINISH (&str_ob, cpp_string);
 
   if (concats && !objc_string && !in_system_header)
 warning (OPT_Wtraditional,

=== modified file 'gcc/collect2.c'
--- gcc/collect2.c  2012-05-31 20:19:00 +
+++ gcc/collect2.c  2012-08-18 13:42:51 +
@@ -514,7 +514,7 @@ extract_string (const char **pp)
 
   obstack_1grow (&temporary_obstack, '\0');
   *pp = p;
-  return XOBFINISH (&temporary_obstack, char *);
+  return XOBFINISH (&temporary_obstack, char);
 }
 
 void
@@ -535,7 +535,7 @@ dump_ld_file (const char *name, FILE *to
  const char *word, *p;
  char *result;
  obstack_1grow (&temporary_obstack, '\0');
- word = XOBFINISH (&temporary_obstack, const char *);
+ word = XOBFINISH (&temporary_obstack, const char);
 
  if (*word == '.')
++word, putc ('.', to);
@@ -943,7 +943,7 @@ maybe_run_lto_and_relink (char **lto_ld_
 
lto_o_files = XNEWVEC (char *, num_files + 1);
lto_o_files[num_files] = NULL;
-   start = XOBFINISH (&temporary_obstack, char *);
+   start = XOBFINISH (&temporary_obstack, char);
for (i = 0; i < num_files; ++i)
  {
end = start;

=== modified file 'gcc/dbxout.c'
--- gcc/dbxout.c2012-06-24 17:58:46 +
+++ gcc/dbxout.c2012-08-18 13:48:19 +
@@ -864,7 +864,7 @@ dbxout_finish_complex_stabs (tree sym, s
   obstack_1grow (&stabstr_ob, '\0');
 
   len = obstack_object_size (&stabstr_ob);
-  chunk = str = XOBFINISH (&stabstr_ob, char *);
+  chunk = str = XOBFINISH (&stabstr_ob, char);
 
   /* Within the buffer are a sequence of NUL-separated strings,
 each of which is to be written out as a separate stab
@@ -897,7 +897,7 @@ dbxout_finish_complex_stabs (tree sym, s
 comma than to do a two-character fputs.  */
   obstack_grow (&stabstr_ob, "\",", 2);
   len = obstack_object_size (&stabstr_ob);
-  str = XOBFINISH (&stabstr_ob, char *);
+  str = XOBFINISH (&stabstr_ob, char);
 
   fwrite (str, 1, len, asm_out_file);
   DBX_FINISH_STABS (sym, code, line, addr, label, number);

=== modified file 'gcc/gcc.c'
--- gcc/gcc.c   2012-08-15 01:56:07 +
+++ gcc/gcc.c   2012-08-18 13:47:06 +
@@ -1435,7 +1435,7 @@ init_spec (void)
   }
 
 obstack_1grow (&obstack, '\0');
-libgcc_spec = XOBFINISH (&obstack, const char *);
+libgcc_spec = XOBFINISH (&obstack, const char);
   }
 #endif
 #ifdef USE_AS_TRADITIONAL_FORMAT
@@ -1444,7 +1444,7 @@ init_spec (void)
 static const char tf[] = "--traditional-format ";
 obst

Re: Speedups/Cleanups: End of GSOC patch collection

2012-08-18 Thread Dimitrios Apostolou

2012-08-18 Dimitrios Apostolou 

* dwarf2out.c (output_indirect_string): Use
ASM_OUTPUT_INTERNAL_LABEL instead of slower ASM_OUTPUT_LABEL.
* varasm.c (assemble_string): Don't break string in chunks, this
is assembler specific and already done in most versions of
ASM_OUTPUT_ASCII.


I think there is no correctness issue regarding output_indirect_string()
since .debug_str are always compiler generated labels, and this gives a 
small speedup with -g3 debug info.


And regarding assemble_string() I find it superfluous to break it in two
places. I found only the following versions of ASM_OUTPUT_ASCII not caring 
about string length, I guess the assemblers don't have a limit:


arm.c: vmsdbgout.c:ASM_OUTPUT_ASCII
picochip.c: picochip_output_ascii()
pdp11.c: output_ascii()


Thanks,
Dimitris=== modified file 'gcc/varasm.c'
--- gcc/varasm.c2012-08-15 01:56:07 +
+++ gcc/varasm.c2012-08-16 06:12:28 +
@@ -1726,22 +1726,7 @@ assemble_align (int align)
 void
 assemble_string (const char *p, int size)
 {
-  int pos = 0;
-  int maximum = 2000;
-
-  /* If the string is very long, split it up.  */
-
-  while (pos < size)
-{
-  int thissize = size - pos;
-  if (thissize > maximum)
-   thissize = maximum;
-
-  ASM_OUTPUT_ASCII (asm_out_file, p, thissize);
-
-  pos += thissize;
-  p += thissize;
-}
+  ASM_OUTPUT_ASCII (asm_out_file, p, size);
 }
=== modified file 'gcc/dwarf2out.c'
--- gcc/dwarf2out.c 2012-08-15 01:56:07 +
+++ gcc/dwarf2out.c 2012-08-16 06:19:19 +
@@ -20887,7 +20887,7 @@ output_indirect_string (void **h, void *
   if (node->form == DW_FORM_strp)
 {
   switch_to_section (debug_str_section);
-  ASM_OUTPUT_LABEL (asm_out_file, node->label);
+  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, node->label);
   assemble_string (node->str, strlen (node->str) + 1);
 }


Speedups/Cleanups: End of GSOC patch collection

2012-08-18 Thread Dimitrios Apostolou

Hello list,

for the following couple of days I'll be posting under this thread my 
collection of patches.


Unless otherwise mentioned they've been bootstrapped and tested on x86, 
but with a three-weeks old snapshot, that is pre-C++ conversion. I plan to 
test again next week with a latest snapshot, so please wait for it before 
applying.


Most things are minor changes and tweaks, some are ported patches from 
last year, some are new. I also have some bigger patches that are flawed 
(some don't even compile) either because of difficulties I encountered or 
simply lack of time. I 'll try and post what is presentable from them, I'd 
appreciate discussion on whether/how to continue with those.



Thanks,
Dimitris



Re: [PATCH][C++] Save memory and reallocations in name-lookup

2012-08-17 Thread Dimitrios Apostolou

Hi,

On Fri, 17 Aug 2012, Jakub Jelinek wrote:

On Fri, Aug 17, 2012 at 06:41:37AM -0500, Gabriel Dos Reis wrote:

I am however concerned with:


  static void
  store_bindings (tree names, VEC(cxx_saved_binding,gc) **old_bindings)
  {
!   static VEC(tree,heap) *bindings_need_stored = NULL;


I would be more comfortable to see the cache be on per-scope
(e.g. namespace scope) basis as opposed
a blanket global cache stored in a global variable.


It is not any kind of cache.  It could be in theory an automatic variable
vector pointer, it is only used during that function.  The reason why it is
static variable instead is just to avoid constant allocation/deallocation
of the vector, this way after the first call it will be already allocated
(but, upon entry to store_bindings will always be empty).


Why not use stack vector of sufficient size for most cases then? I usually 
do something like:


  VEC (tree, stack) *bindings_need_stored = VEC_alloc (tree, stack, 32);
  ...
  VEC_free (tree, stack, bindings_need_stored);

if I've measured that 32 (or whatever) is enough for most cases. I 
don't have a clue about this case though.



Dimitris



Re: Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)

2012-08-07 Thread Dimitrios Apostolou

On Tue, 7 Aug 2012, Ian Lance Taylor wrote:


On Tue, Aug 7, 2012 at 2:24 PM, Dimitrios Apostolou  wrote:


BTW I can't find why ELF_STRING_LIMIT is only 256, it seems GAS supports
arbitrary lengths. I'd have to change my code if we ever set it too high (or
even unlimited) since I allocate the buffer on the stack.


See the comment for ELF_STRING_LIMIT in config/elfos.h.  gas is not
the only ELF assembler.


I've seen it and thought that for non-GAS ELF platforms you should define 
it to override elfos.h, I obviously misunderstood. So now I'm curious, 
this limit is there dominating all platforms because of the worst 
assembler? :-)



Dimitris


P.S. there is an obvious typo in the comment, STRING_LIMIT should be 
ELF_STRING_LIMIT (I think I introduced the typo last year...)


Re: Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)

2012-08-07 Thread Dimitrios Apostolou
I should mention that with my patch .ascii is used more aggresively than 
before, so if a string is longer than ELF_STRING_LIMIT it will be written 
as .ascii all of it, while in the past it would use .string for the 
string's tail. Example diff to original behaviour:



 .LASF15458:
-   .ascii  "SSA_VAR_P(DECL) (TREE_CODE (DECL) == VA"
-   .string "R_DECL || TREE_CODE (DECL) == PARM_DECL || TREE_CODE (DECL) == RESULT_DECL 
|| (TREE_CODE (DECL) == SSA_NAME && (TREE_CODE (SSA_NAME_VAR (DECL)) == VAR_DECL || 
TREE_CODE (SSA_NAME_VAR (DECL)) == PARM_DECL || TREE_CODE (SSA_NAME_VAR (DECL)) == 
RESULT_DECL)))"
+   .ascii  "SSA_VAR_P(DECL) (TREE_CODE (DECL) == VAR_DECL || TREE_CODE (D"
+   .ascii  "ECL) == PARM_DECL || TREE_CODE (DECL) == RESULT_DECL || (TREE"
+   .ascii  "_CODE (DECL) == SSA_NAME && (TREE_CODE (SSA_NAME_VAR (DECL)) "
+   .ascii  "== VAR_DECL || TREE_CODE (SSA_NAME_VAR (DECL)) == PARM_DECL |"
+   .ascii  "| TREE_CODE (SSA_NAME_VAR (DECL)) == RESULT_DECL)))\000"


BTW I can't find why ELF_STRING_LIMIT is only 256, it seems GAS supports 
arbitrary lengths. I'd have to change my code if we ever set it too high 
(or even unlimited) since I allocate the buffer on the stack.



Dimitris



Re: add strnlen to libiberty (was Re: Assembly output optimisations)

2012-08-06 Thread Dimitrios Apostolou

On Mon, 6 Aug 2012, Ian Lance Taylor wrote:


On Mon, Aug 6, 2012 at 9:34 PM, Dimitrios Apostolou  wrote:

As an addendum to my previous patch, I made an attempt to properly add
strnlen() to libiberty, with the code copied from gnulib. Unfortunately it
seems I've messed it up somewhere since defining HAVE_STRNLEN to 0 doesn't
seem to build strnlen.o for me. Any ideas?


What do you mean by "defining HAVE_STRNLEN to 0"?  The thing that will
control building strnlen.o is having AC_REPLACE_FUNCS in
libiberty/configure.in fail to find strlen.  One way you can test this
is by adding this to libiberty/config.cache:

ac_cv_func_strnlen=${ac_cv_func_strnlen=yes}

before invoking configure.


Thanks Ian, that helped a lot. I changed ac_cv_func_strnlen=no and 
reconfigured with -C, and strnlen.o was built. :-)


What else is missing to make this patch appropriate for libiberty? Should 
I change the prolog in strnlen.c, since I only copied it intact from 
gnulib?



Thanks,
Dimitris



add strnlen to libiberty (was Re: Assembly output optimisations)

2012-08-06 Thread Dimitrios Apostolou

As an addendum to my previous patch, I made an attempt to properly add
strnlen() to libiberty, with the code copied from gnulib. Unfortunately it 
seems I've messed it up somewhere since defining HAVE_STRNLEN to 0 doesn't 
seem to build strnlen.o for me. Any ideas?



Thanks,
Dimitris
=== modified file 'libiberty/Makefile.in'
--- libiberty/Makefile.in   2012-04-27 14:14:14 +
+++ libiberty/Makefile.in   2012-08-07 03:52:53 +
@@ -151,7 +151,7 @@ CFILES = alloca.c argv.c asprintf.c atex
 spaces.c splay-tree.c stack-limit.c stpcpy.c stpncpy.c \
 strcasecmp.c strchr.c strdup.c strerror.c strncasecmp.c\
 strncmp.c strrchr.c strsignal.c strstr.c strtod.c strtol.c \
-strtoul.c strndup.c strverscmp.c   \
+strtoul.c strndup.c strnlen.c strverscmp.c \
timeval-utils.c tmpnam.c\
unlink-if-ordinary.c\
vasprintf.c vfork.c vfprintf.c vprintf.c vsnprintf.c vsprintf.c \
@@ -218,6 +218,7 @@ CONFIGURED_OFILES = ./asprintf.$(objext)
 ./strncmp.$(objext) ./strndup.$(objext) ./strrchr.$(objext)\
 ./strstr.$(objext) ./strtod.$(objext) ./strtol.$(objext)   \
 ./strtoul.$(objext) ./strverscmp.$(objext) \
+   ./strnlen.$(objext) \
./tmpnam.$(objext)  \
./vasprintf.$(objext) ./vfork.$(objext) ./vfprintf.$(objext)\
 ./vprintf.$(objext) ./vsnprintf.$(objext) ./vsprintf.$(objext) \
@@ -622,8 +623,7 @@ $(CONFIGURED_OFILES): stamp-picdir
else true; fi
$(COMPILE.c) $(srcdir)/crc32.c $(OUTPUT_OPTION)
 
-./dwarfnames.$(objext): $(srcdir)/dwarfnames.c $(INCDIR)/dwarf2.h \
-   $(INCDIR)/dwarf2.def
+./dwarfnames.$(objext): $(srcdir)/dwarfnames.c $(INCDIR)/dwarf2.h
if [ x"$(PICFLAG)" != x ]; then \
  $(COMPILE.c) $(PICFLAG) $(srcdir)/dwarfnames.c -o pic/$@; \
else true; fi
@@ -656,7 +656,8 @@ $(CONFIGURED_OFILES): stamp-picdir
else true; fi
$(COMPILE.c) $(srcdir)/fibheap.c $(OUTPUT_OPTION)
 
-./filename_cmp.$(objext): $(srcdir)/filename_cmp.c config.h 
$(INCDIR)/filenames.h \
+./filename_cmp.$(objext): $(srcdir)/filename_cmp.c config.h 
$(INCDIR)/ansidecl.h \
+   $(INCDIR)/filenames.h $(INCDIR)/hashtab.h \
$(INCDIR)/safe-ctype.h
if [ x"$(PICFLAG)" != x ]; then \
  $(COMPILE.c) $(PICFLAG) $(srcdir)/filename_cmp.c -o pic/$@; \
@@ -757,7 +758,7 @@ $(CONFIGURED_OFILES): stamp-picdir
$(COMPILE.c) $(srcdir)/insque.c $(OUTPUT_OPTION)
 
 ./lbasename.$(objext): $(srcdir)/lbasename.c config.h $(INCDIR)/ansidecl.h \
-   $(INCDIR)/filenames.h $(INCDIR)/libiberty.h \
+   $(INCDIR)/filenames.h $(INCDIR)/hashtab.h $(INCDIR)/libiberty.h \
$(INCDIR)/safe-ctype.h
if [ x"$(PICFLAG)" != x ]; then \
  $(COMPILE.c) $(PICFLAG) $(srcdir)/lbasename.c -o pic/$@; \
@@ -1043,7 +1044,7 @@ $(CONFIGURED_OFILES): stamp-picdir
else true; fi
$(COMPILE.c) $(srcdir)/splay-tree.c $(OUTPUT_OPTION)
 
-./stack-limit.$(objext): $(srcdir)/stack-limit.c config.h
+./stack-limit.$(objext): $(srcdir)/stack-limit.c config.h $(INCDIR)/ansidecl.h
if [ x"$(PICFLAG)" != x ]; then \
  $(COMPILE.c) $(PICFLAG) $(srcdir)/stack-limit.c -o pic/$@; \
else true; fi
@@ -1104,6 +1105,12 @@ $(CONFIGURED_OFILES): stamp-picdir
else true; fi
$(COMPILE.c) $(srcdir)/strndup.c $(OUTPUT_OPTION)
 
+./strnlen.$(objext): $(srcdir)/strnlen.c
+   if [ x"$(PICFLAG)" != x ]; then \
+ $(COMPILE.c) $(PICFLAG) $(srcdir)/strnlen.c -o pic/$@; \
+   else true; fi
+   $(COMPILE.c) $(srcdir)/strnlen.c $(OUTPUT_OPTION)
+
 ./strrchr.$(objext): $(srcdir)/strrchr.c $(INCDIR)/ansidecl.h
if [ x"$(PICFLAG)" != x ]; then \
  $(COMPILE.c) $(PICFLAG) $(srcdir)/strrchr.c -o pic/$@; \

=== modified file 'libiberty/config.in'
--- libiberty/config.in 2011-07-22 08:33:37 +
+++ libiberty/config.in 2012-08-07 03:29:18 +
@@ -262,6 +262,9 @@
 /* Define to 1 if you have the `strndup' function. */
 #undef HAVE_STRNDUP
 
+/* Define to 1 if you have the `strnlen' function. */
+#undef HAVE_STRNLEN
+
 /* Define to 1 if you have the `strrchr' function. */
 #undef HAVE_STRRCHR
 

=== modified file 'libiberty/configure'
--- libiberty/configure 2012-01-23 06:25:28 +
+++ libiberty/configure 2012-08-07 03:10:54 +
@@ -5340,6 +5340,7 @@ funcs="$funcs strchr"
 funcs="$funcs strdup"
 funcs="$funcs strncasecmp"
 funcs="$funcs strndup"
+funcs="$funcs strnlen"
 funcs="$funcs strrchr"
 funcs="$funcs strstr"
 funcs="$funcs strtod"
@@ -5380,8 +5381,8 @@ if test "x" = "y"; then
 random realpath rename rindex \
 sbrk setenv setproctitle setrlimit sigsetmask snprintf spawnve spawnvpe \
  stpcpy stpncpy strcasecmp strchr strdup \
-

Assembly output optimisations (was: PR 51094 - fprint_w() in output_addr_const() reinstated)

2012-08-06 Thread Dimitrios Apostolou

Hello list,

these clean-ups and minor speedups complete some TODOs and semi-finished 
changes I have gathered in the ELF backend. In a nutshell:


Fixed comment style, used INT_BITS_STRLEN_BOUND from gnulib to be future 
proof on integer representation string length, replaced long arguments in 
fast printing functions with HOST_WIDE_INT that is always a larger type 
(also asserted that), converted some never-negative ints to unsigned. 
Guarded the output.h:default_elf_asm_output_* declarations, mimicking 
varasm.c (I'm not exactly sure why this is guarded in the first place). 
Changed default_elf_asm_output_* to be clearer and faster, they now 
fwrite() line by line instead of putting char by char. Implemented fast 
octal output in default_elf_asm_output_*, this should give a good boost to 
-flto, but I haven't measured a big testcase for this one.


All in all I get a speed-up of ~30 M instr out of ~2 G instr, for -g3 
compilation of reload.c. Actually saving all the putc() calls gives more 
significant gain, but I lost a tiny bit because of converting [sf]print_* 
functions to HOST_WIDE_INT from long, for PR 51094. So on i586 which has 
HOST_WIDE_INT 8 byte wide, I can see slow calls to __u{div,mod}di3 taking 
place. I don't know whether there is a meaning in writing LEB128 values 
greater than 2^31 but I could change all that to HOST_WIDEST_FAST_INT if 
you think so.


Time savings are minor too, about 10 ms out of 0.85 s. Memory usage is the 
same. Bootstrapped on x86, no regressions for C,C++ testsuite.



Thanks Andreas, hp, Mike, for your comments. Mike I'd appreciate if you 
elaborated on how to speed-up sprint_uw_rev(), I don't think I understood 
what you have in mind.


Thanks,
Dimitris2012-08-07 Dimitrios Apostolou 

* final.c: Assert that HOST_WIDE_INT is at least as wide as long.
(output_addr_const): Use fprint_w() instead of fprintf() when
CONST_INT or CONST_FIXED.
(_sprint_uw): New static function.
(sprint_ul_rev): Change to:
(_sprint_uw_rev): Accept HOST_WIDE_INT arg instead of
long. Changed i to unsigned.
(INT_BITS_STRLEN_BOUND): Copied from gnulib.
(HOST_WIDE_INT_STRING_LEN): Define.
(fprint_ul, sprint_ul): Change to:
(fprint_uw, sprint_uw): Accept HOST_WIDE_INT arg instead of
long. Changed counter variables to unsigned.
(fprint_uw_hex): Renamed from fprint_whex
* output.h (fprint_ul, sprint_ul): Remove declarations.
(fprint_w, fprint_uw, sprint_uw): Declare.
(default_elf_asm_output_limited_string)
(default_elf_asm_output_ascii): wrap in #ifdef ELF_ASCII_ESCAPES
(fprint_uw_hex): Renamed from fprint_whex
* elfos.h (ASM_GENERATE_INTERNAL_LABEL): Use sprint_uw() instead
of sprint_ul().
(ASM_OUTPUT_ASCII): Removed questionmark at the end of macro.
* i386.c (print_reg): Use fprint_uw() instead of fprint_ul().
* dwarf2asm.c (asm_output_data_sleb128): Change fprintf() to
fputs() plus fprint_w(). Change fputc() to putc() in hot path.
(dw2_assemble_integer, dw2_asm_output_data)
(dw2_asm_output_data_uleb128): fprint_whex() renamed to
fprint_uw_hex().
* dwarf2out.c (dwarf2out_source_line): Changed comment. Use
fprint_uw() instead of fprint_ul().
* varasm.c (_elf_escape_char): New static function that writes a
char to a string according to ELF_ASCII_ESCAPES.
(_elf_output_ascii_line): New static function that writes to file
a single .ascii assembler declaration.
(default_elf_asm_output_limited_string)
(default_elf_asm_output_ascii): Rewrote functions so that they
fwrite() a full assembler line instead of putting char by char.
=== modified file 'gcc/config/elfos.h'
--- gcc/config/elfos.h  2012-06-19 19:55:33 +
+++ gcc/config/elfos.h  2012-08-06 03:19:16 +
@@ -119,7 +119,7 @@ see the files COPYING3 and COPYING.RUNTI
   (LABEL)[0] = '*';\
   (LABEL)[1] = '.';\
   __p = stpcpy (&(LABEL)[2], PREFIX);  \
-  sprint_ul (__p, (unsigned long) (NUM));  \
+  sprint_uw (__p, (unsigned HOST_WIDE_INT) (NUM)); \
 }  \
   while (0)
 
@@ -418,7 +418,7 @@ see the files COPYING3 and COPYING.RUNTI
 
 #undef  ASM_OUTPUT_ASCII
 #define ASM_OUTPUT_ASCII(FILE, STR, LENGTH)\
-  default_elf_asm_output_ascii ((FILE), (STR), (LENGTH));
+  default_elf_asm_output_ascii ((FILE), (STR), (LENGTH))
 
 /* Allow the use of the -frecord-gcc-switches switch via the
elf_record_gcc_switches function defined in varasm.c.  */

=== modified file 'gcc/config/i386/i386.c'
--- gcc/config/i386/i386.c  2012-07-28 09:16:52 +
+++ gcc/c

Re: [libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)

2012-08-05 Thread Dimitrios Apostolou

On Sat, 4 Aug 2012, Ian Lance Taylor wrote:


On Fri, 3 Aug 2012, Ian Lance Taylor wrote:


I'm not sure where you are looking.  I only see one call to
_obstack_begin in the gcc directory, and it could easily be replaced
with a call to obstack_specify_allocation instead.


In libcpp/ mostly, but only 4 cases so that's not too many, I can change 
those with obstack_init/begin.



+#define XOBFINISH(O, PT)   ((PT) obstack_finish ((O)))



For XOBNEW, etc., we use (T *) rather than (PT).  Using (PT) seems
error-probe--it's the only use of the obstack with a different type
parameter.  Why not use T rather than PT here, and return (T *)?



I'd have to change many (about 60) occurences of XOBFINISH if I change that.
I'd go for it if I was sure it's what we want, it can be a separate patch
later on.


I'm sorry to ask you to change a lot of code, but it simply doesn't
make sense to me to have all but one macro take the type as an
argument, and have one macro take a pointer to the type.  They really
have to be consistent.


No problem, if anyone else doesn't object I'll change those (in a second 
patch, right?).



Thanks,
Dimitris



Re: [libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)

2012-08-04 Thread Dimitrios Apostolou

On Fri, 3 Aug 2012, Ian Lance Taylor wrote:


2012-08-04 Dimitrios Apostolou 

* libiberty.h
(XOBDELETE,XOBGROW,XOBGROWVEC,XOBSHRINK,XOBSHRINKVEC): New
type-safe macros for obstack allocation.
(XOBFINISH): Renamed argument to PT since it is a pointer to T.



+/* Type-safe obstack allocator. You must first initialize the obstack with
+   obstack_init() or _obstack_begin().


This should recommend obstack_init, or obstack_begin, but not
_obstack_begin.  Also obstack_specify_allocation and
obstack_specify_allocation_with_arg are OK, so really it might be
better not to list the functions, but simply say "You must first
initialization the obstack."


Grep reveals that 9 out of 10 times we use _obstack_begin(), and we set 
alignment to 0 (isn't it suboptimal?).





+   T: Type,  O: Obstack,  N: Number of elements,  S: raw Size,


s/Size/size/


+#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T))
+#define XOBSHRINKVEC(O, T, N)  obstack_blank ((O), -1 * sizeof (T) * (N))


These are hard to use safely.  I'm not sure we should define them at all.


I've already used XOBSHRINK and it looks clear to me, but I could use 
obstack_blank() directly if necessary.





+#define XOBFINISH(O, PT)   ((PT) obstack_finish ((O)))


For XOBNEW, etc., we use (T *) rather than (PT).  Using (PT) seems
error-probe--it's the only use of the obstack with a different type
parameter.  Why not use T rather than PT here, and return (T *)?


I'd have to change many (about 60) occurences of XOBFINISH if I change 
that. I'd go for it if I was sure it's what we want, it can be a separate 
patch later on.



Thanks,
Dimitris



[libiberty] add obstack macros (was Re: PR #53525 - track-macro-expansion performance regression)

2012-08-03 Thread Dimitrios Apostolou
Hi Dodji, I appreciate your review but I'm replying for now only for the 
libiberty.h part (attached updated patch), which I've needed elsewhere and 
seems the easiest to quickly apply.


On Wed, 18 Jul 2012, Dodji Seketeli wrote:



=== modified file 'include/libiberty.h'
--- include/libiberty.h 2011-09-28 19:04:30 +
+++ include/libiberty.h 2012-07-07 16:04:01 +


[...]


-/* Type-safe obstack allocator.  */
+/* Type-safe obstack allocator. You must first initialize the obstack with
+   obstack_init() or _obstack_begin(). */


Why not just using the _obstack_begin function?  Why introducing the
use of the obstack_init macro?


Please correct me if I'm wrong, but my impression is that obstack_init
is the documented way (see [1]), _obstack_begin seems hidden and gives 
access to dangerous (performance-wise) parameters, even though we always 
use it with the defaults (hmmm double checking I see that we mostly 
set size 0 which is default, but also alignment 0???).


[1] 
http://www.gnu.org/software/libc/manual/html_node/Preparing-for-Obstacks.html



-#define XOBFINISH(O, T) ((T) obstack_finish ((O)))


I think this change is not needed.  You basically remove this line to
replace it with the hunk below, that comes later in the patch:


+#define XOBFINISH(O, PT)   ((PT) obstack_finish ((O)))


So I believe you could do away with the change.



I just wanted to point out that it expects and returns a Pointer to the 
Type, e.g. after you XOBGROW(O, int, DATA) you should XOBFINISH(O, int *).
It would probably be better for XOBFINISH to accept T and return (T *), 
but I didn't want to change existing status quo.


I think I addressed your other concerns (style, comments, unneeded macro 
params). CC'd maintainers. What do you think about attached patch?



Thanks,
Dimitris


P.S. Personally I prefer the original obstack interface, i.e. 
obstack_grow(), obstack_finish() etc. I just added the macros to follow 
existing style and avoid typecasting because of C++. Let me know if it's 
OK to use obstacks directly and I'll back out the patch.





2012-08-04 Dimitrios Apostolou 

* libiberty.h
(XOBDELETE,XOBGROW,XOBGROWVEC,XOBSHRINK,XOBSHRINKVEC): New
type-safe macros for obstack allocation.
(XOBFINISH): Renamed argument to PT since it is a pointer to T.
=== modified file 'include/libiberty.h'
--- include/libiberty.h 2011-09-28 19:04:30 +
+++ include/libiberty.h 2012-08-03 23:51:55 +
@@ -361,12 +361,21 @@ extern unsigned int xcrc32 (const unsign
 #define XDUPVAR(T, P, S1, S2)  ((T *) xmemdup ((P), (S1), (S2)))
 #define XRESIZEVAR(T, P, S)((T *) xrealloc ((P), (S)))
 
-/* Type-safe obstack allocator.  */
+/* Type-safe obstack allocator. You must first initialize the obstack with
+   obstack_init() or _obstack_begin().
+   T: Type,  O: Obstack,  N: Number of elements,  S: raw Size,
+   PT: Pointer to T,  D: Data,  P: Pointer to element.  */
 
 #define XOBNEW(O, T)   ((T *) obstack_alloc ((O), sizeof (T)))
 #define XOBNEWVEC(O, T, N) ((T *) obstack_alloc ((O), sizeof (T) * (N)))
 #define XOBNEWVAR(O, T, S) ((T *) obstack_alloc ((O), (S)))
-#define XOBFINISH(O, T) ((T) obstack_finish ((O)))
+#define XOBDELETE(O, P)(obstack_free ((O), (P)))
+
+#define XOBGROW(O, T, D)   obstack_grow ((O), (D), sizeof (T))
+#define XOBGROWVEC(O, T, D, N) obstack_grow ((O), (D), sizeof (T) * (N))
+#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T))
+#define XOBSHRINKVEC(O, T, N)  obstack_blank ((O), -1 * sizeof (T) * (N))
+#define XOBFINISH(O, PT)   ((PT) obstack_finish ((O)))
 
 /* hex character manipulation routines */
 



Re: cosmetic change - simplify cse.c:preferable()

2012-08-03 Thread Dimitrios Apostolou

On Thu, 19 Jul 2012, Richard Guenther wrote:


I don't think it's any good or clearer to understand.


Hi Richi, I had forgotten I prepared this for PR #19832, maybe you want to 
take a look. FWIW, with my patch applied there is a difference of ~3 M 
instr, which is almost unmeasurable in time. But we can close the PR even 
with the simple patch Cristophe posted, since mine does not make 
things clearer.



Thanks,
Dimitris



Re: ping again - Show hash table stats when -fmem-report

2012-08-03 Thread Dimitrios Apostolou
I'm always forgetting something, now it was the changelog, see attached 
(same as old, nothing significant changed).


On Fri, 3 Aug 2012, Dimitrios Apostolou wrote:

Hi, I've updated this patch to trunk and rebootstrapped it, so I'm 
resubmitting it, I'm also making a try to CC all relevant maintainers, sorry 
for spamming but even though this patch is simple, it touches many parts.


I'm also adding stats to pointer_{set,map} but it will come in a separate 
patch. The notes quoted from earlier mail still apply:


On Sun, 8 Jul 2012, Dimitrios Apostolou wrote:


Hi,

This patch adds many nice stats about hash tables when gcc is run with 
-fmem-report. Attached patch tested on x86, no regressions.


Also attached is sample output of -fmem-report when compiling reload.c with 
-O2 -g. Especially interesting is the extreme htab usage in var-tracking 
(ofcourse we already knew that... :-) and maybe symtab's slightly high 
collision rate (given it's a very hot hash table). Moreover it's notable 
that since last year, collision rate for mem_attrs_htab has been reduced 
from around 8 to 1, still not good enough but major improvement.


Jan: It has negligible overhead in runtime, that's why I'm pushing it again 
as it is. Having compile-time instrumentation is something different, but 
still I'm not quite sure if it's worth the effort. IMHO doing separate 
builds to get memory info is tedious and even I've stopped doing them. If 
it has no overhead maybe it's better as a runtime option?

Any way, compile-time instrumentation can come later.

Finally I repost some of my notes I'd like to get feedback on:

* Is it OK that I included  in libiberty's hashtab.c? What's the 
proper way to assert stuff, since gcc_assert() is not accessible?


* Many hash tables are created with htab_create_ggc(), for example 
referenced_vars and default_defs in tree-ssa.c. I collect statistics in 
delete_tree_ssa() but maybe some are not deallocated in there but 
automatically garbage collected?


* Obviously hash table sizes are inflated where two entries might reference 
the same element (for example in symtab_hash) but I don't handle this.


* Changes that reduce load factor have been backed out since they brought 
controversy. I still think they are good, at least for symtab. I'll measure 
numbers separately for this after this patch makes it.



Thanks,
Dimitris
2012-05-21 Dimitrios Apostolou 

Print various statistics about hash tables when called with
-fmem-report. If the tables are created once use
htab_dump_statistics(), if they are created/destroyed multiple
times then introduce global variables to track statistics.

* cgraph.c, cgraph.h:
(cgraph_dump_stats): New function to dump stats about hash tables.
* gcc/symtab.c, cgraph.h:
(symtab_dump_stats): New function to dump stats about hash tables.
* cgraph.c: (call_site_hash_{num,expands,searches,collisions}):
New globals to keep statistics about call_site_hash hash tables.
(cgraph_remove_node_callees,cgraph_remove_node): if (mem_report)
then keep statistics about hash tables.
* cselib.c, cselib.h (cselib_dump_stats): New function to dump
stats about cselib_hash_table.
* cselib.c (cselib_htab_{num,expands,searches,collisions}): New
globals to keep hash table statistics.
(cselib_finish): if (mem_report) keep hash table statistics.
* dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if
-fmem_report.
* emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to
dump statistics about mem_attrs_htab hash table.
* tree.h, tree-ssa.c (tree_ssa_dump_stats): New function to print
statistics about all referenced_vars and default_defs hash tables.
* tree-ssa.c (default_defs_{num,expands,searches,collisions})
(referenced_vars_{num,expands,searches,collisions}): new globals
to keep statistics about hash tables.
(delete_tree_ssa): Keep statistics for hash tables by
increasing the new global variables.
* tree.c (dump_tree_statistics): Call tree_ssa_dump_stats().
(print_type_hash_statistics): Used the new htab_dump_statistics()
function.
* var-tracking.c (vars_htab_{num,expands,searches,collisions})
(dropval_htab_{num,expands,searches,collisions})
(cv_htab_{num,expands,searches,collisions}): new globals to keep
hash table statistics.
(shared_hash_destroy, vt_finalize): Keep statistics by
increasing values of new global variables if -fmem-report.
* var-tracking.c, rtl.h (vt_dump_stats): New function to dump
stats about vars->htab, dropped_variables and value_chains hash
tables.
* toplev.c: Included cselib.h for cselib_dump_stats().
(dump_memory_report): 

Re: ping again - Show hash table stats when -fmem-report

2012-08-03 Thread Dimitrios Apostolou
Hi, I've updated this patch to trunk and rebootstrapped it, so I'm 
resubmitting it, I'm also making a try to CC all relevant maintainers, 
sorry for spamming but even though this patch is simple, it touches many 
parts.


I'm also adding stats to pointer_{set,map} but it will come in a separate 
patch. The notes quoted from earlier mail still apply:


On Sun, 8 Jul 2012, Dimitrios Apostolou wrote:


Hi,

This patch adds many nice stats about hash tables when gcc is run with 
-fmem-report. Attached patch tested on x86, no regressions.


Also attached is sample output of -fmem-report when compiling reload.c with 
-O2 -g. Especially interesting is the extreme htab usage in var-tracking 
(ofcourse we already knew that... :-) and maybe symtab's slightly high 
collision rate (given it's a very hot hash table). Moreover it's notable that 
since last year, collision rate for mem_attrs_htab has been reduced from 
around 8 to 1, still not good enough but major improvement.


Jan: It has negligible overhead in runtime, that's why I'm pushing it again 
as it is. Having compile-time instrumentation is something different, but 
still I'm not quite sure if it's worth the effort. IMHO doing separate builds 
to get memory info is tedious and even I've stopped doing them. If it has no 
overhead maybe it's better as a runtime option?

Any way, compile-time instrumentation can come later.

Finally I repost some of my notes I'd like to get feedback on:

* Is it OK that I included  in libiberty's hashtab.c? What's the 
proper way to assert stuff, since gcc_assert() is not accessible?


* Many hash tables are created with htab_create_ggc(), for example 
referenced_vars and default_defs in tree-ssa.c. I collect statistics in 
delete_tree_ssa() but maybe some are not deallocated in there but 
automatically garbage collected?


* Obviously hash table sizes are inflated where two entries might reference 
the same element (for example in symtab_hash) but I don't handle this.


* Changes that reduce load factor have been backed out since they brought 
controversy. I still think they are good, at least for symtab. I'll measure 
numbers separately for this after this patch makes it.



Thanks,
Dimitris
=== modified file 'gcc/Makefile.in'
--- gcc/Makefile.in 2012-07-26 13:10:04 +
+++ gcc/Makefile.in 2012-08-03 14:37:47 +
@@ -2679,7 +2679,7 @@ toplev.o : toplev.c $(CONFIG_H) $(SYSTEM
langhooks.h insn-flags.h $(CFGLOOP_H) hosthooks.h \
$(CGRAPH_H) $(COVERAGE_H) alloc-pool.h $(GGC_H) \
$(OPTS_H) params.def tree-mudflap.h $(TREE_PASS_H) $(GIMPLE_H) \
-   tree-ssa-alias.h $(PLUGIN_H) realmpfr.h tree-diagnostic.h \
+   tree-ssa-alias.h $(PLUGIN_H) cselib.h realmpfr.h tree-diagnostic.h \
$(TREE_PRETTY_PRINT_H) opts-diagnostic.h $(COMMON_TARGET_H)
 
 hwint.o : hwint.c $(CONFIG_H) $(SYSTEM_H) $(DIAGNOSTIC_CORE_H)

=== modified file 'gcc/cgraph.c'
--- gcc/cgraph.c2012-07-16 11:32:42 +
+++ gcc/cgraph.c2012-08-03 14:37:47 +
@@ -1066,6 +1066,13 @@ cgraph_update_edges_for_call_stmt (gimpl
 }
 
 
+/* Keep statistics about call_site_hash hash tables. */
+static unsigned int call_site_hash_num;
+static unsigned long call_site_hash_expands;
+static unsigned long call_site_hash_searches;
+static unsigned long call_site_hash_collisions;
+
+
 /* Remove all callees from the node.  */
 
 void
@@ -1096,6 +1103,16 @@ cgraph_node_remove_callees (struct cgrap
   node->callees = NULL;
   if (node->call_site_hash)
 {
+  if (mem_report)
+   {
+ call_site_hash_num++;
+ call_site_hash_expands +=
+   htab_expands_num (node->call_site_hash);
+ call_site_hash_searches +=
+   htab_searches_num (node->call_site_hash);
+ call_site_hash_collisions +=
+   htab_collisions_num (node->call_site_hash);
+   }
   htab_delete (node->call_site_hash);
   node->call_site_hash = NULL;
 }
@@ -1252,6 +1269,16 @@ cgraph_remove_node (struct cgraph_node *
   node->symbol.decl = NULL;
   if (node->call_site_hash)
 {
+  if (mem_report)
+   {
+ call_site_hash_num++;
+ call_site_hash_expands +=
+   htab_expands_num (node->call_site_hash);
+ call_site_hash_searches +=
+   htab_searches_num (node->call_site_hash);
+ call_site_hash_collisions +=
+   htab_collisions_num (node->call_site_hash);
+   }
   htab_delete (node->call_site_hash);
   node->call_site_hash = NULL;
 }
@@ -2457,4 +2484,18 @@ verify_cgraph (void)
   FOR_EACH_FUNCTION (node)
 verify_cgraph_node (node);
 }
+
+void
+cgraph_dump_stats (void)
+{
+  fprintf (stderr, "\ncgraph.c hash table stats:\n");
+  fprintf (stderr, "\t%u cgraph_node->call_site_hash hash tables:\n",
+  call_site_hash_num);
+  fprintf (stderr, "

PR 51094 - fprint_w() in output_addr_const() reinstated

2012-07-09 Thread Dimitrios Apostolou
Since output_addr_const() shows pretty hot in the compiler, I reinstated 
the fprint_w() call in place of fprintf().


This patch relies on two things: 2's complement representation for 
negative int and that HOST_WIDE_INT is at least as large type as long for 
all platforms (will it always be?).


Bootstrapped/tested on i386, regtested on x86_64 multilib, 
i386-pc-solaris2.10 (thanks ro), i686-darwin9 (thanks iains).



2012-07-09 Dimitrios Apostolou 

* final.c, output.h (fprint_w): New function to write a
HOST_WIDE_INT to a file, fast.
* final.c (output_addr_const): Use fprint_w() instead of
fprintf().
(sprint_ul_rev): New static function to write a
HOST_WIDE_INT to a string in reverse, fast.



Thanks,
Dimitris
=== modified file 'gcc/final.c'
--- gcc/final.c 2012-06-24 17:58:46 +
+++ gcc/final.c 2012-07-09 04:17:48 +
@@ -3693,7 +3693,7 @@ output_addr_const (FILE *file, rtx x)
   break;
 
 case CONST_INT:
-  fprintf (file, HOST_WIDE_INT_PRINT_DEC, INTVAL (x));
+  fprint_w (file, INTVAL (x));
   break;
 
 case CONST:
@@ -3827,11 +3827,12 @@ fprint_whex (FILE *f, unsigned HOST_WIDE
 }
 }
 
-/* Internal function that prints an unsigned long in decimal in reverse.
-   The output string IS NOT null-terminated. */
+/* Internal function that writes to a string an unsigned HOST_WIDE_INT or an
+   unsigned long in decimal in reverse.  The output string IS NOT
+   null-terminated. */
 
 static int
-sprint_ul_rev (char *s, unsigned long value)
+sprint_ul_rev (char *s, unsigned HOST_WIDE_INT value)
 {
   int i = 0;
   do
@@ -3849,6 +3850,32 @@ sprint_ul_rev (char *s, unsigned long va
   return i;
 }
 
+/* Write a signed HOST_WIDE_INT as decimal to a file, fast. */
+
+void
+fprint_w (FILE *f, HOST_WIDE_INT value)
+{
+  /* python says: len(str(2**64)) == 20 */
+  char s[20];
+  int i;
+
+  if (value >= 0)
+i = sprint_ul_rev (s, value);
+  else
+{
+  i = sprint_ul_rev (s, (unsigned HOST_WIDE_INT) (~value) + 1);
+  putc('-', f);
+}
+
+  /* It's probably too small so don't bother with string reversal and fputs */
+  do
+{
+  i--;
+  putc (s[i], f);
+}
+  while (i != 0);
+}
+
 /* Write an unsigned long as decimal to a file, fast. */
 
 void

=== modified file 'gcc/output.h'
--- gcc/output.h2012-06-24 17:58:46 +
+++ gcc/output.h2012-07-08 09:24:59 +
@@ -125,6 +125,7 @@ extern void output_addr_const (FILE *, r
 #define ATTRIBUTE_ASM_FPRINTF(m, n) ATTRIBUTE_NONNULL(m)
 #endif
 
+extern void fprint_w (FILE *, HOST_WIDE_INT);
 extern void fprint_whex (FILE *, unsigned HOST_WIDE_INT);
 extern void fprint_ul (FILE *, unsigned long);
 extern int sprint_ul (char *, unsigned long);



cosmetic change - simplify cse.c:preferable()

2012-07-08 Thread Dimitrios Apostolou

Hello,

I've had this patch some time now, it's simple and cosmetic only, I 
had done it while trying to understand expression costs in CSE. I 
think it's more readable than the previous one. FWIW it passed all tests 
on x86.



Thanks,
Dimitris=== modified file 'gcc/cse.c'
--- gcc/cse.c   2012-06-15 09:22:00 +
+++ gcc/cse.c   2012-07-08 07:28:52 +
@@ -713,32 +713,25 @@ approx_reg_cost (rtx x)
 static int
 preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
 {
-  /* First, get rid of cases involving expressions that are entirely
- unwanted.  */
-  if (cost_a != cost_b)
-{
-  if (cost_a == MAX_COST)
-   return 1;
-  if (cost_b == MAX_COST)
-   return -1;
-}
+  int cost_diff = cost_a - cost_b;
+  int regcost_diff = regcost_a - regcost_b;
 
-  /* Avoid extending lifetimes of hardregs.  */
-  if (regcost_a != regcost_b)
+  if (cost_diff != 0)
 {
-  if (regcost_a == MAX_COST)
-   return 1;
-  if (regcost_b == MAX_COST)
-   return -1;
+  /* If none of the expressions are entirely unwanted */
+  if ((cost_a != MAX_COST) && (cost_b != MAX_COST)
+ /* AND only one of the regs is HARD_REG */
+ && (regcost_diff != 0)
+ && ((regcost_a == MAX_COST) || (regcost_b == MAX_COST))
+ )
+   /* Then avoid extending lifetime of HARD_REG */
+   return regcost_diff;
+
+  return cost_diff;
 }
 
-  /* Normal operation costs take precedence.  */
-  if (cost_a != cost_b)
-return cost_a - cost_b;
-  /* Only if these are identical consider effects on register pressure.  */
-  if (regcost_a != regcost_b)
-return regcost_a - regcost_b;
-  return 0;
+  /* cost_a == costb, consider effects on register pressure */
+  return regcost_diff;
 }
 
 /* Internal function, to compute cost when X is not a register; called



PR #53525 - track-macro-expansion performance regression

2012-07-07 Thread Dimitrios Apostolou
With the attached patches I introduce four new obstacks in struct 
cpp_reader to substitute malloc's/realloc's when expanding macros. Numbers 
have been posted in the PR, but to summarize:


before: 0.785 s or  2201 M instr
after:  0.760 s or  2108 M instr

Memory overhead is some tens kilobytes worst case. Tested on x86, no 
regressions. I think this version of the patch is OK to merge, besides 
some TODO comments (I'd appreciate opinions on them) and the following 
maybe:


IMHO the new obstack_{mark,release} functions are the ones that will be 
harder to apply, because they make gcc's obstacks even more different 
than libc's. I sent the patch to libc-alpha but the feedback I got was 
that I should first make the two obstack versions (gcc,libc) identical, 
and then try to push changes. I've noted the primary differences and plan 
on tackling this, but it's not as trivial as it seems.


So if it's not OK, where could the new obstack_{mark,release} go?


Thanks,
Dimitris
=== modified file 'include/libiberty.h'
--- include/libiberty.h 2011-09-28 19:04:30 +
+++ include/libiberty.h 2012-07-07 16:04:01 +
@@ -361,12 +361,19 @@ extern unsigned int xcrc32 (const unsign
 #define XDUPVAR(T, P, S1, S2)  ((T *) xmemdup ((P), (S1), (S2)))
 #define XRESIZEVAR(T, P, S)((T *) xrealloc ((P), (S)))
 
-/* Type-safe obstack allocator.  */
+/* Type-safe obstack allocator. You must first initialize the obstack with
+   obstack_init() or _obstack_begin(). */
 
 #define XOBNEW(O, T)   ((T *) obstack_alloc ((O), sizeof (T)))
 #define XOBNEWVEC(O, T, N) ((T *) obstack_alloc ((O), sizeof (T) * (N)))
 #define XOBNEWVAR(O, T, S) ((T *) obstack_alloc ((O), (S)))
-#define XOBFINISH(O, T) ((T) obstack_finish ((O)))
+#define XOBDELETE(O, T, P) (obstack_free ((O), (P)))
+
+#define XOBGROW(O, T, D)   obstack_grow ((O), (D), sizeof (T))
+#define XOBGROWVEC(O, T, D, N) obstack_grow ((O), (D), sizeof (T) * (N))
+#define XOBSHRINK(O, T)obstack_blank ((O), -1 * sizeof (T))
+#define XOBSHRINKVEC(O, T, N)  obstack_blank ((O), -1 * sizeof (T) * (N))
+#define XOBFINISH(O, PT)   ((PT) obstack_finish ((O)))
 
 /* hex character manipulation routines */
 

=== modified file 'libcpp/init.c'
--- libcpp/init.c   2012-04-30 11:43:43 +
+++ libcpp/init.c   2012-07-07 16:04:01 +
@@ -241,10 +241,20 @@ cpp_create_reader (enum c_lang lang, has
   /* The expression parser stack.  */
   _cpp_expand_op_stack (pfile);
 
+#define obstack_chunk_alloc ((void *(*) (long)) xmalloc)
+#define obstack_chunk_free  ((void (*) (void *)) free)
+
   /* Initialize the buffer obstack.  */
-  _obstack_begin (&pfile->buffer_ob, 0, 0,
- (void *(*) (long)) xmalloc,
- (void (*) (void *)) free);
+  obstack_init(&pfile->buffer_ob);
+
+  /* Initialize the macro obstacks. */
+  obstack_init (&pfile->exp_ob);
+  if (CPP_OPTION (pfile, track_macro_expansion))
+{
+  obstack_init (&pfile->virt_locs_ob);
+  obstack_init (&pfile->arg_locs_ob);
+  obstack_init (&pfile->exp_locs_ob);
+}
 
   _cpp_init_files (pfile);
 
@@ -253,6 +263,9 @@ cpp_create_reader (enum c_lang lang, has
   return pfile;
 }
 
+#undef obstack_chunk_alloc
+#undef obstack_chunk_free
+
 /* Set the line_table entry in PFILE.  This is called after reading a
PCH file, as the old line_table will be incorrect.  */
 void
@@ -289,6 +302,14 @@ cpp_destroy (cpp_reader *pfile)
 deps_free (pfile->deps);
   obstack_free (&pfile->buffer_ob, 0);
 
+  obstack_free (&pfile->exp_ob, 0);
+  if (CPP_OPTION (pfile, track_macro_expansion))
+{
+  obstack_free (&pfile->virt_locs_ob, 0);
+  obstack_free (&pfile->arg_locs_ob, 0);
+  obstack_free (&pfile->exp_locs_ob, 0);
+}
+
   _cpp_destroy_hashtable (pfile);
   _cpp_cleanup_files (pfile);
   _cpp_destroy_iconv (pfile);

=== modified file 'libcpp/internal.h'
--- libcpp/internal.h   2012-05-29 09:36:29 +
+++ libcpp/internal.h   2012-07-07 17:18:53 +
@@ -555,6 +555,13 @@ struct cpp_reader
   /* If non-null, the lexer will use this location for the next token
  instead of getting a location from the linemap.  */
   source_location *forced_token_location_p;
+
+  /* Obstacks used to speed up macro expansion and virt_locs tracking. */
+  struct obstack exp_ob;   /* for expanding macro arguments */
+  /* The rest are used only when -ftrack-macro-expansion */
+  struct obstack exp_locs_ob;  /* virt_locs of expanded macro arguments */
+  struct obstack arg_locs_ob;  /* virt_locs of macro arguments */
+  struct obstack virt_locs_ob; /* virt locs for all other macros */
 };
 
 /* Character classes.  Based on the more primitive macros in safe-ctype.h.

=== modified file 'libcpp/macro.c'
--- libcpp/macro.c  2012-05-29 14:53:50 +
+++ libcpp/macro.c  2012-07-07 18:08:44 +
@@ -24,10 +24,20 @@ along with this program; see the file CO
  You are forbidden to forbid anyone else to use, share and improve
  what you give them

ping again - Show hash table stats when -fmem-report

2012-07-07 Thread Dimitrios Apostolou
uot;,
-  (unsigned long) table->nslots);
-  fprintf (stderr, "deleted\t\t%lu\n",
+  fprintf (stderr, "\tentries\t\t%lu (%.2f%%)\n",
+  (unsigned long) nelts, nelts * 100.0 / table->nslots);
+  fprintf (stderr, "\tdeleted\t\t%lu\n",
   (unsigned long) deleted);
-  fprintf (stderr, "bytes\t\t%lu%c (%lu%c overhead)\n",
+  fprintf (stderr, "\tslots\t\t%u\n",
+  table->nslots);
+  fprintf (stderr, "\tstring bytes\t%lu%c (%lu%c obstack_memory_used)\n",
   SCALE (total_bytes), LABEL (total_bytes),
-  SCALE (overhead), LABEL (overhead));
-  fprintf (stderr, "table size\t%lu%c\n",
+  SCALE (obmem), LABEL (obmem));
+  fprintf (stderr, "\ttable size\t%lu%c\n",
   SCALE (headers), LABEL (headers));
 
   exp_len = (double)total_bytes / (double)nelts;
   exp2_len = exp_len * exp_len;
   exp_len2 = (double) sum_of_squares / (double) nelts;
 
-  fprintf (stderr, "coll/search\t%.4f\n",
+  fprintf (stderr, "\texpansions\t%u\n",
+  table->expands);
+  fprintf (stderr, "\tsearches\t%u\n",
+  table->searches);
+  fprintf (stderr, "\tcollisions\t%u\n",
+  table->collisions);
+  fprintf (stderr, "\tcoll/search\t%.4f\n",
   (double) table->collisions / (double) table->searches);
-  fprintf (stderr, "ins/search\t%.4f\n",
+  fprintf (stderr, "\tins/search\t%.4f\n",
   (double) nelts / (double) table->searches);
-  fprintf (stderr, "avg. entry\t%.2f bytes (+/- %.2f)\n",
+  fprintf (stderr, "\tavg. entry\t%.2f bytes (+/- %.2f)\n",
   exp_len, approx_sqrt (exp_len2 - exp2_len));
-  fprintf (stderr, "longest entry\t%lu\n",
+  fprintf (stderr, "\tlongest entry\t%lu\n",
   (unsigned long) longest);
 #undef SCALE
 #undef LABEL

=== modified file 'libiberty/hashtab.c'
--- libiberty/hashtab.c 2011-02-03 07:23:20 +
+++ libiberty/hashtab.c 2012-07-07 19:10:57 +
@@ -58,6 +58,7 @@ Boston, MA 02110-1301, USA.  */
 #endif
 
 #include 
+#include 
 
 #include "libiberty.h"
 #include "ansidecl.h"
@@ -563,6 +564,7 @@ htab_expand (htab_t htab)
   htab->size_prime_index = nindex;
   htab->n_elements -= htab->n_deleted;
   htab->n_deleted = 0;
+  htab->expands++;
 
   p = oentries;
   do
@@ -812,6 +814,107 @@ htab_collisions (htab_t htab)
   return (double) htab->collisions / (double) htab->searches;
 }
 
+/* Return the number of expands */
+
+unsigned int
+htab_expands_num (htab_t htab)
+{
+  return htab->expands;
+}
+
+/* Return the number of collisions */
+
+unsigned int
+htab_collisions_num (htab_t htab)
+{
+  return htab->collisions;
+}
+
+/* Return the number of searches */
+
+unsigned int
+htab_searches_num (htab_t htab)
+{
+  return htab->searches;
+}
+
+/* Dump allocation statistics to stderr. If elem_size > 0 display total memory
+ * usage of hash table too. */
+
+void
+htab_dump_statistics (htab_t table, size_t elem_size)
+{
+  size_t n_valid, headers, contents, empties, deleted, total;
+  void **p, **limit;
+
+#define SCALE(x) ((unsigned long) ((x) < 1024*10 \
+ ? (x) \
+ : ((x) < 1024*1024*10 \
+? (x) / 1024 \
+: (x) / (1024*1024
+#define LABEL(x) ((x) < 1024*10 ? ' ' : ((x) < 1024*1024*10 ? 'k' : 'M'))
+
+  deleted = n_valid = empties = 0;
+  p = table->entries;
+  limit = p + table->size;
+  do
+if (*p == HTAB_DELETED_ENTRY)
+  ++deleted;
+else if (*p == HTAB_EMPTY_ENTRY)
+  ++empties;
+else
+  ++n_valid;
+  while (++p < limit);
+
+  assert (deleted == table->n_deleted);
+  assert (empties == table->size - table->n_elements);
+  assert (n_valid + deleted == table->n_elements);
+
+  headers = table->size * sizeof (*(table->entries));
+  contents = n_valid * elem_size;
+  total = sizeof (*table) + headers + contents;
+
+  fprintf (stderr, "\tslots\t\t%lu\n",
+  (unsigned long) table->size);
+
+  fprintf (stderr, "\tused\t\t%lu (%.2f%%)\n",
+  (unsigned long) table->n_elements,
+  table->n_elements * 100.0 / table->size);
+  fprintf (stderr, "\t\tvalid\t\t%lu\n",
+  (unsigned long) n_valid);
+  fprintf (stderr, "\t\tdeleted\t\t%lu\n",
+  (unsigned long) table->n_deleted);
+
+  fprintf(stderr, "\ttotal size\t%lu %cB\n",
+ SCALE (total), LABEL (total));
+  fprintf (stderr, "\t\tstruct htab\t%lu\t B\n",
+  (unsigned long) sizeof (*table));
+  fprintf (stderr, "\t\ttable\t\t%lu\t%cB\n",
+  SCALE (headers), LABEL (headers));
+  if (elem_size > 0)
+{
+  fprintf (stderr, "\t\tone element\t%lu\t B\n",
+  (u

Re: [line-map] simple oneliner that speeds up track-macro-expansion

2012-06-04 Thread Dimitrios Apostolou

Hi Dodji,

On Mon, 4 Jun 2012, Dodji Seketeli wrote:


Hello Dimitrios,

I cannot approve or deny your patch, but I have one question.



Who should I CC then? I saw that you have commits in that file.


I am wondering why this change implies better performance.

Is this because when we later want to encode a new line/column, and
hit the spot below, (in linemap_line_start via
linemap_position_for_column), we call less linemap_add (and thus
allocate less maps):


Almost. To be exact we don't even enter linemap_line_start() because of 
the check in linemap_position_for_column(). As a result we avoid 
almost 450K calls to linemap_line_start() (and many calls to malloc() 
subsequently), for my example case of compiling reload.c. Here is the 
source annotated by callgrind:



Before:

.  source_location
.  linemap_position_for_column (struct line_maps *set, unsigned int 
to_column)
5,193,344  {
  649,168source_location r = set->highest_line;
.
3,895,008linemap_assert
.  (!linemap_macro_expansion_map_p (LINEMAPS_LAST_ORDINARY_MAP 
(set)));
.
1,298,336if (to_column >= set->max_column_hint)
.  {
1,696,684if (r >= 0xC00 || to_column > 10)
.   {
. /* Running low on source_locations - disable column numbers.  
*/
. return r;
.   }
.else
.   {
. struct line_map *map = LINEMAPS_LAST_ORDINARY_MAP (set);
5,938,394 r = linemap_line_start (set, SOURCE_LINE (map, r), to_column 
+ 50);
36,425,117  => line-map.c:linemap_line_start (424171x)
.   }
.  }
  649,168r = r + to_column;
1,298,336if (r >= set->highest_location)
  640,785  set->highest_location = r;
.return r;
3,895,008  }



After

.  source_location
.  linemap_position_for_column (struct line_maps *set, unsigned int 
to_column)
5,193,344  {
  649,168source_location r = set->highest_line;
.
3,895,008linemap_assert
.  (!linemap_macro_expansion_map_p (LINEMAPS_LAST_ORDINARY_MAP 
(set)));
.
1,298,336if (to_column >= set->max_column_hint)
.  {
   48if (r >= 0xC00 || to_column > 10)
.   {
. /* Running low on source_locations - disable column numbers.  
*/
. return r;
.   }
.else
.   {
. struct line_map *map = LINEMAPS_LAST_ORDINARY_MAP (set);
  168 r = linemap_line_start (set, SOURCE_LINE (map, r), to_column 
+ 50);
2,573  => line-map.c:linemap_line_start (12x)
.   }
.  }
  649,168r = r + to_column;
1,298,336if (r >= set->highest_location)
  640,805  set->highest_location = r;
.return r;
3,895,008  }



Dimitris


[line-map] simple oneliner that speeds up track-macro-expansion

2012-06-03 Thread Dimitrios Apostolou

Hello list,

the attached simple patch is a small part of the fix in Bug #53525, but it 
is independant and offers a noticeable performance improvement for 
track-macro-expansion. No regressions tested on x86, besides memory usage 
going up some KB.


I don't get why the max_column_hint was zeroed in every macro. Maybe there 
are pathological cases that I don't see?



2012-06-04  Dimitrios Apostolou 

* line-map.c (linemap_enter_macro): Don't zero max_column_hint in
every macro. This improves performance by reducing the number of
reallocations when track-macro-expansion is on.


Thanks
Dimitris
=== modified file 'libcpp/line-map.c'
--- libcpp/line-map.c   2012-04-30 11:42:12 +
+++ libcpp/line-map.c   2012-05-27 06:52:08 +
@@ -331,7 +331,6 @@ linemap_enter_macro (struct line_maps *s
  num_tokens * sizeof (source_location));
 
   LINEMAPS_MACRO_CACHE (set) = LINEMAPS_MACRO_USED (set) - 1;
-  set->max_column_hint = 0;
 
   return map;
 }



Re: [DF] Generate REFs in REGNO order

2012-05-22 Thread Dimitrios Apostolou


On Tue, 22 May 2012, Paolo Bonzini wrote:


Il 22/05/2012 18:26, Dimitrios Apostolou ha scritto:


You are right, and I noticed that if we reverse (actually put straight)
the loop for the PARALLEL defs inside df_defs_record() then the speedup
stands for both x86 and ppc64.

The following patch was tested on x86, do you think it is meaningful for
the generic case?


It's really a lottery, but if it doesn't make things worse for x86 why not.



It doesn't so please apply, even though I'll try understanding in depth 
and see if I can think of a proper way of generating DEFs in proper order.



Dimitris



Re: [DF] Generate REFs in REGNO order

2012-05-22 Thread Dimitrios Apostolou

On Tue, 22 May 2012, Paolo Bonzini wrote:

Il 21/05/2012 19:49, Dimitrios Apostolou ha scritto:


Thanks for reviewing, in the meantime I'll try to figure out why this
patch doesn't offer any speed-up on ppc64 (doesn't break anything
though), so expect a followup by tomorrow.


Perhaps you hit this?

 else if (GET_CODE (XEXP (note, 0)) == CLOBBER)
   {
 if (REG_P (XEXP (XEXP (note, 0), 0)))
   {
 unsigned int regno = REGNO (XEXP (XEXP (note, 0), 0));
 if (!TEST_HARD_REG_BIT (defs_generated, regno))
   df_defs_record (collection_rec, XEXP (note, 0), bb,
   insn_info, flags);
   }


You are right, and I noticed that if we reverse (actually put straight) 
the loop for the PARALLEL defs inside df_defs_record() then the speedup 
stands for both x86 and ppc64.


The following patch was tested on x86, do you think it is meaningful for 
the generic case?


--- gcc/df-scan.c   2012-05-06 04:08:43 +
+++ gcc/df-scan.c   2012-05-22 13:08:33 +
@@ -3010,7 +3010,7 @@ df_defs_record (struct df_collection_rec
   break;

 case PARALLEL:
-  for (i = XVECLEN (x, 0) - 1; i >= 0; i--)
+  for (i = 0; i < XVECLEN (x, 0); i++)
df_defs_record (collection_rec, XVECEXP (x, 0, i),
bb, insn_info, flags);
   break;



Thanks,
Dimitris



Re: [DF] Generate REFs in REGNO order

2012-05-21 Thread Dimitrios Apostolou

Hi Paolo,

On Mon, 21 May 2012, Paolo Bonzini wrote:

Il 20/05/2012 20:50, Dimitrios Apostolou ha scritto:



Paolo: I couldn't find a single test-case where the mw_reg_pool was
heavily used so I reduced its size. You think it's OK for all archs?



Makes sense, we can see if something breaks.  I'll commit the patch
tomorrow after a re-review.


Nothing will break hopefully, it will only become slower as the 
mw_reg_pool gets resized according to requests. But I can't find a 
mw_reg_pool-intensive testcase, so I guess this won't happen.


Thanks for reviewing, in the meantime I'll try to figure out why this 
patch doesn't offer any speed-up on ppc64 (doesn't break anything though), 
so expect a followup by tomorrow.



Thanks,
Dimitris



Re: Show hash table stats when -fmem-report

2012-05-21 Thread Dimitrios Apostolou


One line patch to update Makefile.


2012-05-21 Dimitrios Apostolou 

* gcc/Makefile.in: (toplev.o) toplev.o depends on cselib.h.



=== modified file 'gcc/Makefile.in'
--- gcc/Makefile.in 2012-05-04 20:04:47 +
+++ gcc/Makefile.in 2012-05-21 14:08:45 +
@@ -2751,7 +2751,7 @@ toplev.o : toplev.c $(CONFIG_H) $(SYSTEM
langhooks.h insn-flags.h $(CFGLAYOUT_H) $(CFGLOOP_H) hosthooks.h \
$(CGRAPH_H) $(COVERAGE_H) alloc-pool.h $(GGC_H) $(INTEGRATE_H) \
$(OPTS_H) params.def tree-mudflap.h $(TREE_PASS_H) $(GIMPLE_H) \
-   tree-ssa-alias.h $(PLUGIN_H) realmpfr.h tree-diagnostic.h \
+   tree-ssa-alias.h $(PLUGIN_H) cselib.h realmpfr.h tree-diagnostic.h \
tree-pretty-print.h opts-diagnostic.h $(COMMON_TARGET_H)

 hwint.o : hwint.c $(CONFIG_H) $(SYSTEM_H) $(DIAGNOSTIC_CORE_H)



Show hash table stats when -fmem-report

2012-05-21 Thread Dimitrios Apostolou

Hello list,

I ported a patch from last year's GSOC to current trunk, removing 
deprecated hash tables and adding some new ones. I also backed out a 
change that reduced collisions by decreasing load factor because it had 
created controversy, so this patch should be easily applicable.


CC'd whoever is relevant with subsystems or had commented on this last 
year. Some notes:


 * Is it OK that I included  in libiberty's hashtab.c?
 * Many hash tables are created with htab_create_ggc(), for example 
referenced_vars and default_defs in tree-ssa.c. I collect statistics in 
delete_tree_ssa() but maybe some are not deallocated in there but 
automatically garbage collected?
 * Obviously hash table sizes are inflated where two entries might 
reference the same element (for example in symtab_hash) but I don't handle 
this.
 * Maybe the best overall solution would be to add a string parameter to 
htab_create*() and store statistics in an internal hash table according to 
this string. Statistics would then be printed for all hash tables by 
iterating this internal table. Since it would cause a significant 
slowdown, -fmem-report would be better as a compile-time option than a 
run-time one. This is probably an overkill so I think I'll skip it.



Thanks,
Dimitris



2012-05-21 Dimitrios Apostolou 

Print various statistics about hash tables when called with
-fmem-report. If the tables are created once use
htab_dump_statistics(), if they are created/destroyed multiple
times then introduce global variables to track statistics.

* cgraph.c, cgraph.h:
(cgraph_dump_stats): New function to dump stats about hash tables.
* gcc/symtab.c, cgraph.h:
(symtab_dump_stats): New function to dump stats about hash tables.
* cgraph.c: (call_site_hash_{num,expands,searches,collisions}):
New globals to keep statistics about call_site_hash hash tables.
(cgraph_remove_node_callees,cgraph_remove_node): if (mem_report)
then keep statistics about hash tables.
* cselib.c, cselib.h (cselib_dump_stats): New function to dump
stats about cselib_hash_table.
* cselib.c (cselib_htab_{num,expands,searches,collisions}): New
globals to keep hash table statistics.
(cselib_finish): if (mem_report) keep hash table statistics.
* dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if
-fmem_report.
* emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to
dump statistics about mem_attrs_htab hash table.
* tree.h, tree-ssa.c (tree_ssa_dump_stats): New function to print
statistics about all referenced_vars and default_defs hash tables.
* tree-ssa.c (default_defs_{num,expands,searches,collisions})
(referenced_vars_{num,expands,searches,collisions}): new globals
to keep statistics about hash tables.
(delete_tree_ssa): Keep statistics for hash tables by
increasing the new global variables.
* tree.c (dump_tree_statistics): Call tree_ssa_dump_stats().
(print_type_hash_statistics): Used the new htab_dump_statistics()
function.
* var-tracking.c (vars_htab_{num,expands,searches,collisions})
(dropval_htab_{num,expands,searches,collisions})
(cv_htab_{num,expands,searches,collisions}): new globals to keep
hash table statistics.
(shared_hash_destroy, vt_finalize): Keep statistics by
increasing values of new global variables if -fmem-report.
* var-tracking.c, rtl.h (vt_dump_stats): New function to dump
stats about vars->htab, dropped_variables and value_chains hash
tables.
* toplev.c: Included cselib.h for cselib_dump_stats().
(dump_memory_report): Call all the above functions to provide
better statistics.

* hashtab.c, hashtab.h: Added "expands" variable to htab_t for
tracking total times the hash table was expanded.
* hashtab.c, hashtab.h (htab_dump_statistics, htab_collisions_num)
	(htab_searches_num, htab_expands_num): New functions for 
statistics.

* hashtab.c: Included assert.h for checks in htab_dump_statistics.
* cgraph.h, varpool.c (varpool_dump_stats): New function to dump
stats about varpool_hash hash table.

* libcpp/symtab.c, symtab.h: Added "expands" variable to
hash_table type for tracking total times the hash table was
expanded.
* symtab.c (ht_dump_statistics): Beautified stats output.



Number of expanded macros: 23397
Average number of tokens per macro expansion: 10

Line Table allocations during the compilation process
Number of ordinary maps used:  576 
Ordinary map used size: 13k
Number of ordinary maps allocated:1365 
Ordinary maps allocated size:   31k
Number of macro maps used:  20k
Macro 

[DF] Generate REFs in REGNO order

2012-05-20 Thread Dimitrios Apostolou

Hello list,

I'm resubmitting this patch from last year's GSOC which speeds up 
compilation by avoiding thousands of calls to qsort(). Measured again it's 
impact on compilation speed, this time compiling (cc1) gcc's reload.c on 
i386:


orig:   0.734s
patched:0.720s


Tested on i686, ppc64. No regressions.

Paolo: I couldn't find a single test-case where the mw_reg_pool was 
heavily used so I reduced its size. You think it's OK for all archs?



2012-05-20  Dimitrios Apostolou  
Paolo Bonzini  

Provide almost 2% speedup on -O0 compilations by generating the
DF_REF_BASE register defs in REGNO order, so that collection_rec
is already sorted and most qsort() calls are avoided. In detail:

(df_def_record_1): Assert a parallel must contain an EXPR_LIST at
this point.  Receive the LOC and move its extraction...
(df_defs_record): ... here. Rewrote logic with a switch statement
instead of multiple if-else.
(df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions
that duplicate the logic of df_defs_record() and df_def_record_1()
but without actually recording any DEFs, only marking them in
the defs HARD_REG_SET.
(df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that
are the result of the call. Record DF_REF_BASE DEFs in REGNO
order. Use HARD_REG_SET instead of bitmap for
regs_invalidated_by_call. Changed defs_generated from bitmap to
HARD_REG_SET, it's much faster.
(df_insn_refs_collect): Record DF_REF_REGULAR DEFs after
df_get_call_refs().
(df_scan_alloc): Rounded up allocation pools size, reduced the
mw_reg_pool size, it was unnecessarily large.



Thanks,
Dimitris
=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2012-02-29 08:12:04 +
+++ gcc/df-scan.c   2012-05-20 15:46:06 +
@@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c
   rtx, rtx *,
   basic_block, struct df_insn_info *,
   enum df_ref_type, int ref_flags);
-static void df_def_record_1 (struct df_collection_rec *, rtx,
+static void df_def_record_1 (struct df_collection_rec *, rtx *,
 basic_block, struct df_insn_info *,
 int ref_flags);
 static void df_defs_record (struct df_collection_rec *, rtx,
@@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 {
   struct df_scan_problem_data *problem_data;
   unsigned int insn_num = get_max_uid () + 1;
-  unsigned int block_size = 400;
+  unsigned int block_size = 512;
   basic_block bb;
 
   /* Given the number of pools, this is really faster than tearing
@@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 sizeof (struct df_reg_info), block_size);
   problem_data->mw_reg_pool
 = create_alloc_pool ("df_scan mw_reg",
-sizeof (struct df_mw_hardreg), block_size);
+sizeof (struct df_mw_hardreg), block_size / 16);
 
   bitmap_obstack_initialize (&problem_data->reg_bitmaps);
   bitmap_obstack_initialize (&problem_data->insn_bitmaps);
@@ -2917,40 +2917,27 @@ df_read_modify_subreg_p (rtx x)
 }
 
 
-/* Process all the registers defined in the rtx, X.
+/* Process all the registers defined in the rtx pointed by LOC.
Autoincrement/decrement definitions will be picked up by
df_uses_record.  */
 
 static void
 df_def_record_1 (struct df_collection_rec *collection_rec,
- rtx x, basic_block bb, struct df_insn_info *insn_info,
+ rtx *loc, basic_block bb, struct df_insn_info *insn_info,
 int flags)
 {
-  rtx *loc;
-  rtx dst;
-
- /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL
- construct.  */
-  if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER)
-loc = &XEXP (x, 0);
-  else
-loc = &SET_DEST (x);
-  dst = *loc;
+  rtx dst = *loc;
 
   /* It is legal to have a set destination be a parallel. */
   if (GET_CODE (dst) == PARALLEL)
 {
   int i;
-
   for (i = XVECLEN (dst, 0) - 1; i >= 0; i--)
{
  rtx temp = XVECEXP (dst, 0, i);
- if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER
- || GET_CODE (temp) == SET)
-   df_def_record_1 (collection_rec,
- temp, bb, insn_info,
-GET_CODE (temp) == CLOBBER
-? flags | DF_REF_MUST_CLOBBER : flags);
+ gcc_assert (GET_CODE (temp) == EXPR_LIST);
+ df_def_record_1 (collection_rec, &XEXP (temp, 0),
+  bb, insn_info, flags);
}
   return;
 }
@@ -3004,26 +2991,98 @@ df_defs_record (struct df_collection_rec
int flags)
 {
   RTX_CODE code = GET

Re: [PATCH] Fix Linux/sparc build after generic asm output optimizations.

2011-11-12 Thread Dimitrios Apostolou

Hi,

On Sat, 12 Nov 2011, Eric Botcazou wrote:


We just need to declare it in system.h in order to use the definition in
libiberty.


OK, this should be fine.



do the patches I sent for bug #51094 solve the problems?

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51094


Thanks,
Dimitris




Re: [PATCH] Fix Linux/sparc build after generic asm output optimizations.

2011-11-11 Thread Dimitrios Apostolou

Hi David,

I couldn't imagine such breakage... If too many platforms break perhaps we 
should undo the optimisation - see attached patch.



Thanks,
Dimitris


P.S. see also bug #51094 I've attached some more fixes
=== modified file 'gcc/config/elfos.h'
--- gcc/config/elfos.h  2011-10-30 01:45:46 +
+++ gcc/config/elfos.h  2011-11-12 02:51:39 +
@@ -125,9 +125,6 @@ see the files COPYING3 and COPYING.RUNTI
 }  \
   while (0)
 
-#undef TARGET_ASM_INTERNAL_LABEL
-#define TARGET_ASM_INTERNAL_LABEL default_elf_internal_label
-
 /* Output the label which precedes a jumptable.  Note that for all svr4
systems where we actually generate jumptables (which is to say every
svr4 target except i386, where we use casesi instead) we put the jump-


Re: repost: [DF] Use HARD_REG_SETs instead of bitmaps

2011-11-06 Thread Dimitrios Apostolou

On Mon, 7 Nov 2011, Jakub Jelinek wrote:

On Mon, Nov 07, 2011 at 12:01:29AM +0200, Dimitrios Apostolou wrote:

On Sun, 6 Nov 2011, Joern Rennecke wrote:

But where HARD_REG_SETS make no material difference in speed, and the
compilation unit has no other tight coupling with tm.h, it would really
be cleaner to move from HARD_REG_SETS to a target-independent type,
like sbitmap or bitmap.  Maybe we want something more lightweight than
an sbitmap, like passing in a HARD_REG_ELT_TYPE *base, int n_elements
pair to describe the data - n_elements would be set to HARD_REG_SET_LONGS.
When you are doing anything time-critical with hard register sets,
there is little point having the same size stored in multiple places,
you might as well keep it in a register and pass it around.
Basic operations could presumably be inlined, so the main overhead would
be loop overhead.  With the right loop unrolling applied, these loops
should be well predictable.


Working with bitmaps in gcc made me really miss the simple bitmap.h
of the linux kernel. I'm not sure if we can borrow from there. More
info at:

http://lxr.linux.no/#linux+v3.1/include/linux/bitmap.h


Hm, what else is hard-reg-set.h?


It's about the same performance-wise, but:

* much uglier because of all the #ifdef cruft
* less useful since it can't be used as a generic bitmap
** Closest generic bitmap is sbitmap.h, but IMHO it's more complex than it 
should be.


I'd love to have the kernel bitmap.h in libiberty.



When you are dealing with pseudos etc.
sparse bitmaps are very often better.


Indeed. They are much better memory-wise, but they have a non-negligible 
overhead performance-wise. I've noted somewhere to replace all hot bitmaps 
with ebitmaps and see what happens, whenever I have the time...



Dimitris



Re: repost: [DF] Use HARD_REG_SETs instead of bitmaps

2011-11-06 Thread Dimitrios Apostolou

On Sun, 6 Nov 2011, Joern Rennecke wrote:

But where HARD_REG_SETS make no material difference in speed, and the
compilation unit has no other tight coupling with tm.h, it would really
be cleaner to move from HARD_REG_SETS to a target-independent type,
like sbitmap or bitmap.  Maybe we want something more lightweight than
an sbitmap, like passing in a HARD_REG_ELT_TYPE *base, int n_elements
pair to describe the data - n_elements would be set to HARD_REG_SET_LONGS.
When you are doing anything time-critical with hard register sets,
there is little point having the same size stored in multiple places,
you might as well keep it in a register and pass it around.
Basic operations could presumably be inlined, so the main overhead would
be loop overhead.  With the right loop unrolling applied, these loops
should be well predictable.


Working with bitmaps in gcc made me really miss the simple bitmap.h of 
the linux kernel. I'm not sure if we can borrow from there. More info at:


http://lxr.linux.no/#linux+v3.1/include/linux/bitmap.h


Dimitris


P.S. we badly need a gxr site :-)


Re: [dwarf2out, elfos] Output assembly faster

2011-10-30 Thread Dimitrios Apostolou
Hello list, I updated this patch to latest trunk and also incorporated 
apinski's suggestion to use stpcpy() instead of a custom loop. 
Bootstrapped and tested again on i386, no regressions. The comments from 
my previous email still apply, I never got a reply:


On Mon, 22 Aug 2011, Dimitrios Apostolou wrote:

Hello again,

most of this patch was posted at the beginning of my GSOC adventure and 
primarily is about replacing fprintf() calls with custom faster ones. From 
that time I changed minor things you suggested, omitted some complexities 
that offered minor speed-up, and made the code as clear as possible, while 
following more closely coding guidelines.


Bootstrapped and tested on both i386 and x86_64, showing runtime improvement 
when compiling with debug info enabled.


The only thing I am not sure is the differentiation between HOST_WIDE_INT and 
long, so I provided two versions of the same function. Please let me know if 
I should handle this differently.



Future speedup in assembly generation *is possible* but requires more 
intrusive changes, maybe making assembly hard to read by human:

* Always use hex numbers, they are much faster to produce
* Call GNU assembler with -s parameter, though it's pretty hard to be 
compliant.
* Even further in the future we could generate binary data, if we *know* the 
assembler is GAS.




Slightly more descriptive changelog:

2011-08-12  Dimitrios Apostolou  

* final.c, output.h (fprint_whex, fprint_w, fprint_ul, sprint_ul):
New functions serving as fast replacements for fprintf() integer
to string conversions. They were used in the following changes.
* final.c (sprint_ul_rev): Internal helper for the above.
(output_addr_const): case CONST_INT: don't use fprintf().
* elfos.h (ASM_GENERATE_INTERNAL_LABEL): Don't use sprintf("%u"),
use sprint_ul() and stpcpy() which are much faster.
(TARGET_ASM_INTERNAL_LABEL): Define as default_elf_internal_label.
(ELF_ASCII_ESCAPES, ELF_STRING_LIMIT): Are the old ESCAPES and
STRING_LIMIT macros.
(ASM_OUTPUT_LIMITED_STRING, ASM_OUTPUT_ASCII): Macros now just
call respective functions that provide the same
functionality. Those are default_elf_asm_output_limited_string()
and default_elf_asm_output_ascii() in varasm.c.
* varasm.c: Fixed some whitespace inconsistencies.
(default_elf_asm_output_limited_string)
(default_elf_asm_output_ascii): The above macros from elfos.h are
implemented here as these functions, avoiding superfluous calls to
fprintf().
(default_elf_internal_label): Hook for
targetm.asm_out.internal_label and ASM_OUTPUT_DEBUG_LABEL.
* i386.c: Don't call fprintf("%u") but fprint_ul() instead.
* defaults.h (ASM_OUTPUT_LABEL, ASM_OUTPUT_INTERNAL_LABEL):
Expanded the macros on multiple lines for readability.
(ASM_OUTPUT_LABELREF): Have two calls to fputs() instead of one to
asm_fprintf().
* dwarf2asm.c (dw2_assemble_integer, dw2_asm_output_data)
(dw2_asm_data_uleb128, dw2_asm_delta_uleb128)
(dw2_asm_delta_sleb128): Convert fprintf() calls to the new
faster functions.
* dwarf2out.c (dwarf2out_source_line): Convert fprintf() calls to
the new faster functions.


I had measured during the summer ([1] before, [2] after, if necessary I'll 
refresh measurements) that this gives a good 2-3% improvement, especially 
when debug info is enabled. Major obstacle was that it made code less 
readable but I think the new function names are more understandable.


[1] 
http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=4.6.0-baseline_prof.txt
[2] 
http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=4.6.0-changed1_prof.txt

Jakub, you had raised concerns that it's better to avoid this optimisation 
completely and make gcc always optimise fprintf("%d") calls, to keep the 
code readable. Do you think that's the case still?


I appreciate all comments,
Dimitris


P.S. Cc'd i386 backend maintainers=== modified file 'gcc/config/elfos.h'
--- gcc/config/elfos.h  2010-11-21 00:54:14 +
+++ gcc/config/elfos.h  2011-10-29 22:57:35 +
@@ -117,10 +117,17 @@ see the files COPYING3 and COPYING.RUNTI
 #define ASM_GENERATE_INTERNAL_LABEL(LABEL, PREFIX, NUM)\
   do   \
 {  \
-  sprintf (LABEL, "*.%s%u", PREFIX, (unsigned) (NUM)); \
+  char *__p;   \
+  (LABEL)[0] = '*';\
+  (LABEL)[1] = '.';

Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-10-08 Thread Dimitrios Apostolou

Hello all,

I received my GSOC t-shirt yesterday which reminded me I have a promise to 
keep... After realising that it can take forever to find enough free time 
to work on GCC, I decided to work a couple of hours whenever I can and 
post updates to my patches as time permits. Hopefully some of them will 
make it into 4.7?


On Mon, 22 Aug 2011, Dimitrios Apostolou wrote:


For the record I'm posting here the final version of this patch, in case it 
gets applied. It adds minor stylistic fixes, plus a small change in 
alloc_pool sizes. Any further testing I do will be posted under this thread.


The previously posted Changelog applies, with the following addition:

(df_scan_alloc): Rounded up allocation pools size, reduced the
mw_reg_pool size, it was unnecessarily large.

Paolo, did I assume correctly that the mw_reg_pool is significantly smaller 
than the rest? That was the case on i386, I assumed it would be similar in 
other arch as well.




The attached patch (df2b.diff, exactly the same as the one in parent 
email) applies successfully to latest gcc snapshot. In addition to 
previous testing (i386,x86_64) I've just finished testing on 
sparc-linux-gnu at the GCC compile farm having no regressions. Finally I 
think Steven's tests on IA64 went ok. Wasn't testing the only thing 
holding this patch?


On sparc runtime of compiling df-scan.c seems to have been reduced from 
34s to 33s user time, for a debug build 
(--enable-checking=assert,misc,runtime,rtl,df). But measurements are too 
flaky since node is busy.




The complete changelog is the following:

2011-07-29  Dimitrios Apostolou  
Paolo Bonzini  

(df_def_record_1): Assert a parallel must contain an EXPR_LIST at
this point.  Receive the LOC and move its extraction...
(df_defs_record): ... here. Rewrote logic with a switch statement
instead of multiple if-else.
(df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions
that duplicate the logic of df_defs_record() and df_def_record_1()
but without actually recording any DEFs, only marking them in
the defs HARD_REG_SET.
(df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that
are the result of the call. Record DF_REF_BASE DEFs in REGNO
order. Use regs_invalidated_by_call HARD_REG_SET instead of
regs_invalidated_by_call_regset bitmap.
(df_insn_refs_collect): Record DF_REF_REGULAR DEFs after
df_get_call_refs().
(df_scan_alloc): Rounded up allocation pools size, reduced the
mw_reg_pool size, it was unnecessarily large.


Thanks,
Dimitris=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-08-22 15:17:18 +
@@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c
   rtx, rtx *,
   basic_block, struct df_insn_info *,
   enum df_ref_type, int ref_flags);
-static void df_def_record_1 (struct df_collection_rec *, rtx,
+static void df_def_record_1 (struct df_collection_rec *, rtx *,
 basic_block, struct df_insn_info *,
 int ref_flags);
 static void df_defs_record (struct df_collection_rec *, rtx,
@@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 {
   struct df_scan_problem_data *problem_data;
   unsigned int insn_num = get_max_uid () + 1;
-  unsigned int block_size = 400;
+  unsigned int block_size = 512;
   basic_block bb;
 
   /* Given the number of pools, this is really faster than tearing
@@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 sizeof (struct df_reg_info), block_size);
   problem_data->mw_reg_pool
 = create_alloc_pool ("df_scan mw_reg",
-sizeof (struct df_mw_hardreg), block_size);
+sizeof (struct df_mw_hardreg), block_size / 16);
 
   bitmap_obstack_initialize (&problem_data->reg_bitmaps);
   bitmap_obstack_initialize (&problem_data->insn_bitmaps);
@@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x)
 }
 
 
-/* Process all the registers defined in the rtx, X.
+/* Process all the registers defined in the rtx pointed by LOC.
Autoincrement/decrement definitions will be picked up by
df_uses_record.  */
 
 static void
 df_def_record_1 (struct df_collection_rec *collection_rec,
- rtx x, basic_block bb, struct df_insn_info *insn_info,
+ rtx *loc, basic_block bb, struct df_insn_info *insn_info,
 int flags)
 {
-  rtx *loc;
-  rtx dst;
-
- /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL
- construct.  */
-  if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER)
-loc = &XEXP (x, 0);
-  else
-loc = &SET_DEST (x);
-  dst = *loc;
+  rtx dst = *loc;
 
   /* It is legal to have a se

Re: [var-tracking] small speed-ups

2011-08-23 Thread Dimitrios Apostolou

On Tue, 23 Aug 2011, Jakub Jelinek wrote:

On Tue, Aug 23, 2011 at 02:40:56PM +0300, Dimitrios Apostolou wrote:


  dst->vars = (shared_hash) pool_alloc (shared_hash_pool);
  dst->vars->refcount = 1;
  dst->vars->htab
-= htab_create (MAX (src1_elems, src2_elems), variable_htab_hash,
+= htab_create (2 * MAX (src1_elems, src2_elems), variable_htab_hash,
   variable_htab_eq, variable_htab_free);


This looks wrong, 2 * max is definitely too much.


For a hash table to fit N elements, it has to have at least 4/3*N
slots, or 2*N slots if htab has the 50% load factor I was proposing.


For var-tracking, 50% load factor is IMHO a very bad idea, memory
consumption of var-tracking is already high right now, we IMHO don't have
the luxury to waste more RAM.


Agreed, then I 'll change it to 4/3 * MAX so that I avoid expansions.


In total for dataflow_set_destroy I can see that calls to
attrs_list_clear() have been reduced from 500K to 250K, and I can
also see a reduction of free() calls from htab_delete(), from 30K to


free calls?  If you avoid free calls, then you end up with a memory leak.
I can understand when you avoid pool_free calls...


You are right, I just mentioned the total difference in free() calls from 
all my patches. But in this part there is no free() involved, so the major 
gain should be from avoiding htab_delete() iterating many times over the 
hash tables. Annotated source from the callgrind profiler (shows 
instruction count):


Before:

 .  void
15,820,597  htab_delete (htab_t htab)
   250,914  {
41,819size_t size = htab_size (htab);
41,819PTR *entries = htab->entries;
 .int i;
 .
83,638if (htab->del_f)
66,493,884  for (i = size - 1; i >= 0; i--)
49,825,998if (entries[i] != HTAB_EMPTY_ENTRY && entries[i] != 
HTAB_DELETED_ENTRY)
 1,813,018  (*htab->del_f) (entries[i]);
 1,800,731  => tree-into-ssa.c:def_blocks_free (10988x)
   345,910  => tree-into-ssa.c:repl_map_free (2815x)
53,950  => tree-scalar-evolution.c:del_scev_info (1197x)
   139,354  => tree-ssa-loop-im.c:memref_free (198x)
   281,777  => tree-ssa-sccvn.c:free_phi (3788x)
81,726  => tree-ssa-uncprop.c:equiv_free (463x)
17,359,572  => var-tracking.c:variable_htab_free (835512x)
42,161  => cfgloop.c:loop_exit_free (720x)
   284,904  => ira-costs.c:cost_classes_del (2270x)
   157  => tree-ssa-loop-im.c:vtoe_free (1x)
11,454,221  => ???:free (37228x)
 1,460,684  => tree-ssa-sccvn.c:free_reference (11329x)
 .
   125,457if (htab->free_f != NULL)



After:

 .  void
 6,543,474  htab_delete (htab_t htab)
   250,914  {
41,819size_t size = htab_size (htab);
41,819PTR *entries = htab->entries;
 .int i;
 .
83,638if (htab->del_f)
29,288,268  for (i = size - 1; i >= 0; i--)
21,927,330if (entries[i] != HTAB_EMPTY_ENTRY && entries[i] != 
HTAB_DELETED_ENTRY)
 1,738,584  (*htab->del_f) (entries[i]);
   139,344  => tree-ssa-loop-im.c:memref_free (198x)
   157  => tree-ssa-loop-im.c:vtoe_free (1x)
81,762  => tree-ssa-uncprop.c:equiv_free (463x)
   281,884  => tree-ssa-sccvn.c:free_phi (3788x)
42,145  => cfgloop.c:loop_exit_free (720x)
   345,870  => tree-into-ssa.c:repl_map_free (2815x)
 1,462,000  => tree-ssa-sccvn.c:free_reference (11329x)
 1,800,951  => tree-into-ssa.c:def_blocks_free (10988x)
16,518,004  => var-tracking.c:variable_htab_free (824010x)
   284,979  => ira-costs.c:cost_classes_del (2270x)
 9,080,245  => ???:free (11513x)
53,921  => tree-scalar-evolution.c:del_scev_info (1197x)
 .
   125,457if (htab->free_f != NULL)



So if the part relevant to this patch is the number of calls to 
variable_htab_free, I suppose it won't make a big difference.


But if I take a look at the top callers and top callees of htab_delete():

Before:

 19,590,149  < tree-ssa-structalias.c:solve_constraints (2058x) [cc1]
 33,299,064  < var-tracking.c:shared_hash_destroy (6923x) [cc1]
 68,941,719  < tree-ssa-pre.c:execute_pre (2058x) [cc1]
134,915,334  *  hashtab.c:htab_delete [cc1]
 17,359,572  >   var-tracking.c:variable_htab_free (835512x) [cc1]
 11,454,221  >   ???:free (108456x) [libc-2.12.so]
  1,800,731  >   tree-into-ssa.c:def_blocks_free (10988x) [cc1]


After:

  8,756,328  < tree-ssa-structalias.c:delete_points_to_sets (1029x) [cc1]
  9,316,165  < tree-ssa-dom.c:tree_ssa_dominator_optimize (562x) [cc1]
 83,696,211  < var-tracking.c:shared_hash_destroy (6923x) [cc1]
 60,459,493  *  hashtab.c:htab_delete [cc1]
 16,518,004  >   var-tracking.c:variable_htab_free (824010x) [cc1]
  9,080,245  >   ???:free (82741x) [libc-2.12.so]
  1,800,951  >   tree-into-ssa.c:def_blocks_free (10988x) [cc1]


Then I'm more confused, htab_delete() seems to be much colder, pr

Re: [var-tracking] small speed-ups

2011-08-23 Thread Dimitrios Apostolou

Hi jakub,

On Mon, 22 Aug 2011, Jakub Jelinek wrote:

On Mon, Aug 22, 2011 at 01:30:33PM +0300, Dimitrios Apostolou wrote:


@@ -1191,7 +1189,7 @@ dv_uid2hash (dvuid uid)
 static inline hashval_t
 dv_htab_hash (decl_or_value dv)
 {
-  return dv_uid2hash (dv_uid (dv));
+  return (hashval_t) (dv_uid (dv));
 }


Why?  dv_uid2hash is an inline that does exactly that.


@@ -1202,7 +1200,7 @@ variable_htab_hash (const void *x)
 {
   const_variable const v = (const_variable) x;

-  return dv_htab_hash (v->dv);
+  return (hashval_t) (dv_uid (v->dv));
 }


Why?


 /* Compare the declaration of variable X with declaration Y.  */
@@ -1211,9 +1209,8 @@ static int
 variable_htab_eq (const void *x, const void *y)
 {
   const_variable const v = (const_variable) x;
-  decl_or_value dv = CONST_CAST2 (decl_or_value, const void *, y);

-  return (dv_as_opaque (v->dv) == dv_as_opaque (dv));
+  return (v->dv) == y;
 }


Why?


I was hoping you'd ask so I'll ask back :-) Why are we doing it that way? 
Why so much indirection in the first place? Why create inline functions 
just to typecast and why do we need this CONST_CAST2 ugliness in C code. I 
bet there are things I don't understand so I'd be happy to listen...


The reason I did this (and many more I didn't publish) simplifications 
within var-tracking is because it hurt my brains to follow the 
logic. Even with the help of TAGS I have a specific stack depth before I 
forget where I begun diving into TAG declarations. Well in var-tracking 
this limit was surpassed by much...





@@ -1397,19 +1398,40 @@ shared_var_p (variable var, shared_hash
  || shared_hash_shared (vars));
 }

+/* Copy all variables from hash table SRC to hash table DST without rehashing
+   any values.  */
+
+static htab_t
+htab_dup (htab_t src)
+{
+  htab_t dst;
+
+  dst = (htab_t) xmalloc (sizeof (*src));
+  memcpy (dst, src, sizeof (*src));
+  dst->entries = (void **) xmalloc (src->size * sizeof (*src->entries));
+  memcpy (dst->entries, src->entries,
+ src->size * sizeof (*src->entries));
+  return dst;
+}
+


This certainly doesn't belong here, it should go into libiberty/hashtab.c
and prototype into include/hashtab.h.  It relies on hashtab.c
implementation details.


OK I'll do that in the future. Should I also move some other htab 
functions I saw in var-tracking and rtl? FOR_EACH_HTAB_ELEMENT comes to 
mind, probably other too.





@@ -2034,7 +2041,8 @@ val_resolve (dataflow_set *set, rtx val,
 static void
 dataflow_set_init (dataflow_set *set)
 {
-  init_attrs_list_set (set->regs);
+  /* Initialize the set (array) SET of attrs to empty lists.  */
+  memset (set->regs, 0, sizeof (set->regs));
   set->vars = shared_hash_copy (empty_shared_hash);
   set->stack_adjust = 0;
   set->traversed_vars = NULL;


I'd say you should instead just implement init_attrs_list_set inline using
memset.


It's used only once, that's why I deleted the function. I'll bring it back 
if you think it helps.





   dst->vars = (shared_hash) pool_alloc (shared_hash_pool);
   dst->vars->refcount = 1;
   dst->vars->htab
-= htab_create (MAX (src1_elems, src2_elems), variable_htab_hash,
+= htab_create (2 * MAX (src1_elems, src2_elems), variable_htab_hash,
   variable_htab_eq, variable_htab_free);


This looks wrong, 2 * max is definitely too much.


For a hash table to fit N elements, it has to have at least 4/3*N 
slots, or 2*N slots if htab has the 50% load factor I was proposing.



@@ -8996,11 +9006,13 @@ vt_finalize (void)

   FOR_ALL_BB (bb)
 {
-  dataflow_set_destroy (&VTI (bb)->in);
-  dataflow_set_destroy (&VTI (bb)->out);
+  /* The "false" do_free parameter means to not bother to iterate and free
+all hash table elements, since we'll destroy the pools. */
+  dataflow_set_destroy (&VTI (bb)->in, false);
+  dataflow_set_destroy (&VTI (bb)->out, false);
   if (VTI (bb)->permp)
{
- dataflow_set_destroy (VTI (bb)->permp);
+ dataflow_set_destroy (VTI (bb)->permp, false);
  XDELETE (VTI (bb)->permp);
}
 }



How much does this actually speed things up (the not freeing pool allocated
stuff during finalizaqtion)?  Is it really worth it?


In total for dataflow_set_destroy I can see that calls to 
attrs_list_clear() have been reduced from 500K to 250K, and I can also see 
a reduction of free() calls from htab_delete(), from 30K to 10K. I'm 
willing to bet that much of this is because of this change, I have kept 
only the ones that showed difference and remember clearly that 
var-tracking is iterating over hash tables too much, either directly or 
from htab_traverse()/htab_delete().



Thanks,
Dimitris



Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-08-22 Thread Dimitrios Apostolou

On Mon, 22 Aug 2011, Dimitrios Apostolou wrote:

Hi Steven,

On Mon, 1 Aug 2011, Steven Bosscher wrote:
On Sun, Jul 31, 2011 at 11:59 PM, Steven Bosscher  
wrote:
On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher  
wrote:

I'll see if I can test the patch on the compile farm this weekend,
just to be sure.


Bootstrap on
ia64-unknown-linux-gnu is in stage2 but it is taking forever (on
gcc60)...


Just to be clear, it is taking forever without the patch too. I did
time -O2 non-bootstrap builds but there was no difference worth
mentioning.


Did the testsuite finish with no regressions on IA64?

I'd test on a sparcstation but unfortunately the machine crashed before
finishing and I can't regain access to it. I'll hopefully test it in some
other platform by next week, I'm curious to actually measure runtime
there.


For the record I'm posting here the final version of this patch, in case 
it gets applied. It adds minor stylistic fixes, plus a small change in 
alloc_pool sizes. Any further testing I do will be posted under this 
thread.


 The previously posted Changelog applies, with the following addition:

(df_scan_alloc): Rounded up allocation pools size, reduced the
mw_reg_pool size, it was unnecessarily large.

Paolo, did I assume correctly that the mw_reg_pool is significantly 
smaller than the rest? That was the case on i386, I assumed it would be 
similar in other arch as well.



Thanks,
Dimitris=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-08-22 15:17:18 +
@@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c
   rtx, rtx *,
   basic_block, struct df_insn_info *,
   enum df_ref_type, int ref_flags);
-static void df_def_record_1 (struct df_collection_rec *, rtx,
+static void df_def_record_1 (struct df_collection_rec *, rtx *,
 basic_block, struct df_insn_info *,
 int ref_flags);
 static void df_defs_record (struct df_collection_rec *, rtx,
@@ -318,7 +318,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 {
   struct df_scan_problem_data *problem_data;
   unsigned int insn_num = get_max_uid () + 1;
-  unsigned int block_size = 400;
+  unsigned int block_size = 512;
   basic_block bb;
 
   /* Given the number of pools, this is really faster than tearing
@@ -347,7 +347,7 @@ df_scan_alloc (bitmap all_blocks ATTRIBU
 sizeof (struct df_reg_info), block_size);
   problem_data->mw_reg_pool
 = create_alloc_pool ("df_scan mw_reg",
-sizeof (struct df_mw_hardreg), block_size);
+sizeof (struct df_mw_hardreg), block_size / 16);
 
   bitmap_obstack_initialize (&problem_data->reg_bitmaps);
   bitmap_obstack_initialize (&problem_data->insn_bitmaps);
@@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x)
 }
 
 
-/* Process all the registers defined in the rtx, X.
+/* Process all the registers defined in the rtx pointed by LOC.
Autoincrement/decrement definitions will be picked up by
df_uses_record.  */
 
 static void
 df_def_record_1 (struct df_collection_rec *collection_rec,
- rtx x, basic_block bb, struct df_insn_info *insn_info,
+ rtx *loc, basic_block bb, struct df_insn_info *insn_info,
 int flags)
 {
-  rtx *loc;
-  rtx dst;
-
- /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL
- construct.  */
-  if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER)
-loc = &XEXP (x, 0);
-  else
-loc = &SET_DEST (x);
-  dst = *loc;
+  rtx dst = *loc;
 
   /* It is legal to have a set destination be a parallel. */
   if (GET_CODE (dst) == PARALLEL)
 {
   int i;
-
   for (i = XVECLEN (dst, 0) - 1; i >= 0; i--)
{
  rtx temp = XVECEXP (dst, 0, i);
- if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER
- || GET_CODE (temp) == SET)
-   df_def_record_1 (collection_rec,
- temp, bb, insn_info,
-GET_CODE (temp) == CLOBBER
-? flags | DF_REF_MUST_CLOBBER : flags);
+ gcc_assert (GET_CODE (temp) == EXPR_LIST);
+ df_def_record_1 (collection_rec, &XEXP (temp, 0),
+  bb, insn_info, flags);
}
   return;
 }
@@ -3003,26 +2990,98 @@ df_defs_record (struct df_collection_rec
int flags)
 {
   RTX_CODE code = GET_CODE (x);
+  int i;
 
-  if (code == SET || code == CLOBBER)
-{
-  /* Mark the single def within the pattern.  */
-  int clobber_flags = flags;
-  clobber_flags |= (code == CLOBBER) ? DF_REF_MUST_CLOBBER : 0;
-  df_def_record_1 (collection_rec, x, bb, insn_info, clobber_flags);
-}
-  else if (code == COND_EXEC)
+  switch (code)
 {
+

Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-08-22 Thread Dimitrios Apostolou

Hi Steven,

On Mon, 1 Aug 2011, Steven Bosscher wrote:

On Sun, Jul 31, 2011 at 11:59 PM, Steven Bosscher  wrote:

On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher  wrote:

I'll see if I can test the patch on the compile farm this weekend,
just to be sure.


Bootstrap on
ia64-unknown-linux-gnu is in stage2 but it is taking forever (on
gcc60)...


Just to be clear, it is taking forever without the patch too. I did
time -O2 non-bootstrap builds but there was no difference worth
mentioning.


Did the testsuite finish with no regressions on IA64?

I'd test on a sparcstation but unfortunately the machine crashed before
finishing and I can't regain access to it. I'll hopefully test it in some
other platform by next week, I'm curious to actually measure runtime
there.


Thanks for helping!
Dimitris


[dwarf2out, elfos] Output assembly faster

2011-08-22 Thread Dimitrios Apostolou

Hello again,

most of this patch was posted at the beginning of my GSOC adventure and 
primarily is about replacing fprintf() calls with custom faster ones. From 
that time I changed minor things you suggested, omitted some complexities 
that offered minor speed-up, and made the code as clear as possible, while 
following more closely coding guidelines.


Bootstrapped and tested on both i386 and x86_64, showing runtime 
improvement when compiling with debug info enabled.


The only thing I am not sure is the differentiation between HOST_WIDE_INT 
and long, so I provided two versions of the same function. Please let me 
know if I should handle this differently.



Future speedup in assembly generation *is possible* but requires more 
intrusive changes, maybe making assembly hard to read by human:

* Always use hex numbers, they are much faster to produce
* Call GNU assembler with -s parameter, though it's pretty hard to be 
compliant.
* Even further in the future we could generate binary data, if we *know* 
the assembler is GAS.



Changelog:

2011-08-12  Dimitrios Apostolou  

* final.c, output.h (fprint_whex, fprint_w, fprint_ul, sprint_ul):
New functions serving as fast replacements for fprintf() integer
to string conversions. They were used in the following changes.
* final.c (sprint_ul_rev): Internal helper for the above.
(output_addr_const): case CONST_INT: don't use fprintf().
* elfos.h (ASM_GENERATE_INTERNAL_LABEL): Don't use sprintf("%u"),
use sprint_ul() and custom loop which are much faster.
(TARGET_ASM_INTERNAL_LABEL): Define as default_elf_internal_label.
(ELF_ASCII_ESCAPES, ELF_STRING_LIMIT): Are the old ESCAPES and
STRING_LIMIT macros.
(ASM_OUTPUT_LIMITED_STRING, ASM_OUTPUT_ASCII): Macros now just
call respective functions that provide now the same
functionality. Those are default_elf_asm_output_limited_string()
and default_elf_asm_output_ascii() in varasm.c.
* varasm.c: Fixed some whitespace inconsistencies.
(default_elf_asm_output_limited_string)
(default_elf_asm_output_ascii): The above macros from elfos.h are
implemented here, avoiding superfluous calls to fprintf().
(default_elf_internal_label): Hook for
targetm.asm_out.internal_label and ASM_OUTPUT_DEBUG_LABEL.
* i386.c: Don't call fprintf("%u") but fprint_ul() instead.
* defaults.h (ASM_OUTPUT_LABEL, ASM_OUTPUT_INTERNAL_LABEL):
Expanded the macros on multiple lines for readability.
(ASM_OUTPUT_LABELREF): Have two calls to fputs() instead of one to
asm_fprintf().
* dwarf2asm.c (dw2_assemble_integer, dw2_asm_output_data)
(dw2_asm_data_uleb128, dw2_asm_delta_uleb128)
(dw2_asm_delta_sleb128): Convert fprintf() calls to faster,
simpler functions.
* dwarf2out.c (dwarf2out_source_line): Convert fprintf() calls to
faster, simpler functions.



Thanks,
Dimitris
=== modified file 'gcc/config/elfos.h'
--- gcc/config/elfos.h  2010-11-21 00:54:14 +
+++ gcc/config/elfos.h  2011-08-22 11:39:43 +
@@ -117,10 +117,23 @@ see the files COPYING3 and COPYING.RUNTI
 #define ASM_GENERATE_INTERNAL_LABEL(LABEL, PREFIX, NUM)\
   do   \
 {  \
-  sprintf (LABEL, "*.%s%u", PREFIX, (unsigned) (NUM)); \
+  int __i = 2; \
+  int __j = 0; \
+  (LABEL)[0] = '*';\
+  (LABEL)[1] = '.';\
+  do   \
+   {   \
+ (LABEL)[__i] = (PREFIX)[__j]; \
+ __i++; __j++; \
+   }   \
+  while ((PREFIX)[__j] != '\0');   \
+  sprint_ul (&(LABEL)[__i], (unsigned long) (NUM));\
 }  \
   while (0)
 
+#undef TARGET_ASM_INTERNAL_LABEL
+#define TARGET_ASM_INTERNAL_LABEL default_elf_internal_label
+
 /* Output the label which precedes a jumptable.  Note that for all svr4
systems where we actually generate jumptables (which is to say every
svr4 target except i386, where we use casesi instead) we put the jump-
@@ -371,7 +384,7 @@ see the files COPYING3 and COPYING.RUNTI
the i386) don't know about that.  Also, we don't use \v
since some versions of gas, such as 2.2 did not accept it.  */
 
-#define ESCAPES \
+#define ELF_ASCII_ESCAPES \
 "\1\

Re: Dump stats about hottest hash tables when -fmem-report

2011-08-22 Thread Dimitrios Apostolou
I should note here that specialised hash-tables in pointer-set.c have a 
load-factor of at most 25%. Also another very fast hash table I've 
studied, dense_hash_map from google's sparse_hash_table, has a load factor 
of 50% max.


As I understand it a good hash function gives a perfectly random value up 
to N. So if we have only 25% of slots empty, like we do with today's 75% 
load factor, chances we won't have collision are small. And collisions 
cost more for an open-addressing hash table.


But for implementing a closed-addressing hash-table with linked lists for 
collision resolution, we'd need allocator pools within libiberty. Is it 
OK to move alloc-pool.[ch] into libiberty as a first step?



Dimitris




[var-tracking] [not-good!] disable shared_hash and other simplifications

2011-08-22 Thread Dimitrios Apostolou

Hello,

the attached patch applies after my previous one, and actually cancels all 
runtime gains from it. It doesn't make things worse than initially, so 
it's not *that* bad.


While trying to understand var-tracking I deleted the whole shared hash 
table concept and some other indirections. It didn't make things much 
worse, and it makes me think that getting rid of other indirections like 
shared variables reference counting would actually help. But it got too 
complicated and I had to give up.


While I don't understand variable tracking, I couldn't help but notice 
some things that could be discussed:


* This is the part of the compiler that makes the heaviest use of hash 
tables. htab_traverse(), htab_delete(), FOR_EACH_HTAB_ELEMENT (should that 
be moved in libiberty?) all iterate way too many times over hash tables. 
This is all because decl/value UIDs are sparse.
* Since var-tracking is happening per-function, is there a way to produce 
per-function UIDs and use those in a regular array instead of a sparse 
hash table? These UIDs could be produced at tree/rtx level, and I bet 
other parts in the compiler could make good use of them.
* Another alternative would be to store variables in a vector, sorted by 
UID value. That way checking if it's there would be O(logn) but merging 
would be a very simple operation. Still, merging dataflow sets is way to 
complex to follow so I didn't manage to change this.
* From a very wide field of view, all this DF solving reminded me a lot of 
what I've seen in df-*.c. Why can't variable tracking be integrated in the 
main DF pass of the compiler, the one that happens *after* register 
allocation?



Thanks,
Dimitris
=== modified file 'gcc/var-tracking.c'
--- gcc/var-tracking.c  2011-08-22 08:27:47 +
+++ gcc/var-tracking.c  2011-08-22 10:36:15 +
@@ -233,17 +233,6 @@ typedef struct attrs_def
   HOST_WIDE_INT offset;
 } *attrs;
 
-/* Structure holding a refcounted hash table.  If refcount > 1,
-   it must be first unshared before modified.  */
-typedef struct shared_hash_def
-{
-  /* Reference count.  */
-  int refcount;
-
-  /* Actual hash table.  */
-  htab_t htab;
-} *shared_hash;
-
 /* Structure holding the IN or OUT set for a basic block.  */
 typedef struct dataflow_set_def
 {
@@ -253,11 +242,11 @@ typedef struct dataflow_set_def
   /* Attributes for registers (lists of attrs).  */
   attrs regs[FIRST_PSEUDO_REGISTER];
 
-  /* Variable locations.  */
-  shared_hash vars;
+  /* Variable locations, stored in a hash table.  */
+  htab_t vars;
 
   /* Vars that is being traversed.  */
-  shared_hash traversed_vars;
+  htab_t traversed_vars;
 } dataflow_set;
 
 /* The structure (one for each basic block) containing the information
@@ -379,9 +368,6 @@ static alloc_pool valvar_pool;
 /* Alloc pool for struct location_chain_def.  */
 static alloc_pool loc_chain_pool;
 
-/* Alloc pool for struct shared_hash_def.  */
-static alloc_pool shared_hash_pool;
-
 /* Alloc pool for struct value_chain_def.  */
 static alloc_pool value_chain_pool;
 
@@ -395,7 +381,7 @@ static htab_t value_chains;
 static bool emit_notes;
 
 /* Empty shared hashtable.  */
-static shared_hash empty_shared_hash;
+static htab_t empty_htab;
 
 /* Scratch register bitmap used by cselib_expand_value_rtx.  */
 static bitmap scratch_regs = NULL;
@@ -478,7 +464,7 @@ static void **delete_slot_part (dataflow
 static void delete_variable_part (dataflow_set *, rtx,
  decl_or_value, HOST_WIDE_INT);
 static int emit_note_insn_var_location (void **, void *);
-static void emit_notes_for_changes (rtx, enum emit_note_where, shared_hash);
+static void emit_notes_for_changes (rtx, enum emit_note_where, htab_t);
 static int emit_notes_for_differences_1 (void **, void *);
 static int emit_notes_for_differences_2 (void **, void *);
 static void emit_notes_for_differences (rtx, dataflow_set *, dataflow_set *);
@@ -1370,164 +1356,30 @@ attrs_list_mpdv_union (attrs *dstp, attr
 }
 }
 
-/* Shared hashtable support.  */
-
-/* Return true if VARS is shared.  */
-
-static inline bool
-shared_hash_shared (shared_hash vars)
-{
-  return vars->refcount > 1;
-}
-
-/* Return the hash table for VARS.  */
-
-static inline htab_t
-shared_hash_htab (shared_hash vars)
-{
-  return vars->htab;
-}
-
 /* Return true if VAR is shared, or maybe because VARS is shared.  */
 
 static inline bool
-shared_var_p (variable var, shared_hash vars)
+shared_var_p (variable var)
 {
   /* Don't count an entry in the changed_variables table as a duplicate.  */
-  return ((var->refcount > 1 + (int) var->in_changed_variables)
- || shared_hash_shared (vars));
+  return (var->refcount - (int) var->in_changed_variables) > 1;
 }
 
-/* Copy all variables from hash table SRC to hash table DST without rehashing
-   any values.  */
-
-static htab_t
-htab_dup (htab_t src)
-{
-  htab_t dst;
-
-  dst = (htab_t) xmalloc (sizeof (*src));
-  memcpy (dst, src, sizeof (*src));
-  dst->entries = (void **) xmalloc (src->s

[var-tracking] small speed-ups

2011-08-22 Thread Dimitrios Apostolou

Hello list,

this patch has all of my changes to var-tracking that I believe are worth 
keeping. These are all minor changes not touching algorithmic issues, I 
lost much time trying to understand what is actually done in 
var-tracking.c but I still don't get it. I wish there was some document 
describing current implementation. I'd appreciate all help I can get here.


Bootstrapped/tested on x86_64.

Thanks to lxo for helping me, hopefully his plan for better var-tracking 
throughout all optimisation passes will be implemented. This is the only 
var-tracking related doc I could find... ( 
http://gcc.gnu.org/wiki/Var_Tracking_Assignments).


This patch also includes minor stylistic changes that made the code just a 
tiny bit more accessible to me, the indirection was so much that it hardly 
reminded me of C, let me know if I should remove these parts. :-s For the 
sake of completion I'll also post a follow-up patch where I 
delete/simplify a big part of var-tracking, unfortunately with some impact 
on performance.



2011-08-22  Dimitrios Apostolou  

* var-tracking.c (init_attrs_list_set): Remove function, instead
use a memset() call to zero the attr list in...
(dataflow_set_init).
(vars_copy): Remove function because inserting each element into a
new hash table was too costly. Replaced with the ...
(htab_dup): ... new function that only does a memcpy() of the
element table in htab_t, without rehashing any elements.
(shared_hash_unshare): Replace the vars_copy() call with
htab_dup(), plus do a little extra work (reference counting) which
was in vars_copy.
(shared_hash_destroy, dataflow_set_destroy): Add an extra
"do_free" bool argument, to avoid iterating over hash tables
freeing elements, when not needed.
(vt_finalize, vt_emit_notes): Call the above with do_free=false
since all pools will be freed later.
(dataflow_set_clear, dataflow_set_copy, dataflow_set_union)
(dataflow_set_merge, dataflow_set_different, compute_bb_dataflow)
(vt_find_locations): Call shared_hash_destroy with do_free=true.
(attrs_list_copy): Do not free destination list but reuse already
allocated elements if possible.
=== modified file 'gcc/var-tracking.c'
--- gcc/var-tracking.c  2011-06-03 01:42:31 +
+++ gcc/var-tracking.c  2011-08-22 09:52:08 +
@@ -414,7 +414,6 @@ static hashval_t variable_htab_hash (con
 static int variable_htab_eq (const void *, const void *);
 static void variable_htab_free (void *);
 
-static void init_attrs_list_set (attrs *);
 static void attrs_list_clear (attrs *);
 static attrs attrs_list_member (attrs, decl_or_value, HOST_WIDE_INT);
 static void attrs_list_insert (attrs *, decl_or_value, HOST_WIDE_INT, rtx);
@@ -423,7 +422,6 @@ static void attrs_list_union (attrs *, a
 
 static void **unshare_variable (dataflow_set *set, void **slot, variable var,
enum var_init_status);
-static void vars_copy (htab_t, htab_t);
 static tree var_debug_decl (tree);
 static void var_reg_set (dataflow_set *, rtx, enum var_init_status, rtx);
 static void var_reg_delete_and_set (dataflow_set *, rtx, bool,
@@ -447,7 +445,7 @@ static bool variable_part_different_p (v
 static bool onepart_variable_different_p (variable, variable);
 static bool variable_different_p (variable, variable);
 static bool dataflow_set_different (dataflow_set *, dataflow_set *);
-static void dataflow_set_destroy (dataflow_set *);
+static void dataflow_set_destroy (dataflow_set *, bool);
 
 static bool contains_symbol_ref (rtx);
 static bool track_expr_p (tree, bool);
@@ -1069,14 +1067,14 @@ adjust_insn (basic_block bb, rtx insn)
 static inline bool
 dv_is_decl_p (decl_or_value dv)
 {
-  return !dv || (int) TREE_CODE ((tree) dv) != (int) VALUE;
+  return !dv || ((int) TREE_CODE ((tree) dv) != (int) VALUE);
 }
 
 /* Return true if a decl_or_value is a VALUE rtl.  */
 static inline bool
 dv_is_value_p (decl_or_value dv)
 {
-  return dv && !dv_is_decl_p (dv);
+  return !dv_is_decl_p (dv);
 }
 
 /* Return the decl in the decl_or_value.  */
@@ -1092,7 +1090,7 @@ static inline rtx
 dv_as_value (decl_or_value dv)
 {
   gcc_checking_assert (dv_is_value_p (dv));
-  return (rtx)dv;
+  return (rtx) dv;
 }
 
 /* Return the opaque pointer in the decl_or_value.  */
@@ -1191,7 +1189,7 @@ dv_uid2hash (dvuid uid)
 static inline hashval_t
 dv_htab_hash (decl_or_value dv)
 {
-  return dv_uid2hash (dv_uid (dv));
+  return (hashval_t) (dv_uid (dv));
 }
 
 /* The hash function for variable_htab, computes the hash value
@@ -1202,7 +1200,7 @@ variable_htab_hash (const void *x)
 {
   const_variable const v = (const_variable) x;
 
-  return dv_htab_hash (v->dv);
+  return (hashval_t) (dv_uid (v->dv));
 }
 
 /* Compare the declaration of variable X with declaration Y.  */
@@ -1211,9 +1209,8 @@ static int
 variable_htab_eq (const voi

Re: tree-ssa-structalias.c: alloc_pool for struct equiv_class_label

2011-08-22 Thread Dimitrios Apostolou

On Mon, 22 Aug 2011, Richard Guenther wrote:


On Mon, Aug 22, 2011 at 9:46 AM, Dimitrios Apostolou  wrote:


2011-08-22  Dimitrios Apostolou  

       * tree-ssa-structalias.c (equiv_class_add)
       (perform_var_substitution, free_var_substitution_info): Created a
       new equiv_class_pool allocator pool for struct
       equiv_class_label. Changed the pointer_equiv_class_table and
       location_equiv_class_table hash tables to not iterate freeing all
       elements in the end, but just free the pool.


Did you check if the hash functions have ever called free()?  If so why
not use the pool free function so that entries can get re-used?  If not,
the natural allocator would be an obstack instead.


I have not found any relevant call of htab_clear_slot(). I didn't consider 
obstacks at all for all these cases, thanks for telling me, I'll see where 
I can use them. As I've noted I have bootstrapped and tested all these 
changes at least on x86_64 with release-checking enabled, but I plan to 
test and measure all my changes together later, and hopefully on other 
platforms in the future.



Thanks,
Dimitris


Re: Dump stats about hottest hash tables when -fmem-report

2011-08-22 Thread Dimitrios Apostolou

Hello,

I'm attaching here the final version of statistics dumping, I think I made 
some stylistic/insignificant changes. Tested on i386 and x86_64. I have 
withheld the hash table size changes, so please apply if you find 
everything good.


Would you agree on a future patch that would make such statistics being 
computed only when the gather-mem-stats compile-time variable is set?



On Sat, 20 Aug 2011, Richard Guenther wrote:

On Fri, Aug 19, 2011 at 6:40 PM, Tom Tromey  wrote:


Your patch to change the symbol table's load factor is fine technically.
I think the argument for putting it in is lacking; what I would like to
see is either some rationale explaining that the increased memory use is
not important, or some numbers showing that it still performs well on
more than a single machine.  My reason for wanting this is just that,
historically, GCC has been very sensitive to increases in memory use.
Alternatively, comments from more active maintainers indicating that
they don't care about this would also help your case.

I can't approve or reject the libiberty change, just the libcpp one.


Yes, memory usage is as important as compile-time.  We still have testcases
that show a vast imbalance of them.  I don't know if the symbol table hash
is ever the problem, but changing the generic load factor in libiberty doesn't
sound a good idea - maybe instead have a away of specifying that factor
per hashtable instance.  Or, as usual, improve the hash function to
reduce the collision rate and/or to make re-hashing cheaper.


Regarding hash table size, I will back off from hashtab.c. I really 
believe that even though it makes a difference the proper solution would 
be a much more intrusive one: complete reimplementation.


If we primarily care about memory usage we should use a closed-addressing 
(with linked-lists in each bucket) hash table, and set the load-factor 
high. If we care about cache-efficiency and speed we should use an 
open-addressing hash table, like the one we have, but decrease the 
load-factor and resolve collisions by quadratic probing instead of 
rehashing. Also we should make sure our hash table is always power-of-2 
sized, and that hash values are always computed using good and proved hash 
functions (we use Bob Jenkins v2 hash, we should upgrade to v3 even though 
v2 seems very good *when* we actually do use it). That would probably mean 
the interface should change to disallow custom hash values, and while we 
do that I'd really like to see new functions dedicated to searching and 
inserting.


I've experimented with some of these changes, but these changes 
individually do not offer significant speed-up. I believe that change will 
show if hash tables are completely reimplemented.


But regarding symtab.c, I only see benefits. Collisions are reduced and 
extra memory usage is negligible. For a hash table with 32K elements and 
64K slots, the overhead of empty slots is 32K*8 = 256KB. It it was 75% 
full it would be 128KB. Actual measurements on i386 show negligible 
overhead, it could be by luck that final ht size is the same (they are 
always power-of-2 sized, so it's possible). Anyway I'll hopefully test in 
other platforms within September and report back.


A change that would probably help for symtab would be to store custom 
string structs, that include string length so we avoid several strlen() 
calls. Another could be to not support deletion of strings (since what we 
are actually doing is string interning, isn't it?). This would make the 
open-addressing hash table much simpler, and I think there is only one 
case we actually delete strings. Have to look further into this one.



All comments welcome,
Dimitris




Changelog:

2011-08-22  Dimitrios Apostolou  

* cgraph.c, cgraph.h (cgraph_dump_stats): New function to dump
stats about cgraph_hash hash table.
* cselib.c, cselib.h (cselib_dump_stats): New function to dump
stats about cselib_hash_table.
* cselib.c (cselib_finish): Keep statistics by increasing values
of new global variables
cselib_htab_{num,expands,searches,collisions} if -fmem-report.
* emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to
dump statistics about mem_attrs_htab hash table.
* tree.c (print_type_hash_statistics): Used the new
htab_dump_statistics() function.
* var-tracking.c (shared_hash_destroy): Keep statistics by
increasing values of new global variables
vars_htab_{num,expands,searches,collisions} if -fmem-report.
(vt_finalize): Keep statistics by increasing values of new global
variables cv_htab_{num,expands,searches,collisions} and
vc_htab_{num,expands,searches,collisions} if -fmem-report.
* var-tracking.c, rtl.h (vt_dump_stats): New function to dump
stats about vars->htab, changed_variables and value_chains hash
tables.
 

cse.c: preferable()

2011-08-22 Thread Dimitrios Apostolou
Attached patch is also posted at bug #19832 and I think resolves it, as 
well as /maybe/ offers a negligible speedup of 3-4 M instr or a couple 
milliseconds. I also post it here for comments.



2011-08-13  Dimitrios Apostolou  

* cse.c (preferable): Make it more readable and slightly faster,
without affecting its logic.

=== modified file 'gcc/cse.c'
--- gcc/cse.c   2011-06-02 21:52:46 +
+++ gcc/cse.c   2011-08-13 00:54:06 +
@@ -720,32 +720,25 @@ approx_reg_cost (rtx x)
 static int
 preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
 {
-  /* First, get rid of cases involving expressions that are entirely
- unwanted.  */
-  if (cost_a != cost_b)
-{
-  if (cost_a == MAX_COST)
-   return 1;
-  if (cost_b == MAX_COST)
-   return -1;
-}
+  int cost_diff = cost_a - cost_b;
+  int regcost_diff = regcost_a - regcost_b;
 
-  /* Avoid extending lifetimes of hardregs.  */
-  if (regcost_a != regcost_b)
+  if (cost_diff != 0)
 {
-  if (regcost_a == MAX_COST)
-   return 1;
-  if (regcost_b == MAX_COST)
-   return -1;
+  /* If none of the expressions are entirely unwanted */
+  if ((cost_a != MAX_COST) && (cost_b != MAX_COST)
+ /* AND only one of the regs is HARD_REG */
+ && (regcost_diff != 0)
+ && ((regcost_a == MAX_COST) || (regcost_b == MAX_COST))
+ )
+   /* Then avoid extending lifetime of HARD_REG */
+   return regcost_diff;
+
+  return cost_diff;
 }
 
-  /* Normal operation costs take precedence.  */
-  if (cost_a != cost_b)
-return cost_a - cost_b;
-  /* Only if these are identical consider effects on register pressure.  */
-  if (regcost_a != regcost_b)
-return regcost_a - regcost_b;
-  return 0;
+  /* cost_a == costb, consider effects on register pressure */
+  return regcost_diff;
 }
 
 /* Internal function, to compute cost when X is not a register; called



Re: Various minor speed-ups

2011-08-22 Thread Dimitrios Apostolou
For whoever is concerned about memory usage, I didn't measure a real 
increase, besides a few KB. These are very hot allocation pools and 
allocating too many blocks of 10 elements is suboptimal.




2011-08-22  Dimitrios Apostolou  

* cselib.c (cselib_init): Increased initial size of elt_list_pool,
elt_loc_list_pool, cselib_val_pool, value_pool allocation pools
since they are very frequently used.


=== modified file 'gcc/cselib.c'
--- gcc/cselib.c2011-05-31 19:14:21 +
+++ gcc/cselib.c2011-08-17 14:03:56 +
@@ -2484,12 +2484,12 @@ void
 cselib_init (int record_what)
 {
   elt_list_pool = create_alloc_pool ("elt_list",
-sizeof (struct elt_list), 10);
+sizeof (struct elt_list), 128);
   elt_loc_list_pool = create_alloc_pool ("elt_loc_list",
-sizeof (struct elt_loc_list), 10);
+sizeof (struct elt_loc_list), 128);
   cselib_val_pool = create_alloc_pool ("cselib_val_list",
-  sizeof (cselib_val), 10);
-  value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 100);
+  sizeof (cselib_val), 128);
+  value_pool = create_alloc_pool ("value", RTX_CODE_SIZE (VALUE), 128);
   cselib_record_memory = record_what & CSELIB_RECORD_MEMORY;
   cselib_preserve_constants = record_what & CSELIB_PRESERVE_CONSTANTS;
 



Re: mem_attrs_htab

2011-08-22 Thread Dimitrios Apostolou

Hi Jakub,

I forgot to mention that all patches are against mid-July trunk, I was 
hoping I'd have no conflicts. Anyway thanks for letting me know, 
if there are conflicts with my other patches please let me know, and I'll 
post an updated version at a later date.


All your other concerns are valid and I'll try addressing them in the 
future. I didn't like hashing addresses either, and I was surprised I saw 
no regressions.



Dimitris




This patch isn't against the trunk, where p->offset and p->size aren't rtxes
anymore, but HOST_WIDE_INTs.  Furthermore, it is a bad idea to hash
the p->expr address itself, it doesn't make any sense to hash on what
p->expr points to in that case.  And p->offset and p->size should be ignored
if the *known_p corresponding fields are false.  So, if you really think
using iterative_hash_object is a win, it should be something like:
 mem_attrs q = *p;
 q.expr = NULL;
 if (!q.offset_known_p) q.offset = 0;
 if (!q.size_known_p) q.size = 0;
 return iterative_hash_object (q, iterative_hash_expr (p->expr, 0));
(or better yet avoid q.expr = NULL and instead start hashing from the next
field after expr).  Hashing the struct padding might not be a good idea
either.

Jakub



Re: tree-ssa-structalias.c: alloc_pool for struct equiv_class_label

2011-08-22 Thread Dimitrios Apostolou

Forgot the patch...

On Mon, 22 Aug 2011, Dimitrios Apostolou wrote:


2011-08-22  Dimitrios Apostolou  

* tree-ssa-structalias.c (equiv_class_add)
(perform_var_substitution, free_var_substitution_info): Created a
new equiv_class_pool allocator pool for struct
equiv_class_label. Changed the pointer_equiv_class_table and
location_equiv_class_table hash tables to not iterate freeing all
elements in the end, but just free the pool.


=== modified file 'gcc/tree-ssa-structalias.c'
--- gcc/tree-ssa-structalias.c  2011-04-29 10:59:33 +
+++ gcc/tree-ssa-structalias.c  2011-08-18 06:53:12 +
@@ -1899,6 +1899,9 @@ static htab_t pointer_equiv_class_table;
classes.  */
 static htab_t location_equiv_class_table;
 
+/* Pool of memory for storing the above */
+static alloc_pool equiv_class_pool;
+
 /* Hash function for a equiv_class_label_t */
 
 static hashval_t
@@ -1948,7 +1951,8 @@ equiv_class_add (htab_t table, unsigned 
 bitmap labels)
 {
   void **slot;
-  equiv_class_label_t ecl = XNEW (struct equiv_class_label);
+  equiv_class_label_t ecl
+= (equiv_class_label_t) pool_alloc (equiv_class_pool);
 
   ecl->labels = labels;
   ecl->equivalence_class = equivalence_class;
@@ -2159,10 +2163,14 @@ perform_var_substitution (constraint_gra
   struct scc_info *si = init_scc_info (size);
 
   bitmap_obstack_initialize (&iteration_obstack);
+  equiv_class_pool = create_alloc_pool ("equiv_class_label pool",
+   sizeof (struct equiv_class_label),
+   64);
+  /* NULL free function, we'll free the whole pool at the end of the pass. */
   pointer_equiv_class_table = htab_create (511, equiv_class_label_hash,
-  equiv_class_label_eq, free);
+  equiv_class_label_eq, NULL);
   location_equiv_class_table = htab_create (511, equiv_class_label_hash,
-   equiv_class_label_eq, free);
+   equiv_class_label_eq, NULL);
   pointer_equiv_class = 1;
   location_equiv_class = 1;
 
@@ -2269,6 +2277,7 @@ free_var_substitution_info (struct scc_i
   sbitmap_free (graph->direct_nodes);
   htab_delete (pointer_equiv_class_table);
   htab_delete (location_equiv_class_table);
+  free_alloc_pool (equiv_class_pool);
   bitmap_obstack_release (&iteration_obstack);
 }
 



Re: Various minor speed-ups

2011-08-22 Thread Dimitrios Apostolou


2011-08-22  Dimitrios Apostolou  

* tree-ssa-pre.c (phi_trans_add, init_pre, fini_pre): Added a pool
for phi_translate_table elements to avoid free() calls from
htab_delete().

=== modified file 'gcc/tree-ssa-pre.c'
--- gcc/tree-ssa-pre.c  2011-05-04 09:04:53 +
+++ gcc/tree-ssa-pre.c  2011-08-17 08:43:23 +
@@ -515,6 +515,10 @@ typedef struct expr_pred_trans_d
 } *expr_pred_trans_t;
 typedef const struct expr_pred_trans_d *const_expr_pred_trans_t;
 
+/* Pool of memory for the above */
+
+static alloc_pool phi_translate_pool;
+
 /* Return the hash value for a phi translation table entry.  */
 
 static hashval_t
@@ -571,7 +575,8 @@ static inline void
 phi_trans_add (pre_expr e, pre_expr v, basic_block pred)
 {
   void **slot;
-  expr_pred_trans_t new_pair = XNEW (struct expr_pred_trans_d);
+  expr_pred_trans_t new_pair
+= (expr_pred_trans_t) pool_alloc (phi_translate_pool);
   new_pair->e = e;
   new_pair->pred = pred;
   new_pair->v = v;
@@ -580,7 +585,8 @@ phi_trans_add (pre_expr e, pre_expr v, b
 
   slot = htab_find_slot_with_hash (phi_translate_table, new_pair,
   new_pair->hashcode, INSERT);
-  free (*slot);
+  if (*slot)
+pool_free (phi_translate_pool, *slot);
   *slot = (void *) new_pair;
 }
 
@@ -4804,8 +4810,12 @@ init_pre (bool do_fre)
   calculate_dominance_info (CDI_DOMINATORS);
 
   bitmap_obstack_initialize (&grand_bitmap_obstack);
+  phi_translate_pool = create_alloc_pool ("phi_translate_table pool",
+ sizeof (struct expr_pred_trans_d),
+ 4096);
+  /* NULL as free because we'll free the whole pool in the end. */
   phi_translate_table = htab_create (5110, expr_pred_trans_hash,
-expr_pred_trans_eq, free);
+expr_pred_trans_eq, NULL);
   expression_to_id = htab_create (num_ssa_names * 3,
  pre_expr_hash,
  pre_expr_eq, NULL);
@@ -4839,6 +4849,7 @@ fini_pre (bool do_fre)
   free_alloc_pool (bitmap_set_pool);
   free_alloc_pool (pre_expr_pool);
   htab_delete (phi_translate_table);
+  free_alloc_pool (phi_translate_pool);
   htab_delete (expression_to_id);
   VEC_free (unsigned, heap, name_to_id);
 



tree-ssa-structalias.c: alloc_pool for struct equiv_class_label

2011-08-22 Thread Dimitrios Apostolou


2011-08-22  Dimitrios Apostolou  

* tree-ssa-structalias.c (equiv_class_add)
(perform_var_substitution, free_var_substitution_info): Created a
new equiv_class_pool allocator pool for struct
equiv_class_label. Changed the pointer_equiv_class_table and
location_equiv_class_table hash tables to not iterate freeing all
elements in the end, but just free the pool.




tree-ssa*: reduce malloc() calls by preallocating hot VECs on the stack

2011-08-22 Thread Dimitrios Apostolou


2011-08-22  Dimitrios Apostolou  

Allocate some very frequently used vectors on the stack:
* vecir.h: Defined a tree vector on the stack.
* tree-ssa-sccvn.c (print_scc, sort_scc, process_scc)
(extract_and_process_scc_for_name): Allocate the scc vector on the
stack instead of the heap, giving it a minimal initial size
instead of 0.
* tree-ssa-structalias.c (get_constraint_for_1)
(get_constraint_for, get_constraint_for_rhs, do_deref)
(get_constraint_for_ssa_var, get_constraint_for_ptr_offset)
(get_constraint_for_component_ref, get_constraint_for_address_of)
(process_all_all_constraints, do_structure_copy)
(make_constraints_to, make_constraint_to, handle_rhs_call)
(handle_lhs_call, handle_const_call, handle_pure_call)
(find_func_aliases_for_builtin_call, find_func_aliases_for_call)
(find_func_aliases, process_ipa_clobber, find_func_clobbers)
(create_variable_info_for): Converted the rhsc, lhsc vectors from
heap to stack, with a minimal initial size, since they were very
frequently allocated.
=== modified file 'gcc/tree-ssa-structalias.c'
--- gcc/tree-ssa-structalias.c  2011-08-18 06:53:12 +
+++ gcc/tree-ssa-structalias.c  2011-08-19 09:43:41 +
@@ -477,11 +477,14 @@ struct constraint_expr
 
 typedef struct constraint_expr ce_s;
 DEF_VEC_O(ce_s);
-DEF_VEC_ALLOC_O(ce_s, heap);
-static void get_constraint_for_1 (tree, VEC(ce_s, heap) **, bool, bool);
-static void get_constraint_for (tree, VEC(ce_s, heap) **);
-static void get_constraint_for_rhs (tree, VEC(ce_s, heap) **);
-static void do_deref (VEC (ce_s, heap) **);
+DEF_VEC_ALLOC_O_STACK(ce_s);
+#define VEC_ce_s_stack_alloc(alloc) \
+  VEC_stack_alloc (ce_s, alloc)
+
+static void get_constraint_for_1 (tree, VEC(ce_s, stack) **, bool, bool);
+static void get_constraint_for (tree, VEC(ce_s, stack) **);
+static void get_constraint_for_rhs (tree, VEC(ce_s, stack) **);
+static void do_deref (VEC (ce_s, stack) **);
 
 /* Our set constraints are made up of two constraint expressions, one
LHS, and one RHS.
@@ -2736,7 +2739,7 @@ new_scalar_tmp_constraint_exp (const cha
If address_p is true, the result will be taken its address of.  */
 
 static void
-get_constraint_for_ssa_var (tree t, VEC(ce_s, heap) **results, bool address_p)
+get_constraint_for_ssa_var (tree t, VEC(ce_s, stack) **results, bool address_p)
 {
   struct constraint_expr cexpr;
   varinfo_t vi;
@@ -2776,12 +2779,12 @@ get_constraint_for_ssa_var (tree t, VEC(
   for (; vi; vi = vi->next)
{
  cexpr.var = vi->id;
- VEC_safe_push (ce_s, heap, *results, &cexpr);
+ VEC_safe_push (ce_s, stack, *results, &cexpr);
}
   return;
 }
 
-  VEC_safe_push (ce_s, heap, *results, &cexpr);
+  VEC_safe_push (ce_s, stack, *results, &cexpr);
 }
 
 /* Process constraint T, performing various simplifications and then
@@ -2861,7 +2864,7 @@ bitpos_of_field (const tree fdecl)
 
 static void
 get_constraint_for_ptr_offset (tree ptr, tree offset,
-  VEC (ce_s, heap) **results)
+  VEC (ce_s, stack) **results)
 {
   struct constraint_expr c;
   unsigned int j, n;
@@ -2920,7 +2923,7 @@ get_constraint_for_ptr_offset (tree ptr,
  c2.type = ADDRESSOF;
  c2.offset = 0;
  if (c2.var != c.var)
-   VEC_safe_push (ce_s, heap, *results, &c2);
+   VEC_safe_push (ce_s, stack, *results, &c2);
  temp = temp->next;
}
  while (temp);
@@ -2955,7 +2958,7 @@ get_constraint_for_ptr_offset (tree ptr,
  c2.var = temp->next->id;
  c2.type = ADDRESSOF;
  c2.offset = 0;
- VEC_safe_push (ce_s, heap, *results, &c2);
+ VEC_safe_push (ce_s, stack, *results, &c2);
}
  c.var = temp->id;
  c.offset = 0;
@@ -2974,7 +2977,7 @@ get_constraint_for_ptr_offset (tree ptr,
as the lhs.  */
 
 static void
-get_constraint_for_component_ref (tree t, VEC(ce_s, heap) **results,
+get_constraint_for_component_ref (tree t, VEC(ce_s, stack) **results,
  bool address_p, bool lhs_p)
 {
   tree orig_t = t;
@@ -2999,7 +3002,7 @@ get_constraint_for_component_ref (tree t
   temp.offset = 0;
   temp.var = integer_id;
   temp.type = SCALAR;
-  VEC_safe_push (ce_s, heap, *results, &temp);
+  VEC_safe_push (ce_s, stack, *results, &temp);
   return;
 }
 
@@ -3021,7 +3024,7 @@ get_constraint_for_component_ref (tree t
temp.offset = 0;
temp.var = anything_id;
temp.type = ADDRESSOF;
-   VEC_safe_push (ce_s, heap, *results, &temp);
+   VEC_safe_push (ce_s, stack, *results, &temp);
return;
  }
 }
@@ -3062,7 +3065,7 @@ get_constraint_for_

graphds.[ch]: alloc_pool for edges

2011-08-22 Thread Dimitrios Apostolou
free() was called way too often before, this patch reduces it 
significantly. Minor speed-up here too, I don't mention it individually 
since numbers are within noise margins.



2011-08-22  Dimitrios Apostolou  

* graphds.h (struct graph): Added edge_pool as a pool for
allocating edges.
* graphds.c (new_graph): Initialise edge_pool.
(add_edge): Allocate edge from edge_pool rather than with malloc.
(free_graph): Instead of iterating across the graph freeing edges,
just destroy the edge_pool.
=== modified file 'gcc/graphds.c'
--- gcc/graphds.c   2009-11-25 10:55:54 +
+++ gcc/graphds.c   2011-08-19 16:44:41 +
@@ -62,7 +62,8 @@ new_graph (int n_vertices)
 
   g->n_vertices = n_vertices;
   g->vertices = XCNEWVEC (struct vertex, n_vertices);
-
+  g->edge_pool = create_alloc_pool ("edge_pool",
+   sizeof (struct graph_edge), 32);
   return g;
 }
 
@@ -71,7 +72,7 @@ new_graph (int n_vertices)
 struct graph_edge *
 add_edge (struct graph *g, int f, int t)
 {
-  struct graph_edge *e = XNEW (struct graph_edge);
+  struct graph_edge *e = (struct graph_edge *) pool_alloc (g->edge_pool);
   struct vertex *vf = &g->vertices[f], *vt = &g->vertices[t];
 
 
@@ -326,19 +327,7 @@ for_each_edge (struct graph *g, graphds_
 void
 free_graph (struct graph *g)
 {
-  struct graph_edge *e, *n;
-  struct vertex *v;
-  int i;
-
-  for (i = 0; i < g->n_vertices; i++)
-{
-  v = &g->vertices[i];
-  for (e = v->succ; e; e = n)
-   {
- n = e->succ_next;
- free (e);
-   }
-}
+  free_alloc_pool (g->edge_pool);
   free (g->vertices);
   free (g);
 }

=== modified file 'gcc/graphds.h'
--- gcc/graphds.h   2009-02-20 15:20:38 +
+++ gcc/graphds.h   2011-08-19 16:44:41 +
@@ -18,6 +18,10 @@ You should have received a copy of the G
 along with GCC; see the file COPYING3.  If not see
 <http://www.gnu.org/licenses/>.  */
 
+
+#include "alloc-pool.h"
+
+
 /* Structure representing edge of a graph.  */
 
 struct graph_edge
@@ -44,10 +48,10 @@ struct vertex
 
 struct graph
 {
-  int n_vertices;  /* Number of vertices.  */
-  struct vertex *vertices;
-   /* The vertices.  */
-  htab_t indices;  /* Fast lookup for indices.  */
+  int n_vertices;  /* Number of vertices.  */
+  struct vertex *vertices; /* The vertices.  */
+  htab_t indices;  /* Fast lookup for indices.  */
+  alloc_pool edge_pool;/* Pool for allocating edges. */
 };
 
 struct graph *new_graph (int);



mem_attrs_htab

2011-08-22 Thread Dimitrios Apostolou


2011-08-22  Dimitrios Apostolou  

* emit-rtl.c (mem_attrs_htab_hash): Hash massively by calling
iterative_hash(). We disregard the offset,size rtx fields of the
mem_attrs struct, but overall this hash is a *huge* improvement to
the previous one, it reduces the collisions/searches ratio from 8
to 0.8 for some cases.
(init_emit_once): Slightly increase the mem_attrs_htab initial
size because it's frequently used and expanded many times.
=== modified file 'gcc/emit-rtl.c'
--- gcc/emit-rtl.c  2011-05-29 17:40:05 +
+++ gcc/emit-rtl.c  2011-08-21 04:44:25 +
@@ -256,11 +256,10 @@ mem_attrs_htab_hash (const void *x)
 {
   const mem_attrs *const p = (const mem_attrs *) x;
 
-  return (p->alias ^ (p->align * 1000)
- ^ (p->addrspace * 4000)
- ^ ((p->offset ? INTVAL (p->offset) : 0) * 5)
- ^ ((p->size ? INTVAL (p->size) : 0) * 250)
- ^ (size_t) iterative_hash_expr (p->expr, 0));
+  /* By massively feeding the mem_attrs struct to iterative_hash() we
+ disregard the p->offset and p->size rtx, but in total the hash is
+ quick and good enough. */
+  return iterative_hash_object (*p, iterative_hash_expr (p->expr, 0));
 }
 
 /* Returns nonzero if the value represented by X (which is really a
@@ -5494,7 +5500,7 @@ init_emit_once (void)
   const_fixed_htab = htab_create_ggc (37, const_fixed_htab_hash,
  const_fixed_htab_eq, NULL);
 
-  mem_attrs_htab = htab_create_ggc (37, mem_attrs_htab_hash,
+  mem_attrs_htab = htab_create_ggc (128, mem_attrs_htab_hash,
mem_attrs_htab_eq, NULL);
   reg_attrs_htab = htab_create_ggc (37, reg_attrs_htab_hash,
reg_attrs_htab_eq, NULL);



Various minor speed-ups

2011-08-22 Thread Dimitrios Apostolou

Hello list,

the followup patches are a selection of minor changes introduced in 
various times during my GSOC project. They mostly are simple or
not that important to be posted alone, so I'll post them alltogether under 
this thread. Nevertheless they have been carefully selected from a pool of 
other changes and they are the ones that *do* offer some (minor) speed 
improvement, and have the least impact on memory usage, if at all.


They have all been tested on x86_64, some also on i386. For production 
builds I have seen no regression introduced.



Thanks,
Dimitris



Re: Dump stats about hottest hash tables when -fmem-report

2011-08-19 Thread Dimitrios Apostolou

On Fri, 19 Aug 2011, Tom Tromey wrote:


I think you are the most likely person to do this sort of testing.
You can use machines on the GCC compile farm for this.

Your patch to change the symbol table's load factor is fine technically.
I think the argument for putting it in is lacking; what I would like to
see is either some rationale explaining that the increased memory use is
not important, or some numbers showing that it still performs well on
more than a single machine.  My reason for wanting this is just that,
historically, GCC has been very sensitive to increases in memory use.
Alternatively, comments from more active maintainers indicating that
they don't care about this would also help your case.

I can't approve or reject the libiberty change, just the libcpp one.


Hi Tom, thanks for your comments. I'm well aware that I should test on 
more equipment besides i386 and x86_64 and I certainly plan to. It's just 
that writing patches is one thing, but advocating them on gcc-patches is 
an equally hard thing, and I plan on doing the latter correctly after GSOC 
ends.


As for the technical side of this patch, I've noticed that while in 
general there is a good gain to be earned from reduced collisions, there 
is an overhead in htab_traverse() and htab_delete() that iterate over the 
array. This is evident primarily in var-tracking, which is more 
htab-intensive than the rest of the compiler alltogether! So I plan to 
resubmit my patch together with small changes in var-tracking.



Thanks,
Dimitris





Re: Dump stats about hottest hash tables when -fmem-report

2011-08-09 Thread Dimitrios Apostolou

On Tue, 9 Aug 2011, Tom Tromey wrote:


"Richard" == Richard Guenther  writes:



The libcpp part is ok with this change.


Richard> Note that sparsely populated hashes come at the cost of increased
Richard> cache footprint.  Not sure what is more important here though, memory
Richard> access or hash computation.

I was only approving the change to the dumping.
I am undecided about making the hash tables more sparse.


Since my Core Quad processor has large caches and the i386 has small 
pointer size, the few extra empty buckets impose small overhead, which as 
it seems is minor in comparison to gains due to less rehashes.


Maybe that's not true on older or alternate equipment. I'd be very 
interested to hear about runtime measurements on various equipment, please 
let me know if you do any.



Thanks,
Dimitris




Re: Dump stats about hottest hash tables when -fmem-report

2011-08-09 Thread Dimitrios Apostolou
I forgot to include the dwarf2out.c:file_table. Stats are printed when -g. 
See attached patch. Additional Changelog:


* dwarf2out.c (dwarf2out_finish): Call htab_dump_statistics() if
-fmem-report.


Dimitris

=== modified file 'gcc/dwarf2out.c'
--- gcc/dwarf2out.c 2011-06-06 17:46:00 +
+++ gcc/dwarf2out.c 2011-08-09 07:56:11 +
@@ -24820,6 +24820,13 @@ dwarf2out_finish (const char *filename)
add_comp_dir_attribute (comp_unit_die ());
 }
 
+  if (mem_report)
+{
+  fprintf(stderr, "\ndwarf2out.c: file_table hash table statistics:\n");
+  htab_dump_statistics(file_table, sizeof (struct dwarf_file_data));
+}
+
+
   for (i = 0; i < VEC_length (deferred_locations, deferred_locations_list); 
i++)
 {
   add_location_or_const_value_attribute (



Dump stats about hottest hash tables when -fmem-report

2011-08-08 Thread Dimitrios Apostolou

Hello list,

this is the second part of my patch. It collects and prints some memory 
statistics for hash tables that I've measured as the hottest ones. Tested 
together with previous patch on i386. Example output:



libcpp symtab string pool:
identifiers 32644 (100.00%)
entries 32644 (49.81%)
deleted 0
slots   65536
string bytes445k (4064  obstack_memory_used)
table size  256k
searches234217
collisions  86838
coll/search 0.3708
ins/search  0.1394
avg. entry  13.97 bytes (+/- 6.42)
longest entry   52
No gimple statistics

??? tree nodes created

(No per-node statistics)
DECL_DEBUG_EXPR  hash: size 1021, 23 elements, 0.00 collisions
DECL_VALUE_EXPR  hash: size 1021, 0 elements, 0.00 collisions

tree.c:type_hash_table stats
slots   32749
identifiers 5605
entries 5605 (17.12%)
deleted 0
struct htab 60  B
table size  127 kB
element 8  B
total contents  43 kB
searches27032
collisions  9148
coll/search 0.3384

emit-rtl.c:mem_attrs_htab hash table:
slots   8191
identifiers 2295
entries 2295 (28.02%)
deleted 0
struct htab 60  B
table size  31 kB
element 24  B
total contents  53 kB
searches7166
collisions  3784
coll/search 0.5280

cgraph.c:cgraph_hash hash table stats:
slots   8191
identifiers 198
entries 3100 (37.85%)
deleted 2902
struct htab 60  B
table size  31 kB
element 160  B
total contents  30 kB
searches103425
collisions  32761
coll/search 0.3168

var-tracking.c stats
2363 vars->htab hash tables:
total searches  483055
total collisions68621
total coll/search   0.1421
54 changed_variables hash tables
total searches  33417
total collisions14253
total coll/search   0.4265
54 value_chains hash tables
total searches  43924
total collisions14027
total coll/search   0.3193

cselib stats for 614 hash tables
total searches  52840
total collisions9597
total coll/search   0.1816



Changelog:

2011-08-09  Dimitrios Apostolou  

* cgraph.c, cgraph.h (cgraph_dump_stats): New function to dump
stats about cgraph_hash hash table.
* cselib.c, cselib.h (cselib_dump_stats): New function to dump
stats about cselib_hash_table.
* cselib.c (cselib_finish): Keep statistics by increasing values
of new global variables cselib_htab_{num,searches,collisions} if
-fmem-report.
* emit-rtl.c, emit-rtl.h (mem_attrs_dump_stats): New function to
dump statistics about mem_attrs_htab hash table.
* tree.c (print_type_hash_statistics): Used the new
htab_dump_statistics() function.
* var-tracking.c (shared_hash_destroy): Keep statistics by
increasing values of new global variables
vars_htab_{num,searches,collisions} if -fmem-report.
(vt_finalize): Keep statistics by increasing values of new global
variables cv_htab_{num,searches,collisions} and
vc_htab_{num,searches,collisions} if -fmem-report.
* var-tracking.c, rtl.h (vt_dump_stats): New function to dump
stats about vars->htab, changed_variables and value_chains hash
tables.
* toplev.c: Included cselib.h for cselib_dump_stats().
(dump_memory_report): Call all the above functions to provide
better statistics.
* hashtab.c, hashtab.h (htab_dump_statistics, htab_collisions_num)
(htab_searches_num): New functions for statistics.
* hashtab.c: Included assert.h for checks in htab_dump_statistics.
* symtab.c (ht_dump_statistics): Beautified stats output.


Thanks,
Dimitris
=== modified file 'gcc/cgraph.c'
--- gcc/cgraph.c2011-06-06 17:12:25 +
+++ gcc/cgraph.c2011-08-09 01:01:19 +
@@ -2897,4 +2897,11 @@ cgraph_used_from_object_file_p (struct c
   return false;
 }
 
+void
+cgraph_dump_stats (void)
+{
+  fprintf (stderr, "\ncgraph.c:cgraph_hash hash table stats:\n");
+  htab_dump_statistics (cgraph_hash, sizeof (struct cgraph_node));
+}
+
 #include "gt-cgraph.h"

=== modified file 'gcc/cgraph.h'
--- gcc/cgraph.h2011-05-31 14:58:49 +
+++ gcc/cgraph.h2011-08-08 23:19:51 +
@@ -540,6 +540,7 @@ bool cgraph_can_remove_if_no_direct

Decrease fill-ratio of hash tables

2011-08-08 Thread Dimitrios Apostolou

Hello list,

the attach simple patch saves reproducively a tiny bit of time on various 
-O0 compilations I've performed, for example 5ms out of 627ms. Tested on 
i386.


We trade a little bit of memory (maxmem2.sh from [1] reported 25240 KB 
instead of 25080 KB without the patch) to allow our hash tables be more 
sparsely populated (always at least half-empty). Since our hash tables 
resolve conflicts by rehashing, reducing collisions is good.


Another patch that I'll post soon helped me measure collisions/searches 
ratio for the hottest hash tables. What could be easily noticed was that 
the ratio was too high, reaching 0.5 or even 0.7 in some cases. This made 
clear that we need some deep refactoring of our hash tables, either 
changing hash functions or the complete hash table implementation, to not 
involve any rehashing for collision handling. Unfortunately such tries 
failed either because they were too simple to show any difference, or they 
were too intrusive and involved changes everywhere htab_t is used 
(almost everywhere).


This patch is the simplest one to show positive change but I believe that 
some careful hash table redesign must be planned. For example, for the 
mem_attrs_htab hash table, coll/searches ratio is still sometimes higher 
than 0.5.


Changelog:

2011-08-09  Dimitrios Apostolou  

* symtab.c (ht_lookup_with_hash): Hash table will now be doubled
when 50% full, not 75%, to reduce collisions.
* hashtab.c (find_empty_slot_for_expand)
(htab_find_slot_with_hash): The same.



Thanks,
Dimitris



[1] http://gcc.gnu.org/wiki/PerformanceTesting
=== modified file 'libcpp/symtab.c'
--- libcpp/symtab.c 2009-07-18 02:22:16 +
+++ libcpp/symtab.c 2011-08-09 02:39:45 +
@@ -172,7 +172,7 @@
 HT_STR (node) = (const unsigned char *) obstack_copy0 (&table->stack,
   str, len);
 
-  if (++table->nelements * 4 >= table->nslots * 3)
+  if (++table->nelements * 2 > table->nslots)
 /* Must expand the string table.  */
 ht_expand (table);
 

=== modified file 'libiberty/hashtab.c'
--- libiberty/hashtab.c 2011-02-03 07:23:20 +
+++ libiberty/hashtab.c 2011-08-09 02:39:45 +
@@ -515,8 +515,9 @@
 }
 
 /* The following function changes size of memory allocated for the
-   entries and repeatedly inserts the table elements.  The occupancy
-   of the table after the call will be about 50%.  Naturally the hash
+   entries and repeatedly inserts the table elements.  It is designed to
+   be called when table is half-full. The occupancy
+   of the table after the call will be about 25%.  Naturally the hash
table must already exist.  Remember also that the place of the
table entries is changed.  If memory allocation failures are allowed,
this function will return zero, indicating that the table could not be
@@ -542,7 +543,7 @@
  too full or too empty.  */
   if (elts * 2 > osize || (elts * 8 < osize && osize > 32))
 {
-  nindex = higher_prime_index (elts * 2);
+  nindex = higher_prime_index (elts * 4);
   nsize = prime_tab[nindex].prime;
 }
   else
@@ -648,7 +649,7 @@
   PTR entry;
 
   size = htab_size (htab);
-  if (insert == INSERT && size * 3 <= htab->n_elements * 4)
+  if (insert == INSERT && htab->n_elements * 2 > size)
 {
   if (htab_expand (htab) == 0)
return NULL;


Re: [RFC] hard-reg-set.h refactoring

2011-08-01 Thread Dimitrios Apostolou

On Mon, 1 Aug 2011, Paolo Bonzini wrote:


On 08/01/2011 05:57 PM, Dimitrios Apostolou wrote:


I don't fully understand the output from -fdump-tree-all, but my
conclusion based also on profiler output and objdump, is that both
unrolling and inlining is happening in both versions. Nevertheless I can
see that assembly output is a bit different in the two cases (I can post
specific disassembly output if you are interested).


Thanks for checking.

Have you tried the idea of passing an unsigned HOST_WIDEST_FAST_INT * (or 
whatever the name) to the target hook?


Keeping my patch exactly the same, just changing the 
hook_void_hard_reg_set to receive a (HOST_WIDEST_FAST_INT *) arg and doing 
the necessary typecasts, added an extra 3 M instructions.


But the ix86_live_on_entry is only called 1233x times from df-scan.c. This 
isn't enough to explain all this overhead.



Dimitris



Re: [RFC] hard-reg-set.h refactoring

2011-08-01 Thread Dimitrios Apostolou

On Sun, 31 Jul 2011, Paolo Bonzini wrote:

On Sat, Jul 30, 2011 at 19:21, Dimitrios Apostolou  wrote:

Nevertheless I'd appreciate comments on whether any part of this patch is
worth keeping. FWIW I've profiled this on i386 to be about 4 M instr slower
out of ~1.5 G inst. I'll be now checking the profiler to see where exactly
the overhead is.


I suggest -fdump-tree-all too, to check if unrolling is happening and
if not why.


I don't fully understand the output from -fdump-tree-all, but my 
conclusion based also on profiler output and objdump, is that both 
unrolling and inlining is happening in both versions. Nevertheless I can 
see that assembly output is a bit different in the two cases (I can post 
specific disassembly output if you are interested).


My opinion is that code cleanup is worth the minor overhead, given that 
there should be no regressions.



Thanks,
Dimitris



Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-07-31 Thread Dimitrios Apostolou

On Sun, 31 Jul 2011, Steven Bosscher wrote:


On Fri, Jul 29, 2011 at 11:48 PM, Steven Bosscher  wrote:

I'll see if I can test the patch on the compile farm this weekend,
just to be sure.


Worked fine with some cross-builds to arm-eabi. Bootstrap on
ia64-unknown-linux-gnu is in stage2 but it is taking forever (on
gcc60)... I will not be able to run the test suite anymore this
weekend, but I'll start the run.

Anyway, anything that makes it into stage2 on ia64 can't be all bad, eh? ;-)


Steven thanks a lot for your help! :-)

If you have any scripts/guidelines for automatic testing on various 
platforms on the compile farm, I'd be happy to use them.



Thanks,
Dimitris



[RFC] hard-reg-set.h refactoring

2011-07-30 Thread Dimitrios Apostolou

Hello list,

the attached patch changes hard-reg-set.h in the following areas:

1) HARD_REG_SET is now always a struct so that it can be used in files 
where we don't want to include tm.h. Many thanks to Paolo for providing 
the idea and the original patch.


2) Code for specific HARD_REG_SET_LONG values is deleted and only generic 
code is left, making the file much more readable/maintainable. I was 
expecting gcc would unroll, even at -O2, loops with 2-3 iterations, so 
performance should have been the same.



I don't intend for this to go mainline, Jakub has explained on IRC that 
certain ABIs make it slower to pass structs and we wouldn't want that. 
Nevertheless I'd appreciate comments on whether any part of this patch is 
worth keeping. FWIW I've profiled this on i386 to be about 4 M instr 
slower out of ~1.5 G inst. I'll be now checking the profiler to see where 
exactly the overhead is.



Thanks,
Dimitris
=== modified file 'gcc/hard-reg-set.h'
--- gcc/hard-reg-set.h  2011-01-03 20:52:22 +
+++ gcc/hard-reg-set.h  2011-07-29 22:32:27 +
@@ -24,35 +24,31 @@ along with GCC; see the file COPYING3.  
 /* Define the type of a set of hard registers.  */
 
 /* HARD_REG_ELT_TYPE is a typedef of the unsigned integral type which
-   will be used for hard reg sets, either alone or in an array.
-
-   If HARD_REG_SET is a macro, its definition is HARD_REG_ELT_TYPE,
-   and it has enough bits to represent all the target machine's hard
-   registers.  Otherwise, it is a typedef for a suitably sized array
-   of HARD_REG_ELT_TYPEs.  HARD_REG_SET_LONGS is defined as how many.
+   will be used for hard reg sets.  An HARD_REG_ELT_TYPE, or an
+   array of them is wrapped in a struct.
 
Note that lots of code assumes that the first part of a regset is
the same format as a HARD_REG_SET.  To help make sure this is true,
we only try the widest fast integer mode (HOST_WIDEST_FAST_INT)
-   instead of all the smaller types.  This approach loses only if
-   there are very few registers and then only in the few cases where
-   we have an array of HARD_REG_SETs, so it needn't be as complex as
-   it used to be.  */
-
-typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE;
-
-#if FIRST_PSEUDO_REGISTER <= HOST_BITS_PER_WIDEST_FAST_INT
-
-#define HARD_REG_SET HARD_REG_ELT_TYPE
+   instead of all the smaller types. */
 
+#ifdef ENABLE_RTL_CHECKING
+#define gcc_rtl_assert(EXPR) gcc_assert (EXPR)
 #else
+#define gcc_rtl_assert(EXPR) ((void)(0 && (EXPR)))
+#endif
+
+typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE;
 
 #define HARD_REG_SET_LONGS \
  ((FIRST_PSEUDO_REGISTER + HOST_BITS_PER_WIDEST_FAST_INT - 1)  \
   / HOST_BITS_PER_WIDEST_FAST_INT)
-typedef HARD_REG_ELT_TYPE HARD_REG_SET[HARD_REG_SET_LONGS];
 
-#endif
+#define HARD_REG_SET struct hard_reg_set
+
+struct hard_reg_set {
+  HARD_REG_ELT_TYPE elems[HARD_REG_SET_LONGS];
+};
 
 /* HARD_CONST is used to cast a constant to the appropriate type
for use with a HARD_REG_SET.  */
@@ -89,343 +85,108 @@ typedef HARD_REG_ELT_TYPE HARD_REG_SET[H
hard_reg_set_intersect_p (X, Y), which returns true if X and Y intersect.
hard_reg_set_empty_p (X), which returns true if X is empty.  */
 
-#define UHOST_BITS_PER_WIDE_INT ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT)
 
-#ifdef HARD_REG_SET
+#define HARD_REG_ELT_BITS ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT)
 
 #define SET_HARD_REG_BIT(SET, BIT)  \
- ((SET) |= HARD_CONST (1) << (BIT))
+  hard_reg_set_set_bit (&(SET), (BIT))
 #define CLEAR_HARD_REG_BIT(SET, BIT)  \
- ((SET) &= ~(HARD_CONST (1) << (BIT)))
+  hard_reg_set_clear_bit(&(SET), (BIT))
 #define TEST_HARD_REG_BIT(SET, BIT)  \
- (!!((SET) & (HARD_CONST (1) << (BIT
-
-#define CLEAR_HARD_REG_SET(TO) ((TO) = HARD_CONST (0))
-#define SET_HARD_REG_SET(TO) ((TO) = ~ HARD_CONST (0))
-
-#define COPY_HARD_REG_SET(TO, FROM) ((TO) = (FROM))
-#define COMPL_HARD_REG_SET(TO, FROM) ((TO) = ~(FROM))
-
-#define IOR_HARD_REG_SET(TO, FROM) ((TO) |= (FROM))
-#define IOR_COMPL_HARD_REG_SET(TO, FROM) ((TO) |= ~ (FROM))
-#define AND_HARD_REG_SET(TO, FROM) ((TO) &= (FROM))
-#define AND_COMPL_HARD_REG_SET(TO, FROM) ((TO) &= ~ (FROM))
-
-static inline bool
-hard_reg_set_subset_p (const HARD_REG_SET x, const HARD_REG_SET y)
-{
-  return (x & ~y) == HARD_CONST (0);
-}
+  hard_reg_set_bit_p((SET), (BIT))
 
-static inline bool
-hard_reg_set_equal_p (const HARD_REG_SET x, const HARD_REG_SET y)
-{
-  return x == y;
-}
-
-static inline bool
-hard_reg_set_intersect_p (const HARD_REG_SET x, const HARD_REG_SET y)
-{
-  return (x & y) != HARD_CONST (0);
-}
-
-static inline bool
-hard_reg_set_empty_p (const HARD_REG_SET x)
+static inline void
+hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit)
 {
-  return x == HARD_CONST (0);
-}
-
+#if HARD_REG_SET_LONGS > 1
+  int word = bit / HARD_REG_ELT_BITS;
+  int bitpos = bit % HARD_REG_ELT_BITS;
 #else
-
-#define SET_HARD_REG_BIT(SET, BIT) \
-  ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT]  \
-   |= HARD_CONST (1) << ((BIT) %

Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-07-29 Thread Dimitrios Apostolou

On Fri, 29 Jul 2011, Kenneth Zadeck wrote:

i really think that patches of this magnitude having to with the rtl level 
should be tested on more than one platform.


I'd really appreciate further testing on alternate platforms from whoever 
does it casually, for me it would take too much time to setup my testing 
platform on GCC compile farm, and deadlines are approaching.



Thanks,
Dimitris



Re: [DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-07-29 Thread Dimitrios Apostolou


Completely forgot it: Tested on i386, no regressions.


Dimitrios


[DF] [performance] generate DF_REF_BASE REFs in REGNO order

2011-07-29 Thread Dimitrios Apostolou

Hello list,

the attached patch achieves a performance improvement by first recording 
DF_REF_BASE DEFs within df_get_call_refs() before the DF_REF_REGULARs are 
recorded in df_defs_record(). BASE DEFs are also recorded in REGNO order. 
Improvement has been measured as follows, for compiling tcp_ipv4.c of 
linux kernel with -O0 optimisation:


trunk  : 1438.4 M instr, 0.627s
patched: 1376.5 M instr, 0.604s

It also includes suggested changes from Paolo discussed on list (subject: 
what can be in a group set). Many thanks to him for the invaluable help 
while writing the patch.


For whoever is interested, you can see the two profiles with fully 
annotated source before and after the change, at the following links. The 
big difference is that qsort() is now called only 33 times instead of 
thousands, from df_sort_and_compress_refs().
Further measurements, comments and ideas for further improvements are 
welcome.


http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=callgrind-tcp_ipv4-trunk-co-109439-prod.txt
http://gcc.gnu.org/wiki/OptimisingGCC?action=AttachFile&do=view&target=callgrind-tcp_ipv4-df2-co-prod.txt


Changelog:

2011-07-29  Dimitrios Apostolou  
Paolo Bonzini  

(df_def_record_1): Assert a parallel must contain an EXPR_LIST at
this point.  Receive the LOC and move its extraction...
(df_defs_record): ... here. Rewrote logic with a switch statement
instead of multiple if-else.
(df_find_hard_reg_defs, df_find_hard_reg_defs_1): New functions
that duplicate the logic of df_defs_record() and df_def_record_1()
but without actually recording any DEFs, only marking them in
the defs HARD_REG_SET.
(df_get_call_refs): Call df_find_hard_reg_defs() to mark DEFs that
are the result of the call. Record DF_REF_BASE DEFs in REGNO
order. Use regs_invalidated_by_call HARD_REG_SET instead of
regs_invalidated_by_call_regset bitmap.
(df_insn_refs_collect): Record DF_REF_REGULAR DEFs after
df_get_call_refs().


Thanks,
Dimitris


P.S. maraz: that's 4.3% improvement in instruction count, should you start 
worrying or is it too late? ;-)
=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-07-29 16:01:50 +
@@ -111,7 +111,7 @@ static void df_ref_record (enum df_ref_c
   rtx, rtx *,
   basic_block, struct df_insn_info *,
   enum df_ref_type, int ref_flags);
-static void df_def_record_1 (struct df_collection_rec *, rtx,
+static void df_def_record_1 (struct df_collection_rec *, rtx *,
 basic_block, struct df_insn_info *,
 int ref_flags);
 static void df_defs_record (struct df_collection_rec *, rtx,
@@ -2916,40 +2916,27 @@ df_read_modify_subreg_p (rtx x)
 }
 
 
-/* Process all the registers defined in the rtx, X.
+/* Process all the registers defined in the rtx pointed by LOC.
Autoincrement/decrement definitions will be picked up by
df_uses_record.  */
 
 static void
 df_def_record_1 (struct df_collection_rec *collection_rec,
- rtx x, basic_block bb, struct df_insn_info *insn_info,
+ rtx *loc, basic_block bb, struct df_insn_info *insn_info,
 int flags)
 {
-  rtx *loc;
-  rtx dst;
-
- /* We may recursively call ourselves on EXPR_LIST when dealing with PARALLEL
- construct.  */
-  if (GET_CODE (x) == EXPR_LIST || GET_CODE (x) == CLOBBER)
-loc = &XEXP (x, 0);
-  else
-loc = &SET_DEST (x);
-  dst = *loc;
+  rtx dst = *loc;
 
   /* It is legal to have a set destination be a parallel. */
   if (GET_CODE (dst) == PARALLEL)
 {
   int i;
-
   for (i = XVECLEN (dst, 0) - 1; i >= 0; i--)
{
  rtx temp = XVECEXP (dst, 0, i);
- if (GET_CODE (temp) == EXPR_LIST || GET_CODE (temp) == CLOBBER
- || GET_CODE (temp) == SET)
-   df_def_record_1 (collection_rec,
- temp, bb, insn_info,
-GET_CODE (temp) == CLOBBER
-? flags | DF_REF_MUST_CLOBBER : flags);
+ gcc_assert (GET_CODE (temp) == EXPR_LIST);
+ df_def_record_1 (collection_rec, &XEXP (temp, 0),
+  bb, insn_info, flags);
}
   return;
 }
@@ -3003,26 +2990,98 @@ df_defs_record (struct df_collection_rec
int flags)
 {
   RTX_CODE code = GET_CODE (x);
+  int i;
 
-  if (code == SET || code == CLOBBER)
-{
-  /* Mark the single def within the pattern.  */
-  int clobber_flags = flags;
-  clobber_flags |= (code == CLOBBER) ? DF_REF_MUST_CLOBBER : 0;
-  df_def_record_1 (collection_rec, x, bb, insn_info, clobber_flags);
-}
-  else if (code == COND_EXEC)
+  switch (code)
 {
+case SET:
+  df_def_record_1 (colle

Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version

2011-07-26 Thread Dimitrios Apostolou
Bug found at last, it's in the following hunk, the ampersand in 
&exit_block_uses is wrong... :-@




@@ -3951,7 +3949,7 @@ df_get_exit_block_use_set (bitmap exit_b
 {
   rtx tmp = EH_RETURN_STACKADJ_RTX;
   if (tmp && REG_P (tmp))
-   df_mark_reg (tmp, exit_block_uses);
+   df_mark_reg (tmp, &exit_block_uses);
 }
 #endif

@@ -3961,12 +3959,12 @@ df_get_exit_block_use_set (bitmap exit_b
 {
   rtx tmp = EH_RETURN_HANDLER_RTX;
   if (tmp && REG_P (tmp))
-   df_mark_reg (tmp, exit_block_uses);
+   df_mark_reg (tmp, &exit_block_uses);
 }
 #endif

   /* Mark function return value.  */
-  diddle_return_value (df_mark_reg, (void*) exit_block_uses);
+  diddle_return_value (df_mark_reg, (void*) &exit_block_uses);
 }



Thanks to everyone for looking in my code, it seems to be working now, 
failing only on some mudflap tests that I've been told to ignore. Expect 
patch repost soon :-)


FWIW test failures in comparison to trunk are the following, but I'll 
ignore them:


FAIL: libmudflap.cth/pass39-frag.c (rerun 18) execution test
FAIL: libmudflap.cth/pass39-frag.c (rerun 18) output pattern test
FAIL: libmudflap.cth/pass39-frag.c (-static -DSTATIC) (rerun 10) execution test
FAIL: libmudflap.cth/pass39-frag.c (-static -DSTATIC) (rerun 10) output pattern 
test
FAIL: libmudflap.cth/pass39-frag.c (-O2) (rerun 7) execution test
FAIL: libmudflap.cth/pass39-frag.c (-O2) (rerun 7) output pattern test
FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 5) execution test
FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 5) output pattern test



Dimitris



added some assert checks in hard-reg-set.h

2011-07-25 Thread Dimitrios Apostolou

Hello list,

the attached patch was tested on i386, and measured to have almost no 
overhead in runtime, when RTL checks are enabled:


Instruction Count before: 2328.6 M
Instruction Count after:  2334.4 M

which translates to some milliseconds, well within noise area.

Changelog:


2011-07-25  Dimitrios Apostolou 

	* hard-reg-set.h (TEST_HARD_REG_BIT, SET_HARD_REG_BIT, 
CLEAR_HARD_REG_BIT): Added some assert checks for test, set and clear 
operations of HARD_REG_SETs, enabled when RTL checks are on. Runtime 
overhead was measured as negligible.




Thanks,
Dimitris=== modified file 'gcc/hard-reg-set.h'
--- gcc/hard-reg-set.h  2011-01-03 20:52:22 +
+++ gcc/hard-reg-set.h  2011-07-25 17:06:36 +
@@ -41,6 +41,13 @@ along with GCC; see the file COPYING3.  
 
 typedef unsigned HOST_WIDEST_FAST_INT HARD_REG_ELT_TYPE;
 
+#ifdef ENABLE_RTL_CHECKING
+#define gcc_rtl_assert(EXPR) gcc_assert (EXPR)
+#else
+#define gcc_rtl_assert(EXPR) ((void)(0 && (EXPR)))
+#endif
+
+
 #if FIRST_PSEUDO_REGISTER <= HOST_BITS_PER_WIDEST_FAST_INT
 
 #define HARD_REG_SET HARD_REG_ELT_TYPE
@@ -91,14 +98,35 @@ typedef HARD_REG_ELT_TYPE HARD_REG_SET[H
 
 #define UHOST_BITS_PER_WIDE_INT ((unsigned) HOST_BITS_PER_WIDEST_FAST_INT)
 
-#ifdef HARD_REG_SET
-
 #define SET_HARD_REG_BIT(SET, BIT)  \
- ((SET) |= HARD_CONST (1) << (BIT))
+  hard_reg_set_set_bit (&(SET), (BIT))
 #define CLEAR_HARD_REG_BIT(SET, BIT)  \
- ((SET) &= ~(HARD_CONST (1) << (BIT)))
+  hard_reg_set_clear_bit(&(SET), (BIT))
 #define TEST_HARD_REG_BIT(SET, BIT)  \
- (!!((SET) & (HARD_CONST (1) << (BIT
+  hard_reg_set_bit_p((SET), (BIT))
+
+#ifdef HARD_REG_SET
+
+static inline void
+hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit)
+{
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  (*s) |= HARD_CONST (1) << bit;
+}
+
+static inline void
+hard_reg_set_clear_bit (HARD_REG_SET *s, unsigned int bit)
+{
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  (*s) &= ~(HARD_CONST (1) << bit);
+}
+
+static inline bool
+hard_reg_set_bit_p (const HARD_REG_SET s, unsigned int bit)
+{
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  return ((s >> bit) & HARD_CONST (1));
+}
 
 #define CLEAR_HARD_REG_SET(TO) ((TO) = HARD_CONST (0))
 #define SET_HARD_REG_SET(TO) ((TO) = ~ HARD_CONST (0))
@@ -137,17 +165,32 @@ hard_reg_set_empty_p (const HARD_REG_SET
 
 #else
 
-#define SET_HARD_REG_BIT(SET, BIT) \
-  ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT]  \
-   |= HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT))
-
-#define CLEAR_HARD_REG_BIT(SET, BIT)   \
-  ((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT]  \
-   &= ~(HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT)))
-
-#define TEST_HARD_REG_BIT(SET, BIT)\
-  (!!((SET)[(BIT) / UHOST_BITS_PER_WIDE_INT]   \
-  & (HARD_CONST (1) << ((BIT) % UHOST_BITS_PER_WIDE_INT
+static inline void
+hard_reg_set_set_bit (HARD_REG_SET *s, unsigned int bit)
+{
+  int byte = bit / UHOST_BITS_PER_WIDE_INT;
+  int bitpos = bit % UHOST_BITS_PER_WIDE_INT;
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  (*s)[byte] |= HARD_CONST (1) << bitpos;
+}
+
+static inline void
+hard_reg_set_clear_bit (HARD_REG_SET *s, unsigned int bit)
+{
+  int byte = bit / UHOST_BITS_PER_WIDE_INT;
+  int bitpos = bit % UHOST_BITS_PER_WIDE_INT;
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  (*s)[byte] &= ~(HARD_CONST (1) << bitpos);
+}
+
+static inline bool
+hard_reg_set_bit_p (const HARD_REG_SET s, unsigned int bit)
+{
+  int byte = bit / UHOST_BITS_PER_WIDE_INT;
+  int bitpos = bit % UHOST_BITS_PER_WIDE_INT;
+  gcc_rtl_assert (bit < FIRST_PSEUDO_REGISTER);
+  return ((s[byte] >> bitpos) & HARD_CONST (1));
+}
 
 #if FIRST_PSEUDO_REGISTER <= 2*HOST_BITS_PER_WIDEST_FAST_INT
 #define CLEAR_HARD_REG_SET(TO)  \



Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version

2011-07-25 Thread Dimitrios Apostolou
That was a bug, indeed, but unfortunately it wasn't the one causing the 
crash I posted earlier... Even after fixing it I get the same backtrace 
from gdb.


So the petition "spot the bug" holds...


Thanks,
Dimitris



eliminate bitmap regs_invalidated_by_call_regset

2011-07-25 Thread Dimitrios Apostolou

Hello list,

the attached patch eliminates regs_invalidated_by_call_regset bitmap and 
uses instead the original regs_invalidated_by_call HARD_REG_SET. Tested on 
i386, I had the following two regressions that I'll investigate right on:


 FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 10) execution test
 FAIL: libmudflap.cth/pass39-frag.c (-O3) (rerun 10) output pattern test



Performance measured not to be affected, maybe it is now a couple 
milliseconds faster:


Original: PC1:0.878s, PC2:6.55s, 2105.6 M instr
Patched : PC1:0.875s, PC2:6.54s, 2104.9 M instr


2011-07-25  Dimitrios Apostolou 

	* df-core.c, df-problems.c, df-scan.c, df.h, reginfo.c, regset.h: 
Eliminate regs_invalidated_by_call_regset bitmap and use instead the 
original regs_invalidated_by_call HARD_REG_SET.



All comments are welcome,
Dimitris=== modified file 'gcc/df-core.c'
--- gcc/df-core.c   2011-04-20 18:19:03 +
+++ gcc/df-core.c   2011-07-25 13:58:58 +
@@ -1886,6 +1886,17 @@ df_print_regset (FILE *file, bitmap r)
 }
 
 
+void
+df_print_hard_reg_set (FILE *f, HARD_REG_SET r)
+{
+  unsigned int i;
+  hard_reg_set_iterator iter;
+
+  EXECUTE_IF_SET_IN_HARD_REG_SET (r, 0, i, iter)
+fprintf (f, " %d [%s]", i, reg_names[i]);
+  fprintf (f, "\n");
+}
+
 /* Write information about registers and basic blocks into FILE.  The
bitmap is in the form used by df_byte_lr.  This is part of making a
debugging dump.  */

=== modified file 'gcc/df-problems.c'
--- gcc/df-problems.c   2011-05-04 20:24:15 +
+++ gcc/df-problems.c   2011-07-25 13:58:58 +
@@ -432,6 +432,7 @@ df_rd_local_compute (bitmap all_blocks)
 {
   unsigned int bb_index;
   bitmap_iterator bi;
+  hard_reg_set_iterator iter;
   unsigned int regno;
   struct df_rd_problem_data *problem_data
 = (struct df_rd_problem_data *) df_rd->problem_data;
@@ -449,7 +450,7 @@ df_rd_local_compute (bitmap all_blocks)
 }
 
   /* Set up the knockout bit vectors to be applied across EH_EDGES.  */
-  EXECUTE_IF_SET_IN_BITMAP (regs_invalidated_by_call_regset, 0, regno, bi)
+  EXECUTE_IF_SET_IN_HARD_REG_SET (regs_invalidated_by_call, 0, regno, iter)
 {
   if (DF_DEFS_COUNT (regno) > DF_SPARSE_THRESHOLD)
bitmap_set_bit (sparse_invalidated, regno);
@@ -969,6 +970,29 @@ df_lr_confluence_0 (basic_block bb)
 bitmap_copy (op1, &df->hardware_regs_used);
 }
 
+/* to |= from1 & ~from2
+   from2 is of type HARD_REG_SET */
+
+static bool
+bitmap_ior_and_compl_from_hard_reg_set (bitmap to, const_bitmap from1,
+   HARD_REG_SET from2)
+{
+  bool ret;
+  unsigned int i;
+  bitmap_head from1_tmp;
+  hard_reg_set_iterator iter;
+
+  bitmap_initialize (&from1_tmp, &bitmap_default_obstack);
+  bitmap_copy (&from1_tmp, from1);
+
+  /* TODO optimise per-word */
+  EXECUTE_IF_SET_IN_HARD_REG_SET (from2, 0, i, iter)
+bitmap_clear_bit (&from1_tmp, i);
+  ret = bitmap_ior_into (to, &from1_tmp);
+
+  bitmap_clear (&from1_tmp);
+  return ret;
+}
 
 /* Confluence function that ignores fake edges.  */
 
@@ -983,7 +1007,8 @@ df_lr_confluence_n (edge e)
   /* ??? Abnormal call edges ignored for the moment, as this gets
  confused by sibling call edges, which crashes reg-stack.  */
   if (e->flags & EDGE_EH)
-changed = bitmap_ior_and_compl_into (op1, op2, 
regs_invalidated_by_call_regset);
+changed = bitmap_ior_and_compl_from_hard_reg_set (op1, op2,
+ regs_invalidated_by_call);
   else
 changed = bitmap_ior_into (op1, op2);
 
@@ -4450,8 +4475,8 @@ df_md_confluence_n (edge e)
 return false;
 
   if (e->flags & EDGE_EH)
-return bitmap_ior_and_compl_into (op1, op2,
- regs_invalidated_by_call_regset);
+return bitmap_ior_and_compl_from_hard_reg_set (op1, op2,
+  regs_invalidated_by_call);
   else
 return bitmap_ior_into (op1, op2);
 }

=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-07-25 13:58:58 +
@@ -409,7 +409,7 @@ df_scan_start_dump (FILE *file ATTRIBUTE
   rtx insn;
 
   fprintf (file, ";;  invalidated by call \t");
-  df_print_regset (file, regs_invalidated_by_call_regset);
+  df_print_hard_reg_set (file, regs_invalidated_by_call);
   fprintf (file, ";;  hardware regs used \t");
   df_print_regset (file, &df->hardware_regs_used);
   fprintf (file, ";;  regular block artificial uses \t");
@@ -3317,7 +3317,7 @@ df_get_call_refs (struct df_collection_r
   int flags)
 {
   rtx note;
-  bitmap_iterator bi;
+  hard_reg_set_iterator iter;
   unsigned int ui;
   bool is_sibling_call;
   unsigned int i;
@@ -3375,7 +3375,7 @@ df_get_call_refs (struct df_collection_r
}
 
   is_sibling_call = SIBLING_CALL_P (insn_info->in

Re: [RFC] Replace some bitmaps with HARD_REG_SETs - second version

2011-07-25 Thread Dimitrios Apostolou
Bug found, in df_mark_reg I need to iterate until regno + n, not n. The 
error is at the following hunk:


--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-07-24 17:16:46 +
@@ -3713,35 +3717,40 @@ df_mark_reg (rtx reg, void *vset)
   if (regno < FIRST_PSEUDO_REGISTER)
 {
   int n = hard_regno_nregs[regno][GET_MODE (reg)];
-  bitmap_set_range (set, regno, n);
+  int i;
+  for (i=regno; iMany thanks to monoid from IRC for spotting it! I'll post an updated patch 
soon.


Thanks, 
Dimitris




Re: [RFC] Replace some bitmaps with HARD_REG_SETs

2011-07-24 Thread Dimitrios Apostolou

Hi Steven,

On Sun, 24 Jul 2011, Steven Bosscher wrote:

Can you please create your patches with the -p option, so that it's
easier to see what function you are changing? Also, even for an RFC
patch a ChangeLog is more than just nice to have ;-)


Do you mean an entry in Changelog file in root directory? I should update 
my tree to latest every time, for my patch to be valid, right?




This hunk in df-problems looks odd:

@@ -2511,9 +2535,9 @@
  if (bb_index == EXIT_BLOCK)
{
  unsigned regno;
- bitmap_iterator bi;
- EXECUTE_IF_SET_IN_BITMAP (df->exit_block_uses, FIRST_PSEUDO_REGISTER,
-   regno, bi)

So this EXECUTE_IF_SET_IN_BITMAP starts with FIRST_PSEUDO_REGISTER
(i.e. works on pseudo registers) ...

+ hard_reg_set_iterator iter;
+ EXECUTE_IF_SET_IN_HARD_REG_SET (df->exit_block_uses, 
FIRST_PSEUDO_REGISTER,
+ regno, iter)
gcc_unreachable ();
}
  else

... and you change it to work only on hard registers. That code looked
like a check to verify that exit_block_uses only contains hard
registers. So you can probably just drop this check.


Thanks for reminding me this, I had indeed removed this check, I just 
didn't commit to my VCS :-)


Should I resend my patch with all the suggestions you made?


Thanks,
Dimitris



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-08 Thread Dimitrios Apostolou

On Fri, 8 Jul 2011, Paolo Bonzini wrote:

On 07/08/2011 05:51 AM, Dimitrios Apostolou wrote:

+ /* first write DF_REF_BASE */


This is not necessary.  These uses are written to use_vec, while the uses 
from REG_EQUIV and REG_EQUAL are written to eq_use_vec (see 
df_ref_create_structure).


Thanks for pointing this out, I missed it, it's complex to follow 
what is written where. Perhaps there is meaning in changing the interface 
of df_ref_create_structure() to accept the particular vector.




Also, anyway this wouldn't work because you would have to split the loop in 
two.  I'll attribute that to the time of day when you were writing the 
message. :)


Even though I'm having my own disputes with Morpheus, sleep deprivation 
wasn't tha culprit for this. :-) My intention was to keep DF_REF_BASEs as 
close to each other, because a close-to-be-sorted array will be sorted 
faster with the sorting function I'll write... :-)


But anyway as you've explained it's irrelevant since they are going 
to different vectors.



Thanks,
Dimitris



Re: what can be in a group set?

2011-07-08 Thread Dimitrios Apostolou

Thanks Paolo for the detailed explanation!

On Fri, 8 Jul 2011, Paolo Bonzini wrote:


That said, changing exit_block_uses and entry_block_defs to HARD_REG_SET would
be a nice cleanup, but it would also touch target code due to

 targetm.extra_live_on_entry (entry_block_defs);



I've already done that :-p


I wouldn't bother for now until you're a bit more experienced.  Unlike
invalidated_by_call it shouldn't show up in profiles, or does it?


Indeed it doesn't show, I just wanted to do it as a clean-up for 
transitioning to HARD_REG_SET all relevant sets in struct df_d.


The only problem remaining is I need a bitmap_copy_from_hard_reg_set() 
function for df_lr_local_compute(), where the bb_info->use bitmap is 
initialised from the exit_block_uses HARD_REG_SET.



Dimitris



Re: what can be in a group set?

2011-07-08 Thread Dimitrios Apostolou

On Fri, 8 Jul 2011, Paolo Bonzini wrote:

On 07/08/2011 12:43 PM, Richard Sandiford wrote:

The docs also say that the first expr_list can be null:

   If @var{lval} is a @code{parallel}, it is used to represent the case of
   a function returning a structure in multiple registers.  Each element
   of the @code{parallel} is an @code{expr_list} whose first operand is a
   @code{reg} and whose second operand is a @code{const_int} representing 
the

   offset (in bytes) into the structure at which the data in that register
   corresponds.  The first element may be null to indicate that the 
structure

   is also passed partly in memory.

but I can't see any code to handle that.  Am I missing something,
or does the lack of a crash here mean that we can remove the last
sentence?

(It might have been added for symmetry with argument passing, where this
sort of thing is needed.  But if it isn't actually used or implemented for
returns, it might be less confusing to remove it.)


Indeed.  Dimitrios, can you pick up the patch since it will somewhat simplify 
your work to eliminate defs_generated?


I'll certainly try :-)

Paolo, something else, in df_mark_reg() is it ever possible for regno to 
be >= FIRST_PSEUDO_REGISTER? An assert I've put doesn't trigger for my 
simple test :-)



Thanks,
Dimitris



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-08 Thread Dimitrios Apostolou

On Fri, 8 Jul 2011, Richard Guenther wrote:

On Fri, Jul 8, 2011 at 5:20 AM, Dimitrios Apostolou  wrote:

Hello list,

The attached patch does two things for df_get_call_refs():
* First it uses HARD_REG_SETs for defs_generated and
regs_invalidated_by_call, instead of bitmaps. Replacing in total more than
400K calls (for my testcase) to bitmap_bit_p() with the much faster
TEST_HARD_REG_BIT, reduces the total instruction count from about 13M to
1.5M.
* Second it produces the REFs in REGNO order, which is important to keep the
collection_rec sorted most times, and avoid expensive calls to qsort().
Thanks to Paolo Bonzini for idea and mentoring.

The second part makes a big difference if accompanied with another patch in
df_insn_refs_collect(). I'll post a followup patch, that is unfortunately
unstable for some of my tests, so I'd appreciate any comments.


Did you check the impact on memory usage?  I suppose on targets
with not many hard registers it should even improve, but do we expect
memory usage to be worse in any case?


Hi Richard, I didn't check memory usage, is that important? Since the 
struct bitmap is fairly bulky, it should take an arch with lots of hard 
regs (which one has the most?).


But still a few bytes tradeoff wouldn't be acceptable for a much faster 
type? And IMHO it makes the code better to understand, since once you see 
HARD_REG_SET you know you can't expect else. FWIW I'm now in the process 
of converting all other bitmap uses for hard regs, to HARD_REG_SETs, at 
least within DF. I'm not sure whether performance gains will be visible, 
however, not much code is as hot as df_get_call_refs().



Thanks,
Dimitris



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-08 Thread Dimitrios Apostolou

On Fri, 8 Jul 2011, Jakub Jelinek wrote:

On Fri, Jul 08, 2011 at 06:20:04AM +0300, Dimitrios Apostolou wrote:

The attached patch does two things for df_get_call_refs():
* First it uses HARD_REG_SETs for defs_generated and
regs_invalidated_by_call, instead of bitmaps. Replacing in total
more than 400K calls (for my testcase) to bitmap_bit_p() with the
much faster TEST_HARD_REG_BIT, reduces the total instruction count
from about 13M to 1.5M.


Have you verified that collection_rec->def_vec never contains pseudo
register references?  Otherwise you couldn't use
HARD_REG_SET... gcc_checking_assert might be useful.



Hi Jakub, Steve pointed me to the following from GCC Internals Manual:

call_insn insns have the same extra fields as insn insns, accessed in the 
same way and in addition contain a field CALL_INSN_FUNCTION_USAGE, which 
contains a list (chain of expr_list expressions) containing use and 
clobber expressions that denote hard registers and MEMs used or clobbered 
by the called function.



So doesn't that mean that for CALL insns it should contain only HARD_REG 
DEFs? I will ofcourse use an assert to be sure.



Thanks,
Dimitris



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-08 Thread Dimitrios Apostolou

On Fri, 8 Jul 2011, Steven Bosscher wrote:

On Fri, Jul 8, 2011 at 5:20 AM, Dimitrios Apostolou  wrote:

The attached patch does two things for df_get_call_refs():


How did you test this patch?

Normally, a patch submission comes with text like, "Bootstrapped &
tested on ..., no regressions.". Also, you chould write a ChangeLog
entry, best included in your mail somewhere at the end ;-)


Hi Steven, thanks for the instructions. I've not run the mandatory tests 
you have told me about, only done some minor testing due to lack of time. 
I'm not yet posting patches for inclusion, but more as an RFC. Should such 
patches be sent to gcc instead of gcc-patches?



Thanks,
Dimitris



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-07 Thread Dimitrios Apostolou
And here is the patch that breaks things. By moving df_defs_record() 
*after* df_get_call_refs() most times collection_rec remains sorted, and 
about 50M instructions are avoided in qsort() 
calls of df_canonize_collection_rec().


Unfortunately this does not work. Sometimes cc1 crashes, for example 
because regstack is empty in subst_stack_regs_in_debug_insn().


Any ideas for ensuring proper ordering of collection_rec?


Thanks,
Dimitris=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-07-08 01:28:55 +
+++ gcc/df-scan.c   2011-07-08 03:38:38 +
@@ -3412,8 +3412,9 @@
   VEC_truncate (df_ref, collection_rec->eq_use_vec, 0);
   VEC_truncate (df_mw_hardreg_ptr, collection_rec->mw_vec, 0);
 
-  /* Record register defs.  */
-  df_defs_record (collection_rec, PATTERN (insn_info->insn), bb, insn_info, 0);
+  if (CALL_P (insn_info->insn))
+df_get_call_refs (collection_rec, bb, insn_info,
+ (is_cond_exec) ? DF_REF_CONDITIONAL : 0);
 
   /* Process REG_EQUIV/REG_EQUAL notes.  */
   for (note = REG_NOTES (insn_info->insn); note;
@@ -3421,33 +3422,33 @@
 {
   switch (REG_NOTE_KIND (note))
 {
+ /* first write DF_REF_BASE */
+case REG_NON_LOCAL_GOTO:
+  /* The frame ptr is used by a non-local goto.  */
+  df_ref_record (DF_REF_BASE, collection_rec,
+ regno_reg_rtx[FRAME_POINTER_REGNUM],
+ NULL, bb, insn_info,
+ DF_REF_REG_USE, 0);
+#if !HARD_FRAME_POINTER_IS_FRAME_POINTER
+  df_ref_record (DF_REF_BASE, collection_rec,
+ regno_reg_rtx[HARD_FRAME_POINTER_REGNUM],
+ NULL, bb, insn_info,
+ DF_REF_REG_USE, 0);
+#endif
+  break;
 case REG_EQUIV:
 case REG_EQUAL:
   df_uses_record (collection_rec,
   &XEXP (note, 0), DF_REF_REG_USE,
   bb, insn_info, DF_REF_IN_NOTE);
   break;
-case REG_NON_LOCAL_GOTO:
-  /* The frame ptr is used by a non-local goto.  */
-  df_ref_record (DF_REF_BASE, collection_rec,
- regno_reg_rtx[FRAME_POINTER_REGNUM],
- NULL, bb, insn_info,
- DF_REF_REG_USE, 0);
-#if !HARD_FRAME_POINTER_IS_FRAME_POINTER
-  df_ref_record (DF_REF_BASE, collection_rec,
- regno_reg_rtx[HARD_FRAME_POINTER_REGNUM],
- NULL, bb, insn_info,
- DF_REF_REG_USE, 0);
-#endif
-  break;
 default:
   break;
 }
 }
 
-  if (CALL_P (insn_info->insn))
-df_get_call_refs (collection_rec, bb, insn_info,
- (is_cond_exec) ? DF_REF_CONDITIONAL : 0);
+  /* Record register defs.  */
+  df_defs_record (collection_rec, PATTERN (insn_info->insn), bb, insn_info, 0);
 
   /* Record the register uses.  */
   df_uses_record (collection_rec,



Re: [df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-07 Thread Dimitrios Apostolou
To document the gains from the bitmaps, here is (part of) the annotated 
source from callgrind profiler, showing instruction count. Before:



 1,154,400  if (bitmap_bit_p(regs_invalidated_by_call_regset, i)
 8,080,800  => bitmap.c:bitmap_bit_p (192400x)
 1,021,200  && !bitmap_bit_p (&defs_generated, i)
 5,106,000  => bitmap.c:bitmap_bit_p (170200x)
   340,400  && (!is_sibling_call
 .  || !bitmap_bit_p (df->exit_block_uses, i)
 .  || refers_to_regno_p (i, i+1,
 .crtl->return_rtx, NULL)))
 2,053,500df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i
35,279,934  => df-scan.c:df_ref_record (170200x)
 .   NULL, bb, insn_info, DF_REF_REG_DEF,
 .   DF_REF_MAY_CLOBBER | flags);
 .  }


After:


 1,346,800  if (TEST_HARD_REG_BIT(regs_invalidated_by_call, i)
   510,600  && !TEST_HARD_REG_BIT (defs_generated, i)
   340,400  && (!is_sibling_call
 .  || !bitmap_bit_p (df->exit_block_uses, i)
 .  || refers_to_regno_p (i, i+1,
 .crtl->return_rtx, NULL)))
 2,057,200df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i
35,279,934  => df-scan.c:df_ref_record (170200x)
 .   NULL, bb, insn_info, DF_REF_REG_DEF,
 .   DF_REF_MAY_CLOBBER | flags);
 .  }



Dimitris



[df-scan.c] Optimise DF_REFs ordering in collection_rec, use HARD_REG_SETs instead of bitmaps

2011-07-07 Thread Dimitrios Apostolou

Hello list,

The attached patch does two things for df_get_call_refs():
* First it uses HARD_REG_SETs for defs_generated and 
regs_invalidated_by_call, instead of bitmaps. Replacing in total more 
than 400K calls (for my testcase) to bitmap_bit_p() with the much faster 
TEST_HARD_REG_BIT, reduces the total instruction count from about 13M to 
1.5M.
* Second it produces the REFs in REGNO order, which is important to keep 
the collection_rec sorted most times, and avoid expensive calls to 
qsort(). Thanks to Paolo Bonzini for idea and mentoring.


The second part makes a big difference if accompanied with another patch 
in df_insn_refs_collect(). I'll post a followup patch, that is 
unfortunately unstable for some of my tests, so I'd appreciate any 
comments.



Thanks,
Dimitris
=== modified file 'gcc/df-scan.c'
--- gcc/df-scan.c   2011-02-02 20:08:06 +
+++ gcc/df-scan.c   2011-07-08 01:28:55 +
@@ -3317,20 +3317,56 @@
   int flags)
 {
   rtx note;
-  bitmap_iterator bi;
-  unsigned int ui;
   bool is_sibling_call;
   unsigned int i;
   df_ref def;
-  bitmap_head defs_generated;
+  HARD_REG_SET defs_generated;
 
-  bitmap_initialize (&defs_generated, &df_bitmap_obstack);
+  CLEAR_HARD_REG_SET(defs_generated);
 
   /* Do not generate clobbers for registers that are the result of the
  call.  This causes ordering problems in the chain building code
  depending on which def is seen first.  */
   FOR_EACH_VEC_ELT (df_ref, collection_rec->def_vec, i, def)
-bitmap_set_bit (&defs_generated, DF_REF_REGNO (def));
+SET_HARD_REG_BIT (defs_generated, DF_REF_REGNO (def));
+
+  is_sibling_call = SIBLING_CALL_P (insn_info->insn);
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+{
+  if (i == STACK_POINTER_REGNUM)
+   /* The stack ptr is used (honorarily) by a CALL insn.  */
+   df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
+  NULL, bb, insn_info, DF_REF_REG_USE,
+  DF_REF_CALL_STACK_USAGE | flags);
+  else if (global_regs[i])
+   {
+ /* Calls to const functions cannot access any global registers and
+calls to pure functions cannot set them.  All other calls may
+reference any of the global registers, so they are recorded as
+used. */
+ if (!RTL_CONST_CALL_P (insn_info->insn))
+   {
+ df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
+NULL, bb, insn_info, DF_REF_REG_USE, flags);
+ if (!RTL_PURE_CALL_P (insn_info->insn))
+   df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
+  NULL, bb, insn_info, DF_REF_REG_DEF, flags);
+   }
+   }
+  /* TODO HARD_REG_SET set intersection! */
+  else /* !global_regs[i] */
+   /* track Caller-Saved registers */
+   if (TEST_HARD_REG_BIT(regs_invalidated_by_call, i)
+   && !TEST_HARD_REG_BIT (defs_generated, i)
+   && (!is_sibling_call
+   || !bitmap_bit_p (df->exit_block_uses, i)
+   || refers_to_regno_p (i, i+1,
+ crtl->return_rtx, NULL)))
+ df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
+NULL, bb, insn_info, DF_REF_REG_DEF,
+DF_REF_MAY_CLOBBER | flags);
+}
 
   /* Record the registers used to pass arguments, and explicitly
  noted as clobbered.  */
@@ -3345,7 +3381,7 @@
  if (REG_P (XEXP (XEXP (note, 0), 0)))
{
  unsigned int regno = REGNO (XEXP (XEXP (note, 0), 0));
- if (!bitmap_bit_p (&defs_generated, regno))
+ if (!TEST_HARD_REG_BIT (defs_generated, regno))
df_defs_record (collection_rec, XEXP (note, 0), bb,
insn_info, flags);
}
@@ -3355,40 +3391,6 @@
}
 }
 
-  /* The stack ptr is used (honorarily) by a CALL insn.  */
-  df_ref_record (DF_REF_BASE, collection_rec, 
regno_reg_rtx[STACK_POINTER_REGNUM],
-NULL, bb, insn_info, DF_REF_REG_USE,
-DF_REF_CALL_STACK_USAGE | flags);
-
-  /* Calls to const functions cannot access any global registers and calls to
- pure functions cannot set them.  All other calls may reference any of the
- global registers, so they are recorded as used.  */
-  if (!RTL_CONST_CALL_P (insn_info->insn))
-for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
-  if (global_regs[i])
-   {
- df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
-NULL, bb, insn_info, DF_REF_REG_USE, flags);
- if (!RTL_PURE_CALL_P (insn_info->insn))
-   df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
-  NULL, bb, insn_info, DF_REF_REG_DEF, flags);
-   }
-
-  is_sibling_call = SIBLING_CALL_P (insn_info->insn);
-  EXECUTE_IF_SET_

Re: Patch: speed up compiler a little bit by optimizing lookup_attribute() and is_attribute_p()

2011-06-21 Thread Dimitrios Apostolou

Hi Nicola,

my patch is too simple compared to yours, feel free to work on it as much 
as you wish, no need to credit me since you posted it independantly. I 
just posted it to note that the inlining part is the one providing most 
performance benefit.


richi: I used always_inline because it is the case that you *never* want 
it to be out-of-line, since it's a small wrapper function providing 
important performance benefit. Do you think a macro would be better?



Thanks,
Dimitris

P.S. Nicola, if you remember it, please keep me Cc'd in further posts of 
yours related to performance of gcc



On Tue, 21 Jun 2011, Nicola Pero wrote:

Dimitrious

I didn't realize you were working on this.  Your patch is indeed very similar. 
:-)

Can I go ahead and rewrite mine following Richard's suggestions (which would 
make it
even more similar to yours), and add your name to the ChangeLog entry too ?

Thanks


Re: Patch: speed up compiler a little bit by optimizing lookup_attribute() and is_attribute_p()

2011-06-21 Thread Dimitrios Apostolou
FWIW I think that most of the speedup is due to inlining 
lookup_attribute(). I got almost the same by applying only the attached 
very simple patch, since strlen() was called too often (according to the 
profile at [1]). I used the always_inline attribute to avoid using a 
macro.


I was going to send it together with some other changes I'm trying and 
after proper measurements. Anyway, better late than ever. Thanks to 
Christophe Jaillet (CC'd) for pointing it to me.



Thanks,
Dimitris


[1] http://gcc.gnu.org/wiki/OptimisingGCC




On Tue, 21 Jun 2011, Richard Guenther wrote:


On Tue, Jun 21, 2011 at 12:17 PM, Nicola Pero
 wrote:

This patch speeds up the C/C++/ObjC/ObjC++ compiler a little bit by optimizing
lookup_attribute() and is_attribute_p().  The main change is that these 
functions
are now inline.


I don't think this is a good idea.


Can you explain why ?  You never do in your response :-)

I'm guessing it's because you think the inline functions are too big and that's
counter-productive ?


Yes.


For this case I'd factor out the NULL attribute list check into an
inline function and keeping the non-NULL attribute list pieces out-of-line.


Ok ... so is this what you'd like - the common, hot case in the inline function,
and the less common, cold case in the out-of-line one ? :-)


Right.


That makes sense - I'll try that out tonight (it takes a few hours to run all
the benchmarks). ;-)


Thanks.




Does it work for all languages?  And yes, I agree it would be nice to
canonicalize to the form without _s even in the attribute lists itself.
Or go one step further - do not store an identifier but an enum as we
in fact only ever insert known attributes into the list (which should
be a VEC instead, possibly even sorted to allow binary search).


Sure ... one step at a time. :-)


Of course ;)

Richard.


Thanks


--- gcc-4.6.0.orig/gcc/tree.c   2011-03-14 14:20:48.0 +0200
+++ gcc-4.6.0-mine/gcc/tree.c   2011-06-21 12:57:00.0 +0300
@@ -5230,10 +5230,9 @@ is_attribute_p (const char *attr, const_
be passed back in if further occurrences are wanted.  */
 
 tree
-lookup_attribute (const char *attr_name, tree list)
+lookup_attribute_len (const char *attr_name, size_t attr_len, tree list)
 {
   tree l;
-  size_t attr_len = strlen (attr_name);
 
   for (l = list; l; l = TREE_CHAIN (l))
 {

--- gcc-4.6.0.orig/gcc/tree.h   2011-03-12 00:38:58.0 +0200
+++ gcc-4.6.0-mine/gcc/tree.h   2011-06-21 13:13:02.0 +0300
@@ -4369,7 +4369,16 @@ extern int is_attribute_p (const char *,
 /* Given an attribute name and a list of attributes, return the list element
of the attribute or NULL_TREE if not found.  */
 
-extern tree lookup_attribute (const char *, tree);
+extern tree lookup_attribute_len (const char *, size_t, tree);
+
+/* implemented as inline so that the compiler optimises away the strlen() for
+   known strings*/
+static __attribute__((__always_inline__)) tree
+lookup_attribute (const char *attr_name, tree list)
+{
+  return lookup_attribute_len(attr_name, strlen (attr_name), list);
+}
+
 
 /* Remove any instances of attribute ATTR_NAME in LIST and return the
modified list.  */


Re: fix memory leak in gengtype

2011-04-21 Thread Dimitrios Apostolou

On Thu, 21 Apr 2011, Laurynas Biveinis wrote:


:( Why don't you get yourself a compile farm account?
http://gcc.gnu.org/wiki/CompileFarm



Thanks Laurynas, I am absolutely thrilled to see such a variety of 
hardware! I'll try applying, but I'm not sure I'm eligible, my 
contributions to OSS are just a few and mostly simple.



Thanks,
Dimitris



Re: fix memory leak in gengtype

2011-04-20 Thread Dimitrios Apostolou

On Wed, 20 Apr 2011, Jeff Law wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/20/11 15:08, Dimitrios Apostolou wrote:

Hello list,

while trying to build gcc-4.6.0 on my sparcstation, I got gengtype OOM
killed. That's when I noticed that its RAM usage peaks at 150MB, which
is a bit excessive for parsing a ~500K text file.

The attached patch fixes the leak and gengtype now uses a peak of 4MB
heap. Hopefully I don't do something wrong, since it took me a while to
understand those obstacks...

The code in question creates an obstack, allocates (and grows) a single
object on the obstack, then frees the object.  This leaks the underlying
obstack structure itself and potentially any chunks that were too small
to hold the object.


Plus a whole page which is preallocated by the obstack, if I understand 
correctly. As a result, for each word in the text file we consume 4KB, 
which are never freed.




It turns out there's a similar leak in gengtype.c which is fixed in the
same way.



Nice, thanks for looking deeper into this, I just stopped when memory 
utilisation seemed ok.



A quick valgrind test shows that prior to your change gengtype leaked
roughly 200M, after your change it leaks about 1.3M and after fixing
gengtype it leaks a little under 300k.

I'll run those changes through the usual tests and check in the changes
assuming they pass those tests.

Thanks for the patch!



P.S. I was trying to test gcc on a rare arch (sparc-unknown-linux-gnu)
but unfortunately the sparcstation crashed and burned after this, so I
can't continue the build and report back :-(

:(  My old PA box has similar problems, though it merely overheats
before a bootstrap can complete, so in theory I could coax it to finish
a bootstrap.   Luckily others (particularly John) have stepped in over
the last decade and taken excellent care of the PA port.


If by PA you mean PA-RISC, I remember when I had access to a Visualize 
C200 with gentoo on. I loved the machine, but it had an important issue: 
it was absolutely random if it would power up, when pressing the power 
button. But as long as we never turned it off, it worked ok :-)



Dimitris


fix memory leak in gengtype

2011-04-20 Thread Dimitrios Apostolou

Hello list,

while trying to build gcc-4.6.0 on my sparcstation, I got gengtype OOM 
killed. That's when I noticed that its RAM usage peaks at 150MB, which is 
a bit excessive for parsing a ~500K text file.


The attached patch fixes the leak and gengtype now uses a peak of 4MB 
heap. Hopefully I don't do something wrong, since it took me a while to 
understand those obstacks...



Thanks,
Dimitris


P.S. I was trying to test gcc on a rare arch (sparc-unknown-linux-gnu) but 
unfortunately the sparcstation crashed and burned after this, so I can't 
continue the build and report back :-(
--- gcc/gengtype-state.c.orig   2011-04-20 23:06:29.0 +0300
+++ gcc/gengtype-state.c2011-04-20 23:12:43.0 +0300
@@ -303,7 +303,7 @@
   obstack_1grow (&id_obstack, (char) 0);
   ids = XOBFINISH (&id_obstack, char *);
   sid = state_ident_by_name (ids, INSERT);
-  obstack_free (&id_obstack, ids);
+  obstack_free (&id_obstack, NULL);
   ids = NULL;
   tk = XCNEW (struct state_token_st);
   tk->stok_kind = STOK_NAME;
@@ -408,7 +408,7 @@
   tk->stok_file = state_path;
   tk->stok_next = NULL;
   strcpy (tk->stok_un.stok_string, cstr);
-  obstack_free (&bstring_obstack, cstr);
+  obstack_free (&bstring_obstack, NULL);
 
   return tk;
 }