On 05/10/2023 22:39, Jørgen Kvalsvik wrote:
On 05/10/2023 21:59, Jan Hubicka wrote:Like Wahlen et al this implementation records coverage in fixed-size bitsets which gcov knows how to interpret. This is very fast, but introduces a limit on the number of terms in a single boolean expression, the number of bits in a gcov_unsigned_type (which is typedef'd to uint64_t), so for most practical purposes this would be acceptable. This limitation is in the implementation and not the algorithm, so support for more conditions can be added by also introducing arbitrary-sized bitsets.This should not be too hard to do - if conditionalis more complex you simply introduce more than one counter for it, right? How many times this trigers on GCC sources?It shouldn't be, no. But when dynamic bitsets are on the table it would be much better to length-encode in smaller multiples than the 64-bit counters. Most expressions are small (<4 terms), so the savings would be substantial. I opted for the simpler fixed-size to start with because it is much simpler and would not introduce any branching or decisions in the instrumentation.
I just posted v6 of this patch, and the bitsets are still fixed size. I consider dynamic bitsets out of scope for this particular effort, although I think it might be worth pursuing later.
For space overhead, the instrumentation needs two accumulators (gcov_unsigned_type) per condition in the program which will be written to the gcov file. In addition, every function gets a pair of local accumulators, but these accmulators are reused between conditions in the same function. For time overhead, there is a zeroing of the local accumulators for every condition and one or two bitwise operation on every edge taken in the an expression. In action it looks pretty similar to the branch coverage. The -g short opt carries no significance, but was chosen because it was an available option with the upper-case free too. gcov --conditions: 3: 17:void fn (int a, int b, int c, int d) { 3: 18: if ((a && (b || c)) && d) conditions covered 3/8 condition 0 not covered (true) condition 0 not covered (false) condition 1 not covered (true) condition 2 not covered (true) condition 3 not covered (true)It seems understandable, but for bigger conditionals I guess it will be bit hard to make sense between condition numbers and the actual source code. We could probably also show the conditions as ranges in the conditional? I am adding David Malcolm to CC, he may have some ideas. I wonder how much this information is confused by early optimizations happening before coverage profiling?
Yes, but I could not figure out strong gcov mechanisms to make a truly better one, and the json + extra tooling is more in line with what you want for large conditionals, I think. I also specifically did one unit of information per line to play nicer with grep, so it currently looks like:
conditions covered 3/8 condition 0 not covered (true false) ...I think improving gcov's printing abilities would do wonders, but I couldn't find anything that would support it currently. Did I miss something?
Some expressions, mostly those without else-blocks, are effectively "rewritten" in the CFG construction making the algorithm unable to distinguish them: and.c: if (a && b && c) x = 1; ifs.c: if (a) if (b) if (c) x = 1; gcc will build the same graph for both these programs, and gcov will report boths as 3-term expressions. It is vital that it is not interpreted the other way around (which is consistent with the shape of the graph) because otherwise the masking would be wrong for the and.c program which is a more severe error. While surprising, users would probably expect some minor rewriting of semantically-identical expressions. and.c.gcov: #####: 2: if (a && b && c) conditions covered 6/6 #####: 3: x = 1; ifs.c.gcov: #####: 2: if (a) #####: 3: if (b) #####: 4: if (c) #####: 5: x = 1; conditions covered 6/6Maybe one can use location information to distinguish those cases?Don't we store discriminator info about individual statements that is also used forauto-FDO?That is one possibility, which I tried for a bit, but abandoned to focus on getting the rest of the algorithm right. I am sure it can be revisited (possibly as a future improvement) and weighted against always emitting an else block (see https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631254.html)
This has not been changed in v6. I think the always-else approach is nicer.
gcc/ChangeLog: * builtins.cc (expand_builtin_fork_or_exec): Check profile_condition_flag. * collect2.cc (main): Add -fno-profile-conditions to OBSTACK. * common.opt: Add new options -fprofile-conditions and * doc/gcov.texi: Add --conditions documentation. * doc/invoke.texi: Add -fprofile-conditions documentation. * gcc.cc: Link gcov on -fprofile-conditions. * gcov-counter.def (GCOV_COUNTER_CONDS): New. * gcov-dump.cc (tag_conditions): New. * gcov-io.h (GCOV_TAG_CONDS): New. (GCOV_TAG_CONDS_LENGTH): Likewise. (GCOV_TAG_CONDS_NUM): Likewise. * gcov.cc (class condition_info): New. (condition_info::condition_info): New. (condition_info::popcount): New. (struct coverage_info): New. (add_condition_counts): New. (output_conditions): New. (print_usage): Add -g, --conditions. (process_args): Likewise. (output_intermediate_json_line): Output conditions. (read_graph_file): Read conditions counters. (read_count_file): Read conditions counters. (file_summary): Print conditions. (accumulate_line_info): Accumulate conditions. (output_line_details): Print conditions. * ipa-inline.cc (can_early_inline_edge_p): Check profile_condition_flag. * ipa-split.cc (pass_split_functions::gate): Likewise. * passes.cc (finish_optimization_passes): Likewise. * profile.cc (find_conditions): New declaration. (cov_length): Likewise. (cov_blocks): Likewise. (cov_masks): Likewise. (cov_free): Likewise. (instrument_decisions): New. (read_thunk_profile): Control output to file. (branch_prob): Call find_conditions, instrument_decisions. (init_branch_prob): Add total_num_conds. (end_branch_prob): Likewise. * tree-profile.cc (struct conds_ctx): New. (CONDITIONS_MAX_TERMS): New. (EDGE_CONDITION): New. (cmp_index_map): New. (index_of): New. (block_conditional_p): New. (edge_conditional_p): New. (single): New. (single_edge): New. (contract_edge): New. (contract_edge_up): New. (ancestors_of): New. (struct outcomes): New. (conditional_succs): New. (condition_index): New. (masking_vectors): New. (cond_reachable_from): New. (neighborhood): New. (isolate_expression): New. (emit_bitwise_op): New. (make_index_map_visit): New. (make_index_map): New. (collect_conditions): New. (yes): New. (struct condcov): New. (cov_length): New. (cov_blocks): New. (cov_masks): New. (cov_free): New. (find_conditions): New. (instrument_decisions): New. (tree_profiling): Check profile_condition_flag. (pass_ipa_tree_profile::gate): Likewise. gcc/testsuite/ChangeLog: * lib/gcov.exp: Add condition coverage test function. * g++.dg/gcov/gcov-18.C: New test. * gcc.misc-tests/gcov-19.c: New test. * gcc.misc-tests/gcov-20.c: New test. * gcc.misc-tests/gcov-21.c: New test. --- gcc/builtins.cc | 2 +- gcc/collect2.cc | 7 +- gcc/common.opt | 8 + gcc/doc/gcov.texi | 37 + gcc/doc/invoke.texi | 19 + gcc/gcc.cc | 4 +- gcc/gcov-counter.def | 3 + gcc/gcov-dump.cc | 24 + gcc/gcov-io.h | 3 + gcc/gcov.cc | 197 +++- gcc/ipa-inline.cc | 2 +- gcc/ipa-split.cc | 3 +- gcc/passes.cc | 3 +- gcc/profile.cc | 84 +- gcc/testsuite/g++.dg/gcov/gcov-18.C | 234 +++++ gcc/testsuite/gcc.misc-tests/gcov-19.c | 1249 ++++++++++++++++++++++++ gcc/testsuite/gcc.misc-tests/gcov-20.c | 22 + gcc/testsuite/gcc.misc-tests/gcov-21.c | 16 + gcc/testsuite/lib/gcov.exp | 191 +++- gcc/tree-profile.cc | 1048 +++++++++++++++++++- libgcc/libgcov-merge.c | 5 + 21 files changed, 3135 insertions(+), 26 deletions(-) create mode 100644 gcc/testsuite/g++.dg/gcov/gcov-18.C create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-19.c create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-20.c create mode 100644 gcc/testsuite/gcc.misc-tests/gcov-21.c @@ -392,6 +394,28 @@ tag_arcs (const char *filename ATTRIBUTE_UNUSED, } } +static void +tag_conditions (const char *filename ATTRIBUTE_UNUSED, + unsigned tag ATTRIBUTE_UNUSED, int length ATTRIBUTE_UNUSED, + unsigned depth)There should be block comments before functions
Done.
@@ -134,6 +135,28 @@ public: vector<unsigned> lines; }; +class condition_info... and datastructures.
Done.
Done, but only for my flag. I can change the remaining flags in another patch.+/* Output conditions (modified condition/decision coverage) */ + +static int flag_conditions = 0;This is really a bool. Gcov was not updated for a while, I think you can simply convert other flag_* to bools as well.
diff --git a/gcc/ipa-split.cc b/gcc/ipa-split.cc index 6730f4f9d0e..276ead617c9 100644 --- a/gcc/ipa-split.cc +++ b/gcc/ipa-split.cc @@ -1930,7 +1930,8 @@ pass_split_functions::gate (function *)/* When doing profile feedback, we want to execute the pass after profilingis read. So disable one in early optimization. */ return (flag_partial_inlining - && !profile_arc_flag && !flag_branch_probabilities); + && !profile_arc_flag && !flag_branch_probabilities + && !profile_condition_flag);This should be unnecessary - profile_condition is not used for profile feedback.
Removed in v6.
} } // anon namespace diff --git a/gcc/passes.cc b/gcc/passes.cc index 6f894a41d22..02194fe286f 100644 --- a/gcc/passes.cc +++ b/gcc/passes.cc @@ -352,7 +352,8 @@ finish_optimization_passes (void) gcc::dump_manager *dumps = m_ctxt->get_dumps (); timevar_push (TV_DUMP);- if (profile_arc_flag || flag_test_coverage || flag_branch_probabilities)+ if (profile_arc_flag || profile_condition_flag || flag_test_coverage + || flag_branch_probabilities) { dumps->dump_start (pass_profile_1->static_pass_number, NULL); end_branch_prob ();I think this whole code is unnecesar since branch probability is now an IPA pass and runs on whole program at once, so the stats done by end_branch_prob can be output from the pass itself.
Done, removed.
(originally the pass was doine as part of optimization queue and we needed a place to output stats at the end of compilation).@@ -1397,10 +1415,18 @@ branch_prob (bool thunk) /* Write the data from which gcov can reconstruct the basic block graph and function line numbers (the gcno file). */ + output_to_file = false; if (coverage_begin_function (lineno_checksum, cfg_checksum)) { gcov_position_t offset;+ /* The condition coverage needs a deeper analysis to identify expressions + * of conditions, which means it is not yet ready to write to the gcno + * file. It will write its entries later, but needs to know if it do it+ * in the first place, which is controlled by the return value of + * coverage_begin_function. */No * on ever line.
Done, except I missed a few. I have fixed them locally, my apologies.
They can, and it was a lot harder than I thought to get it working (largely due to my unfamiliarity with it). v6 uses SSA form and passes the test suite.+ output_to_file = true; + /* Basic block flags */ offset = gcov_write_tag (GCOV_TAG_BLOCKS); gcov_write_unsigned (n_basic_blocks_for_fn (cfun)); + if (coverage_counter_alloc (GCOV_COUNTER_CONDS, 2 * nconds)) + { + /* Add two extra variables to the function for the local+ accumulators, which are zero'd on the entry of a new conditional. + The local accumulators are shared between decisions in order to+ use less stack space. */Shouldn't we able to allocate the stack space.+ tree accu[2] = { + build_decl (UNKNOWN_LOCATION, VAR_DECL, + get_identifier ("__accu_t"), get_gcov_type ()), + build_decl (UNKNOWN_LOCATION, VAR_DECL, + get_identifier ("__accu_f"), get_gcov_type ()),Can't those be just SSA names?
I added a comment, but did not move it out of tree-profiling.cc. As you say, it is large enough, and I will do it if you want me to, but I still find it manageable.+ }; + + gcov_position_t offset {}; + if (output_to_file) + offset = gcov_write_tag (GCOV_TAG_CONDS); + + for (unsigned i = 0; i < nconds; ++i) + { + array_slice<basic_block> expr = cov_blocks (cov, i); + array_slice<gcov_type_unsigned> masks = cov_masks (cov, i); + gcc_assert (expr.is_valid ()); + gcc_assert (masks.is_valid ()); ++ int terms = instrument_decisions (expr, i, accu, masks.begin ());+ if (output_to_file) + { + gcov_write_unsigned (expr.front ()->index); + gcov_write_unsigned (terms); + } + } + if (output_to_file) + gcov_write_length (offset); + } + cov_free (cov); + } + /* For each edge not on the spanning tree, add counting code. */ if (profile_arc_flag && coverage_counter_alloc (GCOV_COUNTER_ARCS, num_instrumented)) { unsigned n_instrumented; - gimple_init_gcov_profiler (); - n_instrumented = instrument_edges (el); gcc_assert (n_instrumented == num_instrumented); if (flag_profile_values) instrument_values (values); - - /* Commit changes done by instrumentation. */ - gsi_commit_edge_inserts (); } free_aux_for_edges (); values.release (); free_edge_list (el); + /* Commit changes done by instrumentation. */ + gsi_commit_edge_inserts (); + coverage_end_function (lineno_checksum, cfg_checksum); if (flag_branch_probabilities && (profile_status_for_fn (cfun) == PROFILE_READ)) @@ -1669,6 +1740,7 @@ init_branch_prob (void) total_num_passes = 0; total_num_times_called = 0; total_num_branches = 0; + total_num_conds = 0; for (i = 0; i < 20; i++) total_hist_br_prob[i] = 0; } @@ -1708,5 +1780,7 @@ end_branch_prob (void) (total_hist_br_prob[i] + total_hist_br_prob[19-i]) * 100 / total_num_branches, 5*i, 5*i+5); } + fprintf (dump_file, "Total number of conditions: %d\n", + total_num_conds); } }diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc index da300d5f9e8..c8b917afb9a 100644 --- a/gcc/tree-profile.cc +++ b/gcc/tree-profile.cc @@ -58,6 +58,8 @@ along with GCC; see the file COPYING3. If not see #include "alloc-pool.h" #include "symbol-summary.h" #include "symtab-thunks.h" +#include "cfganal.h" +#include "cfgloop.h" static GTY(()) tree gcov_type_node; static GTY(()) tree tree_interval_profiler_fn; @@ -73,6 +75,1046 @@ static GTY(()) tree ic_tuple_var; static GTY(()) tree ic_tuple_counters_field; static GTY(()) tree ic_tuple_callee_field; +namespace +{Maybe some overall comment what this code is doing (which you describein tthe email). Also it may make sense to place it to a new source file: it is long enough.
Ah, yes, that was bad. So, what I have done is change the compiler storage type for bitsets to uint64_t, and it seems to be reasonable for a while that the target gcov_type_unsigned is <= 64 bits, and moved target check to TYPE_PRECISION.+/* Some context and reused instances between function calls. Large embedded + buffers are used to up-front request enough memory for most programs and + merge them into a single allocation at the cost of using more memory in the + average case. Some numbers from linux v5.13 which is assumed to be a + reasonably diverse code base: 75% of the functions in linux have less than + 16 nodes in the CFG and approx 2.5% have more than 64 nodes. The functions + that go beyond a few dozen nodes tend to be very large (>100) and so 64+ seems like a good balance. ++ This is really just a performance balance of the cost of allocation and+ wasted memory. */ +struct conds_ctx +{+ /* Bitmap of the processed blocks. Bit n set means basic_block->index has + been processed either explicitly or as a part of an expression. */+ auto_sbitmap marks; ++ /* This is both a reusable shared allocation which is also used to return + single expressions, which means it for most code should only hold a+ couple of elements. */ + auto_vec<basic_block, 32> blocks; + + /* Map from basic_block->index to an ordering so that for a single+ expression (a || b && c) => index_map[a] < index_map[b] < index_map[c]. + The values do not have to be consecutive and can be interleaved by + values from other expressions, so comparisons only make sense for blocks+ that belong to the same expression. */ + auto_vec<int, 64> index_map; ++ /* Pre-allocate bitmaps and vectors for per-function book keeping. This is + pure instance reuse and the bitmaps carry no data between function+ calls. */ + auto_vec<basic_block, 64> B1; + auto_vec<basic_block, 64> B2; + auto_sbitmap G1; + auto_sbitmap G2; + auto_sbitmap G3; + + explicit conds_ctx (unsigned size) noexcept (true) : marks (size), + G1 (size), G2 (size), G3 (size) + { + bitmap_clear (marks); + } ++ /* Mark a node as processed so nodes are not processed twice for example in+ loops, gotos. */ + void mark (const basic_block b) noexcept (true) + { + gcc_assert (!bitmap_bit_p (marks, b->index)); + bitmap_set_bit (marks, b->index); + } + + /* Mark nodes as processed so they are not processed twice. */ + void mark (const vec<basic_block>& bs) noexcept (true) + { + for (const basic_block b : bs) + mark (b); + } ++ /* Check if all nodes are marked. A successful run should visit & mark+ every reachable node exactly once. */+ bool all_marked (const vec<basic_block>& reachable) const noexcept (true)+ { + for (const basic_block b : reachable) + if (!bitmap_bit_p (marks, b->index)) + return false; + return true; + } +}; ++/* Only instrument terms with fewer than number of bits in a (wide) gcov + integer, which is probably 64. The algorithm itself does not impose this+ limitation, but it makes for a simpler implementation. ++ * Allocating the output data structure (coverage_counter_alloc ()) can + assume pairs of gcov_type_unsigned and not use a separate length field.+ * A pair gcov_type_unsigned can be used as accumulators.+ * Updating accumulators is can use the bitwise operations |=, &= and not+ custom operators that work for arbitrary-sized bit-sets. ++ Most real-world code should be unaffected by this, but it is possible+ (especially for generated code) to exceed this limit. */+#define CONDITIONS_MAX_TERMS (sizeof (gcov_type_unsigned) * BITS_PER_UNIT)You also need to consider target number of bits in the gcov type: targets can change the 64bit default to something else. There is gcov_type_node from which you can get number of bits.
++/* Special cases of the single_*_p and single_*_edge functions in basic-block.h + that don't consider exception handling or other complex edges. This helps + create a view of the CFG with only normal edges - if a basic block has both + an outgoing fallthrough and exceptional edge [1], it should be considered a+ single-successor. ++ [1] if this is not possible, these functions can be removed and replaced by+ their basic-block.h cousins. */This should be achievable with -fnon-call-exceptions. For example division can throw an exception then.+bool +single (const vec<edge, va_gc> *edges)Maybe single_p to keep naming similar to basic-block?
Done.
+{ + int n = EDGE_COUNT (edges); + if (n == 0) + return false; + + for (edge e : edges) + if (e->flags & EDGE_COMPLEX) + n -= 1; + + return n == 1; +} ++/* Get the single, non-complex edge. Behavior is undefined edges have more+ than 1 non-complex edges. */ +edge +single_edge (const vec<edge, va_gc> *edges) +{Perhaps to keep behaviour consistent with basic-block.h you want to add gcc_checking_assert (single_p (edges));
Done.
+/* Find the set {ancestors (p) intersect G} where ancestors is the recursive + set of predecessors for p. Limiting to the ancestors that are also in G + (see cond_reachable_from) and by q is an optimization as ancestors outside G+ have no effect when isolating expressions. ++ dfs_enumerate_from () does not work as the filter function needs edge+ information and dfs_enumerate_from () only considers blocks. */ +void+ancestors_of (basic_block p, basic_block q, const sbitmap G, sbitmap ancestors)coding style does not really like using uppercase letter for variables. I see it was a set in the paper, but one can give it bit more descriptive name in the source code.
Done.
This patch also adds intermediate nodes to the bitmap, which simplified a few things in isolate_expression, and made the new instrument_decisions quite nice.+{ + if (!bitmap_bit_p (G, p->index)) + return; + + bitmap_set_bit (ancestors, p->index); + bitmap_set_bit (ancestors, q->index); + if (p == q) + return; + + auto_vec<basic_block, 16> stack; + stack.safe_push (p); + + while (!stack.is_empty ()) + { + basic_block b = stack.pop (); + if (single (b->preds)) + { + edge e = single_edge (b->preds); + e = contract_edge_up (e); + b = e->dest; + }So this walks chain of single pred edges until something non-single is found and only those are added ancessor bitmaps? Why single_edge basic blocks are not inserted into the ancestor birmap?
+ + for (edge e : b->preds) + { + basic_block src = e->src; + if (bitmap_bit_p (ancestors, e->src->index)) + continue; + if (!bitmap_bit_p (G, e->src->index)) + continue; + bitmap_set_bit (ancestors, src->index);bitmap_set_bit returns boolean value whether the bit was newly set or previously 1. So you can avoid bitmap_bit_p above.
Done.
diff --git a/libgcc/libgcov-merge.c b/libgcc/libgcov-merge.c index 5d6e17d1483..eed3556373b 100644 --- a/libgcc/libgcov-merge.c +++ b/libgcc/libgcov-merge.c@@ -33,6 +33,11 @@ void __gcov_merge_add (gcov_type *counters __attribute__ ((unused)), unsigned n_counters __attribute__ ((unused))) {}#endif +#ifdef L_gcov_merge_ior +void __gcov_merge_ior (gcov_type *counters __attribute__ ((unused)), + unsigned n_counters __attribute__ ((unused))) {} +#endifWhy this is necessary?
To build in freestanding environments, proposed here: https://inbox.sourceware.org/gcc-patches/ad3f098f-9caa-b656-47e8-6cd1b216f...@embedded-brains.de/
I would assume scheduling earlier would improve the accuracy also under optimization (which is nice, but usually these measurements require -O0).The profiling part looks good to me. I am worried about the pattern matching of gimple to recognize conditionals especially after the early optimizations was run. Does it work reasonably with -O2? Perhaps it would make sense to schedule this kind of profiling early shortly after build_ssa which would make it incompatible with profile feedback, but I think that wouold be OK.
Richi, can you please look at the gimple matching part? Honza
When doing the SSA form instrumentation I found a peculiar problem with this program:
int setjmp003 (int a) { if (a || setjmp (buf) a += 12; return a; } void jump () { longjmp (buf, 1); }This gives me coverages like 35/4, 36/4. The setjmp creates a central node with complex incoming edges, and an outgoing complex edge to the start of the right operand (the longjmp resume).
Now, the behaviour of this program is undefined, which makes me curious to some of the mechanics. I tried to poison the counters on the complex edge into (_ || setjmp) so that whenever the longjmp happen, the mask would be all-zero and the counter update a no-op. That failed with complex edges not being splitable. I tried selecting the poisoned value in a phi node and pick it on the incoming complex edge, but the complex edge never seemed to be picked in the phi. Is it not possible to use complex edges in phi nodes, or is there some way I can detect that the longjmp happened so I can poison the counters? Or is this behaviour ok considering it is already undefined behaviour?
Thanks, Jørgen