https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
--- Comment #20 from Alexander Monakov <amonakov at gcc dot gnu.org> --- (note that if you uninclude the testcase and compile with -fno-exceptions it's much faster) On the smaller testcase from comment 14, prune_unused_phi_nodes invokes gcc_qsort 53386 times. There are two distinct phases. In the first phase, the count of struct dom_dfsnum to sort grows in a roughly linear fashion up to 23437 on the 12294'th invocation. Hence this first phase is quadratic in the overall number of processed dom_dfsnum-s. In the second phase, it almost always sorts exactly 7 elements for the remaining ~41000 invocations. The number of pairwise comparisons performed by gcc_qsort is approximately (n+1)*(log_2(n)-1), which results in 1.8e9 comparisons overall for the 53386 invocations. perf shows 10.9e9 cycles spent under gcc_qsort, i.e. 6 cycles per comparison, which looks about right. It's possible to reduce that further by switching from classic to bidirectional merge, and using cmovs instead of bitwise arithmetic for branchless selects. > I'll note the swapping of 8 bytes is a bit odd and it seems to be > if-converted, thus always doing a write. That is not a swap. That's the merge step of a mergesort, we are taking the smaller element from the heads of two arrays and moving it to the tail of a third array. Basically there's quadratic behavior in tree-into-ssa, gcc_qsort shows relatively higher on the profile because the log_2(n) factor becomes noticeable. Hope that helps!