Hi,

On 2019-03-02 18:11:43 -0500, Tom Lane wrote:
> On test cases like "pg_bench -S" it seems to be pretty much within the
> noise level of being the same speed as HEAD.

I think that might be because it's bottleneck is just elsewhere
(e.g. very context switch heavy, very few lists of any length).

FWIW, even just taking context switches out of the equation leads to
a ~5-6 %benefit in a simple statement:

DO $f$BEGIN FOR i IN 1..500000 LOOP EXECUTE $s$SELECT aid, bid, abalance, 
filler FROM pgbench_accounts WHERE aid = 2045530;$s$;END LOOP;END;$f$;

master:
+    6.05%  postgres  postgres            [.] AllocSetAlloc
+    5.52%  postgres  postgres            [.] base_yyparse
+    2.51%  postgres  postgres            [.] palloc
+    1.82%  postgres  postgres            [.] hash_search_with_hash_value
+    1.61%  postgres  postgres            [.] core_yylex
+    1.57%  postgres  postgres            [.] SearchCatCache1
+    1.43%  postgres  postgres            [.] expression_tree_walker.part.4
+    1.09%  postgres  postgres            [.] check_stack_depth
+    1.08%  postgres  postgres            [.] MemoryContextAllocZeroAligned

patch v3:
+    5.77%  postgres  postgres            [.] base_yyparse
+    4.88%  postgres  postgres            [.] AllocSetAlloc
+    1.95%  postgres  postgres            [.] hash_search_with_hash_value
+    1.89%  postgres  postgres            [.] core_yylex
+    1.64%  postgres  postgres            [.] SearchCatCache1
+    1.46%  postgres  postgres            [.] expression_tree_walker.part.0
+    1.45%  postgres  postgres            [.] palloc
+    1.18%  postgres  postgres            [.] check_stack_depth
+    1.13%  postgres  postgres            [.] MemoryContextAllocZeroAligned
+    1.04%  postgres  libc-2.28.so        [.] _int_malloc
+    1.01%  postgres  postgres            [.] nocachegetattr

And even just pgbenching the EXECUTEd statement above gives me a
reproducible ~3.5% gain when using -M simple, and ~3% when using -M
prepared.

Note than when not using prepared statement (a pretty important
workload, especially as long as we don't have a pooling solution that
actually allows using prepared statement across connections), even after
the patch most of the allocator overhead is still from list allocations,
but it's near exclusively just the "create a new list" case:

+    5.77%  postgres  postgres            [.] base_yyparse
-    4.88%  postgres  postgres            [.] AllocSetAlloc
   - 80.67% AllocSetAlloc
      - 68.85% AllocSetAlloc
         - 57.65% palloc
            - 50.30% new_list (inlined)
               - 37.34% lappend
                  + 12.66% pull_var_clause_walker
                  + 8.83% build_index_tlist (inlined)
                  + 8.80% make_pathtarget_from_tlist
                  + 8.73% get_quals_from_indexclauses (inlined)
                  + 8.73% distribute_restrictinfo_to_rels
                  + 8.68% RewriteQuery
                  + 8.56% transformTargetList
                  + 8.46% make_rel_from_joinlist
                  + 4.36% pg_plan_queries
                  + 4.30% add_rte_to_flat_rtable (inlined)
                  + 4.29% build_index_paths
                  + 4.23% match_clause_to_index (inlined)
                  + 4.22% expression_tree_mutator
                  + 4.14% transformFromClause
                  + 1.02% get_index_paths
               + 17.35% list_make1_impl
               + 16.56% list_make1_impl (inlined)
               + 15.87% lcons
               + 11.31% list_copy (inlined)
               + 1.58% lappend_oid
            + 12.90% expression_tree_mutator
            + 9.73% get_relation_info
            + 4.71% bms_copy (inlined)
            + 2.44% downcase_identifier
            + 2.43% heap_tuple_untoast_attr
            + 2.37% add_rte_to_flat_rtable (inlined)
            + 1.69% btbeginscan
            + 1.65% CreateTemplateTupleDesc
            + 1.61% core_yyalloc (inlined)
            + 1.59% heap_copytuple
            + 1.54% text_to_cstring (inlined)
            + 0.84% ExprEvalPushStep (inlined)
            + 0.84% ExecInitRangeTable
            + 0.84% scanner_init
            + 0.83% ExecInitRangeTable
            + 0.81% CreateQueryDesc
            + 0.81% _bt_search
            + 0.77% ExecIndexBuildScanKeys
            + 0.66% RelationGetIndexScan
            + 0.65% make_pathtarget_from_tlist


Given how hard it is to improve performance with as flatly distributed
costs as the above profiles, I actually think these are quite promising
results.

I'm not even convinced that it makes all that much sense to measure
end-to-end performance here, it might be worthwhile to measure with a
debugging function that allows to exercise parsing, parse-analysis,
rewrite etc at configurable loop counts. Given the relatively evenly
distributed profiles were going to have to make a few different
improvements to make headway, and it's hard to see benefits of
individual ones if you look at the overall numbers.

Greetings,

Andres Freund

Reply via email to