Hi, On 2019-03-02 18:11:43 -0500, Tom Lane wrote: > On test cases like "pg_bench -S" it seems to be pretty much within the > noise level of being the same speed as HEAD.
I think that might be because it's bottleneck is just elsewhere (e.g. very context switch heavy, very few lists of any length). FWIW, even just taking context switches out of the equation leads to a ~5-6 %benefit in a simple statement: DO $f$BEGIN FOR i IN 1..500000 LOOP EXECUTE $s$SELECT aid, bid, abalance, filler FROM pgbench_accounts WHERE aid = 2045530;$s$;END LOOP;END;$f$; master: + 6.05% postgres postgres [.] AllocSetAlloc + 5.52% postgres postgres [.] base_yyparse + 2.51% postgres postgres [.] palloc + 1.82% postgres postgres [.] hash_search_with_hash_value + 1.61% postgres postgres [.] core_yylex + 1.57% postgres postgres [.] SearchCatCache1 + 1.43% postgres postgres [.] expression_tree_walker.part.4 + 1.09% postgres postgres [.] check_stack_depth + 1.08% postgres postgres [.] MemoryContextAllocZeroAligned patch v3: + 5.77% postgres postgres [.] base_yyparse + 4.88% postgres postgres [.] AllocSetAlloc + 1.95% postgres postgres [.] hash_search_with_hash_value + 1.89% postgres postgres [.] core_yylex + 1.64% postgres postgres [.] SearchCatCache1 + 1.46% postgres postgres [.] expression_tree_walker.part.0 + 1.45% postgres postgres [.] palloc + 1.18% postgres postgres [.] check_stack_depth + 1.13% postgres postgres [.] MemoryContextAllocZeroAligned + 1.04% postgres libc-2.28.so [.] _int_malloc + 1.01% postgres postgres [.] nocachegetattr And even just pgbenching the EXECUTEd statement above gives me a reproducible ~3.5% gain when using -M simple, and ~3% when using -M prepared. Note than when not using prepared statement (a pretty important workload, especially as long as we don't have a pooling solution that actually allows using prepared statement across connections), even after the patch most of the allocator overhead is still from list allocations, but it's near exclusively just the "create a new list" case: + 5.77% postgres postgres [.] base_yyparse - 4.88% postgres postgres [.] AllocSetAlloc - 80.67% AllocSetAlloc - 68.85% AllocSetAlloc - 57.65% palloc - 50.30% new_list (inlined) - 37.34% lappend + 12.66% pull_var_clause_walker + 8.83% build_index_tlist (inlined) + 8.80% make_pathtarget_from_tlist + 8.73% get_quals_from_indexclauses (inlined) + 8.73% distribute_restrictinfo_to_rels + 8.68% RewriteQuery + 8.56% transformTargetList + 8.46% make_rel_from_joinlist + 4.36% pg_plan_queries + 4.30% add_rte_to_flat_rtable (inlined) + 4.29% build_index_paths + 4.23% match_clause_to_index (inlined) + 4.22% expression_tree_mutator + 4.14% transformFromClause + 1.02% get_index_paths + 17.35% list_make1_impl + 16.56% list_make1_impl (inlined) + 15.87% lcons + 11.31% list_copy (inlined) + 1.58% lappend_oid + 12.90% expression_tree_mutator + 9.73% get_relation_info + 4.71% bms_copy (inlined) + 2.44% downcase_identifier + 2.43% heap_tuple_untoast_attr + 2.37% add_rte_to_flat_rtable (inlined) + 1.69% btbeginscan + 1.65% CreateTemplateTupleDesc + 1.61% core_yyalloc (inlined) + 1.59% heap_copytuple + 1.54% text_to_cstring (inlined) + 0.84% ExprEvalPushStep (inlined) + 0.84% ExecInitRangeTable + 0.84% scanner_init + 0.83% ExecInitRangeTable + 0.81% CreateQueryDesc + 0.81% _bt_search + 0.77% ExecIndexBuildScanKeys + 0.66% RelationGetIndexScan + 0.65% make_pathtarget_from_tlist Given how hard it is to improve performance with as flatly distributed costs as the above profiles, I actually think these are quite promising results. I'm not even convinced that it makes all that much sense to measure end-to-end performance here, it might be worthwhile to measure with a debugging function that allows to exercise parsing, parse-analysis, rewrite etc at configurable loop counts. Given the relatively evenly distributed profiles were going to have to make a few different improvements to make headway, and it's hard to see benefits of individual ones if you look at the overall numbers. Greetings, Andres Freund