Re: Lazy JIT IR code generation to increase JIT speed with partitions

Luc Vlaming Mon, 12 Apr 2021 05:12:28 -0700

On 18-01-2021 08:47, Luc Vlaming wrote:

Hi everyone, Andres,
On 03-01-2021 11:05, Luc Vlaming wrote:
On 30-12-2020 14:23, Luc Vlaming wrote:
On 30-12-2020 02:57, Andres Freund wrote:
Hi,

Great to see work in this area!
I would like this topic to somehow progress and was wondering what otherbenchmarks / tests would be needed to have some progress? I've so farprovided benchmarks for small(ish) queries and some tpch numbers,assuming those would be enough.
On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
I would like to propose a small patch to the JIT machinery whichmakes theIR code generation lazy. The reason for postponing the generationof the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each withtheir ownJITted functions, especially when e.g. partition-awarejoins/aggregates areenabled. However, only a fraction of those functions is actuallyexecutedbecause the Parallel Append node distributes the workers among thenodes.With the attached patch we get a lazy generation which makes thatthis is no
longer a problem.
I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't thecase in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BYblub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generationends
up with copies of the inlined code).


As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+    /*
+     * Don't immediately emit nor actually generate the function.
+ * instead do so the first time the expression is actuallyevaluated.+ * That allows to emit a lot of functions together, avoiding alot of+ * repeated llvm and memory remapping overhead. It also helpswith not+ * compiling functions that will never be evaluated, as can bethe case+ * if e.g. a parallel append node is distributing workersbetween its
+     * child nodes.
+     */
-    /*
- * Don't immediately emit function, instead do so the firsttime the
-     * expression is actually evaluated. That allows to emit a lot of
-     * functions together, avoiding a lot of repeated llvm and memory
-     * remapping overhead.
-     */
Greetings,

Andres Freund
Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as Ijust found out there is already discussions about this in [1] and [2].However I think the approach I took can be taken independently andthen other solutions could be added on top.
Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removedand not optimized several times (see [1]). Requires all code to beemitted in the same module.
2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this wouldeither need a shm capable optimized IR cache or would not work withparallel workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assumingthere is a constant overhead for JITting and the goal is to gainruntime.
Going forward I would first try to see if my current approach canwork out. The only idea that would be counterproductive to mysolution would be solution 1. Afterwards I'd like to continue witheither solution 2, 5, or 3 in the hopes that we can reduce JIToverhead to a minimum and can therefore apply it more broadly.
To test out why and where the JIT performance decreased with mysolution I improved the test script and added various queries tomodel some of the cases I think we should care about. I have not(yet) done big scale benchmarks as these queries seemed to alreadyshow enough problems for now. Now there are 4 queries which testJITting with/without partitions, and with varying amounts of workersand rowcounts. I hope these are indeed a somewhat representative setof queries.
As pointed out the current patch does create a degradation inperformance wrt queries that are not partitioned (basically q3 andq4). After looking into those queries I noticed two things:- q3 is very noisy wrt JIT timings. This seems to be the result ofsomething wrt parallel workers starting up the JITting and creatingvery high amounts of noise (e.g. inlining timings varying between3.8s and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite alot of time is spent on startup of the llvm machinery and this getsnoisy because of OS interaction and the like?
Either way I took q4 to try and fix the regression and noticedsomething interesting, given the comment from Andres: the generationand inlining time actually decreased, but the optimization andemission time increased. After trying out various things in thellvm_optimize_module function and googling a bit it seems that theLLVMPassManagerBuilderUseInlinerWithThreshold adds some veryexpensive passes. I tried to construct some queries where this wouldactually gain us but couldnt (yet).
For v2 of the patch-set the first patch slightly changes how weoptimize the code, which removes the aforementioned degradations inthe queries. The second patch then makes that partitions work a lotbetter, but interestingly now also q4 gets a lot faster but somehowq3 does not.
Because these findings contradict the suggestions/findings fromAndres I'm wondering what I'm missing. I would continue and do someTPC-H like tests on top, but apart from that I'm not entirely surewhere we are supposed to gain most from the call toLLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that fromthe scenarios I now tested it seems that the pain is actually in thecode optimization and possibly rather specific passes and notnecessarily in how many modules are emitted.
If there are more / better queries / datasets / statistics I can runand gather I would be glad to do so :) To me the current results seemhowever fairly promising.
Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

  parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
  query1: HEAD       - 08.088901 #runs=5 #JIT=12014
  query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
  query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
  query2: HEAD       - 17.628126 #runs=5 #JIT=24074
  query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
  query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
  query3: HEAD       - 00.220141 #runs=5 #JIT=29
  query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
  query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
  query4: HEAD       - 00.052305 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
  query1: HEAD       - 14.922044 #runs=5 #JIT=102104
  query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
  query2: HEAD       - 18.477133 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
  query3: HEAD       - 00.235587 #runs=5 #JIT=209
  query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
  query4: HEAD       - 00.052534 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
  query1: HEAD       - 12.696588 #runs=5 #JIT=102104
  query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
  query2: HEAD       - 12.091824 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
  query3: HEAD       - 00.122208 #runs=5 #JIT=209
  query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
  query4: HEAD       - 00.033125 #runs=100 #JIT=10
  query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
  query1: HEAD       - 02.760343 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
  query2: HEAD       - 02.396965 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
  query3: HEAD       - 00.186633 #runs=5 #JIT=209
  query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
  query4: HEAD       - 00.013277 #runs=100 #JIT=10
  query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
  query1: HEAD       - 02.339973 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
  query2: HEAD       - 02.268987 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
  query3: HEAD       - 00.084531 #runs=5 #JIT=209
  query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.08668  #runs=5 #JIT=127
  query4: HEAD       - 00.005371 #runs=100 #JIT=10
  query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com[2]https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1    9.49    9.25    1.03
3    11.87    11.65    1.02
4    23.74    21.24    1.12
5    11.66    11.07    1.05
6    7.82    7.72    1.01
7    12.1    11.23    1.08
8    12.99    11.2    1.16
9    71.2    68.05    1.05
10    17.72    17.31    1.02
11    4.75    4.16    1.14
12    10.47    10.27    1.02
13    38.23    38.71    0.99
14    8.69    8.5    1.02
15    12.63    12.6    1.00
19    8.56    8.37    1.02
22    10.34    9.25    1.12

Cheers,
Luc
Kind regards,
Luc

Providing a new set of rebased patches with a better description in thehopes this helps reviewability. Also registering this to the next CF toincrease visibility.


Regards,
Luc

>From 98b793bf4c80802e8da79626febc3c4dc567a844 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Mon, 12 Apr 2021 11:10:58 +0200
Subject: [PATCH v3 2/2] Do not generate the IR code ahead of time, but lazy
 upon first call.

Now we combine the IR generation, optimization and emission passes
all in the same time, which seems to gain us in almost all cases,
also (very) small queries with JIT enabled.

For plans where functions might actually not be called at all we actually
gain several factors runtime. This happens e.g. in plans with parallel append
nodes as there the workers get distributed and are unlikely to participate
in all subplans. Most common case for this is usage of queries
on partitioned tables combined with JIT.
---
 src/backend/jit/llvm/llvmjit_expr.c | 96 +++++++++++++++++------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f9cc790c7..74c31051cd 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,64 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 					   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 					   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * Instead do so the first time the expression is actually evaluated.
+	 * This helps with not compiling functions that will never be evaluated,
+	 * as can be the case if e.g. a parallel append node is distributing
+	 * workers between its child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +172,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2356,30 +2388,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
-- 
2.25.1

>From 73eba2697046eeb3984f144555bb0b32e13162b7 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Mon, 12 Apr 2021 11:05:05 +0200
Subject: [PATCH v3 1/2] Improve jitting performance by not emitting the
 LLVMPassManagerBuilderUseInlinerWithThreshold pass.

This pass contains some very expensive parts which in experiments so
far do not gain us much compared to the runtime we gain afterwards
when running queries. Instead now emit the simpler inliner passes
also for PGJIT_OPT3.

To monitor how many modules are now created an extra statistic is
added so we can test effectively how much performance is
gained / lost because of this.
---
 src/backend/commands/explain.c |  1 +
 src/backend/jit/llvm/llvmjit.c | 17 ++++++-----------
 src/include/jit/jit.h          |  3 +++
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b62a76e7e5..ac97e9b44b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -894,6 +894,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		appendStringInfoString(es->str, "JIT:\n");
 		es->indent++;
 
+		ExplainPropertyInteger("Modules", NULL, ji->created_modules, es);
 		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainIndentText(es);
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 98a27f08bf..18797e696b 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -241,6 +241,8 @@ llvm_mutable_module(LLVMJitContext *context)
 		context->module = LLVMModuleCreateWithName("pg");
 		LLVMSetTarget(context->module, llvm_triple);
 		LLVMSetDataLayout(context->module, llvm_layout);
+
+		context->base.instr.created_modules++;
 	}
 
 	return context->module;
@@ -578,12 +580,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
-	{
-		/* TODO: Unscientifically determined threshold */
-		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
-	}
-	else
+	if (!(context->base.flags & PGJIT_OPT3))
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
@@ -611,11 +608,9 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
-		LLVMAddAlwaysInlinerPass(llvm_mpm);
-	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	LLVMAddAlwaysInlinerPass(llvm_mpm);
+	/* if doing inlining, add inlining pass */
+	if (context->base.flags & PGJIT_INLINE)
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b634df30b9..7a080074c0 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -26,6 +26,9 @@
 
 typedef struct JitInstrumentation
 {
+	/* number of emitted modules */
+	size_t		created_modules;
+
 	/* number of emitted functions */
 	size_t		created_functions;
 
-- 
2.25.1

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Reply via email to