Hi,

>>
>> Good idea. Either that (a separate README), or a comment in a header of
>> some suitable .c/.h file (I prefer that, because that's kinda obvious
>> when reading the code, I often not notice a README exists next to it).
>
> Great, I'd try this from tomorrow. 

I have made it.  Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code.  Hope this is helpful!

The problem:
-------------

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

    /*
     * The attributes whose values are the detoasted version in tts_values[*],
     * if so these memory needs some extra clean-up. These memory can't be put
     * into ecxt_per_tuple_memory since many of them needs a longer life
     * span.
     *
     * These memory is put into TupleTableSlot.tts_mcxt and be clear
     * whenever the tts_values[*] is invalidated.
     */
    Bitmapset   *pre_detoast_attrs;
};

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] &&
        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        Datum           oldDatum;
        MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

        oldDatum = slot->tts_values[attnum];
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
                                                                (struct varlena 
*) oldDatum));
        Assert(slot->tts_nvalid > attnum);
        Assert(oldDatum != slot->tts_values[attnum]);
        slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, 
attnum);
        MemoryContextSwitchTo(old);
    }
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for 
   example an expression like a > 3. However some known expressions like
   VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
    ..
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;
} Join;


Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!


Testing
-------

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt,  one in numeric_out.  this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms


Then we can see the perf report as below:

-    7.34%     0.00%  postgres  postgres           [.] ExecSlotDetoastDatum 
(inlined)
   - ExecSlotDetoastDatum (inlined)
      - 3.47% bms_add_member
       - 3.06% bms_make_singleton (inlined)
        - palloc0
         1.30% AllocSetAlloc

-    5.99%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
    2.64% bms_next_member
    1.17% bms_del_members
    0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms 

-    5.22%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
      - 2.82% bitset_next_member
       1.69% bms_next_member_internal (inlined)
       0.95% bitset_next_member
    0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms. 

So the final result for case 1: 

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms


Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

-- 
Best Regards
Andy Fan

>From cce81190cca7209cbb8aa314a27c52832fbe0a2d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi....@alibaba-inc.com>
Date: Tue, 20 Feb 2024 14:16:10 +0800
Subject: [PATCH v8 1/5] Shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com
---
 src/backend/executor/execExpr.c         |  35 +-
 src/backend/executor/execExprInterp.c   | 180 ++++++++
 src/backend/executor/execTuples.c       | 134 ++++++
 src/backend/executor/execUtils.c        |   2 +
 src/backend/executor/nodeHashjoin.c     |   2 +
 src/backend/executor/nodeMergejoin.c    |   2 +
 src/backend/executor/nodeNestloop.c     |   1 +
 src/backend/jit/llvm/llvmjit_expr.c     |  26 +-
 src/backend/jit/llvm/llvmjit_types.c    |   1 +
 src/backend/optimizer/plan/createplan.c | 107 ++++-
 src/backend/optimizer/plan/setrefs.c    | 551 +++++++++++++++++++-----
 src/include/executor/execExpr.h         |  12 +
 src/include/executor/tuptable.h         |  56 +++
 src/include/nodes/execnodes.h           |  14 +
 src/include/nodes/plannodes.h           |  53 +++
 src/tools/pgindent/typedefs.list        |   2 +
 16 files changed, 1072 insertions(+), 106 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3181b1136a..779fcfaab1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,51 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3f20f1dd31..dc0db12ff4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -158,6 +159,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -166,6 +170,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		Assert(oldDatum != slot->tts_values[attnum]);
+		slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +340,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -597,6 +677,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2137,6 +2236,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2275,6 +2410,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..2d11e5e8b3 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
+	 * let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +657,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +794,9 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -904,6 +930,9 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1150,7 +1179,10 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
+		slot->pre_detoasted_attrs = NULL;
 	}
+	else
+		slot->pre_detoasted_attrs = NULL;
 
 	/*
 	 * And allow slot type specific initialization.
@@ -1288,6 +1320,8 @@ void
 ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 					  TupleDesc tupdesc)	/* new tuple descriptor */
 {
+	MemoryContext old;
+
 	Assert(!TTS_FIXED(slot));
 
 	/* For safety, make sure slot is empty before changing it */
@@ -1304,6 +1338,8 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		pfree(slot->tts_values);
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
+	if (slot->pre_detoasted_attrs)
+		bms_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1319,6 +1355,10 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(Datum));
 	slot->tts_isnull = (bool *)
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
+
+	old = MemoryContextSwitchTo(slot->tts_mcxt);
+	slot->pre_detoasted_attrs = NULL;
+	MemoryContextSwitchTo(old);
 }
 
 /* --------------------------------
@@ -1810,12 +1850,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,83 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0c448422e2..74563c3454 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 47c9daf402..1dcf0c2fd8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -178,4 +178,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..8acb48240e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -346,6 +348,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +386,101 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *	 Walking the Plan tree in the top-down manner to gather the vars which
+ * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
+ *
+ * plan: the plan node to walk right now.
+ * small_tlist: a list of nodes which its subplan should provide them as
+ * small as possible.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4893,6 +4996,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..f9b30903c2 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,48 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute within a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, this context is only
+ * maintained at the expression level rather than plan tree level, so it
+ * can't detect if a Var will be detoasted 2+ time at the plan level.
+ * Recording the times of a Var is detoasted in the plan tree level is
+ * complex, so before we decide it is a must, I am not willing to do too
+ * many changes here.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs;
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +109,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +168,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +199,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +232,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +673,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +692,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +715,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +761,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +779,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +802,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +825,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +853,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +872,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +892,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +911,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +930,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +945,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +990,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1051,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1106,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1156,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1179,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1236,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1302,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1312,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1478,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1534,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1737,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1747,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1815,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2236,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2125,18 +2339,23 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2386,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2410,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2429,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2442,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2451,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2532,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2582,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2597,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2631,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2641,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3287,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3301,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3338,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3356,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3372,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3393,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3432,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3446,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3510,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3666,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a28ddcdd77..9304786bb2 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -830,5 +841,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..3951ffc495 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,18 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes whose values are the detoasted version in tts_values[*],
+	 * if so these memory needs some extra clean-up. These memory can't be put
+	 * into ecxt_per_tuple_memory since many of them needs a longer life span,
+	 * for example the Datum in outer join. These memory is put into
+	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
+	 * invalidated.
+	 *
+	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
+	 */
+	Bitmapset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +439,32 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in pre-detoast-datum.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	attnum = -1;
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	slot->pre_detoasted_attrs = bms_del_members(slot->pre_detoasted_attrs, slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +483,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +531,25 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	if (dstslot->tts_nvalid > 0 && srcslot->tts_nvalid > 0)
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..30fdb37d1c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1481,6 +1481,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2010,6 +2016,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2771,4 +2784,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..ea5033aaa0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +806,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +897,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1621,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf..ac55727e05 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4049,6 +4049,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

>From bd3dc494097811542fd0e9bd73d876f534543e54 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi....@alibaba-inc.com>
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v8 2/5] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c                 | 200 +++++++++++++++++-
 src/backend/nodes/outfuncs.c                  |  51 +++++
 src/include/nodes/bitmapset.h                 |  28 +++
 src/include/nodes/nodes.h                     |   4 +
 src/test/modules/test_misc/Makefile           |  11 +
 src/test/modules/test_misc/README             |   4 +-
 .../test_misc/expected/test_bitset.out        |   7 +
 src/test/modules/test_misc/meson.build        |  17 ++
 .../modules/test_misc/sql/test_bitset.sql     |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c        | 118 +++++++++++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 					  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(int x, Bitset *a)
+{
+	int			wordnum,
+				bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	if (a == NULL)
+		return false;
+
+	if (wordnum >= a->nwords)
+		return false;
+
+	return (a->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+int
+bitset_next_member(const Bitset *a, int prevbit)
+{
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
+
+/*
+ * bitset_to_bitmap - build a legal bitmapset from bitset.
+ */
+Bitmapset *
+bitset_to_bitmap(Bitset *a)
+{
+	int			n;
+
+	bool		found = false;	/* any non-empty bits */
+	Bitmapset  *result;
+	int			i;
+
+	if (a == NULL)
+		return NULL;
+
+	n = a->nwords - 1;
+	do
+	{
+		if (a->words[n] > 0)
+		{
+			found = true;
+			break;
+		}
+	} while (--n >= 0);
+
+	if (!found)
+		return NULL;
+
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(n + 1));
+	result->type = T_Bitmapset;
+	result->nwords = n + 1;
+
+	Assert(result->nwords <= a->nwords);
+
+	i = 0;
+	do
+	{
+		result->words[i] = a->words[i];
+	} while (++i < result->nwords);
+
+	return result;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2c30bba212..f2ee806ef1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -331,6 +331,43 @@ outBitmapset(StringInfo str, const Bitmapset *bms)
 	appendStringInfoChar(str, ')');
 }
 
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+static void
+outBitsetInternal(struct StringInfoData *str,
+				  const struct Bitset *bs,
+				  bool asBitmap)
+{
+	int			x;
+
+	appendStringInfoChar(str, '(');
+	if (asBitmap)
+		appendStringInfoChar(str, 'b');
+	else
+		appendStringInfo(str, "bs");
+	x = -1;
+	while ((x = bitset_next_member(bs, x)) >= 0)
+		appendStringInfo(str, " %d", x);
+	appendStringInfoChar(str, ')');
+}
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+void
+outBitset(struct StringInfoData *str,
+		  const struct Bitset *bs)
+{
+	outBitsetInternal(str, bs, false);
+}
+
+
 /*
  * Print the value of a Datum given its type.
  */
@@ -783,3 +820,17 @@ bmsToString(const Bitmapset *bms)
 	outBitmapset(&str, bms);
 	return str.data;
 }
+
+/*
+ * bitsetToString -
+ *	   similar to bmsToString, but for Bitset
+ */
+char *
+bitsetToString(const struct Bitset *bs, bool asBitmap)
+{
+	StringInfoData str;
+
+	initStringInfo(&str);
+	outBitsetInternal(&str, bs, asBitmap);
+	return str.data;
+}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 906e8dcc15..95ff37c6e9 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -55,6 +55,24 @@ typedef struct Bitmapset
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
 } Bitmapset;
 
+/*
+ * While Bitmapset is designed for variable-length of bits, Bitset is
+ * designed for fixed-length of bits, the fixed length must be specified at
+ * the bitset_init stage and keep unchanged at the whole lifespan. Because
+ * of this, some operations on Bitset is simpler than Bitmapset.
+ *
+ * The bitset_clear unsets all the bits but kept the allocated memory, this
+ * capacity is impossible for bit Bitmapset for some solid reasons.
+ *
+ * Also for performance aspect, the functions for Bitset removed some
+ * unlikely checks, instead with some Asserts.
+ */
+
+typedef struct Bitset
+{
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+} Bitset;
 
 /* result of bms_subset_compare */
 typedef enum
@@ -124,4 +142,14 @@ extern uint32 bms_hash_value(const Bitmapset *a);
 extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
+extern Bitset *bitset_init(size_t size);
+extern void bitset_clear(Bitset *a);
+extern void bitset_free(Bitset *a);
+extern bool bitset_is_empty(Bitset *a);
+extern Bitset *bitset_copy(Bitset *a);
+extern void bitset_add_member(Bitset *a, int x);
+extern void bitset_del_member(Bitset *a, int x);
+extern int	bitset_is_member(int bit, Bitset *a);
+extern int	bitset_next_member(const Bitset *a, int prevbit);
+extern Bitmapset *bitset_to_bitmap(Bitset *a);
 #endif							/* BITMAPSET_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..4d13107990 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -186,16 +186,20 @@ castNodeImpl(NodeTag type, void *ptr)
  * nodes/{outfuncs.c,print.c}
  */
 struct Bitmapset;				/* not to include bitmapset.h here */
+struct Bitset;					/* not to include bitmapset.h here */
 struct StringInfoData;			/* not to include stringinfo.h here */
 
 extern void outNode(struct StringInfoData *str, const void *obj);
 extern void outToken(struct StringInfoData *str, const char *s);
 extern void outBitmapset(struct StringInfoData *str,
 						 const struct Bitmapset *bms);
+extern void outBitset(struct StringInfoData *str, const struct Bitset *bs);
+
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 					 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
+extern char *bitsetToString(const struct Bitset *bs, bool asBitmap);
 
 /*
  * nodes/{readfuncs.c,read.c}
diff --git a/src/test/modules/test_misc/Makefile b/src/test/modules/test_misc/Makefile
index 39c6c2014a..af96604096 100644
--- a/src/test/modules/test_misc/Makefile
+++ b/src/test/modules/test_misc/Makefile
@@ -2,6 +2,17 @@
 
 TAP_TESTS = 1
 
+MODULE_big = test_misc
+OBJS = \
+	$(WIN32RES) \
+	test_misc.o
+PGFILEDESC = "test_misc"
+
+EXTENSION = test_misc
+DATA = test_misc--1.0.sql
+
+REGRESS = test_bitset
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/test_misc/README b/src/test/modules/test_misc/README
index 4876733fa2..ec426c4ad5 100644
--- a/src/test/modules/test_misc/README
+++ b/src/test/modules/test_misc/README
@@ -1,4 +1,2 @@
-This directory doesn't actually contain any extension module.
-
-What it is is a home for otherwise-unclassified TAP tests that exercise core
+What it is is a home for otherwise-unclassified tests that exercise core
 server features.  We might equally well have called it, say, src/test/misc.
diff --git a/src/test/modules/test_misc/expected/test_bitset.out b/src/test/modules/test_misc/expected/test_bitset.out
new file mode 100644
index 0000000000..3d0302d30d
--- /dev/null
+++ b/src/test/modules/test_misc/expected/test_bitset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_misc;
+SELECT test_bitset();
+ test_bitset 
+-------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 964d95db26..a23f3e3f47 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -1,5 +1,22 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+test_misc_sources = files(
+    'test_misc.c',
+)
+
+if host_system == 'windows'
+  test_misc_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_misc',
+    '--FILEDESC', 'test_misc - ',])
+endif
+
+test_misc = shared_module('test_misc',
+  test_misc_sources,
+  kwargs: pg_test_mod_args,
+)
+
+test_install_libs += test_misc
+
 tests += {
   'name': 'test_misc',
   'sd': meson.current_source_dir(),
diff --git a/src/test/modules/test_misc/sql/test_bitset.sql b/src/test/modules/test_misc/sql/test_bitset.sql
new file mode 100644
index 0000000000..0f73bbf532
--- /dev/null
+++ b/src/test/modules/test_misc/sql/test_bitset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_misc;
+
+SELECT test_bitset();
diff --git a/src/test/modules/test_misc/test_misc--1.0.sql b/src/test/modules/test_misc/test_misc--1.0.sql
new file mode 100644
index 0000000000..79afaa6263
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc--1.0.sql
@@ -0,0 +1,5 @@
+\echo Use "CREATE EXTENSION test_misc" to load this file. \quit
+
+CREATE FUNCTION test_bitset()
+       RETURNS pg_catalog.void
+       AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_misc/test_misc.c b/src/test/modules/test_misc/test_misc.c
new file mode 100644
index 0000000000..70d0255ada
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.c
@@ -0,0 +1,118 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_misc.c
+ *
+ * Copyright (c) 2022-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_dsa/test_misc.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "fmgr.h"
+#include "nodes/bitmapset.h"
+#include "nodes/nodes.h"
+#define BIT_ADD 0
+#define BIT_DEL 1
+#ifdef USE_ASSERT_CHECKING
+static void compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op);
+#endif
+PG_MODULE_MAGIC;
+/* Test basic DSA functionality */
+PG_FUNCTION_INFO_V1(test_bitset);
+Datum
+test_bitset(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ASSERT_CHECKING
+	Bitset	   *bs;
+	Bitset	   *bs2;
+	char	   *str1,
+			   *str2,
+			   *empty_str;
+	Bitmapset  *bms = NULL;
+	int			i;
+
+	empty_str = bmsToString(NULL);
+	/* size = 0 */
+	bs = bitset_init(0);
+	Assert(bs == NULL);
+	bitset_clear(bs);
+	Assert(bitset_is_empty(bs));
+	/* bitset_add_member(bs, 0); // crash. */
+	/* bitset_del_member(bs, 0); // crash. */
+	Assert(!bitset_is_member(0, bs));
+	Assert(bitset_next_member(bs, -1) == -2);
+	bs2 = bitset_copy(bs);
+	Assert(bs2 == NULL);
+	bitset_free(bs);
+	bitset_free(bs2);
+	/* size == 68, nword == 2 */
+	bs = bitset_init(68);
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_ADD);
+	}
+	Assert(!bitset_is_empty(bs));
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_DEL);
+	}
+	Assert(bitset_is_empty(bs));
+	bitset_clear(bs);
+	str1 = bitsetToString(bs, true);
+	Assert(strcmp(str1, empty_str) == 0);
+	bms = bitset_to_bitmap(bs);
+	str2 = bmsToString(bms);
+	Assert(strcmp(str1, str2) == 0);
+	bms = bitset_to_bitmap(NULL);
+	Assert(strcmp(bmsToString(bms), empty_str) == 0);
+	bitset_free(bs);
+#endif
+	PG_RETURN_VOID();
+}
+#ifdef USE_ASSERT_CHECKING
+static void
+compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op)
+{
+	char	   *str1,
+			   *str2,
+			   *str3,
+			   *str4;
+	Bitmapset  *bms3;
+	Bitset	   *bs4;
+
+	if (op == BIT_ADD)
+	{
+		*bms = bms_add_member(*bms, member);
+		bitset_add_member(bs, member);
+		Assert(bms_is_member(member, *bms));
+		Assert(bitset_is_member(member, bs));
+	}
+	else if (op == BIT_DEL)
+	{
+		*bms = bms_del_member(*bms, member);
+		bitset_del_member(bs, member);
+		Assert(!bms_is_member(member, *bms));
+		Assert(!bitset_is_member(member, bs));
+	}
+	else
+		Assert(false);
+	/* compare the rest existing bit */
+	str1 = bmsToString(*bms);
+	str2 = bitsetToString(bs, true);
+	Assert(strcmp(str1, str2) == 0);
+	/* test bitset_to_bitmap */
+	bms3 = bitset_to_bitmap(bs);
+	str3 = bmsToString(bms3);
+	Assert(strcmp(str3, str2) == 0);
+	/* test bitset_copy */
+	bs4 = bitset_copy(bs);
+	str4 = bitsetToString(bs4, true);
+	Assert(strcmp(str3, str4) == 0);
+	pfree(str1);
+	pfree(str2);
+	pfree(str3);
+	pfree(str4);
+}
+#endif
diff --git a/src/test/modules/test_misc/test_misc.control b/src/test/modules/test_misc/test_misc.control
new file mode 100644
index 0000000000..48fd08758f
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.control
@@ -0,0 +1,4 @@
+comment = 'Test misc'
+default_version = '1.0'
+module_pathname = '$libdir/test_misc'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ac55727e05..152e586252 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -268,6 +268,7 @@ BitmapOr
 BitmapOrPath
 BitmapOrState
 Bitmapset
+Bitset
 Block
 BlockId
 BlockIdData
-- 
2.34.1

>From 99fe617f2e09955f1ef638b360bc7b819779b18f Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi....@alibaba-inc.com>
Date: Sat, 24 Feb 2024 21:26:19 +0800
Subject: [PATCH v8 3/5] Use bitset instead of Bitmapset for pre_detoast_attrs

due to commit 00b41463c, when all the bits in a bitmapset is unset, the
memory under it will be released automatically, this is cause extra
overhead for shared-detoast-datum feature, so using bitset now.
---
 src/backend/executor/execExprInterp.c |  2 +-
 src/backend/executor/execTuples.c     |  8 ++++----
 src/include/executor/tuptable.h       | 17 ++++++++++++-----
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dc0db12ff4..878235ae76 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -202,7 +202,7 @@ ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
 																(struct varlena *) oldDatum));
 		Assert(slot->tts_nvalid > attnum);
 		Assert(oldDatum != slot->tts_values[attnum]);
-		slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		bitset_add_member(slot->pre_detoasted_attrs, attnum);
 		MemoryContextSwitchTo(old);
 	}
 }
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 2d11e5e8b3..830e1d4ea6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -179,7 +179,7 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
-		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+		if (bitset_is_member(natt, slot->pre_detoasted_attrs))
 			/* it has been in slot->tts_mcxt already. */
 			continue;
 
@@ -1179,7 +1179,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
-		slot->pre_detoasted_attrs = NULL;
+		slot->pre_detoasted_attrs = bitset_init(tupleDesc->natts);
 	}
 	else
 		slot->pre_detoasted_attrs = NULL;
@@ -1339,7 +1339,7 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
 	if (slot->pre_detoasted_attrs)
-		bms_free(slot->pre_detoasted_attrs);
+		bitset_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1357,7 +1357,7 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
 
 	old = MemoryContextSwitchTo(slot->tts_mcxt);
-	slot->pre_detoasted_attrs = NULL;
+	slot->pre_detoasted_attrs = bitset_init(tupdesc->natts);
 	MemoryContextSwitchTo(old);
 }
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 3951ffc495..8f3eba7fbb 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -138,9 +138,16 @@ typedef struct TupleTableSlot
 	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
 	 * invalidated.
 	 *
+	 * Bitset rather than Bitmapset is chosen here because when all the
+	 * members of Bitmapset are deleted, the allocated memory will be
+	 * deallocated automatically, which is too expensive in this case since we
+	 * need to deleted all the members in each ExecClearTuple and repopulate
+	 * it again when fill the detoast datum to tts_values[*]. This situation
+	 * will be run again and again in an execution cycle.
+	 *
 	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
 	 */
-	Bitmapset	   *pre_detoasted_attrs;
+	Bitset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -448,12 +455,12 @@ ExecFreePreDetoastDatum(TupleTableSlot *slot)
 	int			attnum;
 
 	attnum = -1;
-	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
 	{
 		pfree((void *) slot->tts_values[attnum]);
 	}
 
-	slot->pre_detoasted_attrs = bms_del_members(slot->pre_detoasted_attrs, slot->pre_detoasted_attrs);
+	bitset_clear(slot->pre_detoasted_attrs);
 }
 
 
@@ -536,9 +543,9 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 		int			attnum = -1;
 		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
 
-		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
 
-		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
 		{
 			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
 			Size		len;
-- 
2.34.1

>From 83ea86a095bdf6f5c829299906e9f5cfdee9e2b4 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi....@alibaba-inc.com>
Date: Sun, 25 Feb 2024 17:25:36 +0800
Subject: [PATCH v8 4/5] Adding tts_value_mctx to TupleTableSlot

This context is used to hold the detoast datum for now.  With this
change, we don't need to iterate pre_detoast_attrs and pfree each of
them any more during ExecClearTuple, instead we just MemoryContextReset,
which will be more effective. However the cost is we adds 1KB
MemoryContext to each TupleTableSlot unconditionally.

As of now, slot->pre_detoast_attrs is still needed in the ExecCopySlot
case.
---
 src/backend/executor/execExprInterp.c |  2 +-
 src/backend/executor/execTuples.c     | 50 ++++++++++++++++-----------
 src/backend/nodes/bitmapset.c         |  9 -----
 src/include/executor/tuptable.h       | 17 +++++----
 src/include/nodes/bitmapset.h         | 13 ++++++-
 5 files changed, 53 insertions(+), 38 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 878235ae76..13abe54c0b 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -195,7 +195,7 @@ ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
 		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
 	{
 		Datum		oldDatum;
-		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_values_mctx);
 
 		oldDatum = slot->tts_values[attnum];
 		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 830e1d4ea6..3a56d1c55d 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -375,6 +375,15 @@ tts_heap_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+	/*
+	 * tts_values is treated invalidated since tts_nvalid will is set to 0,
+	 * so let's free the pre-detoast datum.
+	 *
+	 * call ExecFreePreDetoastDatum before tts_nvalid is set to 0 is for the
+	 * fast path of ExecFreePreDetoastDatum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -399,13 +408,6 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
-
-	/*
-	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
-	 * let's free the pre-detoast datum.
-	 */
-	ExecFreePreDetoastDatum(slot);
-
 }
 
 static void
@@ -463,6 +465,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	tts_heap_clear(slot);
 
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	hslot->tuple = tuple;
 	hslot->off = 0;
@@ -472,8 +477,6 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -552,6 +555,10 @@ tts_minimal_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -584,9 +591,6 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
-
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -646,6 +650,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	Assert(TTS_EMPTY(slot));
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 	slot->tts_nvalid = 0;
 	mslot->off = 0;
 
@@ -658,8 +665,6 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
-	/* tts_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -756,6 +761,10 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	 * into the non-materialized tuple (which might be gone when accessed).
 	 */
 	bslot->base.off = 0;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 
 	if (!bslot->base.tuple)
@@ -794,9 +803,6 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
-
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -896,6 +902,10 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 	}
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	bslot->base.tuple = tuple;
 	bslot->base.off = 0;
@@ -930,9 +940,6 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
-
-	/* tts_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1166,6 +1173,9 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 		slot->tts_flags |= TTS_FLAG_FIXED;
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
+	slot->tts_values_mctx = GenerationContextCreate(slot->tts_mcxt,
+													"tts_value_ctx",
+													ALLOCSET_START_SMALL_SIZES);
 	slot->tts_nvalid = 0;
 
 	if (tupleDesc != NULL)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 40cfea2308..3e1ccfc8dd 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1485,15 +1485,6 @@ bitset_init(size_t size)
 	return result;
 }
 
-/*
- * bitset_clear - clear the bits only, but the memory is still there.
- */
-void
-bitset_clear(Bitset *a)
-{
-	if (a != NULL)
-		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
-}
 
 void
 bitset_free(Bitset *a)
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 8f3eba7fbb..53782a993f 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -20,6 +20,7 @@
 #include "access/tupdesc.h"
 #include "nodes/bitmapset.h"
 #include "storage/buf.h"
+#include "utils/memutils.h"
 
 /*----------
  * The executor stores tuples in a "tuple table" which is a List of
@@ -127,6 +128,7 @@ typedef struct TupleTableSlot
 #define FIELDNO_TUPLETABLESLOT_ISNULL 6
 	bool	   *tts_isnull;		/* current per-attribute isnull flags */
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
+	MemoryContext tts_values_mctx; /* reset when tts_values[*] is invalidated. */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
 
@@ -452,14 +454,11 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 static inline void
 ExecFreePreDetoastDatum(TupleTableSlot *slot)
 {
-	int			attnum;
-
-	attnum = -1;
-	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
-	{
-		pfree((void *) slot->tts_values[attnum]);
-	}
+	/* We have called this when tts_nvalid is set to 0. */
+	if (slot->tts_nvalid == 0)
+		return;
 
+	MemoryContextResetOnly(slot->tts_values_mctx);
 	bitset_clear(slot->pre_detoasted_attrs);
 }
 
@@ -545,6 +544,10 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
 
+		MemoryContextSwitchTo(old);
+
+		old = MemoryContextSwitchTo(dstslot->tts_values_mctx);
+
 		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
 		{
 			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 95ff37c6e9..e433842217 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -143,7 +143,6 @@ extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
 extern Bitset *bitset_init(size_t size);
-extern void bitset_clear(Bitset *a);
 extern void bitset_free(Bitset *a);
 extern bool bitset_is_empty(Bitset *a);
 extern Bitset *bitset_copy(Bitset *a);
@@ -152,4 +151,16 @@ extern void bitset_del_member(Bitset *a, int x);
 extern int	bitset_is_member(int bit, Bitset *a);
 extern int	bitset_next_member(const Bitset *a, int prevbit);
 extern Bitmapset *bitset_to_bitmap(Bitset *a);
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ * this is used in performance critical path, so inline this one specially.
+ */
+inline static void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
 #endif							/* BITMAPSET_H */
-- 
2.34.1

>From 75e03c2042d96a8df0586f27fe141523059d5c85 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi....@alibaba-inc.com>
Date: Mon, 26 Feb 2024 09:58:37 +0800
Subject: [PATCH v8 5/5] Provide a building option for review purpose.

when DEBUG_PRE_DETOAST_DATUM is defined, which attribute is prepared
detoast in which plan node is logged via INFO.
---
 src/backend/executor/execExpr.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 779fcfaab1..45c2c625b2 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -949,6 +949,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_INNER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+								elog(INFO,
+									 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+									 state->flags,
+									 plan->startup_cost,
+									 plan->total_cost,
+									 attnum);
+#endif
 							}
 							else
 							{
@@ -961,6 +969,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+#endif
 							}
 							else
 								scratch.opcode = EEOP_OUTER_VAR;
@@ -974,6 +990,15 @@ ExecInitExprRec(Expr *node, ExprState *state,
 																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+#endif
 							}
 							else
 								scratch.opcode = EEOP_SCAN_VAR;
-- 
2.34.1

Reply via email to