I gather this is important when large_upd_rate=rate(cross-page update bytes
for tuples larger than fillfactor) exceeds small_ins_rate=rate(insert bytes
for tuples NOT larger than fillfactor).  That is a plausible outcome when
inserts are rare, and table bloat then accrues at
large_upd_rate-small_ins_rate.  I agree this patch improves behavior.

Does anyone have a strong opinion on whether to back-patch?  I am weakly
inclined not to back-patch, because today's behavior might happen to perform
better when large_upd_rate-small_ins_rate<0.  Besides the usual choices of
back-patching or not back-patching, we could back-patch with a stricter
threshold.  Suppose we accepted pages for larger-than-fillfactor tuples when
the pages have at least
BLCKSZ-SizeOfPageHeaderData-sizeof(ItemIdData)-MAXALIGN(MAXALIGN(SizeofHeapTupleHeader)+1)+1
bytes of free space.  That wouldn't reuse a page containing a one-column
tuple, but it would reuse a page having up to eight line pointers.

On Fri, Mar 19, 2021 at 02:16:22PM -0400, John Naylor wrote:
> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c
> @@ -335,11 +335,24 @@ RelationGetBufferForTuple(Relation relation, Size len,

> +     const Size      maxPaddedFsmRequest = MaxHeapTupleSize -
> +     (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));

In evaluating whether this is a good choice of value, I think about the
expected page lifecycle.  A tuple barely larger than fillfactor (roughly
len=1+BLCKSZ*fillfactor/100) will start on a roughly-empty page.  As long as
the tuple exists, the server will skip that page for inserts.  Updates can
cause up to floor(99/fillfactor) same-size versions of the tuple to occupy the
page simultaneously, creating that many line pointers.  At the fillfactor=10
minimum, it's good to accept otherwise-empty pages having at least nine line
pointers, so a page can restart the aforementioned lifecycle.  Tolerating even
more line pointers helps when updates reduce tuple size or when the page was
used for smaller tuples before it last emptied.  At the BLCKSZ=8192 default,
this maxPaddedFsmRequest allows 36 line pointers (good or somewhat high).  At
the BLCKSZ=1024 minimum, it allows 4 line pointers (low).  At the BLCKSZ=32768
maximum, 146 (likely excessive).  I'm not concerned about optimizing
non-default block sizes, so let's keep your proposal.

Comments and the maxPaddedFsmRequest variable name use term "fsm" for things
not specific to the FSM.  For example, the patch's test case doesn't use the
FSM.  (That is fine.  Ordinarily, RelationGetTargetBlock() furnishes its
block.  Under CLOBBER_CACHE_ALWAYS, the "try the last page" logic does so.  An
FSM-using test would contain a VACUUM.)  I plan to commit the attached
version; compared to v5, it updates comments and renames this variable.

Thanks,
nm
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Accept slightly-filled pages for tuples larger than fillfactor.
    
    We always inserted a larger-than-fillfactor tuple into a newly-extended
    page, even when existing pages were empty or contained nothing but an
    unused line pointer.  This was unnecessary relation extension.  Start
    tolerating page usage up to 1/8 the maximum space that could be taken up
    by line pointers.  This is somewhat arbitrary, but it should allow more
    cases to reuse pages.  This has no effect on tables with fillfactor=100
    (the default).
    
    John Naylor and Floris van Nee.  Reviewed by Matthias van de Meent.
    Reported by Floris van Nee.
    
    Discussion: 
https://postgr.es/m/6e263217180649339720afe2176c5...@opammb0562.comp.optiver.com

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 75223c9..08b4e1b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -317,10 +317,10 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState 
bistate)
  *     BULKWRITE buffer selection strategy object to the buffer manager.
  *     Passing NULL for bistate selects the default behavior.
  *
- *     We always try to avoid filling existing pages further than the 
fillfactor.
- *     This is OK since this routine is not consulted when updating a tuple and
- *     keeping it on the same page, which is the scenario fillfactor is meant
- *     to reserve space for.
+ *     We don't fill existing pages further than the fillfactor, except for 
large
+ *     tuples in nearly-empty tables.  This is OK since this routine is not
+ *     consulted when updating a tuple and keeping it on the same page, which 
is
+ *     the scenario fillfactor is meant to reserve space for.
  *
  *     ereport(ERROR) is allowed here, so this routine *must* be called
  *     before any (unlogged) changes are made in buffer pool.
@@ -334,8 +334,10 @@ RelationGetBufferForTuple(Relation relation, Size len,
        bool            use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
        Buffer          buffer = InvalidBuffer;
        Page            page;
-       Size            pageFreeSpace = 0,
-                               saveFreeSpace = 0;
+       Size            nearlyEmptyFreeSpace,
+                               pageFreeSpace = 0,
+                               saveFreeSpace = 0,
+                               targetFreeSpace = 0;
        BlockNumber targetBlock,
                                otherBlock;
        bool            needLock;
@@ -358,6 +360,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
        saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                                                
                   HEAP_DEFAULT_FILLFACTOR);
 
+       /*
+        * Since pages without tuples can still have line pointers, we consider
+        * pages "empty" when the unavailable space is slight.  This threshold 
is
+        * somewhat arbitrary, but it should prevent most unnecessary relation
+        * extensions while inserting large tuples into low-fillfactor tables.
+        */
+       nearlyEmptyFreeSpace = MaxHeapTupleSize -
+               (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
+       if (len + saveFreeSpace > nearlyEmptyFreeSpace)
+               targetFreeSpace = Max(len, nearlyEmptyFreeSpace);
+       else
+               targetFreeSpace = len + saveFreeSpace;
+
        if (otherBuffer != InvalidBuffer)
                otherBlock = BufferGetBlockNumber(otherBuffer);
        else
@@ -376,13 +391,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
         * When use_fsm is false, we either put the tuple onto the existing 
target
         * page or extend the relation.
         */
-       if (len + saveFreeSpace > MaxHeapTupleSize)
-       {
-               /* can't fit, don't bother asking FSM */
-               targetBlock = InvalidBlockNumber;
-               use_fsm = false;
-       }
-       else if (bistate && bistate->current_buf != InvalidBuffer)
+       if (bistate && bistate->current_buf != InvalidBuffer)
                targetBlock = BufferGetBlockNumber(bistate->current_buf);
        else
                targetBlock = RelationGetTargetBlock(relation);
@@ -393,7 +402,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
                 * We have no cached target page, so ask the FSM for an initial
                 * target.
                 */
-               targetBlock = GetPageWithFreeSpace(relation, len + 
saveFreeSpace);
+               targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
        }
 
        /*
@@ -517,7 +526,7 @@ loop:
                }
 
                pageFreeSpace = PageGetHeapFreeSpace(page);
-               if (len + saveFreeSpace <= pageFreeSpace)
+               if (targetFreeSpace <= pageFreeSpace)
                {
                        /* use this page as future insert target, too */
                        RelationSetTargetBlock(relation, targetBlock);
@@ -550,7 +559,7 @@ loop:
                targetBlock = RecordAndGetPageWithFreeSpace(relation,
                                                                                
                        targetBlock,
                                                                                
                        pageFreeSpace,
-                                                                               
                        len + saveFreeSpace);
+                                                                               
                        targetFreeSpace);
        }
 
        /*
@@ -582,7 +591,7 @@ loop:
                         * Check if some other backend has extended a block for 
us while
                         * we were waiting on the lock.
                         */
-                       targetBlock = GetPageWithFreeSpace(relation, len + 
saveFreeSpace);
+                       targetBlock = GetPageWithFreeSpace(relation, 
targetFreeSpace);
 
                        /*
                         * If some other waiter has already extended the 
relation, we
diff --git a/src/backend/access/heap/rewriteheap.c 
b/src/backend/access/heap/rewriteheap.c
index cf13e60..1aff62c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -676,7 +676,11 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 
                if (len + saveFreeSpace > pageFreeSpace)
                {
-                       /* Doesn't fit, so write out the existing page */
+                       /*
+                        * Doesn't fit, so write out the existing page.  It 
always
+                        * contains a tuple.  Hence, unlike 
RelationGetBufferForTuple(),
+                        * enforce saveFreeSpace unconditionally.
+                        */
 
                        /* XLOG stuff */
                        if (RelationNeedsWAL(state->rs_new_rel))
diff --git a/src/test/regress/expected/insert.out 
b/src/test/regress/expected/insert.out
index da50ee3..5063a3d 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -82,6 +82,27 @@ select col1, col2, char_length(col3) from inserttest;
 
 drop table inserttest;
 --
+-- tuple larger than fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+-- create page w/ free space in range [nearlyEmptyFreeSpace, MaxHeapTupleSize)
+INSERT INTO large_tuple_test (select 1, NULL);
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+ pg_size_pretty 
+----------------
+ 8192 bytes
+(1 row)
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+DROP TABLE large_tuple_test;
+--
 -- check indirection (field/array assignment), cf bug #14265
 --
 -- these tests are aware that transformInsertStmt has 3 separate code paths
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 963faa1..bfaa8a3 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -38,6 +38,28 @@ select col1, col2, char_length(col3) from inserttest;
 drop table inserttest;
 
 --
+-- tuple larger than fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+
+-- create page w/ free space in range [nearlyEmptyFreeSpace, MaxHeapTupleSize)
+INSERT INTO large_tuple_test (select 1, NULL);
+
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+
+DROP TABLE large_tuple_test;
+
+--
 -- check indirection (field/array assignment), cf bug #14265
 --
 -- these tests are aware that transformInsertStmt has 3 separate code paths

Reply via email to