On Wed, 9 Sep 2020 at 05:38, Thomas Munro <thomas.mu...@gmail.com> wrote:
>
> On Wed, Sep 9, 2020 at 3:47 AM David Rowley <dgrowle...@gmail.com> wrote:
> > On Tue, 8 Sep 2020 at 12:08, Thomas Munro <thomas.mu...@gmail.com> wrote:
> > > One thought is that if we're going to copy everything out and back in
> > > again, we might want to consider doing it in a
> > > memory-prefetcher-friendly order.  Would it be a good idea to
> > > rearrange the tuples to match line pointer order, so that the copying
> > > work and also later sequential scans are in a forward direction?
> >
> > That's an interesting idea but wouldn't that require both the copy to
> > the separate buffer *and* a qsort? That's the worst of both
> > implementations. We'd need some other data structure too in order to
> > get the index of the sorted array by reverse lineitem point, which
> > might require an additional array and an additional sort.
>
> Well I may not have had enough coffee yet but I thought you'd just
> have to spin though the item IDs twice.  Once to compute sum(lp_len)
> so you can compute the new pd_upper, and the second time to copy the
> tuples from their random locations on the temporary page to new
> sequential locations, so that afterwards item ID order matches offset
> order.

I think you were adequately caffeinated.  You're right that this is
fairly simple to do, but it looks even more simple than looping twice
of the array.  I think it's just a matter of looping over the
itemidbase backwards and putting the higher itemid tuples at the end
of the page. I've done it this way in the attached patch.

I also added a presorted path which falls back on doing memmoves
without the temp buffer when the itemidbase array indicates that
higher lineitems all have higher offsets.  I'm doing the presort check
in the calling function since that loops over the lineitems already.
We can just memmove the tuples in reverse order without overwriting
any yet to be moved tuples when these are in order.

Also, I added code to collapse the memcpy and memmoves for adjacent
tuples so that we perform the minimal number of calls to those
functions. Once we've previously compacted a page it seems that the
code is able to reduce the number of calls significantly.  I added
some logging and reviewed at after a run of the benchmark and saw that
for about 192 tuples we're mostly just doing 3-4 memcpys in the
non-presorted path and just 2 memmoves, for the presorted code path.
I also found that in my test the presorted path was only taken 12.39%
of the time. Trying with 120 million UPDATEs instead of 12 million in
the test ended up reducing this to just 10.89%.  It seems that it'll
just be 1 or 2 tuples spoiling it since new tuples will still be added
earlier in the page after we free up space to add more.

I also experimented seeing what would happen if I also tried to
collapse the memcpys for copying to the temp buffer. The performance
got a little worse from doing that. So I left that code #ifdef'd out

With the attached v3, performance is better. The test now runs
recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x
faster.

We should probably consider what else can be done to try to write
pages with tuples for earlier lineitems earlier in the page.  VACUUM
FULLs and friends will switch back to the opposite order when
rewriting the heap.

Also fixed my missing libc debug symbols:

  24.90%  postgres  postgres            [.] PageRepairFragmentation
  15.26%  postgres  libc-2.31.so        [.] __memmove_avx_unaligned_erms
   9.61%  postgres  postgres            [.] hash_search_with_hash_value
   8.03%  postgres  postgres            [.] compactify_tuples
   6.25%  postgres  postgres            [.] XLogReadBufferExtended
   3.74%  postgres  postgres            [.] PinBuffer
   2.25%  postgres  postgres            [.] hash_bytes
   1.79%  postgres  postgres            [.] heap_xlog_update
   1.47%  postgres  postgres            [.] LWLockRelease
   1.44%  postgres  postgres            [.] XLogReadRecord
   1.33%  postgres  postgres            [.] PageGetHeapFreeSpace
   1.16%  postgres  postgres            [.] DecodeXLogRecord
   1.13%  postgres  postgres            [.] pg_comp_crc32c_sse42
   1.12%  postgres  postgres            [.] LWLockAttemptLock
   1.09%  postgres  postgres            [.] StartupXLOG
   0.90%  postgres  postgres            [.] ReadBuffer_common
   0.84%  postgres  postgres            [.] SlruSelectLRUPage
   0.74%  postgres  libc-2.31.so        [.] __memcmp_avx2_movbe
   0.68%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
   0.66%  postgres  postgres            [.] PageAddItemExtended
   0.66%  postgres  postgres            [.] PageIndexTupleOverwrite
   0.62%  postgres  postgres            [.] smgropen
   0.60%  postgres  postgres            [.] ReadPageInternal
   0.57%  postgres  postgres            [.] GetPrivateRefCountEntry
   0.52%  postgres  postgres            [.] heap_redo
   0.51%  postgres  postgres            [.] AdvanceNextFullTransactionIdPastXid

David
diff --git a/src/backend/storage/page/bufpage.c 
b/src/backend/storage/page/bufpage.c
index d708117a40..339dfc0763 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -411,51 +411,187 @@ PageRestoreTempPage(Page tempPage, Page oldPage)
 }
 
 /*
- * sorting support for PageRepairFragmentation and PageIndexMultiDelete
+ * Item pruning support for PageRepairFragmentation and PageIndexMultiDelete
  */
-typedef struct itemIdSortData
+typedef struct itemIdCompactData
 {
        uint16          offsetindex;    /* linp array index */
        int16           itemoff;                /* page offset of item data */
        uint16          alignedlen;             /* MAXALIGN(item data len) */
-} itemIdSortData;
-typedef itemIdSortData *itemIdSort;
-
-static int
-itemoffcompare(const void *itemidp1, const void *itemidp2)
-{
-       /* Sort in decreasing itemoff order */
-       return ((itemIdSort) itemidp2)->itemoff -
-               ((itemIdSort) itemidp1)->itemoff;
-}
+} itemIdCompactData;
+typedef itemIdCompactData *itemIdCompact;
 
 /*
  * After removing or marking some line pointers unused, move the tuples to
  * remove the gaps caused by the removed items.
+ *
+ * Callers may pass 'presorted' as true if the itemidbase array is sorted in
+ * ascending order of itemoff.  This allows a slightly more optimal code path
+ * to be taken.
  */
 static void
-compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
+compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool 
presorted)
 {
        PageHeader      phdr = (PageHeader) page;
        Offset          upper;
        int                     i;
 
-       /* sort itemIdSortData array into decreasing itemoff order */
-       qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
-                 itemoffcompare);
-
-       upper = phdr->pd_special;
-       for (i = 0; i < nitems; i++)
+       if (presorted)
        {
-               itemIdSort      itemidptr = &itemidbase[i];
-               ItemId          lp;
+               Offset  copy_tail;
+               Offset  copy_head;
+               itemIdCompact   itemidptr;
+
+               /*
+                * The order of lineitem offsets is already in the optimal 
order, i.e,
+                * lower item pointers have a lower offset.  This allows us to 
move
+                * the tuples up to the end of the heap in reverse order without
+                * having to worry about overwriting memory of other tuples 
during the
+                * move operation.
+                */
+
+               /*
+                * Double check the caller didn't mess up while setting the 
presorted
+                * flag to true.
+                */
+#ifdef USE_ASSERT_CHECKING
+               {
+                       Offset lastoff = 0;
+
+                       for (i = 0; i < nitems; i++)
+                       {
+                               itemidptr = &itemidbase[i];
+                               if (lastoff > itemidptr->itemoff)
+                                       Assert(false);
+                               lastoff = itemidptr->itemoff;
+                       }
+               }
+#endif /* USE_ASSERT_CHECKING */
 
-               lp = PageGetItemId(page, itemidptr->offsetindex + 1);
-               upper -= itemidptr->alignedlen;
+               /*
+                * Do the tuple compactification.  Collapse memmove calls for 
adjacent
+                * tuples.
+                */
+               upper = phdr->pd_special;
+               copy_tail = copy_head = itemidbase[nitems - 1].itemoff + 
itemidbase[nitems - 1].alignedlen;
+               for (i = nitems - 1; i >= 0; i--)
+               {
+                       ItemId          lp;
+
+                       itemidptr = &itemidbase[i];
+                       lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+
+                       if (copy_head != itemidptr->itemoff + 
itemidptr->alignedlen)
+                       {
+                               memmove((char *) page + upper,
+                                               page + copy_head,
+                                               copy_tail - copy_head);
+                               copy_tail = itemidptr->itemoff + 
itemidptr->alignedlen;
+                       }
+                       upper -= itemidptr->alignedlen;
+                       copy_head = itemidptr->itemoff;
+
+                       lp->lp_off = upper;
+
+               }
+
+               /* move the remaining chunk */
                memmove((char *) page + upper,
-                               (char *) page + itemidptr->itemoff,
-                               itemidptr->alignedlen);
-               lp->lp_off = upper;
+                               page + copy_head,
+                               copy_tail - copy_head);
+       }
+       else
+       {
+               Offset  copy_tail;
+               Offset  copy_head;
+               itemIdCompact   itemidptr;
+               PGAlignedBlock scratch;
+               char       *scratchptr = scratch.data;
+
+               /*
+                * The tuples in the itemidbase array may be in any order so in 
order to
+                * move these to the end of the page we must make a temp copy 
of each
+                * tuple before we copy them back into the page at the new 
position.
+                *
+                * If a large number of the tuples have been pruned (>75%) then 
we'll copy
+                * these into the temp buffer tuple-by-tuple, otherwise, we'll 
just copy
+                * the entire tuple section of the page in a single memcpy().  
Doing this
+                * saves us doing large copies when we're pruning most of the 
tuples.
+                */
+               if (nitems < PageGetMaxOffsetNumber(page) / 4)
+               {
+                       for (i = 0; i < nitems; i++)
+                       {
+                               itemIdCompact   itemidptr = &itemidbase[i];
+                               memcpy(scratchptr + itemidptr->itemoff, page + 
itemidptr->itemoff,
+                                          itemidptr->alignedlen);
+                       }
+               }
+               else
+               {
+#ifdef NOTUSED
+                       /*
+                        * Try to do the minimal amount of copying to get the 
required
+                        * tuples into the temp buffer.
+                        */
+                       copy_tail = copy_head = itemidbase[0].itemoff + 
itemidbase[0].alignedlen;
+                       for (i = 0; i < nitems; i++)
+                       {
+                               itemidptr = &itemidbase[i];
+
+                               if (copy_head != itemidptr->itemoff + 
itemidptr->alignedlen)
+                               {
+                                       memcpy((char *) scratchptr + copy_head,
+                                                  page + copy_head,
+                                                  copy_tail - copy_head);
+                                       copy_tail = itemidptr->itemoff + 
itemidptr->alignedlen;
+                               }
+                               copy_head = itemidptr->itemoff;
+                       }
+
+                       /* Copy the remaining chunk */
+                       memcpy((char *) scratchptr + copy_head,
+                                  page + copy_head,
+                                  copy_tail - copy_head);
+#else
+                       /* Copy the entire tuple space */
+                       memcpy(scratchptr + phdr->pd_upper,
+                                  page + phdr->pd_upper,
+                                  phdr->pd_special - phdr->pd_upper);
+#endif
+               }
+
+               /*
+                * Do the tuple compactification.  Collapse memmove calls for 
adjacent
+                * tuples.
+                */
+               upper = phdr->pd_special;
+               copy_tail = copy_head = itemidbase[nitems - 1].itemoff + 
itemidbase[nitems - 1].alignedlen;
+               for (i = nitems - 1; i >= 0; i--)
+               {
+                       ItemId          lp;
+
+                       itemidptr = &itemidbase[i];
+                       lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+
+                       if (copy_head != itemidptr->itemoff + 
itemidptr->alignedlen)
+                       {
+                               memcpy((char *) page + upper,
+                                          scratchptr + copy_head,
+                                          copy_tail - copy_head);
+                               copy_tail = itemidptr->itemoff + 
itemidptr->alignedlen;
+                       }
+                       upper -= itemidptr->alignedlen;
+                       copy_head = itemidptr->itemoff;
+
+                       lp->lp_off = upper;
+
+               }
+
+               /* Copy the remaining chunk */
+               memcpy((char *) page + upper,
+                          scratchptr + copy_head,
+                          copy_tail - copy_head);
        }
 
        phdr->pd_upper = upper;
@@ -477,14 +613,16 @@ PageRepairFragmentation(Page page)
        Offset          pd_lower = ((PageHeader) page)->pd_lower;
        Offset          pd_upper = ((PageHeader) page)->pd_upper;
        Offset          pd_special = ((PageHeader) page)->pd_special;
-       itemIdSortData itemidbase[MaxHeapTuplesPerPage];
-       itemIdSort      itemidptr;
+       Offset          last_offset;
+       itemIdCompactData itemidbase[MaxHeapTuplesPerPage];
+       itemIdCompact   itemidptr;
        ItemId          lp;
        int                     nline,
                                nstorage,
                                nunused;
        int                     i;
        Size            totallen;
+       bool            presorted = true; /* For now */
 
        /*
         * It's worth the trouble to be more paranoid here than in most places,
@@ -509,6 +647,7 @@ PageRepairFragmentation(Page page)
        nline = PageGetMaxOffsetNumber(page);
        itemidptr = itemidbase;
        nunused = totallen = 0;
+       last_offset = 0;
        for (i = FirstOffsetNumber; i <= nline; i++)
        {
                lp = PageGetItemId(page, i);
@@ -518,6 +657,12 @@ PageRepairFragmentation(Page page)
                        {
                                itemidptr->offsetindex = i - 1;
                                itemidptr->itemoff = ItemIdGetOffset(lp);
+
+                               if (last_offset > itemidptr->itemoff)
+                                       presorted = false;
+                               else
+                                       last_offset = itemidptr->itemoff;
+
                                if (unlikely(itemidptr->itemoff < (int) 
pd_upper ||
                                                         itemidptr->itemoff >= 
(int) pd_special))
                                        ereport(ERROR,
@@ -552,7 +697,7 @@ PageRepairFragmentation(Page page)
                                         errmsg("corrupted item lengths: total 
%u, available space %u",
                                                        (unsigned int) 
totallen, pd_special - pd_lower)));
 
-               compactify_tuples(itemidbase, nstorage, page);
+               compactify_tuples(itemidbase, nstorage, page, presorted);
        }
 
        /* Set hint bit for PageAddItem */
@@ -831,9 +976,9 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int 
nitems)
        Offset          pd_lower = phdr->pd_lower;
        Offset          pd_upper = phdr->pd_upper;
        Offset          pd_special = phdr->pd_special;
-       itemIdSortData itemidbase[MaxIndexTuplesPerPage];
+       itemIdCompactData itemidbase[MaxIndexTuplesPerPage];
        ItemIdData      newitemids[MaxIndexTuplesPerPage];
-       itemIdSort      itemidptr;
+       itemIdCompact   itemidptr;
        ItemId          lp;
        int                     nline,
                                nused;
@@ -932,7 +1077,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int 
nitems)
        phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData);
 
        /* and compactify the tuple data */
-       compactify_tuples(itemidbase, nused, page);
+       compactify_tuples(itemidbase, nused, page, false);
 }
 
 

Reply via email to