Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas  wrote:
> That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 24/07/2019 21:02, Peter Geoghegan wrote:

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?


I was hoping that you'd work on it.   :-)


That's probably how it's going to go, but hey, doesn't hurt to ask :-).


Any reason to think that it's much different to what you've done with GiST?


No, it should be very similar.

- Heikki




Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:
> Pushed this now, to master and REL_12_STABLE.
>
> Now, B-tree indexes still have the same problem, in all versions. Any
> volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it.   :-)

Any reason to think that it's much different to what you've done with GiST?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 22/07/2019 16:09, Heikki Linnakangas wrote:

Unless something comes up, I'll commit this tomorrow.


Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any 
volunteers to write a similar fix for B-trees?


- Heikki




Re: GiST VACUUM

2019-07-22 Thread Heikki Linnakangas

On 28/06/2019 01:02, Thomas Munro wrote:

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.


I'm a bit uneasy about making this into a generally available function
for reuse.  I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids.  The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.


Fair point.


I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation.  I was planning to think harder about for 13.
Obviously that doesn't help you now.  So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?


I came up with the attached. Instead of having a general purpose "widen" 
function, it adds GetFullRecentGlobalXmin(), to return a 64-bit version 
of RecentGlobalXmin. That's enough for this Gist vacuum patch.


In addition to that change, I modified the GistPageSetDeleted(), 
GistPageSetDeleteXid(), GistPageGetDeleteXid() inline functions a bit. I 
merged GistPageSetDeleted() and GistPageSetDeleteXid() into a single 
function, to make sure that the delete-XID is always set when a page is 
marked as deleted. And I modified GistPageGetDeleteXid() to return 
NormalTransactionId (or a FullTransactionId version of it, to be 
precise), for Gist pages that were deleted with older PostgreSQL v12 
beta versions. The previous patch tripped an assertion in that case, 
which is not nice for people binary-upgrading from earlier beta versions.


I did some testing of this with XID wraparound, and it seems to work. I 
used the attached bash script for the testing, with the a helper contrib 
module to consume XIDs faster. It's not very well commented and probably 
not too useful for anyone, but I'm attaching it here mainly as a note to 
future-self, if we need to revisit this.


Unless something comes up, I'll commit this tomorrow.

- Heikki
>From bdeb2467211a1ab9e733e070d54dce97c05cf18c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 22 Jul 2019 15:57:01 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 169bf6fcfed..e9ca4b82527 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			

Re: GiST VACUUM

2019-07-16 Thread Amit Kapila
On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro  wrote:
>
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:
> > * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> > function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse.  I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there.  The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function.  We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo.  See the usage of
GetEpochForXid in unprocessing patches.  Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids.  The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.
>
> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation.  I was planning to think harder about for 13.
> Obviously that doesn't help you now.  So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?
>

Do you suggest to open code for all the three places for now?  I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: GiST VACUUM

2019-07-03 Thread Peter Geoghegan
On Thu, Apr 4, 2019 at 8:15 AM Heikki Linnakangas  wrote:
> I think we should fix this in a similar manner in B-tree, too, but that
> can be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is
> stored in the page opaque field.

What Postgres versions will the B-Tree fix end up targeting? Sounds
like you plan to backpatch all the way?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-06-27 Thread Thomas Munro
On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:
> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse.  I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids.  The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation.  I was planning to think harder about for 13.
Obviously that doesn't help you now.  So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

[1] 
https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32

-- 
Thomas Munro
https://enterprisedb.com




Re: GiST VACUUM

2019-06-27 Thread Heikki Linnakangas

On 27/06/2019 20:15, Andrey Borodin wrote:

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid 
becomes xid from future

latestRemovedXid = 
XidFromFullTransactionId(latestRemovedFullXid);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
   
 xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be 
xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from 
ancient past already.


Good question. No, that can't happen, because this code is in the WAL 
redo function. In a standby, the next XID counter only moves forward 
when a WAL record is replayed that advances it, and all WAL records are 
replayed serially, so that can't happen when we're in the middle of 
replaying this record. A comment on that would be good, though.


When I originally had the check like above in the code that created the 
WAL record, it had exactly that problem, because in the master the next 
XID counter can advance concurrently.


- Heikki




Re: GiST VACUUM

2019-06-27 Thread Andrey Borodin



> 27 июня 2019 г., в 16:38, Heikki Linnakangas  написал(а):
> 
> I haven't done any testing on this yet. Andrey, would you happen to have an 
> environment ready to test this?

Patches do not deadlock and delete pages on "rescan test" - setup that we used 
to detect "left jumps" in during development of physical vacuum. check-world is 
happy on my machine.
I really like that now there is GISTDeletedPageContents and we do not just cast 
*(FullTransactionId *) PageGetContents(page).

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid 
becomes xid from future

latestRemovedXid = 
XidFromFullTransactionId(latestRemovedFullXid);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,

xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be 
xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from 
ancient past already. 

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-27 Thread Andrey Borodin



> 27 июня 2019 г., в 16:38, Heikki Linnakangas  написал(а):
> 
> I haven't done any testing on this yet. Andrey, would you happen to have an 
> environment ready to test this?

Thanks!

I will do some testing this evening (UTC+5). But I'm not sure I can reliably 
test wraparound of xids...
I will try to break code with usual setup which we used to stress vacuum with 
deletes and inserts.

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-27 Thread Heikki Linnakangas

On 26/06/2019 06:07, Thomas Munro wrote:

On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier  wrote:

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:

I feel a little uncomfortable with number 0x7fff right in code.


PG_INT32_MAX...


MaxTransactionId / 2?


Yeah, that makes sense.

Here's a new version of the patches. Changes:

* I changed the reuse-logging so that we always write a reuse WAL 
record, even if the deleteXid is very old. I tried to avoid that with 
the check for MaxTransactionId / 2 or 0x7fff, but it had some 
problems. In the previous patch version, it wasn't just an optimization. 
Without the check, we would write 32-bit XIDs to the log that had 
already wrapped around, and that caused the standby to unnecessarily 
wait for or kill backends. But checking for MaxTransaction / 2 isn't 
quite enough: there was a small chance that the next XID counter 
advanced just after checking for that, so that we still logged an XID 
that had just wrapped around. A more robust way to deal with this is to 
log a full 64-bit XID, and check for wraparound at redo in the standby. 
And if we do that, trying to optimize this in the master doesn't seem 
that important anymore. So in this patch version, we always log the 
64-bit XID, and check for the MaxTransaction / 2 when replaying the WAL 
record instead.


* I moved the logic to extend a 32-bit XID to 64-bits to a new helper 
function in varsup.c.


* Instead of storing just a naked FullTransactionId in the "page 
contents" of a deleted page, I created a new struct for that. The effect 
is the same, but I think the struct clarifies what's happening, and it's 
a good location to attach a comment explaining it.


* Fixed the mixup between < and >

I haven't done any testing on this yet. Andrey, would you happen to have 
an environment ready to test this?


- Heikki
>From 7fd5901e15ac9e0f1928eeecbb9ae1212bacf3f3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7da..79030e77153 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -858,12 +856,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 * leaf/inner is enough to recognize split for root
 	 */
 }
-else if (GistFollowRight(stack->page) ||
-		 stack->parent->lsn < GistPageGetNSN(stack->page))
+else if ((GistFollowRight(stack->page) ||
+		  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+		 GistPageIsDeleted(stack->page))
 {
 	/*
-	 * The page was split while we momentarily unlocked the
-	 * page. Go back to parent.
+	 * The page was split or deleted while we momentarily
+	 * unlocked the page. Go back to parent.
 	 */
 	UnlockReleaseBuffer(stack->buffer);
 	xlocked = false;
@@ -872,18 +871,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 }
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-UnlockReleaseBuffer(stack->buffer);
-xlocked = 

Re: GiST VACUUM

2019-06-25 Thread Thomas Munro
On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier  wrote:
> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> > I feel a little uncomfortable with number 0x7fff right in code.
>
> PG_INT32_MAX...

MaxTransactionId / 2?

-- 
Thomas Munro
https://enterprisedb.com




Re: GiST VACUUM

2019-06-25 Thread Michael Paquier
On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> I feel a little uncomfortable with number 0x7fff right in code.

PG_INT32_MAX...
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2019-06-25 Thread Andrey Borodin
Hi!

Thanks for clarification, now I understand these patches better.

> 25 июня 2019 г., в 13:10, Heikki Linnakangas  написал(а):
> 
>> Also, I did not understand this optimization:
>> +/*
>> + * We can skip this if the page was deleted so long ago, that no scan 
>> can possibly
>> + * still see it, even in a standby. One measure might be anything older 
>> than the
>> + * table's frozen-xid, but we don't have that at hand here. But 
>> anything older than
>> + * 2 billion, from the next XID, is surely old enough, because you 
>> would hit XID
>> + * wraparound at that point.
>> + */
>> +nextxid = ReadNextFullTransactionId();
>> +diff = U64FromFullTransactionId(nextxid) - 
>> U64FromFullTransactionId(latestRemovedXid);
>> +if (diff < 0x7fff)
>> +return;
>> Standby can be lagging months from primary, and, theoretically, close
>> the gap in one sudden WAL leap...
> It would still process the WAL one WAL record at a time, even if it's lagging 
> months behind. It can't just jump over 2 billion XIDs.
I feel a little uncomfortable with number 0x7fff right in code.

Thanks!

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-25 Thread Heikki Linnakangas

(Thanks for the reminder on this, Michael!)

On 05/04/2019 08:39, Andrey Borodin wrote:

4 апр. 2019 г., в 20:15, Heikki Linnakangas   написал(а):
I suggest that we do the attached. It fixes this for GiST. The
patch changes expands the "deletion XID" to 64-bits, and changes
where it's stored. Instead of storing it pd_prune_xid, it's stored
in the page contents. Luckily, a deleted page has no real content.


So, we store full xid right after page header?


Yep.


+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+   Assert(PageIsEmpty(page));
+   ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + 
sizeof(FullTransactionId);
+
+   *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number)
untouched. I do not know is it done for a reason or not
No. Take a look at PageGetItemId() macro, it subtracts one from the 
offset number. But in any case, that's not really relevant here, because 
this patch stores the transaction id directly as the page content. There 
are no itemids at all on a deleted page.



Also, I did not understand this optimization:
+   /*
+* We can skip this if the page was deleted so long ago, that no scan 
can possibly
+* still see it, even in a standby. One measure might be anything older 
than the
+* table's frozen-xid, but we don't have that at hand here. But 
anything older than
+* 2 billion, from the next XID, is surely old enough, because you 
would hit XID
+* wraparound at that point.
+*/
+   nextxid = ReadNextFullTransactionId();
+   diff = U64FromFullTransactionId(nextxid) - 
U64FromFullTransactionId(latestRemovedXid);
+   if (diff < 0x7fff)
+   return;

Standby can be lagging months from primary, and, theoretically, close
the gap in one sudden WAL leap...
It would still process the WAL one WAL record at a time, even if it's 
lagging months behind. It can't just jump over 2 billion XIDs.



Also, I think, that comparison sign should be >, not <.


Ah, good catch! And it shows that this needs more testing..

- Heikki




Re: GiST VACUUM

2019-06-24 Thread Michael Paquier
Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:
> I think we should fix this in a similar manner in B-tree, too, but that can
> be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is stored
> in the page opaque field.

This is an open item present already for a couple of weeks.  Are you
planning to tackle that soon?
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2019-04-04 Thread Andrey Borodin
Hi!

> 4 апр. 2019 г., в 20:15, Heikki Linnakangas  написал(а):
> 
> On 25/03/2019 15:20, Heikki Linnakangas wrote:
>> On 24/03/2019 18:50, Andrey Borodin wrote:
>>> I was working on new version of gist check in amcheck and understand one 
>>> more thing:
>>> 
>>> /* Can this page be recycled yet? */
>>> bool
>>> gistPageRecyclable(Page page)
>>> {
>>>  return PageIsNew(page) ||
>>>  (GistPageIsDeleted(page) &&
>>>   TransactionIdPrecedes(GistPageGetDeleteXid(page), 
>>> RecentGlobalXmin));
>>> }
>>> 
>>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for 
>>> half of xid cycle. Can we prevent it by resetting PageDeleteXid to 
>>> InvalidTransactionId before doing RecordFreeIndexPage()?
>>> (Seems like same applies to GIN...)
>> True, and B-tree has the same issue. I thought I saw a comment somewhere
>> in the B-tree code about that earlier, but now I can't find it. I
>> must've imagined it.
>> We could reset it, but that would require dirtying the page. That would
>> be just extra I/O overhead, if the page gets reused before XID
>> wraparound. We could avoid that if we stored the full XID+epoch, not
>> just XID. I think we should do that in GiST, at least, where this is
>> new. In the B-tree, it would require some extra code to deal with
>> backwards-compatibility, but maybe it would be worth it even there.
> 
> I suggest that we do the attached. It fixes this for GiST. The patch changes 
> expands the "deletion XID" to 64-bits, and changes where it's stored. Instead 
> of storing it pd_prune_xid, it's stored in the page contents. Luckily, a 
> deleted page has no real content.

So, we store full xid right after page header?
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+   Assert(PageIsEmpty(page));
+   ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + 
sizeof(FullTransactionId);
+
+   *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number) untouched. I do 
not know is it done for a reason or not


Also, I did not understand this optimization:
+   /*
+* We can skip this if the page was deleted so long ago, that no scan 
can possibly
+* still see it, even in a standby. One measure might be anything older 
than the
+* table's frozen-xid, but we don't have that at hand here. But 
anything older than
+* 2 billion, from the next XID, is surely old enough, because you 
would hit XID
+* wraparound at that point.
+*/
+   nextxid = ReadNextFullTransactionId();
+   diff = U64FromFullTransactionId(nextxid) - 
U64FromFullTransactionId(latestRemovedXid);
+   if (diff < 0x7fff)
+   return;

Standby can be lagging months from primary, and, theoretically, close the gap 
in one sudden WAL leap... Also, I think, that comparison sign should be >, not 
<.


Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-04-04 Thread Heikki Linnakangas

On 25/03/2019 15:20, Heikki Linnakangas wrote:

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
  return PageIsNew(page) ||
  (GistPageIsDeleted(page) &&
   TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)


True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.

We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.


I suggest that we do the attached. It fixes this for GiST. The patch 
changes expands the "deletion XID" to 64-bits, and changes where it's 
stored. Instead of storing it pd_prune_xid, it's stored in the page 
contents. Luckily, a deleted page has no real content.


I think we should fix this in a similar manner in B-tree, too, but that 
can be done separately. For B-tree, we need to worry about 
backwards-compatibility, but that seems simple enough; we just need to 
continue to understand old deleted pages, where the deletion XID is 
stored in the page opaque field.


- Heikki
>From b7897577c83a81ec04394ce7113d1d8a47804086 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c840..028b06b264 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -693,14 +693,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -719,9 +720,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -842,12 +840,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 * leaf/inner is enough to recognize split for root
 	 */
 }
-else if (GistFollowRight(stack->page) ||
-		 stack->parent->lsn < GistPageGetNSN(stack->page))
+else if ((GistFollowRight(stack->page) ||
+		  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+		 GistPageIsDeleted(stack->page))
 {
 	/*
-	 * The page was split while we momentarily unlocked the
-	 * page. Go back to parent.
+	 * The page was split or deleted while we momentarily
+	 * unlocked the page. Go back to parent.
 	 */
 	UnlockReleaseBuffer(stack->buffer);
 	xlocked = false;
@@ -856,18 +855,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 }
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-UnlockReleaseBuffer(stack->buffer);
-xlocked = false;
-state.stack = stack = stack->parent;
-continue;
-	

Re: GiST VACUUM

2019-03-25 Thread Heikki Linnakangas

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
 return PageIsNew(page) ||
 (GistPageIsDeleted(page) &&
  TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)


True, and B-tree has the same issue. I thought I saw a comment somewhere 
in the B-tree code about that earlier, but now I can't find it. I 
must've imagined it.


We could reset it, but that would require dirtying the page. That would 
be just extra I/O overhead, if the page gets reused before XID 
wraparound. We could avoid that if we stored the full XID+epoch, not 
just XID. I think we should do that in GiST, at least, where this is 
new. In the B-tree, it would require some extra code to deal with 
backwards-compatibility, but maybe it would be worth it even there.


- Heikki



Re: GiST VACUUM

2019-03-24 Thread Andrey Borodin



> 22 марта 2019 г., в 17:03, Heikki Linnakangas  написал(а):
> 

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-22 Thread Andrey Borodin



> 22 марта 2019 г., в 19:37, Heikki Linnakangas  написал(а):
> 
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
> 
> And committed. Thanks Andrey!
> 
> - Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats 
are there.

> 22 марта 2019 г., в 18:28, Heikki Linnakangas  написал(а):
> 
> * Currently, a search needs to scan all items on a page. If the keys are 
> small, that can be pretty slow. Divide each page further into e.g. 4 
> sub-pages, with a "bounding box" key for each sub-page, to speed up search.
BTW, I already have intra-page indexing patch. But now it obviously need a 
rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 22/03/2019 13:37, Heikki Linnakangas wrote:

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!


This caused the buildfarm to go pink... I was able to reproduce it, by 
running the regression tests in one terminal, and repeatedly running 
"VACUUM;" in another terminal. Strange that it seems to happen all the 
time on the buildfarm, but never happened to me locally.


Anyway, I'm investigating.

- Heikki




Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!

- Heikki



Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 22/03/2019 10:00, Andrey Borodin wrote:

22 марта 2019 г., в 1:04, Heikki Linnakangas 
написал(а):

PS. for Gist, we could almost use the LSN / NSN mechanism to detect
the case that a deleted page is reused: Add a new field to the GiST
page header, to store a new "deleteNSN" field. When a page is
deleted, the deleted page's deleteNSN is set to the LSN of the
deletion record. When the page is reused, the deleteNSN field is
kept unchanged. When you follow a downlink during search, if you
see that the page's deleteNSN > parent's LSN, you know that it was
concurrently deleted and recycled, and should be ignored. That
would allow reusing deleted pages immediately. Unfortunately that
would require adding a new field to the gist page header/footer,
which requires upgrade work :-(. Maybe one day, we'll bite the
bullet. Something to keep in mind, if we have to change the page
format anyway, for some reason.


Yeah, the same day we will get rid of invalid tuples. I can make a
patch for v13. Actually, I have a lot of patches that I want in GiST
in v13. Or v14.


Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are 
small, that can be pretty slow. Divide each page further into e.g. 4 
sub-pages, with a "bounding box" key for each sub-page, to speed up search.


- Heikki



Re: GiST VACUUM

2019-03-22 Thread Andrey Borodin


> 22 марта 2019 г., в 1:04, Heikki Linnakangas  написал(а):
> ...
> When I started testing this, I quickly noticed that empty pages were not 
> being deleted nearly as much as I expected. I tracked it to this check in 
> gistdeletepage:
> 
>> +   if (GistFollowRight(leafPage)
>> +   || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
>> +   {
>> +   /* Don't mess with a concurrent page split. */
>> +   return false;
>> +   }
> 
> That NSN test was bogus. It prevented the leaf page from being reused, if the 
> parent page was *ever* split after the leaf page was created. I don't see any 
> reason to check the NSN here.
That's true. This check had sense only when parent page was not locked (and 
seems like comparison should be opposite). When both pages are locked, this 
test as no sense at all. Check was incorrectly "fixed" by me when transitioning 
from single-scan delete to two-scan delete during summer 2018. Just wandering 
how hard is it to simply delete a page...

>> Though, I'm not sure it is important for GIN. Scariest thing that can
>> happen: it will return same tid twice. But it is doing bitmap scan,
>> you cannot return same bit twice...
> 
> Hmm. Could it return a completely unrelated tuple?
No, I do not think so, it will do comparisons on posting tree tuples.

> We don't always recheck the original index quals in a bitmap index scan, 
> IIRC. Also, a search might get confused if it's descending down a posting 
> tree, and lands on a different kind of a page, altogether.
Yes, search will return an error, user will reissue query and everything will 
be almost fine.

> PS. for Gist, we could almost use the LSN / NSN mechanism to detect the case 
> that a deleted page is reused: Add a new field to the GiST page header, to 
> store a new "deleteNSN" field. When a page is deleted, the deleted page's 
> deleteNSN is set to the LSN of the deletion record. When the page is reused, 
> the deleteNSN field is kept unchanged. When you follow a downlink during 
> search, if you see that the page's deleteNSN > parent's LSN, you know that it 
> was concurrently deleted and recycled, and should be ignored. That would 
> allow reusing deleted pages immediately. Unfortunately that would require 
> adding a new field to the gist page header/footer, which requires upgrade 
> work :-(. Maybe one day, we'll bite the bullet. Something to keep in mind, if 
> we have to change the page format anyway, for some reason.
Yeah, the same day we will get rid of invalid tuples. I can make a patch for 
v13. Actually, I have a lot of patches that I want in GiST in v13. Or v14.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-21 Thread Heikki Linnakangas

On 21/03/2019 18:06, Andrey Borodin wrote:

21 марта 2019 г., в 2:30, Heikki Linnakangas 
написал(а): one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a
deleted page is recycled, to prevent the deletion from being
replayed too early in the hot standby. See _bt_getbuf() and
btree_xlog_reuse_page(). I think we need to do something similar in
GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making
the changes is pretty straightforward, but it's a bit cumbersome to
test.


I've tried to deal with it and stuck...


So, I came up with the attached. I basically copy-pasted the page-reuse 
WAL-logging stuff from nbtree.


When I started testing this, I quickly noticed that empty pages were not 
being deleted nearly as much as I expected. I tracked it to this check 
in gistdeletepage:



+   if (GistFollowRight(leafPage)
+   || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+   {
+   /* Don't mess with a concurrent page split. */
+   return false;
+   }


That NSN test was bogus. It prevented the leaf page from being reused, 
if the parent page was *ever* split after the leaf page was created. I 
don't see any reason to check the NSN here. The NSN is normally used to 
detect if a (leaf) page has been concurrently split, when you descend 
the tree. We don't need to care about that here; as long as the 
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can 
find the downlink and the page is empty, we can delete it.


After removing that bogus NSN check, page reuse become much more 
effective. I've been testing this by running this test script repeatedly:


--
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
   select nextval('gist_test_seq'), point(nextval('gist_test_seq'), 
1000 + g) from generate_series(1, 1) g;


delete from gist_point_tbl where id < currval('gist_test_seq') - 2;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
--

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums. 
The interesting thing here is that the key values keep "moving", so that 
new tuples are added to different places than where old ones are 
removed. That's the case where page reuse is needed.


Before this patch, the index bloated very quickly. With the patch, it 
still bloats, because we still don't delete internal pages. But it's a 
small fraction of the bloat you got before.


Attached is the latest patch version, to be applied on top of the 
IntegerSet patch.



I think we should make B-tree WAL record for this shared, remove
BlockNumber and other unused stuff, leaving just xid and db oid. And
reuse this record for B-tree, GiST and GIN (yeah, it is not checking
for that conflict).
Good point. For now, I didn't try to generalize this, but perhaps we 
should.



Though, I'm not sure it is important for GIN. Scariest thing that can
happen: it will return same tid twice. But it is doing bitmap scan,
you cannot return same bit twice...


Hmm. Could it return a completely unrelated tuple? We don't always 
recheck the original index quals in a bitmap index scan, IIRC. Also, a 
search might get confused if it's descending down a posting tree, and 
lands on a different kind of a page, altogether.


Alexander, you added the mechanism to GIN recently, to prevent pages 
from being reused too early (commit 52ac6cd2d0). Do we need something 
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug 
from happening in hot standby?



PS. for Gist, we could almost use the LSN / NSN mechanism to detect the 
case that a deleted page is reused: Add a new field to the GiST page 
header, to store a new "deleteNSN" field. When a page is deleted, the 
deleted page's deleteNSN is set to the LSN of the deletion record. When 
the page is reused, the deleteNSN field is kept unchanged. When you 
follow a downlink during search, if you see that the page's deleteNSN > 
parent's LSN, you know that it was concurrently deleted and recycled, 
and should be ignored. That would allow reusing deleted pages 
immediately. Unfortunately that would require adding a new field to the 
gist page header/footer, which requires upgrade work :-(. Maybe one day, 
we'll bite the bullet. Something to keep in mind, if we have to change 
the page format anyway, for some reason.


- Heikki
>From d7a77ad483251b62514778d2235516f6f9237ad7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we scan through internal pages
looking for 

Re: GiST VACUUM

2019-03-21 Thread Andrey Borodin



> 21 марта 2019 г., в 2:30, Heikki Linnakangas  написал(а):
> one remaining issue that needs to be fixed:
> 
> During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
> page is recycled, to prevent the deletion from being replayed too early 
> in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
> think we need to do something similar in GiST.
> 
> I'll try fixing that tomorrow, unless you beat me to it. Making the 
> changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL 
record for this shared, remove BlockNumber and other unused stuff, leaving just 
xid and db oid.
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for 
that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: 
it will return same tid twice. But it is doing bitmap scan, you cannot return 
same bit twice...

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-20 Thread Heikki Linnakangas

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas 
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?


Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

>

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.


Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.


I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].


It's in pretty good shape, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.


I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.


[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb...@iki.fi


- Heikki
>From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb7...@iki.fi
---
 src/backend/lib/Makefile  |2 +-
 src/backend/lib/README|2 +
 src/backend/lib/integerset.c  | 1039 +
 src/include/lib/integerset.h  |   25 +
 src/test/modules/Makefile |1 +
 src/test/modules/test_integerset/.gitignore   |4 +
 src/test/modules/test_integerset/Makefile |   21 +
 src/test/modules/test_integerset/README   |7 +
 .../expected/test_integerset.out  |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |8 +
 .../modules/test_integerset/test_integerset.c |  622 ++
 .../test_integerset/test_integerset.control   |4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+   ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- 

Re: GiST VACUUM

2019-03-20 Thread Heikki Linnakangas

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas 
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?


Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

>

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.


Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.


I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].


It's in pretty good shapre, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.


I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.


[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb...@iki.fi


- Heikki
>From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb7...@iki.fi
---
 src/backend/lib/Makefile  |2 +-
 src/backend/lib/README|2 +
 src/backend/lib/integerset.c  | 1039 +
 src/include/lib/integerset.h  |   25 +
 src/test/modules/Makefile |1 +
 src/test/modules/test_integerset/.gitignore   |4 +
 src/test/modules/test_integerset/Makefile |   21 +
 src/test/modules/test_integerset/README   |7 +
 .../expected/test_integerset.out  |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |8 +
 .../modules/test_integerset/test_integerset.c |  622 ++
 .../test_integerset/test_integerset.control   |4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+   ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- 

Re: GiST VACUUM

2019-03-15 Thread Jeff Janes
On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas  wrote:

> On 05/03/2019 02:26, Andrey Borodin wrote:
> >> I also tried your amcheck tool with this. It did not report any
> >> errors.
> >>
> >> Attached is also latest version of the patch itself. It is the
> >> same as your latest patch v19, except for some tiny comment
> >> kibitzing. I'll mark this as Ready for Committer in the commitfest
> >> app, and will try to commit it in the next couple of days.
> >
> > That's cool! I'll work on 2nd step of these patchset to make
> > blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!
>

Thank you.  This is a transformational change; it will allow GiST indexes
larger than RAM to be used in some cases where they were simply not
feasible to use before.  On a HDD, it resulted in a 50 fold improvement in
vacuum time, and the machine went from unusably unresponsive to merely
sluggish during the vacuum.  On a SSD (albeit a very cheap laptop one, and
exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30
fold improvement, from a far faster baseline.  Even on an AWS instance with
a "GP2" SSD volume, which normally shows little benefit from sequential
reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated
torn-page writes using my traditional testing harness with high concurrency
(AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and
did not encounter any problems.  I tested both with btree_gist on a scalar
int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to
get re-used very efficiently, but that is an ancestral problem independent
of this change.

Cheers,

Jeff


Re: GiST VACUUM

2019-03-15 Thread Andrey Borodin
Hi!

> 11 марта 2019 г., в 20:03, Heikki Linnakangas  написал(а):
> 
> On 10/03/2019 18:40, Andrey Borodin wrote:
>> Here's new version of the patch for actual page deletion.
>> Changes:
>> 1. Fixed possible concurrency issue:
>> We were locking child page while holding lock on internal page. Notes
>> in GiST README recommend locking child before parent. Thus now we
>> unlock internal page before locking children for deletion, and lock
>> it again, check that everything is still on it's place and delete
>> only then.
> 
> Good catch. The implementation is a bit broken, though. This segfaults:
Sorry for the noise. Few lines of old code leaked into the patch between 
testing and mailing.

> BTW, we don't seem to have test coverage for this in the regression tests. 
> The tests that delete some rows, where you updated the comment, doesn't 
> actually seem to produce any empty pages that could be deleted.
I've updated test to delete more rows. But it did not trigger previous bug 
anyway, we had to delete everything for this case.

> 
>> One thing still bothers me. Let's assume that we have internal page
>> with 2 deletable leaves. We lock these leaves in order of items on
>> internal page. Is it possible that 2nd page have follow-right link on
>> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
>> VACUUM?
> 
> Hmm. If the follow-right flag is set on a page, it means that its right 
> sibling doesn't have a downlink in the parent yet. Nevertheless, I think I'd 
> sleep better, if we acquired the locks in left-to-right order, to be safe.
Actually, I did not found lock coupling in GiST code. But I decided to lock 
just two pages at once (leaf, then parent, for every pair). PFA v22 with this 
concurrency logic.

> 
> An easy cop-out would be to use LWLockConditionalAcquire, and bail out if we 
> can't get the lock. But if it's not too complicated, it'd be nice to acquire 
> the locks in the correct order to begin with.
> 
> I did a round of cleanup and moving things around, before I bumped into the 
> above issue. I'm including them here as separate patches, for easier review, 
> but they should of course be squashed into yours. For convenience, I'm 
> including your patches here as well, so that all the patches you need are in 
> the same place, but they are identical to what you posted.
I've rebased all these patches so that VACUUM should work on every commit. 
Let's just squash them for the next iteration, it was quite a messy rebase.
BTW, you deleted numEmptyPages, this makes code cleaner, but this variable 
could stop gistvacuum_recycle_pages() when everything is recycled already.


Thanks!

Best regards, Andrey Borodin.



0001-Add-block-set-data-structure-v22.patch
Description: Binary data


0002-Delete-pages-during-GiST-VACUUM-v22.patch
Description: Binary data


0003-Minor-cleanup-v22.patch
Description: Binary data


0004-Move-the-page-deletion-logic-to-separate-function-v2.patch
Description: Binary data


0005-Remove-numEmptyPages-.-it-s-not-really-needed-v22.patch
Description: Binary data


0006-Misc-cleanup-v22.patch
Description: Binary data


Re: GiST VACUUM

2019-03-11 Thread Heikki Linnakangas

On 10/03/2019 18:40, Andrey Borodin wrote:

Here's new version of the patch for actual page deletion.
Changes:

1. Fixed possible concurrency issue:

We were locking child page while holding lock on internal page. Notes
in GiST README recommend locking child before parent. Thus now we
unlock internal page before locking children for deletion, and lock
it again, check that everything is still on it's place and delete
only then.


Good catch. The implementation is a bit broken, though. This segfaults:

create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);

insert into gist_point_tbl (id, p)
  select g,  point((1+g) % 100, (1+g) % 100) from generate_series(1, 
1) g;

insert into gist_point_tbl (id, p)
  select -g, point(1000, 1000) from generate_series(1, 1) g;

delete from gist_point_tbl where id >= 0;
vacuum verbose gist_point_tbl;

BTW, we don't seem to have test coverage for this in the regression 
tests. The tests that delete some rows, where you updated the comment, 
doesn't actually seem to produce any empty pages that could be deleted.



One thing still bothers me. Let's assume that we have internal page
with 2 deletable leaves. We lock these leaves in order of items on
internal page. Is it possible that 2nd page have follow-right link on
1st and someone will lock 2nd page, try to lock 1st and deadlock with
VACUUM?


Hmm. If the follow-right flag is set on a page, it means that its right 
sibling doesn't have a downlink in the parent yet. Nevertheless, I think 
I'd sleep better, if we acquired the locks in left-to-right order, to be 
safe.


An easy cop-out would be to use LWLockConditionalAcquire, and bail out 
if we can't get the lock. But if it's not too complicated, it'd be nice 
to acquire the locks in the correct order to begin with.


I did a round of cleanup and moving things around, before I bumped into 
the above issue. I'm including them here as separate patches, for easier 
review, but they should of course be squashed into yours. For 
convenience, I'm including your patches here as well, so that all the 
patches you need are in the same place, but they are identical to what 
you posted.


- Heikki
>From a09f14de32f3040b19238d239b7f025f345e940b Mon Sep 17 00:00:00 2001
From: Andrey 
Date: Fri, 8 Mar 2019 21:19:32 +0500
Subject: [PATCH 1/6] Add block set data structure

Currently we have Bitmapset which works only with 32 bit ints.
Furthermore it is not very space-efficient in case of sparse
bitmap.

In this commit we invent block set: statically typed radix tree
for keeping BlockNumbers. This still can be improved in many ways
applicable to bitmaps, most notable in-pointer lists can be used.
But for the sake of code simplicity it is now plain in it's
ctructure.

The block set is introduced to support efficient page deletion in
GiST's VACUUM.
---
 src/backend/lib/Makefile  |   4 +-
 src/backend/lib/blockset.c| 201 ++
 src/include/lib/blockset.h|  24 +++
 src/test/modules/test_blockset/.gitignore |   4 +
 src/test/modules/test_blockset/Makefile   |  21 ++
 src/test/modules/test_blockset/README |   8 +
 .../test_blockset/expected/test_blockset.out  |   7 +
 .../test_blockset/sql/test_blockset.sql   |   3 +
 .../test_blockset/test_blockset--1.0.sql  |   8 +
 .../modules/test_blockset/test_blockset.c | 182 
 .../test_blockset/test_blockset.control   |   4 +
 11 files changed, 464 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/blockset.c
 create mode 100644 src/include/lib/blockset.h
 create mode 100644 src/test/modules/test_blockset/.gitignore
 create mode 100644 src/test/modules/test_blockset/Makefile
 create mode 100644 src/test/modules/test_blockset/README
 create mode 100644 src/test/modules/test_blockset/expected/test_blockset.out
 create mode 100644 src/test/modules/test_blockset/sql/test_blockset.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset--1.0.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset.c
 create mode 100644 src/test/modules/test_blockset/test_blockset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..9601894f07c 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o blockset.o bloomfilter.o dshash.o \
+   hyperloglog.o ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/blockset.c b/src/backend/lib/blockset.c
new file mode 100644
index 000..c07b2974b1e
--- /dev/null
+++ b/src/backend/lib/blockset.c

Re: GiST VACUUM

2019-03-10 Thread Andrey Borodin

> 5 марта 2019 г., в 18:21, Heikki Linnakangas  написал(а):
> 
> On 05/03/2019 02:26, Andrey Borodin wrote:
>> 
>> That's cool! I'll work on 2nd step of these patchset to make
>> blockset data structure prettier and less hacky.
> 
> Committed the first patch. Thanks for the patch!

That's cool! Thanks!

> I'll change the status of this patch to "Waiting on Author", to reflect the 
> state of the 2nd patch, since you're working on the radix tree blockset stuff.

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST 
README recommend locking child before parent.
Thus now we unlock internal page before locking children for deletion, and lock 
it again, check that everything is still on it's place and delete only then.
One thing still bothers me. Let's assume that we have internal page with 2 
deletable leaves. We lock these leaves in order of items on internal page. Is 
it possible that 2nd page have follow-right link on 1st and someone will lock 
2nd page, try to lock 1st and deadlock with VACUUM?
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM.patch
Description: Binary data


0001-Add-block-set-data-structure.patch
Description: Binary data


Re: GiST VACUUM

2019-03-05 Thread Heikki Linnakangas

On 05/03/2019 02:26, Andrey Borodin wrote:
I also tried your amcheck tool with this. It did not report any 
errors.


Attached is also latest version of the patch itself. It is the
same as your latest patch v19, except for some tiny comment
kibitzing. I'll mark this as Ready for Committer in the commitfest
app, and will try to commit it in the next couple of days.


That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.


Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one 
little thing that we missed earlier: the check for "invalid tuples" that 
might be left over in pg_upgraded pre-9.1 indexes, was lost at some 
point. I added that check back. (It would be nice if GiST indexes had a 
metadata page with a version number, so we could avoid that work in the 
99% of cases that that check is not needed, but that's a different story.)


I'll change the status of this patch to "Waiting on Author", to reflect 
the state of the 2nd patch, since you're working on the radix tree 
blockset stuff.


- Heikki



Re: GiST VACUUM

2019-03-04 Thread Andrey Borodin
Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together 
seems so obvious now, but never actually visited my mind before.

> 4 марта 2019 г., в 18:04, Heikki Linnakangas  написал(а):
> Thanks! As I noted at 
> https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi,
>  the test patch left the index corrupt. I fixed it so that it leaves behind 
> incompletely split pages, without the corruption, see attached. It's similar 
> to yours, but let me recap what it does:
> 
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. 
> They are leaked, so they won't be reused until the a VACUUM sees them and 
> puts them to the FSM
> 
> * Hacks gistinserttuples(), to leave the split incompleted with 50% 
> probability
> 
> * In vacuum, print a line to the log whenever it needs to "jump left"
> 
> I used this patch, with the attached test script that's similar to yours, but 
> it also tries to verify that the index returns correct results. It prints a 
> result set like this:
> 
>   sum
> -
> -364450
>  364450
> (2 rows)
> 
> If the two numbers add up to 0, the index seems valid. And you should see 
> "RESCAN" lines in the log, to show that jumping left happened. Because the 
> behavior is random and racy, you may need to run the script many times to see 
> both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. 
> Especially the "FollowRight" case happens less frequently than the NSN case, 
> you might need to run the script > 10 times to see it.
Great! I've repeated your tests on my machine, I observe similar frequencies of 
causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
> 
> Attached is also latest version of the patch itself. It is the same as your 
> latest patch v19, except for some tiny comment kibitzing. I'll mark this as 
> Ready for Committer in the commitfest app, and will try to commit it in the 
> next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data 
structure prettier and less hacky.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 02:47, Andrey Borodin wrote:

2 янв. 2019 г., в 20:35, Heikki Linnakangas  написал(а):

In patch #1, to do the vacuum scan in physical order:
...
I think this is ready to be committed, except that I didn't do any testing. We 
discussed the need for testing earlier. Did you write some test scripts for 
this, or how have you been testing?

Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an 
insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE 
THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).


Thanks! As I noted at 
https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, 
the test patch left the index corrupt. I fixed it so that it leaves 
behind incompletely split pages, without the corruption, see attached. 
It's similar to yours, but let me recap what it does:


* Hacks gistbuild(), create 100 empty pages immediately after the root 
pages. They are leaked, so they won't be reused until the a VACUUM sees 
them and puts them to the FSM


* Hacks gistinserttuples(), to leave the split incompleted with 50% 
probability


* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to 
yours, but it also tries to verify that the index returns correct 
results. It prints a result set like this:


   sum
-
 -364450
  364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should 
see "RESCAN" lines in the log, to show that jumping left happened. 
Because the behavior is random and racy, you may need to run the script 
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED 
BY FollowRight" cases. Especially the "FollowRight" case happens less 
frequently than the NSN case, you might need to run the script > 10 
times to see it.


I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as 
your latest patch v19, except for some tiny comment kibitzing. I'll mark 
this as Ready for Committer in the commitfest app, and will try to 
commit it in the next couple of days.


- Heikki


gist-vacuum-test.sh
Description: application/shellscript
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..cad4a2a46e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1187,6 +1187,8 @@ gistinserttuple(GISTInsertState *state, GISTInsertStack *stack,
 			InvalidBuffer, InvalidBuffer, false, false);
 }
 
+static bool HACK = false;
+
 /* 
  * An extended workhorse version of gistinserttuple(). This version allows
  * inserting multiple tuples, or replacing a single tuple with multiple tuples.
@@ -1230,6 +1232,14 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
 
 	/* Insert the tuple(s) to the page, splitting the page if necessary */
+	if (BufferIsValid(leftchild) && HACK)
+	{
+		/* skip actually inserting */
+		splitinfo = NULL;
+		is_split = false;
+	}
+	else
+	{
 	is_split = gistplacetopage(state->r, state->freespace, giststate,
 			   stack->buffer,
 			   tuples, ntup,
@@ -1238,6 +1248,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 			   ,
 			   true,
 			   state->heapRel);
+	}
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1256,7 +1267,12 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	 * didn't have to split, release it ourselves.
 	 */
 	if (splitinfo)
+	{
+		if (random() % 2 == 0)
+			HACK = true;
 		gistfinishsplit(state, stack, giststate, splitinfo, unlockbuf);
+		HACK = false;
+	}
 	else if (unlockbuf)
 		LockBuffer(stack->buffer, GIST_UNLOCK);
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index bd142a3560..fdfc54b009 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -201,6 +201,9 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	for (int i = 0; i < 100; i++)
+		ReleaseBuffer(ReadBuffer(index, P_NEW));
+
 	/*
 	 * Do the heap scan.
 	 */
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 1cbcb038f7..c306621e1e 100644
--- 

Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 21:26, Andrey Borodin wrote:

3 янв. 2019 г., в 23:47, Andrey Borodin 
написал(а):



* Bitmapset stores 32 bit signed integers, but BlockNumber is
unsigned. So this will fail with an index larger than 2^31
blocks. That's perhaps academical, I don't think anyone will try
to create a 16 TB GiST index any time soon. But it feels wrong to
introduce an arbitrary limitation like that.


Looks like bitmapset is unsuitable for collecting block numbers due
to the interface. Let's just create custom container for this? Or
there is one other option: for each block number throw away sign
bit and consider page potentially internal and potentially empty
leaf if (blkno & 0x7FF) is in bitmapset. And then we will have
to create at least one 17Tb GiST to check it actually works.


Heikki, how do you think, is implementing our own radix tree for this
is viable solution? I've written working implementation with 4-level
statically typed tree. If we follow this route, probably, there must
be tests for this data structure.


Yeah, seems reasonable.

- Heikki



Re: GiST VACUUM

2019-01-04 Thread Andrey Borodin
3 янв. 2019 г., в 23:47, Andrey Borodin  написал(а):
> 
>> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So 
>> this will fail with an index larger than 2^31 blocks. That's perhaps 
>> academical, I don't think anyone will try to create a 16 TB GiST index any 
>> time soon. But it feels wrong to introduce an arbitrary limitation like that.
> Looks like bitmapset is unsuitable for collecting block numbers due to the 
> interface. Let's just create custom container for this?
> Or there is one other option: for each block number throw away sign bit and 
> consider page potentially internal and potentially empty leaf if (blkno & 
> 0x7FF) is in bitmapset.
> And then we will have to create at least one 17Tb GiST to check it actually 
> works.

Heikki, how do you think, is implementing our own radix tree for this is viable 
solution?
I've written working implementation with 4-level statically typed tree. If we 
follow this route, probably, there must be tests for this data structure.

Best regards, Andrey Borodin.


radix_tree.diff
Description: Binary data


Re: GiST VACUUM

2019-01-03 Thread Andrey Borodin
Cool, thanks!

> 2 янв. 2019 г., в 20:35, Heikki Linnakangas  написал(а):
> 
> In patch #1, to do the vacuum scan in physical order:
> ...
> I think this is ready to be committed, except that I didn't do any testing. 
> We discussed the need for testing earlier. Did you write some test scripts 
> for this, or how have you been testing?
Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch


0002-Test-left-jumps-v18.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
Description: Binary data

To trigger FollowRight GiST sometimes forget to clear follow-right marker 
simulating crash of an insert. This fills logs with "fixing incomplete split" 
messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).

> 
> Patch #2:

> 
> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So 
> this will fail with an index larger than 2^31 blocks. That's perhaps 
> academical, I don't think anyone will try to create a 16 TB GiST index any 
> time soon. But it feels wrong to introduce an arbitrary limitation like that.
Looks like bitmapset is unsuitable for collecting block numbers due to the 
interface. Let's just create custom container for this?
Or there is one other option: for each block number throw away sign bit and 
consider page potentially internal and potentially empty leaf if (blkno & 
0x7FF) is in bitmapset.
And then we will have to create at least one 17Tb GiST to check it actually 
works.

> * I was surprised that the bms_make_empty() function doesn't set the 'nwords' 
> field to match the allocated size. Perhaps that was intentional, so that you 
> don't need to scan the empty region at the end, when you scan through all 
> matching bits? Still, seems surprising, and needs a comment at least.
Explicitly set nwords to zero. Looking at the code now, I do not see this 
nwords==0 as a very good idea. Probably, it's effective, but it's hacky, 
creates implicit expectations in code.

> * We're now scanning all internal pages in the 2nd phase. Not only those 
> internal pages that contained downlinks to empty leaf pages. That's probably 
> OK, but the comments need updating on that.
Adjusted comments. That if before loop
> if (vstate.emptyPages > 0)
seems superfluous. But I kept it until we solve the problem with 31-bit 
bitmapset.
> * In general, comments need to be fixed and added in several places. For 
> example, there's no clear explanation on what the "delete XID", stored in 
> pd_prune_xid, means. (I know what it is, because I'm familiar with the same 
> mechanism in B-tree, but it's not clear from the code itself.)
I've added comment to GistPageSetDeleteXid()

* In this check
> if (GistPageIsDeleted(page) && 
> TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
I've switched using RecentGlobalDataXmin to RecentGlobalXmin, because we have 
done so in similar mechanics for GIN (for uniformity with B-tree).


Thanks for working on this!


Best regards, Andrey Borodin.




0002-Delete-pages-during-GiST-VACUUM-v19.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v19.patch
Description: Binary data


Re: GiST VACUUM

2019-01-02 Thread Heikki Linnakangas

On 02/01/2019 17:35, Heikki Linnakangas wrote:

On 28/10/2018 19:32, Andrey Borodin wrote:

Hi everyone!


2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.


I'm doing preps for CF.
Here's rebased version.


Thanks, I had another look at these.


Here are new patch versions, with the fixes I mentioned. Forgot to 
attach these earlier.


- Heikki
>From 46456dbf1d07aaf3e6963035a02aaa060decace3 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sun, 28 Oct 2018 22:20:58 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v18-heikki

---
 src/backend/access/gist/gist.c   |   8 +-
 src/backend/access/gist/gistvacuum.c | 430 +++
 2 files changed, 247 insertions(+), 191 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a2cb84800e8..d42e810c6b3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -38,7 +38,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
  bool unlockbuf, bool unlockleftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 GISTSTATE *giststate, List *splitinfo, bool releasebuf);
-static void gistvacuumpage(Relation rel, Page page, Buffer buffer,
+static void gistprunepage(Relation rel, Page page, Buffer buffer,
 			   Relation heapRel);
 
 
@@ -261,7 +261,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 */
 	if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
 	{
-		gistvacuumpage(rel, page, buffer, heapRel);
+		gistprunepage(rel, page, buffer, heapRel);
 		is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
 	}
 
@@ -1544,11 +1544,11 @@ freeGISTstate(GISTSTATE *giststate)
 }
 
 /*
- * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ * gistprunepage() -- try to remove LP_DEAD items from the given page.
  * Function assumes that buffer is exclusively locked.
  */
 static void
-gistvacuumpage(Relation rel, Page page, Buffer buffer, Relation heapRel)
+gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 {
 	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c779..4fb32bf76bf 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
+	 * still need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = 

Re: GiST VACUUM

2019-01-02 Thread Heikki Linnakangas

On 28/10/2018 19:32, Andrey Borodin wrote:

Hi everyone!


2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.


I'm doing preps for CF.
Here's rebased version.


Thanks, I had another look at these.

In patch #1, to do the vacuum scan in physical order:

* The starting NSN was not acquired correctly for unlogged and temp 
relations. They don't use WAL, so their NSN values are based on the 
'unloggedLSN' counter, rather than current WAL insert pointer. So must 
use gistGetFakeLSN() rather than GetInsertRecPtr() for them. Fixed that.


* Adjusted comments a little bit, mostly by copy-pasting the 
better-worded comments from the corresponding nbtree code, and ran pgindent.


I think this is ready to be committed, except that I didn't do any 
testing. We discussed the need for testing earlier. Did you write some 
test scripts for this, or how have you been testing?



Patch #2:

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. 
So this will fail with an index larger than 2^31 blocks. That's perhaps 
academical, I don't think anyone will try to create a 16 TB GiST index 
any time soon. But it feels wrong to introduce an arbitrary limitation 
like that.


* I was surprised that the bms_make_empty() function doesn't set the 
'nwords' field to match the allocated size. Perhaps that was 
intentional, so that you don't need to scan the empty region at the end, 
when you scan through all matching bits? Still, seems surprising, and 
needs a comment at least.


* We're now scanning all internal pages in the 2nd phase. Not only those 
internal pages that contained downlinks to empty leaf pages. That's 
probably OK, but the comments need updating on that.


* In general, comments need to be fixed and added in several places. For 
example, there's no clear explanation on what the "delete XID", stored 
in pd_prune_xid, means. (I know what it is, because I'm familiar with 
the same mechanism in B-tree, but it's not clear from the code itself.)


These can be fixed, they're not show-stoppers, but patch #2 isn't quite 
ready yet.


- Heikki



Re: GiST VACUUM

2018-11-30 Thread Dmitry Dolgov
> On Sun, Oct 28, 2018 at 6:32 PM Andrey Borodin  wrote:
>
> Hi everyone!
>
> > 2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
> > Andrey, your latest patch does not apply.  I am moving this to the next
> > CF, waiting for your input.
>
> I'm doing preps for CF.
> Here's rebased version.

Looks like this patch is waiting for an input since August. Could any of the
reviewers (Heikki?) please take a look at the latest version? In the meantime
I'm moving it to the next CF.



Re: GiST VACUUM

2018-10-28 Thread Andrey Borodin
Hi everyone!

> 2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
> Andrey, your latest patch does not apply.  I am moving this to the next
> CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v17.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v17.patch
Description: Binary data
 



Re: GiST VACUUM

2018-10-01 Thread Michael Paquier
On Mon, Aug 06, 2018 at 11:12:00PM +0500, Andrey Borodin wrote:
> Done. Added function bms_make_empty(int size)

Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2018-08-06 Thread Andrey Borodin
Hi!

PFA v16.

> 5 авг. 2018 г., в 21:45, Andrey Borodin  написал(а):
>> 5 авг. 2018 г., в 16:18, Heikki Linnakangas  написал(а):
>> 
>> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
>> bytes per internal page in total, while a bitmap consumes one bit per page, 
>> leaf or internal. If I'm doing
>> my math right, assuming the ratio of leaf pages vs internal pages is
>> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
>> internal page, vs. 200 bits. An array, with 4 bytes per block number,
>> would be the winner here.
> We do not know size of this array beforehand. I can implement normal 
> ArrayList though (with repallocing array) or linked list of chunks. Or 
> anything from data structures zoo.
> Or just stick with bitmap (my preferred way).
Done.
>> 
>>> But I have to note that default growth strategy of Bitmap is not good: we 
>>> will be repallocing byte by byte.
>> 
>> True, that repallocing seems bad. You could force it to be pre-allocated
>> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
> OK, I'll think of proper resize function (ensure capacity, to be precise).
Done. Added function bms_make_empty(int size)

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v16.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v16.patch
Description: Binary data


Re: GiST VACUUM

2018-08-05 Thread Andrey Borodin
Hi!

> 5 авг. 2018 г., в 16:18, Heikki Linnakangas  написал(а):
> 
> On 31/07/18 23:06, Andrey Borodin wrote:
>>> On a typical GiST index, what's the ratio of leaf vs. internal
>>> pages? Perhaps an array would indeed be better.
> >
>> Typical GiST has around 200 tuples per internal page. I've switched
>> to List since it's more efficient than bitmap.
> 
> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
> bytes per internal page in total, while a bitmap consumes one bit per page, 
> leaf or internal. If I'm doing
> my math right, assuming the ratio of leaf pages vs internal pages is
> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
> internal page, vs. 200 bits. An array, with 4 bytes per block number,
> would be the winner here.
We do not know size of this array beforehand. I can implement normal ArrayList 
though (with repallocing array) or linked list of chunks. Or anything from data 
structures zoo.
Or just stick with bitmap (my preferred way).
> 
>> But I have to note that default growth strategy of Bitmap is not good: we 
>> will be repallocing byte by byte.
> 
> True, that repallocing seems bad. You could force it to be pre-allocated
> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
OK, I'll think of proper resize function (ensure capacity, to be precise).

Best regards, Andrey Borodin.




Re: GiST VACUUM

2018-08-05 Thread Heikki Linnakangas

On 31/07/18 23:06, Andrey Borodin wrote:

On a typical GiST index, what's the ratio of leaf vs. internal
pages? Perhaps an array would indeed be better.

>

Typical GiST has around 200 tuples per internal page. I've switched
to List since it's more efficient than bitmap.


Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per 
page, leaf or internal. If I'm doing

my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

But I have to note that default growth strategy of Bitmap is not 
good: we will be repallocing byte by byte.


True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

- Heikki



Re: GiST VACUUM

2018-07-31 Thread Andrey Borodin
Hi! Thanks for looking into the patch!

> 30 июля 2018 г., в 18:39, Heikki Linnakangas  написал(а):
> 
> On 29/07/18 14:47, Andrey Borodin wrote:
>> Fixed both problems. PFA v14.
> 
> Thanks, took a really quick look at this.
> 
> The text being added to README is outdated for these latest changes.
Fixed.
> 
>> In second step I still use paloc's memory, but only to store two
>> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
>> physical scan only reads internal pages. I can omit that bitmap, if
>> I'll scan everything. Also, I can replace emptyLeafs bitmap with
>> array\list, but I do not really think it will be big.
> 
> On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps 
> an array would indeed be better.
Typical GiST has around 200 tuples per internal page. I've switched to List 
since it's more efficient than bitmap. Is 
> If you have a really large index, the bitmaps can take a fair amount of 
> memory, on top of the memory used for tracking the dead TIDs. I.e. that 
> memory will be in addition to maintenance_work_mem. That's not nice, but I 
> think it's OK in practice, and not worth spending too much effort to 
> eliminate. For a 1 TB index with default 8k block size, the two bitmaps will 
> take 32 MB of memory in total. If you're dealing with a database of that 
> size, you ought to have some memory to spare. But if an array would use less 
> memory, that'd be better.

> 
> If you go with bitmaps, please use the existing Bitmapset instead of rolling 
> your own. Saves some code, and it provides more optimized routines for 
> iterating through all the set bits, too (bms_next_member()). Another 
> possibility would be to use Tidbitmap, in the "lossy" mode, i.e. add the 
> pages with tbm_add_page(). That might save some memory, compared to 
> Bitmapset, if the bitmap is very sparse. Not sure how it compares with a 
> plain array.
Yeah, I've stopped reinventing that bicycle. But I have to note that default 
growth strategy of Bitmap is not good: we will be repallocing byte by byte.

> 
> A straightforward little optimization would be to skip scanning the internal 
> pages, when the first scan didn't find any empty pages. And stop the scan of 
> the internal pages as soon as all the empty pages have been recycled.
Done.

PFA v15.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v15.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v15.patch
Description: Binary data


Re: GiST VACUUM

2018-07-30 Thread Heikki Linnakangas

On 29/07/18 14:47, Andrey Borodin wrote:

Fixed both problems. PFA v14.


Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.


In second step I still use paloc's memory, but only to store two
bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
physical scan only reads internal pages. I can omit that bitmap, if
I'll scan everything. Also, I can replace emptyLeafs bitmap with
array\list, but I do not really think it will be big.


On a typical GiST index, what's the ratio of leaf vs. internal pages? 
Perhaps an array would indeed be better. If you have a really large 
index, the bitmaps can take a fair amount of memory, on top of the 
memory used for tracking the dead TIDs. I.e. that memory will be in 
addition to maintenance_work_mem. That's not nice, but I think it's OK 
in practice, and not worth spending too much effort to eliminate. For a 
1 TB index with default 8k block size, the two bitmaps will take 32 MB 
of memory in total. If you're dealing with a database of that size, you 
ought to have some memory to spare. But if an array would use less 
memory, that'd be better.


If you go with bitmaps, please use the existing Bitmapset instead of 
rolling your own. Saves some code, and it provides more optimized 
routines for iterating through all the set bits, too 
(bms_next_member()). Another possibility would be to use Tidbitmap, in 
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might 
save some memory, compared to Bitmapset, if the bitmap is very sparse. 
Not sure how it compares with a plain array.


A straightforward little optimization would be to skip scanning the 
internal pages, when the first scan didn't find any empty pages. And 
stop the scan of the internal pages as soon as all the empty pages have 
been recycled.


- Heikki



Re: GiST VACUUM

2018-07-29 Thread Andrey Borodin
Hi!

Thank you!

> 29 июля 2018 г., в 14:04, Thomas Munro  
> написал(а):
> 
> On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin  wrote:
>>> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
>>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>> 
>> Just in case, here's second part of patch series with actual page deletion.
> 
> Hi Andrey,
> 
> Cfbot says:
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146
> 
> That's because you declared a new variable after some other
> statements.  You can't do that in old school C89.
Yep, mismerged patch steps and created misplaced declaration

> https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951
> 
> That segfaulted here:
> 
> #0 0x004ab620 in setinternal (state=,
> state=, blkno=368) at gistvacuum.c:44
> 44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
> 
> internalPagesMap was NULL, or pointed to memory that was too small and
> happened to be near an unmapped region (unlikely?), or was some other
> corrupted address?
Yes, there was conditionally uninitialized variable mapNumPages. I do not know 
why it didn't trigger failure last time, but now it was reproduced on my 
machine reliably.

Fixed both problems. PFA v14.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v14.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v14.patch
Description: Binary data


Re: GiST VACUUM

2018-07-29 Thread Thomas Munro
On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin  wrote:
>> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>
> Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements.  You can't do that in old school C89.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x004ab620 in setinternal (state=,
state=, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: GiST VACUUM

2018-07-23 Thread Andrey Borodin
Hi!

> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
> 
> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

I was considering further decreasing memory footprint by using bloom filters 
instead of bitmap, but it will create seriously more work for cpu to compute 
hashes.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v13.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v13.patch
Description: Binary data


Re: GiST VACUUM

2018-07-21 Thread Andrey Borodin
Hi!

> 19 июля 2018 г., в 23:26, Andrey Borodin  написал(а):
> 
> I'm working on triggering left split during vacuum. Will get back when done. 
> Thanks!

Here's patch including some messy hacks to trigger NSN and FollowRight jumps 
during VACUUM.

To trigger FollowRight GiST sometimes forget to clear follow-right marker 
simulating crash of an insert. This fills logs with "fixing incomplete split" 
messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

gistvacuumcleanup() was constantly generating left jumps because there was used 
0 instead of real start NSN, I moved NSN acquisition to gistvacuumscan(). Also 
fixed some comments.

gistvacuumcleanup() will have same effect as gistbulkdelete(), is it OK?

To reproduce left-jumps run ./rescantest.sh
Script contain variables for my local paths.


Best regards, Andrey Borodin.


0001-Physical-GiST-scan-in-VACUUM-v13.patch
Description: Binary data


Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin



> 19 июля 2018 г., в 16:28, Heikki Linnakangas  написал(а):
> Hmm. So, while we are scanning the right sibling, which was moved to 
> lower-numbered block because of a concurrent split, the original page is 
> split again? That's OK, we've already scanned all the tuples on the original 
> page, before we recurse to deal with the right sibling. (The corresponding 
> B-tree code also releases the lock on the original page when recursing)
Seems right.

> 
> I did some refactoring, to bring this closer to the B-tree code, for the sake 
> of consistency. See attached patch. This also eliminates the 2nd pass by 
> gistvacuumcleanup(), in case we did that in the bulkdelete-phase already.
Thanks!

> 
> There was one crucial thing missing: in the outer loop, we must ensure that 
> we scan all pages, even those that were added after the vacuum started.
Correct. Quite a neat logic behind the order of acquiring npages, comparing and 
vacuuming page. Notes in FIXME look correct except function names.

> There's a comment explaining that in btvacuumscan(). This version fixes that.
> 
> I haven't done any testing on this. Do you have any test scripts you could 
> share?
I use just a simple tests that setup replication and does random inserts and 
vaccums. Not a rocket science, just a mutated script
for i in $(seq 1 12); do 
size=$((100 * 2**$i))
./psql postgres -c "create table x as select cube(random()) c from 
generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "drop table x;"
./psql postgres -c "create table x as select cube(random()) c from 
generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "insert into x select cube(random()) c from 
generate_series(1,$size) y;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "VACUUM FULL x;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "drop table x;"
done

> I think we need some repeatable tests for the concurrent split cases.
It is hard to trigger left splits until we delete pages. I'll try to hack 
gistNewBuffer() to cause something similar.

> Even if it involves gdb or some other hacks that we can't include in the 
> regression test suite, we need something now, while we're hacking on this.
> 
> One subtle point, that I think is OK, but gave me a pause, and probably 
> deserves comment somewhere: A concurrent root split can turn a leaf page into 
> one internal (root) page, and two new leaf pages. The new root page is placed 
> in the same block as the old page, while both new leaf pages go to freshly 
> allocated blocks. If that happens while vacuum is running, might we miss the 
> new leaf pages? As the code stands, we don't do the "follow-right" dance on 
> internal pages, so we would not recurse into the new leaf pages. At first, I 
> thought that's a problem, but I think we can get away with it. The only 
> scenario where a root split happens on a leaf page, is when the index has 
> exactly one page, a single leaf page. Any subsequent root splits will split 
> an internal page rather than a leaf page, and we're not bothered by those. In 
> the case that a root split happens on a single-page index, we're OK, because 
> we will always scan that page either before, or after the split. If we scan 
> the single page before the split, we see all the leaf tuples on that page. If 
> we scan the single page after the split, it means that we start the scan 
> after the split, and we will see both leaf pages as we continue the scan.
Yes, only page 0 may change type, and page 0 cannot split to left.


I'm working on triggering left split during vacuum. Will get back when done. 
Thanks!

Best regards, Andrey Borodin.


Re: GiST VACUUM

2018-07-19 Thread Heikki Linnakangas

On 19/07/18 14:42, Andrey Borodin wrote:


19.07.2018, 15:20, "Heikki Linnakangas" :


On 19/07/18 13:52, Andrey Borodin wrote:

 Hi!

 19 июля 2018 г., в 1:12, Heikki Linnakangas mailto:hlinn...@iki.fi>>
 написал(а):

 Yeah, please, I think this is the way to go.


 Here's v11 divided into proposed steps.


Thanks, one quick question:

 /* We should not unlock buffer if we are going to
jump left */
 if (needScan)
 {
 GistBDItem *ptr = (GistBDItem *)
palloc(sizeof(GistBDItem));
 ptr->buffer = buffer;
 ptr->next = bufferStack;
 bufferStack = ptr;
 }
 else
 UnlockReleaseBuffer(buffer);


Why? I don't see any need to keep the page locked, when we "jump left".


Because it can split to the left again, given that we release lock.


Hmm. So, while we are scanning the right sibling, which was moved to 
lower-numbered block because of a concurrent split, the original page is 
split again? That's OK, we've already scanned all the tuples on the 
original page, before we recurse to deal with the right sibling. (The 
corresponding B-tree code also releases the lock on the original page 
when recursing)


I did some refactoring, to bring this closer to the B-tree code, for the 
sake of consistency. See attached patch. This also eliminates the 2nd 
pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase 
already.


There was one crucial thing missing: in the outer loop, we must ensure 
that we scan all pages, even those that were added after the vacuum 
started. There's a comment explaining that in btvacuumscan(). This 
version fixes that.


I haven't done any testing on this. Do you have any test scripts you 
could share? I think we need some repeatable tests for the concurrent 
split cases. Even if it involves gdb or some other hacks that we can't 
include in the regression test suite, we need something now, while we're 
hacking on this.


One subtle point, that I think is OK, but gave me a pause, and probably 
deserves comment somewhere: A concurrent root split can turn a leaf page 
into one internal (root) page, and two new leaf pages. The new root page 
is placed in the same block as the old page, while both new leaf pages 
go to freshly allocated blocks. If that happens while vacuum is running, 
might we miss the new leaf pages? As the code stands, we don't do the 
"follow-right" dance on internal pages, so we would not recurse into the 
new leaf pages. At first, I thought that's a problem, but I think we can 
get away with it. The only scenario where a root split happens on a leaf 
page, is when the index has exactly one page, a single leaf page. Any 
subsequent root splits will split an internal page rather than a leaf 
page, and we're not bothered by those. In the case that a root split 
happens on a single-page index, we're OK, because we will always scan 
that page either before, or after the split. If we scan the single page 
before the split, we see all the leaf tuples on that page. If we scan 
the single page after the split, it means that we start the scan after 
the split, and we will see both leaf pages as we continue the scan.


- Heikki
>From 9978fd22dd7b52b1b3f509f53fbafa505f68b573 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 19 Jul 2018 15:25:58 +0300
Subject: [PATCH 1/1] Physical GiST scan in VACUUM v12

---
 src/backend/access/gist/gistvacuum.c | 431 ---
 1 file changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..180cc6c63a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,38 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   GistNSN startNSN);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	GistNSN		startNSN;
+
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) 

Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin
19.07.2018, 15:20, "Heikki Linnakangas" :On 19/07/18 13:52, Andrey Borodin wrote: Hi! 19 июля 2018 г., в 1:12, Heikki Linnakangas  написал(а): Yeah, please, I think this is the way to go. Here's v11 divided into proposed steps.Thanks, one quick question: /* We should not unlock buffer if we are going to jump left */ if (needScan) { GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem)); ptr->buffer = buffer; ptr->next = bufferStack; bufferStack = ptr; } else UnlockReleaseBuffer(buffer);Why? I don't see any need to keep the page locked, when we "jump left".Because it can split to the left again, given that we release lock.Best regards, Andrey Borodin.

Re: GiST VACUUM

2018-07-19 Thread Heikki Linnakangas

On 19/07/18 13:52, Andrey Borodin wrote:

Hi!


19 июля 2018 г., в 1:12, Heikki Linnakangas 
написал(а):

Yeah, please, I think this is the way to go.


Here's v11 divided into proposed steps.


Thanks, one quick question:


/* We should not unlock buffer if we are going to jump 
left */
if (needScan)
{
GistBDItem *ptr = (GistBDItem *) 
palloc(sizeof(GistBDItem));
ptr->buffer = buffer;
ptr->next = bufferStack;
bufferStack = ptr;
}
else
UnlockReleaseBuffer(buffer);


Why? I don't see any need to keep the page locked, when we "jump left".

- Heikki



Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin
Hi!

> 19 июля 2018 г., в 1:12, Heikki Linnakangas  написал(а):
> 
> Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

In second step I still use paloc's memory, but only to store two bitmaps: 
bitmap of internal pages and bitmap of empty leafs. Second physical scan only 
reads internal pages. I can omit that bitmap, if I'll scan everything.
Also, I can replace emptyLeafs bitmap with array\list, but I do not really 
think it will be big.

Anyway I propose focusing on first step.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v11.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v11.patch
Description: Binary data


Re: GiST VACUUM

2018-07-18 Thread Heikki Linnakangas

On 18/07/18 21:27, Andrey Borodin wrote:

Hi!


18 июля 2018 г., в 16:02, Heikki Linnakangas 
написал(а):

, but I think it would be better to split this into two patches as
follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first
patch, which just switches to doing physical-order scan, first.


This seems very unproportional division of complexity. First patch
(PFA) is very simple. All work is done in one cycle, without
memorizing anything. Actually, you do not even need to rescan
rightlinks: there may be no splits to the left when no pages are
deleted.


Heh, good point.

I googled around and bumped into an older patch to do this: 
https://www.postgresql.org/message-id/1135121410099068%40web30j.yandex.ru. 
Unfortunately, Костя never got around to update the patch, and it was 
forgotten. But the idea seemed sound even back then.


As noted in that thread, there might be deleted pages in the index in 
some rare circumstances, even though we don't recycled empty pages: if 
the index was upgraded from a very old version, as VACUUM FULL used to 
recycle empty pages, or if you crash just when extending the index, and 
end up with newly-initialized but unused pages that way. So we do need 
to handle the concurrent split scenario, even without empty page recycling.



If you think it is proper way to go - OK, I'll prepare
better version of attached diff (by omitting tail recursion and
adding more comments).


Yeah, please, I think this is the way to go.

- Heikki



Re: GiST VACUUM

2018-07-18 Thread Andrey Borodin
Hi!

> 18 июля 2018 г., в 16:02, Heikki Linnakangas  написал(а):
> 
> In the corresponding B-tree code, we use don't do actual recursion, but a 
> hand-optimized "tail recursion", to avoid stack overflow if there are a lot 
> of splits. I think we need to do something like tha there, too. I don't think 
> it's safe to assume that we have enough stack space for the recursion. You 
> have a check_stack_depth() in the function, so you'll get ERROR, but it sucks 
> if VACUUM errors out because of that.
Ok, will do that.

> 
> 
> It's not cool to use up to 'maintenance_work_mem' of memory for holding the 
> in-memory graph. VACUUM already uses up to that much memory to hold the list 
> of dead TIDs, so we would use up to 2x maintenance_work_mem in total.
> 
> The only reason we still need the logical scan is to support page deletion, 
> when there is not enough memory available. Can we get rid of that altogether? 
> I think I'd prefer some other scheme to deal with that situation. For 
> example, we could make note, in memory, of the empty pages during the 
> physical scan, and perform a second physical scan of the index to find their 
> downlinks. Two full-index scans is not nice, but it's actually not that bad 
> if it's done in physical order.
I think this is a good idea. I will implement this.

> And you could have some heuristics, e.g. only do it if more than 10% of the 
> pages were deletable.
> 
> Sorry to ask you to rewrite this again
Piece of cake :)

> , but I think it would be better to split this into two patches as follows:
> 
> 1st patch: Scan the index in physical rather than logical order. No attempt 
> at deleting empty pages yet.
> 
> 2nd patch: Add support for deleting empty pages.
> 
> I would be more comfortable reviewing and committing that first patch, which 
> just switches to doing physical-order scan, first.

This seems very unproportional division of complexity. First patch (PFA) is 
very simple. All work is done in one cycle, without memorizing anything. 
Actually, you do not even need to rescan rightlinks: there may be no splits to 
the left when no pages are deleted.
If you think it is proper way to go - OK, I'll prepare better version of 
attached diff (by omitting tail recursion and adding more comments).


Best regards, Andrey Borodin.


gist_phisical_vacuum_v1.diff
Description: Binary data


Re: GiST VACUUM

2018-07-18 Thread Heikki Linnakangas

On 17/07/18 21:41, Andrey Borodin wrote:

I was checking WAL replay of new scheme to log page deletes and found
a bug there (incorrect value of deleted downlink in WAL record).
Here's fixed patch v10.

Also I've added support to WAL identification for new record, done
some improvements to comments and naming in data structures.


Thanks!


+   /*
+* If this page was splitted after start of the VACUUM we have 
to
+* revisit rightlink, if it points to block we already scanned.
+* This is recursive revisit, should not be deep, but we check
+* the possibility of stack overflow anyway.
+*/
+   if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) 
&&
+   (opaque->rightlink != InvalidBlockNumber) && 
(opaque->rightlink < blkno))
+   {
+   gistbulkdeletephysicalcanpage(info, stats, 
callback, callback_state, opaque->rightlink, startNSN, graph);
+   }


In the corresponding B-tree code, we use don't do actual recursion, but 
a hand-optimized "tail recursion", to avoid stack overflow if there are 
a lot of splits. I think we need to do something like tha there, too. I 
don't think it's safe to assume that we have enough stack space for the 
recursion. You have a check_stack_depth() in the function, so you'll get 
ERROR, but it sucks if VACUUM errors out because of that.



It's not cool to use up to 'maintenance_work_mem' of memory for holding 
the in-memory graph. VACUUM already uses up to that much memory to hold 
the list of dead TIDs, so we would use up to 2x maintenance_work_mem in 
total.


The only reason we still need the logical scan is to support page 
deletion, when there is not enough memory available. Can we get rid of 
that altogether? I think I'd prefer some other scheme to deal with that 
situation. For example, we could make note, in memory, of the empty 
pages during the physical scan, and perform a second physical scan of 
the index to find their downlinks. Two full-index scans is not nice, but 
it's actually not that bad if it's done in physical order. And you could 
have some heuristics, e.g. only do it if more than 10% of the pages were 
deletable.


Sorry to ask you to rewrite this again, but I think it would be better 
to split this into two patches as follows:


1st patch: Scan the index in physical rather than logical order. No 
attempt at deleting empty pages yet.


2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch, 
which just switches to doing physical-order scan, first.


- Heikki



Re: GiST VACUUM

2018-07-17 Thread Andrey Borodin
Hi!

> 16 июля 2018 г., в 21:24, Andrey Borodin  написал(а):
> 

I was checking WAL replay of new scheme to log page deletes and found a bug 
there (incorrect value of deleted downlink in WAL record). Here's fixed patch 
v10.

Also I've added support to WAL identification for new record, done some 
improvements to comments and naming in data structures.

Thanks!

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v10.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v10.patch
Description: Binary data


Re: GiST VACUUM

2018-07-16 Thread Andrey Borodin


> 16 июля 2018 г., в 18:58, Robert Haas  написал(а):
> 
> On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas  wrote:
>> I'm still a bit scared about using pd_prune_xid to store the XID that
>> prevents recycling the page too early. Can we use some field in
>> GISTPageOpaqueData for that, similar to how the B-tree stores it in
>> BTPageOpaqueData?
> 
> What's your reason for being scared?  It seems to me that the
> alternatives being proposed (altering the size of the special space,
> or sometimes repurposing a field for some other purpose) have their
> own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to 
GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way 
we store it easily.

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v9.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v9.patch
Description: Binary data


Re: GiST VACUUM

2018-07-16 Thread Robert Haas
On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas  wrote:
> I'm still a bit scared about using pd_prune_xid to store the XID that
> prevents recycling the page too early. Can we use some field in
> GISTPageOpaqueData for that, similar to how the B-tree stores it in
> BTPageOpaqueData?

What's your reason for being scared?  It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space.  Other AMs like btree
or zheap may have different needs; one size does not fit all.  That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: GiST VACUUM

2018-07-14 Thread Andrey Borodin


> 14 июля 2018 г., в 14:39, Heikki Linnakangas  написал(а):
> 
> On 14/07/18 10:26, Andrey Borodin wrote:
>> This is tradeoff between complex concurrency feature and possibility
>> of few dead tuples left after VACUUM. I want to understand: is it
>> something dangerous in this dead tuples?
> Yeah, that's bad. When a new heap tuple is inserted in the same location, the 
> old index tuple will point to the new, unrelated, tuple, and you will get 
> incorrect query results.

PFA v8: at the beginning of physical scan I grab GetInsertRecPtr() and if leaf 
page have rightlink lefter in file and NSN higher than LSN of start we get back 
to scan that page one more time. This is recursive.


I'm still not very comfortable with storing deletion xid in opaque data: we 
reuse rightlink field under very specific conditions instead of using totally 
unused pd_prune_xid. We have a remark in bufpage.h
 * pd_prune_xid is a hint field that helps determine whether pruning will be
 * useful.  It is currently unused in index pages.
Or may be we should use some other place on those empty 8Kb page.

Deleted page do not have rightlink now, but in B-trees rightlink on deleted 
pages is actively used.
We either cannot reuse NSN: rightlink is useless without NSN anyway. We cannot 
reuse flags, page is deleted in flags after all. uint16 gist_page_id is just 
not enough.

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v8.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v8.patch
Description: Binary data


Re: GiST VACUUM

2018-07-14 Thread Andrey Borodin


> 14 июля 2018 г., в 0:28, Heikki Linnakangas  написал(а):
> 
> On 13/07/18 21:28, Andrey Borodin wrote:
>>> 13 июля 2018 г., в 18:25, Heikki Linnakangas 
>>> написал(а):
>>> Looking at the second patch, to scan the GiST index in physical
>>> order, that seems totally unsafe, if there are any concurrent page
>>> splits. In the logical scan, pushStackIfSplited() deals with that,
>>> by comparing the page's NSN with the parent's LSN. But I don't see
>>> anything like that in the physical scan code.
>> Leaf page can be pointed by internal page and rightlink
>> simultaneously. The purpose of NSN is to visit this page exactly once
>> by following only on of two links in a scan. This is achieved
>> naturally if we read everything from the beginning to the end. (That
>> is how I understand, I can be wrong)
> 
> The scenario where this fails goes like this:
> 
> 1. Vacuum scans physical pages 1-10
> 2. A concurrent insertion splits page 15. The new left half stays on page 15, 
> but the new right half goes to page 5
> 3. Vacuum scans pages 11-20
> 
> Now, if there were any dead tuples on the right half of the split, moved to 
> page 5, the vacuum would miss them.
> 
> The way this is handled in B-tree is that when a page is split, the page is 
> stamped with current "vacuum cycle id". When the vacuum scan encounters a 
> page with the current cycle id, whose right-link points to a lower-numbered 
> page, it immediately follows the right link, and re-scans it. I.e. in the 
> above example, if it was a B-tree, in step 3 when vacuum scans page 15, it 
> would see that it was concurrently split. It would immediately vacuum page 5 
> again, before continuing the scan in physical order.
> 
> We'll need to do something similar in GiST.

OK, I will do this.

This is tradeoff between complex concurrency feature and possibility of few 
dead tuples left after VACUUM. I want to understand: is it something dangerous 
in this dead tuples?
There is one more serious race condition: result of first scan is just a hint 
where to look for downlinks to empty pages. If internal page splits between 
scan and cleanup, offsets of downlinks will be changed, cleanup will lock 
pages, see non-empty pages and will not delete them (though there are not dead 
tuples, just not deleted empty leafs).


Best regards, Andrey Borodin.


Re: GiST VACUUM

2018-07-13 Thread Heikki Linnakangas

On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas 
написал(а):

Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.


Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)


The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on 
page 15, but the new right half goes to page 5

3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved 
to page 5, the vacuum would miss them.


The way this is handled in B-tree is that when a page is split, the page 
is stamped with current "vacuum cycle id". When the vacuum scan 
encounters a page with the current cycle id, whose right-link points to 
a lower-numbered page, it immediately follows the right link, and 
re-scans it. I.e. in the above example, if it was a B-tree, in step 3 
when vacuum scans page 15, it would see that it was concurrently split. 
It would immediately vacuum page 5 again, before continuing the scan in 
physical order.


We'll need to do something similar in GiST.

- Heikki



Re: GiST VACUUM

2018-07-13 Thread Andrey Borodin


> 13 июля 2018 г., в 18:10, Heikki Linnakangas  написал(а):
> 
 But the situation in gistdoinsert(), where you encounter a deleted leaf 
 page, could happen during normal operation, if vacuum runs concurrently 
 with an insert. Insertion locks only one page at a time, as it descends 
 the tree, so after it has released the lock on the parent, but before it 
 has locked the child, vacuum might have deleted the page. In the latest 
 patch, you're checking for that just before swapping the shared lock for 
 an exclusive one, but I think that's wrong; you need to check for that 
 after swapping the lock, because otherwise vacuum might delete the page 
 while you're not holding the lock.
>>> Looks like a valid concern, I'll move that code again.
>> Done.
> 
> Ok, the comment now says:
> 
>> +/*
>> + * Leaf pages can be left deleted but still referenced 
>> in case of
>> + * crash during VACUUM's gistbulkdelete()
>> + */
> 
> But that's not accurate, right? You should never see deleted pages after a 
> crash, because the parent is updated in the same WAL record as the child 
> page, right?
Fixed the comment.
> 
> I'm still a bit scared about using pd_prune_xid to store the XID that 
> prevents recycling the page too early. Can we use some field in 
> GISTPageOpaqueData for that, similar to how the B-tree stores it in 
> BTPageOpaqueData?
There is no room in opaque data, but, technically all page is just a tombstone 
until reused, so we can pick arbitrary place. PFA v7 where xid after delete is 
stored in opaque data, but we can use other places, like line pointer array or 
opaque-1

> 13 июля 2018 г., в 18:25, Heikki Linnakangas  написал(а):
> 
> Looking at the second patch, to scan the GiST index in physical order, that 
> seems totally unsafe, if there are any concurrent page splits. In the logical 
> scan, pushStackIfSplited() deals with that, by comparing the page's NSN with 
> the parent's LSN. But I don't see anything like that in the physical scan 
> code.

Leaf page can be pointed by internal page and rightlink simultaneously. The 
purpose of NSN is to visit this page exactly once by following only on of two 
links in a scan. This is achieved naturally if we read everything from the 
beginning to the end. (That is how I understand, I can be wrong)

> I think we can do something similar in the physical scan: remember the 
> current LSN position at the beginning of the vacuum, and compare with that. 
> The B-tree code uses the "cycle ID" for similar purposes.
> 
> Do we still need the separate gistvacuumcleanup() pass, if we scan the index 
> in physical order in the bulkdelete pass already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() 
and IndexFreeSpaceMapVacuum(). Is it correct to move this to first scan?

Best regards, Andrey Borodin.



0002-Physical-GiST-scan-during-VACUUM-v7.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v7.patch
Description: Binary data


Re: GiST VACUUM

2018-07-13 Thread Heikki Linnakangas

On 13/07/18 16:41, Andrey Borodin wrote:
12 июля 2018 г., в 21:07, Andrey Borodin > написал(а):
12 июля 2018 г., в 20:40, Heikki Linnakangas > написал(а):
Actually, now that I think about it more, I'm not happy with leaving orphaned 
pages like that behind. Let's WAL-log the removal of the downlink, and 
marking the leaf pages as deleted, in one WAL record, to avoid that.


OK, will do this. But this will complicate WAL replay seriously, and I do not 
know a proper way to test that (BTW there is GiST amcheck in progress, but I 
decided to leave it for a while).

Done. Now WAL record for deleted page also removes downlink from internal page.
I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
I do not think it will have any impact on performance.


Yeah, I think that's fine, this isn't that performance critical

But the situation in gistdoinsert(), where you encounter a deleted leaf page, 
could happen during normal operation, if vacuum runs concurrently with an 
insert. Insertion locks only one page at a time, as it descends the tree, so 
after it has released the lock on the parent, but before it has locked the 
child, vacuum might have deleted the page. In the latest patch, you're 
checking for that just before swapping the shared lock for an exclusive one, 
but I think that's wrong; you need to check for that after swapping the lock, 
because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.


Ok, the comment now says:


+   /*
+* Leaf pages can be left deleted but still referenced 
in case of
+* crash during VACUUM's gistbulkdelete()
+*/


But that's not accurate, right? You should never see deleted pages after 
a crash, because the parent is updated in the same WAL record as the 
child page, right?


I'm still a bit scared about using pd_prune_xid to store the XID that 
prevents recycling the page too early. Can we use some field in 
GISTPageOpaqueData for that, similar to how the B-tree stores it in 
BTPageOpaqueData?


- Heikki



Re: GiST VACUUM

2018-07-12 Thread Andrey Borodin



> 12 июля 2018 г., в 20:40, Heikki Linnakangas  написал(а):
> 
> On 12/07/18 19:06, Andrey Borodin wrote:
>>> 11 июля 2018 г., в 0:07, Heikki Linnakangas 
>>> написал(а):
>>> This seems misplaced. This code deals with internal pages, and as
>>> far as I can see, this patch never marks internal pages as deleted,
>>> only leaf pages. However, we should have something like this in the
>>> leaf-page branch, to deal with the case that an insertion lands on
>>> a page that was concurrently deleted. Did you have any tests, where
>>> an insertion runs concurrently with vacuum, that would exercise
>>> this?
>> That bug could manifest only in case of crash between removing
>> downlinks and marking pages deleted.
> 
> Hmm. The downlink is removed first, so I don't think you can see that 
> situation after a crash. After a crash, you might have some empty, orphaned, 
> pages that have already been unlinked from the parent, but a search/insert 
> should never encounter them.
> 
> Actually, now that I think about it more, I'm not happy with leaving orphaned 
> pages like that behind. Let's WAL-log the removal of the downlink, and 
> marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not 
know a proper way to test that (BTW there is GiST amcheck in progress, but I 
decided to leave it for a while).

> 
> But the situation in gistdoinsert(), where you encounter a deleted leaf page, 
> could happen during normal operation, if vacuum runs concurrently with an 
> insert. Insertion locks only one page at a time, as it descends the tree, so 
> after it has released the lock on the parent, but before it has locked the 
> child, vacuum might have deleted the page. In the latest patch, you're 
> checking for that just before swapping the shared lock for an exclusive one, 
> but I think that's wrong; you need to check for that after swapping the lock, 
> because otherwise vacuum might delete the page while you're not holding the 
> lock.
Looks like a valid concern, I'll move that code again.
> 
>> I do not know how to test this
>> reliably. Internal pages are locked before leafs and locks are
>> coupled. No cuncurrent backend can see downlinks to pages being
>> deleted, unless crash happens.
> 
> Are you sure? At a quick glance, I don't think the locks are coupled.


Sorry for overquoting

+   /* rescan inner pages that had empty child pages */
+   foreach(cell,rescanList)
+   {
+   Buffer  buffer;
+   Pagepage;
+   OffsetNumber i,
+   maxoff;
+   IndexTuple  idxtuple;
+   ItemId  iid;
+   OffsetNumber todelete[MaxOffsetNumber];
+   Buffer  buftodelete[MaxOffsetNumber];
+   int ntodelete = 0;
+
+   buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 
(BlockNumber)lfirst_int(cell),
+   
RBM_NORMAL, info->strategy);
+   LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+   gistcheckpage(rel, buffer);
+   page = (Page) BufferGetPage(buffer);
+
+   Assert(!GistPageIsLeaf(page));
+
+   maxoff = PageGetMaxOffsetNumber(page);
+
+   for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = 
OffsetNumberNext(i))
+   {
+   Buffer  leafBuffer;
+   PageleafPage;
+
+   iid = PageGetItemId(page, i);
+   idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+   leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, 
ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+   RBM_NORMAL, 
info->strategy);
+   LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent


> 
> We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? 
I just do not know.
I can do this tests manually if this is enough.


Best regards, Andrey Borodin.


Re: GiST VACUUM

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas 
написал(а):

This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?


That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.


Hmm. The downlink is removed first, so I don't think you can see that 
situation after a crash. After a crash, you might have some empty, 
orphaned, pages that have already been unlinked from the parent, but a 
search/insert should never encounter them.


Actually, now that I think about it more, I'm not happy with leaving 
orphaned pages like that behind. Let's WAL-log the removal of the 
downlink, and marking the leaf pages as deleted, in one WAL record, to 
avoid that.


But the situation in gistdoinsert(), where you encounter a deleted leaf 
page, could happen during normal operation, if vacuum runs concurrently 
with an insert. Insertion locks only one page at a time, as it descends 
the tree, so after it has released the lock on the parent, but before it 
has locked the child, vacuum might have deleted the page. In the latest 
patch, you're checking for that just before swapping the shared lock for 
an exclusive one, but I think that's wrong; you need to check for that 
after swapping the lock, because otherwise vacuum might delete the page 
while you're not holding the lock.



I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.


Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki



Re: GiST VACUUM

2018-07-12 Thread Andrey Borodin
Hi!

PFA v5 of the patch series.

> 11 июля 2018 г., в 0:07, Heikki Linnakangas  написал(а):
> 
> This seems misplaced. This code deals with internal pages, and as far as I 
> can see, this patch never marks internal pages as deleted, only leaf pages. 
> However, we should have something like this in the leaf-page branch, to deal 
> with the case that an insertion lands on a page that was concurrently 
> deleted. Did you have any tests, where an insertion runs concurrently with 
> vacuum, that would exercise this?

That bug could manifest only in case of crash between removing downlinks and 
marking pages deleted. I do not know how to test this reliably.
Internal pages are locked before leafs and locks are coupled. No cuncurrent 
backend can see downlinks to pages being deleted, unless crash happens.

I've replaced code covering this situation into leaf code path and added 
comment.

> 
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it 
> records the parent of every empty leaf page it encounters. In the second 
> phase, it scans every leaf page of that parent, not only those leaves that 
> were seen as empty.

It is fixed in second patch of the series.

> 
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted 
> page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but 
> here it's used for a critical check. This seems to be the same mechanism we 
> use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, 
> not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, 
> not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for 
> explanation. This patch is missing any comments to explain how this works in 
> GiST.

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() 
and added explanation of what is going on. Also, I've added comments about that 
pd_prune_xid usage. There is no other use in GiST for this field. And there is 
no other room to place this xid on a page without pg_upgrade.

> 
> If you crash in the middle of gistbulkdelete(), after it has removed the 
> downlink from the parent, but before it has marked the leaf page as deleted, 
> the leaf page is "leaked". I think that's acceptable, but a comment at least 
> would be good.

Added explanatory comment between WAL-logging downlink removal and marking 
pages deleted.


Thank you for reviewing the patch!

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v5.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v5.patch
Description: Binary data


Re: GiST VACUUM

2018-07-11 Thread Andrey Borodin
Hi, Heikki!

Thanks for looking into the patch!

> 11 июля 2018 г., в 0:07, Heikki Linnakangas  написал(а):
> 
> I'm now looking at the first patch in this series, to allow completely empty 
> GiST pages to be recycled. I've got some questions:
> 
>> --- a/src/backend/access/gist/gist.c
>> +++ b/src/backend/access/gist/gist.c
>> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size 
>> freespace, GISTSTATE *giststate)
>>  GISTInsertStack *item;
>>  OffsetNumber downlinkoffnum;
>> +if(GistPageIsDeleted(stack->page))
>> +{
>> +UnlockReleaseBuffer(stack->buffer);
>> +xlocked = false;
>> +state.stack = stack = stack->parent;
>> +continue;
>> +}
>>  downlinkoffnum = gistchoose(state.r, stack->page, itup, 
>> giststate);
>>  iid = PageGetItemId(stack->page, downlinkoffnum);
>>  idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
> 
> This seems misplaced. This code deals with internal pages, and as far as I 
> can see, this patch never marks internal pages as deleted, only leaf pages. 
> However, we should have something like this in the leaf-page branch, to deal 
> with the case that an insertion lands on a page that was concurrently deleted.
That's a bug. Will fix this.

> Did you have any tests, where an insertion runs concurrently with vacuum, 
> that would exercise this?
Yes, I've tried to test this, but, obviously, not enough. I'll think more about 
how to deal with it.

> 
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it 
> records the parent of every empty leaf page it encounters. In the second 
> phase, it scans every leaf page of that parent, not only those leaves that 
> were seen as empty.
Yes, in first patch there is simplest gistbulkdelete(), second patch will 
remember line pointers of downlinks to empty leafs.

> 
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted 
> page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but 
> here it's used for a critical check. This seems to be the same mechanism we 
> use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, 
> not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, 
> not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for 
> explanation. This patch is missing any comments to explain how this works in 
> GiST.
Will look into this. I remember it was OK half a year ago, but now it is clear 
to me that I had to document that part when I understood it

> 
> If you crash in the middle of gistbulkdelete(), after it has removed the 
> downlink from the parent, but before it has marked the leaf page as deleted, 
> the leaf page is "leaked". I think that's acceptable, but a comment at least 
> would be good.
I was considering doing reverse-split (page merge) concurrency like in Lanin 
and Shasha's paper, but it is just too complex for little purpose. Will add 
comments on possible orphan pages.

Many thanks! I hope to post updated patch series this week.


Best regards, Andrey Borodin.


Re: GiST VACUUM

2018-07-10 Thread Heikki Linnakangas
I'm now looking at the first patch in this series, to allow completely 
empty GiST pages to be recycled. I've got some questions:



--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, 
GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
 
+			if(GistPageIsDeleted(stack->page))

+   {
+   UnlockReleaseBuffer(stack->buffer);
+   xlocked = false;
+   state.stack = stack = stack->parent;
+   continue;
+   }
downlinkoffnum = gistchoose(state.r, stack->page, itup, 
giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);


This seems misplaced. This code deals with internal pages, and as far as 
I can see, this patch never marks internal pages as deleted, only leaf 
pages. However, we should have something like this in the leaf-page 
branch, to deal with the case that an insertion lands on a page that was 
concurrently deleted. Did you have any tests, where an insertion runs 
concurrently with vacuum, that would exercise this?


The code in gistbulkdelete() seems pretty expensive. In the first phase, 
it records the parent of every empty leaf page it encounters. In the 
second phase, it scans every leaf page of that parent, not only those 
leaves that were seen as empty.


I'm a bit wary of using pd_prune_xid for the checks to determine if a 
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a 
hint, but here it's used for a critical check. This seems to be the same 
mechanism we use in B-trees, but in B-trees, we store the XID in 
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use 
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See 
comments in _bt_unlink_halfdead_page() for explanation. This patch is 
missing any comments to explain how this works in GiST.


If you crash in the middle of gistbulkdelete(), after it has removed the 
downlink from the parent, but before it has marked the leaf page as 
deleted, the leaf page is "leaked". I think that's acceptable, but a 
comment at least would be good.


- Heikki



Re: GiST VACUUM

2018-05-16 Thread Andrey Borodin
Hi, Thomas!

> 17 мая 2018 г., в 9:40, Thomas Munro  
> написал(а):
> 
> On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin  wrote:
>> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
> 
> Hi Andrey,
> 
> FYI Windows doesn't like this patch[1].
> 
> + uint16_t flags;
> 
> I think that needs to be uint16?


Thanks! Fixed.

Best regards, Andrey Borodin.


0001-Delete-pages-during-GiST-VACUUM-v4.patch
Description: Binary data


0002-Physical-GiST-scan-during-VACUUM-v4.patch
Description: Binary data


Re: GiST VACUUM

2018-05-16 Thread Thomas Munro
On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin  wrote:
> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1].

+ uint16_t flags;

I think that needs to be uint16?

[1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16

-- 
Thomas Munro
http://www.enterprisedb.com