Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-12 Thread Amul Sul
Thanks a lot Tom.

On Tue, Jul 13, 2021 at 2:37 AM Tom Lane  wrote:
>
> Amul Sul  writes:
> > [ v5_Add-RelationGetSmgr-inline-function.patch ]
>
> Pushed with minor cosmetic adjustments.
>
> RelationCopyStorage() kind of gives me the willies.
> It's not really an smgr-level function, but we call it
> everywhere with smgr pointers that belong to relcache entries:
>
> /* copy main fork */
> -   RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
> +   RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
> rel->rd_rel->relpersistence);
>
> So that would fail hard if a relcache flush could occur inside
> that function.  It seems impossible today, so I settled for
> just annotating the function to that effect.  But it won't
> surprise me a bit if somebody breaks it in future due to not
> having read/understood the comment.
>
> regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-12 Thread Tom Lane
Amul Sul  writes:
> [ v5_Add-RelationGetSmgr-inline-function.patch ]

Pushed with minor cosmetic adjustments.

RelationCopyStorage() kind of gives me the willies.
It's not really an smgr-level function, but we call it
everywhere with smgr pointers that belong to relcache entries:

/* copy main fork */
-   RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+   RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
rel->rd_rel->relpersistence);

So that would fail hard if a relcache flush could occur inside
that function.  It seems impossible today, so I settled for
just annotating the function to that effect.  But it won't
surprise me a bit if somebody breaks it in future due to not
having read/understood the comment.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-11 Thread Amul Sul
On Fri, Jul 9, 2021 at 7:30 PM Alvaro Herrera  wrote:
>
> On 2021-Jul-09, Amul Sul wrote:
>
> > > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:
>
> > > > The point of the static-inline function idea was to be cheap enough
> > > > that it isn't worth worrying about this sort of risky optimization.
> > > > Given that an smgr function is sure to involve some kernel calls,
> > > > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > > > So where I was hoping to get to is that smgr objects are *only*
> > > > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > > > other pointers to them across any non-smgr operations.
>
> > Herewith attached version did the same, thanks.
>
> I think it would be valuable to have a comment in that function to point
> out what is the function there for.

Thanks for the suggestion, added the same in the attached version.

Regards,
Amul


v5_Add-RelationGetSmgr-inline-function.patch
Description: Binary data


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-09 Thread Alvaro Herrera
On 2021-Jul-09, Amul Sul wrote:

> > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:

> > > The point of the static-inline function idea was to be cheap enough
> > > that it isn't worth worrying about this sort of risky optimization.
> > > Given that an smgr function is sure to involve some kernel calls,
> > > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > > So where I was hoping to get to is that smgr objects are *only*
> > > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > > other pointers to them across any non-smgr operations.

> Herewith attached version did the same, thanks.

I think it would be valuable to have a comment in that function to point
out what is the function there for.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."  (E. Dijkstra)




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-09 Thread Amul Sul
On Wed, Jul 7, 2021 at 9:44 AM Amul Sul  wrote:
>
> On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:
> >
> > Amul Sul  writes:
> > > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> > >  wrote:
> > >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> > >> >member alone and there's not the previous call to
> > >> RelationGetSmgr just above. How about using a temporary variable?
> > >>
> > >> SMgrRelation srel = RelationGetSmgr(index);
> > >> smgrwrite(srel, ...);
> > >> log_newpage(srel->..);
> >
> > > Understood.  Used a temporary variable for the place where
> > > RelationGetSmgr() calls are placed too close or in a loop.
> >
> > [ squint... ]  Doesn't this risk introducing exactly the sort of
> > cache-clobber hazard we're trying to prevent?  That is, the above is
> > not safe unless you are *entirely* certain that there is not and never
> > will be any possibility of a relcache flush before you are done using
> > the temporary variable.  Otherwise it can become a dangling pointer.
> >
>
> Yeah, there will a hazard, even if we sure right but cannot guarantee future
> changes in any subroutine that could get call in between.
>
> > The point of the static-inline function idea was to be cheap enough
> > that it isn't worth worrying about this sort of risky optimization.
> > Given that an smgr function is sure to involve some kernel calls,
> > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > So where I was hoping to get to is that smgr objects are *only*
> > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > other pointers to them across any non-smgr operations.
> >
>
> Ok, will revert changes added in  the previous version, thanks.
>

Herewith attached version did the same, thanks.

Regards,
Amul


v4_Add-RelationGetSmgr-inline-function.patch
Description: Binary data


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-06 Thread Amul Sul
On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:
>
> Amul Sul  writes:
> > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> >  wrote:
> >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> >> >member alone and there's not the previous call to
> >> RelationGetSmgr just above. How about using a temporary variable?
> >>
> >> SMgrRelation srel = RelationGetSmgr(index);
> >> smgrwrite(srel, ...);
> >> log_newpage(srel->..);
>
> > Understood.  Used a temporary variable for the place where
> > RelationGetSmgr() calls are placed too close or in a loop.
>
> [ squint... ]  Doesn't this risk introducing exactly the sort of
> cache-clobber hazard we're trying to prevent?  That is, the above is
> not safe unless you are *entirely* certain that there is not and never
> will be any possibility of a relcache flush before you are done using
> the temporary variable.  Otherwise it can become a dangling pointer.
>

Yeah, there will a hazard, even if we sure right but cannot guarantee future
changes in any subroutine that could get call in between.

> The point of the static-inline function idea was to be cheap enough
> that it isn't worth worrying about this sort of risky optimization.
> Given that an smgr function is sure to involve some kernel calls,
> I doubt it's worth sweating over an extra test-and-branch beforehand.
> So where I was hoping to get to is that smgr objects are *only*
> referenced by RelationGetSmgr() calls and nobody ever keeps any
> other pointers to them across any non-smgr operations.
>

Ok, will revert changes added in  the previous version, thanks.

Regards,
Amul




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-06 Thread Tom Lane
Amul Sul  writes:
> On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
>  wrote:
>> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
>> >member alone and there's not the previous call to
>> RelationGetSmgr just above. How about using a temporary variable?
>> 
>> SMgrRelation srel = RelationGetSmgr(index);
>> smgrwrite(srel, ...);
>> log_newpage(srel->..);

> Understood.  Used a temporary variable for the place where
> RelationGetSmgr() calls are placed too close or in a loop.

[ squint... ]  Doesn't this risk introducing exactly the sort of
cache-clobber hazard we're trying to prevent?  That is, the above is
not safe unless you are *entirely* certain that there is not and never
will be any possibility of a relcache flush before you are done using
the temporary variable.  Otherwise it can become a dangling pointer.

The point of the static-inline function idea was to be cheap enough
that it isn't worth worrying about this sort of risky optimization.
Given that an smgr function is sure to involve some kernel calls,
I doubt it's worth sweating over an extra test-and-branch beforehand.
So where I was hoping to get to is that smgr objects are *only*
referenced by RelationGetSmgr() calls and nobody ever keeps any
other pointers to them across any non-smgr operations.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-05 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I looked through the patch. Looks good to me. 

CFbot tests are passing and, as I got it from the thread, nobody opposes this 
refactoring, so, move it to RFC status.

The new status of this patch is: Ready for Committer


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Amul Sul
On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul  wrote in
> > On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
> >  wrote:
> > > +   smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, 
> > > BLOOM_METAPAGE_BLKNO,
> > >   (char *) metapage, true);
> > > -   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > > +   log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, 
> > > INIT_FORKNUM,
> > >
> > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > > to leave the line alone..  I don't mind other sccessive calls if any
> > > since what I don't like is the notation there.
> > >
> >
> > Perhaps, isn't that bad. It is good to follow the practice of using
> > RelationGetSmgr() for rd_smgr access, IMHO.
>
> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> >member alone and there's not the previous call to
> RelationGetSmgr just above. How about using a temporary variable?
>
>   SMgrRelation srel = RelationGetSmgr(index);
>   smgrwrite(srel, ...);
>   log_newpage(srel->..);
>

Understood.  Used a temporary variable for the place where
RelationGetSmgr() calls are placed too close or in a loop.

Please have a look at the attached version, thanks for the review.

Regards,
Amul
From d3043a97044d506ab9255c6d6d446ad962ee308c Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Mon, 19 Apr 2021 03:16:27 -0400
Subject: [PATCH] Add RelationGetSmgr inline function

---
 contrib/amcheck/verify_nbtree.c   |  3 +-
 contrib/bloom/blinsert.c  |  7 ++--
 contrib/pg_prewarm/autoprewarm.c  |  4 +-
 contrib/pg_prewarm/pg_prewarm.c   |  5 +--
 contrib/pg_visibility/pg_visibility.c |  7 ++--
 src/backend/access/gist/gistbuild.c   | 15 
 src/backend/access/hash/hashpage.c|  4 +-
 src/backend/access/heap/heapam_handler.c  |  9 +++--
 src/backend/access/heap/rewriteheap.c | 11 ++
 src/backend/access/heap/visibilitymap.c   | 46 +--
 src/backend/access/nbtree/nbtree.c|  7 ++--
 src/backend/access/nbtree/nbtsort.c   | 16 +++-
 src/backend/access/spgist/spginsert.c | 15 
 src/backend/access/table/tableam.c|  8 ++--
 src/backend/catalog/heap.c|  2 -
 src/backend/catalog/index.c   |  5 +--
 src/backend/catalog/storage.c | 18 -
 src/backend/commands/tablecmds.c  |  9 +++--
 src/backend/storage/buffer/bufmgr.c   | 25 
 src/backend/storage/freespace/freespace.c | 40 ++--
 src/include/utils/rel.h   | 21 ++-
 21 files changed, 122 insertions(+), 155 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b31906cbcfd..7f34eed04ba 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -301,8 +301,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 		bool		heapkeyspace,
 	allequalimage;
 
-		RelationOpenSmgr(indrel);
-		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
+		if (!smgrexists(RelationGetSmgr(indrel), MAIN_FORKNUM))
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("index \"%s\" lacks a main relation fork",
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c34a640d1c4..ae10174d94a 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,6 +164,7 @@ void
 blbuildempty(Relation index)
 {
 	Page		metapage;
+	SMgrRelation reln = RelationGetSmgr(index);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -177,9 +178,9 @@ blbuildempty(Relation index)
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
+	smgrwrite(reln, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(>smgr_rnode.node, INIT_FORKNUM,
 BLOOM_METAPAGE_BLKNO, metapage, true);
 
 	/*
@@ -187,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(reln, INIT_FORKNUM);
 }
 
 /*
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b3f73ea92d6..0289ea657cb 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -514,15 +514,13 @@ autoprewarm_database_main(Datum main_arg)
 			old_blk->filenode != blk->filenode ||
 			old_blk->forknum != blk->forknum)
 		{
-			RelationOpenSmgr(rel);
-
 			/*
 			 * smgrexists is not safe for illegal forknum, hence check whether
 			 * the passed forknum is valid before using it in smgrexists.
 			 */
 			if (blk->forknum > InvalidForkNumber &&
 

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Kyotaro Horiguchi
At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul  wrote in 
> On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
>  wrote:
> > +   smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, 
> > BLOOM_METAPAGE_BLKNO,
> >   (char *) metapage, true);
> > -   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > +   log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, 
> > INIT_FORKNUM,
> >
> > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > to leave the line alone..  I don't mind other sccessive calls if any
> > since what I don't like is the notation there.
> >
> 
> Perhaps, isn't that bad. It is good to follow the practice of using
> RelationGetSmgr() for rd_smgr access, IMHO.

I don't mind RelationGetSmgr(index)->smgr_rnode alone or
>member alone and there's not the previous call to
RelationGetSmgr just above. How about using a temporary variable?

  SMgrRelation srel = RelationGetSmgr(index);
  smgrwrite(srel, ...);
  log_newpage(srel->..);


> > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
> >
> > Isn't this a kind of open item?
> >
> 
> Sorry, I didn't get you. Do I need to move this to some other bucket?

As discussed in the other branch, I agree that it is registered to the
next CF, not registered as an open items of this cycle.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Kyotaro Horiguchi
At Mon, 19 Apr 2021 09:19:36 -0400, Tom Lane  wrote in 
> Michael Paquier  writes:
> > On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
> >> Isn't this a kind of open item?
> 
> > This does not qualify as an open item because it is not an actual bug
> > IMO, neither is it a defect of the existing code, so it seems
> > appropriate to me to not list it.
> 
> Agreed, but by the same token, rushing it into v14 doesn't have any
> clear benefit.  I'd be inclined to leave it for v15 at this point,
> especially since we don't seem to have 100% consensus on the details.

Thanks. Seems reasonable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
>> Isn't this a kind of open item?

> This does not qualify as an open item because it is not an actual bug
> IMO, neither is it a defect of the existing code, so it seems
> appropriate to me to not list it.

Agreed, but by the same token, rushing it into v14 doesn't have any
clear benefit.  I'd be inclined to leave it for v15 at this point,
especially since we don't seem to have 100% consensus on the details.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Michael Paquier
On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
> Isn't this a kind of open item?

This does not qualify as an open item because it is not an actual bug
IMO, neither is it a defect of the existing code, so it seems
appropriate to me to not list it.
--
Michael


signature.asc
Description: PGP signature


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Justin Pryzby
On Mon, Apr 19, 2021 at 04:27:25PM +0530, Amul Sul wrote:
> On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi  
> wrote:
> >
> > At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul  wrote in
> > > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier  
> > > wrote:
> > > >
> > > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > > > We forgot this patch earlier in the commitfest.  Do people think we
> > > > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > > > safety feature poised to prevent more bugs than it's likely to
> > > > > introduce.
> > > >
> > > > No objections from here to do that now even after feature freeze.  I
> > > > also wonder, while looking at that, why you don't just remove the last
> > > > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > > > to use RelationOpenSmgr() anymore, and it could just be removed from
> > > > rel.h.
> > >
> > > Agree, did the same in the attached version, thanks.
> >
> > +   smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, 
> > BLOOM_METAPAGE_BLKNO,
> >   (char *) metapage, true);
> > -   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > +   log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, 
> > INIT_FORKNUM,
> >
> > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > to leave the line alone..  I don't mind other sccessive calls if any
> > since what I don't like is the notation there.
> >
> 
> Perhaps, isn't that bad. It is good to follow the practice of using
> RelationGetSmgr() for rd_smgr access, IMHO.
> 
> > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
> >
> > Isn't this a kind of open item?
> >
> 
> Sorry, I didn't get you. Do I need to move this to some other bucket?

It's not a new feature, and shouldn't wait for July's CF since it's targetting
v14.

The original crash was fixed by Tom by commit 9d523119f.

So it's not exactly an "open item" for v14, but there's probably no better
place for it, so you could add it if you think it's at risk of being forgotten
(again).

https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items

-- 
Justin




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Amul Sul
On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
 wrote:
>
> At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul  wrote in
> > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier  
> > wrote:
> > >
> > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > > We forgot this patch earlier in the commitfest.  Do people think we
> > > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > > safety feature poised to prevent more bugs than it's likely to
> > > > introduce.
> > >
> > > No objections from here to do that now even after feature freeze.  I
> > > also wonder, while looking at that, why you don't just remove the last
> > > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > > to use RelationOpenSmgr() anymore, and it could just be removed from
> > > rel.h.
> >
> > Agree, did the same in the attached version, thanks.
>
> +   smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
>   (char *) metapage, true);
> -   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> +   log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
>
> At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> to leave the line alone..  I don't mind other sccessive calls if any
> since what I don't like is the notation there.
>

Perhaps, isn't that bad. It is good to follow the practice of using
RelationGetSmgr() for rd_smgr access, IMHO.

> > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
>
> Isn't this a kind of open item?
>

Sorry, I didn't get you. Do I need to move this to some other bucket?

Regards,
Amul




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Kyotaro Horiguchi
At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul  wrote in 
> On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier  wrote:
> >
> > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > We forgot this patch earlier in the commitfest.  Do people think we
> > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > safety feature poised to prevent more bugs than it's likely to
> > > introduce.
> >
> > No objections from here to do that now even after feature freeze.  I
> > also wonder, while looking at that, why you don't just remove the last
> > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > to use RelationOpenSmgr() anymore, and it could just be removed from
> > rel.h.
> 
> Agree, did the same in the attached version, thanks.

+   smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
  (char *) metapage, true);
-   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+   log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,

At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone..  I don't mind other sccessive calls if any
since what I don't like is the notation there.

> P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Isn't this a kind of open item?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Amul Sul
On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier  wrote:
>
> On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > We forgot this patch earlier in the commitfest.  Do people think we
> > should still get it in on this cycle?  I'm +1 on that, since it's a
> > safety feature poised to prevent more bugs than it's likely to
> > introduce.
>
> No objections from here to do that now even after feature freeze.  I
> also wonder, while looking at that, why you don't just remove the last
> call within src/backend/catalog/heap.c.  This way, nobody is tempted
> to use RelationOpenSmgr() anymore, and it could just be removed from
> rel.h.

Agree, did the same in the attached version, thanks.

Regards,
Amul

P.S. commitfest entry https://commitfest.postgresql.org/33/3084/


v2_Add-RelationGetSmgr-inline-function.patch
Description: Binary data


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-19 Thread Michael Paquier
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> We forgot this patch earlier in the commitfest.  Do people think we
> should still get it in on this cycle?  I'm +1 on that, since it's a
> safety feature poised to prevent more bugs than it's likely to
> introduce.

No objections from here to do that now even after feature freeze.  I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c.  This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.
--
Michael


signature.asc
Description: PGP signature


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-04-09 Thread Alvaro Herrera
On 2021-Mar-25, Amul Sul wrote:

> Ok, in the attached patch, I have added the inline function to rel.h, and for
> that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
> by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
> the nearby to it which I don't think needed at all.

We forgot this patch earlier in the commitfest.  Do people think we
should still get it in on this cycle?  I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

-- 
Álvaro Herrera   Valdivia, Chile




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-25 Thread Amul Sul
On Thu, Mar 25, 2021 at 12:10 PM Kyotaro Horiguchi
 wrote:
>
> Sorry for the bug.
>
> At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane  wrote in
> > Amul Sul  writes:
> > > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
> > >> static inline struct SMgrRelationData *
> > >> RelationGetSmgr(Relation rel)
> > >> {
> > >> if (unlikely(rel->rd_smgr == NULL))
> > >> RelationOpenSmgr(rel);
> > >> return rel->rd_smgr;
> > >> }
> >
> > > A quick question: Can't it be a macro instead of an inline function
> > > like other macros we have in rel.h?
> >
> > The multiple-evaluation hazard seems like an issue.  We've tolerated
> > such hazards in the past, but mostly just because we weren't relying
> > on static inlines being available, so there wasn't a good way around
> > it.
> >
> > Also, the conditional evaluation here would look rather ugly
> > in a macro, I think, if indeed you could do it at all without
> > provoking compiler warnings.
>
> FWIW, +1 for the function as is.
>

Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.

Regards,
Amul
From dfff4920996aaa443986d43a7bb1231a0230a1e5 Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Thu, 25 Mar 2021 06:27:58 -0400
Subject: [PATCH] Add RelationGetSmgr() inline function

---
 contrib/amcheck/verify_nbtree.c   |  3 +-
 contrib/bloom/blinsert.c  |  6 +--
 contrib/pg_prewarm/autoprewarm.c  |  4 +-
 contrib/pg_prewarm/pg_prewarm.c   |  5 +--
 contrib/pg_visibility/pg_visibility.c |  7 ++--
 src/backend/access/gist/gistbuild.c   | 14 +++
 src/backend/access/hash/hashpage.c|  4 +-
 src/backend/access/heap/heapam_handler.c  |  7 ++--
 src/backend/access/heap/rewriteheap.c | 11 ++
 src/backend/access/heap/visibilitymap.c   | 46 ++-
 src/backend/access/nbtree/nbtree.c|  8 ++--
 src/backend/access/nbtree/nbtsort.c   | 14 ++-
 src/backend/access/spgist/spginsert.c | 16 
 src/backend/access/table/tableam.c|  7 +---
 src/backend/catalog/index.c   |  5 +--
 src/backend/catalog/storage.c | 17 +
 src/backend/commands/tablecmds.c  |  7 ++--
 src/backend/storage/buffer/bufmgr.c   | 24 +++-
 src/backend/storage/freespace/freespace.c | 40 ++--
 src/include/utils/rel.h   | 13 +++
 20 files changed, 118 insertions(+), 140 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b31906cbcfd..7f34eed04ba 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -301,8 +301,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 		bool		heapkeyspace,
 	allequalimage;
 
-		RelationOpenSmgr(indrel);
-		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
+		if (!smgrexists(RelationGetSmgr(indrel), MAIN_FORKNUM))
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("index \"%s\" lacks a main relation fork",
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index d37ceef753a..79fb92debc5 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -178,9 +178,9 @@ blbuildempty(Relation index)
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
+	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 BLOOM_METAPAGE_BLKNO, metapage, true);
 
 	/*
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b3f73ea92d6..0289ea657cb 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -514,15 +514,13 @@ autoprewarm_database_main(Datum main_arg)
 			old_blk->filenode != blk->filenode ||
 			old_blk->forknum != blk->forknum)
 		{
-			RelationOpenSmgr(rel);
-
 			/*
 			 * smgrexists is not safe for illegal forknum, hence check whether
 			 * the passed forknum is valid before using it in smgrexists.
 			 */
 			if (blk->forknum > InvalidForkNumber &&
 blk->forknum <= MAX_FORKNUM &&
-smgrexists(rel->rd_smgr, blk->forknum))
+smgrexists(RelationGetSmgr(rel), blk->forknum))
 nblocks = RelationGetNumberOfBlocksInFork(rel, blk->forknum);
 			else
 

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-25 Thread Kyotaro Horiguchi
Sorry for the bug.

At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane  wrote in 
> Amul Sul  writes:
> > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
> >> static inline struct SMgrRelationData *
> >> RelationGetSmgr(Relation rel)
> >> {
> >> if (unlikely(rel->rd_smgr == NULL))
> >> RelationOpenSmgr(rel);
> >> return rel->rd_smgr;
> >> }
> 
> > A quick question: Can't it be a macro instead of an inline function
> > like other macros we have in rel.h?
> 
> The multiple-evaluation hazard seems like an issue.  We've tolerated
> such hazards in the past, but mostly just because we weren't relying
> on static inlines being available, so there wasn't a good way around
> it.
> 
> Also, the conditional evaluation here would look rather ugly
> in a macro, I think, if indeed you could do it at all without
> provoking compiler warnings.

FWIW, +1 for the function as is.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Tom Lane
Amul Sul  writes:
> On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
>> static inline struct SMgrRelationData *
>> RelationGetSmgr(Relation rel)
>> {
>> if (unlikely(rel->rd_smgr == NULL))
>> RelationOpenSmgr(rel);
>> return rel->rd_smgr;
>> }

> A quick question: Can't it be a macro instead of an inline function
> like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue.  We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Amul Sul
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
>
> Amul Sul  writes:
> > On Tue, Mar 23, 2021 at 8:59 PM Tom Lane  wrote:
> >> On closer inspection, I believe the true culprit is c6b92041d,
> >> which did this:
> >> -   heap_sync(state->rs_new_rel);
> >> +   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
> >> heap_sync was careful about opening rd_smgr, the new code not so much.
>
> > So we also need to make sure of the
> > RelationOpenSmgr() call before smgrimmedsync() as proposed previously.
>
> I wonder if we should try to get rid of this sort of bug by banning
> direct references to rd_smgr?  That is, write the above and all
> similar code like
>
> smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
>
> where we provide something like
>
> static inline struct SMgrRelationData *
> RelationGetSmgr(Relation rel)
> {
> if (unlikely(rel->rd_smgr == NULL))
> RelationOpenSmgr(rel);
> return rel->rd_smgr;
> }
>
> and then we could get rid of most or all other RelationOpenSmgr calls.
>

+1

> This might create more code bloat than it's really worth, but
> it'd be a simple and mechanically-checkable scheme.

I think that will be fine, one-time pain. If you want I will do those changes.

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

Regards,
Amul




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Tom Lane
Amul Sul  writes:
> On Tue, Mar 23, 2021 at 8:59 PM Tom Lane  wrote:
>> On closer inspection, I believe the true culprit is c6b92041d,
>> which did this:
>> -   heap_sync(state->rs_new_rel);
>> +   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
>> heap_sync was careful about opening rd_smgr, the new code not so much.

> So we also need to make sure of the
> RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr?  That is, write the above and all
similar code like

smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);

where we provide something like

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

and then we could get rid of most or all other RelationOpenSmgr calls.

This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-23 Thread Amul Sul
On Tue, Mar 23, 2021 at 8:59 PM Tom Lane  wrote:
>
> I wrote:
> > Michael Paquier  writes:
> >> One bisect later, the winner is:
> >> commit: 3d351d916b20534f973eda760cde17d96545d4c4
> >> author: Tom Lane 
> >> date: Sun, 30 Aug 2020 12:21:51 -0400
> >> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
>
> > I think that's an artifact.  That commit didn't touch anything related to
> > relation opening or closing.  What it could have done, though, is change
> > CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
> > thus causing us to follow the buggy code path where before we didn't.
>
> On closer inspection, I believe the true culprit is c6b92041d,
> which did this:
>
>  */
> if (RelationNeedsWAL(state->rs_new_rel))
> -   heap_sync(state->rs_new_rel);
> +   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
>
> logical_end_heap_rewrite(state);
>
> heap_sync was careful about opening rd_smgr, the new code not so much.
>
> I read the rest of that commit and didn't see any other equivalent
> bugs, but I might've missed something.
>

I too didn't find any other place replacing heap_sync() or equivalent place from
this commit where smgr* operation reaches without necessary precautions call.
heap_sync() was calling RelationOpenSmgr() through FlushRelationBuffers() before
it reached smgrimmedsync().  So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

Regards,
Amul




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-23 Thread Tom Lane
I wrote:
> Michael Paquier  writes:
>> One bisect later, the winner is:
>> commit: 3d351d916b20534f973eda760cde17d96545d4c4
>> author: Tom Lane 
>> date: Sun, 30 Aug 2020 12:21:51 -0400
>> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

> I think that's an artifact.  That commit didn't touch anything related to
> relation opening or closing.  What it could have done, though, is change
> CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
> thus causing us to follow the buggy code path where before we didn't.

On closer inspection, I believe the true culprit is c6b92041d,
which did this:

 */
if (RelationNeedsWAL(state->rs_new_rel))
-   heap_sync(state->rs_new_rel);
+   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
logical_end_heap_rewrite(state);

heap_sync was careful about opening rd_smgr, the new code not so much.

I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-23 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
>> It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
>> but the test is quick enough to reproduce.  It would be good to bisect
>> the origin point here as a first step.

> One bisect later, the winner is:
> commit: 3d351d916b20534f973eda760cde17d96545d4c4
> author: Tom Lane 
> date: Sun, 30 Aug 2020 12:21:51 -0400
> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

> I am too tired to poke at that today, so I'll try tomorrow.  Tom may
> beat me at that though.

I think that's an artifact.  That commit didn't touch anything related to
relation opening or closing.  What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.

The interesting question here seems to be "why didn't the existing
CLOBBER_CACHE_ALWAYS buildfarm testing catch this?".  It looks to me like
the answer is that it only happens for an empty table (or at least one
where the data pattern is such that we skip the RelationOpenSmgr call
earlier in end_heap_rewrite) and we don't happen to be exercising that
exact scenario in the regression tests.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-23 Thread Michael Paquier
On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
> It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
> but the test is quick enough to reproduce.  It would be good to bisect
> the origin point here as a first step.

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane 
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I am too tired to poke at that today, so I'll try tomorrow.  Tom may
beat me at that though.
--
Michael


signature.asc
Description: PGP signature


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-23 Thread Michael Paquier
On Tue, Mar 23, 2021 at 10:52:09AM +0530, Neha Sharma wrote:
> Sure, will give a regression run with CCA enabled.

I can confirm the regression between 13 and HEAD, so I have added an
open item.  It would be good to figure out the root issue here, and I
am ready to bet that the problem is deeper than it looks and that more
code paths could be involved.

It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce.  It would be good to bisect
the origin point here as a first step.
--
Michael


signature.asc
Description: PGP signature


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Neha Sharma
On Tue, Mar 23, 2021 at 10:08 AM Amul Sul  wrote:

> On Mon, Mar 22, 2021 at 3:03 PM Amit Langote 
> wrote:
> >
> > On Mon, Mar 22, 2021 at 5:26 PM Amul Sul  wrote:
> > > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > > rwstate->rs_new_rel->rd_smgr correctly but next line
> tuplesort_begin_cluster()
> > > get called which cause the system cache invalidation and due to CCA
> setting,
> > > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the
> subsequent
> > > operations and causes segmentation fault.
> > >
> > > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > > end_heap_rewrite() would fix the failure. Did the same in the attached
> patch.
> >
> > That makes sense.  I see a few commits in the git history adding
> > RelationOpenSmgr() before a smgr* operation, whenever such a problem
> > would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> > for example.
> >
>
> Thanks for the confirmation.
>
> > I do wonder if there are still other smgr* operations in the source
> > code that are preceded by operations that would invalidate the
> > SMgrRelation that those smgr* operations would be called with.  For
> > example, the smgrnblocks() in gistBuildCallback() may get done too
> > late than a corresponding RelationOpenSmgr() on the index relation.
> >
>
> I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)
> before
> smgrnblocks() with CCA setting and didn't see any problem there.
>
> I think the easiest way to find that is to run a regression suite with CCA
> build, perhaps, there is no guarantee that regression will hit all smgr*
> operations, but that might hit most of them.


Sure, will give a regression run with CCA enabled.

>
> Regards,
> Amul
>

Regards,
Neha Sharma


Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Amul Sul
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote  wrote:
>
> On Mon, Mar 22, 2021 at 5:26 PM Amul Sul  wrote:
> > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > rwstate->rs_new_rel->rd_smgr correctly but next line 
> > tuplesort_begin_cluster()
> > get called which cause the system cache invalidation and due to CCA setting,
> > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the 
> > subsequent
> > operations and causes segmentation fault.
> >
> > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > end_heap_rewrite() would fix the failure. Did the same in the attached 
> > patch.
>
> That makes sense.  I see a few commits in the git history adding
> RelationOpenSmgr() before a smgr* operation, whenever such a problem
> would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> for example.
>

Thanks for the confirmation.

> I do wonder if there are still other smgr* operations in the source
> code that are preceded by operations that would invalidate the
> SMgrRelation that those smgr* operations would be called with.  For
> example, the smgrnblocks() in gistBuildCallback() may get done too
> late than a corresponding RelationOpenSmgr() on the index relation.
>

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)  before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Regards,
Amul




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Amit Langote
On Mon, Mar 22, 2021 at 5:26 PM Amul Sul  wrote:
> In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
> get called which cause the system cache invalidation and due to CCA setting,
> wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
> operations and causes segmentation fault.
>
> By calling RelationOpenSmgr() before calling smgrimmedsync() in
> end_heap_rewrite() would fix the failure. Did the same in the attached patch.

That makes sense.  I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.

I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with.  For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.

--
Amit Langote
EDB: http://www.enterprisedb.com




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-22 Thread Amul Sul
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.

Regards,
Amul



On Mon, Mar 22, 2021 at 11:53 AM Neha Sharma
 wrote:
>
> Hello,
>
> While executing the below test case server crashed with Segfault 11 on master 
> branch.
> I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h
>
> Issue is only reproducing on master branch.
>
> Test Case:
> CREATE TABLE sm_5_323_table (col1 numeric);
> CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
>
> CLUSTER sm_5_323_table USING sm_5_323_idx;
>
> \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h 
> localhost -p 5432 -d postgres
>
> Test case output:
> edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
> psql (14devel)
> Type "help" for help.
>
> postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
> CREATE TABLE
> postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
> CREATE INDEX
> postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
> CLUSTER
> postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t 
> sm_5_323_table -U edb -h localhost -p 5432 -d postgres
> clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" 
> failed: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Stack Trace:
> Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER
>  '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x55e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, 
> behavior=1) at md.c:485
> 485 if (reln->md_num_open_segs[forknum] > 0)
> (gdb) bt
> #0  0x55e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, 
> behavior=1) at md.c:485
> #1  0x55e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at 
> md.c:768
> #2  0x55e5c85eb61b in mdimmedsync (reln=0x0, 
> forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
> #3  0x55e5c85ec6e5 in smgrimmedsync (reln=, 
> forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
> #4  0x55e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) 
> at rewriteheap.c:342
> #5  0x55e5c81a32ea in heapam_relation_copy_for_cluster 
> (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=, 
> use_sort=, OldestXmin=,
> xid_cutoff=, multi_cutoff=0x7ffcba6ebe64, 
> num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, 
> tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
> #6  0x55e5c82f218a in table_relation_copy_for_cluster 
> (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, 
> num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
> xid_cutoff=0x7ffcba6ebe60, OldestXmin=, 
> use_sort=, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, 
> OldTable=0x7f212ce41ba0)
> at ../../../src/include/access/tableam.h:1656
> #7  copy_table_data (pCutoffMulti=, pFreezeXid= pointer>, pSwapToastByContent=, verbose=, 
> OIDOldIndex=,
> OIDOldHeap=16384, OIDNewHeap=) at cluster.c:908
> #8  rebuild_relation (verbose=, indexOid=, 
> OldHeap=) at cluster.c:604
> #9  cluster_rel (tableOid=, indexOid=, 
> params=) at cluster.c:427
> #10 0x55e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, 
> stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at 
> cluster.c:195
> #11 0x55e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, 
> queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
> queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
> #12 0x55e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, 
> queryString=, context=context@entry=PROCESS_UTILITY_TOPLEVEL, 
> params=,
> queryEnv=, dest=dest@entry=0x55e5ca510710, 
> qc=0x7ffcba6ec340) at utility.c:525
> #13 0x55e5c85f6148 in PortalRunUtility 
> (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, 
> isTopLevel=isTopLevel@entry=true,
> setHoldSnapshot=setHoldSnapshot@entry=false, 
> dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
> #14 0x55e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, 
> isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
> dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, 
> qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
> #15 0x55e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, 
> count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
>