Re: In-place persistance change of a relation

2020-11-11 Thread Kyotaro Horiguchi
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost  wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost  
> > wrote in 
> > > * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > > trials of several ways, I drifted to the following way after poking
> > > > several ways.
> > > > 
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > > 
> > > > It always skips table copy in the SET UNLOGGED case, and only when
> > > > wal_level=minimal in the SET LOGGED case.  Crash recovery seems
> > > > working by some brief testing by hand.
> > > 
> > > Somehow missed that this patch more-or-less does what I was referring to
> > > down-thread, but I did want to mention that it looks like it's missing a
> > > necessary FlushRelationBuffers() call before the sync, otherwise there
> > > could be dirty buffers for the relation that's being set to LOGGED (with
> > > wal_level=minimal), which wouldn't be good.  See the comments above
> > > smgrimmedsync().
> > 
> > Right. Thanks.  However, since SetRelFileNodeBuffersPersistence()
> > called just above scans shared buffers so I don't want to just call
> > FlushRelationBuffers() separately.  Instead, I added buffer-flush to
> > SetRelFileNodeBuffersPersistence().
> 
> Maybe I'm missing something, but it sure looks like in the patch that
> SetRelFileNodeBuffersPersistence() is being called after the
> smgrimmedsync() call, and I don't think you get to just switch the order
> of those- the sync is telling the kernel to make sure it's written to
> disk, while the FlushBuffer() is just writing it into the kernel but
> doesn't provide any guarantee that the data has actually made it to
> disk.  We have to FlushBuffer() first, and then call smgrimmedsync().
> Perhaps there's a way to avoid having to go through shared buffers
> twice, and I generally agreed it'd be good if we could avoid doing so,
> but this approach doesn't look like it actually works.

Yeah, sorry for the rare-baked version.. I was confused about the
order at the time.  The next version works like this:

LOGGED->UNLOGGED


for each relations:
   minimal>
  
  if it is index call ambuildempty() (which syncs the init fork)
  else WAL-log smgr_create then sync the init file.
  
...
commit time:
  
abort time:
  
  

UNLOGGED->LOGGED


for each relations:
   minimal>
  
  
  
...
commit time:
  
  
abort time:
  


> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> > 
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> >   be safely roll-backed. (It might be better to drop buffers).
> 
> Not sure if it'd be better to drop buffers or not, but figuring out how
> to deal with rollback seems pretty important.  How is the persistence
> change in the catalog not WAL-logged though..?

Rollback works as the above. Buffer persistence change is registered
in pending-deletes.  Persistence change in catalog is rolled back in
the ordinary way (or automatically).

If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL.  However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.

> > - This version handles indexes but not yet handle toast relatins.
> 
> Would need to be fixed, of course.

Fixed.

> > - tableAMs are supposed to support this feature. (but I'm not sure
> >   it's worth allowing them not to do so).
> 
> Seems like they should.

Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that.  That introduces new
include of amapi.h to storage.c, which is a bit uneasy.

The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted.  However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.

As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-place persistance change of a relation

2020-11-11 Thread Stephen Frost
Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost  wrote 
> in 
> > * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > trials of several ways, I drifted to the following way after poking
> > > several ways.
> > > 
> > > 1. Flip BM_PERMANENT of active buffers
> > > 2. adding/removing init fork
> > > 3. sync files,
> > > 4. Flip pg_class.relpersistence.
> > > 
> > > It always skips table copy in the SET UNLOGGED case, and only when
> > > wal_level=minimal in the SET LOGGED case.  Crash recovery seems
> > > working by some brief testing by hand.
> > 
> > Somehow missed that this patch more-or-less does what I was referring to
> > down-thread, but I did want to mention that it looks like it's missing a
> > necessary FlushRelationBuffers() call before the sync, otherwise there
> > could be dirty buffers for the relation that's being set to LOGGED (with
> > wal_level=minimal), which wouldn't be good.  See the comments above
> > smgrimmedsync().
> 
> Right. Thanks.  However, since SetRelFileNodeBuffersPersistence()
> called just above scans shared buffers so I don't want to just call
> FlushRelationBuffers() separately.  Instead, I added buffer-flush to
> SetRelFileNodeBuffersPersistence().

Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk.  We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.

> FWIW this is a revised version of the PoC, which has some known
> problems.
> 
> - Flipping of Buffer persistence is not WAL-logged nor even be able to
>   be safely roll-backed. (It might be better to drop buffers).

Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important.  How is the persistence
change in the catalog not WAL-logged though..?

> - This version handles indexes but not yet handle toast relatins.

Would need to be fixed, of course.

> - tableAMs are supposed to support this feature. (but I'm not sure
>   it's worth allowing them not to do so).

Seems like they should.

Thanks,

Stephen


signature.asc
Description: PGP signature