On Tuesday, September 8, 2020 1:02 PM, Amit Kapila wrote: Hello, > On Mon, Sep 7, 2020 at 1:33 PM [email protected] > <[email protected]> wrote: > > > > On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote: > > > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <[email protected]> wrote: > > > > > > > > Amit Kapila <[email protected]> writes: > > > > > Even if the relation is locked, background processes like > > > > > checkpointer can still touch the relation which might cause > > > > > problems. Consider a case where we extend the relation but > > > > > didn't flush the newly added pages. Now during truncate > > > > > operation, checkpointer can still flush those pages which can > > > > > cause trouble for truncate. But, I think in the recovery path > > > > > such cases won't cause a > > > problem. > > > > > > > > I wouldn't count on that staying true ... > > > > > > > > > > > > https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR > > > OBV-jC > > > > [email protected] > > > > > > > > > > I don't think that proposal will matter after commit c5315f4f44 > > > because we are caching the size/blocks for recovery while doing > > > extend (smgrextend). In the above scenario, we would have cached the > > > blocks which will be used at later point of time. > > > > > > > I'm guessing we can still pursue this idea of improving the recovery path > first. > > > > I think so.
Alright, so I've updated the patch which passes the regression and TAP tests.
It compiles and builds as intended.
> > I'm working on an updated patch version, because the CFBot's telling
> > that postgres fails to build (one of the recovery TAP tests fails).
> > I'm still working on refactoring my patch, but have yet to find a proper
> solution at the moment.
> > So I'm going to continue my investigation.
> >
> > Attached is an updated WIP patch.
> > I'd appreciate if you could take a look at the patch as well.
> >
>
> So, I see the below log as one of the problems:
> 2020-09-07 06:20:33.918 UTC [10914] LOG: redo starts at 0/15FFEC0
> 2020-09-07 06:20:33.919 UTC [10914] FATAL: unexpected data beyond EOF
> in block 1 of relation base/13743/24581
>
> This indicates that we missed invalidating some buffer which should have
> been invalidated. If you are able to reproduce this locally then I suggest to
> first
> write a simple patch without the check of the threshold, basically in recovery
> always try to use the new way to invalidate the buffer. That will reduce the
> scope of the code that can create a problem. Let us know if the problem still
> exists and share the logs. BTW, I think I see one problem in the code:
>
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> + bufHdr->firstDelBlock[j])
>
> Here, I think you need to use 'i' not 'j' for forkNum and
> firstDelBlock as those are arrays w.r.t forks. That might fix the
> problem but I am not sure as I haven't tried to reproduce it.
Thanks for advice. Right, that seems to be the cause of error,
and fixing that (using fork) solved the case.
I also followed the advice of Tsunakawa-san of using more meaningful iterator
Instead of using "i" & "j" for readability.
I also added a new function when relation fork is bigger than the threshold
If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
(DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that function.
However, as expected in the previous discussions, this is a bit slower than the
standard buffer invalidation process, because the whole shared buffers are
scanned nfork times.
Currently, I set the threshold to (NBuffers / 500)
Feedback on the patch/testing are very much welcome.
Best regards,
Kirk Jamison
v12-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description: v12-Speedup-dropping-of-relation-buffers-during-recovery.patch
