RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila 
> wrote in
 > I also can't think of a way to use an optimized path for such cases
> > but I don't agree with your comment on if it is common enough that we
> > leave this optimization entirely for the truncate path.
> 
> An ugly way to cope with it would be to let other smgr functions
> manage the cached value, for example, by calling smgrnblocks while
> InRecovery.  Or letting smgr remember the maximum block number ever
> accessed.  But we cannot fully rely on that since smgr can be closed
> midst of a session and smgr doesn't offer such persistence.  In the
> first place smgr doesn't seem to be the place to store such persistent
> information.

Yeah, considering the future evolution of this patch to operations during 
normal running, I don't think that would be a good fit, either.

Then, the as we're currently targeting just recovery, the options we can take 
are below.  Which would vote for?  My choice would be (3) > (2) > (1).


(1)
Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
This brings the most safety and code consistency.
But this would not benefit from optimization for TRUNCATE in unexpectedly many 
cases -- when TOAST storage exists but it's not written, or FSM/VM is not 
updated after checkpoint.


(2)
Use the cached flag in VACUUM (0003), but use InRecovery instead of the cached 
flag in TRUNCATE (0004).
This benefits from the optimization in all cases.
But this lacks code consistency.
You may be afraid of safety if the startup process smgrclose()s the relation 
after the shared buffer flushing hits disk full.  However, startup process 
doesn't smgrclose(), so it should be safe.  Just in case the startup process 
smgrclose()s, the worst consequence is PANIC shutdown after repeated failure of 
checkpoints due to lingering orphaned dirty shared buffers.  Accept it as 
Thomas-san's devil's suggestion.


(3)
Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
This benefits from the optimization in all cases.
The code is consistent and smaller.
As for the safety, this is the same as (2), but it applies to VACUUM as well.


Regards
Takayuki Tsunakawa





RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread k.jami...@fujitsu.com
On Wednesday, December 9, 2020 10:58 AM, Tsunakawa, Takayuki wrote: 
> From: Kyotaro Horiguchi 
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila
> >  wrote in
>  > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that
> > > we leave this optimization entirely for the truncate path.
> >
> > An ugly way to cope with it would be to let other smgr functions
> > manage the cached value, for example, by calling smgrnblocks while
> > InRecovery.  Or letting smgr remember the maximum block number ever
> > accessed.  But we cannot fully rely on that since smgr can be closed
> > midst of a session and smgr doesn't offer such persistence.  In the
> > first place smgr doesn't seem to be the place to store such persistent
> > information.
> 
> Yeah, considering the future evolution of this patch to operations during
> normal running, I don't think that would be a good fit, either.
> 
> Then, the as we're currently targeting just recovery, the options we can take
> are below.  Which would vote for?  My choice would be (3) > (2) > (1).
> 
> 
> (1)
> Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
> This brings the most safety and code consistency.
> But this would not benefit from optimization for TRUNCATE in unexpectedly
> many cases -- when TOAST storage exists but it's not written, or FSM/VM is
> not updated after checkpoint.
> 
> 
> (2)
> Use the cached flag in VACUUM (0003), but use InRecovery instead of the
> cached flag in TRUNCATE (0004).
> This benefits from the optimization in all cases.
> But this lacks code consistency.
> You may be afraid of safety if the startup process smgrclose()s the relation
> after the shared buffer flushing hits disk full.  However, startup process
> doesn't smgrclose(), so it should be safe.  Just in case the startup process
> smgrclose()s, the worst consequence is PANIC shutdown after repeated
> failure of checkpoints due to lingering orphaned dirty shared buffers.  Accept
> it as Thomas-san's devil's suggestion.
> 
> 
> (3)
> Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
> This benefits from the optimization in all cases.
> The code is consistent and smaller.
> As for the safety, this is the same as (2), but it applies to VACUUM as well.

If we want code consistency, then we'd fall in either 1 or 3.
And if we want to take the benefits of optimization for both 
DropRelFileNodeBuffers
and DropRelFileNodesAllBuffers, then I'd choose 3.
However, if the reviewers and committer want to make use of the "cached" flag,
then we can live with "cached" value in place there even if it's not common to
get the optimization for TRUNCATE path. So only VACUUM would take the most
benefit.
My vote is also (3), then (2), and (1).

Regards,
Kirk Jamison




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Amit Kapila
On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  
> wrote in
> > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
> >  wrote:
> > >
> > > From: Jamison, Kirk/ジャミソン カーク 
> > > > Because one of the rel's cached value was false, it forced the
> > > > full-scan path for TRUNCATE.
> > > > Is there a possible workaround for this?
> > >
> > > Hmm, the other two relfilenodes are for the TOAST table and index of the 
> > > target table.  I think the INSERT didn't access those TOAST relfilenodes 
> > > because the inserted data was stored in the main storage.  But TRUNCATE 
> > > always truncates all the three relfilenodes.  So, the standby had not 
> > > opened the relfilenode for the TOAST stuff or cached its size when 
> > > replaying the TRUNCATE.
> > >
> > > I'm afraid this is more common than we can ignore and accept the slow 
> > > traditional path, but I don't think of a good idea to use the cached flag.
> > >
> >
> > I also can't think of a way to use an optimized path for such cases
> > but I don't agree with your comment on if it is common enough that we
> > leave this optimization entirely for the truncate path.
>
> Mmm. At least btree doesn't need to call smgrnblocks except at
> expansion, so we cannot get to the optimized path in major cases of
> truncation involving btree (and/or maybe other indexes).
>

AFAICS, btree insert should call smgrnblocks via
btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
Similarly delete should also call smgrnblocks. Can you be bit more
specific related to the btree case you have in mind?

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Kyotaro Horiguchi
At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila  wrote 
in 
> On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  
> > wrote in
> > > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
> > >  wrote:
> > > >
> > > > From: Jamison, Kirk/ジャミソン カーク 
> > > > > Because one of the rel's cached value was false, it forced the
> > > > > full-scan path for TRUNCATE.
> > > > > Is there a possible workaround for this?
> > > >
> > > > Hmm, the other two relfilenodes are for the TOAST table and index of 
> > > > the target table.  I think the INSERT didn't access those TOAST 
> > > > relfilenodes because the inserted data was stored in the main storage.  
> > > > But TRUNCATE always truncates all the three relfilenodes.  So, the 
> > > > standby had not opened the relfilenode for the TOAST stuff or cached 
> > > > its size when replaying the TRUNCATE.
> > > >
> > > > I'm afraid this is more common than we can ignore and accept the slow 
> > > > traditional path, but I don't think of a good idea to use the cached 
> > > > flag.
> > > >
> > >
> > > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that we
> > > leave this optimization entirely for the truncate path.
> >
> > Mmm. At least btree doesn't need to call smgrnblocks except at
> > expansion, so we cannot get to the optimized path in major cases of
> > truncation involving btree (and/or maybe other indexes).
> >
> 
> AFAICS, btree insert should call smgrnblocks via
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
> Similarly delete should also call smgrnblocks. Can you be bit more
> specific related to the btree case you have in mind?

Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
called during buffer loading while recovery. So, smgrnblock is called
for indexes if any update happens on the heap relation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> called during buffer loading while recovery. So, smgrnblock is called
> for indexes if any update happens on the heap relation.

I misunderstood that you said there's no problem with the TOAST index because 
TRUNCATE creates the meta page, resulting in the caching of the page and size 
of the relation.  Anyway, I'm relieved the concern disappeared.

Then, I'd like to hear your vote in my previous mail...


Regards
Takayuki Tsunakawa





Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Amit Kapila
On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila  
> wrote in
> > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > expansion, so we cannot get to the optimized path in major cases of
> > > truncation involving btree (and/or maybe other indexes).
> > >
> >
> > AFAICS, btree insert should call smgrnblocks via
> > btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
> > Similarly delete should also call smgrnblocks. Can you be bit more
> > specific related to the btree case you have in mind?
>
> Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> called during buffer loading while recovery. So, smgrnblock is called
> for indexes if any update happens on the heap relation.
>

Okay, so this means that we can get the benefit of optimization in
many cases in the Truncate code path as well even if we use 'cached'
flag? If so, then I would prefer to keep the code consistent for both
vacuum and truncate recovery code path.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com
On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: 
> On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila
> >  wrote in
> > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > > expansion, so we cannot get to the optimized path in major cases
> > > > of truncation involving btree (and/or maybe other indexes).
> > > >
> > >
> > > AFAICS, btree insert should call smgrnblocks via
> > >
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte
> nded->XLogReadBufferExtended->smgrnblocks.
> > > Similarly delete should also call smgrnblocks. Can you be bit more
> > > specific related to the btree case you have in mind?
> >
> > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> > called during buffer loading while recovery. So, smgrnblock is called
> > for indexes if any update happens on the heap relation.
> >
> 
> Okay, so this means that we can get the benefit of optimization in many cases
> in the Truncate code path as well even if we use 'cached'
> flag? If so, then I would prefer to keep the code consistent for both vacuum
> and truncate recovery code path.

Yes, I have tested that optimization works for index relations.

I have attached the V34, following the conditions that we use "cached" flag
for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
consistency.
I added comment in 0004 the limitation of optimization when there are TOAST
relations that use NON-PLAIN strategy. i.e. The optimization works if the data
types used are integers, OID, bytea, etc. But for TOAST-able data types like 
text,
the optimization will be skipped and force a full scan during recovery.

Regards,
Kirk Jamison



v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.

bytea is a TOAST-able type.


+   /*
+* Enter the optimization if the total number of blocks to be
+* invalidated for all relations is below the full scan threshold.
+*/
+   if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

Checking cached here doesn't seem to be necessary, because if cached is false, 
the control goes to the full scan path as below:

+   if (!cached)
+   goto buffer_full_scan;
+


Regards
Takayuki Tsunakawa



Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila
On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
 wrote:
>
> Yes, I have tested that optimization works for index relations.
>
> I have attached the V34, following the conditions that we use "cached" flag
> for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
> consistency.
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.
>

AFAIU, it won't take optimization path only when we have TOAST
relation but there is no insertion corresponding to it. If so, then we
don't need to mention it specifically because there are other similar
cases where the optimization won't work like when during recovery we
have to just perform TRUNCATE.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com
On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Yes, I have tested that optimization works for index relations.
> >
> > I have attached the V34, following the conditions that we use "cached"
> > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > for consistency.
> > I added comment in 0004 the limitation of optimization when there are
> > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > works if the data types used are integers, OID, bytea, etc. But for
> > TOAST-able data types like text, the optimization will be skipped and force 
> > a
> full scan during recovery.
> >
> 
> AFAIU, it won't take optimization path only when we have TOAST relation but
> there is no insertion corresponding to it. If so, then we don't need to 
> mention
> it specifically because there are other similar cases where the optimization
> won't work like when during recovery we have to just perform TRUNCATE.
> 

Right, I forgot to add that there should be an update like insert to the TOAST
relation for truncate optimization to work. However, that is only limited to
TOAST relations with PLAIN strategy. I have tested with text data type, with
Inserts before truncate, and it did not enter the optimization path. OTOH,
It worked for data type like integer. So should I still not include that 
information?

Also, I will remove the unnecessary "cached" from the line that Tsunakawa-san
mentioned. I will wait for a few more comments before reuploading, hopefully,
the final version & including the test for truncate,

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
> 
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path. OTOH,
> It worked for data type like integer. So should I still not include that 
> information?

What's valuable as a code comment to describe the remaining issue is that the 
reader can find clues to if this is related to the problem he/she has hit, 
and/or how to solve the issue.  I don't think the current comment is so bad in 
that regard, but it seems better to add:

* The condition of the issue: the table's ancillary storage (index, TOAST 
table, FSM, VM, etc.) was not updated during recovery.
(As an aside, "during recovery" here does not mean "after the last checkpoint" 
but "from the start of recovery", because the standby experiences many 
checkpoints (the correct term is restartpoints in case of standby).)

* The cause as a hint to solve the issue: The startup process does not find 
page modification WAL records.  As a result, it won't call 
XLogReadBufferExtended() and smgrnblocks() called therein, so the relation/fork 
size is not cached.


Regards
Takayuki Tsunakawa




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila
On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
 wrote:
>
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > Yes, I have tested that optimization works for index relations.
> > >
> > > I have attached the V34, following the conditions that we use "cached"
> > > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > > for consistency.
> > > I added comment in 0004 the limitation of optimization when there are
> > > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > > works if the data types used are integers, OID, bytea, etc. But for
> > > TOAST-able data types like text, the optimization will be skipped and 
> > > force a
> > full scan during recovery.
> > >
> >
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
>
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path.
>

I think you are seeing because text datatype allows creating toast
storage and your data is small enough to be toasted.

> OTOH,
> It worked for data type like integer.
>

It is not related to any datatype, it can happen whenever we don't
have any operation on any of the forks after recovery.

> So should I still not include that information?
>

I think we can extend your existing comment like: "Otherwise if the
size of a relation fork is not cached, we proceed to a full scan of
the whole buffer pool. This can happen if there is no update to a
particular fork during recovery."

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com
From: tsunakawa.ta...@fujitsu.com 
> What's valuable as a code comment to describe the remaining issue is that the

You can attach XXX or FIXME in front of the issue description for easier 
search.  (XXX appears to be used much more often in Postgres.)


Regards
Takayuki Tsunakawa



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-13 Thread k.jami...@fujitsu.com
On Friday, December 11, 2020 10:27 AM, Amit Kapila wrote:
> On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
>  wrote:
> > So should I still not include that information?
> >
> 
> I think we can extend your existing comment like: "Otherwise if the size of a
> relation fork is not cached, we proceed to a full scan of the whole buffer 
> pool.
> This can happen if there is no update to a particular fork during recovery."

Attached are the final updated patches.
I followed this advice and updated the source code comment a little bit.
There are no changes from the previous except that and the unnecessary
"cached" condition which Tsunakawa-san mentioned.

Below is also the updated recovery performance test results for TRUNCATE.
(1000 tables, 1MB per table, results measured in seconds)
| s_b   | Master | Patched | % Reg   | 
|---||-|-| 
| 128MB | 0.406  | 0.406   | 0%  | 
| 512MB | 0.506  | 0.406   | -25%| 
| 1GB   | 0.806  | 0.406   | -99%| 
| 20GB  | 15.224 | 0.406   | -3650%  | 
| 100GB | 81.506 | 0.406   | -19975% |

Because of the size of relation, it is expected to enter full-scan for
the 128MB shared_buffers setting. And there was no regression.
Similar to previous test results, the recovery time was constant
for all shared_buffers setting with the patches applied.

Regards,
Kirk Jamison



v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-13 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> Attached are the final updated patches.

Looks good, and the patch remains ready for committer.  (Personally, I wanted 
the code comment to touch upon the TOAST and FSM/VM for the reader, because we 
couldn't think of those possibilities and took some time to find why the 
optimization path wasn't taken.)


Regards
Takayuki Tsunakawa



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-17 Thread Tang, Haiying
Hello Kirk,

I noticed you have pushed a new version for your patch which has some changes 
on TRUNCATE on TOAST relation. 
Although you've done performance test for your changed part. I'd like to do a 
double check for your patch(hope you don't mind).
Below is the updated recovery performance test results for your new patch. All 
seems good.

*TOAST relation with PLAIN strategy like integer : 
1. Recovery after VACUUM test results(average of 15 times)
shared_buffers  master(sec)patched(sec) 
%reg=((patched-master)/patched)
--
128M2.111   1.604   -24%
10G 57.135  1.878   -97%
20G 167.122 1.932   -99%

2. Recovery after TRUNCATE test results(average of 15 times)
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/patched)
--
128M2.326   1.718   -26%
10G 82.397  1.738   -98%
20G 169.275 1.718   -99%

*TOAST relation with NON-PLAIN strategy like text/varchar: 
1. Recovery after VACUUM test results(average of 15 times)
shared_buffers  master(sec)patched(sec) 
%reg=((patched-master)/patched)
--
128M3.174   2.493   -21%
10G 72.716  2.246   -97%
20G 163.660 2.474   -98%

2. Recovery after TRUNCATE test results(average of 15 times): Although it looks 
like there are some improvements after patch applied. I think that's because of 
the average calculation. TRUNCATE results should be similar between master and 
patched because they all do full scan.
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/patched)
--
128M4.978   4.958   0%
10G 97.048  88.751  -9%
20G 183.230 173.226 -5% 

[Machine spec]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 128G
OS: CentOS 8

[Failover test data]
Total table Size: 600M
Table: 1 tables (1000 rows per table)

[Configure in postgresql.conf]
autovacuum = off
wal_level = replica
max_wal_senders = 5
max_locks_per_transaction = 1

If you have any questions on my test results, please let me know.

Regards
Tang




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila
On Thu, Nov 19, 2020 at 12:37 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Andres Freund 
>
> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum,
> > int nforks,
> > +   BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock)
> > ...
> > + /* Check that it is in the buffer pool. If not, do 
> > nothing.
> > */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > + buf_id = BufTableLookup(&bufTag, bufHash);
> > ...
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> > + bufHdr->tag.forkNum == forkNum[i] &&
> > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > + InvalidateBuffer(bufHdr);   /* releases
> > spinlock */
> > + else
> > + UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition lock, 
> > and
> > have done a lookup in the mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum? And why are we doing so holding the buffer header
> > lock, which is essentially a spinlock, so should only ever be held for very 
> > short
> > portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but 
> > there
> > the situation is different: We haven't done a buffer mapping lookup, and we
> > don't hold a partition lock!
>
> That's because the buffer partition lock is released immediately after the 
> hash table has been looked up.  As an aside, InvalidateBuffer() requires the 
> caller to hold the buffer header spinlock and doesn't hold the buffer 
> partition lock.
>

This answers the second part of the question but what about the first
part (We hold a buffer partition lock, and have done a lookup in the
mapping table. Why are we then rechecking the
relfilenode/fork/blocknum?)

I think we don't need such a check, rather we can have an Assert
corresponding to that if-condition in the patch. I understand it is
safe to compare relfilenode/fork/blocknum but it might confuse readers
of the code.

I have started doing minor edits to the patch especially planning to
write a theory why is this optimization safe and here is what I can
come up with: "To remove all the pages of the specified relation forks
from the buffer pool, we need to scan the entire buffer pool but we
can optimize it by finding the buffers from BufMapping table provided
we know the exact size of each fork of the relation. The exact size is
required to ensure that we don't leave any buffer for the relation
being dropped as otherwise the background writer or checkpointer can
lead to a PANIC error while flushing buffers corresponding to files
that don't exist.

To know the exact size, we rely on the size cached for each fork by us
during recovery which limits the optimization to recovery and on
standbys but we can easily extend it once we have shared cache for
relation size.

In recovery, we cache the value returned by the first lseek(SEEK_END)
and the future writes keeps the cached value up-to-date. See
smgrextend. It is possible that the value of the first lseek is
smaller than the actual number of existing blocks in the file due to
buggy Linux kernels that might not have accounted for the recent
write. But that should be fine because there must not be any buffers
after that file size.

XXX We would make the extra lseek call for the unoptimized paths but
that is okay because we do it just for the first fork and we anyway
have to scan the entire buffer pool the cost of which is so high that
the extra lseek call won't make any visible difference. However, we
can use InRecovery flag to avoid the additional cost but that doesn't
seem worth it."

Thoughts?

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> This answers the second part of the question but what about the first
> part (We hold a buffer partition lock, and have done a lookup in th
> mapping table. Why are we then rechecking the
> relfilenode/fork/blocknum?)
> 
> I think we don't need such a check, rather we can have an Assert
> corresponding to that if-condition in the patch. I understand it is
> safe to compare relfilenode/fork/blocknum but it might confuse readers
> of the code.

Hmm, you're right.  I thought someone else could steal the found buffer and use 
it for another block because the buffer mapping lwlock is released without 
pinning the buffer or acquiring the buffer header spinlock.  However, in this 
case (replay of TRUNCATE during recovery), nobody steals the buffer: bgwriter 
or checkpointer doesn't use a buffer for a new block, and the client backend 
waits for AccessExclusive lock.


> I have started doing minor edits to the patch especially planning to
> write a theory why is this optimization safe and here is what I can
> come up with:

Thank you, that's fluent and easier to understand.


Regards
Takayuki Tsunakawa





Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila
On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> >
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
>
> Hmm, you're right. I thought someone else could steal the found buffer and 
> use it for another block because the buffer mapping lwlock is released 
> without pinning the buffer or acquiring the buffer header spinlock.
>

Okay, I see your point.

>  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> and the client backend waits for AccessExclusive lock.
>
>

Why would all client backends wait for AccessExclusive lock on this
relation? Say, a client needs a buffer for some other relation and
that might evict this buffer after we release the lock on the
partition. In StrategyGetBuffer, it is important to either have a pin
on the buffer or the buffer header itself must be locked to avoid
getting picked as victim buffer. Am I missing something?

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi
At Tue, 22 Dec 2020 01:42:55 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> > 
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
> 
> Hmm, you're right.  I thought someone else could steal the found
> buffer and use it for another block because the buffer mapping
> lwlock is released without pinning the buffer or acquiring the
> buffer header spinlock.  However, in this case (replay of TRUNCATE
> during recovery), nobody steals the buffer: bgwriter or checkpointer
> doesn't use a buffer for a new block, and the client backend waits
> for AccessExclusive lock.

Mmm. If that is true, doesn't the unoptimized path also need the
rechecking?

The AEL doesn't work for a buffer block. No new block can be allocted
for the relation but still BufferAlloc can steal the block for other
relations since the AEL doesn't work for each buffer block.  Am I
still missing something?


> > I have started doing minor edits to the patch especially planning to
> > write a theory why is this optimization safe and here is what I can
> > come up with:
> 
> Thank you, that's fluent and easier to understand.

+1

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi
At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Amit Kapila 
> > > This answers the second part of the question but what about the first
> > > part (We hold a buffer partition lock, and have done a lookup in th
> > > mapping table. Why are we then rechecking the
> > > relfilenode/fork/blocknum?)
> > >
> > > I think we don't need such a check, rather we can have an Assert
> > > corresponding to that if-condition in the patch. I understand it is
> > > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > > of the code.
> >
> > Hmm, you're right. I thought someone else could steal the found buffer and 
> > use it for another block because the buffer mapping lwlock is released 
> > without pinning the buffer or acquiring the buffer header spinlock.
> >
> 
> Okay, I see your point.
> 
> >  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> > the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> > and the client backend waits for AccessExclusive lock.
> >
> >

I understood that you are thinking that the rechecking is useless.

> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

I think exactly like that.  If we acquire the bufHdr lock before
releasing the partition lock, that steal doesn't happen but it doesn't
seem good as a locking protocol.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

Ouch, right.  (The year-end business must be making me crazy...)

So, there are two choices here:

1) The current patch.
2) Acquire the buffer header spinlock before releasing the buffer mapping 
lwlock, and eliminate the buffer tag comparison as follows:

  BufTableLookup();
  LockBufHdr();
  LWLockRelease();
  InvalidateBuffer();

I think both are okay.  If I must choose either, I kind of prefer 1), because 
LWLockRelease() could take longer time to wake up other processes waiting on 
the lwlock, which is not very good to do while holding a spinlock.


Regards
Takayuki Tsunakawa





Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila
On Tue, Dec 22, 2020 at 8:18 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> Ouch, right.  (The year-end business must be making me crazy...)
>
> So, there are two choices here:
>
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
>
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
>
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.
>
>

I also prefer (1). I will add some comments about the locking protocol
in the next version of the patch.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila
On Tue, Dec 22, 2020 at 8:12 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  
> wrote in
>
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> I think exactly like that.  If we acquire the bufHdr lock before
> releasing the partition lock, that steal doesn't happen but it doesn't
> seem good as a locking protocol.
>

Right, so let's keep the code as it is but I feel it is better to add
some comments explaining the rationale behind this code.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi
At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
> 
> Ouch, right.  (The year-end business must be making me crazy...)
> 
> So, there are two choices here:
> 
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
> 
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
> 
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.

I like, as said before, the current patch.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> Mmm. If that is true, doesn't the unoptimized path also need the
> rechecking?

Yes, the traditional processing does the recheck after acquiring the buffer 
header spinlock.


Regards
Takayuki Tsunakawa







RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread k.jami...@fujitsu.com
On Monday, December 21, 2020 10:25 PM, Amit Kapila wrote:
> I have started doing minor edits to the patch especially planning to write a
> theory why is this optimization safe and here is what I can come up with: 
> "To
> remove all the pages of the specified relation forks from the buffer pool, we
> need to scan the entire buffer pool but we can optimize it by finding the
> buffers from BufMapping table provided we know the exact size of each fork
> of the relation. The exact size is required to ensure that we don't leave any
> buffer for the relation being dropped as otherwise the background writer or
> checkpointer can lead to a PANIC error while flushing buffers corresponding
> to files that don't exist.
> 
> To know the exact size, we rely on the size cached for each fork by us during
> recovery which limits the optimization to recovery and on standbys but we
> can easily extend it once we have shared cache for relation size.
> 
> In recovery, we cache the value returned by the first lseek(SEEK_END) and
> the future writes keeps the cached value up-to-date. See smgrextend. It is
> possible that the value of the first lseek is smaller than the actual number 
> of
> existing blocks in the file due to buggy Linux kernels that might not have
> accounted for the recent write. But that should be fine because there must
> not be any buffers after that file size.
> 
> XXX We would make the extra lseek call for the unoptimized paths but that is
> okay because we do it just for the first fork and we anyway have to scan the
> entire buffer pool the cost of which is so high that the extra lseek call 
> won't
> make any visible difference. However, we can use InRecovery flag to avoid the
> additional cost but that doesn't seem worth it."
> 
> Thoughts?

+1 
Thank you very much for expanding the comments to carefully explain the
reason on why the optimization is safe. I was also struggling to explain it 
completely
but your description also covers the possibility of extending the optimization 
in the
future once we have shared cache for rel size. So I like this addition.

(Also, it seems that we have concluded to retain the locking mechanism of the 
existing patch based from the recent email exchanges. Both the traditional path 
and
the optimized path do the rechecking. So there seems to be no problem, I'm 
definitely
fine with it.)

Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila
On Tue, Dec 22, 2020 at 8:30 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > Why would all client backends wait for AccessExclusive lock on this
> > > relation? Say, a client needs a buffer for some other relation and
> > > that might evict this buffer after we release the lock on the
> > > partition. In StrategyGetBuffer, it is important to either have a pin
> > > on the buffer or the buffer header itself must be locked to avoid
> > > getting picked as victim buffer. Am I missing something?
> >
> > Ouch, right.  (The year-end business must be making me crazy...)
> >
> > So, there are two choices here:
> >
> > 1) The current patch.
> > 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> > lwlock, and eliminate the buffer tag comparison as follows:
> >
> >   BufTableLookup();
> >   LockBufHdr();
> >   LWLockRelease();
> >   InvalidateBuffer();
> >
> > I think both are okay.  If I must choose either, I kind of prefer 1), 
> > because LWLockRelease() could take longer time to wake up other processes 
> > waiting on the lwlock, which is not very good to do while holding a 
> > spinlock.
>
> I like, as said before, the current patch.
>

Attached, please find the updated patch with the following
modifications, (a) updated comments at various places especially to
tell why this is a safe optimization, (b) merged the patch for
extending the smgrnblocks and vacuum optimization patch, (c) made
minor cosmetic changes and ran pgindent, and (d) updated commit
message. BTW, this optimization will help not only vacuum but also
truncate when it is done in the same transaction in which the relation
is created.  I would like to see certain tests to ensure that the
value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I see
that some testing has been done earlier [1] for this threshold but I
am not still able to conclude. The criteria to find the right
threshold should be what is the maximum size of relation to be
truncated above which we don't get benefit with this optimization.

One idea could be to remove "nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
use optimized path for the tests. Then use the relation size as
NBuffers/128, NBuffers/256, NBuffers/512 for different values of
shared buffers as 128MB, 1GB, 20GB, 100GB.

Apart from tests, do let me know if you are happy with the changes in
the patch? Next, I'll look into DropRelFileNodesAllBuffers()
optimization patch.

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.


v36-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description: Binary data


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila
On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
>
> Apart from tests, do let me know if you are happy with the changes in
> the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> optimization patch.
>

Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]

1.
DropRelFileNodesAllBuffers()
{
..
+buffer_full_scan:
+ pfree(block);
+ nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
+ for (i = 0; i < n; i++)
+ nodes[i] = smgr_reln[i]->smgr_rnode.node;
+
..
}

How is it correct to assign nodes array directly from smgr_reln? There
is no one-to-one correspondence. If you see the code before patch, the
passed array can have mixed of temp and non-temp relation information.

2.
+ for (i = 0; i < n; i++)
  {
- pfree(nodes);
+ for (j = 0; j <= MAX_FORKNUM; j++)
+ {
+ /*
+ * Assign InvalidblockNumber to a block if a relation
+ * fork does not exist, so that we can skip it later
+ * when dropping the relation buffers.
+ */
+ if (!smgrexists(smgr_reln[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);

Similar to above, how can we assume smgr_reln array has all non-local
relations? Have we tried the case with mix of temp and non-temp
relations?

In this code, I am slightly worried about the additional cost of each
time checking smgrexists. Consider a case where there are many
relations and only one or few of them have not cached the information,
in such a case we will pay the cost of smgrexists for many relations
without even going to the optimized path. Can we avoid that in some
way or at least reduce its usage to only when it is required? One idea
could be that we first check if the nblocks information is cached and
if so then we don't need to call smgrnblocks, otherwise, check if it
exists. For this, we need an API like smgrnblocks_cahced, something we
discussed earlier but preferred the current API. Do you have any
better ideas?


[1] - 
https://www.postgresql.org/message-id/OSBPR01MB2341882F416A282C3F7D769DEFC70%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com
On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: 
> Attached, please find the updated patch with the following modifications, (a)
> updated comments at various places especially to tell why this is a safe
> optimization, (b) merged the patch for extending the smgrnblocks and
> vacuum optimization patch, (c) made minor cosmetic changes and ran
> pgindent, and (d) updated commit message. BTW, this optimization will help
> not only vacuum but also truncate when it is done in the same transaction in
> which the relation is created.  I would like to see certain tests to ensure 
> that
> the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I
> see that some testing has been done earlier [1] for this threshold but I am 
> not
> still able to conclude. The criteria to find the right threshold should be 
> what is
> the maximum size of relation to be truncated above which we don't get
> benefit with this optimization.
> 
> One idea could be to remove "nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always use optimized path for the tests. Then use the relation size as
> NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> buffers as 128MB, 1GB, 20GB, 100GB.

Alright. I will also repeat the tests with the different threshold settings, 
and thank you for the tip.

> Apart from tests, do let me know if you are happy with the changes in the
> patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.

Thank you, Amit.
That looks more neat, combining the previous patches 0002-0003, so I am +1
with the changes because of the clearer explanations for the threshold and
optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that 
we
do it safely too and that we are not InRecovery.

> [1] -
> https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9
> FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila
On Wed, Dec 23, 2020 at 6:30 AM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote:
>
> > Apart from tests, do let me know if you are happy with the changes in the
> > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.
>
> Thank you, Amit.
> That looks more neat, combining the previous patches 0002-0003, so I am +1
> with the changes because of the clearer explanations for the threshold and
> optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch 
> sets.
> Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure 
> that we
> do it safely too and that we are not InRecovery.
>

I think the 0001 is mostly for test purposes but we will see once the
main patches are ready.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila
On Tue, Dec 22, 2020 at 5:41 PM Amit Kapila  wrote:
>
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
> >
> > Apart from tests, do let me know if you are happy with the changes in
> > the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
>
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> 
>
> In this code, I am slightly worried about the additional cost of each
> time checking smgrexists. Consider a case where there are many
> relations and only one or few of them have not cached the information,
> in such a case we will pay the cost of smgrexists for many relations
> without even going to the optimized path. Can we avoid that in some
> way or at least reduce its usage to only when it is required? One idea
> could be that we first check if the nblocks information is cached and
> if so then we don't need to call smgrnblocks, otherwise, check if it
> exists. For this, we need an API like smgrnblocks_cahced, something we
> discussed earlier but preferred the current API. Do you have any
> better ideas?
>

One more idea which is not better than what I mentioned above is that
we completely avoid calling smgrexists and rely on smgrnblocks. It
will throw an error in case the particular fork doesn't exist and we
can use try .. catch to handle it. I just mentioned it as it came
across my mind but I don't think it is better than the previous one.

One more thing about patch:
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
+
+ if (!cached)
+ goto buffer_full_scan;

Why do we need to use goto here? We can simply break from the loop and
then check if (cached && nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
possible without much complexity.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> + /* Get the number of blocks for a relation's fork */
> + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> +
> + if (!cached)
> + goto buffer_full_scan;
> 
> Why do we need to use goto here? We can simply break from the loop and
> then check if (cached && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> possible without much complexity.

That's because two for loops are nested -- breaking there only exits the inner 
loop.  (I thought the same as you at first... And I understand some people have 
alergy to goto, I think modest use of goto makes the code readable.)


Regards
Takayuki Tsunakawa






Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Kyotaro Horiguchi
At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> > 
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
> 
> That's because two for loops are nested -- breaking there only exits the 
> inner loop.  (I thought the same as you at first... And I understand some 
> people have alergy to goto, I think modest use of goto makes the code 
> readable.)

I don't strongly oppose to goto's but in this case the outer loop can
break on the same condition with the inner loop, since cached is true
whenever the inner loop runs to the end. It is needed to initialize
the variable cache with true, instead of false, though.

The same pattern is seen in the tree.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Zhihong Yu
Hi,
It is possible to come out of the nested loop without goto.

+   boolcached = true;
...
+* to that fork during recovery.
+*/
+   for (i = 0; i < n && cached; i++)
...
+   if (!cached)
+.  break;

Here I changed the initial value for cached to true so that we enter the
outer loop.
In place of the original goto, we break out of inner loop and exit outer
loop.

Cheers

On Tue, Dec 22, 2020 at 8:22 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> >
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
>
> That's because two for loops are nested -- breaking there only exits the
> inner loop.  (I thought the same as you at first... And I understand some
> people have alergy to goto, I think modest use of goto makes the code
> readable.)
>
>
> Regards
> Takayuki Tsunakawa
>
>
>
>
>


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com
On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
> 
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> =
> ===
> 1.
> DropRelFileNodesAllBuffers()
> {
> ..
> +buffer_full_scan:
> + pfree(block);
> + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
> +for (i = 0; i < n; i++)  nodes[i] = smgr_reln[i]->smgr_rnode.node;
> +
> ..
> }
> 
> How is it correct to assign nodes array directly from smgr_reln? There is no
> one-to-one correspondence. If you see the code before patch, the passed
> array can have mixed of temp and non-temp relation information.

You are right. I mistakenly removed the array node that should have been
allocated for non-local relations. So I fixed that by doing:

SMgrRelation*rels;

rels = palloc(sizeof(SMgrRelation) * nnodes);   /* non-local relations 
*/

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nnodes; i++)
{
if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode))
{
if (smgr_reln[i]->smgr_rnode.backend == MyBackendId)

DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node);
}
else
rels[n++] = smgr_reln[i];
}
...
if (n == 0)
{
pfree(rels);
return;
}
...
//traditional path:

pfree(block);
nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
for (i = 0; i < n; i++)
nodes[i] = rels[i]->smgr_rnode.node;

> 2.
> + for (i = 0; i < n; i++)
>   {
> - pfree(nodes);
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + /*
> + * Assign InvalidblockNumber to a block if a relation
> + * fork does not exist, so that we can skip it later
> + * when dropping the relation buffers.
> + */
> + if (!smgrexists(smgr_reln[i], j))
> + {
> + block[i][j] = InvalidBlockNumber;
> + continue;
> + }
> +
> + /* Get the number of blocks for a relation's fork */ block[i][j] =
> + smgrnblocks(smgr_reln[i], j, &cached);
> 
> Similar to above, how can we assume smgr_reln array has all non-local
> relations? Have we tried the case with mix of temp and non-temp relations?

Similar to above reply.

> In this code, I am slightly worried about the additional cost of each time
> checking smgrexists. Consider a case where there are many relations and only
> one or few of them have not cached the information, in such a case we will
> pay the cost of smgrexists for many relations without even going to the
> optimized path. Can we avoid that in some way or at least reduce its usage to
> only when it is required? One idea could be that we first check if the nblocks
> information is cached and if so then we don't need to call smgrnblocks,
> otherwise, check if it exists. For this, we need an API like
> smgrnblocks_cahced, something we discussed earlier but preferred the
> current API. Do you have any better ideas?

Right. I understand the point that let's say there are 100 relations, and
the first 99 non-local relations happen to enter the optimization path, but the 
last
one does not, calling smgrexist() would be too costly and waste of time in that 
case.
The only solution I could think of as you mentioned is to reintroduce the new 
API
which we discussed before: smgrnblocks_cached().
It's possible that we call smgrexist() only if smgrnblocks_cached() returns
InvalidBlockNumber.
So if everyone agrees, we can add that API smgrnblocks_cached() which will
Include the "cached" flag parameter, and remove the additional flag 
modifications
from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() will use the new API.

Thoughts?


Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila
On Wed, Dec 23, 2020 at 1:07 PM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
>
> > In this code, I am slightly worried about the additional cost of each time
> > checking smgrexists. Consider a case where there are many relations and only
> > one or few of them have not cached the information, in such a case we will
> > pay the cost of smgrexists for many relations without even going to the
> > optimized path. Can we avoid that in some way or at least reduce its usage 
> > to
> > only when it is required? One idea could be that we first check if the 
> > nblocks
> > information is cached and if so then we don't need to call smgrnblocks,
> > otherwise, check if it exists. For this, we need an API like
> > smgrnblocks_cahced, something we discussed earlier but preferred the
> > current API. Do you have any better ideas?
>
> Right. I understand the point that let's say there are 100 relations, and
> the first 99 non-local relations happen to enter the optimization path, but 
> the last
> one does not, calling smgrexist() would be too costly and waste of time in 
> that case.
> The only solution I could think of as you mentioned is to reintroduce the new 
> API
> which we discussed before: smgrnblocks_cached().
> It's possible that we call smgrexist() only if smgrnblocks_cached() returns
> InvalidBlockNumber.
> So if everyone agrees, we can add that API smgrnblocks_cached() which will
> Include the "cached" flag parameter, and remove the additional flag 
> modifications
> from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() will use the new API.
>

Yeah, let's do it that way unless anyone has a better idea to suggest.
--
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila
On Wed, Dec 23, 2020 at 10:42 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > + /* Get the number of blocks for a relation's fork */
> > > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > > +
> > > + if (!cached)
> > > + goto buffer_full_scan;
> > >
> > > Why do we need to use goto here? We can simply break from the loop and
> > > then check if (cached && nBlocksToInvalidate <
> > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > > possible without much complexity.
> >
> > That's because two for loops are nested -- breaking there only exits the 
> > inner loop.  (I thought the same as you at first... And I understand some 
> > people have alergy to goto, I think modest use of goto makes the code 
> > readable.)
>
> I don't strongly oppose to goto's but in this case the outer loop can
> break on the same condition with the inner loop, since cached is true
> whenever the inner loop runs to the end. It is needed to initialize
> the variable cache with true, instead of false, though.
>

+1. I think it is better to avoid goto here as it can be done without
introducing any complexity or making code any less readable.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread k.jami...@fujitsu.com
On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote:
> >
> > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com"
> >  wrote in
> > > From: Amit Kapila 
> > > > + /* Get the number of blocks for a relation's fork */ block[i][j]
> > > > + = smgrnblocks(smgr_reln[i], j, &cached);
> > > > +
> > > > + if (!cached)
> > > > + goto buffer_full_scan;
> > > >
> > > > Why do we need to use goto here? We can simply break from the loop
> > > > and then check if (cached && nBlocksToInvalidate <
> > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid
> goto
> > > > if possible without much complexity.
> > >
> > > That's because two for loops are nested -- breaking there only exits
> > > the inner loop.  (I thought the same as you at first... And I
> > > understand some people have alergy to goto, I think modest use of
> > > goto makes the code readable.)
> >
> > I don't strongly oppose to goto's but in this case the outer loop can
> > break on the same condition with the inner loop, since cached is true
> > whenever the inner loop runs to the end. It is needed to initialize
> > the variable cache with true, instead of false, though.
> >
> 
> +1. I think it is better to avoid goto here as it can be done without
> introducing any complexity or making code any less readable.

I also do not mind, so I have removed the goto and followed the advice
of all reviewers. It works fine in the latest attached patch 0003.

Attached herewith are the sets of patches. 0002 & 0003 have the following
changes:

1. I have removed the modifications in smgrnblocks(). So the modifications of 
other functions that uses smgrnblocks() in the previous patch versions were
also reverted.
2. Introduced a new API smgrnblocks_cached() instead which returns either
a cached size for the specified fork or an InvalidBlockNumber.
Since InvalidBlockNumber is used, I think it is logical not to use the 
additional
boolean parameter "cached" in the function as it will be redundant.
Although in 0003, I only used the "cached" as a Boolean variable to do the trick
of not using goto.
This function is called both in DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers().
3. Modified some minor comments from the patch and commit logs.

It compiles. Passes the regression tests too.
Your feedbacks are definitely welcome.

Regards,
Kirk Jamison


v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description:  v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch


v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.

The code looks correct and has become further compact.  Remains ready for 
committer.


Regards
Takayuki Tsunakawa



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying
Hi Amit, Kirk

>One idea could be to remove "nBlocksToInvalidate <
>BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
>nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
>use optimized path for the tests. Then use the relation size as
>NBuffers/128, NBuffers/256, NBuffers/512 for different values of
>shared buffers as 128MB, 1GB, 20GB, 100GB.

I followed your idea to remove check and use different relation size for 
different shared buffers as 128M,1G,20G,50G(my environment can't support 100G, 
so I choose 50G).
According to results, all three thresholds can get optimized, even NBuffers/128 
when shared_buffers > 128M.
IMHO, I think NBuffers/128 is the maximum relation size we can get optimization 
in the three thresholds, Please let me know if I made something wrong. 

Recovery after vacuum test results as below ' Optimized percentage' and ' 
Optimization details(unit: second)' shows:
(512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
NBuffers/128
%reg: means (patched(512)- master(512))/ master(512)

Optimized percentage:
shared_buffers  %reg(512)   %reg(256)   %reg(128)
-
128M0%  -1% -1%
1G  -65%-49%-62%
20G -98%-98%-98%
50G -99%-99%-99%

Optimization details(unit: second):
shared_buffers  master(512) patched(512)master(256) patched(256)
master(128) patched(128)
-
128M0.108   0.108   0.109   0.108   
0.109   0.108
1G  0.310   0.107   0.410   0.208   
0.811   0.309
20G 94.493  1.511   188.777 3.014   
380.633 6.020
50G 537.978 3.815   867.453 7.524   
1559.07615.541

Test prepare:
Below is test table amount for different shared buffers. Each table size is 8k, 
so I use table amount = NBuffers/(512 or 256 or 128):
shared_buffers  NBuffersNBuffers/512NBuffers/256NBuffers/128
---
128M16384   32  64  128
1G  131072  256 512 1024
20G 26214405120 10240   20480
50G 65536001280025600   51200

Besides, I also did single table performance test.
Still, NBuffers/128 is the max relation size which we can get optimization.

Optimized percentage:
shared_buffers  %reg(512)   %reg(256)   %reg(128)
-
128M0%  0%  -1%
1G  0%  1%  0%
20G 0%  -24%-25%
50G 0%  -24%-20%

Optimization details(unit: second):
shared_buffers  master(512) patched(512)master(256) patched(256)
master(128) patched(128)
-
128M0.107   0.107   0.108   0.108   
0.108   0.107
1G  0.108   0.108   0.107   0.108   
0.108   0.108
20G 0.208   0.208   0.409   0.309   
0.409   0.308
50G 0.309   0.308   0.408   0.309   
0.509   0.408

Any question on my test results is welcome.

Regards,
Tang




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila
On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying
 wrote:
>
> Hi Amit, Kirk
>
> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
> >use optimized path for the tests. Then use the relation size as
> >NBuffers/128, NBuffers/256, NBuffers/512 for different values of
> >shared buffers as 128MB, 1GB, 20GB, 100GB.
>
> I followed your idea to remove check and use different relation size for 
> different shared buffers as 128M,1G,20G,50G(my environment can't support 
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even 
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get 
> optimization in the three thresholds, Please let me know if I made something 
> wrong.
>

But how can we conclude NBuffers/128 is the maximum relation size?
Because the maximum size would be where the performance is worse than
the master, no? I guess we need to try by NBuffers/64, NBuffers/32,
 till we get the threshold where master performs better.

> Recovery after vacuum test results as below ' Optimized percentage' and ' 
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
>
> Optimized percentage:
> shared_buffers  %reg(512)   %reg(256)   %reg(128)
> -
> 128M0%  -1% -1%
> 1G  -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
>
> Optimization details(unit: second):
> shared_buffers  master(512) patched(512)master(256) patched(256)  
>   master(128) patched(128)
> -
> 128M0.108   0.108   0.109   0.108 
>   0.109   0.108
> 1G  0.310   0.107   0.410   0.208 
>   0.811   0.309
> 20G 94.493  1.511   188.777 3.014 
>   380.633 6.020
> 50G 537.978 3.815   867.453 7.524 
>   1559.07615.541
>

I think we should find a better way to display these numbers because
in cases like where master takes 537.978s and patch takes 3.815s, it
is clear that patch has reduced the time by more than 100 times
whereas in your table it shows 99%.

> Test prepare:
> Below is test table amount for different shared buffers. Each table size is 
> 8k,
>

Table size should be more than 8k to get all this data because 8k
means just one block. I guess either it is a typo or some other
mistake.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread k.jami...@fujitsu.com
On Thu, December 24, 2020 6:02 PM JST, Tang, Haiying wrote:
> Hi Amit, Kirk
> 
> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always
> >use optimized path for the tests. Then use the relation size as
> >NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> >buffers as 128MB, 1GB, 20GB, 100GB.
> 
> I followed your idea to remove check and use different relation size for
> different shared buffers as 128M,1G,20G,50G(my environment can't support
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get
> optimization in the three thresholds, Please let me know if I made something
> wrong.
 

Hello Tang,
Thank you very much again for testing. Perhaps there is a confusing part in the
presented table where you indicated master(512), master(256), master(128). 
Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD
and just execute the existing default full scan of NBuffers.
Or I may have misunderstood something?

> Recovery after vacuum test results as below ' Optimized percentage' and '
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256,
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%-1%-1%
> 1G -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1080.1080.1090.1080.1090.108
> 1G0.310 0.107 0.410 0.208 0.811 0.309
> 20G 94.493 1.511 188.777 3.014 380.633 6.020
> 50G537.9783.815867.4537.5241559.07615.541
> 
> Test prepare:
> Below is test table amount for different shared buffers. Each table size is 
> 8k,
> so I use table amount = NBuffers/(512 or 256 or 128):
> shared_buffersNBuffersNBuffers/512NBuffers/256NBuffers/128
> -
> --
> 128M163843264128
> 1G1310722565121024
> 20G2621440   51201024020480
> 50G6553600   128002560051200
> 
> Besides, I also did single table performance test.
> Still, NBuffers/128 is the max relation size which we can get optimization.
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%0%-1%
> 1G 0%1%0%
> 20G 0%-24%-25%
> 50G 0%-24%-20%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1070.1070.1080.1080.1080.107
> 1G0.108 0.108 0.107 0.108 0.108 0.108
> 20G0.208 0.208 0.409 0.309 0.409 0.308
> 50G0.309 0.308 0.408 0.309 0.509 0.408

I will also post results from my machine in the next email.
Adding what Amit mentioned that we should also test for NBuffers/64, etc.
until we determine which of the threshold performs worse than master.


Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila
On Wed, Dec 23, 2020 at 6:27 PM k.jami...@fujitsu.com
 wrote:
>
>
> It compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.
>

Thanks, the patches look good to me now. I have slightly edited the
patches for comments, commit messages, and removed the duplicate
code/check in smgrnblocks. I have changed the order of patches (moved
Assert related patch to last because as mentioned earlier, I am not
sure if we want to commit it.). We might still have to change the scan
threshold value based on your and Tang-San's results.

-- 
With Regards,
Amit Kapila.


v38-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description: Binary data


v38-0002-Optimize-DropRelFileNodesAllBuffers-for-recovery.patch
Description: Binary data


v38-0003-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description: Binary data


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying
Hi Kirk,


>Perhaps there is a confusing part in the presented table where you indicated 
>master(512), master(256), master(128). 
>Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD and 
>just execute the existing default full scan of NBuffers.
>Or I may have misunderstood something?

Sorry for your confusion, I didn't make it clear. I didn't use 
BUF_DROP_FULL_SCAN_THRESHOLD for master. 
Master(512) means the test table amount in master is same with patched(512), so 
does master(256) and master(128).
I meant to mark 512/256/128 to distinguish results in master for the three 
threshold(applied in patches) .

Regards
Tang




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying
Hi Amit,

>But how can we conclude NBuffers/128 is the maximum relation size?
>Because the maximum size would be where the performance is worse than 
>the master, no? I guess we need to try by NBuffers/64, NBuffers/32, 
> till we get the threshold where master performs better.

You are right, we should keep on testing until no optimization.

>I think we should find a better way to display these numbers because in 
>cases like where master takes 537.978s and patch takes 3.815s

Yeah, I think we can change the %reg formula from (patched- master)/ master to 
(patched- master)/ patched.

>Table size should be more than 8k to get all this data because 8k means 
>just one block. I guess either it is a typo or some other mistake.

8k here is the relation size, not data size. 
For example, when I tested recovery performance of 400M relation size, I used 
51200 tables(8k per table).
Please let me know if you think this is not appropriate.

Regards
Tang

-Original Message-
From: Amit Kapila  
Sent: Thursday, December 24, 2020 9:11 PM
To: Tang, Haiying/唐 海英 
Cc: Tsunakawa, Takayuki/綱川 貴之 ; Jamison, 
Kirk/ジャミソン カーク ; Kyotaro Horiguchi 
; Andres Freund ; Tom Lane 
; Thomas Munro ; Robert Haas 
; Tomas Vondra ; 
pgsql-hackers 
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist

On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying  
wrote:
>
> Hi Amit, Kirk
>
> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.
>
> I followed your idea to remove check and use different relation size for 
> different shared buffers as 128M,1G,20G,50G(my environment can't support 
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even 
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get 
> optimization in the three thresholds, Please let me know if I made something 
> wrong.
>

But how can we conclude NBuffers/128 is the maximum relation size?
Because the maximum size would be where the performance is worse than the 
master, no? I guess we need to try by NBuffers/64, NBuffers/32,  till we 
get the threshold where master performs better.

> Recovery after vacuum test results as below ' Optimized percentage' and ' 
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
>
> Optimized percentage:
> shared_buffers  %reg(512)   %reg(256)   %reg(128)
> -
> 128M0%  -1% -1%
> 1G  -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
>
> Optimization details(unit: second):
> shared_buffers  master(512) patched(512)master(256) patched(256)  
>   master(128) patched(128)
> -
> 128M0.108   0.108   0.109   0.108 
>   0.109   0.108
> 1G  0.310   0.107   0.410   0.208 
>   0.811   0.309
> 20G 94.493  1.511   188.777 3.014 
>   380.633 6.020
> 50G 537.978 3.815   867.453 7.524 
>   1559.07615.541
>

I think we should find a better way to display these numbers because in cases 
like where master takes 537.978s and patch takes 3.815s, it is clear that patch 
has reduced the time by more than 100 times whereas in your table it shows 99%.

> Test prepare:
> Below is test table amount for different shared buffers. Each table 
> size is 8k,
>

Table size should be more than 8k to get all this data because 8k means just 
one block. I guess either it is a typo or some other mistake.

--
With Regards,
Amit Kapila.






Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila
On Fri, Dec 25, 2020 at 9:28 AM Tang, Haiying
 wrote:
>
> Hi Amit,
>
> >But how can we conclude NBuffers/128 is the maximum relation size?
> >Because the maximum size would be where the performance is worse than
> >the master, no? I guess we need to try by NBuffers/64, NBuffers/32,
> > till we get the threshold where master performs better.
>
> You are right, we should keep on testing until no optimization.
>
> >I think we should find a better way to display these numbers because in
> >cases like where master takes 537.978s and patch takes 3.815s
>
> Yeah, I think we can change the %reg formula from (patched- master)/ master 
> to (patched- master)/ patched.
>
> >Table size should be more than 8k to get all this data because 8k means
> >just one block. I guess either it is a typo or some other mistake.
>
> 8k here is the relation size, not data size.
> For example, when I tested recovery performance of 400M relation size, I used 
> 51200 tables(8k per table).
> Please let me know if you think this is not appropriate.
>

I think one table with a varying amount of data is sufficient for the
vacuum test. I think with more number of tables there is a greater
chance of variation. We have previously used multiple tables in one of
the tests because of the Truncate operation (which uses
DropRelFileNodesAllBuffers that takes multiple relations as input) and
that is not true for Vacuum operation which I suppose you are testing
here.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying
Hi Amit,

>I think one table with a varying amount of data is sufficient for the vacuum 
>test. 
>I think with more number of tables there is a greater chance of variation. 
>We have previously used multiple tables in one of the tests because of the 
>Truncate operation (which uses DropRelFileNodesAllBuffers that takes multiple 
>relations as input) 
>and that is not true for Vacuum operation which I suppose you are testing here.

Thanks for your advice and kindly explanation. I'll continue the threshold test 
with one single table.

Regards,
Tang




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-28 Thread Tang, Haiying
Hi Amit,

>I think one table with a varying amount of data is sufficient for the vacuum 
>test. 
>I think with more number of tables there is a greater chance of variation. 
>We have previously used multiple tables in one of the tests because of 
>the Truncate operation (which uses DropRelFileNodesAllBuffers that 
>takes multiple relations as input) and that is not true for Vacuum operation 
>which I suppose you are testing here.

I retested performance on single table for several times, the table size is 
varying with the BUF_DROP_FULL_SCAN_THRESHOLD for different shared buffers.
When shared_buffers is below 20G, there were no significant changes between 
master(HEAD) and patched.
And according to the results compared between 20G and 100G, we can get 
optimization up to NBuffers/128, but there is no benefit from NBuffers/256.
I've tested many times, most times the same results came out, I don't know why. 
But If I used 5 tables(each table size is set as BUF_DROP_FULL_SCAN_THRESHOLD), 
then we can get benefit from it(NBuffers/256).

Here is my test results for single table. If you have any question or 
suggestion, kindly let me know.

%reg= (patched- master(HEAD))/ patched
Optimized percentage:
shared_buffers  %reg(NBuffers/512)  %reg(NBuffers/256)  
%reg(NBuffers/128)  %reg(NBuffers/64)   %reg(NBuffers/32)   
%reg(NBuffers/16)   %reg(NBuffers/8)%reg(NBuffers/4)
--
128M0%  0%  -1% 
0%  1%  0%  
0%  0%
1G  -1% 0%  -1% 
0%  0%  0%  
0%  0%
20G 0%  0%  -33%
0%  0%  -13%
0%  0%
100G-32%0%  -49%
0%  10% 30% 
0%  6%

Result details(unit: second):
patched  (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.107   0.107   
0.108   0.107   0.108   0.208
1G  0.107   0.107   0.107   0.108   
0.208   0.208   0.308   0.409 
20G 0.208   0.308   0.308   0.409   
0.609   0.808   1.511   2.713 
100G0.309   0.408   0.609   1.010   
2.011   5.017   6.620   13.931

master(HEAD) (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.108   0.107   
0.107   0.107   0.108   0.208
1G  0.108   0.107   0.108   0.108   
0.208   0.207   0.308   0.409 
20G 0.208   0.309   0.409   0.409   
0.609   0.910   1.511   2.712 
100G0.408   0.408   0.909   1.010   
1.811   3.515   6.619   13.032

Regards
Tang





RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-29 Thread Tang, Haiying
Hi Amit,

In last 
mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
I've sent you the performance test results(run only 1 time) on single table. 
Here is my the retested results(average by 15 times) which I think is more 
accurate.

In terms of 20G and 100G, the optimization on 100G is linear, but 20G is 
nonlinear(also include test results on shared buffers of 50G/60G), so it's a 
little difficult to decide the threshold from the two for me. 
If just consider 100G, I think NBuffers/32 is the optimized max relation size. 
But I don't know how to judge for 20G. If you have any suggestion, kindly let 
me know.

#%reg   128M1G  20G 100G
---
%reg(NBuffers/512)   0% -1% -5% -26%
%reg(NBuffers/256)   0%  0%  5% -20%
%reg(NBuffers/128)  -1% -1% -10%-16%
%reg(NBuffers/64)   -1%  0%  0%  -8%
%reg(NBuffers/32)0%  0% -2% -4%
%reg(NBuffers/16)0%  0% -6%  4%
%reg(NBuffers/8) 1%  0%  2% -2%
%reg(NBuffers/4) 0%  0%  2%  2%

Optimization details(unit: second):
patched  (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
--
128M0.107   0.107   0.107   0.107   
0.107   0.107   0.108   0.208
1G  0.107   0.108   0.107   0.108   
0.208   0.208   0.308   0.409 
20G 0.199   0.299   0.317   0.408   
0.591   0.900   1.561   2.866 
100G0.318   0.381   0.645   0.992   
1.913   3.640   6.615   13.389

master(HEAD) (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
--
128M0.107   0.107   0.108   0.108   
0.107   0.107   0.107   0.208
1G  0.108   0.108   0.108   0.108   
0.208   0.207   0.308   0.409 
20G 0.208   0.283   0.350   0.408   
0.601   0.955   1.529   2.806 
100G0.400   0.459   0.751   1.068   
1.984   3.506   6.735   13.101

Regards
Tang




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-30 Thread Amit Kapila
On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
 wrote:
>
> Hi Amit,
>
> In last 
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> I've sent you the performance test results(run only 1 time) on single table. 
> Here is my the retested results(average by 15 times) which I think is more 
> accurate.
>
> In terms of 20G and 100G, the optimization on 100G is linear, but 20G is 
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a 
> little difficult to decide the threshold from the two for me.
> If just consider 100G, I think NBuffers/32 is the optimized max relation 
> size. But I don't know how to judge for 20G. If you have any suggestion, 
> kindly let me know.
>

Considering these results NBuffers/64 seems a good threshold as beyond
that there is no big advantage. BTW, it is not clear why the advantage
for single table is not as big as multiple tables with the Truncate
command. Can you share your exact test steps for any one of the tests?
Also, did you change autovacumm = off for these tests, if not then the
results might not be reliable because before you run the test via
Vacuum command autovacuum would have done that work?

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-02 Thread k.jami...@fujitsu.com
On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote:
> On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
>  wrote:
> >
> > Hi Amit,
> >
> > In last
> >
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718
> 2
> > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> > I've sent you the performance test results(run only 1 time) on single table.
> Here is my the retested results(average by 15 times) which I think is more
> accurate.
> >
> > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a
> little difficult to decide the threshold from the two for me.
> > If just consider 100G, I think NBuffers/32 is the optimized max relation 
> > size.
> But I don't know how to judge for 20G. If you have any suggestion, kindly let
> me know.
> >
> 
> Considering these results NBuffers/64 seems a good threshold as beyond
> that there is no big advantage. BTW, it is not clear why the advantage for
> single table is not as big as multiple tables with the Truncate command. Can
> you share your exact test steps for any one of the tests?
> Also, did you change autovacumm = off for these tests, if not then the results
> might not be reliable because before you run the test via Vacuum command
> autovacuum would have done that work?

Happy new year. The V38 LGTM.
Apologies for a bit of delay on posting the test results, but since it's the
start of commitfest, here it goes and the results were interesting.

I executed a VACUUM test using the same approach that Tsunakawa-san did in [1],
but only this time, the total time that DropRelFileNodeBuffers() took.
I used only a single relation, tried with various sizes using the values of 
threshold:
NBuffers/512..NBuffers/1, as advised by Amit.

Example of relation sizes for NBuffers/512.
100GB shared_buffers:  200 MB 
20GB shared_buffers:   40 MB
1GB shared_buffers:2 MB
128MB shared_buffers:  0.25 MB

The regression, which means the patch performs worse than master, only happens
for relation size NBuffers/2 and below for all shared_buffers. The fastest
performance on a single relation was using the relation size NBuffers/512.

[VACUUM Recovery Performance on Single Relation]
Legend: P_XXX (Patch, NBuffers/XXX relation size),
M_XXX (Master, Nbuffers/XXX relation size)
Unit: seconds

| Rel Size | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| P_512|  0.012594  |  0.001989  |  0.81  |  0.12  | 
| M_512|  0.208757  |  0.046212  |  0.002013  |  0.000295  | 
| P_256|  0.026311  |  0.004416  |  0.000129  |  0.21  | 
| M_256|  0.241017  |  0.047234  |  0.002363  |  0.000298  | 
| P_128|  0.044684  |  0.009784  |  0.000290  |  0.42  | 
| M_128|  0.253588  |  0.047952  |  0.002454  |  0.000319  | 
| P_64 |  0.065806  |  0.017444  |  0.000521  |  0.75  | 
| M_64 |  0.268311  |  0.050361  |  0.002730  |  0.000339  | 
| P_32 |  0.121441  |  0.033431  |  0.001646  |  0.000112  | 
| M_32 |  0.285254  |  0.061486  |  0.003640  |  0.000364  | 
| P_16 |  0.255503  |  0.065492  |  0.001663  |  0.000144  | 
| M_16 |  0.377013  |  0.081613  |  0.003731  |  0.000372  | 
| P_8  |  0.560616  |  0.109509  |  0.005954  |  0.000465  | 
| M_8  |  0.692596  |  0.112178  |  0.006667  |  0.000553  | 
| P_4  |  1.109437  |  0.162924  |  0.011229  |  0.000861  | 
| M_4  |  1.162125  |  0.178764  |  0.011635  |  0.000935  | 
| P_2  |  2.202231  |  0.317832  |  0.020783  |  0.002646  | 
| M_2  |  1.583959  |  0.306269  |  0.015705  |  0.002021  | 
| P_1  |  3.080032  |  0.632747  |  0.032183  |  0.002660  | 
| M_1  |  2.705485  |  0.543970  |  0.030658  |  0.001941  | 

%reg = (Patched/Master)/Patched

| %reg_relsize | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | 
| %reg_256 | -816.041%  | -969.691%  | -1731.783% | -1319.048% | 
| %reg_128 | -467.514%  | -390.123%  | -747.008%  | -658.333%  | 
| %reg_64  | -307.727%  | -188.704%  | -423.992%  | -352.000%  | 
| %reg_32  | -134.891%  | -83.920%   | -121.097%  | -225.970%  | 
| %reg_16  | -47.557%   | -24.614%   | -124.279%  | -157.390%  | 
| %reg_8   | -23.542%   | -2.437%| -11.967%   | -19.010%   | 
| %reg_4   | -4.749%| -9.722%| -3.608%| -8.595%| 
| %reg_2   | 28.075%| 3.638% | 24.436%| 23.615%| 
| %reg_1   | 12.160%| 14.030%| 4.739% | 27.010%| 

Since our goal is to get the approximate threshold where the cost for
finding to be invalidated buffers gets higher in optimized path than
the traditional path:
A. Traditional Path
 1. For each shared_buffers, compare the relfilenode.
 2. LockBufHdr()
 3. Compare block number, InvalidateBuffers() if it'

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-03 Thread Amit Kapila
On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com
 wrote:
>
> Happy new year. The V38 LGTM.
> Apologies for a bit of delay on posting the test results, but since it's the
> start of commitfest, here it goes and the results were interesting.
>
> I executed a VACUUM test using the same approach that Tsunakawa-san did in 
> [1],
> but only this time, the total time that DropRelFileNodeBuffers() took.
>

Please specify the exact steps like did you deleted all the rows from
a table or some of it or none before performing Vacuum? How did you
measure this time, did you removed the cached check? It would be
better if you share the scripts and or the exact steps so that the
same can be used by others to reproduce.

> I used only a single relation, tried with various sizes using the values of 
> threshold:
> NBuffers/512..NBuffers/1, as advised by Amit.
>
> Example of relation sizes for NBuffers/512.
> 100GB shared_buffers:  200 MB
> 20GB shared_buffers:   40 MB
> 1GB shared_buffers:2 MB
> 128MB shared_buffers:  0.25 MB
>
..
>
> Although the above table shows that NBuffers/2 would be the
> threshold, I know that the cost would vary depending on the machine
> specs. I think I can suggest the threshold and pick one from among
> NBuffers/2, NBuffers/4 or NBuffers/8, because their values are closer
> to the InvalidatedBuffers.
>

Hmm, in the tests done by Tang, the results indicate that in some
cases the patched version is slower at even NBuffers/32, so not sure
if we can go to values shown by you unless she is doing something
wrong. I think the difference in results could be because both of you
are using different techniques to measure the timings, so it might be
better if both of you can share scripts or exact steps used to measure
the time and other can use the same technique and see if we are
getting consistent results.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-03 Thread Tang, Haiying
Hi Amit,

Sorry for my late reply. Here are my answers for your earlier questions.

>BTW, it is not clear why the advantage for single table is not as big as 
>multiple tables with the Truncate command
I guess it's the amount of table blocks caused this difference. For single 
table I tested the amount of block is threshold.
For multiple tables I test the amount of block is a value(like: one or dozens 
or hundreds) which far below threshold.
The closer table blocks to the threshold, the less advantage raised.

I tested below 3 situations of 50 tables when shared buffers=20G / 100G.
1. For multiple tables which had one or dozens or hundreds blocks(far below 
threshold) per table, we got significant improve, like [1]. 
2. For multiple tables which has half threshold blocks per table, advantage 
become less, like [2].
3. For multiple tables which has threshold blocks per table, advantage become 
more less, like [3].

[1].  247 blocks per table
s_b master  patched %reg((patched-master)/patched)

20GB1.109   0.108   -927%
100GB   3.113   0.108   -2782%

[2].  NBuffers/256/2 blocks per table
s_b master  patched %reg

20GB2.012   1.210   -66%
100GB   10.226  6.4 -60%

[3].  NBuffers/256 blocks per table
s_b master  patched %reg

20GB3.868   2.412   -60%
100GB   14.977  10.591  -41%

>Can you share your exact test steps for any one of the tests? Also, did you 
>change autovacumm = off for these tests?
Yes, I configured streaming replication environment as Kirk did before.
autovacumm = off. 
full_page_writes = off. 
checkpoint_timeout = 30min

Test steps: 
e.g. shared_buffers=20G, NBuffers/512, table blocks= 20*1024*1024/8/512=5120 . 
table size(kB)= 20*1024*1024/512=40960kB 
1. (Master) create table test(id int, v_ch varchar, v_ch1 varchar); 
2. (Master) insert about 40MB data to table.
3. (Master) delete from table (all rows of table) 
4. (Standby) To test with failover, pause the WAL replay on standby server.
   SELECT pg_wal_replay_pause();
5. (Master) VACUUM;
6. (Master) Stop primary server. pg_ctl stop -D $PGDATA -w 7. (Standby) Resume 
wal replay and promote standby. (get the recovery time from this step)

Regards
Tang




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com
On Sunday, January 3, 2021 10:35 PM (JST), Amit Kapila wrote:
> On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Happy new year. The V38 LGTM.
> > Apologies for a bit of delay on posting the test results, but since
> > it's the start of commitfest, here it goes and the results were interesting.
> >
> > I executed a VACUUM test using the same approach that Tsunakawa-san
> > did in [1], but only this time, the total time that DropRelFileNodeBuffers()
> took.
> >
> 
> Please specify the exact steps like did you deleted all the rows from a table 
> or
> some of it or none before performing Vacuum? How did you measure this
> time, did you removed the cached check? It would be better if you share the
> scripts and or the exact steps so that the same can be used by others to
> reproduce.

Basically, I used the TimestampDifference function in DropRelFileNodeBuffers().
I also executed DELETE before VACUUM.
I also removed nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD
And used the threshold as the relation size.

> Hmm, in the tests done by Tang, the results indicate that in some cases the
> patched version is slower at even NBuffers/32, so not sure if we can go to
> values shown by you unless she is doing something wrong. I think the
> difference in results could be because both of you are using different
> techniques to measure the timings, so it might be better if both of you can
> share scripts or exact steps used to measure the time and other can use the
> same technique and see if we are getting consistent results.

Right, since we want consistent results, please disregard the approach that I 
did.
I will resume the test similar to Tang, because she also executed the original 
failover
test which I have been doing before.
To avoid confusion and to check if the results from mine and Tang are 
consistent,
I also did the recovery/failover test for VACUUM on single relation, which I 
will send
in a separate email after this.

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com
On Wed, January 6, 2021 7:04 PM (JST), I wrote:
> I will resume the test similar to Tang, because she also executed the original
> failover test which I have been doing before.
> To avoid confusion and to check if the results from mine and Tang are
> consistent, I also did the recovery/failover test for VACUUM on single 
> relation,
> which I will send in a separate email after this.

A. Test to find the right THRESHOLD

So below are the procedures and results of the VACUUM recovery performance
test on single relation.
I followed the advice below and applied the supplementary patch on top of V39:
Test-for-threshold.patch
This will ensure that we'll always enter the optimized path.
We're gonna use the threshold then as the relation size.

> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.

Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script
to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers.
I did not go further beyond 8 because it took too much time, and I could
already observe significant results until that.

[Vacuum Recovery Performance on Single Relation]
1. Setup synchronous streaming replication. I used the configuration
  written at the bottom of this email.
2. [Primary] Create 1 table. (rel.sh create)
3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct
   size for the set shared_buffers by commenting out the right size in "insert"
   of rel.sh script. (rel.sh insert)
4. [Primary] Delete table. (rel.sh delete)
5. [Standby] Optional: To double-check that DELETE is reflected on standby.
   SELECT count(*) FROM tableXXX;
  Make sure it returns 0.
6. [Standby] Pause WAL replay. (rel.sh pause)
  (This script will execute SELECT pg_wal_replay_pause(); .)
7. [Primary] VACUUM the single relation. (rel.sh vacuum)
8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop)
   (The script will execute pg_ctl stop -D $PGDATA -w -mi)
9. [Standby] Resume WAL replay and promote the standby.
  (rel.sh resume)
  It basically prints a timestamp when resuming WAL replay,
  and prints another timestamp when the promotion is done.
  Compute the time difference.

[Results for VACUUM on single relation]
Average of 5 runs.

1. % REGRESSION
% Regression: (patched - master)/master

| rel_size | 128MB  | 1GB| 20GB   | 100GB| 
|--||||--| 
| NB/512   | 0.000% | 0.000% | 0.000% | -32.680% | 
| NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   | 
| NB/128   | 0.000% | 0.000% | 0.000% | -16.502% | 
| NB/64| 0.000% | 0.000% | 0.000% | -9.841%  | 
| NB/32| 0.000% | 0.000% | 0.000% | -6.219%  | 
| NB/16| 0.000% | 0.000% | 0.000% | 3.323%   | 
| NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |

For 100GB shared_buffers, we can observe regression
beyond NBuffers/32. So with this, we can conclude
that NBuffers/32 is the right threshold.
For NBuffers/16 and beyond, the patched performs
worse than master. In other words, the cost of for finding
to be invalidated buffers gets higher in the optimized path
than the traditional path.

So in attached V39 patches, I have updated the threshold
BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.

2. [PATCHED]
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.206 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.506 | 
| NB/64| 0.106 | 0.106 | 0.306 | 0.907 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.508 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.109 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.614 |

3. MASTER
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.606 | 
| NB/64| 0.106 | 0.106 | 0.306 | 1.006 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.608 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.009 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.114 |

I used the following configurations:
[postgesql.conf]
shared_buffers = 100GB #20GB,1GB,128MB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 1
max_wal_size = 20GB

# For streaming replication from primary. Don't uncomment on Standby.
synchronous_commit = remote_write
synchronous_standby_names = 'walreceiver'

# For Standby. Don't uncomment on Primary.
# hot_standby = on
#primary_conninfo = 'host=... user=... port=... application_name=walreceiver'

--
B. Regression Test using the NBuffers/32 Threshold (V39 Patches)

For this one,

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread Tang, Haiying
Hi Kirk,

>And if you want to test, I have already indicated the detailed steps including 
>the scripts I used. Have fun testing!

Thank you for your sharing of test steps and scripts. I'd like take a look at 
them and redo some of the tests using my machine. I'll send my test reults in a 
separate email after this.

Regards,
Tang




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread Tang, Haiying
>I'd like take a look at them and redo some of the tests using my machine. I'll 
>send my test reults in a separate email after this.

I did the same tests with Kirk's scripts using the latest patch on my own 
machine. The results look pretty good and similar with Kirk's. 

average of 5 runs.

[VACUUM failover test for 1000 relations] Unit is second, 
%reg=(patched-master)/ master

| s_b   | Master| Patched   | %reg  | 
|--|---|--|--| 
| 128 MB| 0.408 | 0.308 |  -24.44%  | 
| 1 GB  | 0.809 | 0.308 |  -61.94%  | 
| 20 GB | 12.529| 0.308 |  -97.54%  | 
| 100 GB| 59.310| 0.369 |  -99.38%  |

[TRUNCATE failover test for 1000 relations] Unit is second, 
%reg=(patched-master)/ master

| s_b   | Master| Patched   | %reg  | 
|--|---|--|--| 
| 128 MB| 0.287 | 0.207 |  -27.91%  | 
| 1 GB  | 0.688 | 0.208 |  -69.84%  | 
| 20 GB | 12.449| 0.208 |  -98.33%  | 
| 100 GB| 61.800| 0.207 |  -99.66%  |

Besides, I did the test for threshold value again. (I rechecked my test process 
and found out that I forgot to check the data synchronization state on standby 
which may introduce some NOISE to my results.)
The following results show we can't get optimize over NBuffers/32 just like 
Kirk's test results, so I do approve with Kirk on the threshold value.

%regression:
| rel_size |128MB|1GB|20GB| 100GB |
|--||||---| 
| NB/512   | 0% | 0% | 0% | -48%  | 
| NB/256   | 0% | 0% | 0% | -33%  | 
| NB/128   | 0% | 0% | 0% | -9%   | 
| NB/64| 0% | 0% | 0% | -5%   | 
| NB/32| 0% | 0% |-4% | -3%   | 
| NB/16| 0% | 0% |-4% | 1%| 
| NB/8 | 1% | 0% | 1% | 3%|

Optimization details(unit: second):
patched:
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8
-
128M0.107   0.107   0.107   0.106   
0.107   0.107   0.107 
1G  0.107   0.107   0.107   0.107   
0.107   0.107   0.107 
20G 0.107   0.108   0.207   0.307   
0.442   0.876   1.577 
100G0.208   0.308   0.559   1.060   
1.961   4.567   7.922 

master:
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8
-
128M0.107   0.107   0.107   0.107   
0.107   0.107   0.106 
1G  0.107   0.107   0.107   0.107   
0.107   0.107   0.107 
20G 0.107   0.107   0.208   0.308   
0.457   0.910   1.560 
100G0.308   0.409   0.608   1.110   
2.011   4.516   7.721 

[Specs]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 128G
OS: CentOS 8

Any question to my test is welcome.

Regards,
Tang





Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread Amit Kapila
On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
 wrote:
>
> [Results for VACUUM on single relation]
> Average of 5 runs.
>
> 1. % REGRESSION
> % Regression: (patched - master)/master
>
> | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> |--||||--|
> | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
>
> For 100GB shared_buffers, we can observe regression
> beyond NBuffers/32. So with this, we can conclude
> that NBuffers/32 is the right threshold.
> For NBuffers/16 and beyond, the patched performs
> worse than master. In other words, the cost of for finding
> to be invalidated buffers gets higher in the optimized path
> than the traditional path.
>
> So in attached V39 patches, I have updated the threshold
> BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
>

Thanks for the detailed tests. NBuffers/32 seems like an appropriate
value for the threshold based on these results. I would like to
slightly modify part of the commit message in the first patch as below
[1], otherwise, I am fine with the changes. Unless you or anyone else
has any more comments, I am planning to push the 0001 and 0002
sometime next week.

[1]
"The recovery path of DropRelFileNodeBuffers() is optimized so that
scanning of the whole buffer pool can be avoided when the number of
blocks to be truncated in a relation is below a certain threshold. For
such cases, we find the buffers by doing lookups in BufMapping table.
This improves the performance by more than 100 times in many cases
when several small tables (tested with 1000 relations) are truncated
and where the server is configured with a large value of shared
buffers (greater than 100GB)."

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread k.jami...@fujitsu.com
On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote:
> 
> On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
>  wrote:
> >
> > [Results for VACUUM on single relation]
> > Average of 5 runs.
> >
> > 1. % REGRESSION
> > % Regression: (patched - master)/master
> >
> > | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> > |--||||--|
> > | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> > | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> > | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> > | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> > | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> > | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
> >
> > For 100GB shared_buffers, we can observe regression
> > beyond NBuffers/32. So with this, we can conclude
> > that NBuffers/32 is the right threshold.
> > For NBuffers/16 and beyond, the patched performs
> > worse than master. In other words, the cost of for finding
> > to be invalidated buffers gets higher in the optimized path
> > than the traditional path.
> >
> > So in attached V39 patches, I have updated the threshold
> > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
> >
> 
> Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> value for the threshold based on these results. I would like to
> slightly modify part of the commit message in the first patch as below
> [1], otherwise, I am fine with the changes. Unless you or anyone else
> has any more comments, I am planning to push the 0001 and 0002
> sometime next week.
> 
> [1]
> "The recovery path of DropRelFileNodeBuffers() is optimized so that
> scanning of the whole buffer pool can be avoided when the number of
> blocks to be truncated in a relation is below a certain threshold. For
> such cases, we find the buffers by doing lookups in BufMapping table.
> This improves the performance by more than 100 times in many cases
> when several small tables (tested with 1000 relations) are truncated
> and where the server is configured with a large value of shared
> buffers (greater than 100GB)."

Thank you for taking a look at the results of the tests. And it's also 
consistent with the results from Tang too.
The commit message LGTM.

Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread Kyotaro Horiguchi
At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote:
> > 
> > On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > [Results for VACUUM on single relation]
> > > Average of 5 runs.
> > >
> > > 1. % REGRESSION
> > > % Regression: (patched - master)/master
> > >
> > > | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> > > |--||||--|
> > > | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> > > | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> > > | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> > > | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> > > | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> > > | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> > > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
> > >
> > > For 100GB shared_buffers, we can observe regression
> > > beyond NBuffers/32. So with this, we can conclude
> > > that NBuffers/32 is the right threshold.
> > > For NBuffers/16 and beyond, the patched performs
> > > worse than master. In other words, the cost of for finding
> > > to be invalidated buffers gets higher in the optimized path
> > > than the traditional path.
> > >
> > > So in attached V39 patches, I have updated the threshold
> > > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
> > >
> > 
> > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > value for the threshold based on these results. I would like to
> > slightly modify part of the commit message in the first patch as below
> > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > has any more comments, I am planning to push the 0001 and 0002
> > sometime next week.
> > 
> > [1]
> > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > scanning of the whole buffer pool can be avoided when the number of
> > blocks to be truncated in a relation is below a certain threshold. For
> > such cases, we find the buffers by doing lookups in BufMapping table.
> > This improves the performance by more than 100 times in many cases
> > when several small tables (tested with 1000 relations) are truncated
> > and where the server is configured with a large value of shared
> > buffers (greater than 100GB)."
> 
> Thank you for taking a look at the results of the tests. And it's also 
> consistent with the results from Tang too.
> The commit message LGTM.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-06-16 Thread k.jami...@fujitsu.com
Hi,

Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.

Regards,
Kirk Jamison


v8-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v8-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-02-04 Thread k.jami...@fujitsu.com
Hi,

I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.

Kindly check the attached V6 patch.
Any thoughts on this?

Regards,
Kirk Jamison


v6-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v6-Optimize-dropping-of-relation-buffers-using-dlist.patch


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-02-05 Thread Robert Haas
On Tue, Feb 4, 2020 at 4:57 AM k.jami...@fujitsu.com
 wrote:
> Kindly check the attached V6 patch.
> Any thoughts on this?

Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.

Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-29 Thread Konstantin Knizhnik




On 17.06.2020 09:14, k.jami...@fujitsu.com wrote:

Hi,

Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.

Regards,
Kirk Jamison
As far as i understand this patch can provide significant improvement of 
performance only in case of
recovery  of truncates of large number of tables. You have added shared 
hash of relation buffers and certainly if adds some
extra overhead. According to your latest results this overhead is quite 
small. But it will be hard to prove that there will be no noticeable 
regression

at some workloads.

I wonder if you have considered case of local hash (maintained only 
during recovery)?
If there is after-crash recovery, then there will be no concurrent 
access to shared buffers and this hash will be up-to-date.
in case of hot-standby replica we can use some simple invalidation (just 
one flag or counter which indicates that buffer cache was updated).
This hash also can be constructed on demand when DropRelFileNodeBuffers 
is called first time (so w have to scan all buffers once, but subsequent 
drop operation will be fast.


i have not thought much about it, but it seems to me that as far as this 
problem only affects recovery, we do not need shared hash for it.






RE: [Patch] Optimize dropping of relation buffers using dlist

2020-07-30 Thread k.jami...@fujitsu.com
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
> On 17.06.2020 09:14, k.jami...@fujitsu.com wrote:
> > Hi,
> >
> > Since the last posted version of the patch fails, attached is a rebased 
> > version.
> > Written upthread were performance results and some benefits and challenges.
> > I'd appreciate your feedback/comments.
> >
> > Regards,
> > Kirk Jamison

> As far as i understand this patch can provide significant improvement of
> performance only in case of recovery  of truncates of large number of tables. 
> You
> have added shared hash of relation buffers and certainly if adds some extra
> overhead. According to your latest results this overhead is quite small. But 
> it will
> be hard to prove that there will be no noticeable regression at some 
> workloads.

Thank you for taking a look at this.

Yes, one of the aims is to speed up recovery of truncations, but at the same 
time the
patch also improves autovacuum, vacuum and relation truncate index executions. 
I showed results of pgbench results above for different types of workloads,
but I am not sure if those are validating enough...

> I wonder if you have considered case of local hash (maintained only during
> recovery)?
> If there is after-crash recovery, then there will be no concurrent access to 
> shared
> buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just one 
> flag
> or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers is
> called first time (so w have to scan all buffers once, but subsequent drop
> operation will be fast.
> 
> i have not thought much about it, but it seems to me that as far as this 
> problem
> only affects recovery, we do not need shared hash for it.
> 

The idea of the patch is to mark the relation buffers to be dropped after 
scanning
the whole shared buffers, and store them into shared memory maintained in a 
dlist,
and traverse the dlist on the next scan.
But I understand the point that it is expensive and may cause overhead, that is 
why
I tried to define a macro to limit the number of pages that we can cache for 
cases
that lookup cost can be problematic (i.e. too many pages of relation).

#define BUF_ID_ARRAY_SIZE 100
int buf_id_array[BUF_ID_ARRAY_SIZE];
int forknum_indexes[BUF_ID_ARRAY_SIZE];

In DropRelFileNodeBuffers
do
{
nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, 
lengthof(buf_id_array));
for (i = 0; i < nbufs; i++)
{
...
}
} while (nbufs == lengthof(buf_id_array));


Perhaps the patch affects complexities so we want to keep it simpler, or commit 
piece by piece?
I will look further into your suggestion of maintaining local hash only during 
recovery.
Thank you for the suggestion.

Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-30 Thread Konstantin Knizhnik
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I have tested this patch at various workloads and hardware (including Power2 
server with 384 virtual cores)
and didn't find performance regression.

The new status of this patch is: Ready for Committer


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-07-30 Thread k.jami...@fujitsu.com
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
> 
> I have tested this patch at various workloads and hardware (including Power2
> server with 384 virtual cores) and didn't find performance regression.
> 
> The new status of this patch is: Ready for Committer

Thank you very much, Konstantin, for testing the patch for different workloads.
I wonder if I need to modify some documentations.
I'll leave the final review to the committer/s as well.

Regards,
Kirk Jamison


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-31 Thread Tom Lane
Robert Haas  writes:
> Unfortunately, I don't have time for detailed review of this. I am
> suspicious that there are substantial performance regressions that you
> just haven't found yet. I would not take the position that this is a
> completely hopeless approach, or anything like that, but neither would
> I conclude that the tests shown so far are anywhere near enough to be
> confident that there are no problems.

I took a quick look through the v8 patch, since it's marked RFC, and
my feeling is about the same as Robert's: it is just about impossible
to believe that doubling (or more) the amount of hashtable manipulation
involved in allocating a buffer won't hurt common workloads.  The
offered pgbench results don't reassure me; we've so often found that
pgbench fails to expose performance problems, except maybe when it's
used just so.

But aside from that, I noted a number of things I didn't like a bit:

* The amount of new shared memory this needs seems several orders
of magnitude higher than what I'd call acceptable: according to my
measurements it's over 10KB per shared buffer!  Most of that is going
into the CachedBufTableLock data structure, which seems fundamentally
misdesigned --- how could we be needing a lock per map partition *per
buffer*?  For comparison, the space used by buf_table.c is about 56
bytes per shared buffer; I think this needs to stay at least within
hailing distance of there.

* It is fairly suspicious that the new data structure is manipulated
while holding per-partition locks for the existing buffer hashtable.
At best that seems bad for concurrency, and at worst it could result
in deadlocks, because I doubt we can assume that the new hash table
has partition boundaries identical to the old one.

* More generally, it seems like really poor design that this has been
written completely independently of the existing buffer hash table.
Can't we get any benefit by merging them somehow?

* I do not like much of anything in the code details.  "CachedBuf"
is as unhelpful as could be as a data structure identifier --- what
exactly is not "cached" about shared buffers already?  "CombinedLock"
is not too helpful either, nor could I find any documentation explaining
why you need to invent new locking technology in the first place.
At best, CombinedLockAcquireSpinLock seems like a brute-force approach
to an undocumented problem.

* The commentary overall is far too sparse to be of any value ---
basically, any reader will have to reverse-engineer your entire design.
That's not how we do things around here.  There should be either a README,
or a long file header comment, explaining what's going on, how the data
structure is organized, and what the locking requirements are.
See src/backend/storage/buffer/README for the sort of documentation
that I think this needs.

Even if I were convinced that there's no performance gotchas,
I wouldn't commit this in anything like its current form.

Robert again:
> Also, systems with very large shared_buffers settings are becoming
> more common, and probably will continue to become more common, so I
> don't think we can dismiss that as an edge case any more. People don't
> want to run with an 8GB cache on a 1TB server.

I do agree that it'd be great to improve this area.  Just not convinced
that this is how.

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-31 Thread Andres Freund
Hi,

On 2020-07-31 13:39:37 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > Unfortunately, I don't have time for detailed review of this. I am
> > suspicious that there are substantial performance regressions that you
> > just haven't found yet. I would not take the position that this is a
> > completely hopeless approach, or anything like that, but neither would
> > I conclude that the tests shown so far are anywhere near enough to be
> > confident that there are no problems.
> 
> I took a quick look through the v8 patch, since it's marked RFC, and
> my feeling is about the same as Robert's: it is just about impossible
> to believe that doubling (or more) the amount of hashtable manipulation
> involved in allocating a buffer won't hurt common workloads.  The
> offered pgbench results don't reassure me; we've so often found that
> pgbench fails to expose performance problems, except maybe when it's
> used just so.

Indeed. The buffer mapping hashtable already is visible as a major
bottleneck in a number of workloads. Even in readonly pgbench if s_b is
large enough (so the hashtable is larger than the cache). Not to speak
of things like a cached sequential scan with a cheap qual and wide rows.


> Robert again:
> > Also, systems with very large shared_buffers settings are becoming
> > more common, and probably will continue to become more common, so I
> > don't think we can dismiss that as an edge case any more. People don't
> > want to run with an 8GB cache on a 1TB server.
> 
> I do agree that it'd be great to improve this area.  Just not convinced
> that this is how.

Wonder if the temporary fix is just to do explicit hashtable probes for
all pages iff the size of the relation is < s_b / 500 or so. That'll
address the case where small tables are frequently dropped - and
dropping large relations is more expensive from the OS and data loading
perspective, so it's not gonna happen as often.

Greetings,

Andres Freund




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-31 Thread Tom Lane
Andres Freund  writes:
> Indeed. The buffer mapping hashtable already is visible as a major
> bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> large enough (so the hashtable is larger than the cache). Not to speak
> of things like a cached sequential scan with a cheap qual and wide rows.

To be fair, the added overhead is in buffer allocation not buffer lookup.
So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
the potential trouble spot is where the working set is bigger than shared
buffers but still fits in RAM (so there's no actual I/O needed, but we do
still have to shuffle buffers a lot).

> Wonder if the temporary fix is just to do explicit hashtable probes for
> all pages iff the size of the relation is < s_b / 500 or so. That'll
> address the case where small tables are frequently dropped - and
> dropping large relations is more expensive from the OS and data loading
> perspective, so it's not gonna happen as often.

Oooh, interesting idea.  We'd need a reliable idea of how long the
relation had been (preferably without adding an lseek call), but maybe
that's do-able.

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-07-31 Thread Andres Freund
Hi,

On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Indeed. The buffer mapping hashtable already is visible as a major
> > bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> > large enough (so the hashtable is larger than the cache). Not to speak
> > of things like a cached sequential scan with a cheap qual and wide rows.
> 
> To be fair, the added overhead is in buffer allocation not buffer lookup.
> So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
> the potential trouble spot is where the working set is bigger than shared
> buffers but still fits in RAM (so there's no actual I/O needed, but we do
> still have to shuffle buffers a lot).

Oh, right, not sure what I was thinking.


> > Wonder if the temporary fix is just to do explicit hashtable probes for
> > all pages iff the size of the relation is < s_b / 500 or so. That'll
> > address the case where small tables are frequently dropped - and
> > dropping large relations is more expensive from the OS and data loading
> > perspective, so it's not gonna happen as often.
> 
> Oooh, interesting idea.  We'd need a reliable idea of how long the
> relation had been (preferably without adding an lseek call), but maybe
> that's do-able.

IIRC we already do smgrnblocks nearby, when doing the truncation (to
figure out which segments we need to remove). Perhaps we can arrange to
combine the two? The layering probably makes that somewhat ugly :(

We could also just use pg_class.relpages. It'll probably mostly be
accurate enough?

Or we could just cache the result of the last smgrnblocks call...


One of the cases where this type of strategy is most intersting to me is
the partial truncations that autovacuum does... There we even know the
range of tables ahead of time.

Greetings,

Andres Freund




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-05 Thread k.jami...@fujitsu.com
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
> 
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
> 
> Oh, right, not sure what I was thinking.
> 
> 
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
> 
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure 
> out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
> 
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
> 
> Or we could just cache the result of the last smgrnblocks call...
> 
> 
> One of the cases where this type of strategy is most intersting to me is the 
> partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit 
into
RAM but not into shared buffers, could one of Andres' suggested idea/s above 
address
that, in addition to this patch's possible shared invalidation fix? Could that 
settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:
> But aside from that, I noted a number of things I didn't like a bit:
> 
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For 
> comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think 
> this
> needs to stay at least within hailing distance of there.
> 
> * It is fairly suspicious that the new data structure is manipulated while 
> holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in 
> deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
> 
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the 
speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like 
radix tree
buffer mapping which is complex, I thought a patch like mine could be an 
alternative
approach to that. So I want to improve the patch further. 
I hope you can help me clarify the direction, so that I can avoid going farther 
away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement 
is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore 
because
it would exceed the CPU cache line size. So I added new data structures (hash 
table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of 
the same
Relfilenode in the key and a lock when inserting and deleting buffers from 
buffer table,
as well as during lookups. As for the partition locking, I added it to reduce 
lock contentio

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-06 Thread Tomas Vondra

On Thu, Aug 06, 2020 at 01:23:31AM +, k.jami...@fujitsu.com wrote:

On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.


Hi,

On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Indeed. The buffer mapping hashtable already is visible as a major
> > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > is large enough (so the hashtable is larger than the cache). Not to
> > speak of things like a cached sequential scan with a cheap qual and wide
rows.
>
> To be fair, the added overhead is in buffer allocation not buffer lookup.
> So it shouldn't add cost to fully-cached cases.  As Tomas noted
> upthread, the potential trouble spot is where the working set is
> bigger than shared buffers but still fits in RAM (so there's no actual
> I/O needed, but we do still have to shuffle buffers a lot).

Oh, right, not sure what I was thinking.


> > Wonder if the temporary fix is just to do explicit hashtable probes
> > for all pages iff the size of the relation is < s_b / 500 or so.
> > That'll address the case where small tables are frequently dropped -
> > and dropping large relations is more expensive from the OS and data
> > loading perspective, so it's not gonna happen as often.
>
> Oooh, interesting idea.  We'd need a reliable idea of how long the
> relation had been (preferably without adding an lseek call), but maybe
> that's do-able.

IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
which segments we need to remove). Perhaps we can arrange to combine the
two? The layering probably makes that somewhat ugly :(

We could also just use pg_class.relpages. It'll probably mostly be accurate
enough?

Or we could just cache the result of the last smgrnblocks call...


One of the cases where this type of strategy is most intersting to me is the 
partial
truncations that autovacuum does... There we even know the range of tables
ahead of time.


Konstantin tested it on various workloads and saw no regression.


Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.

And I can trivially reproduce measurable (and significant) regression
using a very simple pgbench read-only test, with amount of data that
exceeds shared buffers but fits into RAM.

The following numbers are from a x86_64 machine with 16 cores (32 w HT),
64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB,
i.e. twice the SB size).

With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients - see the attached script for details) I see this:

   1 client8 clients16 clients
--
master38249   236336368591
patched   35853   217259349248
-6%  -8%   -5%

This is average of the runs, but the conclusions for medians are almost
exactly te same.


But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads
that fit into RAM but not into shared buffers, could one of Andres'
suggested idea/s above address that, in addition to this patch's
possible shared invalidation fix? Could that settle the added overhead
in BufferAlloc() as temporary fix?


Not sure.


Thomas Munro is also working on caching relation sizes [1], maybe that
way we could get the latest known relation size. Currently, it's
possible only during recovery in smgrnblocks.


It's not clear to me how would knowing the relation size help reducing
the overhead of this patch?

Can't we somehow identify cases when this optimization might help and
only actually enable it in those cases? Like in a recovery, with a lot
of truncates, or something like that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 


run.sh
Description: Bourne shell script
1 client
38408.098942
37148.085739
38553.640548
37243.143856
38487.044062
37225.251112
38584.828645
37474.801133
38506.580090
37942.415423
38438.285457
38249.098321
37232.708808
37368.309774
38268.592606
8 clients
242598.113719
240728.545096
238609.937059
236060.993323
235654.171372
238064.785008
244072.880501
233852.274273
229449.357520
238786.728666
229605.148726
233260.304044
232185.071739
242031.025232
236335.954147
16 clients
365702.997309
366334.247594
371966.060740
376383.183023
375279.446966
375416.168457
368591.075982
367195.919635
355262.316767
368375.432125
361657.667234
386455.075762
365156.675183
372176.374557
376528.470288
1 client
35725.116701
34158.764487
36346.391570
35951.323419
33164.079463
35317.282661
35991.253869
36031.895254
36446.26343

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-06 Thread Amit Kapila
On Thu, Aug 6, 2020 at 6:53 AM k.jami...@fujitsu.com
 wrote:
>
> On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>
> Hi,
> Thank you for your constructive review and comments.
> Sorry for the late reply.
>
> > Hi,
> >
> > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > Indeed. The buffer mapping hashtable already is visible as a major
> > > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > > is large enough (so the hashtable is larger than the cache). Not to
> > > > speak of things like a cached sequential scan with a cheap qual and wide
> > rows.
> > >
> > > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > upthread, the potential trouble spot is where the working set is
> > > bigger than shared buffers but still fits in RAM (so there's no actual
> > > I/O needed, but we do still have to shuffle buffers a lot).
> >
> > Oh, right, not sure what I was thinking.
> >
> >
> > > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > > That'll address the case where small tables are frequently dropped -
> > > > and dropping large relations is more expensive from the OS and data
> > > > loading perspective, so it's not gonna happen as often.
> > >
> > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > relation had been (preferably without adding an lseek call), but maybe
> > > that's do-able.
> >
> > IIRC we already do smgrnblocks nearby, when doing the truncation (to figure 
> > out
> > which segments we need to remove). Perhaps we can arrange to combine the
> > two? The layering probably makes that somewhat ugly :(
> >
> > We could also just use pg_class.relpages. It'll probably mostly be accurate
> > enough?
> >
> > Or we could just cache the result of the last smgrnblocks call...
> >
> >
> > One of the cases where this type of strategy is most intersting to me is 
> > the partial
> > truncations that autovacuum does... There we even know the range of tables
> > ahead of time.
>
> Konstantin tested it on various workloads and saw no regression.
> But I understand the sentiment on the added overhead on BufferAlloc.
> Regarding the case where the patch would potentially affect workloads that 
> fit into
> RAM but not into shared buffers, could one of Andres' suggested idea/s above 
> address
> that, in addition to this patch's possible shared invalidation fix? Could 
> that settle
> the added overhead in BufferAlloc() as temporary fix?
>

Yes, I think so. Because as far as I can understand he is suggesting
to do changes only in the Truncate/Vacuum code path. Basically, I
think you need to change DropRelFileNodeBuffers or similar functions.
There shouldn't be any change in the BufferAlloc or the common code
path, so there is no question of regression in such cases. I am not
sure what you have in mind for this but feel free to clarify your
understanding before implementation.

> Thomas Munro is also working on caching relation sizes [1], maybe that way we
> could get the latest known relation size. Currently, it's possible only during
> recovery in smgrnblocks.
>
> Tom Lane wrote:
> > But aside from that, I noted a number of things I didn't like a bit:
> >
> > * The amount of new shared memory this needs seems several orders of
> > magnitude higher than what I'd call acceptable: according to my measurements
> > it's over 10KB per shared buffer!  Most of that is going into the
> > CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> > how could we be needing a lock per map partition *per buffer*?  For 
> > comparison,
> > the space used by buf_table.c is about 56 bytes per shared buffer; I think 
> > this
> > needs to stay at least within hailing distance of there.
> >
> > * It is fairly suspicious that the new data structure is manipulated while 
> > holding
> > per-partition locks for the existing buffer hashtable.
> > At best that seems bad for concurrency, and at worst it could result in 
> > deadlocks,
> > because I doubt we can assume that the new hash table has partition 
> > boundaries
> > identical to the old one.
> >
> > * More generally, it seems like really poor design that this has been 
> > written
> > completely independently of the existing buffer hash table.
> > Can't we get any benefit by merging them somehow?
>
> The original aim is to just shorten the recovery process, and eventually the 
> speedup
> on both vacuum and truncate process are just added bonus.
> Given that we don't have a shared invalidation mechanism in place yet like 
> radix tree
> buffer mapping which is complex, I thought a patch like mine could be an 
> alternative
> approach to that. So I want to improve the patch further.
> I hope you can help me clarify the direction, so that I can avoid going 
> farther away
> f

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-06 Thread Amit Kapila
On Fri, Aug 7, 2020 at 3:03 AM Tomas Vondra
 wrote:
>
> >But I understand the sentiment on the added overhead on BufferAlloc.
> >Regarding the case where the patch would potentially affect workloads
> >that fit into RAM but not into shared buffers, could one of Andres'
> >suggested idea/s above address that, in addition to this patch's
> >possible shared invalidation fix? Could that settle the added overhead
> >in BufferAlloc() as temporary fix?
>
> Not sure.
>
> >Thomas Munro is also working on caching relation sizes [1], maybe that
> >way we could get the latest known relation size. Currently, it's
> >possible only during recovery in smgrnblocks.
>
> It's not clear to me how would knowing the relation size help reducing
> the overhead of this patch?
>

AFAICU the idea is to directly call BufTableLookup (similar to how we
do in BufferAlloc) to find the buf_id in function
DropRelFileNodeBuffers and then invalidate the required buffers.  And,
we need to do this when the size of the relation is less than some
threshold. So, I think the crux would be to reliably get the number of
blocks information. So, probably relation size cache stuff might help.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-06 Thread Amit Kapila
On Sat, Aug 1, 2020 at 1:53 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund  writes:
>
> > > Wonder if the temporary fix is just to do explicit hashtable probes for
> > > all pages iff the size of the relation is < s_b / 500 or so. That'll
> > > address the case where small tables are frequently dropped - and
> > > dropping large relations is more expensive from the OS and data loading
> > > perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
>
> IIRC we already do smgrnblocks nearby, when doing the truncation (to
> figure out which segments we need to remove). Perhaps we can arrange to
> combine the two? The layering probably makes that somewhat ugly :(
>
> We could also just use pg_class.relpages. It'll probably mostly be
> accurate enough?
>

Don't we need the accurate 'number of blocks' if we want to invalidate
all the buffers? Basically, I think we need to perform BufTableLookup
for all the blocks in the relation and then Invalidate all buffers.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-06 Thread Tom Lane
Amit Kapila  writes:
> On Sat, Aug 1, 2020 at 1:53 AM Andres Freund  wrote:
>> We could also just use pg_class.relpages. It'll probably mostly be
>> accurate enough?

> Don't we need the accurate 'number of blocks' if we want to invalidate
> all the buffers? Basically, I think we need to perform BufTableLookup
> for all the blocks in the relation and then Invalidate all buffers.

Yeah, there is no room for "good enough" here.  If a dirty buffer remains
in the system, the checkpointer will eventually try to flush it, and fail
(because there's no file to write it to), and then checkpointing will be
stuck.  So we cannot afford to risk missing any buffers.

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Konstantin Knizhnik




On 07.08.2020 00:33, Tomas Vondra wrote:


Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.


Sorry, that I have not explained  my test scenarios.
As far as Postgres is pgbench-oriented database:) I have also used pgbench:
read-only case and sip-some updates.
For this patch most critical is number of buffer allocations,
so I used small enough database (scale=100), but shared buffer was set 
to 1Gb.
As a result, all data is cached in memory (in file system cache), but 
there is intensive swapping at Postgres buffer manager level.
I have tested it both with relatively small (100) and large (1000) 
number of clients.
I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM 
Power2 server with about 380 virtual cores  and about 1Tb of memory.
I the last case results are vary very much I think because of NUMA 
architecture) but I failed to find some noticeable regression of patched 
version.



But I have to agree that adding parallel hash (in addition to existed 
buffer manager hash) is not so good idea.

This cache really quite frequently becomes bottleneck.
My explanation of why I have not observed some noticeable regression was 
that this patch uses almost the same lock partitioning schema
as already used it adds not so much new conflicts. May be in case of 
POwer2 server, overhead of NUMA is much higher than other factors
(although shared hash is one of the main thing suffering from NUMA 
architecture).
But in principle I agree that having two independent caches may decrease 
speed up to two times  (or even more).


I hope that everybody will agree that this problem is really critical. 
It is certainly not the most common case when there are hundreds of 
relation which are frequently truncated. But having quadratic complexity 
in drop function is not acceptable from my point of view.
And it is not only recovery-specific problem, this is why solution with 
local cache is not enough.


I do not know good solution of the problem. Just some thoughts.
- We can somehow combine locking used for main buffer manager cache (by 
relid/blockno) and cache for relid. It will eliminates double locking 
overhead.
- We can use something like sorted tree (like std::map) instead of hash 
- it will allow to locate blocks both by relid/blockno and by relid only.





RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread k.jami...@fujitsu.com
On Friday, August 7, 2020 12:38 PM, Amit Kapila wrote:
Hi,
> On Thu, Aug 6, 2020 at 6:53 AM k.jami...@fujitsu.com 
> wrote:
> >
> > On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
> >
> > Hi,
> > Thank you for your constructive review and comments.
> > Sorry for the late reply.
> >
> > > Hi,
> > >
> > > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > > Andres Freund  writes:
> > > > > Indeed. The buffer mapping hashtable already is visible as a
> > > > > major bottleneck in a number of workloads. Even in readonly
> > > > > pgbench if s_b is large enough (so the hashtable is larger than
> > > > > the cache). Not to speak of things like a cached sequential scan
> > > > > with a cheap qual and wide
> > > rows.
> > > >
> > > > To be fair, the added overhead is in buffer allocation not buffer 
> > > > lookup.
> > > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > > upthread, the potential trouble spot is where the working set is
> > > > bigger than shared buffers but still fits in RAM (so there's no
> > > > actual I/O needed, but we do still have to shuffle buffers a lot).
> > >
> > > Oh, right, not sure what I was thinking.
> > >
> > >
> > > > > Wonder if the temporary fix is just to do explicit hashtable
> > > > > probes for all pages iff the size of the relation is < s_b / 500 or 
> > > > > so.
> > > > > That'll address the case where small tables are frequently
> > > > > dropped - and dropping large relations is more expensive from
> > > > > the OS and data loading perspective, so it's not gonna happen as 
> > > > > often.
> > > >
> > > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > > relation had been (preferably without adding an lseek call), but
> > > > maybe that's do-able.
> > >
> > > IIRC we already do smgrnblocks nearby, when doing the truncation (to
> > > figure out which segments we need to remove). Perhaps we can arrange
> > > to combine the two? The layering probably makes that somewhat ugly
> > > :(
> > >
> > > We could also just use pg_class.relpages. It'll probably mostly be
> > > accurate enough?
> > >
> > > Or we could just cache the result of the last smgrnblocks call...
> > >
> > >
> > > One of the cases where this type of strategy is most intersting to
> > > me is the partial truncations that autovacuum does... There we even
> > > know the range of tables ahead of time.
> >
> > Konstantin tested it on various workloads and saw no regression.
> > But I understand the sentiment on the added overhead on BufferAlloc.
> > Regarding the case where the patch would potentially affect workloads
> > that fit into RAM but not into shared buffers, could one of Andres'
> > suggested idea/s above address that, in addition to this patch's
> > possible shared invalidation fix? Could that settle the added overhead in
> BufferAlloc() as temporary fix?
> >
> 
> Yes, I think so. Because as far as I can understand he is suggesting to do 
> changes
> only in the Truncate/Vacuum code path. Basically, I think you need to change
> DropRelFileNodeBuffers or similar functions.
> There shouldn't be any change in the BufferAlloc or the common code path, so
> there is no question of regression in such cases. I am not sure what you have 
> in
> mind for this but feel free to clarify your understanding before 
> implementation.
>
> > Thomas Munro is also working on caching relation sizes [1], maybe that
> > way we could get the latest known relation size. Currently, it's
> > possible only during recovery in smgrnblocks.
> >
> > Tom Lane wrote:
> > > But aside from that, I noted a number of things I didn't like a bit:
> > >
> > > * The amount of new shared memory this needs seems several orders of
> > > magnitude higher than what I'd call acceptable: according to my
> > > measurements it's over 10KB per shared buffer!  Most of that is
> > > going into the CachedBufTableLock data structure, which seems
> > > fundamentally misdesigned --- how could we be needing a lock per map
> > > partition *per buffer*?  For comparison, the space used by
> > > buf_table.c is about 56 bytes per shared buffer; I think this needs to 
> > > stay at
> least within hailing distance of there.
> > >
> > > * It is fairly suspicious that the new data structure is manipulated
> > > while holding per-partition locks for the existing buffer hashtable.
> > > At best that seems bad for concurrency, and at worst it could result
> > > in deadlocks, because I doubt we can assume that the new hash table
> > > has partition boundaries identical to the old one.
> > >
> > > * More generally, it seems like really poor design that this has
> > > been written completely independently of the existing buffer hash table.
> > > Can't we get any benefit by merging them somehow?
> >
> > The original aim is to just shorten the recovery process, and
> > eventually the speedup on both vacuum and truncate process are just added
> bonus.
> > Given that we don't have a shared invalidation mechanism in place y

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tomas Vondra

On Fri, Aug 07, 2020 at 10:08:23AM +0300, Konstantin Knizhnik wrote:



On 07.08.2020 00:33, Tomas Vondra wrote:


Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.


Sorry, that I have not explained  my test scenarios.
As far as Postgres is pgbench-oriented database:) I have also used pgbench:
read-only case and sip-some updates.
For this patch most critical is number of buffer allocations,
so I used small enough database (scale=100), but shared buffer was set 
to 1Gb.
As a result, all data is cached in memory (in file system cache), but 
there is intensive swapping at Postgres buffer manager level.
I have tested it both with relatively small (100) and large (1000) 
number of clients.


I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM 
Power2 server with about 380 virtual cores  and about 1Tb of memory.
I the last case results are vary very much I think because of NUMA 
architecture) but I failed to find some noticeable regression of 
patched version.




IMO using such high numbers of clients is pointless - it's perfectly
fine to test just a single client, and the 'basic overhead' should be
visible. It might have some impact on concurrency, but I think that's
just a secondary effect I think. In fact, I wouldn't be surprised if
high client counts made it harder to observe the overhead, due to
concurrency problems (I doubt you have a laptop with this many cores).

Another thing you might try doing is using taskset to attach processes
to particular CPU cores, and also make sure there's no undesirable
influence from CPU power management etc. Laptops are very problematic in
this regard, but even servers can have that enabled in BIOS.



But I have to agree that adding parallel hash (in addition to existed 
buffer manager hash) is not so good idea.

This cache really quite frequently becomes bottleneck.
My explanation of why I have not observed some noticeable regression 
was that this patch uses almost the same lock partitioning schema
as already used it adds not so much new conflicts. May be in case of 
POwer2 server, overhead of NUMA is much higher than other factors
(although shared hash is one of the main thing suffering from NUMA 
architecture).
But in principle I agree that having two independent caches may 
decrease speed up to two times  (or even more).


I hope that everybody will agree that this problem is really critical. 
It is certainly not the most common case when there are hundreds of 
relation which are frequently truncated. But having quadratic 
complexity in drop function is not acceptable from my point of view.
And it is not only recovery-specific problem, this is why solution 
with local cache is not enough.




Well, ultimately it's a balancing act - we need to consider the risk of
regressions vs. how common the improved scenario is. I've seen multiple
applications that e.g. drop many relations (after all, that's why I
optimized that in 9.3) so it's not entirely bogus case.


I do not know good solution of the problem. Just some thoughts.
- We can somehow combine locking used for main buffer manager cache 
(by relid/blockno) and cache for relid. It will eliminates double 
locking overhead.
- We can use something like sorted tree (like std::map) instead of 
hash - it will allow to locate blocks both by relid/blockno and by 
relid only.




I don't know. I think the ultimate problem here is that we're adding
code to a fairly hot codepath - it does not matter if it's hash, list,
std::map or something else I think. All of that has overhead.

That's the beauty of Andres' proposal to just loop over the blocks of
the relation and evict them one by one - that adds absolutely nothing to
BufferAlloc.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas
On Fri, Aug 7, 2020 at 12:03 AM Tom Lane  wrote:
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.

This comment suggests another possible approach to the problem, which
is to just make a note someplace in shared memory when we drop a
relation. If we later find any of its buffers, we drop them without
writing them out. This is not altogether simple, because (1) we don't
have infinite room in shared memory to accumulate such notes and (2)
it's not impossible for the OID counter to wrap around and permit the
creation of a new relation with the same OID, which would be a problem
if the previous note is still around.

But this might be solvable. Suppose we create a shared hash table
keyed by  with room for 1 entry per 1000 shared
buffers. When you drop a relation, you insert into the hash table.
Periodically you "clean" the hash table by marking all the entries,
scanning shared buffers to remove any matches, and then deleting all
the marked entries. This should be done periodically in the
background, but if you try to drop a relation and find the hash table
full, or you try to create a relation and find the OID of your new
relation in the hash table, then you have to clean synchronously.

Right now, the cost of dropping K relation when N shared buffers is
O(KN). But with this approach, you only have to incur the O(N)
overhead of scanning shared_buffers when the hash table fills up, and
the hash table size is proportional to N, so the amortized complexity
is O(K); that is, dropping relations takes time proportional to the
number of relations being dropped, but NOT proportional to the size of
shared_buffers, because as shared_buffers grows the hash table gets
proportionally bigger, so that scans don't need to be done as
frequently.

Andres's approach (retail hash table lookups just for blocks less than
the relation size, rather than a full scan) is going to help most with
small relations, whereas this approach helps with relations of any
size, but if you're trying to drop a lot of relations, they're
probably small, and if they are large, scanning shared buffers may not
be the dominant cost, since unlinking the files also takes time. Also,
this approach might turn out to slow down buffer eviction too much.
That could maybe be mitigated by having some kind of cheap fast-path
that gets used when the hash table is empty (like an atomic flag that
indicates whether a hash table probe is needed), and then trying hard
to keep it empty most of the time (e.g. by aggressive background
cleaning, or by ruling that after some number of hash table lookups
the next process to evict a buffer is forced to perform a cleanup).
But you'd probably have to try it to see how well you can do.

It's also possible to combine the two approaches. Small relations
could use Andres's approach while larger ones could use this approach;
or you could insert both large and small relations into this hash
table but use different strategies for cleaning out shared_buffers
depending on the relation size (which could also be preserved in the
hash table).

Just brainstorming here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Aug 7, 2020 at 12:03 AM Tom Lane  wrote:
>> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
>> in the system, the checkpointer will eventually try to flush it, and fail
>> (because there's no file to write it to), and then checkpointing will be
>> stuck.  So we cannot afford to risk missing any buffers.

> This comment suggests another possible approach to the problem, which
> is to just make a note someplace in shared memory when we drop a
> relation. If we later find any of its buffers, we drop them without
> writing them out. This is not altogether simple, because (1) we don't
> have infinite room in shared memory to accumulate such notes and (2)
> it's not impossible for the OID counter to wrap around and permit the
> creation of a new relation with the same OID, which would be a problem
> if the previous note is still around.

Interesting idea indeed.

As for (1), maybe we don't need to keep the info in shmem.  I'll just
point out that the checkpointer has *already got* a complete list of all
recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
imagine looking aside at that to discover that a dirty buffer belongs to a
recently-dropped relation.  pendingUnlinks would need to be converted to a
hashtable to make searches cheap, and it's not very clear what to do in
backends that haven't got access to that table, but maybe we could just
accept that backends that are forced to flush dirty buffers might do some
useless writes in such cases.

As for (2), the reason why we have that list is that the physical unlink
doesn't happen till after the next checkpoint.  So with some messing
around here, we could probably guarantee that every buffer belonging
to the relation has been scanned and deleted before the file unlink
happens --- and then, even if the OID counter has wrapped around, the
OID won't be reassigned to a new relation before that happens.

In short, it seems like maybe we could shove the responsibility for
cleaning up dropped relations' buffers onto the checkpointer without
too much added cost.  A possible problem with this is that recycling
of those buffers will happen much more slowly than it does today,
but maybe that's okay?

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas
On Fri, Aug 7, 2020 at 12:09 PM Tom Lane  wrote:
> As for (1), maybe we don't need to keep the info in shmem.  I'll just
> point out that the checkpointer has *already got* a complete list of all
> recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
> imagine looking aside at that to discover that a dirty buffer belongs to a
> recently-dropped relation.  pendingUnlinks would need to be converted to a
> hashtable to make searches cheap, and it's not very clear what to do in
> backends that haven't got access to that table, but maybe we could just
> accept that backends that are forced to flush dirty buffers might do some
> useless writes in such cases.

I don't see how that can work. It's not that the writes are useless;
it's that they will fail outright because the file doesn't exist.

> As for (2), the reason why we have that list is that the physical unlink
> doesn't happen till after the next checkpoint.  So with some messing
> around here, we could probably guarantee that every buffer belonging
> to the relation has been scanned and deleted before the file unlink
> happens --- and then, even if the OID counter has wrapped around, the
> OID won't be reassigned to a new relation before that happens.

This seems right to me, though.

> In short, it seems like maybe we could shove the responsibility for
> cleaning up dropped relations' buffers onto the checkpointer without
> too much added cost.  A possible problem with this is that recycling
> of those buffers will happen much more slowly than it does today,
> but maybe that's okay?

I suspect it's going to be advantageous to try to make cleaning up
dropped buffers quick in normal cases and allow it to fall behind only
when someone is dropping a lot of relations in quick succession, so
that buffer eviction remains cheap in normal cases. I hadn't thought
about the possible negative performance consequences of failing to put
buffers on the free list, but that's another reason to try to make it
fast.

My viewpoint on this is - I have yet to see anybody really get hosed
because they drop one relation and that causes a full scan of
shared_buffers. I mean, it's slightly expensive, but computers are
fast. Whatever. What hoses people is dropping a lot of relations in
quick succession, either by spamming DROP TABLE commands or by running
something like DROP SCHEMA, and then suddenly they're scanning
shared_buffers over and over again, and their standby is doing the
same thing, and now it hurts. The problem on the standby is actually
worse than the problem on the primary, because the primary can do
other things while one process sits there and thinks about
shared_buffers for a long time, but the standby can't, because the
startup process is all there is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Aug 7, 2020 at 12:09 PM Tom Lane  wrote:
>> ... it's not very clear what to do in
>> backends that haven't got access to that table, but maybe we could just
>> accept that backends that are forced to flush dirty buffers might do some
>> useless writes in such cases.

> I don't see how that can work. It's not that the writes are useless;
> it's that they will fail outright because the file doesn't exist.

At least in the case of segment zero, the file will still exist.  It'll
have been truncated to zero length, and if the filesystem is stupid about
holes in files then maybe a write to a high block number would consume
excessive disk space, but does anyone still care about such filesystems?
I don't remember at the moment how we handle higher segments, but likely
we could make them still exist too, postponing all the unlinks till after
checkpoint.  Or we could just have the backends give up on recycling a
particular buffer if they can't write it (which is the response to an I/O
failure already, I hope).

> My viewpoint on this is - I have yet to see anybody really get hosed
> because they drop one relation and that causes a full scan of
> shared_buffers. I mean, it's slightly expensive, but computers are
> fast. Whatever. What hoses people is dropping a lot of relations in
> quick succession, either by spamming DROP TABLE commands or by running
> something like DROP SCHEMA, and then suddenly they're scanning
> shared_buffers over and over again, and their standby is doing the
> same thing, and now it hurts.

Yeah, trying to amortize the cost across multiple drops seems like
what we really want here.  I'm starting to think about a "relation
dropper" background process, which would be somewhat like the checkpointer
but it wouldn't have any interest in actually doing buffer I/O.
We'd send relation drop commands to it, and it would scan all of shared
buffers and flush related buffers, and then finally do the file truncates
or unlinks.  Amortization would happen by considering multiple target
relations during any one scan over shared buffers.  I'm not very clear
on how this would relate to the checkpointer's handling of relation
drops, but it could be worked out; if we were lucky maybe the checkpointer
could stop worrying about that.

regards, tom lane




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas
On Fri, Aug 7, 2020 at 12:52 PM Tom Lane  wrote:
> At least in the case of segment zero, the file will still exist.  It'll
> have been truncated to zero length, and if the filesystem is stupid about
> holes in files then maybe a write to a high block number would consume
> excessive disk space, but does anyone still care about such filesystems?
> I don't remember at the moment how we handle higher segments, but likely
> we could make them still exist too, postponing all the unlinks till after
> checkpoint.  Or we could just have the backends give up on recycling a
> particular buffer if they can't write it (which is the response to an I/O
> failure already, I hope).

None of this sounds very appealing. Postponing the unlinks means
postponing recovery of the space at the OS level, which I think will
be noticeable and undesirable for users. The other notions all seem to
involve treating as valid on-disk states we currently treat as
invalid, and our sanity checks in this area are already far too weak.
And all you're buying for it is putting a hash table that would
otherwise be shared memory into backend-private memory, which seems
like quite a minor gain. Having that information visible to everybody
seems a lot cleaner.

> Yeah, trying to amortize the cost across multiple drops seems like
> what we really want here.  I'm starting to think about a "relation
> dropper" background process, which would be somewhat like the checkpointer
> but it wouldn't have any interest in actually doing buffer I/O.
> We'd send relation drop commands to it, and it would scan all of shared
> buffers and flush related buffers, and then finally do the file truncates
> or unlinks.  Amortization would happen by considering multiple target
> relations during any one scan over shared buffers.  I'm not very clear
> on how this would relate to the checkpointer's handling of relation
> drops, but it could be worked out; if we were lucky maybe the checkpointer
> could stop worrying about that.

I considered that, too, but it might be overkill. I think that one
scan of shared_buffers every now and then might be cheap enough that
we could just not worry too much about which process gets stuck doing
it. So for example if the number of buffers allocated since the hash
table ended up non-empty reaches NBuffers, the process wanting to do
the next eviction gets handed the job of cleaning it out. Or maybe the
background writer could help; it's not like it does much anyway, zing.
It's possible that a dedicated process is the right solution, but we
might want to at least poke a bit at other alternatives.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-08 Thread Amit Kapila
On Fri, Aug 7, 2020 at 9:33 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund  wrote:
> >> We could also just use pg_class.relpages. It'll probably mostly be
> >> accurate enough?
>
> > Don't we need the accurate 'number of blocks' if we want to invalidate
> > all the buffers? Basically, I think we need to perform BufTableLookup
> > for all the blocks in the relation and then Invalidate all buffers.
>
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.
>

Right, this reminds me of the discussion we had last time on this
topic where we decided that we can't even rely on using smgrnblocks to
find the exact number of blocks because lseek might lie about the EOF
position [1]. So, we anyway need some mechanism to push the
information related to the "to be truncated or dropped relations" to
the background worker (checkpointer and or others) to avoid flush
issues. But, maybe it is better to push the responsibility of
invalidating the buffers for truncated/dropped relation to the
background process. However, I feel for some cases where relation size
is greater than the number of shared buffers there might not be much
benefit in pushing this operation to background unless there are
already a few other relation entries (for dropped relations) so that
cost of scanning the buffers can be amortized.

[1] - https://www.postgresql.org/message-id/16664.1435414204%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-08 Thread Amit Kapila
On Fri, Aug 7, 2020 at 11:03 PM Robert Haas  wrote:
>
> On Fri, Aug 7, 2020 at 12:52 PM Tom Lane  wrote:
> > At least in the case of segment zero, the file will still exist.  It'll
> > have been truncated to zero length, and if the filesystem is stupid about
> > holes in files then maybe a write to a high block number would consume
> > excessive disk space, but does anyone still care about such filesystems?
> > I don't remember at the moment how we handle higher segments,
> >

We do unlink them and register the request to forget the Fsync
requests for those. See mdunlinkfork.

> > but likely
> > we could make them still exist too, postponing all the unlinks till after
> > checkpoint.  Or we could just have the backends give up on recycling a
> > particular buffer if they can't write it (which is the response to an I/O
> > failure already, I hope).
> >

Note that we don't often try to flush the buffers from the backend. We
first try to forward the request to checkpoint queue and only if the
queue is full, the backend tries to flush it, so even if we decide to
give up flushing such a buffer (where we get an error) via backend, it
shouldn't impact very many cases. I am not sure but if we can somehow
reliably distinguish this type of error from any other I/O failure
then we can probably give up on flushing this buffer and continue or
maybe just retry to push this request to checkpointer.

>
> None of this sounds very appealing. Postponing the unlinks means
> postponing recovery of the space at the OS level, which I think will
> be noticeable and undesirable for users. The other notions all seem to
> involve treating as valid on-disk states we currently treat as
> invalid, and our sanity checks in this area are already far too weak.
> And all you're buying for it is putting a hash table that would
> otherwise be shared memory into backend-private memory, which seems
> like quite a minor gain. Having that information visible to everybody
> seems a lot cleaner.
>

The one more benefit of giving this responsibility to a single process
like checkpointer is that we can avoid unlinking the relation until we
scan all the buffers corresponding to it. Now, surely keeping it in
shared memory and allow other processes to work on it has other merits
which are that such buffers might get invalidated faster but not sure
we can retain the benefit of another approach which is to perform all
such invalidation of buffers before unlinking the relation's first
segment.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-17 Thread Amit Kapila
On Fri, Aug 7, 2020 at 9:33 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund  wrote:
> >> We could also just use pg_class.relpages. It'll probably mostly be
> >> accurate enough?
>
> > Don't we need the accurate 'number of blocks' if we want to invalidate
> > all the buffers? Basically, I think we need to perform BufTableLookup
> > for all the blocks in the relation and then Invalidate all buffers.
>
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.
>

Today, again thinking about this point it occurred to me that during
recovery we can reliably find the relation size and after Thomas's
recent commit c5315f4f44 (Cache smgrnblocks() results in recovery), we
might not need to even incur the cost of lseek. Why don't we fix this
first for 'recovery' (by following something on the lines of what
Andres suggested) and then later once we have a generic mechanism for
"caching the relation size" [1], we can do it for non-recovery paths.
I think that will at least address the reported use case with some
minimal changes.

[1] - 
https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> On Fri, Aug 7, 2020 at 9:33 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund 
> wrote:
> > >> We could also just use pg_class.relpages. It'll probably mostly be
> > >> accurate enough?
> >
> > > Don't we need the accurate 'number of blocks' if we want to
> > > invalidate all the buffers? Basically, I think we need to perform
> > > BufTableLookup for all the blocks in the relation and then Invalidate all
> buffers.
> >
> > Yeah, there is no room for "good enough" here.  If a dirty buffer
> > remains in the system, the checkpointer will eventually try to flush
> > it, and fail (because there's no file to write it to), and then
> > checkpointing will be stuck.  So we cannot afford to risk missing any
> buffers.
> >
> 
> Today, again thinking about this point it occurred to me that during recovery
> we can reliably find the relation size and after Thomas's recent commit
> c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> following something on the lines of what Andres suggested) and then later
> once we have a generic mechanism for "caching the relation size" [1], we can
> do it for non-recovery paths.
> I think that will at least address the reported use case with some minimal
> changes.
> 
> [1] -
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 

Attached is an updated V9 version with minimal code changes only and
avoids the previous overhead in the BufferAlloc. This time, I only updated
the recovery path as suggested by Amit, and followed Andres' suggestion
of referring to the cached blocks in smgrnblocks.
The layering is kinda tricky so the logic may be wrong. But as of now,
it passes the regression tests. I'll follow up with the performance results.
It seems there's regression for smaller shared_buffers. I'll update if I find 
bugs.
But I'd also appreciate your reviews in case I missed something.

Regards,
Kirk Jamison




v9-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v9-Speedup-dropping-of-relation-buffers-during-recovery.patch


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Kyotaro Horiguchi
Hello.

At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> > Today, again thinking about this point it occurred to me that during 
> > recovery
> > we can reliably find the relation size and after Thomas's recent commit
> > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> > following something on the lines of what Andres suggested) and then later
> > once we have a generic mechanism for "caching the relation size" [1], we can
> > do it for non-recovery paths.
> > I think that will at least address the reported use case with some minimal
> > changes.
> > 
> > [1] -
> > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com

Isn't a relation always locked asscess-exclusively, at truncation
time?  If so, isn't even the result of lseek reliable enough? And if
we don't care the cost of lseek, we can do the same optimization also
for non-recovery paths. Since anyway we perform actual file-truncation
just after so I think the cost of lseek is negligible here.

> Attached is an updated V9 version with minimal code changes only and
> avoids the previous overhead in the BufferAlloc. This time, I only updated
> the recovery path as suggested by Amit, and followed Andres' suggestion
> of referring to the cached blocks in smgrnblocks.
> The layering is kinda tricky so the logic may be wrong. But as of now,
> it passes the regression tests. I'll follow up with the performance results.
> It seems there's regression for smaller shared_buffers. I'll update if I find 
> bugs.
> But I'd also appreciate your reviews in case I missed something.

BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
number of file pages that we make relation-targetted search for
buffers. Otherwise we scan through all buffers. On the other hand the
latest patch just leaves all buffers for relation forks longer than
the threshold.

I think we should determine whether to do targetted-scan or full-scan
based on the ratio of (expectedly maximum) total number of pages for
all (specified) forks in a relation against total number of buffers.

By the way

> #define BUF_DROP_THRESHOLD500 /* NBuffers divided by 2 */

NBuffers is not a constant. Even if we wanted to set the macro as
described in the comment, we should have used (NBuffers/2) instead of
"500". But I suppose you might wanted to use (NBuffders / 500) as Tom
suggested upthread.  And the name of the macro seems too generic. I
think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
 




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Kyotaro Horiguchi
I'd like make a subtle correction.

At Wed, 02 Sep 2020 10:31:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> By the way
> 
> > #define BUF_DROP_THRESHOLD  500 /* NBuffers divided by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as
> described in the comment, we should have used (NBuffers/2) instead of
> "500". But I suppose you might wanted to use (NBuffders / 500) as Tom
> suggested upthread.  And the name of the macro seems too generic. I

Who made the suggestion is Andres, not Tom. Sorry for the mistake.

> think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
> better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Amit Kapila
On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com" 
>  wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during 
> > > recovery
> > > we can reliably find the relation size and after Thomas's recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > > even incur the cost of lseek. Why don't we fix this first for 'recovery' 
> > > (by
> > > following something on the lines of what Andres suggested) and then later
> > > once we have a generic mechanism for "caching the relation size" [1], we 
> > > can
> > > do it for non-recovery paths.
> > > I think that will at least address the reported use case with some minimal
> > > changes.
> > >
> > > [1] -
> > > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation
> time?  If so, isn't even the result of lseek reliable enough?
>

Even if the relation is locked, background processes like checkpointer
can still touch the relation which might cause problems. Consider a
case where we extend the relation but didn't flush the newly added
pages. Now during truncate operation, checkpointer can still flush
those pages which can cause trouble for truncate. But, I think in the
recovery path such cases won't cause a problem.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Tom Lane
Amit Kapila  writes:
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I wouldn't count on that staying true ...

https://www.postgresql.org/message-id/ca+hukgj8nrsqgkzensnrc2mfrobv-jcnacbyvtpptk2a9yy...@mail.gmail.com

regards, tom lane




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Wednesday, September 2, 2020 10:31 AM, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com"
>  wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during
> > > recovery we can reliably find the relation size and after Thomas's
> > > recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not
> > > need to even incur the cost of lseek. Why don't we fix this first
> > > for 'recovery' (by following something on the lines of what Andres
> > > suggested) and then later once we have a generic mechanism for
> > > "caching the relation size" [1], we can do it for non-recovery paths.
> > > I think that will at least address the reported use case with some
> > > minimal changes.
> > >
> > > [1] -
> > >
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 
> Isn't a relation always locked asscess-exclusively, at truncation time?  If 
> so,
> isn't even the result of lseek reliable enough? And if we don't care the cost 
> of
> lseek, we can do the same optimization also for non-recovery paths. Since
> anyway we perform actual file-truncation just after so I think the cost of 
> lseek
> is negligible here.

The reason for that is when I read the comment in smgrnblocks in smgr.c
I thought that smgrnblocks can only be reliably used during recovery here
to ensure that we have the correct size.
Please correct me if my understanding is wrong, and I'll fix the patch.

 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.
 */
if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum]; 

> > Attached is an updated V9 version with minimal code changes only and
> > avoids the previous overhead in the BufferAlloc. This time, I only
> > updated the recovery path as suggested by Amit, and followed Andres'
> > suggestion of referring to the cached blocks in smgrnblocks.
> > The layering is kinda tricky so the logic may be wrong. But as of now,
> > it passes the regression tests. I'll follow up with the performance results.
> > It seems there's regression for smaller shared_buffers. I'll update if I 
> > find
> bugs.
> > But I'd also appreciate your reviews in case I missed something.
> 
> BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
> number of file pages that we make relation-targetted search for buffers.
> Otherwise we scan through all buffers. On the other hand the latest patch just
> leaves all buffers for relation forks longer than the threshold.

Right, I missed the part or condition for that part. Fixed in the latest one.
 
> I think we should determine whether to do targetted-scan or full-scan based
> on the ratio of (expectedly maximum) total number of pages for all (specified)
> forks in a relation against total number of buffers.

> By the way
> 
> > #define BUF_DROP_THRESHOLD  500 /* NBuffers divided
> by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as described
> in the comment, we should have used (NBuffers/2) instead of "500". But I
> suppose you might wanted to use (NBuffders / 500) as Tom suggested
> upthread.  And the name of the macro seems too generic. I think more
> specific names like BUF_DROP_FULLSCAN_THRESHOLD would be better.

Fixed.

Thank you for the review!
Attached is the v10 of the patch.

Best regards,
Kirk Jamison


v10-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v10-Speedup-dropping-of-relation-buffers-during-recovery.patch


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-02 Thread Amit Kapila
On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I wouldn't count on that staying true ...
>
> https://www.postgresql.org/message-id/ca+hukgj8nrsqgkzensnrc2mfrobv-jcnacbyvtpptk2a9yy...@mail.gmail.com
>

I don't think that proposal will matter after commit c5315f4f44
because we are caching the size/blocks for recovery while doing extend
(smgrextend). In the above scenario, we would have cached the blocks
which will be used at later point of time.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread k.jami...@fujitsu.com
On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Even if the relation is locked, background processes like
> > > checkpointer can still touch the relation which might cause
> > > problems. Consider a case where we extend the relation but didn't
> > > flush the newly added pages. Now during truncate operation,
> > > checkpointer can still flush those pages which can cause trouble for
> > > truncate. But, I think in the recovery path such cases won't cause a
> problem.
> >
> > I wouldn't count on that staying true ...
> >
> >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> OBV-jC
> > nacbyvtpptk2a9yy...@mail.gmail.com
> >
> 
> I don't think that proposal will matter after commit c5315f4f44 because we are
> caching the size/blocks for recovery while doing extend (smgrextend). In the
> above scenario, we would have cached the blocks which will be used at later
> point of time.
> 

Hi,

I'm guessing we can still pursue this idea of improving the recovery path first.

I'm working on an updated patch version, because the CFBot's telling
that postgres fails to build (one of the recovery TAP tests fails).
I'm still working on refactoring my patch, but have yet to find a proper 
solution at the moment.
So I'm going to continue my investigation.

Attached is an updated WIP patch.
I'd appreciate if you could take a look at the patch as well.

In addition, attached also are the regression logs for the failure and other 
logs
Accompanying it: wal_optimize_node_minimal and wal_optimize_node_replica.

The failures stated in my session was:
t/018_wal_optimize.pl  18/34 Bailout called.
Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Best regards,
Kirk Jamison


v11-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v11-Speedup-dropping-of-relation-buffers-during-recovery.patch


regress_log_018_wal_optimize
Description: regress_log_018_wal_optimize


018_wal_optimize_node_minimal.log
Description: 018_wal_optimize_node_minimal.log


018_wal_optimize_node_replica.log
Description: 018_wal_optimize_node_replica.log


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread Amit Kapila
On Mon, Sep 7, 2020 at 1:33 PM k.jami...@fujitsu.com
 wrote:
>
> On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > Even if the relation is locked, background processes like
> > > > checkpointer can still touch the relation which might cause
> > > > problems. Consider a case where we extend the relation but didn't
> > > > flush the newly added pages. Now during truncate operation,
> > > > checkpointer can still flush those pages which can cause trouble for
> > > > truncate. But, I think in the recovery path such cases won't cause a
> > problem.
> > >
> > > I wouldn't count on that staying true ...
> > >
> > >
> > https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > OBV-jC
> > > nacbyvtpptk2a9yy...@mail.gmail.com
> > >
> >
> > I don't think that proposal will matter after commit c5315f4f44 because we 
> > are
> > caching the size/blocks for recovery while doing extend (smgrextend). In the
> > above scenario, we would have cached the blocks which will be used at later
> > point of time.
> >
>
> I'm guessing we can still pursue this idea of improving the recovery path 
> first.
>

I think so.

> I'm working on an updated patch version, because the CFBot's telling
> that postgres fails to build (one of the recovery TAP tests fails).
> I'm still working on refactoring my patch, but have yet to find a proper 
> solution at the moment.
> So I'm going to continue my investigation.
>
> Attached is an updated WIP patch.
> I'd appreciate if you could take a look at the patch as well.
>

So, I see the below log as one of the problems:
2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
in block 1 of relation base/13743/24581

This indicates that we missed invalidating some buffer which should
have been invalidated. If you are able to reproduce this locally then
I suggest to first write a simple patch without the check of the
threshold, basically in recovery always try to use the new way to
invalidate the buffer. That will reduce the scope of the code that can
create a problem. Let us know if the problem still exists and share
the logs. BTW, I think I see one problem in the code:

if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ bufHdr->tag.forkNum == forkNum[j] &&
+ bufHdr->tag.blockNum >= firstDelBlock[j])

Here, I think you need to use 'i' not 'j' for forkNum and
firstDelBlock as those are arrays w.r.t forks. That might fix the
problem but I am not sure as I haven't tried to reproduce it.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] &&
> + bufHdr->tag.blockNum >= firstDelBlock[j])
> 
> Here, I think you need to use 'i' not 'j' for forkNum and
> firstDelBlock as those are arrays w.r.t forks. That might fix the
> problem but I am not sure as I haven't tried to reproduce it.


(1)
+   INIT_BUFFERTAG(newTag, rnode.node, 
forkNum[j], firstDelBlock[j]);

And you need to use i here, too.

I advise you to suspect any character, any word, and any sentence.  I've found 
many bugs for others so far.  I'm afraid you're just seeing the code flow.


(2)
+   LWLockAcquire(newPartitionLock, 
LW_SHARED);
+   buf_id = BufTableLookup(&newTag, 
newHash);
+   LWLockRelease(newPartitionLock);
+
+   bufHdr = GetBufferDescriptor(buf_id);

Check the result of BufTableLookup() and do nothing if the block is not in the 
shared buffers.


(3)
+   else
+   {
+   for (j = BUF_DROP_FULLSCAN_THRESHOLD; j < 
NBuffers; j++)
+   {

What's the meaning of this loop?  I don't understand the start condition.  
Should j be initialized to 0?


(4)
+#define BUF_DROP_FULLSCAN_THRESHOLD(NBuffers / 2)

Wasn't it 500 instead of 2?  Anyway, I think we need to discuss this threshold 
later.


(5)
+   if (((int)nblocks) < BUF_DROP_FULLSCAN_THRESHOLD)

It's better to define BUF_DROP_FULLSCAN_THRESHOLD as an uint32 value instead of 
casting the type here, as these values are blocks.


Regards
Takayuki Tsunakawa

 


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread tsunakawa.ta...@fujitsu.com
From: tsunakawa.ta...@fujitsu.com 
> (1)
> + INIT_BUFFERTAG(newTag,
> rnode.node, forkNum[j], firstDelBlock[j]);
> 
> And you need to use i here, too.

I remember the books "Code Complete" and/or "Readable Code" suggest to use 
meaningful loop variable names like fork_num and block_count, to prevent this 
type of mistakes.


Regards
Takayuki Tsunakawa


 


<    1   2   3   >