Re: Diagnostic comment in LogicalIncreaseXminForSlot

2021-08-06 Thread Amit Kapila
On Mon, Jul 12, 2021 at 5:28 PM Ashutosh Bapat
 wrote:
>
> On Mon, Jul 12, 2021 at 8:39 AM Amit Kapila  wrote:
>>
>> On Mon, Jul 5, 2021 at 12:54 PM Masahiko Sawada  
>> wrote:
>> >
>> > On Fri, May 21, 2021 at 6:00 PM Ashutosh Bapat
>> >  wrote:
>> > >
>> > > It's there in CF. I am fine with PG-15. It will be good to patch the 
>> > > back-branches to have this extra diagnostic information available.
>> >
>> > The patch looks to me.
>> >
>>
>> {
>>   slot->candidate_catalog_xmin = xmin;
>>   slot->candidate_xmin_lsn = current_lsn;
>> + elog(DEBUG1, "got new catalog_xmin %u at %X/%X", xmin,
>> + LSN_FORMAT_ARGS(current_lsn));
>>   }
>>   SpinLockRelease(>mutex);
>>
>> Today, again looking at this patch, I don't think doing elog inside
>> spinlock is a good idea. We can either release spinlock before it or
>> use some variable to remember that we need to write such an elog and
>> do it after releasing the lock. What do you think?
>
>
> The elog will be effective only under DEBUG1, otherwise it will be almost a 
> NOOP. I am wondering whether it's worth adding a bool assignment and move the 
> elog out only for DEBUG1.
>

If you don't like adding another variable then probably we can release
spinlock in each of if .. else loop. As mentioned previously, it
doesn't seem a good idea to use elog inside spinlock, so we either
need to find some way to avoid that or probably will drop this patch.
Do let me know what you or others prefer here?

-- 
With Regards,
Amit Kapila.




Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-06 Thread Amit Kapila
On Fri, Aug 6, 2021 at 9:57 PM Japin Li  wrote:
>
> >
> > Hmm yes, it cannot cover all cases. I had somehow misunderstood that
> > the subscriber knows which relations are associated with which
> > publications. Given that the subscriber doesn’t know which relations
> > are associated with which publications (and the set of subscribed
> > relations on the subscriber becomes out of date once the publication
> > is updated), I think it's impossible to achieve that DROP PUBLICATION
> > drops relations that are associated with only the publication being
> > dropped.
> >
> >> Do we have better ideas to fix or shall we just
> >> say that we will refresh based on whatever current set of relations
> >> are present in publication to be dropped?
> >
> > I don't have better ideas. It seems to me that it's prudent that we
> > accept the approach in v3 patch and refresh all publications in DROP
> > PUBLICATION cases.
> >
>
> Maybe refreshing all publications in PUBLICATION cases is simple way to
> fix the problem.
>

Do you mean to say that do it for both Add and Drop or just for Drop?
Actually, doing it both will make the behavior consistent but doing it
just for Drop might be preferable by some users. I think it is better
to be consistent here but I am fine if you and others feel it is
better to refresh all publications only in Drop case.

-- 
With Regards,
Amit Kapila.




Re: Alias collision in `refresh materialized view concurrently`

2021-08-06 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Aug 06, 2021 at 10:48:40AM -0400, Tom Lane wrote:
>> Given that it took this long to notice the problem at all, maybe
>> this is not a fix to cram in on the weekend before the release wrap.
>> But I don't see why we need to settle for "mostly works" when
>> "always works" is barely any harder.

> Yes, I would vote to delay that for a couple of days.  That's not
> worth taking a risk for.

I went ahead and created the patch, including test case, and it
seems fine.  So I'm leaning towards pushing that tomorrow.  Mainly
because I don't want to have to document "we partially fixed this"
in this release set and then "we really fixed it" three months from
now.

regards, tom lane




Re: Alias collision in `refresh materialized view concurrently`

2021-08-06 Thread Michael Paquier
On Fri, Aug 06, 2021 at 10:48:40AM -0400, Tom Lane wrote:
> AFAICT that works and generates the identical parse tree to the original
> coding.  The only place touched by the patch where it's a bit difficult to
> make the syntax unambiguous this way is
> 
> "CREATE TEMP TABLE %s AS "
> "SELECT _$mv.ctid AS tid, _$newdata "
> "FROM %s _$mv FULL JOIN %s _$newdata ON (",
> 
> because newdata.* would ordinarily get expanded to multiple columns
> if it's at the top level of a SELECT list, and that's not what we want.
> However, that's easily fixed using the same hack as in ruleutils.c's
> get_variable: add a no-op cast to the table's rowtype.  So this
> would look like
> 
> appendStringInfo(,
>  "CREATE TEMP TABLE %s AS "
>  "SELECT mv.ctid AS tid, newdata.*::%s "
>  "FROM %s mv FULL JOIN %s newdata ON (",
>  diffname, matviewname, matviewname, tempname);

Smart piece.  I haven't thought of that.

> Given that it took this long to notice the problem at all, maybe
> this is not a fix to cram in on the weekend before the release wrap.
> But I don't see why we need to settle for "mostly works" when
> "always works" is barely any harder.

Yes, I would vote to delay that for a couple of days.  That's not
worth taking a risk for.
--
Michael


signature.asc
Description: PGP signature


Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

2021-08-06 Thread Peter Geoghegan
On Fri, Jul 30, 2021 at 3:01 PM Peter Geoghegan  wrote:
> The RMT discussed this recently. We are concerned about this issue,
> including how it has been handled so far.
>
> If you're unable to resolve the issue (or at least commit time to
> resolving the issue) then the RMT will call for the revert of the
> original feature patch.

The RMT continues to be concerned about the lack of progress on this
open item, especially the lack of communication from Michael Meskes,
the committer of the original patch (commit ad8305a). We are prepared
to exercise our authority to resolve open items directly. The only
fallback option available to us is a straight revert of commit
ad8305a.

We ask that Michael Meskes give a status update here no later than
23:59 on Fri 13 Aug ("anywhere on earth" timezone). This update should
include a general assessment of the problem, a proposed resolution
(e.g., committing the proposed patch from Hayato Kuroda), and an
estimate of when we can expect the problem to be fully resolved. If
Michael is unable to provide a status update by that deadline, or if
Michael is unable to commit to a reasonably prompt resolution for this
open item, then the RMT will call for a revert without further delay.

The RMT's first responsibility is to ensure that PostgreSQL 14 is
released on schedule. We would prefer to avoid a revert, but we cannot
allow this to drag on indefinitely.

--
Peter Geoghegan




Re: Use generation context to speed up tuplesorts

2021-08-06 Thread Tomas Vondra

On 8/6/21 3:07 PM, David Rowley wrote:

On Wed, 4 Aug 2021 at 02:10, Tomas Vondra  wrote:

A review would be nice, although it can wait - It'd be interesting to
know if those patches help with the workload(s) you've been looking at.


I tried out the v2 set of patches using the attached scripts.  The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.

I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.

Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48.  This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes.  This is because it also holds the
GenerationBlock.  So with the patch there are some cases where we'll
use slightly more memory.

Additional tests:

1. Sort 1 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.

Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.



Thanks for the scripts and the spreadsheet with results.

I doubt quadrupling the allocations won't help very much, but I suspect 
the problem might be in the 0004 patch - at least that's what shows 
regression in my results. Could you try with just 0001-0003 applied?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-06 Thread Soumyadeep Chakraborty
Rebased. Also added a stronger check to see if the standby is stuck in
recovery_min_apply_delay:

$node_standby->poll_query_until('postgres', qq{
SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
WHERE backend_type='startup';
}) or die "Timed out checking if startup is in recovery_min_apply_delay";

Attached v3.

Regards,
Soumyadeep (VMware)
From 441716f45d0fbffde1135ba007c3a6c5b6b464de Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v3 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 +---
 src/test/recovery/t/005_replay_delay.pl | 36 +++--
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ec6a834be..3e05119212b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..4f409ad10d1 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,7 +56,6 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
-
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('standby2');
 $node_standby2->init_from_backup($node_primary, $backup_name,
@@ -110,3 +109,36 @@ $node_standby2->poll_query_until('postgres',
 $node_standby2->promote;
 $node_standby2->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery()")
   or die "Timed out while waiting for promotion to finish";
+
+# Now test to see if updates to recovery_min_apply_delay apply when a standby is
+# waiting for a recovery delay to elapse.
+
+# First, set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(51, 60))");
+
+# The standby's flush LSN should be caught up.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+
+# The standby's replay LSN will not be caught up as it should be waiting in
+# recovery_min_apply_delay.
+$node_standby->poll_query_until('postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO DEFAULT;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
-- 
2.25.1



Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-06 Thread Masahiko Sawada
On Fri, Aug 6, 2021 at 2:50 PM Amit Kapila  wrote:
>
> On Fri, Aug 6, 2021 at 10:09 AM Amit Kapila  wrote:
> >
> > On Fri, Aug 6, 2021 at 5:09 AM Masahiko Sawada  
> > wrote:
> > >
> >
> > But, isn't this happening because of your suggestion to compare the
> > current set of relations with relations from publications that doesn't
> > need to be removed? Do we have better ideas to fix or shall we just
> > say that we will refresh based on whatever current set of relations
> > are present in publication to be dropped?
> >
>
> The other options could be that (a) for drop publication, we refresh
> all the publications, (b) for both add/drop publication, we refresh
> all the publications.
>
> Any better ideas?
>
> As this is introduced in PG-14 via the below commit, I am adding
> authors and committer to see if they have any better ideas.
>
> commit 82ed7748b710e3ddce3f7ebc74af80fe4869492f
> Author: Peter Eisentraut 
> Date:   Tue Apr 6 10:44:26 2021 +0200
>
> ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION
>
>  Author: Japin Li 
>  Author: Bharath Rupireddy 
>

I've added this to Open Items.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash

2021-08-06 Thread David Zhang
Hi Drouvot,
I can reproduce the issue you mentioned on REL_12_STABLE as well as Master 
branch, but the patch doesn't apply to REL_12_STABLE. After applied it to 
Master branch, it returns some wired result when run the query in the first 
time. 
As you can see in the log below, after the first time execute the query `select 
* from pg_logical_slot_get_changes('bdt_slot', null, null);` it returns some 
extra data.
david:postgres$ psql -d postgres
psql (15devel)
Type "help" for help.

postgres=# \q
david:postgres$ psql -d postgres
psql (15devel)
Type "help" for help.

postgres=# select 
pg_create_logical_replication_slot('bdt_slot','test_decoding');
 pg_create_logical_replication_slot 

 (bdt_slot,0/1484FA8)
(1 row)

postgres=# CREATE TABLE tbl1 (a INT, b TEXT);
CREATE TABLE
postgres=# CREATE TABLE tbl2 (a INT);
CREATE TABLE
postgres=# ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
ALTER TABLE
postgres=# 
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ;
INSERT 0 1
postgres=*# ALTER TABLE tbl1 ADD COLUMN id serial primary key;
ALTER TABLE
postgres=*# INSERT INTO tbl2 VALUES(1);
INSERT 0 1
postgres=*# commit;
COMMIT
postgres=# 
postgres=# select * from pg_logical_slot_get_changes('bdt_slot', null, null);
lsn| xid |  
  

  

  

  

  

  

  

  

  

  

  

  

  data

  

  

  

  

  

  

  

  

  

  

Re: Numeric x^y for negative x

2021-08-06 Thread Dean Rasheed
On Fri, 6 Aug 2021 at 21:26, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > On Fri, 6 Aug 2021 at 17:15, Tom Lane  wrote:
> >> Looks plausible by eyeball (I've not tested).
>
> > So, I have back-branch patches for this ready to go. The question is,
> > is it better to push now, or wait until after next week's releases?
>
> I'd push now, given we have a failing buildfarm member.
>
> Admittedly, there may be nobody else using that compiler out in
> the real world, but we don't know that.
>

OK. Will do.

Regards,
Dean




Re: Numeric x^y for negative x

2021-08-06 Thread Tom Lane
Dean Rasheed  writes:
> On Fri, 6 Aug 2021 at 17:15, Tom Lane  wrote:
>> Looks plausible by eyeball (I've not tested).

> So, I have back-branch patches for this ready to go. The question is,
> is it better to push now, or wait until after next week's releases?

I'd push now, given we have a failing buildfarm member.

Admittedly, there may be nobody else using that compiler out in
the real world, but we don't know that.

regards, tom lane




Re: Alias collision in `refresh materialized view concurrently`

2021-08-06 Thread Tom Lane
I wrote:
> I just came across this issue while preparing the release notes.
> ISTM that people have expended a great deal of effort on a fundamentally
> unreliable solution, when a reliable one is easily available.

Concretely, I propose the attached.

regards, tom lane

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 9493b227b4..512b00bc65 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -537,9 +537,12 @@ transientrel_destroy(DestReceiver *self)
 /*
  * Given a qualified temporary table name, append an underscore followed by
  * the given integer, to make a new table name based on the old one.
+ * The result is a palloc'd string.
  *
- * This leaks memory through palloc(), which won't be cleaned up until the
- * current memory context is freed.
+ * As coded, this would fail to make a valid SQL name if the given name were,
+ * say, "FOO"."BAR".  Currently, the table name portion of the input will
+ * never be double-quoted because it's of the form "pg_temp_NNN", cf
+ * make_new_heap().  But we might have to work harder someday.
  */
 static char *
 make_temptable_name_n(char *tempname, int n)
@@ -627,16 +630,20 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 	 * that in a way that allows showing the first duplicated row found.  Even
 	 * after we pass this test, a unique index on the materialized view may
 	 * find a duplicate key problem.
+	 *
+	 * Note: here and below, we use "tablename.*::tablerowtype" as a hack to
+	 * keep ".*" from being expanded into multiple columns in a SELECT list.
+	 * Compare ruleutils.c's get_variable().
 	 */
 	resetStringInfo();
 	appendStringInfo(,
-	 "SELECT _$newdata FROM %s _$newdata "
-	 "WHERE _$newdata IS NOT NULL AND EXISTS "
-	 "(SELECT 1 FROM %s _$newdata2 WHERE _$newdata2 IS NOT NULL "
-	 "AND _$newdata2 OPERATOR(pg_catalog.*=) _$newdata "
-	 "AND _$newdata2.ctid OPERATOR(pg_catalog.<>) "
-	 "_$newdata.ctid)",
-	 tempname, tempname);
+	 "SELECT newdata.*::%s FROM %s newdata "
+	 "WHERE newdata.* IS NOT NULL AND EXISTS "
+	 "(SELECT 1 FROM %s newdata2 WHERE newdata2.* IS NOT NULL "
+	 "AND newdata2.* OPERATOR(pg_catalog.*=) newdata.* "
+	 "AND newdata2.ctid OPERATOR(pg_catalog.<>) "
+	 "newdata.ctid)",
+	 tempname, tempname, tempname);
 	if (SPI_execute(querybuf.data, false, 1) != SPI_OK_SELECT)
 		elog(ERROR, "SPI_exec failed: %s", querybuf.data);
 	if (SPI_processed > 0)
@@ -663,9 +670,9 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 	resetStringInfo();
 	appendStringInfo(,
 	 "CREATE TEMP TABLE %s AS "
-	 "SELECT _$mv.ctid AS tid, _$newdata "
-	 "FROM %s _$mv FULL JOIN %s _$newdata ON (",
-	 diffname, matviewname, tempname);
+	 "SELECT mv.ctid AS tid, newdata.*::%s AS newdata "
+	 "FROM %s mv FULL JOIN %s newdata ON (",
+	 diffname, tempname, matviewname, tempname);
 
 	/*
 	 * Get the list of index OIDs for the table from the relcache, and look up
@@ -757,9 +764,9 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 if (foundUniqueIndex)
 	appendStringInfoString(, " AND ");
 
-leftop = quote_qualified_identifier("_$newdata",
+leftop = quote_qualified_identifier("newdata",
 	NameStr(attr->attname));
-rightop = quote_qualified_identifier("_$mv",
+rightop = quote_qualified_identifier("mv",
 	 NameStr(attr->attname));
 
 generate_operator_clause(,
@@ -787,8 +794,8 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 	Assert(foundUniqueIndex);
 
 	appendStringInfoString(,
-		   " AND _$newdata OPERATOR(pg_catalog.*=) _$mv) "
-		   "WHERE _$newdata IS NULL OR _$mv IS NULL "
+		   " AND newdata.* OPERATOR(pg_catalog.*=) mv.*) "
+		   "WHERE newdata.* IS NULL OR mv.* IS NULL "
 		   "ORDER BY tid");
 
 	/* Create the temporary "diff" table. */
@@ -814,10 +821,10 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 	/* Deletes must come before inserts; do them first. */
 	resetStringInfo();
 	appendStringInfo(,
-	 "DELETE FROM %s _$mv WHERE ctid OPERATOR(pg_catalog.=) ANY "
-	 "(SELECT _$diff.tid FROM %s _$diff "
-	 "WHERE _$diff.tid IS NOT NULL "
-	 "AND _$diff._$newdata IS NULL)",
+	 "DELETE FROM %s mv WHERE ctid OPERATOR(pg_catalog.=) ANY "
+	 "(SELECT diff.tid FROM %s diff "
+	 "WHERE diff.tid IS NOT NULL "
+	 "AND diff.newdata IS NULL)",
 	 matviewname, diffname);
 	if (SPI_exec(querybuf.data, 0) != SPI_OK_DELETE)
 		elog(ERROR, "SPI_exec failed: %s", querybuf.data);
@@ -825,8 +832,8 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 	/* Inserts go last. */
 	resetStringInfo();
 	appendStringInfo(,
-	 "INSERT INTO %s SELECT (_$diff._$newdata).* "
-	 "FROM %s _$diff WHERE tid IS NULL",
+	 "INSERT INTO %s SELECT (diff.newdata).* "
+	 "FROM %s diff WHERE tid 

Re: Numeric x^y for negative x

2021-08-06 Thread Dean Rasheed
On Fri, 6 Aug 2021 at 17:15, Tom Lane  wrote:
>
> > I guess the best thing to do is just test the value against
> > PG_INT32_MIN/MAX, which is what int84() does. There are 2 other places
> > in numeric.c that use similar code to check for int16/32 overflow, so
> > it's possible that they're broken in the same way on that platform,
> > but they aren't covered by the regression tests, so it's also possible
> > that they're OK. Anyway, something like the attached seems likely to
> > be safer.
>
> Looks plausible by eyeball (I've not tested).
>

So, I have back-branch patches for this ready to go. The question is,
is it better to push now, or wait until after next week's releases?

Regards,
Dean




Re: OpenSSL 3.0.0 compatibility

2021-08-06 Thread Tom Lane
Daniel Gustafsson  writes:
> Until there is an animal running OpenSSL 3.0.0 in the buildfarm I think this
> should be HEAD only.  Further down the line we need to support OpenSSL 3 in 
> all
> backbranches IMO since they are all equally likely to be compiled against it,
> but not until we can regularly test against it in the farm.

Works for me.

regards, tom lane




Re: OpenSSL 3.0.0 compatibility

2021-08-06 Thread Daniel Gustafsson
> On 6 Aug 2021, at 21:01, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> Are you going to commit these?

Absolutely, a combination of unplanned home renovations and vacations changed
my plans a bit recently.

> Note that with release wraps scheduled for Monday, we are probably
> already past the time when it'd be wise to push anything that has
> a significant chance of introducing portability issues.  There's
> just not much time to deal with it if the buildfarm shows problems.
> So unless you intend this as HEAD-only, I'd counsel waiting for the
> release window to pass.

Until there is an animal running OpenSSL 3.0.0 in the buildfarm I think this
should be HEAD only.  Further down the line we need to support OpenSSL 3 in all
backbranches IMO since they are all equally likely to be compiled against it,
but not until we can regularly test against it in the farm.

--
Daniel Gustafsson   https://vmware.com/





Re: OpenSSL 3.0.0 compatibility

2021-08-06 Thread Tom Lane
Peter Eisentraut  writes:
> Are you going to commit these?

Note that with release wraps scheduled for Monday, we are probably
already past the time when it'd be wise to push anything that has
a significant chance of introducing portability issues.  There's
just not much time to deal with it if the buildfarm shows problems.
So unless you intend this as HEAD-only, I'd counsel waiting for the
release window to pass.

regards, tom lane




Release notes for August minor releases

2021-08-06 Thread Tom Lane
I've pushed the first draft for $SUBJECT at

https://git.postgresql.org/pg/commitdiff/2f38ec6a157b60291ece1deb072bfff84f317334

Please send comments/corrections by Sunday.

regards, tom lane




Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Mahendra Singh Thalor
On Fri, 6 Aug 2021 at 21:17, Dilip Kumar  wrote:
>
> On Fri, Aug 6, 2021 at 8:53 PM Himanshu Upadhyaya
>  wrote:
> >
> > Hi Sadhu,
> >
> > Patch working as expected with shared tables, just one Minor comment on the 
> > patch.
> > +   if (!dbentry)
> > +   return;
> > +
> > +   /*
> > +* We simply throw away all the shared table entries by recreating 
> > new
> > +* hash table for them.
> > +*/
> > +   if (dbentry->tables != NULL)
> > +   hash_destroy(dbentry->tables);
> > +   if (dbentry->functions != NULL)
> > +   hash_destroy(dbentry->functions);
> > +
> > +   dbentry->tables = NULL;
> > +   dbentry->functions = NULL;
> > +
> > +   /*
> > +* This creates empty hash tables for tables and functions.
> > +*/
> > +   reset_dbentry_counters(dbentry);
> >
> > We already have the above code for non-shared tables, can we restrict 
> > duplicate code?
> > one solution I can think of, if we can have a list with two elements and 
> > iterate each element with
> > these common steps?
>
> Another idea could be that instead of putting this logic in
> pgstat_recv_resetcounter(), we can have this logic in pg_stat_reset()
> or maybe in pgstat_reset_counters().  So now
> pgstat_recv_resetcounter() logic remain the same and I think that
> logic is much cleaner i.e. whatever dobid it got in the message it
> will reset stat for that dboid.
>
> So now, if it depends upon the logic of the callers that what they
> want to do so in this case pgstat_recv_resetcounter(), can send two
> messages one for MyDatabaseOid which it is already doing, and another
> message for the InvalidOid.
>
Hi,
I reviewed patch and please find my review comments below:

1)
In pgstat_recv_resetcounter, first we are checking for m_databaseid.

+++ b/src/backend/postmaster/pgstat.c
@@ -5092,7 +5092,6 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter
*msg, int len)
  * Lookup the database in the hashtable.  Nothing to do if not there.
  */
 dbentry = pgstat_get_db_entry(msg->m_databaseid, false);

 if (!dbentry)
 return;
 If we don't find any dbentry, then we are returning but I think we
should try to reset stats for shared tables. I may be wrong because I
haven't tested this.

2)
+
+/*
+ * Lookup for the shared tables also to reset the stats
+ */
+dbentry = pgstat_get_db_entry(InvalidOid, false);
+if (!dbentry)
+return;

I think, always we should get dbentry for shared tables so we can add
assert here.

+Assert (dbentry);

3)
 pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 {
 PgStat_StatDBEntry *dbentry;
+boolfound;

 dbentry = pgstat_get_db_entry(msg->m_databaseid, false);

@@ -5168,13 +5192,41 @@
pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int
len)
 /* Set the reset timestamp for the whole database */
 dbentry->stat_reset_timestamp = GetCurrentTimestamp();

-/* Remove object if it exists, ignore it if not */
+/* Remove object if it exists, if not then may be it's a shared table */
 if (msg->m_resettype == RESET_TABLE)
+{
 (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
+   HASH_REMOVE, );
+if (!found)
+{
+/* If we didn't find it, maybe it's a shared table */
+dbentry = pgstat_get_db_entry(InvalidOid, false);
+if (dbentry)
+{
+/* Set the reset timestamp for the whole database */
+dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+(void) hash_search(dbentry->tables, (void *)
&(msg->m_objectid),
+   HASH_REMOVE, NULL);
+}
+}
+}
 else if (msg->m_resettype == RESET_FUNCTION)
+{
 (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
+   HASH_REMOVE, );
+if (!found)
+{
+/* If we didn't find it, maybe it's a shared table */
+dbentry = pgstat_get_db_entry(InvalidOid, false);
+if (dbentry)
+{
+/* Set the reset timestamp for the whole database */
+dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+(void) hash_search(dbentry->functions, (void *)
&(msg->m_objectid),
+   HASH_REMOVE, NULL);
+}
+}
+}
 }

Above code can be replaced with:
@@ -5160,7 +5162,10 @@
pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int
len)
 {
PgStat_StatDBEntry *dbentry;

-   dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+   if (IsSharedRelation(msg->m_objectid))
+   dbentry = pgstat_get_db_entry(InvalidOid, false);
+   else
+   

Re: OpenSSL 3.0.0 compatibility

2021-08-06 Thread Peter Eisentraut

On 20.07.21 01:23, Daniel Gustafsson wrote:

So I think your proposed patch is sound and a good short-term and low-risk 
solution

The attached 0001 disables the padding.  I've tested this with OpenSSL 1.0.1,
1.0.2, 1.1.1 and Git HEAD at e278127cbfa2709d.

Another aspect of OpenSSL 3 compatibility is that of legacy cipher support, and
as we concluded upthread it's best to leave that to the user to define in
openssl.cnf.  The attached 0002 adds alternative output files for 3.0.0
installations without the legacy provider loaded, as well as adds a note in the
pgcrypto docs to enable it in case DES is needed.  It does annoy me a bit that
we don't load the openssl.cnf file for 1.0.1 if we start mentioning it in the
docs for other versions, but it's probably not worth the effort to fix it given
the lack of complaints so far (it needs a call to OPENSSL_config(NULL); guarded
to HAVE_ macros for 1.0.1).


Are you going to commit these?




Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-06 Thread Japin Li



Hi,

Sorry for the late reply.  Having read this thread, the problem is caused by
misunderstanding the AlterSubscription_refresh().  My apologies.

On Fri, 06 Aug 2021 at 14:12, Masahiko Sawada  wrote:
> On Fri, Aug 6, 2021 at 1:39 PM Amit Kapila  wrote:
>>
>> On Fri, Aug 6, 2021 at 5:09 AM Masahiko Sawada  wrote:
>> >
>> > On Thu, Aug 5, 2021 at 11:40 PM houzj.f...@fujitsu.com
>> >  wrote:
>> > >
>> > > > To summary, I think that what we want to do in DROP SUBSCRIPTION cases 
>> > > > is to
>> > > > drop relations from pg_subscription_rel that are no longer included in 
>> > > > the set of
>> > > > after-calculated publications. In the latter example, when ALTER
>> > > > SUBSCRIPTION ... DROP PUBLICATION pub12, we should compare the current
>> > > > set of relations associated with {pub12, pub13} to the set of 
>> > > > relcations associated
>> > > > with pub13, not pub12. Then we can find out t2 is no longer associated 
>> > > > with the
>> > > > after-calculated publication (i.g., pub13) so remove it.
>> > >
>> > > Thanks for the review. You are right, I missed this point.
>> > > Attach new version patch which fix these problems(Also add a new 
>> > > pl-test).
>> > >
>> >
>> > Thank you for updating the patch!
>> >
>> > Here are some comments:
>> >
>> > I think that it still could happen that DROP PULIBCATION misses
>> > dropping relations. Please consider the following case:
>> >
>> > On publisher and subscriber:
>> > create table t1 (a int);
>> > create table t2 (a int);
>> > create table t3 (a int);
>> >
>> > On publisher:
>> > create publication pub12 for table t1, t2;
>> > create publication pub3 for table t3;
>> >
>> > On subscriber:
>> > create subscription sub connection 'dbname=postgres' publication pub12, 
>> > pub3;
>> >
>> > On publisher:
>> > alter publication pub3 add table t1;
>> >
>> > On subscriber:
>> > alter subscription sub drop publication pub12;
>> > select srsubid, srrelid::regclass::text, srsubstate, srsublsn from
>> > pg_subscription_rel;
>> >  srsubid | srrelid | srsubstate | srsublsn
>> > -+-++---
>> >16393 | t3  | r  | 0/1707CE0
>> >16393 | t1  | r  | 0/1707D18
>> >
>> > With v3 patch, pg_subscription_rel has entries for t3 and t1. But if
>> > ALTER SUBSCRIPTION ... DROP PUBLICATION pub12 refreshes only relations
>> > associated with pub12, t1 should be removed and be added when we
>> > refresh pub3.
>> >
>>
>> But, isn't this happening because of your suggestion to compare the
>> current set of relations with relations from publications that doesn't
>> need to be removed?
>
> Hmm yes, it cannot cover all cases. I had somehow misunderstood that
> the subscriber knows which relations are associated with which
> publications. Given that the subscriber doesn’t know which relations
> are associated with which publications (and the set of subscribed
> relations on the subscriber becomes out of date once the publication
> is updated), I think it's impossible to achieve that DROP PUBLICATION
> drops relations that are associated with only the publication being
> dropped.
>
>> Do we have better ideas to fix or shall we just
>> say that we will refresh based on whatever current set of relations
>> are present in publication to be dropped?
>
> I don't have better ideas. It seems to me that it's prudent that we
> accept the approach in v3 patch and refresh all publications in DROP
> PUBLICATION cases.
>

Maybe refreshing all publications in PUBLICATION cases is simple way to
fix the problem.

>>
>> One more thing, I think it would be better to write a separate refresh
>> function for add/drop rather than adding checks in the current refresh
>> functionality. We can extract common functionality code which can be
>> called from the different refresh functions.
>
> +1
>

Agreed.

> Regards,


--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Numeric x^y for negative x

2021-08-06 Thread Tom Lane
Dean Rasheed  writes:
> So the "test for overflow by reverse-conversion" obviously isn't
> working as intended, and it's going through power_var_int() when it
> shouldn't. I don't think there's anything wrong with that code, so I
> think this is a compiler bug.

Yeah, looks like one.

> I guess the best thing to do is just test the value against
> PG_INT32_MIN/MAX, which is what int84() does. There are 2 other places
> in numeric.c that use similar code to check for int16/32 overflow, so
> it's possible that they're broken in the same way on that platform,
> but they aren't covered by the regression tests, so it's also possible
> that they're OK. Anyway, something like the attached seems likely to
> be safer.

Looks plausible by eyeball (I've not tested).

regards, tom lane




Re: Numeric x^y for negative x

2021-08-06 Thread Dean Rasheed
On Fri, 6 Aug 2021 at 03:58, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > On Thu, 5 Aug 2021 at 17:04, Tom Lane  wrote:
> >> It looks like castoroides is not happy with this patch:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides=2021-08-01%2008%3A52%3A43
>
> > Hmm, there's something very weird going on there.
>
> Yeah.  I tried to reproduce this on the gcc compile farm's Solaris 10
> machine, but the test passed fine for me.  The only obvious configuration
> difference I can find is that that machine has
>
> $ cc -V
> cc: Sun C 5.10 SunOS_sparc Patch 141861-10 2012/11/07
>
> whereas castorides' compiler seems to be a few years older.  So this
> does seem like it could be a compiler bug.
>

Ah, so the latest test results from castoroides confirm my previous
hypothesis, that it isn't even reaching the new code in power_var():

  0.99 ^ 233000 returned 1.0199545627709647
  0.99 ^ 70 returned 0.9396000441558118

which are actually the results you'd get if you just cast the exponent
to an int32, throwing away the top 32 bits and compute the results:

  0.99 ^ -197580800 = 1.0199545627709647
  0.99 ^ 623009792 = 0.9396000441558118

So the "test for overflow by reverse-conversion" obviously isn't
working as intended, and it's going through power_var_int() when it
shouldn't. I don't think there's anything wrong with that code, so I
think this is a compiler bug.

I guess the best thing to do is just test the value against
PG_INT32_MIN/MAX, which is what int84() does. There are 2 other places
in numeric.c that use similar code to check for int16/32 overflow, so
it's possible that they're broken in the same way on that platform,
but they aren't covered by the regression tests, so it's also possible
that they're OK. Anyway, something like the attached seems likely to
be safer.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 13bb968..0c210a3
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -4289,11 +4289,13 @@ numericvar_to_int32(const NumericVar *va
 	if (!numericvar_to_int64(var, ))
 		return false;
 
+	if (unlikely(val < PG_INT32_MIN) || unlikely(val > PG_INT32_MAX))
+		return false;
+	
 	/* Down-convert to int4 */
 	*result = (int32) val;
 
-	/* Test for overflow by reverse-conversion. */
-	return ((int64) *result == val);
+	return true;
 }
 
 Datum
@@ -4373,15 +4375,14 @@ numeric_int2(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
  errmsg("smallint out of range")));
 
-	/* Down-convert to int2 */
-	result = (int16) val;
-
-	/* Test for overflow by reverse-conversion. */
-	if ((int64) result != val)
+	if (unlikely(val < PG_INT16_MIN) || unlikely(val > PG_INT16_MAX))
 		ereport(ERROR,
 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
  errmsg("smallint out of range")));
 
+	/* Down-convert to int2 */
+	result = (int16) val;
+
 	PG_RETURN_INT16(result);
 }
 
@@ -10186,10 +10187,7 @@ power_var(const NumericVar *base, const
 
 		if (numericvar_to_int64(exp, ))
 		{
-			int			expval = (int) expval64;
-
-			/* Test for overflow by reverse-conversion. */
-			if ((int64) expval == expval64)
+			if (expval64 >= PG_INT32_MIN && expval64 <= PG_INT32_MAX)
 			{
 /* Okay, select rscale */
 rscale = NUMERIC_MIN_SIG_DIGITS;
@@ -10197,7 +10195,7 @@ power_var(const NumericVar *base, const
 rscale = Max(rscale, NUMERIC_MIN_DISPLAY_SCALE);
 rscale = Min(rscale, NUMERIC_MAX_DISPLAY_SCALE);
 
-power_var_int(base, expval, result, rscale);
+power_var_int(base, (int) expval64, result, rscale);
 return;
 			}
 		}
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
new file mode 100644
index 80b42f8..3707aff
--- a/src/test/regress/expected/numeric.out
+++ b/src/test/regress/expected/numeric.out
@@ -1154,6 +1154,55 @@ SELECT * FROM fract_only;
 (7 rows)
 
 DROP TABLE fract_only;
+-- Check conversion to integers
+SELECT (-9223372036854775808.5)::int8; -- should fail
+ERROR:  bigint out of range
+SELECT (-9223372036854775808.4)::int8; -- ok
+ int8 
+--
+ -9223372036854775808
+(1 row)
+
+SELECT 9223372036854775807.4::int8; -- ok
+int8 
+-
+ 9223372036854775807
+(1 row)
+
+SELECT 9223372036854775807.5::int8; -- should fail
+ERROR:  bigint out of range
+SELECT (-2147483648.5)::int4; -- should fail
+ERROR:  integer out of range
+SELECT (-2147483648.4)::int4; -- ok
+int4 
+-
+ -2147483648
+(1 row)
+
+SELECT 2147483647.4::int4; -- ok
+int4
+
+ 2147483647
+(1 row)
+
+SELECT 2147483647.5::int4; -- should fail
+ERROR:  integer out of range
+SELECT (-32768.5)::int2; -- should fail
+ERROR:  smallint out of range
+SELECT (-32768.4)::int2; -- ok
+  int2  
+
+ -32768
+(1 row)
+
+SELECT 32767.4::int2; -- ok
+ int2  
+---
+ 32767
+(1 row)
+
+SELECT 

Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Dilip Kumar
On Fri, Aug 6, 2021 at 8:53 PM Himanshu Upadhyaya
 wrote:
>
> Hi Sadhu,
>
> Patch working as expected with shared tables, just one Minor comment on the 
> patch.
> +   if (!dbentry)
> +   return;
> +
> +   /*
> +* We simply throw away all the shared table entries by recreating new
> +* hash table for them.
> +*/
> +   if (dbentry->tables != NULL)
> +   hash_destroy(dbentry->tables);
> +   if (dbentry->functions != NULL)
> +   hash_destroy(dbentry->functions);
> +
> +   dbentry->tables = NULL;
> +   dbentry->functions = NULL;
> +
> +   /*
> +* This creates empty hash tables for tables and functions.
> +*/
> +   reset_dbentry_counters(dbentry);
>
> We already have the above code for non-shared tables, can we restrict 
> duplicate code?
> one solution I can think of, if we can have a list with two elements and 
> iterate each element with
> these common steps?

Another idea could be that instead of putting this logic in
pgstat_recv_resetcounter(), we can have this logic in pg_stat_reset()
or maybe in pgstat_reset_counters().  So now
pgstat_recv_resetcounter() logic remain the same and I think that
logic is much cleaner i.e. whatever dobid it got in the message it
will reset stat for that dboid.

So now, if it depends upon the logic of the callers that what they
want to do so in this case pgstat_recv_resetcounter(), can send two
messages one for MyDatabaseOid which it is already doing, and another
message for the InvalidOid.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Himanshu Upadhyaya
Hi Sadhu,

Patch working as expected with shared tables, just one Minor comment on the
patch.
+   if (!dbentry)
+   return;
+
+   /*
+* We simply throw away all the shared table entries by recreating
new
+* hash table for them.
+*/
+   if (dbentry->tables != NULL)
+   hash_destroy(dbentry->tables);
+   if (dbentry->functions != NULL)
+   hash_destroy(dbentry->functions);
+
+   dbentry->tables = NULL;
+   dbentry->functions = NULL;
+
+   /*
+* This creates empty hash tables for tables and functions.
+*/
+   reset_dbentry_counters(dbentry);

We already have the above code for non-shared tables, can we restrict
duplicate code?
one solution I can think of, if we can have a list with two elements and
iterate each element with
these common steps?

Thanks,
Himanshu

On Fri, Aug 6, 2021 at 5:40 PM Himanshu Upadhyaya <
upadhyaya.himan...@gmail.com> wrote:

> Hi Sadhu,
>
>
> > The call to “pg_stat_reset“ does reset all the statistics data for
> > tables belonging to the current database but does not take care of
> > shared tables e.g pg_attribute.
>
> pg_attribute is not a shared catalog, is the mentioned scenario is also
> applicable for few others tables?
>
> I have just tried it with-out your patch:
>
> postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
>  relid | schemaname |   relname| heap_blks_read | heap_blks_hit |
> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
> tidx_blks_read | tidx_blks_hit
>
> ---++--++---+---+--+-+++---
>   1249 | pg_catalog | pg_attribute | 29 |   522 |
> 8 |  673 | ||
>  |
> (1 row)
>
> postgres=# select pg_stat_reset();
>  pg_stat_reset
> ---
>
> (1 row)
>
> postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
>  relid | schemaname |   relname| heap_blks_read | heap_blks_hit |
> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
> tidx_blks_read | tidx_blks_hit
>
> ---++--++---+---+--+-+++---
>   1249 | pg_catalog | pg_attribute |  0 | 0 |
> 0 |0 | ||
>  |
>
>
> We are able to reset the stats of pg_attibute without your patch.
>
> Thanks,
> Himanshu
>
> On Fri, Aug 6, 2021 at 1:56 PM Sadhuprasad Patro 
> wrote:
>
>> Hi,
>>
>> The call to “pg_stat_reset“ does reset all the statistics data for
>> tables belonging to the current database but does not take care of
>> shared tables e.g pg_attribute. Similarly to reset the statistics at
>> table level, the call to “pg_stat_reset_single_table_counters“ does
>> not take care of shared tables.
>>
>> When we reset all the statistics using the call “pg_stat_reset”, the
>> postgres process internally makes calls to “
>> pgstat_recv_resetcounter“, which resets the statistics of all the
>> tables of the current database. But not resetting the statistics of
>> shared objects using database ID as 'InvalidOid'.
>>
>> The same fix is made in the internal function
>> “pgstat_recv_resetsinglecounter“ to reset the statistics for the
>> shared table for the call "pg_stat_reset_single_table_counters".
>>
>> --
>> thank u
>> SADHU PRASAD
>> EnterpriseDB: http://www.enterprisedb.com
>>
>


Re: Alias collision in `refresh materialized view concurrently`

2021-08-06 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jun 02, 2021 at 03:44:45PM +0530, Bharath Rupireddy wrote:
>> Thanks.The changes with that approach are very minimal. PSA v5 and let
>> me know if any more changes are needed.

> Simple enough, so applied and back-patched.

I just came across this issue while preparing the release notes.
ISTM that people have expended a great deal of effort on a fundamentally
unreliable solution, when a reliable one is easily available.
The originally complained-of issue was that a user-chosen column name
could collide with the query-chosen table name:

ERROR: column reference "mv" is ambiguous
LINE 1: ...alog.=) mv.mv AND newdata OPERATOR(pg_catalog.*=) mv) WHERE ...

This is true, but it's self-inflicted damage, because all you have
to do is write the query so that mv is clearly a table name:

... mv.mv AND newdata.* OPERATOR(pg_catalog.*=) mv.*) WHERE ...

AFAICT that works and generates the identical parse tree to the original
coding.  The only place touched by the patch where it's a bit difficult to
make the syntax unambiguous this way is

"CREATE TEMP TABLE %s AS "
"SELECT _$mv.ctid AS tid, _$newdata "
"FROM %s _$mv FULL JOIN %s _$newdata ON (",

because newdata.* would ordinarily get expanded to multiple columns
if it's at the top level of a SELECT list, and that's not what we want.
However, that's easily fixed using the same hack as in ruleutils.c's
get_variable: add a no-op cast to the table's rowtype.  So this
would look like

appendStringInfo(,
 "CREATE TEMP TABLE %s AS "
 "SELECT mv.ctid AS tid, newdata.*::%s "
 "FROM %s mv FULL JOIN %s newdata ON (",
 diffname, matviewname, matviewname, tempname);

Given that it took this long to notice the problem at all, maybe
this is not a fix to cram in on the weekend before the release wrap.
But I don't see why we need to settle for "mostly works" when
"always works" is barely any harder.

regards, tom lane




Re: RFC: Improve CPU cache locality of syscache searches

2021-08-06 Thread John Naylor
On Thu, Aug 5, 2021 at 4:12 PM Andres Freund  wrote:
> I have wondered before whether we should have syscache definitions
generate
> code specific to each syscache definition. I do think that'd give a good
bit
> of performance boost. But I don't see a trivial way to get there without
> notational overhead.
>
> We could define syscaches in a header using a macro that's defined
differently
> in syscache.c than everywhere else. The header would declare a set of
> functions for each syscache, syscache.c would define them to call an
> always_inline function with the relevant constants.
>
> Or perhaps we should move syscache definitions into the pg_*.h headers,
and
> generate the relevant code as part of their processing. That seems like it
> could be nice from a modularity POV alone. And it could do better than the
> current approach, because we could hardcode the types for columns in the
> syscache definition without increasing notational overhead.

I had code laying around to generate the info needed for initialization,
but I apparently deleted it and never emailed it. :-(

If we went that far, I wonder if we could specialize the cc_bucket for the
actual types, rather than just number of Datums, which would further save
on space. More complex, though.

Also, if comparisons got cheaper, maybe we could get away with not storing
the full hash in the bucket in favor of a tag of the 16 highest bits. (The
catctup would still need the full hash for invalidation, at least).

Another idea I came across while researching in-memory key-value stores is
"bulk chaining" -- see [1] for an example. In that case, our example
2-cacheline bucket would have a dlist_head pointing not to the catctups,
but to another bucket. So that throws away my L1/L2 analogy and uses a
chain of buckets, each of which contain multiple sets of
hashes/keys/pointers. That's closer to a normal collision chain, but with
better cache locality. If we did that, I imagine the catctup could dispense
with storing its Datum array and its dlist_node entirely.

[1]
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf

--
John Naylor
EDB: http://www.enterprisedb.com


[PATCH] OpenSSL: Mark underlying BIO with the appropriate type flags

2021-08-06 Thread Itamar Gafni
Hi Everyone,

I've ran into an issue where the OpenSSL API function "ssl_get_fd" fails, due 
to the underlying BIO object created by Postgres is not being flagged properly.

Previous to OpenSSL version 1.1.0, the BIO methods object would be copied 
directly from the existing socket type and then its read\write functions would 
be replaced.
With 1.1.0 and up, the object is created from scratch and then all its methods 
are initialized to be the ones of the socket type, except read/write which are 
custom.
In this newer way, a new type is given to it by calling "BIO_get_new_index", 
but the related type flags aren't added.

For more information please see: 
https://www.openssl.org/docs/man1.1.0/man3/BIO_get_new_index.html
In this case, the type should have both BIO_TYPE_SOURCE_SINK and 
BIO_TYPE_DESCRIPTOR.

The proposed patch will add these flags to the BIO type on creation. I have 
compiled it with OpenSSL, enabled encryption and verified that basic queries 
work fine.

I don't believe this affects Postgres, since the code has been this way for 5 
years. I ran into it because I'm writing auditing code that hooks on Postgres 
calls. I've already found a workaround by adding these flags myself with an 
additional hook, but thought it would be worth bringing up here and see if you 
think it's worth patching.

Regards,
Itamar Gafni
Imperva
---
NOTICE:
This email and all attachments are confidential, may be proprietary, and may be 
privileged or otherwise protected from disclosure. They are intended solely for 
the individual or entity to whom the email is addressed. However, mistakes 
sometimes happen in addressing emails. If you believe that you are not an 
intended recipient, please stop reading immediately. Do not copy, forward, or 
rely on the contents in any way. Notify the sender and/or Imperva, Inc. by 
telephone at +1 (650) 832-6006 and then delete or destroy any copy of this 
email and its attachments. The sender reserves and asserts all rights to 
confidentiality, as well as any privileges that may apply. Any disclosure, 
copying, distribution or action taken or omitted to be taken by an unintended 
recipient in reliance on this message is prohibited and may be unlawful.
Please consider the environment before printing this email.


001-Add-type-flags-to-SSL-BIO.patch
Description: 001-Add-type-flags-to-SSL-BIO.patch


[PATCH]Remove obsolete macro CHECKFLOATVAL in btree_gist

2021-08-06 Thread tanghy.f...@fujitsu.com
Hi

Commit 6bf0bc842 replaced float.c's CHECKFLOATVAL() macro with static inline 
subroutines, 
but the fix introduced a performance regression as Tom Lane pointed out and 
fixed at 607f8ce74.

Found obsolete CHECKFLOATVAL usage in contrib/btree_gist, and tried to fix it 
according to 607f8ce74.
The attached patch has been testified in master. All tap tests passed.

Regards,
Tang


0001-Remove-obsolete-macro-CHECKFLOATVAL.patch
Description: 0001-Remove-obsolete-macro-CHECKFLOATVAL.patch


Re: logical replication empty transactions

2021-08-06 Thread Ajin Cherian
On Mon, Aug 2, 2021 at 7:20 PM Amit Kapila  wrote:
>
> On Fri, Jul 23, 2021 at 3:39 PM Ajin Cherian  wrote:
> >
>
> Let's first split the patch for prepared and non-prepared cases as
> that will help to focus on each of them separately.

As a first shot, I have split the patch into prepared and non-prepared cases,

regards,
Ajin Cherian
Fujitsu Australia


v12-0002-Skip-empty-prepared-transactions-for-logical-rep.patch
Description: Binary data


v12-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: Added schema level support for publication.

2021-08-06 Thread vignesh C
On Fri, Aug 6, 2021 at 4:02 PM Amit Kapila  wrote:
>
> On Fri, Aug 6, 2021 at 2:16 PM vignesh C  wrote:
> >
> > On Thu, Aug 5, 2021 at 3:54 PM Amit Kapila  wrote:
> > >
> > >
> > > Few more comments:
> > > ===
> > > 1. Do we need the previous column 'puballtables' after adding pubtype
> > > or pubkind in pg_publication?
> >
> > I felt this should be retained as old client will still be using 
> > puballtables, like in case of old client executing \dRp+  commands.
> >
>
> But do we guarantee that old clients work with newer server versions?
> For example, psql docs say: "psql works best with servers of the same
> or an older major version. Backslash commands are particularly likely
> to fail if the server is of a newer version than psql itself." [1]
> (See Notes Section). Similarly, pg_dump docs say: "However, pg_dump
> cannot dump from PostgreSQL servers newer than its own major version;
> it will refuse to even try, rather than risk making an invalid dump."
> [2] (See Notes Section).
>
> It seems Sawada-San has the same question and IIUC docs suggest we
> don't need such compatibility, so what makes you think we need it?

Ok, I was not sure if we can remove any system table columns, hence
had retained it. It seems that my understanding was wrong. I will
remove this column in the next version of the patch.

Regards,
Vignesh




Re: Use generation context to speed up tuplesorts

2021-08-06 Thread David Rowley
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra  wrote:
> A review would be nice, although it can wait - It'd be interesting to
> know if those patches help with the workload(s) you've been looking at.

I tried out the v2 set of patches using the attached scripts.  The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.

I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.

Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48.  This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes.  This is because it also holds the
GenerationBlock.  So with the patch there are some cases where we'll
use slightly more memory.

Additional tests:

1. Sort 1 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.

Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.

David
#!/bin/bash

sec=5
dbname=postgres
records=100
mod=100

psql -c "drop table if exists t" $dbname
psql -c "create table t (a bigint not null)" $dbname
psql -c "insert into t select x % $mod from generate_series(1,$records) x" 
$dbname
psql -c "vacuum freeze t" $dbname

for i in {1..32}
do
echo $i
psql -c "explain analyze select * from t order by a offset 10" 
$dbname | grep -E "Memory|Disk"
echo "select * from t order by a offset 10" > bench.sql
for loops in {1..3}
do
pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps
done

psql -c "alter table t add column c$i bigint" $dbname
psql -c "update t set c$i = a" $dbname
psql -c "vacuum full t" $dbname
psql -c "vacuum freeze t" $dbname
done



generation context tuplesort.ods
Description: application/vnd.oasis.opendocument.spreadsheet
#!/bin/bash

sec=5
dbname=postgres
records=1
mod=100

psql -c "drop table if exists t" $dbname
psql -c "create table t (a bigint not null)" $dbname
psql -c "insert into t select x % $mod from generate_series(1,$records) x" 
$dbname
psql -c "vacuum freeze t" $dbname

for i in {1..32}
do
echo $i
psql -c "explain analyze select * from t order by a offset 10" 
$dbname | grep -E "Memory|Disk"
echo "select * from t order by a offset 10" > bench.sql
for loops in {1..3}
do
pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps
done

psql -c "alter table t add column c$i bigint" $dbname
psql -c "update t set c$i = a" $dbname
psql -c "vacuum full t" $dbname
psql -c "vacuum freeze t" $dbname
done



Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?

2021-08-06 Thread Andrew Dunstan


On 8/5/21 11:29 PM, Andres Freund wrote:
> Hi,
>
> When testing EXEC_BACKEND on linux I see occasional test failures as long as I
> don't disable ASLR. There's a code comment to that effect:
>
>* If testing EXEC_BACKEND on Linux, you should run this as root before
>* starting the postmaster:
>*
>* echo 0 >/proc/sys/kernel/randomize_va_space
>
> but I don't like doing that on a system wide basis.
>
> Linux allows disabling ASLR on a per-process basis using
> personality(ADDR_NO_RANDOMIZE). There's a wrapper binary to do that as well,
> setarch --addr-no-randomize.
>
> I was wondering if we should have postmaster do personality(ADDR_NO_RANDOMIZE)
> for EXEC_BACKEND builds? It seems nicer to make it automatically work than
> have people remember that they need to call "setarch --addr-no-randomize make 
> check".
>
> Not that it actually matters for EXEC_BACKEND, but theoretically doing
> personality(ADDR_NO_RANDOMIZE) in postmaster is a tad more secure than doing
> it via setarch, as in the personality() case postmaster's layout itself is
> still randomized...
>
>
> Or perhaps we should just add a comment mentioning setarch.
>

If we can set it conveniently then that seems worth doing.


(Thinks: do we have non-Windows buildfarm members doing EXEC_BACKEND?)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Mahendra Singh Thalor
On Fri, 6 Aug 2021 at 17:40, Himanshu Upadhyaya
 wrote:
>
> Hi Sadhu,
>
>
> > The call to “pg_stat_reset“ does reset all the statistics data for
> > tables belonging to the current database but does not take care of
> > shared tables e.g pg_attribute.
>
> pg_attribute is not a shared catalog, is the mentioned scenario is also 
> applicable for few others tables?

Yes, I think, by mistake, Sadhu has mentioned pg_attribute.

With patch, I checked for pg_database and verified that we are resetting stats.

psql (15devel)
Type "help" for help.

postgres=# SELECT * FROM pg_statio_all_tables  where relid=1262;
 relid | schemaname |   relname   | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
---++-++---+---+--+-+++---
  1262 | pg_catalog | pg_database |  1 | 2 |
  2 |0 |   0 |  0 |
  0 | 0
(1 row)

postgres=#
postgres=# select pg_stat_reset();
 pg_stat_reset
---

(1 row)

postgres=# SELECT * FROM pg_statio_all_tables  where relid=1262;
 relid | schemaname |   relname   | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
---++-++---+---+--+-+++---
  1262 | pg_catalog | pg_database |  0 | 0 |
  0 |0 |   0 |  0 |
  0 | 0
(1 row)

postgres=#


Shared tables are:
1. pg_database -- DatabaseRelationId 1262
2. pg_tablespcae -- TableSpaceRelationId 1213
3. pg_authid -- AuthIdRelationId 1260
4. pg_auth_members -- AuthMemRelationId 1261
5. pg_shdescription -- SharedDescriptionRelationId 2396
6. pg_shdepend -- SharedDependRelationId 1214
7. pg_shseclabel -- SharedSecLabelRelationId 3592
8. pg_db_role_setting -- DbRoleSettingRelationId 2694
9. pg_replication_origin -- ReplicationOriginRelationId 6000
10. pg_subscription -- SubscriptionRelationId 6100

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
>
> I have just tried it with-out your patch:
>
> postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
>  relid | schemaname |   relname| heap_blks_read | heap_blks_hit | 
> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | 
> tidx_blks_read | tidx_blks_hit
> ---++--++---+---+--+-+++---
>   1249 | pg_catalog | pg_attribute | 29 |   522 | 
> 8 |  673 | |||
> (1 row)
>
> postgres=# select pg_stat_reset();
>  pg_stat_reset
> ---
>
> (1 row)
>
> postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
>  relid | schemaname |   relname| heap_blks_read | heap_blks_hit | 
> idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | 
> tidx_blks_read | tidx_blks_hit
> ---++--++---+---+--+-+++---
>   1249 | pg_catalog | pg_attribute |  0 | 0 | 
> 0 |0 | |||
>
>
> We are able to reset the stats of pg_attibute without your patch.
>
> Thanks,
> Himanshu
>
> On Fri, Aug 6, 2021 at 1:56 PM Sadhuprasad Patro  wrote:
>>
>> Hi,
>>
>> The call to “pg_stat_reset“ does reset all the statistics data for
>> tables belonging to the current database but does not take care of
>> shared tables e.g pg_attribute. Similarly to reset the statistics at
>> table level, the call to “pg_stat_reset_single_table_counters“ does
>> not take care of shared tables.
>>
>> When we reset all the statistics using the call “pg_stat_reset”, the
>> postgres process internally makes calls to “
>> pgstat_recv_resetcounter“, which resets the statistics of all the
>> tables of the current database. But not resetting the statistics of
>> shared objects using database ID as 'InvalidOid'.
>>
>> The same fix is made in the internal function
>> “pgstat_recv_resetsinglecounter“ to reset the statistics for the
>> shared table for the call "pg_stat_reset_single_table_counters".
>>
>> --
>> thank u
>> SADHU PRASAD
>> EnterpriseDB: http://www.enterprisedb.com



--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Himanshu Upadhyaya
Hi Sadhu,


> The call to “pg_stat_reset“ does reset all the statistics data for
> tables belonging to the current database but does not take care of
> shared tables e.g pg_attribute.

pg_attribute is not a shared catalog, is the mentioned scenario is also
applicable for few others tables?

I have just tried it with-out your patch:

postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
 relid | schemaname |   relname| heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
---++--++---+---+--+-+++---
  1249 | pg_catalog | pg_attribute | 29 |   522 |
  8 |  673 | ||
 |
(1 row)

postgres=# select pg_stat_reset();
 pg_stat_reset
---

(1 row)

postgres=# SELECT * FROM pg_statio_all_tables  where relid=1249;
 relid | schemaname |   relname| heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
---++--++---+---+--+-+++---
  1249 | pg_catalog | pg_attribute |  0 | 0 |
  0 |0 | ||
 |


We are able to reset the stats of pg_attibute without your patch.

Thanks,
Himanshu

On Fri, Aug 6, 2021 at 1:56 PM Sadhuprasad Patro  wrote:

> Hi,
>
> The call to “pg_stat_reset“ does reset all the statistics data for
> tables belonging to the current database but does not take care of
> shared tables e.g pg_attribute. Similarly to reset the statistics at
> table level, the call to “pg_stat_reset_single_table_counters“ does
> not take care of shared tables.
>
> When we reset all the statistics using the call “pg_stat_reset”, the
> postgres process internally makes calls to “
> pgstat_recv_resetcounter“, which resets the statistics of all the
> tables of the current database. But not resetting the statistics of
> shared objects using database ID as 'InvalidOid'.
>
> The same fix is made in the internal function
> “pgstat_recv_resetsinglecounter“ to reset the statistics for the
> shared table for the call "pg_stat_reset_single_table_counters".
>
> --
> thank u
> SADHU PRASAD
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Mahendra Singh Thalor
On Fri, 6 Aug 2021 at 13:56, Sadhuprasad Patro  wrote:
>
> Hi,
>
> The call to “pg_stat_reset“ does reset all the statistics data for
> tables belonging to the current database but does not take care of
> shared tables e.g pg_attribute. Similarly to reset the statistics at
> table level, the call to “pg_stat_reset_single_table_counters“ does
> not take care of shared tables.
>
> When we reset all the statistics using the call “pg_stat_reset”, the
> postgres process internally makes calls to “
> pgstat_recv_resetcounter“, which resets the statistics of all the
> tables of the current database. But not resetting the statistics of
> shared objects using database ID as 'InvalidOid'.
>
> The same fix is made in the internal function
> “pgstat_recv_resetsinglecounter“ to reset the statistics for the
> shared table for the call "pg_stat_reset_single_table_counters".
>

Hi Sadhu,
I was looking into the patch. I will look more in the coming days. I
have below comments.

1. Please can you share the test case to reproduce the issue and
please add the test case in patch.

2. diff --git a/src/backend/postmaster/pgstat.c
b/src/backend/postmaster/pgstat.c
index 56755cb92b..f272931276 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5092,7 +5092,6 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter
*msg, int len)
 * Lookup the database in the hashtable.  Nothing to do if not there.
 */
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
if (!dbentry)
return;

I think, by mistake, you removed one line in the patch.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: Added schema level support for publication.

2021-08-06 Thread Amit Kapila
On Fri, Aug 6, 2021 at 2:02 PM vignesh C  wrote:
>
> On Wed, Aug 4, 2021 at 4:10 PM Amit Kapila  wrote:
> >
> > On Tue, Aug 3, 2021 at 8:38 PM vignesh C  wrote:
>
> > 6.
> > + {PublicationSchemaRelationId, /* PUBLICATIONSCHEMAMAP */
> > + PublicationSchemaPsnspcidPspubidIndexId,
> > + 2,
> > + {
> > + Anum_pg_publication_schema_psnspcid,
> > + Anum_pg_publication_schema_pspubid,
> > + 0,
> > + 0
> > + },
> >
> > Why don't we keep pubid as the first column in this index?
>
> I wanted to keep it similar to PUBLICATIONRELMAP, should we keep it as
> it is, thoughts?
>

Okay, I see your point. I think for PUBLICATIONRELMAP, we need it
because it is searched using the only relid in
GetRelationPublications, so, similarly, in the patch, you are using
schema_oid in GetSchemaPublications, so probably that will help. I was
wondering why you haven't directly used the cache in
GetSchemaPublications similar to GetRelationPublications? It seems
there is a danger for concurrent object drop. Can you please check how
the safety is ensured say when either one wants to drop the
corresponding relation/schema or publication? Another point is why
can't we use the other index (where the index is on relid or
schema_oid (PublicationSchemaObjectIndexId))?

-- 
With Regards,
Amit Kapila.




Re: Added schema level support for publication.

2021-08-06 Thread Amit Kapila
On Fri, Aug 6, 2021 at 2:16 PM vignesh C  wrote:
>
> On Thu, Aug 5, 2021 at 3:54 PM Amit Kapila  wrote:
> >
> >
> > Few more comments:
> > ===
> > 1. Do we need the previous column 'puballtables' after adding pubtype
> > or pubkind in pg_publication?
>
> I felt this should be retained as old client will still be using 
> puballtables, like in case of old client executing \dRp+  commands.
>

But do we guarantee that old clients work with newer server versions?
For example, psql docs say: "psql works best with servers of the same
or an older major version. Backslash commands are particularly likely
to fail if the server is of a newer version than psql itself." [1]
(See Notes Section). Similarly, pg_dump docs say: "However, pg_dump
cannot dump from PostgreSQL servers newer than its own major version;
it will refuse to even try, rather than risk making an invalid dump."
[2] (See Notes Section).

It seems Sawada-San has the same question and IIUC docs suggest we
don't need such compatibility, so what makes you think we need it?

[1] - https://www.postgresql.org/docs/devel/app-psql.html
[2] - https://www.postgresql.org/docs/devel/app-pgdump.html

-- 
With Regards,
Amit Kapila.




RE: [PATCH]Comment improvement in publication.sql

2021-08-06 Thread tanghy.f...@fujitsu.com
Hi

I saw some inaccurate comments for AlterPublicationStmt structure when
reviewing patches related to publication[1].

The variable tables are used for 'ALTER PUBLICATION ... ADD/DROP/SET TABLE',
but the comments only say 'ADD/DROP'. How about add 'SET' to the comments?

typedef struct AlterPublicationStmt
{
NodeTag type;
char   *pubname;/* Name of the publication */

/* parameters used for ALTER PUBLICATION ... WITH */
List   *options;/* List of DefElem nodes */

/* parameters used for ALTER PUBLICATION ... ADD/DROP TABLE */
List   *tables; /* List of tables to add/drop */
boolfor_all_tables; /* Special publication for all tables 
in db */
DefElemAction tableAction;  /* What action to perform with the 
tables */
} AlterPublicationStmt;

It's also a comment improvement, so I add this change to this patch.
A new version patch is attached, pleases have a look.

[1] 
https://www.postgresql.org/message-id/OS0PR01MB61132C2C4E2232258EB192FDFBF19%40OS0PR01MB6113.jpnprd01.prod.outlook.com

Regards
Tang


v3-improvement-on-comment.patch
Description: v3-improvement-on-comment.patch


Re: Added schema level support for publication.

2021-08-06 Thread vignesh C
On Thu, Aug 5, 2021 at 3:54 PM Amit Kapila  wrote:
>
> On Wed, Aug 4, 2021 at 4:10 PM Amit Kapila 
wrote:
> >
> > On Tue, Aug 3, 2021 at 8:38 PM vignesh C  wrote:
> > >
> > > Thanks for reporting this, this is fixed in the v18 patch attached.
> > >
> >
> > I have started looking into this patch and below are some initial
comments.
> >
>
> Few more comments:
> ===
> 1. Do we need the previous column 'puballtables' after adding pubtype
> or pubkind in pg_publication?

I felt this should be retained as old client will still be using
puballtables, like in case of old client executing \dRp+  commands.

> 2.
> @@ -224,6 +279,20 @@ CreatePublication(ParseState *pstate,
> CreatePublicationStmt *stmt)
> ..
> + nspcrel = table_open(NamespaceRelationId, ShareUpdateExclusiveLock);
> + PublicationAddSchemas(puboid, schemaoidlist, true, NULL);
> + table_close(nspcrel, ShareUpdateExclusiveLock);
>
> What is the reason for opening and taking a lock on
> NamespaceRelationId? Do you want to avoid dropping the corresponding
> schema during this duration? If so, that is not sufficient because
> what if somebody drops it just before you took lock on
> NamespaceRelationId. I think you need to use LockDatabaseObject to
> avoid dropping the schema and note that it should be unlocked only at
> the end of the transaction similar to what we do for tables. I guess
> you need to add this locking inside the function
> PublicationAddSchemas. Also, it would be good if you can add few
> comments in this part of the code to explain the reason for locking.

Modified.

> 3. The function PublicationAddSchemas is called from multiple places
> in the patch but the locking protection is not there at all places. I
> think if you add locking as suggested in the previous point then it
> should be okay. I think you need similar locking for
> PublicationDropSchemas.

Modified.

> 4.
> @@ -421,16 +537,84 @@ AlterPublicationTables(AlterPublicationStmt
> *stmt, Relation rel,
>   PublicationAddTables(pubid, rels, true, stmt);
>
>   CloseTableList(delrels);
> + if (pubform->pubtype == PUBTYPE_EMPTY)
> + UpdatePublicationTypeTupleValue(rel, tup,
> + Anum_pg_publication_pubtype,
> + PUBTYPE_TABLE);
>   }
>
> At the above and all other similar places, the patch first updates the
> pg_publication after performing the rel/schema operation. Isn't it
> better to first update pg_publication to remain in sync with how
> CreatePublication works? I am not able to see any issue with the way
> you have it in the patch but it is better to keep the code consistent
> across various similar functions to avoid confusion in the future.

Modified.

Thanks for the comments, the changes for the above are available in the v19
patch posted at [1].

[1] -
https://www.postgresql.org/message-id/CALDaNm3BMLBpWOSdS3Q2vwpsM%3D0yovpJm8dKHRqNyFpANbrhpw%40mail.gmail.com

Regards,
Vignesh


Re: Added schema level support for publication.

2021-08-06 Thread vignesh C
On Wed, Aug 4, 2021 at 8:08 PM tanghy.f...@fujitsu.com <
tanghy.f...@fujitsu.com> wrote:
>
> On Tuesday, August 3, 2021 11:08 PM vignesh C  wrote:
> >
> > Thanks for reporting this, this is fixed in the v18 patch attached.
>
> Thanks for fixing it.
>
> Few suggestions for V18:
>
> 1.
> +# Clean up the tables on both publisher and subscriber as we don't need
them
> +$node_publisher->safe_psql('postgres', "DROP SCHEMA sch1 cascade");
> +$node_publisher->safe_psql('postgres', "DROP SCHEMA sch2 cascade");
> +$node_publisher->safe_psql('postgres', "DROP SCHEMA sch3 cascade");
> +$node_subscriber->safe_psql('postgres', "DROP SCHEMA sch1 cascade");
> +$node_subscriber->safe_psql('postgres', "DROP SCHEMA sch2 cascade");
> +$node_subscriber->safe_psql('postgres', "DROP SCHEMA sch3 cascade");
>
> Should we change the comment to "Clean up the schemas ... ", instead of
'tables'?

Modified.

> 2.
> +$result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM sch1.tab3");
>
> Spaces are used here(and some other places), but in most places we use a
TAB, so
> I think it's better to change it to a TAB.

Modified.

> 3.
> List   *tables; /* List of tables to
add/drop */
> boolfor_all_tables; /* Special publication for all
tables in db */
> DefElemAction tableAction;  /* What action to perform with
the tables */
> +   List   *schemas;/* Optional list of schemas */
>  } AlterPublicationStmt;
>
> Should we change the comment here to 'List of schemas to add/drop', then
it can
> be consistent with the comment for 'tables'.

Modified.

> I also noticed that 'tableAction' variable is used when we add/drop/set
schemas,
> so maybe the variable name is not suitable any more.

Changed the variable name.

> Besides, the following comment is above these codes. Should we add some
comments
> for schema?
>
> /* parameters used for ALTER PUBLICATION ... ADD/DROP TABLE */

Modified

> And it says 'add/drop', do we need to add 'set'? (it's not related to this
> patch, so I think I can add it in another thread[1] if needed, which is
related
> to comment improvement)

You can include the change in the patch posted.

> 4.
> I saw the existing tests about permissions in publication.sql, should we
add
> tests for schema publication? Like this:
>
> diff --git a/src/test/regress/sql/publication.sql
b/src/test/regress/sql/publication.sql
> index 33dbdf7bed..c19337631e 100644
> --- a/src/test/regress/sql/publication.sql
> +++ b/src/test/regress/sql/publication.sql
> @@ -160,16 +160,19 @@ GRANT CREATE ON DATABASE regression TO
regress_publication_user2;
>  SET ROLE regress_publication_user2;
>  SET client_min_messages = 'ERROR';
>  CREATE PUBLICATION testpub2;  -- ok
> +CREATE PUBLICATION testpub3;  -- ok
>  RESET client_min_messages;
>
>  ALTER PUBLICATION testpub2 ADD TABLE testpub_tbl1;  -- fail
> +ALTER PUBLICATION testpub3 ADD SCHEMA pub_test;  -- fail
>
>  SET ROLE regress_publication_user;
>  GRANT regress_publication_user TO regress_publication_user2;
>  SET ROLE regress_publication_user2;
>  ALTER PUBLICATION testpub2 ADD TABLE testpub_tbl1;  -- ok
> +ALTER PUBLICATION testpub3 ADD SCHEMA pub_test;  -- ok
>
> -DROP PUBLICATION testpub2;
> +DROP PUBLICATION testpub2, testpub3;
>
>  SET ROLE regress_publication_user;
>  REVOKE CREATE ON DATABASE regression FROM regress_publication_user2;

Added.

The changes for the above are available in the v19 patch posted at [1].

[1] -
https://www.postgresql.org/message-id/CALDaNm3BMLBpWOSdS3Q2vwpsM%3D0yovpJm8dKHRqNyFpANbrhpw%40mail.gmail.com

Regards,
Vignesh


Gather performance analysis

2021-08-06 Thread Dilip Kumar
Hi,

I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue.  In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.

So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.

IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.

So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.

--Setup
SET parallel_tuple_cost TO 0   -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,2) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;

Perf analysis: Gather Node
   - 43.57% shm_mq_receive
  - 78.94% shm_mq_receive_bytes
 - 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
   - apic_timer_interrupt
 smp_apic_timer_interrupt

Perf analysis: Worker Node
  - 99.14% shm_mq_sendv
 - 74.10% shm_mq_send_bytes
+ 42.35% shm_mq_inc_bytes_written
- 32.56% pg_atomic_read_u64
   - pg_atomic_read_u64_impl
  - 86.27% apic_timer_interrupt
+ 17.93% WaitLatch

>From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.

2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.

Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables.  So now we only read/write from the shared
memory in the below conditions

1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.

Machine information:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
CPU(s):56
On-line CPU(s) list:   0-55
Thread(s) per core:2
Core(s) per socket:14
Socket(s): 2
NUMA node(s):  2

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
 Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
 Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory.  2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.

Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)

-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS 

Re: Added schema level support for publication.

2021-08-06 Thread Masahiko Sawada
On Wed, Aug 4, 2021 at 12:08 AM vignesh C  wrote:
>
> On Tue, Aug 3, 2021 at 12:00 PM tanghy.f...@fujitsu.com
>  wrote:
> >
> > On Monday, August 2, 2021 11:40 PM vignesh C wrote:
> > >
> > > Thanks for the comments, attached v17 patches has the fixes for the same.
> >
> > Thanks for your new patch.
> >
> > I saw the following warning when compiling. It seems we don't need this 
> > variable any more.
> >
> > publicationcmds.c: In function ‘AlterPublicationSchemas’:
> > publicationcmds.c:592:15: warning: unused variable ‘oldlc’ 
> > [-Wunused-variable]
> >ListCell   *oldlc;
> >^
>
> Thanks for reporting this, this is fixed in the v18 patch attached.

I've also started reviewing this patch. I've not looked at the patch
yet but here are initial comments/questions based on using this
feature:

pg_publication catalog still has puballtables column but it's still
necessary? IIUC since pubtype = 'a' means publishing all tables in the
database puballtables seems no longer necessary.

---
Suppose that a parent table and its child table are defined in
different schemas, there is a publication for the schema where only
the parent table is defined, and the subscriber subscribes to the
publication, should changes for its child table be replicated to the
subscriber?

In FOR TABLE cases, i.g., where the subscriber subscribes to the
publication that is only for the parent table, changes for its child
table are replicated to the subscriber.

As far as I tested v18 patch, changes for the child table are not
replicated in FOR SCHEMA cases. Here is the test script:

On publisher and subscriber:
create schema p_schema;
create schema c_schema;
create table p_schema.p (a int) partition by list (a);
create table c_schema.c partition of p_schema.p for values in (1);

On publisher:
create publication pub_p_schema for schema p_schema;

On subscriber:
create subscription pub connection 'dbname=postgres' publication pub_p_schema;

On publisher:
insert into p_schema.p values (1);
select * from p_schema.p;
 a
---
 1
(1 row)

On subscriber:
select * from p_schema.p;
 a
---

(0 rows)

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Support reset of Shared objects statistics in "pg_stat_reset" function

2021-08-06 Thread Sadhuprasad Patro
Hi,

The call to “pg_stat_reset“ does reset all the statistics data for
tables belonging to the current database but does not take care of
shared tables e.g pg_attribute. Similarly to reset the statistics at
table level, the call to “pg_stat_reset_single_table_counters“ does
not take care of shared tables.

When we reset all the statistics using the call “pg_stat_reset”, the
postgres process internally makes calls to “
pgstat_recv_resetcounter“, which resets the statistics of all the
tables of the current database. But not resetting the statistics of
shared objects using database ID as 'InvalidOid'.

The same fix is made in the internal function
“pgstat_recv_resetsinglecounter“ to reset the statistics for the
shared table for the call "pg_stat_reset_single_table_counters".

--
thank u
SADHU PRASAD
EnterpriseDB: http://www.enterprisedb.com


0001-pg_stat_reset-and-pg_stat_reset_single_table_counter.patch
Description: Binary data


Re: archive status ".ready" files may be created too early

2021-08-06 Thread Kyotaro Horiguchi
At Fri, 6 Aug 2021 00:21:34 +, "Bossart, Nathan"  
wrote in 
> On 8/5/21, 12:39 AM, "Kyotaro Horiguchi"  wrote:
> > Honestly I don't like having this initialization in XLogWrite.  We
> > should and I think can initialize it earlier.  It seems to me the most
> > appropriate timing to initialize the variable is just before running
> > the end-of-recovery checkpoint).  Since StartupXLOG knows the first
> > segment to write .  If it were set to 0, that doesn't matter at all.
> > We can get rid of the new symbol by doing this.
> 
> This seems like a good idea to me.  I made this change in v5.  I
> performed some basic testing, and it seems to reliably initialize
> lastNotifiedSeg correctly.
> 
> > +   /*
> > +* EndOfLog resides on the next segment of the last finished one. Set 
> > the
> > +* last finished segment as lastNotifiedSeg now.  In the case where the
> > +* last crash has left the last several segments not being marked as
> > +* .ready, the checkpoint just after does that for all finished 
> > segments.
> > +* There's a corner case where the checkpoint advances segment, but that
> > +* ends up at most with a duplicate archive notification.
> > +*/
> 
> I'm not quite following the corner case you've described here.  Is it
> possible that the segment that EndOfLog points to will be eligible for
> removal after the checkpoint?

Archiving doesn't immediately mean removal.  A finished segment is
ought to be archived right away.  Since the EndOfLog segment must not
get marked .ready, setting lastNotifiedSeg to the previous segment is
quite right, but if the end-of-recovery checkpoint advances segment,
EndOfLog is marked .ready at the XLogFlush just after.  But, sorry,
what I forgot at the time was the checkpoint also moves
lastNotifiedSeg.  So, sorry, that corner case does not exist.

> In v5 of the patch, I've also added an extra call to
> NotifySegmentsReadyForArchive() in the same place we previously
> created the .ready files.  I think this helps notify archiver sooner
> in certain cases (e.g., asynchronous commit).

In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock
including .ready file creations. Since the notification loop doesn't
need the hash itself, the loop can be took out of the lock section?

current:
LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
last_notified = GetLastNotifiedSegment();
latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, 
flushed, );

if (found)
{
SetLastNotifiedSegment(latest_boundary_seg - 1);
for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
XLogArchiveNotifySeg(seg, false);

RemoveRecordBoundariesUpTo(latest_boundary_seg);

PgArchWakeup();
}
LWLockRelease(ArchNotifyLock);

But we can release the lock earlier.

LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
last_notified = GetLastNotifiedSegment();
latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, 
flushed, );

if (found)
{
SetLastNotifiedSegment(latest_boundary_seg - 1);
RemoveRecordBoundariesUpTo(latest_boundary_seg);
}
LWLockRelease(ArchNotifyLock);

if (found)
{
for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
XLogArchiveNotifySeg(seg, false);

PgArchWakeup();
}

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-06 Thread Masahiko Sawada
On Fri, Aug 6, 2021 at 1:39 PM Amit Kapila  wrote:
>
> On Fri, Aug 6, 2021 at 5:09 AM Masahiko Sawada  wrote:
> >
> > On Thu, Aug 5, 2021 at 11:40 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > > To summary, I think that what we want to do in DROP SUBSCRIPTION cases 
> > > > is to
> > > > drop relations from pg_subscription_rel that are no longer included in 
> > > > the set of
> > > > after-calculated publications. In the latter example, when ALTER
> > > > SUBSCRIPTION ... DROP PUBLICATION pub12, we should compare the current
> > > > set of relations associated with {pub12, pub13} to the set of 
> > > > relcations associated
> > > > with pub13, not pub12. Then we can find out t2 is no longer associated 
> > > > with the
> > > > after-calculated publication (i.g., pub13) so remove it.
> > >
> > > Thanks for the review. You are right, I missed this point.
> > > Attach new version patch which fix these problems(Also add a new pl-test).
> > >
> >
> > Thank you for updating the patch!
> >
> > Here are some comments:
> >
> > I think that it still could happen that DROP PULIBCATION misses
> > dropping relations. Please consider the following case:
> >
> > On publisher and subscriber:
> > create table t1 (a int);
> > create table t2 (a int);
> > create table t3 (a int);
> >
> > On publisher:
> > create publication pub12 for table t1, t2;
> > create publication pub3 for table t3;
> >
> > On subscriber:
> > create subscription sub connection 'dbname=postgres' publication pub12, 
> > pub3;
> >
> > On publisher:
> > alter publication pub3 add table t1;
> >
> > On subscriber:
> > alter subscription sub drop publication pub12;
> > select srsubid, srrelid::regclass::text, srsubstate, srsublsn from
> > pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn
> > -+-++---
> >16393 | t3  | r  | 0/1707CE0
> >16393 | t1  | r  | 0/1707D18
> >
> > With v3 patch, pg_subscription_rel has entries for t3 and t1. But if
> > ALTER SUBSCRIPTION ... DROP PUBLICATION pub12 refreshes only relations
> > associated with pub12, t1 should be removed and be added when we
> > refresh pub3.
> >
>
> But, isn't this happening because of your suggestion to compare the
> current set of relations with relations from publications that doesn't
> need to be removed?

Hmm yes, it cannot cover all cases. I had somehow misunderstood that
the subscriber knows which relations are associated with which
publications. Given that the subscriber doesn’t know which relations
are associated with which publications (and the set of subscribed
relations on the subscriber becomes out of date once the publication
is updated), I think it's impossible to achieve that DROP PUBLICATION
drops relations that are associated with only the publication being
dropped.

> Do we have better ideas to fix or shall we just
> say that we will refresh based on whatever current set of relations
> are present in publication to be dropped?

I don't have better ideas. It seems to me that it's prudent that we
accept the approach in v3 patch and refresh all publications in DROP
PUBLICATION cases.

>
> One more thing, I think it would be better to write a separate refresh
> function for add/drop rather than adding checks in the current refresh
> functionality. We can extract common functionality code which can be
> called from the different refresh functions.

+1

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: RFC: Improve CPU cache locality of syscache searches

2021-08-06 Thread Andres Freund
Hi,

On Thu, Aug 5, 2021, at 22:20, Yura Sokolov wrote:
> Andres Freund писал 2021-08-06 06:49:
> > Hi,
> > 
> > On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:
> >> Why don't use simplehash or something like that? Open-addressing 
> >> schemes
> >> show superior cache locality.
> > 
> > I thought about that as well - but it doesn't really resolve the 
> > question of
> > what we want to store in-line in the hashtable and what not. We can't 
> > store
> > the tuples themselves in the hashtable for a myriad of reasons (need 
> > pointer
> > stability, they're variably sized, way too large to move around 
> > frequently).
> > 
> > 
> >> Well, simplehash entry will be 24 bytes this way. If simplehash 
> >> template
> >> supports external key/element storage, then it could be shrunk to 16 
> >> bytes,
> >> and syscache entries will not need dlist_node. (But it doesn't at the
> >> moment).
> > 
> > I think storing keys outside of the hashtable entry defeats the purpose 
> > of the
> > open addressing, given that they are always checked and that our 
> > conflict
> > ratio should be fairly low.
> 
> It's opposite: if conflict ratio were high, then key outside of 
> hashtable will
> be expensive, since lookup to non-matched key will cost excess memory 
> access.
> But with low conflict ratio we will usually hit matched entry at first 
> probe.
> And since we will use entry soon, it doesn't matter when it will go to 
> CPU L1
> cache: during lookup or during actual usage.

Often enough it does matter, because there will be earlier dependencies on 
whether a lookup is a cache hit/miss than on the content of the cached tuple.

Regards,

Andres